вход по аккаунту



код для вставкиСкачать
Patent Translate
Powered by EPO and Google
This translation is machine-generated. It cannot be guaranteed that it is intelligible, accurate,
complete, reliable or fit for specific purposes. Critical decisions, such as commercially relevant or
financial decisions, should not be based on machine-translation output.
PROBLEM TO BE SOLVED: To minimize the amount of traveling traffic. In a method and system
for distributing an audio signal from one of a plurality of audio sources to an output end, a first
transmitter comprises a plurality of first level speech from each of a plurality of audio sources.
Unicast one audio signal to the bridge server. A second transmitter unicasts at least one audio
signal from each first level bridge server to at least one second level audio signal bridge server. A
third transmitter unicasts from each second level bridge server to the first level bridge server. A
fourth transmitter unicasts the output signal from the first level bridge server to the audio
source. In an alternative embodiment, a third transmitter outputs the selected audio signal as an
output from a second level bridge server to at least one of the first level bridge server and an
audio source. To multicast. [Selected figure] Figure 2D
Method and system for distributing audio signals
The present invention relates generally to methods and systems for distributing audio signals,
and more particularly to methods and systems for distributing audio signals for use in an audio
or audio / video conferencing system.
As known in the art, one of the major components in a voice conferencing system or a voice /
video conferencing system is a voice processing device.
The audio processing device is responsible for receiving audio from various sites connected to
the conferencing system and distributing the audio to the various sites.
There are two classic types of speech processing devices. A voice switch and a voice mixer. In the
case of voice switches, time-compressed voice data from one site is not sent to any other site, or
to one or more of them. These voice switches are used, for example, for sites using the "push to
talk" method, to send time-compressed voice in the presence of speech or some audible signal to
be communicated. In some voice switches, "push to talk" at the site is automatic, thus eliminating
the need for the user to actually "push" the button. In any case, the voice switch does not actually
decode the time-compressed voice signal. Rather, this simply determines which audio source
each site will receive, and then routes the time-compressed audio to the appropriate site or sites.
The operation of the switch can be based on one or a combination of the following: control
protocol to allow the user to request to speak; control protocol to allow the user to request to
open a particular site; and 1 A decision mechanism to transfer voices received from one site to
another. Usually, the voice switch is configured so that no site receives the voice it is
Voice switches are very efficient to implement, as they do not need to decode time-compressed
voice signals. Thus, a single conferencing server or bridge can support multiple sites.
Furthermore, since the voice switch simply routes as it receives time-compressed voice as it is
received, the voice signal may be due to transcoding (ie, decompression followed by compression
of the decompressed signal) or other signal processing losses. Will not deteriorate. Furthermore,
voice switches have a relatively low time delay since no code conversion is required.
On the other hand, since the voice switch does not decode the time-compressed speech it
receives, it only passes it to one or more sites, so the decision on how to route the timecompressed speech is limited There is. In particular, because the input to the switch is
compressed voice, the switch can not use acoustic energy detection in making routing decisions.
Furthermore, if there is a participant who speaks to more than one site, the voice switch has to
select only one of those sites as a source of voice to be passed through the switch. In other
words, the switch only routes time-compressed audio from the site, so it can not mix audio at
various sites.
The audio mixer works with time-compressed, i.e. uncompressed audio. For each site in the
conferencing system, the audio mixer combines the audio from the other selected sites, reencodes the combined audio (ie time-compressed) and time-compressed audio to the receiving
site To be able to output as In the case of a large number of sites, a selector is used to select only
some of the sites and discard the non-selected sites thus reducing noise. Because uncompressed
audio is available at the selector, the selector can make a selection based on the relative amount
of acoustic energy in the audio received from the site. Other signal processing techniques can be
used as well to make decisions. In the case of a voice mixer, when participants at multiple sites
are speaking, participants at other sites can listen to them. On the other hand, the speech mixer
has to decompress, mix and then recompress the received speech. This three-step process
degrades the quality of the original voice, and this effect can be particularly undesirable if
multiple conference servers are cascaded to serve a single meeting. Additionally, signal
processing adds delay to voice propagation. Furthermore, once the audio signals are mixed, they
can not be unmixed, thus limiting the available topology for the distributed conference server.
JP-A-4-84553 JP-A-2-92137 JP-A-4-369153
For example, at the Pan-Atlantic Conference it is advantageous to actually minimize the amount
of traffic traveling across the ocean, and to minimize the amount of such traffic. It is a task.
The invention relates to a method for distributing an audio signal from one of a plurality of audio
sources to an output.
A feature of the invention comprises unicasting one voice signal from each of a plurality of voice
sources to a plurality of first level voice bridge servers, at least one voice signal from each of the
bridge servers being at least one Unicasting to a two level voice signal bridge server; and output
signals from each second level bridge server, first to at least one of said first level bridge servers,
and then to an audio source Are in the process of unicasting.
In one variant, the last unicast phase is replaced by multicasting the selected voice signal as
output from said second level bridge server to at least one of the bridge server and the audio
source It can be done.
In another aspect, the invention relates to a system for distributing an audio signal from one to
one output of a plurality of audio sources. The system comprises a first transmitter unicasting a
voice signal from each of a plurality of voice sources to a plurality of first level voice bridge
servers; at least one voice from each of the first level bridge servers A second transmitter for
unicasting signals to at least one second level voice signal bridge server, and a third transmission
for unicasting each of the second level bridge servers to the first level bridge server And a fourth
transmitter that unicasts one output signal from the first level bridge server to the audio source.
In another aspect of the invention, a third transmitter outputs selected audio signals from the
second level bridge server to at least one of a first level bridge server and an audio source. As
These and other features of the present invention, as well as the invention itself, will be more
readily apparent from the following detailed description considered in conjunction with the
accompanying drawings.
Referring now to FIG. 1, a plurality of sites, here four sites, for example site "A", site "B", site "C"
and site "D", are connected to one another through one server or bridge 12 The audio portion of
the connected audio conferencing system or audio / video conferencing system 10 is shown.
Included within the bridge 12 are, for example, the sites "A", "B", "C" and "D" as illustrated, for
example, through the RTP / RTCP transport circuit. A plurality of speech processing units 14a,
14b, 14c and 14d (FIG. 2) coupled into corresponding ones.
The voice data received from and sent to the various remote sites A, B, C and D from the bridge
12 are compressed voice, typically compressed voice packets. The bridge 12 operates to
selectively transfer and / or mix audio from various sites so that each site can participate in the
conference. Although four sites are shown in this illustrated embodiment, it is also possible to
handle more or fewer sites. Furthermore, although the bridge 12 in the illustrated embodiment of
the present invention operates primarily in software, the bridge can also be implemented in
hardware, software or a combination of both (although Naturally, this is not the case for the
analog operation described below). Although sites are illustrated as direct connections, sites can
be configured in a number of different ways, such as through public switched telephone
networks, wirelessly, directly connected, or any other of various communication paths including,
for example, a local area network. It is possible to connect to the bridge in the form of a
combination of
Referring now to FIG. 2, an audio processing device that can be used with the present invention
will be described. In a first particular embodiment, each of the speech processing devices 14a14d is structurally identical. One example, audio processor 14c, is shown in detail here. In this
embodiment of the present invention, each of the voice processing devices 14a-14d distributes to
one of its connected sites the voice signal from the other of the plurality of sites. Thus,
considering site "C", the corresponding audio processing device 14c distributes audio signals
from one of sites "A", "B" and "D" to site "C".
More specifically, each of the sites "A" to "D" sends and receives time-compressed voice packets,
here for example through RTP / RTCP transport. However, it should also be understood that
other transports can also be used and that the audio signal to and from the site need not be
packet based. The time-compressed audio signals from the audio sources at sites "A"-"D" are
provided to audio processing section 15 of bridge 12 on lines 16a-16d, respectively, as shown.
The time-compressed speech signals on lines 16a-16d are likewise passed to time expanders or
decoders 18a-18d, respectively, as shown. The decoders 18a-18d produce decompressed or
uncompressed audio signals on lines 19a-19d, respectively. Both time-compressed and
uncompressed audio signals on lines 16a-16d and 19a-19d, respectively, are provided to audio
processor section 15 as shown. The audio signals supplied from the bridge 12 to the sites "A",
"B", "C" and "D" are time-compressed audio on lines 20a-20d respectively from the audio
processing unit 15 as shown It is supplied to the site as a signal.
In more detail, referring to the voice processing apparatus example 14c, the voice processing
apparatus 14c is a compressed voice on lines 16a, 16b and 16d from each of a plurality of sites
"A", "B" and "D" respectively. Included is a switch 22 for receiving the signal. Switch 22
selectively couples one of the plurality of compressed audio signals on lines 14a, 14b, 14d to site
"C" in accordance with and based on the control signal on line 24. The selector 26 receives the
uncompressed audio signal on lines 19a, 19b and 19d from the sites "A", "B" and "D" respectively.
Selector 26 includes a speech probability detector and determines one of the sites "A", "B" or "D"
with the highest speech probability, thus corresponding control symbols on line 24. Generate
One of the sites "A", "B", or "D" with the highest speech probability is thus coupled to site "C". In
applications where site "C" may receive multiple audio streams, selector 26 may be selected from
multiple sites "A", "B", or "D" for coupling to site "C". It can be modified appropriately to select
one. It should be noted that as an equivalent alternative for the selector 26 it is possible to use an
acoustic energy detector or other device.
The computational efficiency of the speech processing units 14a-14d is between that of a pure
speech switch and a speech mixer. The audio processors 14a-14d decode all audio signals they
receive, but do not perform any mixing or retime compression. The decoded or uncompressed
speech is only used to provide drive information to the selector 26 which operates the switch 22
to enable the distribution of the compressed speech signal using the speech detector 22a in the
selector 26. Be Also, the delays in the processing units 14a-14d are likewise between the pure
voice switch and the voice mixer delays.
Furthermore, the audio quality of the original audio signal stream is not degraded since the audio
processors 14a-14d switch the time-compressed audio signals without processing the
compressed signals (i.e. without expanding them). This happens even if the control signal to the
selector 26 represents uncompressed speech. If the site connected to the bridge is a legitimate
endpoint that can receive one audio stream to make sound out of its speakers, the site will have
improved meeting behavior compared to a pure voice switch. To provide The selector operates in
such a way that the loudest voice is heard by all people at all sites "A"-"D", except that the loudest
voice hears the second loudest voice. (Note that when the selector 26 at the loudest voice site
selects its loudest input, that input is the "second loudest" input to the bridge). When two
participants at two different sites are discussing, they are each listening to each other. If they do
not interrupt each other, everyone will hear the entire conversation. It should be noted that this
behavior is not dependent on any special features such as detectors at the endpoint site or "push
to talk" buttons.
Some endpoint sites can accept multiple audio streams and do their own local mixing. With this
feature at the endpoint site, improved quality of service can be achieved on conventional voice
mixers. First of all, if mixing is done at the endpoint site, no additional encoder (mixing) steps in
the audio processing unit 12 are required. Thus, delay and signal degradation are reduced.
However, additional switches 22 and more complex selectors may be required. Similarly,
endpoint sites can typically receive (and mix) only a relatively small number of audio streams.
This constraint limits the ability of the conference bridge to use the multi-stream capability of
this endpoint, and in large conferences, the endpoint is quickly overloaded.
Speech processing devices 14a-14d can provide a solution to this problem. Thus, selector 26 /
switch (s) 22 will output the loudest speaker (two or more) in separate streams to the local
mixing endpoint site. The system can limit the number of streams it outputs to the number that
one site can receive. By automatically selecting the loudest speaker for the endpoint site, the
processing units 14a-14d can provide the highest possible signal quality and avoid overloading
the endpoint site.
In another aspect of the voice processing device, the loudest stream may be multicast to multiple
sites. That is, it is possible to receive all the site voices and control the multiplex switch 14 in
which the output is multicast to all the sites only by selecting the entire voice segment 15. In this
mode, each endpoint site must automatically ignore its own transmitted stream. Similarly,
conference bridge 12 must ensure that the total number of streams that have been multicast
does not exceed the capabilities of any site that is instructed to accept them.
Thus, when the voice processing devices 14a-14d are in communication with endpoint sites,
there are at least three useful applications for them. First, each endpoint site unicasts its audio
stream to the conference bridge 12. The bridge 12 uses the selector 26 to select one or more
streams at the switches 22a, ... 22n to be unicast back to each endpoint site. (See Figure 2A).
Second, each endpoint site unicasts its voice stream (s) to the conference bridge 12. The bridge
12 uses the selector 26 to select one or more streams at the switches 22a,... 22n, and multicasts
the selected stream to all sites (see FIG. 2B). Third, each endpoint site multicasts its voice stream
(s). These can be received by other connected sites as well as by the bridge 12. The bridge selects
one or more streams and multicasts them on different multicast addresses. A combination of
these approaches is certainly possible, for example in situations where only a few sites can
receive the multicast transmission.
An embodiment of a distributed conferencing system with multiple bridges according to the
present invention will be described below. Another application of the speech processing devices
14a-14d is for distributed conferencing support. Distributed (or cascaded) conferencing is
performed on multiple bridges (see FIG. 2c). This sometimes occurs because the conference is
large. In other situations, the conference is distributed to optimize bandwidth utilization. For
example, at a Pan-Atlantic conference, it is advantageous to minimize the amount of traffic that
actually travels across the ocean. This end goal is achieved by using two conference bridges 12a
and 12b (e.g. one in Europe, one in North America). Each bridge acts as one site for the other
bridge. The ability of speech processing units 14a-14d to be able to select an audio stream to
contain speech from the entire stream without degrading the speech is very useful in distributed
conferencing. For example, referring to FIG. 2D, endpoint site 90 in the conference will unicast its
voice to its bridge 92. At this time, each bridge can multicast the "active" voice stream to the
multicast group 96. The second level bridge 98 can then recheck the active voice from the first
level bridge to further reduce the number of streams. Higher levels in the bridge hierarchy can
also be added if needed. The endpoint site can simply accept multicast voice 100 from the top
level bridge, here the bridge 98. Other topologies are possible as well (eg, using only unicast
transmission as illustrated in FIG. 2E).
Referring now to FIG. 3, there is shown an alternative speech processing unit 14'C that can
replace the speech processing unit in FIG. The audio processing unit 14'C operates as a mixer
and is adapted to operate, for example, with an endpoint site that can receive only one audio
stream. This is true even in situations where the bandwidth is so critical that only one audio
stream can be transmitted to one endpoint site, even if the endpoint site can receive multiple
audio streams.
First of all, considering how compressed speech is produced, for example, for site C, the speech
processing apparatus 14'C, as in the previous case, each of sites "A", "B" and "D" And a switch 22
'connected to the compressed audio signal on lines 16a, 16b and 16d (FIG. 1). The switch 22
'selectively couples one of the plurality of compressed audio signals on lines 16a, 16b, 16d to the
input 32 of the selector (or switch) 34 in accordance with the control signal on line 24'. A
selector 26 'connects to the uncompressed audio signals on lines 19a, 19b and 19d decoded
from the compressed signals from sites "A", "B" and "D" respectively. Selector 26 'includes a
speech probability detector and is one of the sites "A", "B" or "D" with the highest (or most loud)
speech probability to generate a control signal on line 24'. Decide. One of the sites “A”, “B”
or “D” with the highest (or most loud) speech probability is coupled to the selector 34 at the
input 32.
The uncompressed audio signals on lines 19a, 19b and 19d from sites "A", "B" and "D" are
similarly provided to audio mixer 28 to generate an uncompressed composite audio signal on
line 35. . The mixed uncompressed audio signal is provided to a time compression encoder 29 to
produce a corresponding compressed composite audio signal on line 31. The compressed
composite speech signal generated by the encoder 29 is supplied to another one of the input pair
30, 32 of the selector 34, here the input 30. As described above, the output end of the switch 22
′ is supplied to the input end 32 of the selector 34. Thus, the selector 34 supplies at one input
32 one of the compressed speech signals from the sites "A", "B" and "D" where the most probable
(or most loud) speaker is A second input 30 receives the supply of the time-compressed
composite (mixed) audio signal generated by the encoder 29.
In addition to determining the speech probability at each of the sites "A", "B" and "D", the selector
26 'also similarly determines whether more than one person is speaking at those sites. decide. If
multiple people are talking at the same time (double-talk, triple-talk, etc. conditions), a logical "1"
signal is provided on line 36. Otherwise, the selector 26 'generates a logic "0" signal. The line 36
is also supplied to the selector 34 as well as the mixer 28 and the enabling terminal (EN) of the
encoder 29. If the logic signal 36 on line 36 is a logic "1", indicating that more than one person is
speaking at the same time, the mixer 28 and encoder 29 are enabled and the selector 34 will
Connect the time-compressed complex audio source N generated by V.sub.c to site "C" on line
20c. Otherwise, if only one person is speaking, ie, the logic signal on line 36 is a logic "0", mixer
28 and encoder 29 will not be enabled and selector 34 will The selector 34 combines selected
ones of the compressed speech signals on lines 16a, 16b, 16d with the highest speech
probability to the site "C" on line 20c.
Thus, when using voice control mechanism 14'C, for speech detection in selector 26 'when more
than one person is speaking at a time at different connected sites (i.e. "double talk", for example)
The uncompressed speech used in (1) is selectively mixed in mixer 28 (as described below) and
then encoded or time-compressed in encoder 29. This mixed compressed complex speech is
transmitted through the selector 34 to the endpoint site, here the site "C". When "double talk" is
not occurring, the mixer 28 and the encoder 29 are not required and thus not activated, thus a
great deal of computational resources in the bridge are thus taken when they are not
implemented in software. It will be saved. Furthermore, there is no transcoding loss at all, except
when the audio processing unit is in mixing mode.
If the selector 26 'determines that more than one speaker is present, the detector signals the
mixer 28 on line 37 to which input line the signal can be found by the speaker. provide. The
mixer 28 can mix two, three or more inputs to produce its mixed output on line 35, depending on
its configuration. The mixing level will depend on the bridge configuration, including in particular
the number of connected sites and the desirability of listening to two or more speakers at the
same time. The selected speaker is also dependent on some minimum threshold level of speech
and will typically be selected from among the sites with this level. Alternatively, two or three of
the loudest speakers can be selected.
It should be noted that both the compressed audio signals on lines 16a, 16b and 16d can be
provided to the aligners 40a, 40b, 40d, respectively, prior to the switch 22 'and the selector 26'.
Typically, only one aligner exists for each site. The purpose of this aligner is to equalize any delay
between the incoming streams from sites "A", "B", "C" and "D" (in this embodiment). In many
cases, the use of an aligner is optional as the audio stream is already synchronized.
It should be understood that only one mixer / coder is required per conference. In this
configuration, for example, during double talk, the loudest speaker will hear the second loudest
speaker. The second loud speaker hears the loudest speaker. All other people will hear the mix of
the two loudest speakers. The system can be expanded to mix two or more of the loudest
speakers. In these cases, multiple encoders are required. For example, if three speakers are
needed in the mix, four coders are needed (regardless of the number of connected sites). Thus,
the loudest speaker can then hear the two loud speakers, and the second loud speaker can hear
the loudest speaker and the third loud speaker. The third loud speaker listens to the mix of the
two loudest speakers. All others listen to the mix of the three loudest speakers. It should be noted
that the computational requirements of the large number of supporting encoders are not so
severe, as "triple talk" is even more rare than "double talk".
In a typical implementation of the bridge 12 according to the invention in which up to three
speakers can be mixed together, the bridge is, as before, four sites: site A, site B, site C and site D.
Connected Compressed audio from the site is passed through RTD / RTCP transport to the
respective aligners 40a, 40b, 40c and 40d and then to the respective encoders 18a, 18b, 18c and
18d. The compressed output of the aligner is sent to each of the four mixers 28a, 28b, 28c and
28d, and the uncompressed audio output is sent to the selector 26 '. The selector controls each of
the mixers on line 37, and each mixer, when activated, produces a respective output signal for
sites A, B, C and D. , 29c, 29d to produce a mixed output. The output of the encoder is directed to
the crosspoint switch 100. The crosspoint switch also receives the compressed input from the
site output by the aligner on lines 16a, 16b, 16c, 16d, relative to the selector on the lines 19a,
19b, 19c and 19d. It is controlled by the output of the selector in accordance with the specific
voice pitch and the content of the speech. The output of the selector crosses to select either the
compressed voice on line 16 or the output of encoder 29 for presentation to various sites on the
lines labeled A, B, C and D. Control the point switch. In this manner, a mix of up to three speakers
can be provided using four mixers, four encoders, but only one selector and one cross point
switch. In a preferred embodiment of the invention, the mixers / coders and selectors and the
aligners and coders are all implemented in software. Thus, not only is the speech quality
enhanced, but computational savings are considerable, if the use of a mixer is not required and
no separate transcoding is utilized.
The following table illustrates the results for various interrupt rates. This table assumes that each
interruption lasts for 2 seconds and that there are 5 possible speakers. Even with high interrupt
probability, the average code load per conference is very low, here less than one encoder per
conference. It should be noted that in conventional speech mixers, there is one encoder per
endpoint site. <img class = "EMIRef" id = "197896206-00003" />
The system shown in FIG. 2 can be used with endpoint sites receiving only one audio stream, as
described above. The endpoint site has all of the benefits of full audio mixing, and the
computational load on the bridge 12 is much less. Furthermore, the quality of speech during
uninterrupted speech is superior to speech mixers, since no extra transcoding is performed
during this time. This is particularly advantageous in the bridge-to-bridge connection of FIG. 2C.
When conventional mixers are used, code conversion losses limit the number of servers that can
be cascaded. Three bridges are the standard recommended limit. For voice processing devices
such as processing device 14'C, transcoding loss occurs only during interrupts. Under such
circumstances, it is usually difficult to follow the meeting, so other transcoding losses are less of
a concern. Thus, the present invention increases the number of bridges that can be cascaded.
Other features are within the spirit and scope of the appended claims.
FIG. 1 is a block diagram of a bridged audio / video conferencing system.
FIG. 1 is a block diagram of a conference system having an audio processing device usable in the
present invention. FIG. 1 is a block diagram of one particular configuration of an audio
processing device. FIG. 6 is a block diagram of a second preferred configuration of the speech
processing device. Fig. 2 is a schematic block diagram of a cascaded bridge connection according
to the invention. FIG. 1 is a schematic block diagram of a multilevel bridge topology according to
the invention. FIG. 1 is a schematic block diagram of a multilevel bridge topology according to
the invention. FIG. 6 is a block diagram of an alternative embodiment of a speech processing
device adapted for use in the conferencing system of FIG. 1; FIG. 7 is a block diagram of a
particular alternative bridge configuration for mixing up to three speakers.
Без категории
Размер файла
27 Кб
description, jp2004140850
Пожаловаться на содержимое документа