close

Вход

Забыли?

вход по аккаунту

?

JP2007151038

код для вставкиСкачать
Patent Translate
Powered by EPO and Google
Notice
This translation is machine-generated. It cannot be guaranteed that it is intelligible, accurate,
complete, reliable or fit for specific purposes. Critical decisions, such as commercially relevant or
financial decisions, should not be based on machine-translation output.
DESCRIPTION JP2007151038
To effectively remove echoes with a small amount of calculation. A band selection unit holds a
signal component of a selected frequency band for a sound emission signal which is divided by
the sound emission signal dividing unit, and removes a signal component of a frequency band
not selected. On the other hand, the signal component of the frequency band in which the sound
emission signal is selected is removed from the sound collection signal which has been banddivided by the sound collection signal division means 2, and the signal component not selected is
held. The sound emission signal synthesis means 4 synthesizes the sound emission signal from
which the signal component of the predetermined frequency band is removed, and the sound
collection signal synthesis means 5 holds the signal component of the frequency band from
which the sound emission signal is removed. The collected sound signal from which the signal
component of the frequency band where the sound emission signal is held is removed is
synthesized. As a result, the frequency components of the sound emission signal and the sound
collection signal do not overlap. [Selected figure] Figure 1
Voice processing device
[0001]
The present invention relates to a voice processing apparatus, and more particularly to a voice
processing apparatus that executes voice processing in the case of performing full-duplex call in
a voice communication system including a speaker and a microphone (hereinafter referred to as
a microphone).
[0002]
08-05-2019
1
In a speech communication system in which each device is equipped with a speaker and a
microphone, such as a hands-free telephone and a video conference system, voices picked up by
the microphone of the far end device are sent to the near end device Be done.
On the other hand, the voice of the near-end speaker picked up by the microphone installed in
the near-end device is also sent to the far-end device, and is emitted from the far-end device's
speaker. For this reason, the voice of the other party emitted from the speaker at each of the far
end and the near end is applied to the microphone. If no processing is performed, this voice is
sent to the other device again, causing a phenomenon called "echo" that can be heard from the
speaker slightly behind the user's voice like a echo. When the echo becomes large, it is applied to
the microphone again to loop the system and cause howling.
[0003]
Conventionally, in such a loudspeaker communication system, an echo canceller is incorporated
as a voice processing device for preventing echo and howling. A general echo canceller measures
an impulse response between a speaker and a microphone using an adaptive filter, generates a
pseudo echo in which this impulse response is folded into a reference signal emitted from the
speaker, and is applied to the microphone The echo component is removed by subtracting the
pseudo echo from the sound of the speaker.
[0004]
However, since the impulse response between the speaker and the microphone changes due to
the temperature change in the room or the movement of a person or the like, it is difficult to
sufficiently cancel the echo if the full-band echo canceller is configured with fixed coefficients. .
Under the circumstances, there has been proposed a band division type echo canceler which
divides an audio signal into a plurality of bands and performs echo cancellation processing for
each band (for example, see Patent Document 1). JP-A-59-64932
[0005]
However, it is difficult to generate an accurate pseudo echo in a voice processing apparatus that
removes echo using a conventional adaptive filter, and thus there is a problem that the amount of
calculation becomes enormous to sufficiently eliminate the echo. The
08-05-2019
2
[0006]
As described above, the impulse response between the speaker and the microphone changes only
when the person in the room moves the body, and the relationship of the reflection of the sound
changes, but the adaptive filter changes a certain amount to follow the change and converge.
take time.
Also, in the principle of the adaptive filter, adaptation can not be performed to frequency
components that are not included in the sound emitted from the speaker, so convergence is fast
in the case of sound including all frequencies such as white noise. When a person's voice is
emitted from a speaker as in a video conference, it is known that it takes some time for
convergence. As described above, since the time from when the system changes to when the
adaptive filter converges can not generate an accurate pseudo echo, there is a problem that the
echo remains or howling is caused. The band division type also has the same problem because it
uses an adaptive filter.
[0007]
In addition, the amount of operation of the adaptive filter is generally larger than that of the fast
Fourier transform (FFT) or the filter bank, which causes a problem when it is used in a low cost
system. In particular, when applied to audio signal processing in a large place such as a
gymnasium, a long tap length is required for the adaptive filter because the distance from the
speaker to the microphone becomes long and the remaining time becomes long. It is known. In
this case, the amount of calculation further increases and the burden becomes heavy.
[0008]
The present invention has been made in view of these points, and can reduce the amount of
calculation required for the echo removal processing, enable effective echo processing, and
remove the time echo from system fluctuation to convergence. An object of the present invention
is to provide a voice processing device.
[0009]
08-05-2019
3
In the present invention, in order to solve the above-mentioned problems, in a voice processing
apparatus which executes voice processing in the case of performing full-duplex calling in a voice
communication system provided with a speaker and a microphone, A speech processing
apparatus is provided, which comprises band selection means, sound emission signal synthesis
means and sound collection signal synthesis means.
The sound emission signal dividing means divides the sound emission signal acquired from the
other device and output from the speaker into a plurality of frequency bands. On the other hand,
the sound collection signal division means divides the sound collection signal input from the
microphone into the same frequency band. The band selecting means divides a predetermined
frequency band range including the entire range of a plurality of frequency bands into a
frequency band for selecting a sound emission signal and a frequency band for selecting a
collected signal, and is not selected for each frequency band. The signal component of the sound
emission signal or the sound collection signal of another frequency band is removed. The sound
emission signal synthesizing means synthesizes the sound emission signal which is divided into
bands and from which the signal components of the frequency band not selected by the band
selection means are removed. Similarly, the sound pickup signal synthesizing means synthesizes
a sound pickup signal which is band-divided and from which signal components of the frequency
band not selected by the band selection means are removed.
[0010]
According to such a voice processing apparatus, the band selection means divides the
predetermined frequency band range including the entire region of the plurality of frequency
bands into the frequency band for selecting the sound emission signal and the frequency band
for selecting the sound collection signal. In the frequency band belonging to the predetermined
frequency band range, either one of the sound emission signal and the sound collection signal is
selected. In this frequency band, one signal component of the selected sound emission signal or
sound collection signal is held and the other is removed. The sound emission signal dividing
means divides the sound emission signal acquired from the other device and output from the
speaker into a plurality of frequency bands, and outputs the sound emission signal to the band
selection means. The band selection means holds the signal component of the selected frequency
band for the sound emission signal divided into a plurality of frequency bands, removes the
signal component of the non-selected frequency band, and outputs it to the sound emission
signal synthesis means Do. Then, the sound emission signal synthesis means synthesizes the
sound emission signal from which the signal component of the frequency band not selected has
been removed, and outputs it to the speaker. As a result, the speaker emits an audio signal
08-05-2019
4
(sound emission signal) from which the signal component of the frequency band not selected is
removed. On the other hand, the sound collection signal division means divides the sound
collection signal input from the microphone into a plurality of frequency bands identical to the
sound emission signal, and outputs the same to the band selection means. The band selection
means selects the sound collection signal for the sound collection signal divided into a plurality
of frequency bands, holds the signal component of the frequency band in which the sound
emission signal is not selected, selects the sound emission signal, and collects the sound
collection The signal component of the frequency band in which the signal is not selected is
removed and output to the collected signal combining means. Then, the sound emission signal
synthesis means synthesizes the sound collection signal from which the signal component of the
frequency band not selected is removed. As a result, the collected sound signal from which the
signal component of the frequency band on which the sound output of the speaker input from
the microphone is superimposed is removed is sent to another device.
[0011]
In the voice processing apparatus according to the present invention, the voice emitted from the
speaker and the voice collected by the microphone are divided into a plurality of frequency
domains, and the domain in which one signal component is effective is removed by removing the
other signal component. The frequency components of the sound signal to be emitted and the
sound signal to be collected can be made not to overlap. Therefore, although the sound emitted
from the speaker is superimposed on the sound collected by the microphone, the superimposed
component is included only in the region from which the signal component is removed in the
frequency region of the collected signal. . As a result, it is possible to realize two-way
simultaneous calls in a speech communication system without echo and howling occurring.
[0012]
In addition, convergence at the time of system fluctuation takes a certain amount of time, and in
comparison with an adaptive filter having a small amount of calculation, the convergence time is
not required and the amount of calculation can be reduced. Thereby, even when the microphone
or the speaker is moved or the speaker moves, the effect can be exhibited.
[0013]
08-05-2019
5
Hereinafter, embodiments of the present invention will be described with reference to the
drawings. First, the concept of the invention applied to the embodiment will be described, and
then the specific contents of the embodiment will be described. FIG. 1 is a conceptual view of the
invention applied to the embodiment.
[0014]
The sound processing apparatus according to the present invention comprises a sound emission
signal division means 1 for dividing a sound emission signal into a plurality of frequency bands, a
sound collection signal division means 2 for dividing a sound collection signal into a plurality of
frequency bands, and a plurality of frequency bands. Band selection means 3 for selecting a band
to be used so that frequency bands including divided sound emission signals and signal
components of collected signals do not overlap, and sound emission signal synthesis for
synthesizing the sound emission signals subjected to band selection processing Means 4 and a
sound pickup signal synthesizing means 5 for synthesizing the sound pickup signal subjected to
band selection processing are provided.
[0015]
The sound emission signal division means 1 receives a sound emission signal acquired from
another device and output from a speaker, ie, an audio signal collected by another device by a
microphone, from time domain to frequency domain such as Fourier transform. An audio signal
(sound emission signal) is divided into a plurality of frequency bands by multi-rate signal
processing using a conversion method or a filter bank.
The divided sound emission signal is output to the band selection means 3.
[0016]
When the sound pickup signal division means 2 inputs a sound pickup signal input from the
microphone, that is, an audio signal on which the sound emission signal output from the speaker
is superimposed, the sound signal is generated by the same method as the sound emission signal
division means 1 The (pickup signal) is divided into a plurality of frequency bands. The divided
sound signals are output to the band selection means 3.
08-05-2019
6
[0017]
The band selection means 3 inputs the sound emission signal and the sound collection signal
which are band-divided into a plurality of frequency bands, and selects a sound emission signal
for a predetermined frequency band range and a frequency which selects the sound collection
signal Determine the bandwidth. The predetermined frequency band range may be the range of
the entire frequency band or may be part of the entire range. In the frequency band for selecting
the sound emission signal, the signal component of the sound emission signal in that frequency
band is not removed, and the signal component of the sound collection signal in that frequency
band is removed. On the other hand, in the frequency band in which the sound collection signal
is selected, the signal component of the sound emission signal in that frequency band is removed,
and the signal component of the sound collection signal in that frequency band is held. As a
result, the frequency band in which the audio signal (sound emission signal) output to the
speaker and the signal component of the audio signal (sound collection signal) input from the
microphone do not overlap does not overlap. Therefore, even if the sound emission signal is
superimposed on the sound collection signal input from the microphone, this can be removed.
Further, the signal component of the voice signal of the removed frequency band may be
interpolated with the signal component of the frequency band not removed. The details of the
selection of the frequency band range and the frequency band will be described in detail in the
embodiment.
[0018]
The sound emission signal synthesis means 4 is band-divided by the sound emission signal
division means 1, and the band selection means 3 synthesizes the sound emission signal from
which the signal component of the predetermined frequency band is removed, and outputs it to
the speaker.
[0019]
The collected signal combining means 5 is divided into bands by the collected signal dividing
means 2, holds the signal component of the frequency band from which the emitted signal is
removed by the band selecting means 3, and A sound pickup signal from which signal
components have been removed is synthesized.
The synthesized sound collection signal is sent to the other device as a sound emission signal
output from the speaker of the other device.
08-05-2019
7
[0020]
In the voice processing apparatus having such a configuration, when a voice signal collected by a
device at the other end performing full duplex communication is input, voice processing is
performed to emit the voice signal to the speaker. The sound collection signal of the other party
acquired from the other party device is used as a sound emission signal, and the sound emission
signal dividing means 1 divides the band into a plurality of frequency bands. The band-divided
sound emission signals are synthesized by the sound emission signal synthesis means 4 after the
signal components of the predetermined frequency band are removed by the band selection
means 3, and the sound is emitted from the speaker. Therefore, the audio signal emitted from the
speaker has the signal component of the predetermined frequency band removed.
[0021]
The voice signal emitted from the speaker is input to the voice processing device through the
microphone together with the voice of the speaker. The voice signal emitted from the speaker is
superimposed on the voice signal of the speaker on the collected signal input from the
microphone. The sound collection signal division means 2 divides this sound collection signal
into a plurality of frequency bands in the same manner as the sound emission signal division
means 1 and outputs the same to the band selection means 3. In the audio processing of the
band selection means 3, the signal component of the collected signal in the frequency band in
which the sound output signal is selected is removed, and the sound output signal is not selected,
that is, the collection of the frequency band in which the collected signal is selected Does not
remove sound signal components. Thus, the signal component of the frequency band to which
the sound emission signal is superimposed is removed from the band-divided sound collection
signal. The sound pickup signal synthesizing means 5 synthesizes the sound pickup signals and
transmits the synthesized sound signals to the destination device. Howling is also prevented by
outputting the sound collection signal from which the sound component of the sound emission
signal from the speaker, that is, the echo component has been removed from the synthesized
sound collection signal and the echo component has been removed, to another device. can do.
[0022]
As described above, according to the voice processing apparatus of the present invention, it is
08-05-2019
8
possible to realize two-way simultaneous call by suppressing echo and howling. In addition, echo
and howling are suppressed by preventing the frequency components of the audio signal (sound
output signal) output from the speaker and the audio signal (sound collection signal) input from
the microphone from overlapping with each other, compared with the adaptive filter. There is
also the advantage that the amount of calculation is small, the system can cope with fluctuations,
and time for convergence is not required.
[0023]
Hereinafter, the embodiment will be described in detail with reference to the drawings by taking
a case where the embodiment is applied to an audio processing unit of a video conference system
as an example. FIG. 2 is a diagram showing the configuration of the video conference system
according to the embodiment of the present invention. In the figure, processing units related to
images not related to the description of the present invention are omitted.
[0024]
In the video conference system of the present embodiment, a conference terminal 10a
connecting a speaker 21a and a microphone 22a, and a conference terminal 10b connecting a
speaker 21b and a microphone 22b are connected by a communication line 23. Hereinafter, the
conference terminal 10a disposed near any speaker is connected to the near end device and the
near end device 10a via the communication line 23, and the conference terminal 10b located far
from this speaker is connected to the far end device Suppose it is 10b. The near-end device 10a
and the far-end device 10b have the same configuration, and an internal block diagram of the
far-end device 10b is omitted in the figure. The communication line 23 is a general digital
communication line such as Ethernet (registered trademark).
[0025]
The speaker 21a connected to the near end device 10a processes the sound collected by the
microphone 22b connected to the far end device 10b by the near end device 10a and emits the
sound. The microphone 22a connected to the near end device 10a picks up the voice of the
teleconference attendee of the near end device 10a. At this time, the sound emitted from the
speaker 21a input through the space is superimposed and collected. The same applies to the farend device 10b.
08-05-2019
9
[0026]
The internal configurations of the near end device 10a and the far end device 10b will be
described below in the case of the near end device 10a. The near-end device 10a includes a D / A
converter 11 connected to a speaker, an A / D converter 12 connected to a microphone 22b, a
signal processing unit 13 processing an audio signal, and an audio signal encoding / decoding
processing A communication unit 15 connected to the codec 14 and the communication line 23
is provided.
[0027]
The D / A converter 11 converts the digital audio data processed by the signal processing unit 13
into an analog. The analog audio signal is amplified by an amplifier (not shown) and then emitted
from the speaker 21a. The A / D converter 12 converts an analog voice signal amplified by an
amplifier (not shown) into digital voice data, with the voice collected by the microphone 22a. The
signal processing unit 13 is constituted by a digital signal processor (DSP), and performs
processing of converting input and output audio data into desired data, and at the same time,
frequency components of the collected sound and the sound to be output are heavy. Perform
audio processing to prevent Details of this voice processing will be described later. The voice
codec 14 converts voice data based on the input of the microphone 22 a sent from the signal
processing unit 13 into a code defined in a standard manner in communication of the
teleconference system, and is sent from the communication unit 15. The voice data encoded by
the far-end device 10 b is decoded and sent to the signal processing unit 13. The communication
unit 15 transmits / receives input / output data including encoded voice data to / from the farend device 10 b through the communication line 23 based on a predetermined digital data
communication protocol.
[0028]
Next, the audio processing by the signal processing unit 13 will be described in detail. First, as a
first embodiment, a signal processing unit that divides an entire frequency range of an audio
signal into a frequency band for selecting a signal component of sound emission and a frequency
band for selecting a signal component of collected sound and performs audio processing explain.
08-05-2019
10
[0029]
FIG. 3 is a diagram showing the configuration of the signal processing unit according to the first
embodiment of this invention. The signal processing unit 30 is incorporated in the signal
processing unit 13 of the conference terminal shown in FIG. The signal processing unit 30
according to the first embodiment of the present invention includes an analysis filter bank 31
which divides a sound emission signal into a plurality of frequency bands, a synthesis filter bank
32 which synthesizes band-divided sound emission signals, and the like. It comprises an analysis
filter bank 33 for dividing the sound signal into a plurality of frequency bands and a synthesis
filter bank 34 for synthesizing the band-separated sound collection signal.
[0030]
The analysis filter bank 31 is a sound emission signal division means, and divides the audio
signal data input from the audio codec 14 into 128 frequency bands from low to high. In the
following, for the sake of explanation, the channel in the lowest band is sequentially numbered as
the first channel, and the channel in the highest band is the 128th channel. This band division
processing is described, for example, in "Subband adaptive filter using perfect reconstruction DFT
filter bank" by Kazunobu Watanabe et al. (The Institute of Electronics, Information and
Communication Engineers, August 1996, Vol. J79-A No. 8 pp. It comprises using the DFT filter
bank described in 1385-1393). The process of dividing the band and performing signal
processing after downsampling and re-synthesizing is called multi-rate signal processing. As the
band division method, various methods are known according to applications such as a QMF filter
bank as well as the DFT filter bank. In the embodiment, although the case of using a DFT filter
bank is described, band division may be performed by another method. Further, as a method
other than the filter bank, it is also possible to use a method in which transformation from time
domain to frequency domain such as Fourier transform or inverse transformation is defined. DFT
filter banks are divided into analysis and synthesis functions. It is known that speech data divided
into bands by analysis can be re-synthesized to original speech data by a synthesis filter bank.
Although the original signal and the re-synthesized signal may be slightly different depending on
the method, the present invention can be configured to have no substantial influence.
[0031]
The synthesis filter bank 32 is a sound emission signal synthesis means, and audio data from
08-05-2019
11
which the signal components of even-numbered channels are removed by the band selection
means (not shown) among the 128 channels of the speech signal band-divided by the analysis
filter bank 31. , And synthesizes the components of the entire band to generate one audio signal.
The audio signal is output to the speaker 21 a via the D / A converter 11.
[0032]
The analysis filter bank 33 is a collected sound signal dividing unit, and like the analysis filter
bank 31, the audio signal data input from the A / D converter 12 is banded in the frequency band
of 128 channels from the low band to the high band. To divide. The analysis filter bank 33 is
configured the same as the analysis filter bank 31.
[0033]
The synthesis filter bank 34 is a collected sound signal synthesis means, and audio data from
which the signal components of the odd-numbered channels have been removed by the band
selection means (not shown) among the 128 channels of the speech signal band-divided by the
analysis filter bank 33. , And synthesizes the components of the entire band to generate one
audio signal. The audio signal is transmitted to another device via the audio codec 14.
[0034]
The band selection means selects a frequency band (channel) in which each audio signal is
contained so that the frequency component of the audio signal emitted from the speaker 21a and
the frequency component of the audio signal collected by the microphone 22a do not overlap. .
Here, the odd-numbered channels (1, 3,..., 127) of numbers assigned from 1 to 128 in order from
the low band are selected for the sound signal to be emitted, and the even-numbered channels (2,
4 ,... 128) are selected for the audio signal collected from the microphone. That is, the sound
signal to be emitted uses signal components in the frequency band corresponding to the oddnumbered channels, and signal components in the frequency band corresponding to the evennumbered channels are removed. Further, as to the audio signal to be collected, the signal
component of the frequency band corresponding to the odd-numbered channel is removed, and
the signal component of the frequency band corresponding to the even-numbered channel is
used. Thus, by separating the channels, it is possible to prevent the emitted sound signal from
being superimposed on the collected sound signal.
08-05-2019
12
[0035]
The audio processing executed by such a signal processing unit 30 will be described using a
flowchart. First, processing of an audio signal emitted from the speaker (hereinafter, referred to
as speaker audio processing) will be described. FIG. 4 is a flowchart showing a speaker audio
processing procedure of the near end apparatus including the signal processing unit of the first
embodiment. The voice processing is performed in the same procedure in the far-end device 10b.
[0036]
[Step S01] The encoded voice data from the far-end device 10b is received by the communication
unit 15 via the communication line 23. [Step S02] The audio codec 14 decodes the audio data
received in step S01, and for example, digital audio data of 32 KHz sampling 16-bit straight PCM
is generated. The digital voice data is sent to a signal processing unit 30 configured by a DSP.
[0037]
[Step S03] The signal processing unit 30 performs band division processing by the analysis filter
bank 31 on the input voice data. Here, the DFT filter bank is used to divide the sound data output
from another device into the frequency band of 128 channels from the low band to the high
band.
[0038]
[Step S04] The signal component of the channel (in this case, the odd-numbered channel)
assigned to the sound signal to be emitted among the 128 channels is directly sent to the
synthesis filter bank 32 and assigned to the sound signal to be emitted. The signal components of
the channels that were not present (here, even-numbered channels) are not output to the
synthesis filter bank 32. That is, the even-numbered output of the 128 channels of the analysis
filter bank 31 is output as 0 to the synthesis filter bank 32. Thereby, the signal component of the
even-numbered frequency band of the sound signal to be emitted is removed.
08-05-2019
13
[0039]
[Step S05] The synthesis filter bank 32 receives audio data from which signal components in the
frequency band (here, even-numbered channels) to which the audio signal to be emitted is not
allocated are removed, and all band signal components are The voice signal is synthesized into
one voice signal and sent to the D / A converter 11.
[0040]
[Step S06] The D / A converter 11 converts the sound signal for emission synthesized by the
synthesis filter bank 32 into an analog sound signal, and outputs the analog sound signal to the
speaker 21a.
[Step S07] The analog audio signal obtained from the D / A converter 11 is amplified by an
amplifier and emitted from the speaker 21a.
[0041]
By executing the above-described speaker audio processing procedure, signal components of
even-numbered frequency bands divided into 128 channels are removed from the audio signal
output from the other device and emitted from the speaker 21a by the above-described speaker
audio processing procedure. After being synthesized into an audio signal, it is output from the
speaker 21a.
[0042]
Next, audio processing of an audio signal collected from the microphone (hereinafter, referred to
as microphone audio processing) will be described.
FIG. 5 is a flow chart showing the procedure for processing the microphone sound of the nearend device including the signal processing unit according to the first embodiment. The voice
processing is performed in the same procedure in the far-end device 10b.
08-05-2019
14
[0043]
[Step S11] The speech of the teleconference attendee on the near end side and the sound emitted
from the speaker 21a are collected by the microphone 22a. The collected voice is converted by
the A / D converter 12 into digital voice data of 32 KHz sampling 16 bit straight PCM. The
converted voice data is output to the analysis filter bank 33.
[0044]
[Step S12] The analysis filter bank 33 inputs the audio data generated in step S11, and divides
the audio data into 128 channels of audio data, similarly to the analysis filter bank 31 for sound
emission signal.
[0045]
[Step S13] The signal component of the channel (here, the even-numbered channel) assigned to
the sound collection signal reverse to the sound emission signal among the 128 channels is
directly sent to the synthesis filter bank 34, and the sound to be sounded The signal components
of the channels assigned to the signals (here, odd-numbered channels) are not output to the
synthesis filter bank 34.
That is, out of the 128 channels of the analysis filter bank 33, the odd-numbered output opposite
to the sound signal to be emitted is output to the synthesis filter bank 34 as 0. As a result, the
signal components of the odd-numbered frequency bands of the collected audio signal are
removed.
[0046]
[Step S14] The synthesis filter bank 34 receives voice data from which signal components of the
frequency band to which the collected voice signal is not allocated (here, the odd-numbered
channel opposite to the noise release signal) have been removed. The signal components in the
entire band are synthesized into one voice signal and sent to the voice codec 14.
[0047]
08-05-2019
15
[Step S15] The audio codec 14 encodes the audio signal input from the synthesis filter bank 34
into a predetermined code, and sends the encoded signal to the communication unit 15.
[Step S16] The communication unit 15 transmits the encoded voice data to the far-end device
10b via the communication line 23.
[0048]
By performing the above-described processing procedure, signal components in the oddnumbered frequency bands divided into 128 channels are removed from the audio signal
collected by the microphone 22a and transmitted to the far-end device 10b; After being
synthesized into one voice signal and encoded, the communication unit 15 is transmitted.
[0049]
As described above, the audio signal emitted from the speaker 21a and collected by the
microphone 22a holds only the signal component of the frequency band corresponding to the
odd-numbered channel.
Therefore, by removing the signal components of the odd-numbered channels from the collected
audio signal, the signal component derived from the audio signal emitted from the speaker 21a is
removed from the collected audio signal. Become. That is, the echo component is removed from
the collected audio signal.
[0050]
As described above, in the first embodiment, the frequency components of the sound emitted
from the speaker and the sound collected by the microphone are divided into even and odd
frequency bands so as not to overlap. , Echo and howling can be suppressed to realize two-way
simultaneous calls.
[0051]
In the above description, although the channels are divided into even-numbered and oddnumbered channels, this can be realized if the frequency components of the microphone and the
speaker do not overlap with each other. Also, there are various ways of dividing.
08-05-2019
16
[0052]
The channel for microphone sound processing and the channel for speaker sound processing
may be divided into two or more frequency bands instead of alternately dividing them into even
and odd numbers.
For example, channels selected by microphone audio processing: 1, 2, 5, 6, ..., 125, 126 Channels
selected by speaker audio processing: 3, 4, 7, 8, ..., 127, 128, and so on , And may be divided
alternately two by two.
[0053]
Besides the method of preselecting the channel to be selected simply, there is also a method of
dynamically selecting according to the feature of the sound.
In the audio processing of the present embodiment, some components of the audio are removed,
which may affect the sound quality. For example, when removing a part of components for audio
compression, a method using an auditory masking effect is known as a method of reducing the
audibility effect.
[0054]
FIG. 6 is a diagram showing the characteristics of the audio signal. The figure is a graph in which
the output of audio that is to be output to a speaker at a given time is processed by analysis filter
bank 31, and the horizontal axis is the 128 channels of the filter bank arranged from low to high.
.
[0055]
In this figure, there is a characteristic that there is a peak at channels 15, 60 and 78. In this case,
channel selection for preventing overlapping of the frequency components of the speaker audio
08-05-2019
17
processing and the microphone audio processing using the auditory masking effect is performed,
for example, as follows.
[0056]
Channels to be selected by speaker audio processing: 3, 6, 9, 12, 15, 18, 21, 24, 27, 30, 30, 36,
39, 42, 45, 51, 54, 57, 60, 63, 66, 59, 72, 75, 78, 81, 84, 87, 90, 93, 96, 99, 102, 105, 108, 111,
114, 117, 120, 123, 126 Channels to be selected by microphone audio processing: Above Except
for 1 to 128 channels, that is, in speaker audio processing, every two channels including the
15th, 60th, and 78 peaks are selected, and in microphone audio processing, other channels are
selected. The selection method using the auditory masking effect is not limited to every two
channels described above, and can be adjusted to any value. Moreover, the detection of the peak
of the frequency component of audio | voice can also be a method of detecting the peak of what
was time-averaged other than the method of using the peak of a certain time point.
[0057]
In the above description, the filter bank is used to divide into frequency components so that the
frequency components of the microphone sound processing and the speaker sound processing
do not overlap, but similar time frequency conversion is also performed in addition to the filter
bank. It is also possible to use a Fourier transform that performs. In this case, the analysis filter
banks 31 and 33 may be 128-point Fourier transforms, and the synthesis filter banks 32 and 34
may be inverse transforms corresponding to the Fourier transforms. In addition to filter banks
and Fourier transforms, time-frequency transforms can also be performed using discrete cosine
transform (DCT) or wavelet transform.
[0058]
In the first embodiment, the frequency component is removed to be 0, but it is also possible to
select a small value according to the environmental noise other than 0. Next, a second
embodiment will be described. In the first embodiment, although the entire frequency range of
the audio signal is divided into the frequency band for selecting the signal component of the
sound emission and the frequency band for selecting the signal component of the sound
collection, the second embodiment In this case, it is assumed that such voice processing is
performed as a part of the frequency band range, and voice processing using an adaptive filter is
08-05-2019
18
performed in the other frequency bands.
[0059]
FIG. 7 is a diagram showing the configuration of a signal processing unit according to the second
embodiment. The components of the processing function of the signal processing unit in the
second embodiment are the same as the components of the first embodiment shown in FIG. Thus,
the functions of the second embodiment will be described using the reference numerals of the
components shown in FIG.
[0060]
In the second embodiment, it is assumed that the band division by the analysis filter banks 31
and 33 is 256 channels and the first channel is to output the component of the lowest frequency,
and the filter of the highest frequency component is numbered in order. It is assumed that the
256th channel. Then, audio processing using adaptive filters 41, 42, 43,..., 44 is performed for
the first channel to the 128th channel, and for the 129th channel to the 256th channel in the
first embodiment. Perform the same process as shown. Note that the audio processing using the
adaptive filter for the first channel to the 128th channel is generally performed because the
lower frequency band contains many audio signals.
[0061]
With regard to the 129th to 256th channels, among the signal components of the channels
divided into bands by the analysis filter bank 31, the signal components of even-numbered
channels are removed and sent to the synthesis filter bank 32 as speaker voice processing.
Further, as microphone sound processing, among the signal components of the channels divided
by the analysis filter bank 33, signal components of odd-numbered channels are removed and
sent to the synthesis filter bank 34. By performing the above processing, the same effect as that
of the first embodiment can be obtained for the frequency band to which the 129th to 256th
channels belong.
[0062]
08-05-2019
19
Next, the audio processing procedure of the second embodiment will be described. FIG. 8 is a
flowchart showing an audio processing procedure of the signal processing unit according to the
second embodiment. The description of the same processing procedure as that of the first
embodiment will be omitted.
[0063]
Speaker audio processing will be described. [Step S <b> 21] The analysis filter bank 31 divides
the audio signal to be emitted into the 256 channels up to the 256th channel with the first
channel as the lowest frequency component.
[0064]
[Step S22] Among the 256 channels, the even-numbered outputs of the 129th to 256th channels
are set to 0, and are output to the synthesis filter bank 32. [Step S23] Of the 256 channels, for
the first to 128th channels, signal components of all the channels are sent to the synthesis filter
bank 32, and adaptive filters (processing are provided corresponding to the respective channels.
.., 44, and sends the signal component as a reference voice of the adaptive filter processing.
[0065]
[Step S24] The signal components in the entire band are synthesized into one audio signal. This
audio signal passes through the D / A converter 11 and is output from the speaker 21 a. The
microphone sound processing will be described.
[0066]
[Step S25] At the same time as the speaker audio processing, the audio signal input through the
microphone 22a is band-divided into 256 channels by the analysis filter bank 33 having the
same configuration as the analysis filter bank 31.
[0067]
08-05-2019
20
[Step S26] Of the 256 channels, for the 129th to 256th channels, odd-numbered outputs are set
to 0 in the opposite manner to the speaker audio processing, and are output to the synthesis
filter bank 34.
That is, as with the first embodiment, the components of channels 129 to 256 include only the
audio components of the odd channel and the even channel of the output sound to the speaker
21a and the input sound from the microphone 22a, respectively. Also, the same frequency
components are configured not to overlap.
[0068]
[Step S27] Of the 256 channels, for the first to 128th channels, signal components are supplied
to adaptive filters (processing units) 41, 42, 43,... 44 provided corresponding to the respective
channels. send.
[0069]
[Step S28] The adaptive filters (processing units) 41, 42, 43,... 44 use the audio signal emitted to
the speaker input at step S23 as the reference signal for the first to 128th channels as a
reference signal. The audio signal picked up by the microphone input in S27 is processed as an
objective signal by an adaptive filter using LMS (Least Mean Square) algorithm.
The adaptive filter processing is not directly related to the present invention, and thus the
description thereof is omitted. Adaptive filtering produces pseudo echoes.
[0070]
[Step S29] The adaptive filters (processing units) 41, 42, 43,..., 44 are calculated in step S28 from
the sound signal picked up by the microphone input in step S27 for the first to 128th channels.
The pseudo echo is subtracted and sent to the synthesis filter bank 34.
[0071]
[Step S30] The synthesis filter bank 34 synthesizes all the band components of the 129th to
08-05-2019
21
256th channel signal components input in step S26 and the 1st to 128th channel signal
components input in step S29. And one voice signal.
This voice signal is transmitted to the other device via the voice codec 14.
[0072]
In the second embodiment described above, by combining the configuration of the first
embodiment with the echo cancellation by the adaptive filter, the signal component that strongly
affects the sound quality is processed by the adaptive filter so as to affect the sound quality. The
signal component with a small amount of voice is subjected to audio processing by the
configuration of the first embodiment. As a result, in the frequency band to which the first
embodiment is applied, the amount of calculation can be reduced as in the first embodiment, so
an optimum system should be designed in consideration of the sound quality and the amount of
calculation. Can.
[0073]
The processing of the 129th to 256th channels of the second embodiment is the same as the
processing of the first embodiment, and the same modification as that of the first embodiment is
possible. In addition to the LMS algorithm, various adaptive filter methods are known, and other
methods can also be used. Also, various control methods for improving the performance of the
adaptive filter are known, and the performance is improved by applying to the adaptive
processing (step S28).
[0074]
In multi-rate signal processing using a DFT filter bank or the like, since frequency conversion is
performed, there is known a method of down-sampling a filter output to reduce the amount of
calculation. Especially in the present embodiment, the amount of calculation can be reduced by
this method by adaptive filter processing.
[0075]
08-05-2019
22
In the first embodiment and the second embodiment of the above description, in order to
configure the frequency components of the sound of the microphone and the speaker not to
overlap, a method of removing some components has been described in detail. . In this
configuration, the frequency component to be removed from the sound of the microphone is a
component contained in the sound emitted from the speaker, and the frequency component to be
removed from the sound emitted from the speaker is one of the sounds collected by the
microphone It is a frequency component that is not removed, and is in a contradictory
relationship. When the near-end device and the far-end device are connected in such an identical
configuration, the frequency components removed from the far-end device's microphone become
the unremoved frequency components of the sound emitted from the near-end device's speaker.
The voice output at the near end has already been removed at the far end, and a situation occurs
where nothing can be heard. The following describes some ways to solve such problems.
[0076]
As a first speech process, the microphone speech can be sent to the far end device by
interpolating the removed components with the unremoved components. The details of the
interpolation method are described below. In the first embodiment, it is described in detail that
from the analysis filter bank 33, odd-numbered channels out of 128 channels are sent as 0 to the
synthesis filter bank 34. In step S13 of the microphone sound processing of the first embodiment
shown in FIG. 5, the odd channel to be removed is replaced with the data of the next even
channel not to be removed. For example, the first channel to be removed has the same data as
the second channel. Similarly, the third channel configures data of the fourth channel, the fifth
configures the sixth, and so on up to the 128th channel. Other than this interpolation operation,
it can be realized with the same configuration as that of the first embodiment.
[0077]
As the second sound processing, when there is a component removed in the sound sent from the
far end device, the component is complemented and output from the speaker. The details of the
interpolation method are described below. In the first embodiment, it is described in detail that
the analysis filter bank 31 to the even channel among the outputs of 128 channels are sent to
the synthesis filter bank 32 as 0. After the process of step S03 in the speaker audio process of
the first embodiment shown in FIG. 4, it is determined whether there is a removed frequency
component. In this method, the power of the filter output for a fixed time is integrated, and when
08-05-2019
23
it is less than the threshold value, it is determined that this frequency component is removed. If it
is determined that the data has been removed, it is replaced with the data of the adjacent odd
channel. For example, when it is determined that the component of the second channel is
removed, the same data as the data of the first channel is used. After this determination and
operation, the process proceeds to step S04. Other than the above-described determination and
operation, the configuration is the same as that of the first embodiment.
[0078]
In addition, frequency components other than the frequency components removed from the
microphone sound of the far-end device are regarded as frequency components to be removed
from the microphone sound of the near-end device. As described above, the frequency
component of the sound emitted from the speaker and the frequency component of the sound
collected by the microphone are configured in a contradictory relationship. Therefore, if this
relationship is reversed between the near end device and the far end device, no sound is output
when trying to emit the component removed from the far end device's microphone sound from
the near end device's speaker. It does not happen. Although it has been described that the output
of the even channel is set to 0 at Step 04 in the speaker audio processing of the first embodiment
shown in FIG. 4, the configuration is as follows. For the output from step 03, the sum of the
power of the even channel for a fixed time and the sum of the power of the odd channel are
integrated. The above values are compared, and small even or odd channel groups are set to zero.
Further, in step S13 in the microphone sound processing shown in FIG. 5, the odd channel is set
to 0 when the even channel is set to 0, and the even channel is set to 0 when the odd channel is
set to 0. The configuration other than the above-described determination and operation can be
realized with the same configuration as that of the first embodiment. In the above description,
the presence or absence of the removed component was determined from the filter bank output
power, but it is sent to the other device as a control signal which component has been removed
by the far end device and the near end device. Configurations to control are also possible.
[0079]
By performing any of the above processes, for example, the output audio signal component that
occurs when the frequency component removed by the microphone of the near end device is
selected as the component to be output from the speaker of the far end device is already The
problem of being removed and having no signal component can be avoided.
[0080]
08-05-2019
24
It is a conceptual diagram of the invention applied to an embodiment.
FIG. 1 is a diagram showing a configuration of a video conference system according to an
embodiment of the present invention. It is the figure which showed the structure of the signal
processing part of the 1st Embodiment of this invention. It is the flowchart which showed the
speaker audio processing procedure of the near end apparatus containing the signal processing
part of a 1st embodiment. It is the flowchart which showed the microphone sound processing
procedure of the near end apparatus containing the signal processing part of a 1st embodiment.
It is the figure which showed the characteristic of the audio signal. It is the figure which showed
the structure of the signal processing part of 2nd Embodiment. It is the flowchart which showed
the audio processing procedure of the signal processing part of a 2nd embodiment.
Explanation of sign
[0081]
1 ... sound emission signal division means, 2 ... sound collection signal division means, 3 ... band
selection means, 4 ... sound emission signal synthesis means, 5 ... sound collection signal
synthesis means
08-05-2019
25
Документ
Категория
Без категории
Просмотров
0
Размер файла
40 Кб
Теги
jp2007151038
1/--страниц
Пожаловаться на содержимое документа