close

Вход

Забыли?

вход по аккаунту

?

DESCRIPTION JP2007053511

код для вставкиСкачать
Patent Translate
Powered by EPO and Google
Notice
This translation is machine-generated. It cannot be guaranteed that it is intelligible, accurate,
complete, reliable or fit for specific purposes. Critical decisions, such as commercially relevant or
financial decisions, should not be based on machine-translation output.
DESCRIPTION JP2007053511
PROBLEM TO BE SOLVED: To effectively suppress an audio input signal. SOLUTION: An echo
cancellation means 1 receives an audio input signal and removes an echo component. The noise
learning means 1 extracts a noise component from the audio signal from which the echo has
been removed, and calculates stationary noise. The noise cancellation means 3 removes noise
from the audio signal from which the echo has been removed based on the stationary noise.
Further, the echo learning means 4 learns the residual echo which has not been removed by the
echo canceling means 1 using the voice output signal. Then, the echo suppressing means 5
removes the residual echo from the voice signal from which the echo and the noise output from
the noise canceling means 3 have been removed. On the other hand, the speaker volume
estimation means 6 calculates the estimated speaker volume by removing stationary noise and
residual echo from the voice signal from which the echo has been removed. The suppressing
means 7 suppresses the audio signal if necessary according to the estimated speaker volume.
[Selected figure] Figure 1
Voice processing device and microphone device
[0001]
The present invention relates to a voice processing device and a microphone device, and more
particularly to a voice processing device and a microphone device that removes echo and noise
from a voice input signal to take out a voice signal and performs suppression processing
according to the size of the voice signal.
[0002]
Conventionally, as typified by a video conference system, it is possible to conduct a conference
15-04-2019
1
between multiple points by transmitting and receiving voice and video via a plurality of terminals
installed at multiple points such as remote places. There is a conference system.
[0003]
In a microphone device (hereinafter, referred to as a microphone) used in such a conference
system or the like, a voice processing unit that processes a voice input signal collected by the
microphone is mounted.
In the audio processing unit, processing is performed on the audio input signal by an echo
canceler for preventing echo of the audio output from its own speaker from getting into the
microphone and a noise canceler for removing stationary background noise and the like.
Furthermore, the suppressor suppresses the sound of the microphone based on the voice signal
processed by the echo canceller and the noise canceller, except when the voice of the speaker in
front of the microphone makes speech, and unnecessary noise and echo are generated. It is
prevented from being sent to the other party device.
[0004]
However, noise and echo may be caused by the situation of the room in which the microphone is
installed (for example, a place with a lot of noise or a place where sound is likely to be reflected)
or the situation of a conference (single talk It changes according to whether it is talk etc.). For
this reason, it is not easy to remove only noise and echo components from an audio signal input
by the microphone.
[0005]
Therefore, there is a device that adjusts an echo suppressor by generating a test signal and
analyzing a signal input to a microphone by generating a test signal that has been amplified by a
speaker (see, for example, Patent Document 1). Patent No. 3601164 (Paragraph Nos. [0018] to
[0022], FIG. 1)
15-04-2019
2
[0006]
However, the conventional voice processing apparatus has a problem that the suppressor may
not operate effectively in response to unnecessary noise and echo can not be removed in the
echo canceller and the noise canceller.
[0007]
In the conventional speech processing apparatus, the suppressor determines that the speech is
from the speaker when the speech input signal after the echo canceller or the noise canceller is
larger than the threshold, and turns on.
That is, a voice signal based on the voice input signal after the echo canceller and noise canceller
processing is output. On the other hand, when the voice input signal is smaller than the threshold
value, it is determined that the voice is not voice by the speaker and is turned off, and the voice
signal is not output. However, if there is noise or echo that can not be removed by the echo
canceller or noise canceler, the suppressor may erroneously recognize that it is the speaker's
voice signal and may transmit the noise or echo at that time to the other party's device. There is a
problem.
[0008]
In order to prevent the voice signal from being transmitted to the opposite device when the
convergence by the echo canceller is not sufficient, the threshold for recognizing the voice signal
of the speaker may be set high. However, if the threshold value is increased too much, the voice
of the speaker in front of the microphone may not be transmitted to the other party device
during double talk. Conversely, if the threshold value is reduced, echo is noticeable when the
echo canceller can not sufficiently converge at single talk.
[0009]
In addition, when the echo suppressor is adjusted by the test signal, if the status of the installed
room or the call status changes due to the passage of time from the adjustment time, the
difference from the state at the time of adjustment becomes large, There is a problem that it can
15-04-2019
3
not function effectively. In addition, it is not practical because the user has to give an operation
instruction each time the adjustment, which is troublesome for the user.
[0010]
The present invention has been made in view of such a point, and an audio processing apparatus
for effectively performing suppression processing to transmit an audio signal with little
discomfort to a partner apparatus and a microphone apparatus equipped with the audio
processing apparatus Intended to provide.
[0011]
In order to solve the above problems, the present invention provides an audio processing
apparatus that removes an echo or noise from an audio input signal to extract an audio signal,
and performs suppression processing according to the size of the audio signal.
This speech processing apparatus comprises echo cancellation means, noise learning means,
noise cancellation means, speaker volume estimation means, and suppression means. The echo
cancellation means removes an echo component mixed in the audio input signal by the
wraparound of the audio output. The noise learning means extracts the noise component from
the voice signal from which the echo cancellation means has removed the echo component, and
learns the stationary noise from the noise component. The noise cancellation means removes
noise from the audio signal from which the echo component has been removed based on the
stationary noise. The speaker volume estimation means subtracts the stationary noise from the
voice signal from which the echo component has been removed to calculate an estimated speaker
volume. The suppressing means suppresses the voice signal from which noise and echo have
been removed by the echo canceling means and the noise canceling means according to the
estimated speaker volume.
[0012]
According to such a speech processing apparatus, the echo cancellation means receives the
speech input signal, removes the echo component, and removes the echo component from the
speech input signal as noise learning means, noise cancellation means, and speaker volume
estimation means Send to The noise learning means extracts noise components from the input
signal, and learns stationary noises that are generated constantly using the extracted noise
15-04-2019
4
components. In learning, for example, a plurality of noise components extracted for each process
are statistically processed to calculate stationary noise. The learned stationary noise is output to
noise cancellation means and speaker volume estimation means. The noise canceling means
removes noise from the voice input signal from which the echo has been removed based on the
stationary noise, and outputs the noise to the suppressing means. On the other hand, the speaker
volume estimation means removes stationary noise from the voice input signal from which the
echo has been removed, calculates an estimated speaker volume, and outputs it to the
suppression means. The suppressing means determines whether or not to suppress according to
the estimated speaker volume, and in the case of suppressing, the audio signal from which the
echo and noise have been removed from the noise canceling means is suppressed. As a result,
suppression processing is performed according to the estimated speaker volume.
[0013]
Moreover, in order to solve the said subject, the microphone apparatus which integrated said
voice processing apparatus is provided. The microphone device includes an audio input unit that
converts collected sound into a digital signal and outputs the digital signal as an audio input
signal, an echo cancellation unit that removes an echo component mixed in the audio input signal
due to the wraparound of the audio output, and an echo component Noise extraction means for
extracting a noise component from the speech signal from which the noise has been removed
and learning stationary noise from the noise component, and noise cancellation means for
further removing noise from the audio signal from which the echo component has been removed
based on the stationary noise; According to the speaker volume estimation means for calculating
the estimated speaker volume by subtracting stationary noise from the voice signal from which
the echo component has been removed, and the echo component and noise by the echo
cancellation means and the noise cancellation means according to the estimated speaker volume.
Suppressing means for suppressing the removed audio signal.
[0014]
In such a microphone device, after an echo is removed by the echo canceling means, a noise
component is extracted by the noise learning means to calculate a stationary noise from the voice
input signal based on the voice collected by the voice input means. The speaker volume
estimation means subtracts the stationary noise from the echo-removed speech signal to
calculate an estimated speaker volume. On the other hand, the noise cancellation means further
removes noise from the audio signal from which the echo has been removed. When the
suppression means executes the suppression process according to the estimated speaker volume
15-04-2019
5
to the voice signal, the suppression process according to the estimated speaker volume is
performed.
[0015]
In the present invention, since the suppression processing of the voice signal to be transmitted to
the other party is performed according to the estimated voice size of the speaker, noise or echo
that can not be removed is erroneously recognized as the voice signal and transmitted to the
other party device. Can be prevented. As a result, the other party device has an advantage of
being able to receive an audio signal that is easy to hear with less discomfort.
[0016]
Hereinafter, embodiments of the present invention will be described with reference to the
drawings. First, the concept of the invention applied to the embodiment will be described, and
then the specific contents of the embodiment will be described. FIG. 1 is a conceptual view of the
invention applied to the embodiment.
[0017]
The speech processing apparatus according to the present invention comprises an echo
canceling unit 1, a noise learning unit 2, a noise canceling unit 3, an echo learning unit 4, an
echo suppressing unit 5, a speaker volume estimating unit 6, and a suppressing unit 7.
[0018]
The echo cancellation means 1 will remove the echo component mixed in the audio | voice input
signal, when the audio | voice output output from the speaker etc. wraps around, if an audio |
voice input signal is received.
The echo component is a voice signal component which is detected by itself and voice output of
the voice signal transmitted to the other party apparatus is voice input and back again.
Therefore, the echo component is predicted and calculated using the audio signal output from
the speaker. The sound coming out of the speaker is slightly delayed, and the echo component is
15-04-2019
6
predicted in consideration of reaching the microphone. The voice signal from which the echo
component has been removed is output to the noise learning means 2 and the noise canceling
means 3.
[0019]
The noise learning means 2 receives the audio signal from which the echo component has been
removed by the echo cancellation means 1, extracts the noise component, and learns stationary
noise from the noise component. The stationary noise learning is performed by dividing the
speech signal into several frequency domains and performing each frequency. The voice signal
input to the noise learning means 2 has its echo component removed by the echo cancellation
means 1, the echo component is completely removed, and if the voice signal of the speaker is not
included, this voice signal is noise It can be regarded as In the case of the voice signal of the
speaker, a feature indicating that it is voice appears in the waveform, so whether or not it is the
voice of the speaker can be determined by analyzing the harmonic structure or the like.
Therefore, the input voice signal is analyzed to determine whether the voice of the speaker is
included, and only when it is determined that the voice signal is not included, learning of
stationary noise is performed. Also, if the magnitude of the noise component is suddenly
increased compared to the magnitude of the noise component collected so far, it is not a
stationary noise that occurs regularly, so this is also used for learning do not do. In the learning
process, stationary noise is calculated by statistically processing data of noise components
obtained in this manner. The calculated stationary noise is output to the noise cancellation unit 3
and the speaker volume estimation unit 5.
[0020]
The noise cancellation means 3 receives the audio signal from which the echo has been removed
by the echo cancellation means 1 and removes noise from the audio signal from which the echo
has been removed based on the stationary noise calculated by the noise learning means 2. The
processing is performed independently for each frequency domain. The noise fluctuates, and at
this time, the noise contained in the voice signal input to the noise canceling means 3 and the
stationary noise learned by the noise learning means 2 are not exactly the same. For this reason,
if the stationary noise learned from the input audio signal is subtracted as it is, the remaining
musical noise becomes noticeable. On the other hand, if you pull too much, the voice becomes
unnatural like a robot. Therefore, the noise is removed so as to be a natural voice, such as
adjusting the noise to a certain extent so as to make the remaining part disappear. Such
treatments are known and are not mentioned here. The speech signal from which the noise has
15-04-2019
7
been removed is sent to the echo learning means 4 and the echo suppression means 5.
[0021]
The echo learning means 4 performs learning for estimating the residual echo which the echo
cancellation means 1 has not converged. Echo learning is also processed independently for each
frequency domain, as in stationary noise learning. The echo learning means 4 calculates the ratio
of the power of the size of the audio signal from which the echo component and the noise
component have been removed by the echo canceling means 1 and the noise canceling means 3
to the size of the audio output signal output to the speaker. Learn the attenuation level of echo
based on. Since it takes time for the sound coming out of the speaker to reach the microphone,
the ratio is calculated using the sound output signal slightly ahead. Note that this time difference
does not have to be a very strict value, and it may be assumed that the sound continues to have a
certain width. Then, the attenuation level of the learned echo and the voice output signal are
multiplied to calculate an estimated residual echo, and the estimated residual echo is calculated
and output to the echo suppression means 5 and the speaker volume estimation means 6. The
learning of the echo is performed only when the speaker emits an unsteady sound. Also, in the
case of double talk, since the speech signals of other speakers are mixed and can not be
calculated correctly, learning is performed only in the case of single talk.
[0022]
The echo suppressing means 5 removes the estimated residual echo from the audio signal
inputted from the noise canceling means 3 based on the estimated residual echo calculated by
the echo learning means 4, and outputs it to the suppressing means 7.
[0023]
The speaker volume estimation unit 6 subtracts the stationary noise calculated by the noise
learning unit 2 from the echo-removed speech signal acquired from the echo cancellation unit 1
to calculate an estimated speaker volume.
Furthermore, based on the estimated residual echo calculated by the echo learning means 4, the
echo component which has not converged is also removed, if necessary. In the case of single talk,
it is further set to a value smaller than the calculated estimated speaker volume. For example, a
method known as a gay gel algorithm or the like is appropriately used to determine whether the
15-04-2019
8
single talk is made. By estimating the estimated speaker volume small in the case of single talk,
the suppression process of the speech signal can be started earlier, and the apparent echo
convergence time can be advanced. Furthermore, the estimated speaker volume may be
estimated to be smaller according to the size of the sound output from the microphone, if
necessary. The estimated speaker volume is output to the suppression unit 7.
[0024]
The suppressing means 7 suppresses the audio signal after the echo cancellation and the noise
cancellation according to the estimated speaker volume, and outputs it as an audio signal to be
transmitted to the other party. The suppression unit 7 compares the estimated speaker volume
with a predetermined threshold range, and performs the following suppression processing
according to the comparison result. If the estimated speaker volume exceeds the threshold range,
the suppression process is not performed, and the speech signal is output as it is. If the estimated
speaker volume is below the threshold range, the speech signal is suppressed and the speech is
not transmitted to the other side. If the estimated speaker volume is within the threshold range,
an amount to be suppressed is determined by a preset function, and suppression processing is
performed at the suppression level. The function is set arbitrarily.
[0025]
The operation of the audio signal processing apparatus having such a configuration will be
described. The audio signal processing apparatus receives an audio input signal collected by a
microphone or the like and an audio output signal output from a speaker or the like. The echo
cancellation means 1 removes an echo component predicted from the audio output signal from
the audio input signal. When the speech signal from which the echo component has been
removed is input, the noise learning means 2 determines whether or not the speech of the
speaker is included in the speech signal, and extracts the noise component if not contained. If the
noise component changes rapidly, it is determined that the noise component is not in the steady
state and discarded. Then, stationary noise is learned from the noise component thus collected.
The stationary noise is notified to the noise cancellation unit 3 and the speaker volume
estimation unit 6. The noise cancellation means 3 removes noise from the audio signal from
which the echo has been removed by the echo cancellation means 1 based on the stationary
noise. On the other hand, the echo learning means 4 learns the attenuation level of the echo from
the speech output signal and the speech signal, and calculates an estimated residual echo. The
echo suppression means 5 further uses the estimated residual echo to remove residual echo
components from the speech signal.
15-04-2019
9
[0026]
As described above, after the echo is removed from the voice input signal, the stationary noise
learned by the noise learning means 2 and the estimated residual echo based on the attenuation
level of the echo learned by the echo learning means 4 are further removed An audible speech
signal with little or no echo component is generated.
[0027]
On the other hand, the speaker volume estimation means 6 removes stationary noise and
estimated residual echo from the speech signal output from the echo cancellation means 1 and
calculates an estimated speaker volume.
When the suppression means 7 receives an audio signal from which echo, noise and residual
echo have been removed from the echo suppression means 5, the suppression means 7 performs
suppression processing according to the estimated speaker volume, and outputs an audio signal
to be transmitted to the other party. Do.
[0028]
As described above, since the suppression process is performed based on the estimated speaker
volume, the voice signal is not transmitted to the other party by the sound other than the voice of
the speaker even if the echo canceller and the noise canceller do not work completely. As a
result, for example, it is possible to eliminate the phenomenon that the sound of the microphone
in front of the speaker and the noise source is not suppressed and transmitted to the other party,
and only the speaker's sound can be transmitted to the other party become.
[0029]
The above process is performed independently for each frequency domain. By performing
processing in each frequency domain, highly accurate results can be obtained. Hereinafter, the
case where the embodiment is applied to the audio processing of the video conference system
will be described in detail with reference to the drawings. FIG. 2 is a block diagram of a
15-04-2019
10
microphone applied to the video conference system according to the embodiment.
[0030]
In the video conference system according to the embodiment, the microphone 1 (100) and the
microphone 2 (101) are cascaded by the communication paths 301 and 302 and the power
signal paths 311 and 312 with respect to the video conference system main body (hereinafter
referred to as the main body) 200. It is connected. The respective microphones have the same
configuration, and hence the case of the microphone 1 (100) will be described below.
[0031]
The microphone 1 (100) includes an audio signal processing unit 110 that performs audio
processing, a power control circuit 130 that performs power processing, a DC-DC converter 131,
and a serial I / F FPGA (field programmable gate array) 140 that controls serial communication. ,
A microphone on / off switch 150, and a sound collection unit 160 for inputting sound and an A
/ D converter 161. The serial I / F FPGA 140 is hereinafter referred to as a serial I / F 140.
[0032]
The audio signal processing unit 110 removes echo and noise from the audio input signal,
performs suppression processing, and generates an audio signal to be transmitted to another
device. As shown in the figure, when multiple microphones are connected, the audio signal of
another microphone connected in cascade from another serial I / F 140 (Cascade In) and the
audio signal of its own microphone are added, and the serial I / F 140 is added. Send via
(Cascade Out). Also, a control command (not shown) inputs a control command via the serial I / F
140 and performs processing according to the command (Control I / O).
[0033]
The power supply control circuit 130 sends DC power supplied from the upstream main body
200 to the DC-DC 131 and determines whether to supply power downstream, and in the case of
supplying power, microphone 2 via the power supply signal path 312 Control to supply power to
(101) is performed.
15-04-2019
11
[0034]
The serial I / F 140 inputs downlink data transmitted from the main body 200, performs
predetermined processing, and outputs the downlink data to the downstream microphone 2
(101).
Also, after processing such as adding the voice signal of the own microphone to voice
information to the upstream data input from the downstream microphone 2 (101), it is output to
the upstream main body 200. Hereinafter, the downlink data and uplink data to be
communicated will be collectively referred to as communication commands.
[0035]
The on / off switch 150 is an external switch for operating the on / off of the microphone 1
(100). When it is off, the audio signal processing unit 110 does not output the audio input signal
detected by the own microphone to the outside.
[0036]
The sound collection unit 160 inputs external sound and sends it to the A / D converter 161. The
A / D converter 161 converts the analog audio signal generated by the sound collection unit 160
into a digital signal and outputs the digital signal to the audio signal processing unit 110.
[0037]
The main unit 200 exchanges information with the cascaded microphones 1 (100) and 2 (101)
via communication commands, and manages these microphones. In addition, it communicates
with a main unit installed in another room or the like via a network to exchange audio signals.
[0038]
15-04-2019
12
An external DC power supply 400 is connected to each microphone and supplies power as
needed. Details of the audio signal processing unit 110 will be described. FIG. 3 is a block
diagram showing the configuration of the audio signal processing unit according to the present
embodiment.
[0039]
The audio signal processing unit 110 according to the embodiment includes an echo canceller
111, a noise level learning unit 112, a noise canceller 113, a speaker noise level learning unit
114, a speaker sound noise canceller 115, an echo attenuation level learning unit 116, a residual
echo level estimation unit 117, The echo suppressor 118, the speaker volume estimation unit
119, and the suppressor 120 are provided.
[0040]
The echo canceller 111 receives an audio input signal input from the sound collection unit 160
and converted into a digital signal by the A / D converter 161, and removes an echo component.
The noise level learning unit 112 is the noise learning unit 2 and learns stationary noise based
on the audio signal from which the echo component has been removed. The noise canceller 113,
which is the noise canceller 3, removes noise from the audio signal based on the stationary noise
level learned by the noise level learning unit 112.
[0041]
The speaker noise level learning unit 114 learns the noise level included in the audio signal
output from the speaker. Similar to the noise level learning unit 112, learning is performed using
the sound output signal of the speaker in the steady state. Similar to the noise canceller 113, the
speaker sound noise canceller 115 removes the stationary noise component from the audio
output signal based on the speaker noise level learned by the speaker noise level learning unit
114.
[0042]
15-04-2019
13
The echo attenuation level learning unit 116 and the residual echo level estimation unit 117 are
echo learning means 4. The echo attenuation level learning unit 116 learns the echo attenuation
level from the ratio of the audio output signal and the audio input signal after echo and noise
removal. The residual echo level estimation unit 117 calculates an estimated residual echo level
using the echo attenuation level learned by the echo attenuation level learning unit 116.
[0043]
The echo suppressor 118 uses the estimated residual echo calculated by the residual echo level
estimation unit 117 to remove the residual echo from the voice signal from which the echo and
noise have been removed by the echo canceller 111 and the noise canceller 113.
[0044]
The speaker volume estimation unit 119 subtracts the stationary noise and the estimated
residual echo from the voice signal from which the echo canceler 111 has removed the echo to
calculate the estimated speaker volume.
The suppressor 120 outputs a voice signal from which echo and noise have been removed by the
echo canceller 111, the noise canceller 113 and the echo suppressor 118 according to the
magnitude of the estimated speaker volume calculated by the speaker volume estimation unit
119, or Decide what to suppress. In addition, when suppressing, the suppression level is also
determined.
[0045]
In the audio signal processing unit 110 having such a configuration, the echo canceller 111
performs processing for removing an echo component of the audio input signal recorded by the
microphone based on the audio output signal output from the speaker. The voice signal from
which the echo component has been removed is transmitted to the noise level learning unit 112
and the noise canceller 113.
[0046]
15-04-2019
14
For example, in a state where the speaker is not speaking, if echo is removed by the echo
canceler 111, what is mainly included in the speech signal will be stationary noise. The noise
level learning unit 112 extracts a noise component included in the audio signal from which the
echo canceler 111 has removed the echo, and learns stationary noise from the extracted noise
component. This process is performed independently for each frequency domain. Similarly, the
speaker noise level learning unit 114 learns stationary noise on the audio output signal side. At
this time, the noise canceller 113 and the echo suppressor 118 also work, and processing for
removing noise and residual echo from the audio signal is performed.
[0047]
The speech signal obtained when the speaker is not speaking is mostly stationary noise
components such as background noise. When the speaker is not speaking, the estimated speaker
volume calculated by the speaker volume estimation means 6 also has a low value, so that the
speech signal is suppressed by the suppression means 7 and the speech is not transmitted to the
other party.
[0048]
In addition, when transient noise occurs in this state, the noise level learning unit 112 does not
learn stationary noise with a noise component that has changed rapidly, so there is no change in
the value of stationary noise. Since the speaker volume estimation unit 119 looks at the level of
the speaker volume instead of the entire level, the possibility of misrecognition is low, but it may
be considered that the misrecognition may be made depending on the situation of noise.
However, in the present embodiment, since division into several frequency domains and
processing is performed independently for each frequency domain, even if misrecognition occurs
in part of frequency domains, the state as a whole is correctly recognized can do.
[0049]
When the speaker starts speaking, the noise level learning unit 112 interrupts learning. Further,
the noise canceller 113 removes noise from the audio signal based on the stationary noise
calculated by the noise level learning unit 112. The removal of noise is processed so that the
listener does not sound unnatural. On the other hand, the echo attenuation level learning unit
15-04-2019
15
116 starts learning of the echo attenuation level, and the residual echo level estimation unit 117
calculates an estimated residual echo from the learned echo attenuation level and the speech
output signal after noise removal. . The audio signal is further devoid of residual echo by echo
suppressor 118, resulting in audible speech.
[0050]
Further, the speaker volume estimation means 6 removes stationary noise and estimated residual
noise from the speech signal from which the echo has been removed by the echo canceller 111,
and calculates an estimated speaker volume. FIG. 4 is a diagram for explaining the speaker
volume estimation process according to the present embodiment. (A) is a voice signal after echo
cancellation by the echo canceller, (B) is a stationary noise signal learned by the noise level
learning unit, (C) is an estimated residual echo signal calculated by the residual echo level
estimation unit, D) shows an audio signal of the estimated speaker volume calculated by the
speaker volume estimation unit.
[0051]
The voice signal after echo cancellation shown in (A) is a signal in which stationary noise,
residual echo and speaker volume overlap. Then, based on the stationary noise signal and the
estimated residual echo signal calculated by the noise level learning unit 112 and the residual
echo level estimation unit 117, the estimated noise component and the estimated residual echo
component are removed from the voice signal after echo cancellation. The person's volume is
obtained.
[0052]
As can be seen from the figure, since the audio signal is characterized by the waveform in each
frequency domain, by processing independently in each frequency domain, highly accurate
results can be obtained in each individual domain.
[0053]
In the case of single talk, the speaker volume estimation unit 119 sets the estimated speaker
volume to a small value.
15-04-2019
16
For example, even if the talk of the speaker is over, an audio signal that has come around from
the speaker mixes in the audio input signal. If this can not be completely removed by the echo
canceller 111, the voice signal (residual echo) that can not be removed is input to the speaker
volume estimation unit 119. Even if this can not be removed even by the residual echo level
estimation unit 117, since the estimated speaker volume is estimated to be small, the speech
signal is suppressed early by the suppressor 120. This makes it possible to accelerate the
apparent convergence time.
[0054]
The processing of the suppressor 120 will be described. FIG. 5 is a view showing an example of
suppression processing by the suppressor of the present embodiment. In the illustrated example,
the threshold range is 40 dB to 60 dB. When the estimated speaker volume exceeds 60 dB, the
suppressor level is 1.0, and the audio signal is output as it is. On the other hand, when the
estimated speaker volume is below 40 dB, the suppressor level is 0.0 and no voice signal is
output. The previous suppressor level is maintained between 40 dB and 60 dB. That is, the
estimated speaker volume increases from 40 dB and the suppressor level 0.0 is maintained until
it exceeds 60 dB. Conversely, if decreasing from 60 dB, the suppressor level 1.0 is maintained
until it falls below 40 dB.
[0055]
Conventionally, since the sound signal was turned on / off at an arbitrary threshold, when the
volume changed before and after the threshold, the sound signal was turned on / off and noise
was noticeable and unpleasant. In the present embodiment, since the state does not change
within the threshold range, an offensive event does not occur.
[0056]
Note that this function is an example, and the behavior can be arbitrarily set such as changing
the suppressor level in a step-like manner within a predetermined threshold range. As described
above, the microphone of the present embodiment is particularly effective in a form such as a
television conference system in which a plurality of microphones each capable of processing an
15-04-2019
17
echo canceler and the like are connected. Of course, even with one microphone, it is possible to
prevent the unnecessary switching of the suppression function on / off and the like.
[0057]
It is a conceptual diagram of the invention applied to an embodiment. It is a block diagram of the
microphone applied to the video conference system of embodiment. It is the block diagram which
showed the structure of the audio | voice signal processing part of this Embodiment. It is a figure
explaining the speaker volume estimation process of this Embodiment. It is the figure which
showed an example of the suppression process by the suppressor of this Embodiment.
Explanation of sign
[0058]
DESCRIPTION OF SYMBOLS 1 ... echo cancellation means, 2 ... noise learning means, 3 ... noise
cancellation means, 4 ... echo learning means, 5 ... echo suppression means, 6 ... speaker volume
estimation means, 7 · · · Means of suppression
15-04-2019
18
Документ
Категория
Без категории
Просмотров
0
Размер файла
30 Кб
Теги
jp2007053511, description
1/--страниц
Пожаловаться на содержимое документа