close

Вход

Забыли?

вход по аккаунту

?

DESCRIPTION JP2009147654

код для вставкиСкачать
Patent Translate
Powered by EPO and Google
Notice
This translation is machine-generated. It cannot be guaranteed that it is intelligible, accurate,
complete, reliable or fit for specific purposes. Critical decisions, such as commercially relevant or
financial decisions, should not be based on machine-translation output.
DESCRIPTION JP2009147654
An object of the present invention is to accurately remove an echo component contained in
collected voice. A signal processing unit divides a first audio signal generated by a first
microphone and a second audio signal generated by a second microphone into predetermined
frequency bands. 41a and 41b are provided. Then, based on the powers of the first and second
microphones, a sound source separation unit 42-that separates the echo components of the
sound emitted by the first sound source included in the sound emitted by the second sound
source for each predetermined frequency band. 1-42-1024 are provided. Then, the first and
second audio signals from which the echo components of the first sound source are separated by
the sound source separation units 42-1 to 42-1024 are synthesized into an audio signal
including the sound emitted by the first sound source, A band synthesizing unit 43 is provided
which synthesizes an audio signal including an echo component of one sound source. [Selected
figure] Figure 2
Voice processing apparatus, voice processing system and voice processing program
[0001]
The present invention relates to a voice processing apparatus, a voice processing system, and a
voice processing program for suppressing the effects of echo and howling by processing voices
collected in an environment such as a conference room where a plurality of speakers speak.
[0002]
Conventionally, for example, in order to facilitate a conference held simultaneously at a distant
15-04-2019
1
place, the conference rooms of each other are referred to as first and second conference rooms.
A video conference system is used in which speakers can talk to each other and the situation of
the speakers can be displayed using the video conference system installed in. This video
conference system (hereinafter also referred to as a loud-speaking call system). Is equipped with
a plurality of video / audio processing devices capable of showing the situation in each other's
conference room and emitting the speech contents of the speaker. In the following description, it
is assumed that the video / audio processing apparatus is installed in the first and second
conference rooms, respectively.
[0003]
The video / audio processing apparatus includes a microphone for picking up voice during a
meeting, a camera for photographing a speaker, a signal processing unit for performing
predetermined processing on the voice of the speaker picked up by the microphone, and another
conference room And a speaker for emitting the contents of the speaker's speech. The video /
audio processing devices installed in each of the conference rooms are connected via a
communication line. Then, by transmitting and receiving the recorded video / audio data to each
other, the state of each conference room is displayed, and the utterance content is emitted.
[0004]
In such a video conference system, the sound emitted by the speaker is reflected on a wall or the
like and input to the microphone. If no processing is performed on the input audio, the audio
data is again sent to the video / audio processing device. As a result, for a speaker who is in the
second meeting room, a phenomenon in which one's own speech can be heard from the speaker
a little later is generated like a echo. Such a phenomenon is called "echo". When the echo
becomes large, the sound emitted by the speaker is again input to the microphone, causing
looping in the speech communication system, which causes howling.
[0005]
Conventionally, in order to prevent echo and howling, a technique called an echo canceller is
used. In general, in an echo canceller, first, an adaptive filter is used to measure an impulse
15-04-2019
2
response between a speaker and a microphone. Then, with respect to the sound from the speaker
input to the microphone, a pseudo echo in which an impulse response is convoluted with the
reference signal emitted from the speaker is generated. Then, the pseudo echo is subtracted from
the voice input to the microphone. By subtracting the pseudo echo, it is possible to remove
unnecessary voice that causes echo and howling.
[0006]
Patent Document 1 discloses a technique for separating stereo signals mixed in channels with
one another at the time of sound collection into original channel signals with a low SN ratio and a
small amount of calculation. Unexamined-Japanese-Patent No. 2003-271167
[0007]
By the way, the impulse response between the speaker and the microphone is easily changed
only by changing the reflection relationship of the voice, for example, when the attendee of the
video conference moves the body. Therefore, it takes some time for the adaptive filter to follow,
calculate the impulse response, and generate a pseudo echo. In other words, it takes about 10
seconds, for example, for the time for obtaining the sound emitted from the sound source and the
pseudo echo from the direct sound from the speaker and the reflected sound from the wall. Then,
pseudo echo can not be accurately generated until the echo cancellation becomes possible by the
adaptive filter after the speech communication system changes (for example, when a speaker
wearing a pin-type microphone moves). For this reason, when the impulse response changes, the
echo may be largely returned, and in a severe case, howling may be caused.
[0008]
Also, in general, the amount of operation of the adaptive filter is larger than that of fast Fourier
transform (FFT) or filter bank. For this reason, if an adaptive filter is used in a signal processing
apparatus with low processing performance, high-speed computation can not be performed. Also,
for example, even when trying to cancel the echo using an adaptive filter in a large place such as
a gymnasium, the distance from the speaker to the microphone becomes long, the reverberation
time becomes long, and a long tap length is required for the adaptive filter. As a result, the
amount of computation increases further, and an effective solution has been sought.
15-04-2019
3
[0009]
In addition to the process using an adaptive filter, a technique as described in Patent Document 1
is also proposed. However, in the prior art, instantaneous values are used when obtaining matrix
parameters, but when instantaneous values are directly applied to matrix parameters, the
instantaneous values may vary. For this reason, it is not possible to accurately realize sound
source separation processing for separating an echo or noise from a sound collected from a
sound source (for example, a speaker who speaks).
[0010]
The present invention has been made in view of such a situation, and it is an object of the present
invention to accurately remove an echo component contained in collected voice.
[0011]
According to the present invention, a plurality of microphones collect voices emitted by a first
sound source with a plurality of microphones, and voices emitted by a second sound source
including voices emitted by the collected first sound source as echo components. It is suitable for
processing an audio signal generated by a plurality of microphones when collecting sound.
That is, the first audio signal generated by at least the first microphone and the second audio
signal generated by the second microphone among the plurality of microphones are divided into
predetermined frequency bands. Then, based on the powers of the first and second microphones,
the first sound source included in the sound emitted by the second sound source is emitted for
each of the predetermined frequency bands of the divided first and second sound signals.
Separates the echo component of speech. Then, the first and second audio signals from which the
echo component of the first sound source is separated are synthesized into an audio signal
including the sound emitted by the first sound source, and the echo component of the first sound
source separated is included. Synthesize into an audio signal.
[0012]
By doing this, an audio signal from which an echo component has been removed can be obtained.
15-04-2019
4
[0013]
According to the present invention, for example, when the voice of the speaker as the first sound
source and the voice emitted by the speaker as the second sound source are picked up by a
plurality of microphones, the voice of the speaker is echoed It is possible to remove an echo
component from the sound collected from the speaker included as a component.
For this reason, even in the case of a communication system in which an echo occurs in the
collected voice in the related art, an echo or a howling does not occur, and a voice signal
consisting of only the voice of the speaker as the first sound source can be obtained. For this
reason, there is an effect that the quality of the obtained audio signal is enhanced.
[0014]
Hereinafter, an embodiment of the present invention will be described with reference to the
attached drawings. The present embodiment will be described as an example applied to a video
conference system 10 capable of transmitting and receiving video data and audio data in real
time between remote locations as a video / audio processing system for processing video data
and audio data.
[0015]
FIG. 1 is a block diagram showing a configuration example of a video conference system 10. As
shown in FIG. In the first and second conference rooms located apart from each other, video /
audio processing devices 1 and 21 capable of processing video data and audio data are installed.
The video / audio processing devices 1 and 21 are mutually connected by a digital
communication line 9 capable of communicating digital data made of Ethernet (registered
trademark) or the like. The video / audio processing devices 1 and 21 are centrally controlled by
the control device 31 that controls data transmission timing and the like via the communication
line 9. The following describes the case where the video / audio processing devices 1 and 21 are
installed at two bases (first and second conference rooms), but the video / audio processing
apparatuses are installed at three or more bases. It is also good.
15-04-2019
5
[0016]
Hereinafter, an exemplary internal configuration of the video / audio processing apparatus 1 will
be described. However, since the video / audio processing device 21 has substantially the same
configuration as the video / audio processing device 1, the description and the detailed
description of the internal blocks of the video / audio processing device 21 will be omitted.
[0017]
The video / audio processing apparatus 1 includes a first microphone 2a and a second
microphone 2b that pick up voices uttered by a speaker and generate analog voice data. Then,
the video / audio processing apparatus 1 amplifies analog audio data supplied from the first
microphone 2 a and the second microphone 2 b with an amplifier (not shown) and converts it
into digital audio data (A / D (A). / D: Analog / Digital) conversion units 3a and 3b, and analog /
digital (A / D: Analog / Digital) conversion units 3a and 3b are signal processing units for audio
that perform predetermined processing on digital audio data And 4).
[0018]
The first microphone 2a placed near the speaker and the second microphone 2b placed near the
speaker 7 generate analog voice data from the voices collected respectively. The first microphone
2a and the second microphone 2b collect voices uttered by a speaker participating in the first
conference room, and the voices emitted from the speakers 7 are also superimposed and
collected via a space. Sound. Analog voice data supplied from the first microphone 2a and the
second microphone 2b are converted to digital voice data of 48 kHz sampling 16-bit PCM (PulseCode Modulation), for example, by the analog / digital converters 3a and 3b. . The converted
digital audio data is supplied to the signal processing unit 4 one sample at a time.
[0019]
By the way, in this example, a speaker (not shown) is shown as a first sound source S1, and the
speaker 7 is shown as a second sound source S2. The audio uttered by the speaker is looped
through the video / audio processing apparatus 1 and emitted from the speaker 7. That is, the
sound emitted by the second sound source S2 installed at the first base (the first conference
15-04-2019
6
room) among the plurality of bases is the second one where the sound collected at the first base
is emitted. The voice collected at the second base (the second meeting room). Then, a first
transfer characteristic H21 (ω) of the second microphone 2b that picks up the sound emitted by
the first sound source S1 and a first transmission characteristic H21 (ω) of the first microphone
2a that picks up the sound emitted by the first sound source S1. The second transfer
characteristic H11 (ω) is determined. In addition, the third transmission characteristic H12 (ω)
of the first microphone 2a that collects the sound emitted by the second sound source S2 and the
second transfer characteristic H12 (ω) of the second microphone 2b that collects the sound
emitted by the second sound source S2 The fourth transfer characteristic H22 is determined.
These first to fourth transfer characteristics are parameters used for speech separation in the
signal processing unit 4 described later.
[0020]
The signal processing unit 4 is configured by a digital signal processor (DSP: Digital Signal
Processor). Details of processing performed by the signal processing unit 4 will be described
later.
[0021]
In addition, the video / audio processing apparatus 1 includes an audio codec unit 5 that encodes
digital audio data supplied from the signal processing unit 4 into a code defined in a standard of
communication of the video conference system 10. The audio codec unit 5 also has a function of
decoding encoded digital audio data received from the video / audio processing device 21 via the
communication unit 8 which is a communication interface. Also, the video / audio processing
apparatus 1 includes a digital / analog (D / A: Digital / Analog) converter 6 for converting digital
audio data supplied from the audio codec 5 into analog audio data, and a digital / analog
converter And a speaker 7 for amplifying and emitting the analog audio data supplied from 6 by
an amplifier (not shown).
[0022]
Also, the video / audio processing apparatus 1 captures a speaker and generates a camera 11
that generates analog video data, and an analog / digital converter 14 that converts analog video
data supplied from the camera 11 into digital video data. And. The digital video data converted
15-04-2019
7
by the analog / digital converter 14 is supplied to a video signal processor 4a and subjected to
predetermined processing.
[0023]
Also, the video / audio processing device 1 converts the digital video data supplied from the
video codec unit 15 into analog video data, and the video codec unit 15 encodes the digital video
data subjected to the predetermined processing by the signal processing unit 4a. And a display
unit 17 for amplifying the analog video data supplied from the digital / analog converter 16 by
an amplifier (not shown) and displaying a video.
[0024]
The communication unit 8 controls communication of digital video / audio data to the video /
audio processing device 21 and the control device 31 which are counterpart devices.
The communication unit 8 is digital audio data encoded by the audio codec unit 5 according to a
predetermined encoding method (for example, MPEG (Moving Picture Experts Group) -4 AAC
(Advanced Audio Coding) method, G. 728 method), Digital video data encoded in a predetermined
system by the video codec unit 15 is divided into packets according to a predetermined protocol.
Then, it is transmitted to the video / audio processing device 21 through the communication line
9.
[0025]
Also, the video / audio processing device 1 receives packets of digital video / audio data from the
audio processing device 21. The communication unit 8 assembles the received packet and
decodes it by the audio codec unit 5 and the video codec unit 15. The decoded digital audio data
is subjected to predetermined processing by the signal processing unit 4, then amplified by an
amplifier (not shown) through the D / A conversion unit 6, and emitted by the speaker 7.
Similarly, the decoded digital video data is subjected to predetermined processing by the signal
processing unit 4, then amplified by an amplifier (not shown) through the D / A conversion unit
16, and the image is displayed by the display unit 17. Is displayed.
[0026]
15-04-2019
8
The display unit 17 displays the state of the speakers gathered in the first and second conference
rooms by dividing the screen. Therefore, even if the first and second conference rooms are far
apart, each speaker can conduct a conference without feeling the distance between each other.
[0027]
Next, an example of the internal configuration of the signal processing unit 4 will be described
with reference to the block diagram of FIG. The signal processing unit 4 according to the present
embodiment is characterized by performing predetermined processing on digital audio data.
[0028]
The signal processing unit 4 converts audio signals included in the digital audio data supplied
from the analog / digital converters 3a and 3b from the time domain to the frequency domain,
and divides the audio signal into bands of 1024 channels. And sound source separation units 421 to 42-1024 for performing sound source separation to remove an echo component and a noise
component included in the collected voice from the band-divided voice signal, a sound signal
from which the echo component and the noise component have been removed Are synthesized
for each band to generate digital speech data, and a band synthesis unit 43. However, also in the
case of removing only the echo component from the audio signal, it is also called sound source
separation. The digital speech data synthesized by the band synthesis unit 43 is supplied to the
speech codec unit 5 and subjected to predetermined processing.
[0029]
Analog voice data supplied from the first microphone 2a and the second microphone 2b are
converted into digital voice data by the analog / digital converters 3a and 3b. The converted
digital voice data is sent to the band division units 41a and 41b.
[0030]
15-04-2019
9
The band division units 41a and 41b perform band division processing for dividing an audio
signal included in digital audio data into predetermined frequency bands. For example, Fourier
transform is used for the band division process. The Fourier transform transforms the time
domain into the frequency domain, and after processing, can be resynthesized as time domain
data by performing an inverse Fourier transform. However, as the band division processing
performed by the band division units 41a and 41b, well-known reference 1 (Kazunobu
Wataguchi "Subband adaptive filter using a completely reconstructed DFT filter bank" Journal of
the Institute of Electronics, Information and Communication Engineers, August 1996, Vol. . A
technique such as DFT (Discrete Fourier Transform) filter bank described in J79-A No. 8 pp.
1385-1393) may be used. Moreover, in this example, although the band division parts 41a and
41b are provided corresponding to the first microphone 2a and the second microphone 2b,
respectively, a plurality of microphones are generated using one band division part. An audio
signal to be transmitted may be divided into predetermined frequency bands.
[0031]
In the audio signals band-divided by the band division units 41a and 41b, for example, the lowest
band channel is the first channel and the highest band channel is the 1024th channel in order of
band number. Then, the audio signals of the same channel (for example, the n-th channel) output
from the band division units 41a and 41b are supplied to the sound source separation unit 42-n.
Therefore, the audio signal of the first channel is supplied to the sound source separation unit
42-1. Further, the audio signal of the second channel is supplied to the sound source separation
unit 42-2. Similarly, the audio signal of the 1024th channel output from the band division units
41a and 41b is supplied to the sound source separation unit 42-1024.
[0032]
The sound source separation units 42-1 to 42-1024 perform sound source separation based on
the powers of the first microphone 2a and the second microphone 2b. That is, the echo
components of the sound emitted by the first sound source included in the sound emitted by the
second sound source S2 are separated for each predetermined frequency band of the sound
signal divided by the band division units 41a and 41b.
[0033]
15-04-2019
10
In addition, the sound source separation units 42-1 to 42-1024 also have a function of removing
steady-state noise which is constantly generated with little variation in time. In this case, the
sound source separation units 42-1 to 42-1024 remove the stationary noise from the collected
voice, so the first and second audio signals are not the stationary signal including the noise
component and the non-noise component. Separate into stationary signals. Then, the noise
component contained in the steady signal is suppressed, and the echo component of the sound
emitted by the first sound source contained in the sound emitted by the second sound source
from the non-stationary signal is separated.
[0034]
The band synthesis unit 43 receives the audio signal separated by the sound source separation
units 42-1 to 42-1024. Then, of the sound signals separated from the sound source, the sound
signal is synthesized into a sound signal including the sound emitted by the first sound source for
each predetermined frequency band. Further, it is synthesized into an audio signal including echo
components of the first sound source separated for each predetermined frequency band. Then,
the band synthesis unit 43 sends the synthesized speech signal to the speech codec unit 5 as
digital speech data in a format that can be processed by another processing unit.
[0035]
The conventional sound source separation unit (corresponding to the sound source separation
units 42-1 to 42-1024 in this example). ) Uses a technology based on sound source separation
based on sound source separation based on estimating incident angle of each frequency
component of input signals acquired by multiple microphones to separate the echo contained in
the audio signal, and the speaker I was doing sound source separation only to emit voice. The
basic processing of the sound source separation method SAFIA is described in publicly known
document 2 (co-authored by Mariko Aoki, “Improvement of performance of the sound source
separation method SAFIA under reverberation”, Journal of the Institute of Electronics,
Information and Communication Engineers Vol. J87-A No. 9 pp. 1171-1186) and well-known
reference 3 (by Akiko Mariko, “Separation and extraction of proximity sound source under high
noise using sound source separation method SAFIA”, Journal of the Institute of Electronics,
Information and Communication Engineers, April 2005 Vol. J88-A No. 4 pp. 468-479). However,
the conventional sound source separation method is a method of selecting a frequency based on
only the power difference between the microphones, and does not obtain an indoor impulse
response as in the adaptive processing. For this reason, the number of required parameters
decreases, and even if the speech communication system changes, it is not easily affected.
15-04-2019
11
[0036]
When sound source separation is performed using the conventional sound source separation
method SAFIA, the matrix parameter H (ω) is obtained using the following equation (1). Among
the variables, ω is the frequency, i is the time when the first microphone 2a and the second
microphone 2b pick up the sound emitted by the first sound source S1 and the second sound
source S2, Th1 is the first threshold , Th2 is a second threshold, and E is a function indicating an
expected value. The matrix parameter H (ω) is a (2 × 2) mixing matrix whose element is the
transfer characteristic (frequency response) H nm (ω) from the sound source Sm to the
microphone n. H11 (ω, i) indicates a first transfer characteristic from the first sound source S1
to the first microphone 2a. H21 (ω, i) indicates a second transfer characteristic from the first
sound source S1 to the second microphone 2b. H12 (ω, i) indicates a third transfer characteristic
from the second sound source S2 to the first microphone 2a. H22 (ω, i) represents a fourth
transfer characteristic from the second sound source S2 to the second microphone 2b.
[0037]
[0038]
Next, the power obtained from the audio signal generated by the first microphone 2a is defined
as a first power X1 (n), and the second power X2 (n) determined from the audio signal generated
by the second microphone 2b Do.
The first power X1 (n) and the second power X2 (n) are time-varying values, and are time
averaged over a predetermined period.
[0039]
Then, according to the following equations (2) and (3), the time-varying first audio signal Y1 (ω,
i) which is the audio emitted by the first sound source S1, and the audio emitted by the second
audio source S2 Speech separation is performed by obtaining two speech signals Y2 (ω, i). The
first voice signal Y1 (ω, i) is a voice signal including the voice of the speaker who is the target
15-04-2019
12
sound. The second audio signal Y2 (ω, i) is an audio signal containing the audio of the echo
component.
[0040]
[0041]
Equation (2) is an equation for obtaining instantaneous values of the first audio signal Y1 (ω, i)
and the second audio signal Y2 (ω, i).
[0042]
[0043]
The equation (3) time-averages the matrix parameter H (ω) obtained by the equation (1) to
obtain the first audio signal Y1 (ω, i) and the second audio signal Y2 (ω, i). Is an equation for
finding
[0044]
By the way, even if the sound source separation method SAFIA is used in an actual environment,
it is not possible to accurately separate the echo component from the voice of the speaker.
Therefore, when sound source separation processing is performed using the sound source
separation units 42-1 to 42-1024 of the present embodiment, the following equations (4) to (6)
are used instead of the conventional sound source separation processing.
Each variable is the same as that defined by the equations (1) to (3) for performing the
conventional sound source separation process described above.
However, the frequency ω is a value determined for each predetermined frequency band divided
by the band division units 41a and 41b.
15-04-2019
13
Further, a function E for obtaining an average of values in which a value obtained by dividing the
first power X1 (ω) by the second power X2 (ω) is larger than the first threshold TH1, and a first
power X1 (ω) The function E is used to obtain an average of values in which the value obtained
by dividing the value by the second power X2 (.omega.) Becomes larger than the second
threshold value TH2.
[0045]
[0046]
Equation (4) is the ratio of the first transfer characteristic H21 (ω) to the second transfer
characteristic H11 (ω) and the ratio of the third transfer characteristic H12 (ω) to the fourth
transfer characteristic H22 (ω) Is an equation for finding
The matrix parameter H (ω) obtained by the equation (4) is a time-varying value. In equation (4),
the value is obtained by further time averaging the ratio of the time-averaged first power X1 (n)
and the second power X2 (n). Therefore, the ratio of the time-averaged first power X1 (n) to the
second power X2 (n) obtained by the sound source separation units 42-1 to 42-1024 of the
present embodiment is the conventional sound source separation method. This is different from
the time-averaged value of the ratio of the first power X1 (n) to the second power X2 (n).
[0047]
[0048]
The equation (5) is obtained by using the matrix parameter H (ω) obtained by the equation (4),
the first power X1 (n), the first audio signal Y1 (ω), and the second power X2 (n). Thus, the timeaveraged first audio signal Y1 (ω) which is the sound emitted by the first sound source S1 and
the time-averaged second audio signal Y2 (ω) which is the sound emitted by the second sound
source S2 Is an equation for finding
[0049]
[0050]
15-04-2019
14
Equation (6) gives the first audio signal Y1 (ω) and the second audio signal Y2 (ω), the first
power X1 (n) and the first power X2 determined by the equation (5). Based on (ω), a timevarying first audio signal Y1 (ω, i) which is an audio emitted by the first sound source S1 and a
time-varying second audio signal which is an audio emitted by the second audio source S2 This is
an equation for obtaining the signal Y2 (ω, i).
[0051]
In this example, the power values of the first microphone 2a and the second microphone 2b are
used in equation (4) for obtaining the matrix parameter H (ω).
Therefore, the accuracy of obtaining the matrix parameter H (ω) is high.
Furthermore, in the equations (5) and (6) for sound source separation, since the suppression
amount obtained by the power value is applied to the instantaneous value, it is not easily affected
by the variation of the instantaneous value.
[0052]
Here, under the actual environment, the performance of the sound source separation processing
was evaluated based on digital audio data obtained using the conventional technique and digital
audio data obtained via the signal processing unit 4 of this example. An example of the results is
shown in Table 1 below.
[0053]
[0054]
Among the evaluation elements shown in Table 1, SDR (Signal to Distortion Ratio) is the amount
of distortion that occurs in the target sound as a result of subjecting the target sound and the
sound source separation when the speaker's voice is the target sound (Signal). It shows the ratio
to (Distortion).
15-04-2019
15
The larger the value of SDR, the smaller the amount of distortion of the target sound.
[0055]
Further, NRR (Noise Reduction Ratio) is a value obtained by subtracting the SN ratio before sound
source separation from the SN ratio after sound source separation, and indicates the
improvement amount of the SN ratio by sound source separation.
As the NRR also increases, echoes other than the target sound are suppressed, which means that
the sound source separation performance is higher.
[0056]
That is, in the conventional method, even if sound source separation is performed, the sound
quality of the sound is poor and echo remains.
However, as a result of performing sound source separation using the signal processing unit 4 of
this example, it was shown that the echo was reliably separated from the target sound, and the
sound source separation performance was enhanced.
[0057]
By the way, in the conventional sound source separation method SAFIA, steady-state noise which
is constantly generated with little fluctuation of time is not removed.
For this reason, the reproduced voice contained stationary noise, and the sound quality was poor.
In addition, the actual environment includes non-stationary noise that occurs suddenly, and may
include non-stationary noise in the reproduced voice. As a cause of stationary noise and nonstationary noise, conventionally, when obtaining the matrix parameter H (ω), an instantaneous
value is used, and further, the matrix parameter H (ω) obtained from the instantaneous value is
directly obtained Can be mentioned. That is, variables for separating noise components vary from
time to time.
15-04-2019
16
[0058]
In the present example, equations (4) to (6) are expanded as in the following equations (7) to (9)
in order to cope with the actual environment in which stationary noise and non-stationary noise
occur. Equations (7) to (9) are equations used to remove the effects of stationary noise and nonstationary noise. Each variable is the same as that defined by the equations (1) to (3) for
performing the conventional sound source separation process described above.
[0059]
[0060]
Equation (7) shows the ratio of the first transfer characteristic H21 (ω) to the second transfer
characteristic H11 (ω) and the ratio of the third transfer characteristic H12 (ω) to the fourth
transfer characteristic H22 (ω). Is an equation for finding
The matrix parameter H (ω) obtained by the equation (7) is a time-varying value. In Equation (7),
the first noise component N1 (ω) input to the first microphone 2a is reduced from the first
power X1 (n). Similarly, the second noise component N2 (ω) input to the second microphone 2b
from the second power X2 (n) is reduced.
[0061]
[0062]
Equation (8) is the matrix parameter H (ω) obtained by equation (7), the first power X1 (n), the
first audio signal Y1 (ω), and the second power X2 (n) Thus, the time-averaged first audio signal
Y1 (ω) which is the sound emitted by the first sound source S1 and the time-averaged second
audio signal Y2 (ω) which is the sound emitted by the second sound source S2 Is an equation for
finding
[0063]
15-04-2019
17
[0064]
The equation (9) gives the first audio signal Y1 (ω) and the second audio signal Y2 (ω) obtained
by the equation (8), the first power X1 (n) and the first power X2 Based on (ω), a time-varying
first audio signal Y1 (ω, i) which is an audio emitted by the first sound source S1 and a timevarying second audio signal which is an audio emitted by the second audio source S2 This is an
equation for obtaining the signal Y2 (ω, i).
[0065]
As described above, in the equations (7) to (9), the calculation is performed excluding the
influence of the stationary noise (the first noise component N1 (ω) and the second noise
component N2 (ω)).
For this reason, the sound signal obtained as a result of calculation is not affected by the
stationary noise component.
Also, in Equations (8) and (9) for sound source separation, the stationary noise component is
removed and then the inverse matrix is applied to remove the non-stationary noise component.
For this reason, it is possible to remove stationary noise components and non-stationary noise
components simultaneously.
[0066]
According to the present embodiment described above, in the echo canceller for preventing echo
and howling, it is included in the voice of the speaker who collected the sound with a small
amount of operation and in a short time as compared with the prior art based on adaptive
processing. Echo can be suppressed.
For this reason, since the speaker can not hear his / her speech content as an echo, there is an
effect that the speech is not disturbed.
15-04-2019
18
[0067]
When echo cancellation is performed, the matrix parameter H is obtained based on the power
ratio of the time-averaged first microphone 2 a and the second microphone 2 b. Conventionally,
compared with the case where the power ratio of two microphones is determined for each
moment and the matrix parameter H is determined, the sound source separation process of this
example has less variation in temporal change. That is, it is possible to remove the influence of
shock noise, sudden sound and the like. For this reason, there is an effect that sound source
separation can be performed with high accuracy by the matrix parameter H determined stably.
[0068]
Also, by obtaining the power value and taking an average, it is possible to remove the influence
of momentary large values and small values. Including this, as shown in the experimental results
(Table 1), there is a performance difference between SDR and NRR of the reproduced voice. That
is, in the equation (1), the variation of the ratio of the instantaneous values is large, and the
parameter can not be correctly determined unless the average time is sufficiently large. However,
in Equation (4), the parameters can be determined stably by averaging the power ratios of the
first microphone 2a and the second microphone 2b. By stabilizing the parameters in this manner,
the sound quality is greatly improved.
[0069]
Also, as in the case of an adaptive filter, echo does not return for a time until convergence if the
speech communication system changes. Furthermore, it can be realized using a filter bank or
Fourier transform, and a reduction method of the calculation amount has already been proposed
for these, and it can be realized with a small amount of calculation as compared with the case of
using an adaptive filter. Further, in the present invention, regardless of the reverberation time of
the room used, there is an effect that the amount of calculation is not increased in order to cope
with the long reverberation time when the adaptive filter is used.
[0070]
15-04-2019
19
In addition, it becomes possible to simultaneously remove the stationary noise component and
the echo component. Conventionally, it has been difficult to remove the stationary noise
component from the collected voice, but by performing the voice separation processing
according to the present embodiment, it is possible to obtain the voice from which the stationary
noise component and the echo component have been removed. For this reason, there is an effect
that the listener can easily hear the reproduced sound.
[0071]
Further, it is possible to easily remove the echo component and the stationary noise component
from the sound collected simply by placing the second microphone 2b near the speaker 7 which
is the second sound source S2. In this case, it is sufficient to prepare a microphone for the
speaker, and the system configuration can be easily realized. The sound source separation
processing performed by the sound source separation units 42-1 to 42-1024 in the present
example is a processing performed based on the power difference between the first microphone
2a and the second microphone 2b. Therefore, parameter errors and stationary noise are taken
into consideration so that they can be applied to the real environment, and high-quality
reproduced voice (only the voice of the speaker) can be obtained.
[0072]
Although the above embodiment has been described as an example applied to a video conference
system which transmits and receives voice bidirectionally, if it is a system using bidirectional
communication, it is applied to, for example, voice communication by telephone. May be
[0073]
In addition, the series of processes in the above-described embodiment can be performed by
hardware, but can also be performed by software.
When a series of processes are executed by software, it is possible to execute various functions
by installing a program that configures the software in a dedicated hardware, or by installing
various programs. For example, a program for configuring desired software is installed and
executed on a general-purpose personal computer or the like.
15-04-2019
20
[0074]
In addition, a recording medium recording the program code of software for realizing the
functions of the embodiment described above is supplied to a system or apparatus, and a
computer (or a control apparatus such as a CPU) of the system or apparatus stores the
information in the recording medium. Needless to say, this can also be achieved by reading out
and executing the specified program code.
[0075]
As a recording medium for supplying the program code in this case, for example, a floppy disk, a
hard disk, an optical disk, a magneto-optical disk, a CD (Compact Disc) -ROM (Read Only
Memory), a CD-R, a magnetic tape, a non-volatile Memory card, ROM, etc. can be used.
[0076]
Further, by executing the program code read out by the computer, not only the functions of the
above-described embodiment are realized, but also an operating system (OS) operating on the
computer based on instructions of the program code. And the like, etc. are included in the case
where part or all of the actual processing is performed, and the processing realizes the functions
of the above-described embodiment.
[0077]
Furthermore, the present invention is not limited to the embodiment described above, and it goes
without saying that various other configurations can be taken without departing from the scope
of the present invention.
For example, although the video / audio processing devices 1 and 21 are configured to be
controlled by the control device 31, they control the timing at which the video / audio processing
devices 1 and 21 mutually transmit and receive digital video / audio data in a peer-to-peer
method. You may do so.
[0078]
It is a block diagram showing an example of internal composition of a video conference system in
15-04-2019
21
a 1 embodiment of the present invention.
It is a block diagram which shows the internal structural example of the signal processing part in
one embodiment of this invention.
Explanation of sign
[0079]
DESCRIPTION OF SYMBOLS 1 ... Audio / video processing apparatus, 2a, 2b ... Microphone, 3a,
3b ... Analog / digital conversion part, 4 ... Signal processing part, 5 ... Speech codec part, 6 ...
Digital / analog conversion part, 7 ... Speaker, 8 ... Communication unit, 9: communication line,
10: video conference system, 21: video / audio processing device, 31: control device, 41a, 41b:
band division unit, 42-1 to 42-1024: sound source separation unit, 43: band Synthesis
department
15-04-2019
22
Документ
Категория
Без категории
Просмотров
0
Размер файла
36 Кб
Теги
jp2009147654, description
1/--страниц
Пожаловаться на содержимое документа