close

Вход

Забыли?

вход по аккаунту

?

DESCRIPTION JPH10313497

код для вставкиСкачать
Patent Translate
Powered by EPO and Google
Notice
This translation is machine-generated. It cannot be guaranteed that it is intelligible, accurate,
complete, reliable or fit for specific purposes. Critical decisions, such as commercially relevant or
financial decisions, should not be based on machine-translation output.
DESCRIPTION JPH10313497
[0001]
BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a
method of separating and extracting at least one sound source signal from a signal in which a
plurality of acoustic signals emitted from a plurality of sound sources such as audio signal
sources and various environmental sound sources are mixed. The present invention relates to a
sound source separation device used and a recording medium recording a program for executing
the method by a computer.
[0002]
This type of sound source separation device is applied to various devices such as a sound
collection device in a television conference, a sound collection device for transmitting an audio
signal uttered under a noise environment, a sound collection device of an apparatus for
identifying the type of sound source Be done. In the conventional sound source separation
technology, a method has been used in which components from the same sound source are
collected and synthesized by estimating the fundamental frequency of each signal in the
frequency domain and extracting the harmonic structure.
[0003]
However, this method has a problem that (1) the separable signals are limited to those having a
harmonic structure such as voice vowels and tones, (2) estimation of the fundamental frequency
generally takes a long time It is difficult to separate the sound source in real time, because (3) the
10-04-2019
1
frequency component of another sound source is mixed with the extracted signal due to an
estimation error of harmonic structure etc. Since it was perceived, the separation accuracy was
insufficient.
[0004]
SUMMARY OF THE INVENTION An object of the present invention is to separate and extract
even an acoustic signal of a sound source having no harmonic structure, that is, to enable source
separation without depending on the type of the sound source, and in real time Abstract: A
method, an apparatus, and a program storage medium that enable sound source separation of
[0005]
Another object of the present invention is to provide a sound source separation method,
apparatus and program recording medium with high separation accuracy and little noise
contamination.
[0006]
A sound source separation method according to the present invention uses a plurality of
microphones provided apart from one another to divide each output channel signal of each
microphone into a plurality of frequency bands in a band division process, Only one sound
source signal component is mainly present in each of the bands, and the sound reaching the
microphones changes due to the positions of the plurality of microphones for each same band of
the divided output channel signals. The difference between the signal parameter, that is, the level
(power) and the arrival time value is detected as the inter-band inter-channel parameter value
difference, and the above-mentioned band division of that band is performed based on the interband inter-channel parameter value difference of each band It is judged in the sound source
signal determination process which of the respective output channel signals is the signal input
from which sound source, and the sound source signal Based on the determination of the
process, at least one signal input from the same sound source is selected in the sound source
signal selection process from each of the band-divided output channel signals, and is selected as
a signal from the same sound source in the sound source signal selection process. The plurality
of band signals are synthesized as a sound source signal in a sound source combining process.
[0007]
According to the embodiment of the sound source separation method of the present invention,
the band-by-band level of each of the output channel signals divided in the above-mentioned
band-division process is detected, and each band-by-band level at which these are detected The
sound source not producing sound is detected based on the comparison result, and a detection
10-04-2019
2
signal of the sound source not producing the sound generates a synthesis corresponding to the
sound source not producing the sound among the sound source signals synthesized in the sound
source synthesizing process. Suppress the signal.
[0008]
According to another embodiment of the sound source separation method of the present
invention, the arrival time difference of each output channel signal divided in the abovementioned band separation process to the microphone is detected for each same band, and the
detected arrival time differences according to each band Based on the result of comparing
channels between channels in the same band, a non-pronunciation sound source is detected, and
the above-mentioned tone generation among the sound source signals synthesized in the above
sound source synthesis process by the non-pronounced sound source detection signal. Suppress
the uncombined sound source and the corresponding synthesized signal.
[0009]
DESCRIPTION OF THE PREFERRED EMBODIMENT FIG. 1 shows an embodiment of the present
invention.
The microphones 1 and 2 are disposed at an interval of, for example, about 20 cm, and the
microphones 1 and 2 collect acoustic signals from the sound sources A and B and convert them
into electric signals.
The output of the microphone 1 is referred to as an L channel signal, and the output of the
microphone 2 is referred to as an R channel signal.
The L channel signal and the R channel signal are supplied to the inter-channel time difference /
level difference detection unit 3 and the band division unit 4, and the band division unit 4 divides
each into a plurality of frequency band signals to obtain inter-band time difference / level
difference The detection unit 5 and the sound source determination signal selection unit 6 are
supplied.
According to each detection output of the detection units 3 and 5, one of the channel signals is
selected as an A component or a B component for each band in the selection unit 6, and the A
10-04-2019
3
component signal and the B component signal for each selected band are The sound source
signal synthesis units 7A and 7B respectively synthesize and separate and output a sound source
A signal and a sound source B signal.
When the sound source A is closer to the microphone 1 than the microphone 2, the signal SA1
reaching the microphone 1 than the sound source A arrives earlier than the signal SA2 reaching
the microphone 2 than the sound source A, and the level is larger. When it is close to the
microphone 2, the signals SB1 and SB2 arriving from the sound source B to the microphones 1
and 2 respectively reach the microphone 2 quickly and the level is also large.
As described above, according to the present invention, the amount of change in the sound signal
reaching the two microphones 1 and 2 due to the position of the sound source with respect to
the microphones 1 and 2, in this example, the difference between arrival time and level of both
signals is used.
[0010]
The device shown in FIG. 1 operates as follows. As shown in FIG. 2, signals from the two sound
sources A and B are taken into the microphones 1 and 2 (S01). The inter-channel time difference
/ level difference detection unit 3 detects an inter-channel time difference or level difference
from the L channel signal and the R channel signal. The parameter used to detect the time
difference will be described using the cross correlation function of the L channel signal and the R
channel signal. As shown in FIG. 3, first, samples L (t) and R (t) of the L channel signal and the R
channel signal are read (S02), and a cross correlation function between these samples is
calculated (S03). This calculation finds the cross correlation between both channel signals at the
same sample time point, and shifts the other channel signal by one sample time point with
respect to one channel signal, when it is shifted by two. The cross-correlation function is obtained
by obtaining cross-correlations of cases respectively. A large number of these cross correlations
are obtained, and a histogram is generated by normalizing them with power (S04). Next, time
differences .DELTA..alpha.1 and .DELTA..alpha.2 for taking first and second cumulative frequency
rankings of the histogram are determined (S05). These time differences .DELTA..alpha.1 and
.DELTA..alpha.2 are converted into inter-channel time differences .DELTA..tau.1 and .DELTA..tau.2
respectively by the following equations and outputted (S06).
[0011]
10-04-2019
4
Δτ 1 = 1000 × Δα 1 / F (1) Δτ 2 = 1000 × Δα 2 / F (2) where F is a sampling frequency,
and the factor of 1000 is used to increase the value to some extent for convenience of
calculation. The time differences .DELTA..tau.1 and .DELTA..tau.2 are inter-channel time
differences between the L channel signal and the R channel signal of the signals of the sound
sources A and B, respectively.
[0012]
Returning to the explanation of FIGS. 1 and 2, the band division unit 4 respectively transmits the
L channel signal and the R channel signal to the signals L (f1), L (f2),..., (Fn) of each frequency
band and the signal R (f1). , R (f2),..., (Fn) (S04). This division is performed by, for example,
performing discrete Fourier transform on each channel signal to convert it into a frequency
domain signal, and then dividing it into each frequency band. This band division is performed to
the extent that only the signal component of one sound source is mainly present in each band
due to the difference in the frequency characteristics of the signals of the sound sources A and B.
For example, the power spectrum of the sound source A is obtained as shown in FIG. 4A, and the
power spectrum of the sound source B is obtained as shown in FIG. 4B, and division is performed
with a bandwidth Δf that allows separation of the respective spectra. At this time, for example,
the spectrum of one sound source can be neglected with respect to the spectrum of the other
sound source, as shown by the corresponding spectrum in a broken line. Also, as understood
from FIGS. 4A and 4B, they may be separated by a bandwidth 2Δf. In other words, it is not
necessary to include only one spectrum in each band. The discrete Fourier transform is
performed, for example, every 20 to 40 ms.
[0013]
Next, the inter-band inter-channel time difference / level difference detection unit 5 determines,
for example, between inter-channels of corresponding band signals such as L (f1) and R (f1), ... L
(fn) and R (fn). An inter-channel time difference or level difference is detected (S05). Here, the
inter-band time difference between channels is uniquely detected by using the inter-channel time
differences Δτ 1 and Δτ 2 detected by the inter-channel time difference detection unit 3. The
equation used for this detection is as follows.
[0014]
10-04-2019
5
Δτ1-{(Δφi / (2πfi) + (ki1 / fi)) = εi 1 (3) Δτ2-{(Δφi / (2πfi) + (ki2 / fi)) = εi 2 (4) i = 1, 2,
, N and Δφi are phase differences between the signal L (fi) and the signal R (fi). In these
equations, the integers ki1 and ki2 are determined so as to minimize εi1 and εi2. Next, the
smaller channel time difference Δτj (j = 1, 2) by comparing the minimum values εi 1 and εi 2
is set as the inter-channel time difference Δτij of the band i. That is, the inter-channel time
difference in that band of one sound source signal is used.
[0015]
The sound source determination signal selection unit 6 uses the band-by-band channel time
differences Δτ1j to τnj detected by the band-by-band channel time difference / level
difference detection unit 5 to generate each band signal L (f1) to L (fn) and R (f1). It is
determined by the sound source signal determination unit 601 which of the corresponding ones
of the corresponding ones) to R (fn) is to be selected (S06). For example, among the time
differences .DELTA..tau.1 and .DELTA..tau.2 calculated by the inter-channel time difference / level
difference detection unit 3, .DELTA..tau.1 is the inter-channel time difference of the signal from
the sound source A close to the L side microphone and .DELTA..tau.2 is on the R side. A case will
be described where the inter-channel time difference of the signal from the sound source B close
to the microphone is used.
[0016]
In this case, when the time difference Δτ ij calculated by the inter-band time difference / level
difference detection unit 5 is Δτ 1, the gate 602 Li is opened by the sound source signal
determination unit 601 and the L side input signal L (fi) Is output as SA (fi) as it is, the source
signal determination unit 601 closes the gate 602R of the input signal R (fi) on the R side band i,
and SB (fi) is output as 0. Conversely, in the band i where the time difference Δτ ij is Δτ 2, the
L side outputs the signal L (fi) as SA (fi) = 0 and the R side outputs the input signal R (fi) as SB (fi)
as it is. Ru. That is, as shown in FIG. 1, the band signals L (f1) to L (fn) are respectively supplied
to the sound source signal combining unit 7A through the gates 602L1 to 602Ln, and the band
signals R (f1) to R (fn) are respectively supplied to the gates 602R1 to 602R1. The signal is
supplied to the sound source signal synthesis unit 7 through 602Rn. In the sound source signal
judging unit 601 in the sound source judging signal selecting unit 6, Δτ1j to Δτnj are input,
and gate control signals CLi = 1 and CRi = 0 are generated for the band i where Δτij is judged
to be Δτ1 and the corresponding gate 602Li Is controlled to be open, 602Ri is controlled to be
closed, and gate control signal CLi = 0 and CRi = 1 are generated for a band i determined to have
10-04-2019
6
Δτ ij be Δτ 2, and the corresponding gate 602Li is closed and 602Ri is controlled to be open.
Be done. The above description is a functional configuration, and is actually processed by, for
example, a digital signal processor.
[0017]
The signals SA (fi) to SA (fn) are synthesized by the sound source signal synthesizing unit 7A,
respectively inverse Fourier transformed in the example of the band division and output as the
signal SA to the output terminal tA, and the signals are generated by the sound source signal
synthesizing unit 7B. Similarly, SB (fi) to SB (fn) are synthesized and output as the signal SB to the
output terminal tB. As apparent from the above description, in the device of the present
invention, it is determined, from which sound source each band component of each channel
signal is divided into small bands, and the determined components are all output. That is, if the
frequency components of the signals of the sound sources A and B do not overlap with each
other, processing is performed without dropping a specific frequency band, so the sound source
is maintained while maintaining high sound quality compared to the conventional method of
extracting only the harmonic structure. It is possible to separate the A and B signals.
[0018]
The above description is based on only the inter-channel time difference and the inter-channel
time difference detected by the inter-channel time difference / level difference detection unit 3
and the inter-band time difference / level difference detection unit 5, and the sound source
determination signal unit Determination conditions were determined at 601. Next, an
embodiment will be described in which the determination of the determination condition is
processed using the level difference between channels. In this embodiment, as shown in FIG. 5,
the L channel signal and the R channel signal are taken from the microphones 1 and 2 (S02), and
the level difference ΔL between the L channel signal and the R channel signal is detected. It
detects in part 3 (FIG. 1) (S03). Similarly to step S04 in FIG. 2, the L channel signal and the R
channel signal are respectively divided into n band-by-band channel signals L (f1) to L (fn) and R
(f1) to R (fn) (S04) , Band-specific channel signals L (f1) to L (fn) and R (f1) to R (fn), that is, L (f1)
and R (f1), L (f2) and R (f2) The inter-band level differences ΔL 1, ΔL 2,..., ΔL n are detected for
each of L (f n),..., L (f n) (S 05).
[0019]
10-04-2019
7
Human voice can be regarded as a steady state for about 20 ms to 40 ms. Therefore, in the
sound source signal determination unit 601 (FIG. 1), the code of the value obtained by taking the
logarithm of the inter-channel level difference ΔL and the code of the value obtained by taking
the logarithm of the inter-band level difference ΔLi for every 20 ms to 40 ms. Calculate the same
code (+ or-) in the percentage band of all bands and if the two have the same code in a
predetermined value, for example, a band of 80% or more (S06, S07) From there, it is judged only
by the level difference ΔL between channels for 20 ms to 40 ms (S08), and if it is a band with
80% or less having the same code, the band for every band from 20 ms to 40 ms from there This
determination is performed using another channel level difference ΔLi (S09). As for the
determination method, when all bands are determined by the inter-channel level difference ΔL, if
ΔL is positive, the L channel signal L (t) is output as it is as the signal SA, and the R channel
signal R (t) is a signal Output as SB = 0. Conversely, if ΔL is 0 or less, the L channel signal L (t) is
output as the signal SA = 0, and the R channel signal R (t) is output as the signal SB as it is.
However, this is the description in the case of using a value obtained by subtracting the R side
from the L side as the inter-channel level difference. In addition, in the case of making a
determination for each band using the inter-band channel level difference ΔLi, if the inter-band
channel level difference ΔLi is positive for each band fi, the L side divided signal L (fi) is the
signal SA as it is. The R side divided signal R (fi) is output as (fi), and the signal SB (fi) = 0.
Conversely, if the level difference ΔLi is 0 or less, the divided signal L (fi) is output as the signal
SA (fi) = 0 on the L side, and the divided signal R (fi) is output as the signal SB (fi) on the R side.
Be done. As described above, the gate control signals CL1 to CLn and CR1 to CRn are output from
the sound source signal determination unit 601, and the gates 602L1 to 602Ln and 602R1 to
602Rn are controlled. This is also an explanation in the case of using a value obtained by
subtracting the R side from the L side as the inter-band channel level difference, as in the former
case. The signals SA (f1) to SA (fn) and the signals SB (f1) to SB (fn) are respectively output to the
output terminals tA and tB as the signals SA and SB respectively synthesized as in the previous
embodiment (S10). ).
[0020]
In the embodiment, only one of the arrival time difference and the level difference is used as the
determination condition used by the sound source signal determination unit 601. However, when
only the level difference is used, in the low frequency band, the levels of L (fi) and R (fi) may
antagonize, in which case it becomes difficult to accurately obtain the level difference. When only
the time difference is used, it may be difficult to correctly calculate the time difference because
phase rotation occurs in a high frequency band. From these points, it may be advantageous to
use the time difference in the low frequency band and the level difference in the high frequency
band over the entire band over using a single parameter.
10-04-2019
8
[0021]
Therefore, an embodiment in which both the inter-channel time difference by band and the interchannel inter-channel level difference are used in the sound source signal determination unit 601
will be described with reference to the drawings following FIG. The block of the functional
configuration of this embodiment is the same as that of FIG. 1, but the processing in the interchannel time difference / level difference detection portion 3, the inter-band time difference /
level difference detection unit 5 and the sound source signal determination unit 601 are as
follows. Different as. The inter-channel time difference / level difference detection unit 3 outputs
one time difference Δτ such as one of the averages of the detected absolute values of the time
differences Δτ 1 and Δτ 2, or if only Δτ 1 and Δτ 2 are relatively close values. . Although
the inter-channel time differences Δτ 1, Δτ 2, Δτ are calculated before the channel signals L
(t) and R (t) are divided into bands on the frequency axis, they may be calculated after being
divided into bands.
[0022]
As shown in FIG. 5, the L channel signal L (t) and the R channel signal R (t) are read every frame
(for example, 20 to 40 ms) (S02), and the band division unit 4 reads the L channel signal and the
R channel signal. Each is divided into a plurality of frequency bands. In this example, a Hanning
window is applied to the L channel signal L (t) and the R channel signal R (t) (S03), and the
signals L (f1) to L (fn) and R (R) divided by Fourier transformation are respectively applied. f1) to
R (fn) are obtained (S04).
[0023]
Next, the inter-band time difference / level difference detection unit 5 checks whether the
frequency fi of the divided signal is a band (hereinafter referred to as a low range) equal to or
less than 1 / (2Δτ) (Δτ is a channel time difference). (S05), if it is less than, it outputs the
phase difference Δφi between channels by band (S08), and the frequency f of the divided signal
is larger than 1 / (2Δτ) and less than 1 / Δτ (hereinafter referred to as middle band Is
checked (S06), and if it is in this middle range, the inter-band phase difference .DELTA..phi.i and
the level difference .DELTA.Li are output (S09), and the frequency f of the divided signal is 1 /
.DELTA. Hereinafter, it is checked whether or not it is called the high band) (S07). If it is the high
band, the inter-band level difference ΔLi is output (S10).
10-04-2019
9
[0024]
The sound source signal determination unit 601 uses L (f1) to L (fn) and R (f1) to R using the
inter-band channel phase difference and level difference detected by the inter-band time
difference / level difference detection unit 5. (Fn) A determination is made as to which to output.
In addition, as for the phase difference Δφi and the level difference ΔL, values calculated by
subtracting the R side value from the L side are used in this example.
[0025]
For the signals L (fi) and R (fi) determined to be in the low range, first, it is checked whether the
phase difference Δφi is greater than or equal to π as shown in FIG. 7 (S15). In step S15, if
.DELTA..phi. Is not more than .pi., It is checked if it is -.pi. Or less (S16), and if .DELTA..phi.i is less
than .DELTA..phi. If it is not less than π, Δφi is used as it is (S19). The inter-band phase
difference Δφi determined in steps S17, S18 and S19 is converted into a time difference Δσi
by the following equation (S20).
[0026]
Δσi = 1000 · Δφi / 2πfi (5) When the divided signals L (fi) and R (fi) are determined to be in
the middle range, the inter-band level difference ΔL (fi) is used as shown in FIG. Then, the phase
difference Δφi is uniquely determined. That is, it is checked whether ΔL (fi) is positive (S23). If
it is positive, it is checked whether the phase difference Δφi between the bands is positive
(S24). If it is positive, the Δφi is output as it is (S26) If the value is not positive at step S24, the
value obtained by adding 2π to Δφi is output as Δφi (S27). If .DELTA.L (fi) is not positive in
step S23, it is checked whether the inter-band phase difference .DELTA..phi. Is negative (S25). If
negative, the .DELTA..phi. Is output as .DELTA..phi. As it is (S28), If not negative in S25, a value
obtained by subtracting 2π from Δφi is output as Δφi (S29). The .DELTA..phi.i at any of the
steps S26 to S29 is calculated as the inter-band time difference .DELTA..sigma.i according to the
following equation (S30).
[0027]
10-04-2019
10
In accordance with Δσi = 1000 · Δφi / 2πfi (6) or more, the inter-band time difference Δσi
in the low band and the mid band and the inter-band level difference ΔL (fi) in the high band are
obtained, according to these The discrimination of the sound source signal is made as follows. As
shown in FIG. 9, the phase difference Δφi in the low band and the mid band and the level
difference ΔLi in the high band are used to discriminate each frequency component of both
channels as a signal of one of the corresponding sound sources. Specifically, in the low band and
the mid band, it is checked whether the inter-band time difference Δσ i obtained in FIGS. 7 and
8 is positive (S 34). If it is positive, the L side channel of the band i The signal L (fi) is output as
the signal SA (fi), and the R-side band channel signal R (fi) is output as the 0 signal SB (fi) (S36).
Conversely, if the channel time difference by band ΔΔi is not positive in step S34, 0 is output as
SA (fi), and the R-side channel signal R (fi) is output as SB (fi) (S37).
[0028]
Further, in the high frequency band, it is checked whether the inter-band level difference ΔL (fi)
detected in step S10 in FIG. 6 is positive (S35), and if it is positive, the L side channel as signal SA
(fi) The signal L (fi) is output, and 0 is output as SB (fi) (S38). If the level difference ΔLi is not
positive in step S35, 0 is output as SA (fi), and the R-side band channel signal R (fi) is output as
SB (fi) (S39).
[0029]
The L side or R side is output for each band as described above, and the frequency components
respectively determined by the sound source signal combining units 7A and 7B are added over
the entire band (S40), and each added signal is inverse Fourier The conversion is performed
(S41), and the converted signals SA and SB are output (S42). As described above, in this
embodiment, by using parameters advantageous for sound source separation for each frequency
band, sound source separation with higher separation performance is realized compared to the
case of using a single parameter across the entire band. It is possible.
[0030]
The present invention is applicable even when the number of sound sources is three or more. As
10-04-2019
11
an example, in the case where the number of sound sources is 3 and the number of microphones
is 2, the case of sound source separation using the arrival time difference to the microphones will
be described. In this case, when the inter-channel time difference / level difference detection unit
3 calculates the inter-channel time difference of the L channel signal and the R channel signal for
each sound source, as shown in FIG. The inter-channel time differences .DELTA..tau.1,
.DELTA..tau.2, .DELTA..tau.3 for the respective sound source signals are calculated by finding
each time point taking the first to third cumulative frequencies (peak values). Then, the interband inter-channel time difference / level difference detection unit 5 also determines the interband inter-channel time difference of each band to any one of Δτ1 to Δτ3. The way of this
determination is the same as the calculation formulas (3) and (4) described in the abovementioned embodiment. The sound source signal determination unit 601 will be described as an
example where Δτ1> 0, Δτ2> 0, and Δτ3 <0. Here, .DELTA..tau.1, .DELTA..tau.2,
.DELTA..tau.3 are assumed to be inter-channel time differences of the sound sources A, B and C,
respectively, and these values are also assumed to be values calculated by subtracting the R side
value from the L side. In this case, the sound source A is near the L-side microphone 1 and the
sound source B is near the R-side microphone 2. Therefore, it is possible to separate the signal of
the sound source B by adding the signal of the sound source A from the signal of the L channel
and the signal of the sound source A by adding the signal of the sound source A It is possible.
Further, the signal of the sound source C is separated by adding and outputting a signal of a band
having an inter-band channel time difference of Δτ 3 from the R channel signal.
[0031]
In the above, the sound source signal is separated, and the separated sound source signals SA
and SB are separately output. However, for example, when one sound source A is a voice by a
speaker and the other sound source B is noise, the present invention is also for separating and
extracting the signal sound of the sound source A mixed with the noise and suppressing the
noise. Can be applied. In that case, the sound source signal combining unit 7A in FIG. 1 may be
left, and the sound source signal combining unit 7B and the gates 602R1 to 602Rn in the frame
9 indicated by a one-dot chain line may be omitted.
[0032]
When one sound source A has a wider frequency band than the other sound source B, and each
frequency band is known in advance, overlapping of both sound source signals in the band
separation unit 10 in FIG. 1 as shown in FIG. Not separate frequency bands. For example, when
the frequency band of the signal A (t) of the sound source A is f1 to fn but the frequency band of
10-04-2019
12
the signal B (t) of the sound source B is f1 to fn (fn> fm), the non-overlapping bands fm + 1 to fn
The signals are separated from the outputs of the microphones 1 and 2, and the signals of the
bands fm + 1 to fn are not subjected to the determination processing of the sound source signal
determination unit 601, and in some cases, the processing of the inter-band time difference /
level difference detection unit 5 The sound source signal determination unit 601 selects the
divided band channel signals R (fm + 1) to R (fn) of R to be selected as the channel signal SB (t) to
be selected as the signal of the sound source B respectively SB (fm + 1) to SB ( The sound source
signal selection unit 602 is controlled so as to output as fn) and SA (fm + 1) to SA (fn) output 0.
That is, the gates 602Lm + 1 to 602Ln are normally closed, and the gates 602Rm + 1 to 602Rn
are normally open.
[0033]
In the above description, the time difference between channels by each band Δσi, whether it is
positive or negative, and whether the level difference between channels by each band ΔLi is
positive or negative, that is, all the band signals are microphones with 0 as a threshold. It was
determined that it was close to. This is a case where the sound source A and the sound source B
are positioned symmetrically with respect to the bisector of the line connecting as the
microphone 1. If it is not in this relationship, the determination threshold may be determined as
follows.
[0034]
The inter-band level difference at which the signal of the sound source A reaches the microphone
1 and the microphone 2 is ΔLA, the inter-channel time difference at which it arrives is Δτ A
and the signal of the sound source B reaches the microphone 1 and the microphone 2 Let the
difference be .DELTA.LB, and the time difference between the arriving inter-band channels be
.DELTA..tau.B. At this time, the threshold ΔLth of the inter-band channel level difference may be
ΔLth = (ΔLA + ΔLi) / 2, and the threshold Δτth of the inter-channel time difference may be
Δτth = (ΔτA + ΔτB) / 2. In the embodiment described above, ΔLth = 0 and Δτth = 0 in the
case of ΔLB = -ΔLA and ΔτB = -ΔτA. Position the microphones 1 and 2 so that the two sound
sources are on different sides of the microphones 1 and 2 so that the sound sources A and B can
be separated. The threshold values ΔLth and Δτth may be variable, and ΔLth and Δτth may
be adjustable so that separation can be performed well.
[0035]
10-04-2019
13
In the above embodiment, there may be errors in the inter-band time difference or inter-band
inter-channel level difference due to the influence of reverberation and diffraction in a room, and
the sound source signals may not be separated with high accuracy. An embodiment in which
such a problem is improved will be described next. As shown in FIG. 11, the microphones M1,
M2, and M3 are disposed at the apexes of an equilateral triangle having a side of, for example,
20 cm. Spaces are divided and set based on the directivity characteristics of the microphones M1
to M3, and the divided spaces are called sound source zones. If all the microphones M1 to M3
have no directivity and the same characteristic, they are divided into six as in zones Z1 to Z6 as
shown in FIG. 12, for example. That is, six zones Z1 to Z6 divided at six equal angular intervals
around the center point Cp are formed by the microphones M1, M2 and M3 and the straight
lines respectively passing through the center points Cp. The sound source A is located in the zone
Z3, and the sound source B is located in the zone Z4. That is, each sound source zone is
determined based on the arrangement and characteristics of the microphones M1 to M3 such
that one sound source belongs to one sound source zone.
[0036]
In FIG. 11, the band division unit 41 divides the sound signal S1 of the first channel collected by
the microphone M1 into n frequency band signals S1 (f1) to S1 (fn), and the division unit 42
divides the sound signal S1 by the microphone M2. The second channel acoustic signal S2
collected is divided into n frequency band signals S2 (f1) to S2 (fn), and the band dividing unit 43
divides the third channel acoustic signal S3 collected by the microphone M3. It is divided into n
frequency band signals S3 (f1) to S3 (fn). Each of these bands f1 to fn is common to the band
division units 41 to 43, and such band division can use a discrete Fourier transformer.
[0037]
The sound source separation unit 80 separates the sound source signal using the method
described with reference to FIGS. 1 to 10. However, since there are three microphones in FIG. 11,
the same processing is performed for each two combinations of the signals of the three channels.
Therefore, the band division unit in the sound source separation unit 80 and the band division
units 41 to 43 can also be used. The level (power) signals P (S1f1) to P (S1fn) of the signals S1
(f1) to S1 (fn) of each band obtained by the band dividing section 41 are detected by the band
level (power) detecting section S1. Similarly, each band signal S2 (f1) to S2 (fn) and S3 (f1) to S3
(fn) P (S2f1) obtained by the band dividing units 42 and 43 by the band level detection units 52
10-04-2019
14
and 53, respectively. .About.P (S2 fn) and P (S3 f1) to P (S3 fn) are respectively detected. These
band level detection can also be realized by a Fourier transformer. That is, each channel signal
may be decomposed into spectra by discrete Fourier transform, and the power of each spectrum
may be determined. Therefore, for each channel signal, a power spectrum may be obtained and
the power spectrum may be band-split. Each channel signal of each of the microphones M1 to
M3 is divided into each band by the band-by-band level detection unit 400, and the level (power)
is output.
[0038]
On the other hand, full band level detection unit 61 detects the level (power) P (S1) of all
frequency components of acoustic signal S1 of the first channel collected by microphone M1 and
full band level detection units 62 and 63 respectively Levels P (S2) and P (S3) of all frequency
components of the sound signals S2 and S3 of the second and third channels 2 and 3 collected
by M2 and M3, respectively, are detected.
[0039]
The sound source state determination unit 70 determines, by computer processing, a sound
source zone in which no sound is emitted.
First, the band-specific levels P (S1f1) to P (S1fn), P (S2f1) to P (S2fn), and P (S3f1) to P (S3fn)
obtained by the band-specific level detection unit 50 are signals of the same band. Compare with
each other. Then, the channel of the largest level is identified for each of the bands f1 to fn.
[0040]
By setting the number n of band divisions to a predetermined number or more, as described
above, it can be considered that only one sound source acoustic signal is included in one band.
The levels P (S1 fi), P (S2 fi), and P (S3 fi) of H can be regarded as the levels of sound from the
same sound source. Therefore, when there is a difference between the levels P (S1 fi), P (S2 fi)
and P (S3 fi) of the same band for the first to third channels, the level of the band of the
microphone channel closest to the sound source becomes the largest. .
[0041]
10-04-2019
15
As a result of the above processing, channels with the largest levels are allocated to the
respective bands f1 to fn. The total number 帯 域 1, χ2, χ3 of the bands with the largest level is
calculated for each of the first to third channels among the n bands. It can be considered that the
microphone of the channel with a larger value of the total number is closer to the sound source.
If the total numerical value is, for example, about 90 n / 100 or more, it can be determined that
the sound source is close to the microphone of the channel. However, if the total number of
bands with the highest level is 53 n / 100 and the value of the second highest value is 49 n /
100, it is not clear whether the sound source is close to the corresponding microphone.
Therefore, when the total number exceeds a preset reference value ThP, for example, about n / 3,
it is determined that the sound source is closest to the microphone of the channel corresponding
to the total number.
[0042]
The sound source state determination unit 70 also receives the levels P (S1) to P (S3) of each
channel detected by the all band level detection unit 60, and all of the levels are set to
predetermined reference values. In the case of ThR or less, it is determined that there is no sound
source in any zone. The control signal is generated based on the determination result by the
sound source state determination unit 70, and the signal suppression unit 90 performs
suppression on the sound signals A and B divided by the sound source separation unit 80. That
is, the sound signal SA is suppressed (attenuated or deleted) by the control signal SAi, the sound
signal SB is suppressed by the control signal SBi, and both the sound signals SA and SB are
suppressed by the control signal SABi. For example, the normally closed switches 9A and 9B are
provided in the signal suppression unit 90, and the output terminals tA and tB of the sound
source separation unit 80 are connected to the output terminals tA 'and tB' through the normally
closed switches 9A and 9B, respectively. Thus, the switch 9A is opened, the switch 9B is opened
by the control signal SBi, and both the switches 9A and 9B are opened by the control signal SABi.
As a matter of course, the signal of the frame to be separated by the sound source separation unit
80 and the signal of the frame from which the control signal used for suppression by the signal
suppression unit 90 is the same are used. The generation of the suppression (control) signals SAi,
SBi, SABi will be described in an easy-to-understand manner.
[0043]
Now, as shown in FIG. 12, when sound sources A and B are located, microphones M1 to M3 are
10-04-2019
16
arranged as shown in the figure, zones Z1 to Z6 are determined, and sound sources A and B are
separate zones Z3, Position each on Z4. At this time, the distances SA1, SA2, and SA3 of the
sound source A with respect to the microphones M1 to M3 satisfy SA2 <SA3 <SA1. Further,
distances SB1, SB2, and SB3 of the sound source B with respect to the microphones M1 to M3
satisfy SB3 <SB2 <SB1.
[0044]
When all the detection signals P (S1) to P (S3) of the full band level detection unit 60 are smaller
than the reference value ThR, the sound sources A and B are considered to be pronounced, for
example, not speaking. The acoustic signals SA and SB are suppressed. At this time, the output
sound signals SA and SB become silence signals (101 and 102 in FIG. 13). When only the sound
source A is sounding, the frequency components of all the bands of the acoustic signal reach the
microphone M2 at the largest sound pressure level (power), so the total number of bands χ2 of
the channels of this microphone M2 is It will be the most.
[0045]
Also, when only the sound source B is sounding, the frequency components of all the bands of
the acoustic signal reach the microphone M3 at the largest sound pressure level, so the total
number of bands χ3 of the channels of this microphone M3 is the largest. Become more.
Furthermore, when the sound sources A and B both sound, the number of bands in which the
acoustic signal reaches at the largest sound pressure level is antagonized by the microphones M2
and M3.
[0046]
Therefore, it is determined that the sound source is present in the zone controlled by the
microphone when the total number of bands reaching the microphone with the acoustic signal at
the largest sound pressure level exceeds the reference value ThP based on the reference value
ThP described above. Thus, it is possible to detect the sound source zone being sounded. In the
above example, when only the sound source A is sounding, only χ2 exceeds the reference value
ThP, and it is detected that the sounding sound source is present in the zone Z3 managed by the
microphone M2, The audio signal SB is suppressed by the control signal SBi to output only the
acoustic signal SA (103 and 104 in FIG. 13).
10-04-2019
17
[0047]
In addition, when only the sound source B is sounding, it is detected that only the χ 3 exceeds
the reference value ThP, and it is detected that the sounding sound source is present in the zone
Z 4 managed by the microphone M 3, so the control signal The acoustic signal SA is suppressed
by SAi to output only the acoustic signal SB (105 and 106 in FIG. 13). Furthermore, when both of
the sound sources A and B are sounding and both χ2 and χ3 exceed the reference value ThP,
for example, priority can be given to the sound source A, and processing can be performed if
only the sound source A is sounding. The procedure of FIG. 13 is as such. If both χ2 and χ3
have not reached the reference value ThP, it is determined that both sound sources A and B are
sounding as long as the levels P (S1) to P (S3) exceed the reference value ThR. None of the
control signals SAi, SBi, and SABi are output, and the speech suppression unit 90 does not
suppress the combined signals SA and SB (107 in FIG. 13).
[0048]
As described above, the signal suppressing unit 90 suppresses the sound source signals SA and
SB separated by the sound source separating unit 80 that correspond to the sound source
determined to be not sounded by the sound source state determining unit 70. Unwanted sounds
are suppressed. When sound source C is added to zone Z6 as shown in FIG. 14 with respect to
the state shown in FIG. 12, although not shown, sound source separation unit 80 corresponds to
signal SA corresponding to sound source A and sound source B. In addition to the signal SB, a
signal SC corresponding to the sound source C is output.
[0049]
In addition to the control signal SAi for suppressing the signal SA and the control signal SBi for
suppressing the signal SB, the sound source state determination unit 70 outputs the control
signal SCi for suppressing the signal SC to the signal suppressing unit 90. Besides control signals
SABi for suppressing signals SA and SB, control signals SBci for suppressing signals SB and SC,
control signals SCAi for suppressing signals SC and SA, and control for suppressing all signals SA,
SB and SC The signal SABci is output. The sound source state determination unit 70 performs
processing as shown in FIG.
10-04-2019
18
[0050]
First, when all the levels P (S1) to P (S3) do not exceed the reference value ThR, it is determined
that none of the sound sources A to C are sounding, and the sound source state determination
unit 70 outputs SABCi. As a result, all of the signals SA, SB, and SC are suppressed (201 to 202 in
FIG. 15). Next, when the sound sources A, B, and C respectively pronounce independently, any
one of P (S1) to P (S3) becomes larger than ThR, and as in the case of two sound sources, Since
the level of the channel of the microphone closest to the sound source is the largest, any one of
the channel numbers χ1, χ2, and χ3 exceeds the reference value ThP. Then, when only the
sound source C is sounding, χ 1 exceeds ThP, and the control signal SABi is output to suppress
the signals SA and SB (203 and 204 in FIG. 15). When only the sound source A is sounding, the
control signal SBci is output to suppress the signals SB and SC. Furthermore, when only the
sound source A is sounding, the control signal SBci is output to suppress the signals SB and SC
(205 to 208 in FIG. 15).
[0051]
Next, when any two of the three sound sources A to C sound, the number of bands in which the
level of the microphone in the zone corresponding to the non-sounding sound source is the
largest is compared with that of the other microphones Become smaller. For example, when only
the sound source C is not sounding, the number of bands χ1 where the level of the microphone
M1 is the largest is smaller than the number of bands χ2, χ3 of the other two microphones M2
and M3.
[0052]
Therefore, if a certain reference value ThQ (<ThP) is set in advance and χ1 becomes equal to or
less than the reference value ThQ, of the zones Z5 and Z6 obtained by dividing the space into
two by the microphone M1 and the microphone M3, a zone closer to the microphone M1 At Z6,
it is determined that the sound source does not emit a signal. Further, it is determined that the
sound source does not emit a signal in the zone Z1 closer to the microphone M1 among the
zones Z1 and Z2 obtained by dividing the space into two by the microphones M1 and M2.
[0053]
10-04-2019
19
That is, it is determined that the sound sources in the zones Z1 and Z6 do not emit signals. Since
the sound sources in these zones are the sound source C, it is determined that the sound source C
does not emit a signal. That is, it is determined that only the sound sources A and B are emitting
signals, and the control signal SCi is generated to suppress the signal SC. In the state shown in
FIG. 14, when only one of the three sound sources A to C is not sounding, the number of bands
χ1, χ2, and と な る 3 that are usually maximum for any of the microphones is less than or
equal to the reference value ThP. In FIG. 15, steps 203, 205, and 207 are passed, and in step
209, it is checked whether が 1 is equal to or less than the reference value ThQ. If only the sound
source C does not sound, Q1 <ThQ, and the control signal SCi is generated. (210 of FIG. 15). If
χ1 is not less than ThQ in step 209, it is sequentially examined whether Q2 and χ3 are also less
than ThQ. If less than ThQ, it is estimated that only sound source A or only sound source B is not
sounding, respectively The control signal SAi or SBi is suppressed (211 to 214 in FIG. 15).
[0054]
If it is determined in step 213 that χ 3 is not equal to or less than ThQ, it is determined that all
the sound sources A, B, and C are sounding, and no control signal is generated (215 in FIG. 15).
In this case, the reference value ThP is about 2n / 3 to 3n / 4, and the reference value ThQ is
about n / 2 to 2n / 3, that is, for example, when ThP is about 2n / 3, ThQ is about n / 2.
[0055]
In the above example, the zones are divided into six zones Z1 to Z6. However, as shown in FIG.
16, the zones are divided into three zones Z1 to Z3 by dotted lines passing from the center point
Cp to the midpoint between the microphones. Similarly, the sound source state can be
determined. In this case, for example, when only the sound source A is sounding, since the
number of bands チ ャ ネ ル 2 of the channel of the microphone M2 is the largest, it is
determined that the sound source is in the zone Z2 managed by the microphone M2. When only
the sound source B is sounding, the χ 3 is the largest, and it is determined that there is a sound
source in the zone Z3. Further, when the χ1 is equal to or less than the preset value ThQ, it is
determined that the sound source in the zone Z1 among the microphones M1 and M2 and M3
and divided into two respectively is not sounding. By the above processing, even if the zone is
divided into three, the state of the sound source can be determined as in the case of six division.
10-04-2019
20
[0056]
Further, although the reference values ThR, ThP, ThQ have been described in the case where the
same value is used for all the microphones M1 to M3, they may be changed as appropriate for
each microphone. Further, in the above description, the case of three sound sources and three
microphones has been described, but if the number of microphones is equal to or greater than
the number of sound sources, sound source zones can be similarly detected.
[0057]
For example, in the case of four sound sources, the four microphones divide the space into four
zones as in the division method of FIG. 16 so that the microphones of the individual channels
control one sound source. In the sound source state determination at this time, it is determined
whether all the four sound sources are silent or one of them is sounding by the same processing
as steps 201 to 208 in FIG. If none of them, it is determined whether one of the four is silent by
the same processing as steps 209 to 214 in FIG. 15, and if there is no silence, all is performed by
the same processing as step 215 in FIG. It is determined that the sound source of is sounding. In
addition, when three of the four sound sources are sounding (when one is silent), it may be as it
is, but in order to select one of the three that is closer to silence , To control in more detail as
follows. That is, the reference value is changed from ThQ to ThS (ThP> ThS> ThQ), and
processing portions similar to steps 209 to 214 in FIG. 15 are provided at the next stage of each
of steps 210, 212 and 214 in FIG. One of the three sound sources close to silence is determined.
[0058]
Thus, as the number of sound sources increases, by repeating the processing contents of steps
209 to 214 in FIG. 15, it is possible to determine two or more sound sources that are silent or
close to silent. However, the determination reference value ThS approaches ThP as the number of
repetitions of processing increases. The processing operation procedure described above is as
shown in FIG. 17 for the case of four microphones and four sound sources. First, the first to
fourth channel signals S1 to S4 are taken from the microphones M1 to M4 (S01), and the levels P
(S1) to P (S4) of these channel signals S1 to S4 are detected (S02). It is checked whether (S1) to P
(S4) are all less than or equal to the reference value ThR (S03), and if less than the reference
value, the control signal SABCDi is generated to output the combined signal SA, SB, SC (S1) Is
suppressed (S04). If any one is not smaller than the reference value ThR in step S03, the channel
signals S1 to S4 are divided into n bands and the levels P (S1 fi), P (S2 fi), P (S3 fi), P (S4 fi) of the
10-04-2019
21
respective bands ) (I = 1,..., N) are obtained (S05). Among the channels, the largest channel fiM (M
is one of 1, 2, 3 and 4) in the level of the same band fi is determined for each band (S06), and fi1
and fi2 in all bands (n) Each total value χ1, χ2, χ3, χ4 of ,, fi3, 44 is obtained (S07). The
largest one of χ1, χ2, χ3, χ4 is determined (S08), and it is checked whether 基準 M is equal
to or greater than a reference value ThP1 (eg n / 3) (S09). If the selected sound source signal or
sound source A signal is generated, the control signal SBCDi is generated to suppress the
separated acoustic signal of the separated channel other than the separated channel M (S010).
You may move to step S010 immediately from step S08.
[0059]
If it is determined in step S09 that the channel M is not equal to or more than the reference
value, it is checked whether there is a channel M having χM equal to or less than the reference
value ThQ (S011). If there is nothing below ThQ, it is considered that all sound sources are
sounding, and no control signal is generated (S012). At step S011, if there is a channel M with
χM equal to or smaller than ThQ, a control signal SMi for suppressing the sound source signal
separated as the channel M corresponding to this is generated (S013).
[0060]
To suppress silence or near silence in the separated sound source signals other than those
suppressed by the control signal SMi, add 1 to S (S014) (S is initialized to 0 in advance), and S is
M It is checked whether it matches -1 (M is the number of sound sources) (S015), and if not, ThQ
is increased by + ΔQ and the process returns to step S011 (S016). Step S011 is executed by
increasing ThQ by [Delta] Q within a range not exceeding ThP until S becomes M-1. If M-1 = S at
step S015, control signals SMi are generated to suppress the separated sound source signals
corresponding to the respective channels M of each 以下 M smaller than ThQ at that time (S013).
If necessary, the process may move to step S013 before M-1 = S in step S015.
[0061]
After χ1 to ス テ ッ プ 4 are calculated in step S07, it is checked whether there is ThP2 (for
example, 2n / 3) or more among these, and if there is one, the process proceeds to step S010,
and if not, it may move to step S011 (S017) . Although the control signal to the signal
suppression unit 90 is generated using the inter-band level difference of the channel signals S1
10-04-2019
22
to S3 of the microphones M1 to M3 in order to improve the accuracy of sound source separation
in the above description, the control signal using the inter-band time difference Can also be
generated.
[0062]
This example is shown in FIG. 18 with the parts corresponding to those in FIG. In this
embodiment, arrival time difference signals An (S1f1) to An (S1fn) are detected by the band-wise
time difference detection unit 101 from the signals S1 (f1) to S1 (fn) of the respective bands f1 to
fn obtained by the band division unit 41. Similarly, from the signals S2 (f1) to S2 (fn) and S3 (f1)
to S3 (fn) of the respective bands obtained by the band division units 42 and 43, respectively,
arrival time difference signals An (S2f1) to An ( S2fn) and An (S3f1) to An (S3fn) are detected by
the band-wise time difference detection units 102 and 103, respectively.
[0063]
In the processing for obtaining these arrival time difference signals, for example, the phase (or
group delay) of the signal of each band is calculated by Fourier transform, and the signals S1 (fi),
S2 (fi), S3 (fi) of the same band fi. By comparing the phases of (i = 1, 2,..., N) with each other, it is
possible to obtain a signal corresponding to the arrival time difference of the same sound source
signal. Also in this case, the division by the band division unit 40 is performed as small as it can
be considered that only one sound source signal component exists in one band.
[0064]
If the arrival time difference with respect to the reference microphone is set to 0 on the basis of
any one of the microphones M1 to M3, for example, the arrival time difference with respect to
the other microphones is fast relative to the reference microphone. Since it can be judged
whether it arrived or arrived late, it can be represented by a value with positive or negative
polarity. In this case, assuming that the reference microphone is M1, for example, the arrival time
difference signals An (S1f1) to An (S1fn) are all zero.
[0065]
10-04-2019
23
The sound source state determination unit 110 determines a sound source not emitting a sound
by computer processing. First, arrival time difference signals An (S1f1) to An (S1fn), An (S2f1) to
An (S2fn), An (S3f1) to An (S3fn) obtained by the band-specific time difference detection unit 100
are signals of the same band. Compare with each other. As a result, for each of the bands f1 to fn,
it is possible to determine the channel in which the signal reaches the fastest.
[0066]
Therefore, the total number of bands determined that the signal arrives fastest for each channel
is calculated and compared between channels. As a result, it can be considered that the
microphone of the channel having a larger value of the total number of bands is closer to the
sound source. Then, when the total number of bands exceeds a preset reference value ThP for a
certain channel, it is determined that there is a sound source in the zone controlled by the
microphone of the corresponding channel.
[0067]
Further, the levels P (S1) to P (S3) of each channel detected by the all band level detection
section 60 are also input to the sound source state determination section 110, and the level of a
certain channel is less than a preset reference value ThR. In this case, it is determined that there
is no sound source in the zone controlled by the microphone of that channel. Now, suppose that
microphones M1 to M3 are arranged for sound sources A and B as shown in FIG. The total
number of bands for the channel of the microphone M1 is χ1, and the total number of bands for
the channels of the microphones M2 and M3 is χ2 and χ3, respectively.
[0068]
Also in this case, it may be performed in the same manner as the processing procedure shown in
FIG. That is, first, when all the detection signals P (S1) to P (S3) of the full band level detection
unit 60 are smaller than the reference value ThR (101), it is considered that the sound sources A
and B are not sounding. SABi is generated (102) to suppress both sound source signals SA and
SB. At this time, the output signals SA 'and SB' become silence signals.
10-04-2019
24
[0069]
When only the sound source A is sounding, the frequency components of all the bands of the
sound source signal reach the microphone M2 fastest, so the total number of bands χ2 of the
channels of the microphone M2 is the largest. Also, when only the sound source B is sounding,
the frequency components of all the bands of the sound source signal reach the microphone M3
fastest, so the total number of bands χ3 of the channels of the microphone M3 is the largest.
[0070]
Furthermore, when both the sound sources A and B are sounding, the number of bands in which
the sound source signal reaches the fastest is antagonized by the microphones M2 and M3.
Therefore, when the total number of bands for which the sound source signal reaches the
microphone the fastest through the reference value ThP exceeds the set value ThP, the sound
source is present in the zone controlled by the microphone, and the sound source generates
sound. It is determined that there is.
[0071]
In the above example, when only the sound source A is sounding, only the χ 2 exceeds the
reference value ThP (103 in FIG. 3), and the sound source generating sound is present in the
zone Z3 controlled by the microphone M2 Therefore, the control signal SBi is generated (104),
the acoustic signal SB is suppressed, and only the signal SA is output. Also, when only the sound
source B is sounding, only χ3 exceeds the reference value ThP (105), and it is detected that the
sound source emitting sound is present in the zone Z4 managed by the microphone M3. The
control signal SAi is generated (106), the signal SA is suppressed, and only the signal SB is
output.
[0072]
In this example, ThP is set to, for example, about n / 3, and both the sound sources A and B are
sounding, and both χ2 and χ3 may exceed the reference value ThP. In this case, as shown in
the processing procedure of FIG. 13, it is possible to give priority to one sound source, in this
example, A, and to output only the separated signal to the sound source A. If both χ2 and χ3
10-04-2019
25
have not reached the reference value ThP, it is determined that both sound sources A and B are
sounding as long as the levels P (S1) to P (S3) exceed the reference value ThR. The control
signals SAi, SBi, and SABi are not output (107 in FIG. 3), and the speech suppression unit 90 does
not suppress the speech signals SA and SB.
[0073]
When sound source C is added to zone Z6 as shown in FIG. 14 with respect to the state shown in
FIG. 12, signal SA corresponding to sound source A and signal SB corresponding to sound source
B from sound source separation unit 80 are not shown. Besides, the signal SC corresponding to
the sound source C is output. Corresponding to this, in addition to control signal SAi for
suppressing signal SA and control signal SBi for suppressing signal SB, control signal SCi for
suppressing signal SC is output from sound source state determination unit 110, and signal SA In
addition to the control signal SABi for suppressing SB, the control signals SBci for suppressing
the signals SB and SC, the control signals SCAi for suppressing the signals SC and SA, and the
control signal SABci for suppressing all the signals SA, SB and SC are output. . Then, the sound
source state determination unit 110 performs the same process as that shown in FIG. 15
described above.
[0074]
First, when all the levels P (S1) to P (S3) do not exceed the reference value ThR, it is determined
that none of the sound sources A to C are sounding, and the sound source state determination
unit 110 outputs SABCi. , And all of the signals SA, SB, and SC are suppressed. Next, when the
sound sources A, B, and C are producing sounds independently, as in the case of two sound
sources, the arrival time of the channel of the microphone closest to the sound source is the
fastest, so One of the band numbers χ1, χ2, χ3 of the channel exceeds the reference value
ThP. When only the sound source C is sounding, the control signal SABi is output to suppress the
signals SA and SB. When only the sound source A is sounding, the control signal SBci is output to
suppress the signals SB and SC. Furthermore, when only the sound source B is sounding, the
control signal SACi is output to suppress the signals SA and SC (203 to 208 in FIG. 15).
[0075]
Next, when any two of the three sound sources A to C are sounding, the fastest band of arrival
10-04-2019
26
time of the microphone in the zone corresponding to the sound source not sounding is that of the
other microphones It becomes smaller than. For example, when only the sound source C is not
sounding, the number of bands with the fastest arrival time to the microphone M1 マ イ ク ロ ホ
ン 1 becomes smaller than the number of bands with the other two microphones M2 and M3χ2
and χ3.
[0076]
Therefore, if a certain reference value ThQ (<ThP) is set in advance and χ1 becomes equal to or
less than the reference value ThQ, of the zones Z5 and Z6 obtained by dividing the space into
two by the microphone M1 and the microphone M3, a zone closer to the microphone M1 At Z6,
it is determined that the sound source does not emit a signal, and it is further determined that
the sound source does not emit a signal at zone Z1 closer to the microphone M1 among zones Z1
and Z2 obtained by dividing the space into two by the microphones M1 and M2.
[0077]
That is, it is determined that the sound sources in the zones Z1 and Z6 do not emit signals.
Since the sound sources in these zones are the sound source C, it is determined that the sound
source C does not emit a signal. That is, it is determined that only the sound sources A and B emit
signals, and the control signal SCi is generated to suppress the signal SC (209 to 210 in FIG. 15).
Zones in which only the sound source A and only the sound source B do not emit signals are
similarly determined (211 to 214 in FIG. 15).
[0078]
If it is determined that と も に 1, χ2, χ3 are not less than the reference value ThQ, it is
determined that all of the sound sources A, B, C emit signals (215 in FIG. 15). In the above
example, the zones are divided into six zones Z1 to Z6. However, as shown in FIG. 16, even if
divided into three zones, the sound source state can be similarly determined. In this case, for
example, when only the sound source A is sounding, since the number of bands チ ャ ネ ル 2 of
the channel of the microphone M2 is the largest, it is determined that the sound source is in the
zone Z2 managed by the microphone M2. Further, when only the sound source B is sounding, it
is determined that the sound source 3 is in the zone Z3 in the same manner as the χ3 is the
largest. If χ1 is less than or equal to a preset value ThQ, it is determined that the sound source
10-04-2019
27
in zone Z1 of the two halves of the space is divided by microphones M1 and M3 and no sound is
generated. It is determined that the sound source in the zone Z1 among the divided areas does
not emit a signal. By the above processing, even if the zone is divided into three, the state of the
sound source can be determined as in the case of six division.
[0079]
The setting of the reference values ThP and ThQ in the above case may be performed in the same
manner as in the case of using the previous band level. Further, although the reference values
ThR, ThP, ThQ have been described in the case where the same value is used for all the
microphones M1 to M3, they may be changed as appropriate for each microphone. Further, in
the above description, the case of three sound sources and three microphones has been
described, but if the number of microphones is equal to or greater than the number of sound
sources, sound source zones can be similarly detected. The procedure is the same as in the case
of using the band level described above. Therefore, for example, when three sound sources out of
four are sounding when there are four sound sources (when one is silent), it may be left as it is,
but it is more silent than those three. In order to select one close one, the reference value is
changed from ThQ to ThS (ThP> ThS> ThQ), and the processing steps similar to 209 to 214 in
FIG. 15 are performed in the next stages of 210, 212, and 214 in FIG. The same applies to
determining one silent sound source out of three.
[0080]
If the time difference is used instead of the level in the process shown in FIG. 17, the process
procedure shown in FIG. 17 can also be applied to the suppression of the unnecessary signal
using the arrival time difference shown in FIG. In the above, the output channel signal of each
microphone is first divided into bands. However, when using the level by band, the power
spectrum of each channel may first be determined and then divided into bands. The example is
shown in FIG. 19 with the same reference numerals as those in FIGS. 1 and 11, and only parts
different from these will be described. In this example, each channel signal from microphones 1
and 2 is converted into a power spectrum by, for example, fast Fourier transform by power
spectrum decomposing section 300, and then divided into each band by band dividing section 4
for each channel. In the band, only one source signal is mainly included to obtain band-specific
levels. In this case, each band-by-band level supplied to the sound source signal selection unit
602 also supplies the phase component of the original spectrum so that the sound source signal
synthesis unit 7 can reproduce the sound source signal.
10-04-2019
28
[0081]
Further, each band level is supplied to the inter-band level difference detection unit 5 and the
sound source state determination unit 70, and other operations processed as described in FIG. 1
and FIG. This is the same as the case of FIG. In the embodiment described with reference to FIG.
2, even if it is determined from which sound source it has arrived using only the corresponding
inter-channel time difference for each band division signal without using the inter-channel time
difference. Good. Further, in the embodiment described with reference to FIG. 5, it is determined
from which sound source only the corresponding inter-channel level difference is used for each
band division signal without using the inter-channel level difference. You may The detection of
the inter-channel level difference in the embodiment with reference to FIG. 5 may use the level
before converting to the logarithmic level. The division of each frequency band in the band
dividing unit 4 in FIG. 1, each band dividing unit 40 in FIGS. 11 and 18, the band dividing unit
233 in FIG. 20, and the band dividing unit 241 in FIG. There is no need. These division numbers
may be different from one another depending on the required accuracy. The band division unit
233 in FIG. 20 may first obtain the power spectrum of the input signal for subsequent
processing, and then divide the power spectrum into a plurality of frequency bands.
[0082]
Hereinafter, experimental examples to which the present invention shown in FIGS. The present
invention was applied to the combination of three types of two sound source signals shown in
FIG. 20, and at that time, the frequency resolution given by the band division unit 4 was changed
to evaluate the separated signals physically and subjectively. The mixed signal before separation
processing was created by giving and adding only inter-channel time difference and level
difference on a computer. The given inter-channel time difference and level difference are 0.47
ms and 2 dB, respectively.
[0083]
The frequency resolution of the band dividing unit 4 is five types of about 5 Hz, 10 Hz, 20 Hz, 40
Hz, and 80 Hz. The signals separated at these resolutions and the original signal (OS) were
evaluated for a total of six types of signals. The signal band is about 5 kHz. Quantitative
evaluation was performed as follows. When the separation of the mixed signal is completely
performed, the original signal and the separated signal become equal. That is, the correlation
10-04-2019
29
coefficient is 1. Therefore, the correlation coefficient between the original signal and the
processed signal was calculated for each sound as a physical quantity to measure the degree of
separation.
[0084]
The results are shown in dashed lines in FIG. The speech showed a considerably low correlation
value at a frequency resolution of 80 Hz for any combination, but no significant difference was
observed at other resolutions. There was no significant difference between the frequency
resolutions used for the bird's call. Subjective evaluation was performed as follows.
[0085]
The subjects were five Japanese in their twenties and thirties with normal hearing. For each
sound source, separated sounds of five different frequency resolutions and original sounds were
randomly presented with headphones in a diotic manner, and the sound quality was evaluated in
five stages. The presentation time of one sound was about 4 seconds. The results are shown by
solid lines in FIG. The separated sound S1 has the highest evaluation when the frequency
resolution is 10 Hz. There was also a significant difference (α <0.05) between the assessments
for all conditions. Although the evaluation of the frequency resolution of 20 Hz is the highest for
the separated sounds S2 to 4 and 6, there is no significant difference between 20 Hz and 10 Hz.
There were also significant differences between the 20 Hz sound and 5 Hz, 40 Hz and 80 Hz
respectively. From these results, it was found that the optimum frequency resolution exists
regardless of the type of combination to be separated for speech. In the case of this experiment,
about 20 Hz or 10 Hz is the optimum value. The separation sound S5 (bird's call) was the highest
in the case of 40 Hz, but significant differences existed only between 40 Hz and 5 Hz, and 20 Hz
and 5 Hz. In any case, there was a significant difference between the sound after separation
processing and the original sound.
[0086]
The effects of the present invention are shown in FIGS. FIG. 21 shows a spectrum 201 of mixed
voice of male voice and female voice before separation processing, and each spectrum 202, 203
of male voice S1 and female voice S2 after separation processing according to the present
invention. Fig. 23 shows the waveforms of the original voices of male voice S1 and female voice
10-04-2019
30
S2 before separation processing as A and B, mixed voice waveforms as C, and respective
waveforms of male voice S1 and female voice S2 after separation processing as D, E It shows
each. It can be seen from FIG. 21 that unnecessary components are suppressed. Further, it can be
seen from FIG. 23 that the speech after separation processing is restored with the same quality
as the original speech.
[0087]
The resolution of the band division is preferably about 10 to 20 Hz in the case of voice, and is
not preferably 5 Hz or less and 50 Hz or more. The band division method is not limited to Fourier
transform, and may be divided by a band filter. Next, an experimental example in the case where
the signal suppression unit 90 performs signal suppression by determining the sound source
state using the level difference shown in FIG. Using two microphones, place two sound sources A
and B at a distance of 1.5 m from the dummy head and at an angle difference of 90 degrees (45
degrees right and 45 degrees left with respect to the middle point of the two microphones) At the
sound pressure level, sound was collected in a variable reverberation room with a reverberation
time of 0.2 s (500 Hz). Combinations of the mixed sound and the separated sound used are S1 to
S4 in FIG.
[0088]
With respect to the separated voices S1 to S4, the ratio of the number of frames determined to be
silent to the number of silence frames of the original sound was calculated. The result was
correctly detected by more than 90% as follows. Male (S1) Female (S2) Female voice 1 (S3)
Female voice 2 (S4) Detection rate 99% 93% 92% 95% Separated by the basic method shown in
Figures 6-9 and the improved method shown in Figure 11 The resulting sound was randomly
presented diotically with headphones, and evaluated for less mixing of noise and less feeling of
discontinuity. The separated sounds used were S1 to S4 described above, and the subjects were
five Japanese people in their twenties to thirties with normal hearing. The presentation time of
one sound is about 4 seconds, and the number of trials for each sound is three. As a result, 91.7%
of the improvement methods and 8.3% of the basic methods evaluated the rate of evaluation that
the degree of noise mixing was low. On the other hand, there were many respondents who
judged that the improvement method was 20.0% and the basic method was 80.0% and the basic
method was less with less sense of discontinuities, but a significant difference was seen with the
improved method. It was not done.
10-04-2019
31
[0089]
Next, in order to make a relative evaluation of the separation performance, the comparison of the
degree of separation of the following five types of sounds was made by subjective evaluation. (2)
Basic method (computer): A sound obtained by separating a mixed signal added on a computer
by giving a time difference between channels (0.47 ms) and a level difference (2 dB) by the basic
method. (3) Improved method (real environment): A sound obtained by separating the mixed
sound picked up under the conditions used in the previous experiment of detection of silent
sections by the improved method. (4) Basic method (real environment): A sound obtained by
separating the mixed sound collected under the conditions used in the previous experiment of
detection of silent sections by the basic method. (5) Mixed sound: A mixed sound picked up
under the conditions used in the previous experiment for detection of silent sections.
[0090]
With respect to the first two mixed sounds in FIG. 20, a total of 20 types of sounds processed by
the methods of “original sound” above (1) to (4) and “mixed sound” are randomly presented
to the diotics with headphones, The degree of separation was evaluated in seven steps. That is,
seven points are "most separated" and one point is "not most separated". The subject, the
presentation time of the sound and the number of trials are the same as in the case of the
evaluation of the low degree of mixing of the noise.
[0091]
The results are shown in FIG. 24. All sound sources (S0) are A, male voice (S1) is B, female voice
(S2) is C, female voice 1 (S3) is D, and female voice 2 (S4) is E It shows each. The analysis results
(S0) for all the sound sources and the results (S1) to (S4) analyzed for each type of sound source
show almost the same tendency. For all cases S0 to S4, "(1) Original sound", "(2) Basic method
(computer)", "(3) Improvement method (real environment)", "(4) Basic method (real
environment)", " 5) The separation accuracy is high in the order of “mixed sound”. That is, in
the real environment, the improvement method is better than the basic method.
[0092]
As described above, according to the present invention, each channel signal from a plurality of
microphones is divided into a plurality of bands to such an extent that the main component
10-04-2019
32
consists only of the component of one sound source signal. The sound source signals can be
correctly separated by detecting the level and arrival time of the band and determining which
sound source signal is separated for each band from these, and processing in real time is
possible. is there.
[0093]
In particular, by detecting a sound source that is not sounding and suppressing its component, it
is possible to accurately separate even in a room such as a roundabout or reverberant place.
[0094]
Brief description of the drawings
[0095]
1 is a block diagram showing a functional configuration of an embodiment of the sound source
separation device of the present invention.
[0096]
2 is a flow chart showing the processing procedure of the embodiment of the sound source
separation method of the present invention.
[0097]
3 is a flowchart showing an example of a processing procedure for obtaining inter-channel time
differences Δτ 1 and Δτ 2 in FIG.
[0098]
FIGS. 4A and 4B each show an example of the spectrum of two sound source signals.
[0099]
5 is a flow chart showing the processing procedure of the embodiment of performing sound
source separation using the inter-channel level difference in the sound source separation method
of the present invention.
[0100]
10-04-2019
33
6 is a flow chart showing a part of the processing procedure of the embodiment using interchannel level difference and inter-channel arrival time difference in the present invention source
separation method.
[0101]
7 is a flowchart showing the continuation of step S08 in FIG.
[0102]
8 is a flowchart showing the continuation of step S09 in FIG.
[0103]
9 is a flowchart showing the continuation of steps S10 in FIG. 6 and steps S20 and S30 in FIGS.
[0104]
10 is a block diagram showing a functional configuration of an embodiment for separating sound
source signals having different frequency bands.
[0105]
11 is a block diagram showing a functional configuration of the embodiment of the sound source
separation device of the present invention to which a configuration for suppressing the
unnecessary sound source signal using the level difference is added.
[0106]
Fig. 123 shows an example of arrangement of two microphones, zones for receiving the
microphones, and two sound sources.
[0107]
13 is a flow chart showing an example of a processing procedure of detection of a sound source
zone and generation of a suppression control signal when there is one sound source being
sounded.
[0108]
Fig. 143 shows an example of arrangement of three microphones, their corresponding zones, and
10-04-2019
34
three sound sources.
[0109]
15 is a flow chart showing an example of the process of detecting the zone of the sound source
and the process of generating the suppression control signal when there are three sound sources.
[0110]
[FIG. 163] An example in which a zone is divided into three by three microphones, and an
example of arrangement of sound sources.
[0111]
17 is a flow chart showing an example of a processing procedure for generating a control signal
for suppressing a synthetic sound source signal not sounding in the sound source separation
device of the present invention.
[0112]
18 is a block diagram showing a functional configuration of the embodiment of the sound source
separation device of the present invention to which a configuration for suppressing the
unnecessary sound source signal using the arrival time difference is added.
[0113]
19 is a block diagram showing a functional configuration of an embodiment in the case where
band division is performed after the power spectrum is obtained by the sound source separation
device according to the present invention.
[0114]
FIG. 20 shows the types of sound sources used in the experiments of the present invention.
[0115]
21 is a diagram showing speech spectra before and after processing according to the method of
the embodiment shown in FIG.
10-04-2019
35
[0116]
<Drawing 22> The figure which shows the result of the subjective evaluation experiment which
uses the method of execution example which is shown in drawing 6-figure 9.
[0117]
23 is a diagram showing an audio waveform after processing processed by the method of the
embodiment shown in FIG. 6 to FIG. 9 and its original audio waveform.
[0118]
24 is a diagram showing experimental results for the sound source separation method shown in
FIG. 6 to FIG. 9 and the sound source separation device shown in FIG.
10-04-2019
36
Документ
Категория
Без категории
Просмотров
0
Размер файла
59 Кб
Теги
description, jph10313497
1/--страниц
Пожаловаться на содержимое документа