close

Вход

Забыли?

вход по аккаунту

?

JP2009047803

код для вставкиСкачать
Patent Translate
Powered by EPO and Google
Notice
This translation is machine-generated. It cannot be guaranteed that it is intelligible, accurate,
complete, reliable or fit for specific purposes. Critical decisions, such as commercially relevant or
financial decisions, should not be based on machine-translation output.
DESCRIPTION JP2009047803
The present invention provides an acoustic signal processing method capable of generating an
appropriate weighting coefficient and realizing a high noise suppression effect without
performing complicated calculations. According to one embodiment, at least one dictionary is
used as a weighting factor used for weighting, which is learned so as to optimize an evaluation
function defined by a weighted learning acoustic signal and a target acoustic signal
corresponding to the learning acoustic signal. The steps of preparing; estimating a noise
component included in the input acoustic signal; obtaining a feature amount dependent on the
noise component of the input acoustic signal; selecting a weighting factor corresponding to the
feature amount from the dictionary Weighting the input acoustic signal using the selected
weighting factor to generate a processed output acoustic signal. [Selected figure] Figure 1
Acoustic signal processing method and apparatus
[0001]
The present invention relates to an acoustic signal processing method and apparatus capable of
suppressing noise components in an input acoustic signal.
[0002]
When making a call on a cell phone or cordless phone, ambient noise mixed in with the speaker's
voice interferes with the call.
04-05-2019
1
Also, when using speech recognition technology in a real environment, ambient noise can be a
factor in reducing the recognition rate. A noise canceller is often used as one of the methods of
solving such a noise problem.
[0003]
The Minimum Mean-Square Error (MMSE) disclosed in Non-Patent Documents 1 and 2 has a high
noise suppression amount and a subjective evaluation value among noise cancelers, and is widely
used as a comprehensively excellent method. It is one of the methods that In the MMSE method,
an estimated value of a target acoustic signal is obtained by multiplying each frequency
component of the input acoustic signal from the microphone by a weighting factor. In order to
determine the weighting factor, it is assumed that the target acoustic signal and the noise
component included in the input acoustic signal follow independent Gaussian distributions, and a
method of analytically determining the weighting factor is used.
[0004]
On the other hand, Non-Patent Document 3 is given as a noise suppression technique using a
plurality of microphones. Non-Patent Document 3 shows a method of effectively performing
noise suppression by configuring a Wiener filter using a cross spectrum between channels. Y.
Ephraim, D. Malah, "Speech Enhancement Using a Minimum Mean-Square Error Short-Time
Spectral Amplitude Estimator", IEEE Trans. ASSP vol. 32, 1109-1121, 1984. Y. Ephraim, D.
Malah, "Speech Enhancement Using a Minimum Mean-Square Error Log-Spectral Amplitude
Estimator", IEEE Trans. ASSP vol. 33, 443-445, 1985. R. Zelinski, ?A Microphone Array with
Adaptive Post-filtering for Noise Reduction,? IEEE ICASSP 88 pp. 2578-2581 1988
[0005]
A method of statistically determining a weighting coefficient by applying a statistical model such
as a Gaussian distribution to a target acoustic signal or noise component requires a complicated
function calculation, and has a problem that the amount of calculation increases. In addition, the
target acoustic signal and the noise component do not necessarily follow the statistical model
assumed in advance such as a Gaussian distribution, and the weighting coefficient to be obtained
is large when the target acoustic signal and the noise component deviate from the statistical
model. There is a problem that it is not appropriate and the noise suppression performance is
04-05-2019
2
degraded.
[0006]
An object of the present invention is to make it possible to generate appropriate weight
coefficients without complicated calculations to realize a high noise suppression effect.
[0007]
According to a first aspect of the present invention, a weight used for weighting is learned so as
to optimize an evaluation function defined by a weighted learning acoustic signal and a target
acoustic signal corresponding to the learning acoustic signal. Preparing a coefficient in at least
one dictionary; estimating a noise component included in an input acoustic signal; obtaining a
feature amount dependent on the noise component of the input acoustic signal; and
corresponding to the feature amount An acoustic signal processing method comprising: selecting
a weighting factor from the dictionary; weighting the input acoustic signal using the selected
weighting factor; and generating a processed output acoustic signal. provide.
[0008]
According to a second aspect of the present invention, there is provided a method of calculating
at least one feature representing correlation between channels of input acoustic signals of a
plurality of channels; previously obtained by learning from at least one dictionary according to
the feature Selecting a weighting factor; performing signal processing including weighted
addition on the input acoustic signals of the plurality of channels to generate an integrated
acoustic signal; weighting the integrated acoustic signal using the weighting factor Providing a
processed output acoustic signal, providing an acoustic signal processing method.
[0009]
According to a third aspect of the present invention, a weighting factor used for weighting is
learned so as to optimize an evaluation function defined by a weighted learning acoustic signal
and a target acoustic signal corresponding to the learning acoustic signal. Preparing in at least
one dictionary; processing for estimating a noise component included in the input sound signal;
processing for obtaining a feature amount dependent on the noise component of the input sound
signal; and weight corresponding to the feature amount Performing acoustic signal processing on
the computer, including a process of selecting coefficients from the dictionary; a process of
weighting the input acoustic signal using the selected weighting factor; and generating a
processed output acoustic signal Provide a program to
04-05-2019
3
[0010]
According to a fourth aspect of the present invention, there is provided a process of calculating
at least one feature quantity representing a correlation between channels of input acoustic
signals of a plurality of channels; previously obtained by learning from at least one dictionary
according to the feature quantity A process of selecting a weighting factor; a process of
performing signal processing including weighted addition on the input acoustic signals of the
plurality of channels to generate an integrated acoustic signal; weighting the integrated acoustic
signal using the weighting factor Providing a program for causing a computer to perform
acoustic signal processing, including: processing for generating a processed output acoustic
signal.
[0011]
According to the present invention, since weighting factors are obtained by learning, it is possible
to obtain weighting factors only by referring to the learning result without performing
complicated calculations.
Also, since the characteristics of the signal can be directly reflected on the weighting coefficients
without passing through the statistical model, if the statistical properties of the target speech or
noise are different from those of the statistical model, it is higher than methods using statistical
models such as MMSE. It becomes possible to realize the noise suppression effect.
[0012]
Hereinafter, embodiments of the present invention will be described.
First Embodiment As shown in FIG. 1, in an acoustic signal processing device according to a first
embodiment of the present invention, N-channel input acoustic waves from a plurality (N) of
microphones 101-1 to 101-N. A signal is input to the feature amount calculation unit 102 and
the weighting units 105-1 to 105-N.
The feature amount calculation unit 102 calculates the feature amount of the input sound signal
by processing including estimation of a noise component included in the input sound signal.
04-05-2019
4
The weighting factor dictionary 103 stores a large number of weighting factors obtained in
advance by the learning unit 100 in advance.
[0013]
In the selection unit 104, a weighting factor corresponding to the feature quantity calculated by
the feature quantity calculating unit 102 is selected from the weighting factor dictionary 103.
The weighting units 105-1 to 105-N multiply the inputted input acoustic signals by the
weighting factors selected by the selection unit 104, thereby generating an output acoustic signal
in which noise is suppressed.
[0014]
Next, the processing procedure of this embodiment will be described with reference to the
flowchart of FIG. Electrical signals output from the microphones 101-1 to 101 -N, that is, input
acoustic signals x1 (t) to xN (t) (N is 1 or more) are input to the feature amount calculation unit
102. The feature amount calculation unit 102 estimates noise components included in the input
acoustic signals x1 (t) to xN (t) (step S11), and depends on the noise components, and the input
acoustic signals x1 (t) to xN (t) The feature quantities of are calculated (step S12). An example of
such a feature amount is a signal-to-noise ratio (SNR) given by the following equation.
[0015]
Where SG and NS are the power of the signal component and noise component of the input
acoustic signal, n is the channel number (the number of the microphones 101-1 to 101-N), and t
is the time.
[0016]
The estimation of the noise component is usually performed using the input acoustic signal in the
section where the desired signal component (the target acoustic signal) does not exist.
04-05-2019
5
The SNRn (t) of equation (1) may be updated sequentially or may be averaged over a certain time
width.
[0017]
Next, a weighting factor corresponding to SNR n (t) is selected from the weighting factor
dictionary 103 in the selection unit 104 (step S13). The weighting factor dictionary 103 stores
weighting factors learned in advance for each SNRn (t). The details of learning will be described
in detail later.
[0018]
Finally, the weighting unit 105 multiplies the input acoustic signals x1 (t) to xN (t) by the
weighting factor selected by the selection unit 104 and weights the output acoustic signal y1 (t)
whose noise is suppressed. ... Y N (t) are generated (step S14).
[0019]
In the weighting factor dictionary 103, the weighting factors may be prepared independently for
each channel, or may be common among the channels.
When the microphones 101-1 to 101-N are adjacent to each other, it is possible to reduce the
storage capacity used for the weighting factor dictionary 103 without degrading the
performance by sharing the weighting factors among the channels.
[0020]
The feature amount calculation unit 102 may also calculate the feature amount independently
for each channel, but the powers of the signal components and noise components of the input
acoustic signals x1 (t) to xN (t) are spread over a plurality of channels. It is also effective to
reduce statistical variations by averaging. In addition, various modifications can be made to the
configuration of feature amounts, such as obtaining feature amounts independently for each
channel and obtaining vectors with each feature amount as an element, and using them as multi-
04-05-2019
6
dimensional feature amounts.
[0021]
When filtering in the time domain is performed in the weighting units 105-1 to 105-N, the
output acoustic signals y1 (t) to yN (t) = yn (t) are weight coefficients wn and input acoustic
signals x1 (t) to xN. It is expressed by the following equation as a convolution with (t) = xn (t).
[0022]
However, the weighting factor is expressed as wn = {wn (0), wn (1),..., Wn (L-1)}.
L is the filter length.
[0023]
According to the present embodiment, by selecting the weighting factor to be used for weighting
based on the feature amount of the input acoustic signal from the weighting factor dictionary
103 obtained by prior learning, the type of noise such as in a car is limited. In this case, the noise
suppression performance can be more effectively improved as compared with the general
statistical model. In this case, it is important how the learning unit 100 performs prior learning,
but a detailed learning method will be described by the following embodiment.
[0024]
Second Embodiment In the acoustic signal processing apparatus according to the second
embodiment of the present invention shown in FIG. 3, the input acoustic signal from the
microphones 101-1 to 110-N (N is 1 or more) has a Fourier transform. The signals are input to
the units 110-1 to 110-N, and converted from time domain signals to frequency domain signals.
[0025]
The feature amount calculation unit 102 estimates noise components in the input acoustic signal
from the output signals of the Fourier transform units 110-1 to 110-N, and a priori SNR
04-05-2019
7
calculation that calculates the a priori SNR of the input acoustic signal The unit 106 includes a
post-SNR calculator 107 that calculates the post-SNR of the input acoustic signal.
The calculated a priori SNR and a posteriori SNR are given to the selection unit 104 and used to
select a weighting factor from the weighting factor dictionary 103.
[0026]
In weighting sections 105-1 to 105-N, the output signals from Fourier transform sections 110-1
to 110-N are weighted by the weighting factor selected by selection section 104. The signals
after weighting are made to be output acoustic signals in the time domain by the inverse Fourier
transform units 111-1 to 111-N.
[0027]
Next, the operation principle of the present embodiment will be described. The input acoustic
signal xn (t) from the n-th microphone 101-n is converted to the frequency component Yn (l, k)
by the Fourier transform unit 110-n. l is a frame number and k is a frequency number. The
Fourier transform is usually performed every predetermined frame length (L samples) to obtain K
frequency components. In practice, approximately half of the K frequency components are
symmetrical components, so it is common to process them excluding them. When a signal
converted into the frequency domain is input as an input acoustic signal, the Fourier transform
units 110-1 to 110-N are unnecessary. In the following description, the channel number n is
omitted, and Yn (l, k) is expressed as Y (l, k).
[0028]
In the present embodiment,
[0029]
When the input sound signal Y (l, k) is expressed as the sum of the target sound signal X (l, k)
and the noise component N (l, k) as in , k).
[0030]
04-05-2019
8
The noise estimation unit 108 estimates statistical properties of noise, for example, the average
value of noise power (referred to as estimated noise power) as the simplest example.
There are various methods for calculating the estimated noise power. For example, a method of
detecting a noise section and calculating an average power of the detected noise section is
simple.
As another method, Rainer Martin, "Noise power spectral density estimation based on optimal
smoothing and minimum statistics," IEEE Transactions on speech and audio processing, vol. 9, no.
5, pp. 504-512, July 2001. There are detailed descriptions in (Reference 4), the references cited
in Reference 4, etc., and various methods are studied.
[0031]
Next, the operation of the post-SNR calculator 107 will be described. The post-SNR is defined as
the ratio of the power of the input acoustic signal to the power of the noise component, where
[0032]
It will be expressed as However, R <2> (l, k) and ? d (l, k) are the power (the square of the
amplitude spectrum) and the power of the estimated noise component of the input acoustic
signal of the k-th band of the lth frame, respectively.
[0033]
Next, the operation of the pre-SNR calculator 106 will be described. The pre-SNR is defined as the
ratio of the power of the target acoustic signal contained in the input acoustic signal to the power
of the noise component. Since the target acoustic signal can not be observed directly, an estimate
of the prior SNR is obtained. As a representative calculation method of the prior SNR, for
example, it is described in Non-Patent Document 1
04-05-2019
9
[0034]
??????? Here, G (l?1, k) is a weighting coefficient of one frame before, ? is a smoothing
coefficient, and P [] is an operation to replace it with 0 if the value in [] is negative. Various
modifications such as using P [? (l, k) -1] itself in the equation (5) itself or adaptively changing
? in the equation (5) are used for the method of calculating the pre-SNR. Is considered.
[0035]
Although the pre-SNR and the post-SNR described above are expressed in the form of a signal-tonoise ratio, it is also possible to treat the denominator and the numerator independently. For
example, a two-dimensional vector (R <2> (l, k), ?d (l, k)) whose elements are the denominator
and the numerator of equation (4) if it is a posteriori SNR; There is a method of separating into a
two-dimensional vector whose elements are the numerator and denominator of 5). Also, a
method using a part of these elements (eg, a total of three dimensions of the first element of the
pre-SNR and the post-SNR) is possible. Furthermore, including the SNRs of the input acoustic
signals of the other channels, and configuring one feature quantity by the SNRs of the input
acoustic signals of all the channels, and sharing the one feature quantity among the input
acoustic signals of all the channels. Is also possible.
[0036]
Next, the operation of the selection unit 104 will be described. In the selection unit 104, the a
priori SNR ? (l, k) and the a posteriori SNR ? (l, k) input from the feature amount calculation
unit 102, that is, the feature amount f (l, k) = (? (l, k), ? ( A weighting factor corresponding to l,
k)) is selected from the weighting factor dictionary 103. The weighting factor dictionary 103
stores a large number of weighting factors learned in advance.
[0037]
As a method of correlating the feature amount f (l, k) = (? (l, k), ? (l, k)) with the weighting
coefficient W (l, k) in the weighting coefficient dictionary 103 A representative feature quantity
(representative point) of and a weighting factor corresponding to each feature quantity are
04-05-2019
10
prepared, a representative vector closest to the inputted feature quantity is selected, and a
weighting factor corresponding to this representative vector is output Method is simple. More
generally, using the function F with the feature amount as an input, the feature amount f (l, k) =
(? (l, k), ? (l, k)) and the weighting factor W (l, k) The correspondence with) is expressed as
follows.
[0038]
Finally, weighting section 105 multiplies the input spectrum, ie, the signals in the frequency
domain from Fourier transform sections 110 to 110 -N, by weighting factors in the following
manner to obtain an estimated value of the target acoustic signal.
[0039]
In addition, the signal of the equation (7) may be inversely transformed by the inverse Fourier
transform units 111-1 to 111-N as necessary, and may be used as a signal in the time domain.
Or the time domain representation of the inverse transform of equation (7)
[0040]
It is also possible to use Here, * represents the convolution shown in equation (2), which can be
realized as time domain filtering.
[0041]
In Non-Patent Documents 1 and 2, assuming that the target acoustic signal and the noise
component follow a Gaussian distribution, the weighting factor W (l, k) is analytically determined.
If the acoustic signal to be actually treated exhibits statistical properties close to this assumption,
although the techniques of Non-Patent Documents 1 and 2 are effective, the actual acoustic
signal does not necessarily follow the Gaussian distribution. Although researches have also been
made to apply the Labras distribution or the gamma distribution, there are problems such as
computational complexity and the necessity of compromising with an approximate solution.
Furthermore, the actual acoustic signal often has a more complicated distribution than these
04-05-2019
11
distributions, and the assumption itself of assuming a statistical model often becomes a problem.
[0042]
In this embodiment, in order to solve this problem, the function F () of equation (6) is learned in
advance using a target acoustic signal to be actually used and a signal close to a noise
component, instead of assuming a statistical model. In the actual use of the acoustic signal
processing apparatus, a method of determining the weighting factor according to the function F
() is used. As a result, although limited to an environment similar to that at the time of learning,
there is an effect that good performance can be obtained under the conditions. For example,
when the acoustic signal processing device according to the present embodiment is mounted on
a car and used, it is possible to realize good noise suppression performance at the time of
traveling by previously learning using traveling noise.
[0043]
Another advantage of this embodiment is that the weight coefficient stored in the weight
coefficient dictionary 103 is referred to based on the feature amount of the input sound signal,
so the weight coefficient is derived using a complicated calculation formula. There is no need to
do it. Even in the conventional method, for example, it is possible to solve by a method of
calculating weighting factors in advance with discrete values (such as 1 dB steps) of a priori SNR
and a posteriori SNR and providing as table data of weighting factors. A method is provided to
make table data of weighting factors more appropriate for the actual environment.
[0044]
Hereinafter, a learning method of the weighting factor in the present embodiment will be
described. First, a learning acoustic signal is prepared as an input acoustic signal, and a target
acoustic signal is prepared as an ideal output acoustic signal. For example, when it is desired to
emphasize only speech from a speech signal buried in noise, the learning acoustic signal is a
speech signal on which noise is superimposed, and the target acoustic signal is a speech only
signal. These signals are often realized by adding noise components and audio signals on a
computer or using only audio signals.
04-05-2019
12
[0045]
Next, the learning acoustic signal and the target acoustic signal are subjected to Fourier
transform on a frame basis to obtain respective frequency components X (l, k) and S (l, k). Where
l is a frame number and k is a frequency component number. Next, the feature amount f (l, k) is
calculated from X (l, k). f (l, k) is obtained by the number of frames of the learning input acoustic
signal, but is classified into a predetermined number of clusters by a clustering algorithm such as
LBG algorithm. The center of gravity of each cluster is stored as a representative point and used
for clustering at the time of processing.
[0046]
The weighting factor is set by setting a predetermined evaluation function, and is obtained by
optimizing the evaluation function for each cluster. For example, the amplitude obtained by
multiplying the amplitude of the learning acoustic signal X (l, k) classified into the i-th cluster Ci
by the weight Wi (k) and the amplitude of the target acoustic signal S (l, k) corresponding thereto
Evaluation function of the sum of the power of the error of
[0047]
And consider Wi (k) to minimize Ji (k). This is by partially differentiating Ji (k) by Wi (k) and
setting it to zero.
[0048]
Is required. Wi (k) is determined for each frequency component k by the number of clusters.
[0049]
In the evaluation function of Equation (9), all the frames classified into the cluster Ci are treated
with the same scale, but different scales may be used for each frame. For example
04-05-2019
13
[0050]
It is also possible to use a weighted sum of error power as an evaluation function. It is possible to
control the weighting factor according to the purpose, such as obtaining a weighting factor Wi
(k) emphasizing the speech section by setting the frame corresponding to the speech section to a
large value with A (l, k). Become.
[0051]
In the present embodiment, the weighting factor is determined for each frequency component k,
but it is also possible to determine the weighting factor in units of subbands configured of a
plurality of sets of frequency components. In that case, the evaluation function Q (p) of the p-th
subband is, for example, the sum of distortions of frequency components k corresponding to the
subband.
[0052]
The method of representing is simple. The weighting factor Wi (k) can be determined by
minimizing the evaluation function in the same manner as described above.
[0053]
Third Embodiment Next, a third embodiment of the present invention will be described with
reference to FIG. The acoustic signal processing apparatus shown in FIG. 4 is the same as the
second embodiment except that a weighting factor calculation unit 120 is added to the front
stage of the weighting unit 105. In the equation (6), the weighting factor is determined directly
from the feature quantity (? (n, k), ? (n, k)), but in the present embodiment, a parameter for
determining the weighting factor is selected. ?????
[0054]
04-05-2019
14
As shown in, weighting factors are determined using a function P {} with the coefficients obtained
in F () as parameters. For example, S. F. Boll, "Suppression of Acoustic Noise in Speech Using
Spectral Subtraction," often used as a simple noise suppression method, "IEEE Trans. ASSP vol.
27, pp. 113-120, 1979. In the spectral subtraction described in [Document 5], the estimated
value of the amplitude of the target acoustic signal is
[0055]
It is expressed as Where N (n, k) is the amplitude of the estimation noise and is equal to sqrt (?
d (n, k)). According to the general method that the phase information of X ? (n, f) uses the phase
information of Y (n, f), equation (14) is
[0056]
Can be deformed. The first term of the right side of equation (15)
[0057]
If it expresses,
[0058]
And can be expressed in the same form as equation (7).
Here, it is assumed that the parameter selected from the weighting factor dictionary 103 is ?,
that is, ? = F (? (n, k), ? (n, k)) is selected from the weighting factor dictionary 103, and the
function P ()
[0059]
And represents the weighting factor Gss (n, k). By thus selecting the weighting factor parameter
(?) without directly obtaining the weighting factor from the weighting factor dictionary 103,
04-05-2019
15
improvement in the estimation accuracy of the parameter at the time of learning can be
expected.
[0060]
Fourth Embodiment In an acoustic signal processing apparatus according to a fourth
embodiment of the present invention, as shown in FIG. 5, the a priori SNR calculation unit 106 is
generated from the acoustic signal processing apparatus shown in FIG. 3 according to the second
embodiment. Has been removed. According to the present embodiment, the feature amount input
to the selection unit 104 is only the posterior SNR ? (l, k), so that the search speed of the
representative point in the selection unit 104 is faster than in the second embodiment. There is
an advantage.
[0061]
Fifth Embodiment In an acoustic signal processing apparatus according to a fourth embodiment
of the present invention, as shown in FIG. 6, the post-SNR calculator 107 is constructed from the
acoustic signal processing apparatus of FIG. 3 according to the second embodiment. Has been
removed. In the present embodiment, the feature amount input to the selection unit 104 is only
the a priori SNR ? (l, k), so the advantage is that the search speed of representative points in the
selection unit 104 is faster than in the second embodiment. There is.
[0062]
Sixth Embodiment FIG. 7 shows an acoustic signal processing apparatus according to a sixth
embodiment of the present invention, and the acoustic signal processing apparatus according to
the second embodiment shown in FIG. A switch 402 for switching is added, and a plurality of
weighting factor dictionaries 103-1 to 103M are further included. Although FIG. 7 shows the
case where one microphone 101 is used for simplicity, a plurality of microphones may be used as
in the above.
[0063]
04-05-2019
16
Next, the operation of this embodiment will be described. The operation of this embodiment is
basically the same as that of the second embodiment, except that the switching unit 402 switches
and uses the weighting coefficient dictionaries 103-1 to 103-N. In accordance with the control
signal 401, the switching unit 402 selects one of the M weighting factor dictionaries 103-1 to
103M. For example, in consideration of applications in automobiles, weighting factor dictionaries
103-1 to 103M are prepared corresponding to various vehicle speeds, and are switched and used
according to the vehicle speed. This makes it possible to use an optimal weight coefficient
dictionary for each vehicle speed, so that higher noise suppression performance can be realized.
[0064]
Seventh Embodiment FIG. 8 shows an acoustic signal processing apparatus according to a
seventh embodiment of the present invention, in which the switch 402 in FIG. 7 is replaced by a
weighting adder 403. The weighting adder 403 adds weights to weighting factors output from all
of the plurality of weighting factor dictionaries 103-1 to 103-N or weighting factors selected
from a part of the weighting factor dictionaries 103-1 to 103-N ((1) It is smoothed by applying a
weighted average). In the weighting adder 403, fixed weighting factors may be used for
weighting addition, or variable weighting factors controlled according to a control signal may be
used.
[0065]
Eighth Embodiment As shown in FIG. 9, in an acoustic signal processing device according to an
eighth embodiment of the present invention, N channels of input acoustic signals from a plurality
(N) of microphones 101-1 to 101-N The inter-channel feature value calculation unit 202 and
weighting units 105-1 to 105-N of the array unit 201 are input. The inter-channel feature
quantity calculation unit 202 calculates a feature quantity (referred to as an inter-channel feature
quantity in this specification) representing the difference between channels of the input sound
signal, and passes it to the selection unit 204. The selection unit 204 selects one weight
coefficient associated with the inter-channel feature amount from the weight coefficient
dictionary 203 storing a large number of weight coefficients.
[0066]
On the other hand, the input acoustic signals weighted by the weighting units 105-1 to 105-N in
04-05-2019
17
the array unit 201 are integrated by being added by the adder 205, and are output from the
array unit 201 as an integrated acoustic signal. The integrated acoustic signal is weighted
according to the weighting factor selected by the selection unit 204 in the noise suppression unit
206, and an output acoustic signal in which the target acoustic signal (for example, the voice of
the specific speaker) is emphasized is generated.
[0067]
Next, the processing procedure of the present embodiment will be described according to the
flowchart of FIG. The inter-channel feature quantity calculating unit 202 calculates inter-channel
feature quantities of the input acoustic signals (referred to as x1 to xN) output from the
microphones 101-1 to 101-N (step S11). When digital signal processing technology is used,
input acoustic signals x1 to xN are digital signals discretized in the time direction by an analogdigital converter (not shown), for example, represented as x (t) using time index t. . If the input
acoustic signals x1 to xN are discretized, the inter-channel feature quantities are also discretized.
As a specific example of the inter-channel feature quantity, the correlation coefficient of the input
acoustic signals x1 to xN, the cross spectrum, and the SNR (signal to noise ratio) can be used as
described later.
[0068]
Next, based on the inter-channel feature amount calculated in step S21, the selection unit 204
selects a weighting factor associated with the inter-channel feature amount from the weighting
factor dictionary 203 (step S22). That is, the weighting factor selected from the weighting factor
dictionary 203 is extracted. The correspondence between the inter-channel feature quantity and
the weighting factor is determined in advance, and the simplest method is to make the
discretized inter-channel feature quantity and the weighting factor in one-to-one correspondence.
As a more efficient method of association, there is also a method of dividing the inter-channel
feature quantities into groups using a clustering method such as LBG and assigning a
corresponding weighting factor to each group. It is also conceivable to associate a weighting
coefficient with a weighted sum of outputs of each distribution using a statistical distribution
such as a GMM (Gaussian mixture model). As described above, various methods can be
considered for the correspondence, which are determined in consideration of the amount of
calculation, the amount of memory, and the like. The weighting factor A thus selected by the
selection unit 104 is set in the noise suppression unit 206.
04-05-2019
18
[0069]
On the other hand, the input acoustic signals x1 to xN are also sent to the weighting units 105-1
to 105-N of the array unit 201, where directivity control is performed by weighted addition and
an integrated acoustic signal is output (step S23) .
[0070]
Next, the integrated acoustic signal is weighted by the weighting factor A by the noise
suppression unit 206, and an output acoustic signal in which the audio signal is emphasized is
obtained (step S24).
[0071]
Next, the inter-channel feature value calculation unit 202 will be described in detail.
The inter-channel feature amount is an amount representing the relationship between the N
channels of the input acoustic signals x1 to xN from the N microphones 101-1 to 101N as
described above, and specifically, for example, a correlation coefficient, Cross spectrum, SNR or
the like can be mentioned.
The correlation coefficient is given by x (t) and y (t) as input acoustic signals from two
microphones.
[0072]
It can be expressed. However, E {} is an expected value or a time average value. If the input
acoustic signal is more than two channels, for example,
[0073]
It can be calculated. Here, xp (n) and xq (n) are p and q-th input acoustic signals, respectively,
and ppq represents the sum of all combinations except for the overlap of xp and xq. This
correlation coefficient is in the frequency domain
04-05-2019
19
[0074]
It is expressed as Where f is a frequency component obtained by discrete Fourier transform, W x
1 x 2 (f) is a cross spectrum between input signals, W x 1 x 1 (f), W x 2 x 2 (f) is the power of
input acoustic signal x 1 (n), x 2 (n) The spectrum, ? f, represents the sum over all frequency
components.
[0075]
The cross spectrum Wx1x2 (f) or ? (f) obtained by normalizing this may be used as the feature
amount. Further, it is also possible to configure the feature quantity as a three-dimensional
vector by combining the cross spectrum Wx1x2 (f) and the power spectrums Wx1x1 (f) and
Wx2x2 (f). Alternatively, the feature quantity is configured as a two-dimensional vector by
combining Wx1x1 (f) + Wx2x2 (f) representing the power of all channels, the power spectrum
Wyy (f) of the array output, and the cross spectrum Wx1x2 (f) Is also possible. Further, a section
in which the target acoustic signal is not present is detected, and the power spectrum W nn (f) of
the section is used as one of the feature quantities, or used for correction of other feature
quantities (subtraction from the power spectrum, etc.) Is also possible. The representation in the
frequency domain can also be extended to more than two channels in the same way as in the
time domain. Alternatively, other correlation methods such as generalized correlation functions
may be used. For the generalized correlation function, see, for example, "The Generalized
Correlation Method for Estimation of Time Delay, C. H. Knapp and G. C. Carter, IEEE Trans,
Acoust., Speech, Signal Processing", Vol. ASSP-24, No. 4, pp. 320-327 (1976) (Reference 6).
[0076]
The SNR is a ratio of the power S of the signal component to the power N of the noise
component, and is defined as SNR = S / N. Usually, the SNR is converted to a decibel value and
used. With regard to N, it is possible to measure in a section where there is no target acoustic
signal. Since S can not be observed directly, a method of using an input acoustic signal as it is, or
a method of indirectly estimating using a method of Decision-Directed or the like disclosed in
Non-Patent Document 1 is used. In addition to the method of obtaining the SNR for each channel
and using it as the feature value, it is also possible to use the average value or the sum of the
SNRs for all channels as the feature value. Furthermore, SNRs obtained by different calculation
04-05-2019
20
methods may be combined.
[0077]
Next, the array unit 201 will be described. In the present embodiment, the array unit 201 is not
particularly limited, and any array can be used. A simple array is a delay and sum array. The
delay-and-sum array is a method of adjusting (adding in-phase) the array weights W so that the
phase difference of the signals in the target direction becomes zero. W is a complex number, and
the in-phase is achieved by its argument. Griffiths-Jim type arrays, Directionally Constrained
Minimization of Power (DCMP), or minimum dispersion beamformers are well known as
examples of adaptive arrays. In addition, various methods such as methods based on ICA
(Independent Component Analysis) have been proposed in recent years, and the target acoustic
signal is emphasized using these methods.
[0078]
The integrated acoustic signal in which the target acoustic signal is emphasized includes residual
noise. In particular, spread noise can not be sufficiently suppressed by array processing that
performs noise suppression using spatial information. The noise suppression unit 206
suppresses such noise. Conventionally, such noise suppression processing is called post filter and
is noted as part of array processing. In the conventional method, the method of analytically
obtaining the weighting factor based on the Wiener filter is the mainstream.
[0079]
On the other hand, in the present embodiment, the noise suppression processing is realized by
selecting the weighting factor based on the inter-channel feature amount. Specifically, based on
the inter-channel feature amount, a weighting factor is selected from the weighting factor
dictionary 203 learned in advance, and the selected weighting factor is convoluted in the
integrated acoustic signal in the noise suppression unit 206 or in the frequency domain In the
case of the processing of the above, the noise suppression processing is realized by multiplying
the integrated acoustic signal by the selected weighting factor in the noise suppression unit 206.
[0080]
04-05-2019
21
By learning the weighting factor in advance using the tendency of the inter-channel feature
amount indicated by the noise component to be suppressed, high suppression performance can
be exhibited under a noise environment similar to that at the time of learning. For learning, a
squared error minimization measure with the aforementioned target acoustic signal is used.
[0081]
Ninth Embodiment In an acoustic signal processing device according to a ninth embodiment of
the present invention shown in FIG. 11, an N-channel input acoustic signal with respect to the
acoustic signal processing device of FIG. 9 according to the eighth embodiment. Fourier
transform units 110-1 to 110-N for converting the frequency domain into a signal in the
frequency domain, and an inverse Fourier transform unit 111 for converting an acoustic signal in
the frequency domain after array processing and noise suppression to a signal in the time
domain It is done. Furthermore, with the addition of Fourier transform units 110-1 to 110-N and
Fourier inverse transform unit 111, array unit 201 having weighting units 105-1 to 105-N and
addition unit 205 and noise suppression unit 206 are in the frequency domain. Are replaced by
an array unit 301 having a weighting unit 301-1 to 301 -N and an addition unit 305 and a noise
suppression unit 306.
[0082]
As is well known in the digital signal processing art, convolution operations in the time domain
are represented by multiplication operations in the frequency domain. In this embodiment, the Nchannel input acoustic signals are converted into frequency domain signals in the Fourier
transform units 110-1 to 110-N, array processing and noise suppression are performed, and the
noise inversed signals are subjected to Fourier inverse transform units. Inverse Fourier transform
is performed at 111 to return to the time domain signal. Therefore, in terms of signal processing,
this embodiment is performing processing equivalent to that of the eighth embodiment in which
processing is performed in the time domain. In this case, the output signal Y (k) from the adding
unit 305 is expressed not in convolution as shown in equation (2) but in the form of a product as
follows.
[0083]
Where k is a frequency index.
04-05-2019
22
[0084]
Similarly, the calculation in the noise suppression unit 306 is also
[0085]
It is expressed in the form of and the product.
The inverse Fourier transform unit 111 performs inverse Fourier transform on the output signal
Z (k) from the noise suppression unit 306 to obtain an output acoustic signal z (t) in the time
domain.
It is also possible to use the frequency domain output signal Z (k) from the noise suppression
unit 306 as it is, for example, as a parameter for speech recognition.
[0086]
As an advantage of performing processing after converting an input acoustic signal into a
frequency domain as in the present embodiment, the amount of calculation may be reduced
depending on the filter order of the array unit 301 and the noise suppression unit 306, and Since
processing can be performed independently for each, it is easy to cope with complex noise such
as reverberation.
[0087]
Tenth Embodiment FIG. 12 shows an acoustic signal processing device according to a tenth
embodiment of the present invention, and with respect to the acoustic signal processing device of
FIG. 11 according to the ninth embodiment, a collating unit 501 and representative points A
dictionary 502 has been added.
In the representative point dictionary 502, as shown in FIG. 13, the feature amounts of a plurality
of representative points (I) obtained by the LBG method or the like are stored in association with
the index ID. Here, the representative point is a representative point of each cluster when the
04-05-2019
23
inter-channel feature amount is clustered.
[0088]
The processing procedure of the acoustic signal processing device of FIG. 12 is shown in the
flowchart of FIG. However, in FIG. 14, the processes of the Fourier transform units 110-1 to 110
-N and the inverse Fourier transform unit 111 are omitted. The inter-channel feature quantity
calculation unit 202 calculates an inter-channel feature quantity of the N-channel acoustic signal
after Fourier transform (step S31). Next, the inter-channel feature quantities and the feature
quantities of the plurality of representative points (I) stored in the representative point dictionary
502 are compared in the matching unit 501, and the distance between the two is calculated (step
S32).
[0089]
The index ID indicating the feature amount of the representative point that minimizes the
distance between the inter-channel feature amount and the feature amount of the representative
point is sent from the matching unit 501 to the selecting unit 204, and the selecting unit 204
selects the weight corresponding to the index ID. A coefficient is selected from the weighting
coefficient dictionary 203 and extracted (step S33). The weighting factor thus selected by the
selection unit 204 is set in the noise suppression unit 306.
[0090]
On the other hand, the input acoustic signal converted into the frequency domain by the Fourier
transform units 110-1 to 110-N is input to the weighting units 304-1 to 304-N of the array unit
301 to obtain an integrated acoustic signal. (Step S34).
[0091]
Next, in the integrated acoustic signal, an output signal in which noise is suppressed in
accordance with the weighting factor set in step S33 is calculated in the noise suppression unit
306, and an output acoustic signal in which the target audio signal is emphasized is obtained
(step S35) .
04-05-2019
24
The output acoustic signal from the noise suppression unit 306 is subjected to inverse Fourier
transform in the inverse Fourier transform unit 111 to be an output acoustic signal in the time
domain.
[0092]
Eleventh Embodiment As shown in FIG. 15, in the acoustic signal processing device according to
the eleventh embodiment of the present invention, the inter-channel feature quantity calculating
unit 202 and the weighting factor dictionary 203 described in the ninth embodiment. And a
plurality of (M) weight control units 600-1 to 600-M each having a selection unit 204.
[0093]
The weight control units 600-1 to 600 -M are switched by the input switch 602 and the output
switch 603 in accordance with the control signal 601.
That is, the input acoustic signal set of N channels from the microphones 101-1 to 101 -N is
input to any one of the weight control units 600-1 to 600 -M by the input switcher 602, and the
inter-channel feature value calculation unit 202 The inter-channel feature amount is calculated
by In the weight control unit to which the input acoustic signal set is input, the selection unit 204
selects a weight coefficient corresponding to the inter-channel feature amount from the weight
coefficient dictionary 203. The selected weighting factor is supplied to the noise suppression unit
206 via the output switch 603.
[0094]
On the other hand, the N-channel acoustic signals from the weighting units 105-1 to 105-N are
synthesized by the addition unit 205, and output from the array unit 201 as an integrated
acoustic signal. The integrated acoustic signal is subjected to noise suppression by the noise
suppression unit 206 using the weighting factor selected by the selection unit 204, and an
output acoustic signal in which the target audio signal is emphasized is generated.
[0095]
04-05-2019
25
The weighting factor dictionary 203 is created in advance by learning in an acoustic environment
close to the actual use environment. In practice, different acoustic environments are envisaged.
For example, the acoustic environment inside a car varies greatly depending on the vehicle type.
The weight coefficient dictionaries 203 in the weight control units 600-1 to 600-M are learned
under different acoustic environments. Therefore, the weight control units 600-1 to 600-M are
switched according to the actual use environment at the time of acoustic signal processing, and
selected from the weight coefficient dictionary 203 learned under the same or most similar
acoustic environment as the actual use environment. By performing weighting using the
weighting factor selected by the unit 204, it is possible to perform acoustic signal processing
suitable for the actual use environment.
[0096]
The control signal 601 used to switch the weight control units 600-1 to 600-M may be
generated, for example, by a button operation by the user, or is caused by an input acoustic
signal such as a signal to noise ratio (SNR). Parameters may be automatically generated as an
index. Also, an external parameter such as a vehicle speed may be generated as an index.
[0097]
When each of the weight control units 600-1 to 600-M includes the inter-channel feature
quantity calculation unit 202, the inter-channel feature quantity suitable for the acoustic
environment corresponding to each of the weight control units 600-1 to 600-M. By using the
calculation method and parameters, it is expected to calculate the more accurate inter-channel
feature quantity.
[0098]
The acoustic signal processing based on the embodiment of the present invention described
above can be realized not only by hardware but also by software using a computer such as a
personal computer.
Therefore, according to the present invention, it is possible to provide a program as listed below
or a computer readable storage medium storing the program.
04-05-2019
26
[0099]
The present invention is not limited to the above embodiment as it is, and at the implementation
stage, the constituent elements can be modified and embodied without departing from the scope
of the invention. In addition, various inventions can be formed by appropriate combinations of a
plurality of constituent elements disclosed in the above embodiment. For example, some
components may be deleted from all the components shown in the embodiment. Furthermore,
components in different embodiments may be combined as appropriate.
[0100]
A block diagram showing an acoustic signal processing apparatus according to the first
embodiment A flowchart showing a processing procedure in the first embodiment A block
diagram showing an acoustic signal processing apparatus according to the second embodiment
An acoustic signal according to the third embodiment The block diagram showing the processing
device The block diagram showing the acoustic signal processing device according to the fourth
embodiment The block diagram showing the acoustic signal processing device according to the
fifth embodiment The block showing the acoustic signal processing device according to the sixth
embodiment Block diagram showing an acoustic signal processing apparatus according to the
seventh embodiment. Block diagram showing an acoustic signal processing apparatus according
to an eighth embodiment. Flowchart showing processing procedure in the eighth embodiment.
Acoustic according to the ninth embodiment. FIG. 12 is a block diagram showing a signal
processing apparatus. FIG. 12 is a block diagram showing an acoustic signal processing
apparatus according to a tenth embodiment. FIG. 10 is a diagram showing the contents of a
representative point dictionary in FIG. Block diagram showing an acoustic signal processing
apparatus according to an eleventh embodiment of a flow chart illustrating an embodiment of a
processing procedure
Explanation of sign
[0101]
100: Learning unit 101-1 to 101-N: Microphone 102: Feature amount calculation unit 103:
Weight coefficient dictionary 104: Selection unit 105-1 to 105-N: Weighting unit 106 и и и и и и и и и и
ииииииииииииииииииииииииииииииииииииииииииииииииииииииииииииииииииииииииииииииииииииииии
и и и и и и и и и и и и и Unit 120 ... Weight coefficient calculation unit 201 ... Array unit 202 ... Interchannel feature value calculation unit 203 ... Weight coefficient dictionary 204 ... Selection unit
205 ... Adder 206 ... Noise suppression Unit 401 иии Control signal 402 иии Switch 403 и и и
04-05-2019
27
Weighting adder 501 иии Matching unit 502 иии Representative point dictionary 600-1 to 600-M иии
Weight control unit 601 иии Control signal No. 602 ... input switching unit 603 ... output switching
unit
04-05-2019
28
Документ
Категория
Без категории
Просмотров
0
Размер файла
44 Кб
Теги
jp2009047803
1/--страниц
Пожаловаться на содержимое документа