close

Вход

Забыли?

вход по аккаунту

?

DESCRIPTION JP2007010897

код для вставкиСкачать
Patent Translate
Powered by EPO and Google
Notice
This translation is machine-generated. It cannot be guaranteed that it is intelligible, accurate,
complete, reliable or fit for specific purposes. Critical decisions, such as commercially relevant or
financial decisions, should not be based on machine-translation output.
DESCRIPTION JP2007010897
An object of the present invention is to emphasize a target sound signal by alleviating the
problem of target sound removal under reverberation in a car or the like. An inter-channel
feature amount calculating unit calculates an inter-channel feature amount representing a
difference between channels for input acoustic signals of a plurality of channels from
microphones 101-1 to 101-N, and a weighting factor dictionary prepared in advance. The
weighting coefficients of the plurality of channels associated with the feature amount are
selected by the selection unit 104 to the selection unit 104, and the input acoustic signals are
weighted and added by the weighting units 105-1 to 105-N and the adder 106 using the selected
weighting coefficients. Generate an output acoustic signal. [Selected figure] Figure 1
Acoustic signal processing method, apparatus and program
[0001]
The present invention relates to a microphone array technology which is one of noise
suppression techniques used in hands-free speech and speech recognition, and more particularly
to an audio signal processing method and apparatus for emphasizing and outputting a target
audio signal in an input audio signal. About the program.
[0002]
When speech recognition technology is used in a real environment, ambient noise has a
significant effect on the recognition rate.
10-04-2019
1
For example, in a car, there are many noises such as engine noise, wind noise, the sound of
oncoming and overtaking vehicles, and the sound of car audio devices. These noises are mixed
with the voice of the speaker and input to the speech recognition device, which causes the
recognition rate to be greatly reduced. One way to solve such noise problems is to use a
microphone array. The microphone array performs signal processing on input sound signals
from a plurality of microphones, and emphasizes and outputs a target sound signal which is a
speaker's voice.
[0003]
An adaptive microphone array is known that suppresses noise by automatically pointing a dead
angle with low sound reception sensitivity of the microphone in the direction of noise arrival. The
adaptive microphone array is generally designed to suppress noise under the condition
(constraint condition) that the signal in the target sound direction is not suppressed. As a result,
for example, it is possible to suppress the noise from the side without suppressing the target
audio signal coming from the front direction.
[0004]
However, in a real environment, even if it is the voice of the speaker who is in front, there is a
problem of so-called reverberation which is reflected from surrounding obstacles such as a wall
and comes from various directions. In a classical adaptive microphone array, reverberation is not
considered. As a result, when the adaptive microphone array is used under reverberation, there is
a problem called "target sound removal" in which a target sound signal to be emphasized should
be erroneously suppressed.
[0005]
When the influence of reverberation is known, that is, when the transfer function from the sound
source to the microphone is known, a method has been proposed to avoid the problem of target
sound removal. For example, Non-Patent Document 1 proposes a method of applying a matched
filter obtained from a transfer function represented in the form of an impulse response to an
input acoustic signal from a microphone. On the other hand, Non-Patent Document 2 describes a
method of reducing reverberation by converting an input acoustic signal into a cepstrum and
10-04-2019
2
suppressing a high-order cepstrum. JL Flanagan, AC Surendran and EE Jan, "Spatially Selective
Sound Capture for Speech and Audio Processing", Speech Communication, 13, pp207-222, 1993
AV Oppenheim and RW Schafer, "Digital Signal Processing", Prentice Hall, pp.519- 524, 1975
[0006]
The method of Non-Patent Document 1 needs to know the impulse response in advance, and in
order to do so, it is necessary to measure the impulse response in the environment actually used.
Since there are many factors that affect the transfer function, such as opening and closing of
passengers, luggage, windows, etc. in an automobile, it is difficult to put into practice a method
based on such known impulse response.
[0007]
On the other hand, although non-patent document 2 uses the tendency that reverberation
components tend to appear in the higher order terms of cepstrum, since the direct wave and the
reverberation components are not completely separated and present, reverberation harmful to
the adaptive microphone array is caused. The extent to which the components can be removed
depends on the application.
[0008]
In particular, in a narrow space such as the inside of a car, many reflection components are
concentrated in a short time, and the reflection components interfere with the direct wave to
greatly deform the spectrum.
Therefore, since the direct wave and the reverberation component can not be sufficiently
separated by the cepstrum method, it is difficult to avoid the target sound removal due to the
influence of the reverberation.
[0009]
As described above, in the prior art, there is a problem that it is not possible to sufficiently
remove the reverberation component which causes the target sound removal of the microphone
array in a narrow space such as an automobile.
10-04-2019
3
[0010]
An object of the present invention is to provide an acoustic signal processing method, apparatus
and program for enhancing a target audio signal by alleviating the problem of target sound
removal under reverberation.
[0011]
According to an aspect of the present invention, a feature amount representing a difference
between channels of input acoustic signals of a plurality of channels is determined, and
weighting coefficients of the plurality of channels associated with the feature amount are
selected from a weighting coefficient dictionary prepared in advance. And weighting adding the
input acoustic signal using the selected weighting factor to generate an output acoustic signal.
[0012]
In another aspect of the present invention, a feature quantity representing a difference between
channels of input acoustic signals of a plurality of channels is clustered to generate a plurality of
clusters, a centroid of the cluster is determined, and the feature quantity and the centroid A
weighting factor of a plurality of channels associated with a cluster having a centroid having the
minimum distance is selected from a weighting factor dictionary prepared in advance, and the
input acoustic signal is selected using the selected weighting factor Are weighted and added to
generate an output sound signal.
[0013]
According to still another aspect of the present invention, a distance between a feature
representing a difference between channels of input acoustic signals of a plurality of channels
and a plurality of representative points prepared in advance is determined, and a representative
point having the minimum distance is determined. And selecting weighting coefficients of a
plurality of channels associated with the representative point where the distance is minimum
from a weighting coefficient dictionary prepared in advance, weighting addition of the input
acoustic signal using the selected weighting coefficients, and outputting an output sound
Generate a signal
[0014]
According to the present invention, since weighting factors are selected based on inter-channel
feature quantities of a plurality of input acoustic signals, it is possible to easily avoid the problem
of target sound removal under reverberation by learning the weighting factors. It becomes.
10-04-2019
4
[0015]
Hereinafter, some embodiments of the present invention will be described with reference to the
drawings.
[0016]
First Embodiment As shown in FIG. 1, an acoustic signal processing device according to a first
embodiment of the present invention is configured to receive an N-channel sound reception
signal from a plurality (N) of microphones 101-1 to 101 -N. From the weighting factor dictionary
103 based on the inter-channel feature quantity, the feature quantity calculating unit 102 for
calculating the inter-channel feature quantity of the input audio signal, the weighting factor
dictionary 103 storing a plurality of weighting factors (hereinafter also referred to as weighting
factor coefficients) The target audio signal is added by adding the output signals of the selection
unit 104 for selecting the weighting factor, the weighting units 105-1 to 105-N for weighting the
selected weighting factor to the input sound signals x1 to xN, and the weighting units 105-1 to
105-N. And an adder 106 for obtaining an output acoustic signal in which is emphasized.
[0017]
Next, the processing procedure of the present embodiment will be described according to the
flowchart of FIG.
[0018]
Input acoustic signals x1 to xN from the microphones 101-1 to 101-N are input to the interchannel feature quantity calculation unit 102, and the inter-channel feature quantity is calculated
(step S11).
When digital signal processing technology is used, x 1 to x N are discretized in the time direction
by an A / D converter (not shown), and expressed as x 1 (t) using a time index t, for example.
The inter-channel feature amount is an amount representing the difference between the channels
of the input acoustic signals x1 to xN, and a specific example thereof will be described later.
If the input acoustic signals x1 to xN are discretized, the inter-channel feature quantities are also
10-04-2019
5
discretized.
[0019]
Next, the weighting factor w1 to wN associated with the inter-channel feature quantity is selected
from the weighting factor dictionary 103 based on the inter-channel feature quantity by the
selection unit 104 (step S12).
The correspondence between the inter-channel feature quantity and the weighting factors w1 to
wN is determined in advance, and as the simplest method, the discretized inter-channel feature
quantity and the weighting factors w1 to wN are associated on a one-to-one basis It is a way to
[0020]
As a more efficient method of association, as will be described in a third embodiment described
later, the inter-channel feature quantities are grouped using a clustering method such as LBG,
and each group of inter-channel feature quantities is grouped. There is also a method of
associating weighting factors w1 to wN with.
Also, a method may be considered in which the distribution weight and the weight coefficients
w1 to wN are associated using a statistical distribution such as a GMM (Gaussian mixture model).
As described above, various methods can be considered for the correspondence, which are
determined in consideration of the amount of calculation, the amount of memory, and the like.
[0021]
The weighting factors w1 to wN selected by the selection unit 104 in this manner are set in the
weighting units 105-1 to 105-N.
10-04-2019
6
The input acoustic signals x1 to xN are weighted according to the weighting factors w1 to wN by
the weighting units 105-1 to 105-N, and then added by the addition unit 106 to obtain the
output acoustic signal y in which the target sound signal is emphasized. (Step S13).
[0022]
In digital signal processing in the time domain, weighting is expressed as convolution.
The weighting factors w1 to wN are forms of filter coefficients, and wn = {wn (0), wn (1),. . . , Wn
(L−1)} n = 1, 2,. . , N; where L is the filter length, and the output signal y is the sum of
convolutions of the respective channels.
[0023]
[0024]
It is expressed as
Where * represents convolution,
[0025]
[0026]
である。
The timing of updating the weighting factor wn is, for example, in units of samples, or in units of
predetermined frames.
10-04-2019
7
[0027]
Next, the inter-channel feature amount will be described.
The inter-channel feature amount is an amount representing the difference between the channels
of the N-channel input acoustic signals x1 to xN from the N microphones 101-1 to 101-N as
described above, and various items are considered as follows: Be
[0028]
Now, consider the case where the arrival time difference τ of the input acoustic signals x1 to xN
is N = 2. As shown in FIG. 3, when the input acoustic signals x1 to xN come from the front of the
array of microphones 101-1 to 101-N, τ = 0. As shown in FIG. 4, when the input acoustic signals
x1 to xN come from the front side and deviated from the front by an angle θ, a delay of τ = d
sin θ / c occurs. Here, c is the speed of sound, and d is the distance between the microphones
101 to N.
[0029]
Here, assuming that the arrival time difference τ can be detected, a relatively large weighting
factor, for example (0.5, 0.5), is associated with τ = 0, and relative to values other than τ = 0.
By associating a small weighting factor, for example, (0, 0), it is possible to emphasize only the
input acoustic signal from the front. When considering τ discretely, it may be a time unit
corresponding to the minimum angle that the array of microphones 101-1-N can detect, or may
be a time corresponding to a constant angle unit such as an interval of one degree, or There are
various methods, such as using a fixed time interval regardless of the angle.
[0030]
In many of the microphone arrays that are often used conventionally, output signals are obtained
by weighting and adding input acoustic signals from the respective microphones. There are
various microphone array methods, but the difference between the methods is basically the
method of determining the weighting factor w. In many adaptive microphone arrays, the
10-04-2019
8
weighting factor w is analytically determined based on an input acoustic signal. For example,
according to Directionally Constrained Minimization of Power (DCMP), which is one of adaptive
microphone arrays, the weighting factor w is
[0031]
[0032]
It is expressed as
Here, Rxx is an inter-channel correlation matrix of the input acoustic signal, inv () is an inverse
matrix, <h> is conjugate transposition, w and c are vectors, and h is a scalar. The vector c is also
called a constraint vector. It is possible to design the response in the direction indicated by the
vector c to be the desired response h. It is also possible to set a plurality of constraints, in which
case c is a matrix and h is a vector. Usually, the constraint vector is designed as the target sound
direction, and the desired response is designed as 1.
[0033]
In DCMP, the weight coefficient is determined adaptively based on the input acoustic signal from
the microphone, so high noise suppression capability can be realized with a smaller number of
microphones compared to a fixed type array such as a delay sum array. However, the problem of
"target sound removal" in which the target sound signal is regarded as noise and suppressed
because the direction vector c determined in advance does not necessarily coincide with the
direction in which the target sound actually arrives due to the interference of sound waves under
reverberation Happens. As described above, the adaptive array that adaptively forms the
directivity based on the input acoustic signal is significantly affected by reverberation, and the
problem of "target sound removal" can not be avoided.
[0034]
On the other hand, in the method of setting the weighting factor based on the inter-channel
feature quantity according to the present embodiment, the target sound removal can be
10-04-2019
9
suppressed by learning the weighting factor. For example, if an acoustic signal emitted from the
front causes a delay of arrival time difference τ by τ 0 due to reflection, the weight coefficient
corresponding to τ 0 is made relatively large as (0.5, 0.5), By relatively reducing the weight
coefficient corresponding to τ other than τ 0 as (0, 0), the problem of target sound removal
can be avoided. The learning of the weighting factor, that is, the correspondence between the
inter-channel feature quantity and the weighting factor when creating the weighting factor
dictionary 103 is performed in advance by a method to be described later. For example, a crosspower-spectrum phase (CSP) method can be mentioned as a method of determining the arrival
time difference τ. When N = 2 in CSP method, CSP coefficient is
[0035]
[0036]
I ask.
CSP (t) is a CSP coefficient, Xn (f) is a Fourier transform of xn (t), IFT {} is an inverse Fourier
transform, conj () is a conjugate complex number, and || represents an absolute value. Since the
CSP coefficient is the inverse Fourier transform of the whitening cross spectrum, it has a pulselike peak at time t corresponding to the arrival time difference τ. Therefore, the arrival time
difference τ can be known by searching for the maximum value of the CSP coefficients.
[0037]
It is also possible to use complex coherence in addition to the arrival time difference itself as the
inter-channel feature quantity based on the arrival time difference. The complex coherence of X1
(f), X2 (f) is
[0038]
[0039]
10-04-2019
10
Is represented by
Coh (f) is the complex coherence, and E {} is the expected value in the time direction (more
strictly, the collective average). Coherence is used in the field of signal processing as a quantity
that describes the relationship between two signals. An uncorrelated signal between channels,
such as diffusive noise, has a smaller absolute value of coherence and a directional signal has
greater coherence. Since the time difference between channels appears as a phase component of
coherence, the directional signal can be distinguished in phase whether it is a signal from the
target sound direction or a signal from other directions. . By utilizing these properties as feature
quantities, it is possible to distinguish between diffuse noise, a target sound signal, and
directional noise. As can be seen from equation (5), since coherence is a function of frequency, it
is compatible with the second embodiment described later, but when used in the time domain, it
is a representative frequency value that is averaged in the frequency direction There are various
ways to use such as The coherence is generally defined in N channels and is not limited to N = 2
as in the example here.
[0040]
As the inter-channel feature quantity, a generalized correlation function can be used in addition
to the feature quantity based on the arrival time difference. For the generalized correlation
function, see, for example, "The Generalized Correlation Method for Estimation of Time Delay, C.
H. Knapp and G. C. Carter, IEEE Trans, Acoust., Speech, Signal Processing", Vol. ASSP-24, No. 4,
pp. 320-327 (1976). The generalized correlation function GCC (t) is
[0041]
[0042]
It is defined as
Here, IFT is an inverse Fourier transform, Φ (f) is a weighting coefficient, and G12 (f) is a cross
power spectrum between channels. There are various methods for determining Φ (f), and the
details are described in the above document. For example, the weight coefficient ml ml (f)
according to the maximum likelihood estimation method is expressed by the following equation.
10-04-2019
11
[0043]
[0044]
However, | γ12 (f) | <2> is amplitude squared coherence.
As in the case of CSP, the strength of the correlation between channels and the direction of the
sound source can be known from t giving the maximum value and the maximum value of GCC (t).
[0045]
As described above, according to the present embodiment, the relationship between the interchannel feature amount and the weighting factors w1 to wN is obtained by learning, and even if
the direction information of the input acoustic signals x1 to xN is disturbed by reverberation etc.,
this is learned. By setting, it is possible to emphasize the target sound signal without causing the
problem of "target sound removal".
[0046]
Second Embodiment FIG. 5 shows an acoustic signal processing device according to a second
embodiment of the present invention.
In the present embodiment, Fourier transform units 201-1 to 20-N and a Fourier inverse
transform unit 207 are added to the sound processing apparatus of the first embodiment shown
in FIG. 1, and weighting units 105-1 to 105-1 in FIG. N is replaced by weighting units 205-1 to
205-N that perform multiplication in the frequency domain. As is well known in the digital signal
processing art, convolution operations in the time domain are represented by products in the
frequency domain. In this embodiment, weighted addition is performed after the input acoustic
signals x1 to xN are converted into the frequency domain by the inverse Fourier transform units
201-1 to 201-N. Thereafter, the inverse Fourier transform unit 205 performs inverse Fourier
transform to convert it to a signal in the time domain to generate an output acoustic signal. In the
present embodiment, in terms of signal processing, processing equivalent to that of the first
10-04-2019
12
embodiment for processing in the time domain is performed. The output signal of the addition
unit 106 corresponding to equation (1) is expressed by the following equation in the form of a
product instead of a convolution.
[0047]
[0048]
Here, k is a frequency index.
[0049]
By inverse Fourier transforming the output signal Y (k) of the adding unit 106, an output
acoustic signal y (t) having a waveform in the time domain is generated.
The advantage of converting to the frequency domain in this way is that the amount of
calculation may be reduced depending on the weight orders of the weighting units 105-1 to -N,
and processing can be performed independently in frequency units. And it is easy to express the
reverberation.
As a supplement to the latter, it is common for waveform interference due to reverberation to
differ in intensity and phase from one frequency to another. That is, although the interference is
strong at a certain frequency, the change in the frequency direction is severe, such as having
little influence at another frequency. In such a case, it is possible to perform more precise
processing by processing independently for each frequency. In addition, it is also possible to put
together a plurality of frequencies and perform processing into sub-bands, for convenience of the
calculation amount. Third Embodiment In the third embodiment of the present invention, as
shown in FIG. 6, a clustering unit 208 and a clustering dictionary 209 are added to the acoustic
signal processing apparatus of FIG. 5 which is the second embodiment. There is. The clustering
dictionary 209 stores I centroids obtained by the LBG method.
[0050]
The processing procedure of this embodiment will be described with reference to FIG. 7. First, as
in the second embodiment, the input acoustic signals x1 to xN from the microphones 101-1 to
101-N are frequency domain by Fourier transform units 205-1 to 205-N. After conversion to the
10-04-2019
13
inter-channel feature quantity calculation unit 102, the inter-channel feature quantity is
calculated (step S21).
[0051]
Next, the inter-channel feature quantities are clustered by the clustering unit 208 with reference
to the clustering dictionary 209 to generate a plurality of clusters (step S22).
The centroid (center of gravity) of each cluster, that is, the representative point is determined by
calculation (step S23). The distances between the calculated centroid and I centroids in the
clustering dictionary 209 are calculated (step S24).
[0052]
The clustering unit 208 sends, to the selection unit 204, an index number indicating a centroid
(representative point at which the distance is minimum) that minimizes the calculated distance.
The selecting unit 204 selects a weighting factor corresponding to the index number from the
weighting factor dictionary 103 and sends it to the weighting units 105-1 to 105-N (step S25).
[0053]
In the weighting units 105-1 to 105-N, the input acoustic signals converted into the frequency
domain by the Fourier transform units 205-1 to 205-N are weighted according to the weighting
coefficients, and are added by the addition unit 206 (step S26). Thereafter, the weighted-added
signal is converted into a time domain waveform by the inverse Fourier transform unit 207 to
generate an output sound signal in which the target sound signal is emphasized.
[0054]
Next, a method of creating the weighting factor dictionary 103 according to the present
embodiment by learning will be described. The inter-channel feature quantity has a certain
distribution for each sound source position and analysis frame, and the distribution is
10-04-2019
14
continuous. Therefore, when discretizing the weighting factor, it is necessary to associate the
inter-channel feature quantity with the weighting factor is there. Although there are various
methods for this association, here, inter-channel feature quantities are pre-clustered by the LBG
algorithm, and weighting coefficients are associated with the numbers of clusters having
centroids that minimize the distance from the inter-channel feature quantities. Describe the
method. That is, the average value of the inter-channel feature amount is determined for each
cluster, and one weighting factor is associated with each cluster.
[0055]
When creating the clustering dictionary 209, N channel learning input sound obtained by
receiving a series of sounds emitted from the sound source while changing the sound source
position under an assumed reverberation environment by the microphones 101-1 to 101-N. The
inter-channel feature quantity is calculated for the signal as described above, and the LBG
algorithm is applied to this. Next, the weighting factor dictionary 103 corresponding to the
cluster is created as follows.
[0056]
The relationship between the input acoustic signal and the output acoustic signal in the
frequency domain is expressed by the following equation.
[0057]
[0058]
Here, X (k) is X (k) = {X1 (k), X2 (k),. . . , XN (k)}, and W (k) is also a vector consisting of weighting
coefficients of the respective channels.
k represents a frequency index and <h> represents conjugate transpose.
[0059]
10-04-2019
15
The output acoustic signal obtained by weighting and adding the learning input acoustic signal of
the m-th frame from the microphone to X (m, k) and the learning input acoustic signal X (m, k)
according to the weighting factor is Y (m, k) And the target signal or desired Y (m, k) is S (m, k).
Let X (m, k), Y (m, k) and S (m, k) be learning data of the m-th frame. Hereinafter, the frequency
index k will be abbreviated and described.
[0060]
Let M be the total number of frames of learning data generated in various environments such as
different sound source positions, and add a frame index to each frame. The inter-channel feature
quantities of the learning input acoustic signal are clustered, and a set of frame indexes in which
the inter-channel feature quantity belongs to the i-th cluster is represented as Ci. Next, the error
with respect to the target signal of the output acoustic signal of the learning data belonging to
the ith cluster is determined. This error is, for example, the sum Ji of the squared error of the
output signal of the learning data belonging to the i-th cluster and the target signal, and is
expressed by the following equation.
[0061]
[0062]
Let Wi, which minimizes Ji in equation (10), be the weighting factor corresponding to the ith
cluster.
The weighting factor wi is obtained by partially differentiating Ji by W,
[0063]
[0064]
10-04-2019
16
となる。
ただし、
[0065]
[0066]
である。
However, E {} represents an expected value.
[0067]
This is performed for all clusters, and Wi (i = 1, 2,..., I) is recorded as the weight coefficient
dictionary 103. Where I is the total number of clusters.
[0068]
As the correspondence between the inter-channel feature amount and the weighting factor,
various methods such as GMM using a statistical method can be considered, and it is not limited
to the present embodiment. Moreover, although the setting method of the weighting coefficient
in a frequency domain was described in this embodiment, it is also possible to set a weighting
coefficient in a time domain.
[0069]
Fourth Embodiment In a fourth embodiment of the present invention, as shown in FIG. 8,
microphones 101-1 to 101-N and first to third microphones 101-1 to 602-N are located in the
room 602 in which the speakers 601-1 and 601-2 are present. The acoustic signal processing
10-04-2019
17
device 100 described in any of the embodiments is disposed. The room 602 is, for example, the
inside of a car. The acoustic signal processing device 603 sets the target sound direction to the
direction of the speaker 601-1, and learning as described in the third embodiment is performed
in an environment identical or relatively similar to the room 602. , Weight coefficient dictionary
has been created. Therefore, the utterance of the speaker 601-1 is not suppressed, and only the
utterance of the speaker 601-2 is suppressed.
[0070]
Actually, there are fluctuation factors such as a load on a vehicle, presence of a window, and the
like, in addition to the fluctuation related to the sound source such as the sitting position of the
person, the body shape and the position of the seat. At the time of learning, these variations are
included in the learning data to perform learning and design so as to be robust to variation
factors, but it is conceivable to perform additional learning if it is desired to optimize the
situation more. For example, the speaker 601-1 is caused to make several utterances, and based
on this, the clustering dictionary and the weighting factor dictionary (not shown) included in the
acoustic signal processing apparatus 100 are updated. Similarly, it is also possible to make the
speaker 601-2 speak and update the dictionary so as to suppress the voice.
[0071]
Fifth Embodiment According to the fifth embodiment of the present invention, as shown in FIG. 9,
the microphones 101-1 and 101-2 are disposed on both sides of the robot head 701, that is, on
the ear portion. It is connected to the acoustic signal processing apparatus 100 described in any
one of the first to third embodiments.
[0072]
As described above, in the microphones 101-1 and 101-2 installed on the robot head 701, the
directional information of the arriving acoustic waves is likely to be disturbed similarly to the
reverberation due to the complex sound wave diffraction at the head 701.
That is, when the microphones 101-1 and 101-2 are arranged on the robot head 701 in this
manner, the robot head 701 is present as an obstacle on the straight line connecting the
microphone and the sound source. For example, when there is a sound source on the left side of
the robot head 701, a direct sound reaches the microphone 101-2 located in the left ear, but the
10-04-2019
18
head 701 becomes an obstacle for the microphone 101-1 located in the right ear Sound does not
arrive, but a diffracted wave that has traveled around the head 701 arrives.
[0073]
Such diffraction effects are laborious to analyze mathematically. Therefore, if there is an obstacle
between the microphones, as shown in Fig. 9, if the microphone is placed with the robot head
701 in the ear or if the obstacle such as a pillar or a wall is sandwiched, the sound source
direction It is difficult to estimate.
[0074]
According to the first to third embodiments of the present invention, even if there is an obstacle
on the straight line connecting the microphone and the sound source in this way, the influence of
diffraction due to the obstacle is taken into the acoustic signal processing apparatus by learning.
It becomes possible to emphasize only the target sound signal from a specific direction.
[0075]
Sixth Embodiment FIG. 10 shows an echo canceller which is an acoustic signal processing device
according to a sixth embodiment of the present invention.
In the echo canceller of this embodiment, the microphones 101-1 to 101-N, the acoustic signal
processing device 100, the transmitter 802, and the speaker 803 are disposed in a room 801
such as in a car. When making a hands-free call using a telephone, a personal digital assistant
(PDA), a personal computer (PC), etc., the component (echo) of the sound emitted from the
speaker 803 can be sent to the other end of the call. There is. In order to prevent this, an echo
canceller is generally used.
[0076]
In the present embodiment, the acoustic signal processing apparatus 100 can suppress the
acoustic signal emitted from the speaker 803 as the target signal 0 by learning in advance, using
the feature that the directivity can be formed by learning. At the same time, it is possible to pass
10-04-2019
19
the voice of the speaker and suppress the sound from the speaker 803 by learning to pass the
acoustic signal from the front direction. If this principle is applied, for example, learning can be
performed so as to suppress music flowing from a speaker in a car.
[0077]
The acoustic signal processing described in the first to sixth embodiments described above can
also be realized, for example, by using a general-purpose computer device as basic hardware.
That is, the above-described acoustic signal processing can be realized by causing a processor
mounted on a computer device to execute a program. At this time, the program may be
implemented in advance by installing the program in a computer device, or may be stored in a
storage medium such as a CD-ROM or distributed through a network to distribute the program to
the computer device. You may install it suitably.
[0078]
The present invention is not limited to the above embodiment as it is, and at the implementation
stage, the constituent elements can be modified and embodied without departing from the scope
of the invention. In addition, various inventions can be formed by appropriate combinations of a
plurality of constituent elements disclosed in the above embodiment. For example, some
components may be deleted from all the components shown in the embodiment. Furthermore,
components in different embodiments may be combined as appropriate.
[0079]
The block diagram of the acoustic signal processing device concerning the 1st embodiment of
the present invention The flow chart which shows the processing procedure concerning a 1st
embodiment The figure 1st for explaining the setting method of the weighting coefficient in a 1st
embodiment The figure for demonstrating the setting method of the weighting coefficient in form
The block diagram of the acoustic signal processing apparatus based on 2nd Embodiment of this
invention The block diagram 3rd of the acoustic signal processing apparatus based on 3rd
Embodiment of this invention The flowchart which shows the processing procedure which
concerns on embodiment The schematic plan view which shows the usage example of the
acoustic signal processing apparatus in the 4th Embodiment of this invention The usage example
of the acoustic signal processing apparatus in the 5th embodiment of this invention Schematic
plan view A block diagram of an echo canceller using an acoustic signal processing device
according to a sixth embodiment of the present invention
10-04-2019
20
Explanation of sign
[0080]
101-1 to N: microphone; 102: inter-channel feature value calculation unit; 103: weighting
coefficient dictionary; 104: selection unit; 105-1 to N: weighting unit; · Adder 204 · · · selection
unit; 205-1-N · · · Fourier transform unit; 207 · · · inverse Fourier transform unit; 208 · · · ·
clustering unit; 209 · · · · · clustering dictionary
10-04-2019
21
Документ
Категория
Без категории
Просмотров
0
Размер файла
34 Кб
Теги
jp2007010897, description
1/--страниц
Пожаловаться на содержимое документа