close

Вход

Забыли?

вход по аккаунту

?

DESCRIPTION JP2005234246

код для вставкиСкачать
Patent Translate
Powered by EPO and Google
Notice
This translation is machine-generated. It cannot be guaranteed that it is intelligible, accurate,
complete, reliable or fit for specific purposes. Critical decisions, such as commercially relevant or
financial decisions, should not be based on machine-translation output.
DESCRIPTION JP2005234246
PROBLEM TO BE SOLVED: To provide a sound source separation method and system capable of
accurately separating a target sound from a plurality of noises. SOLUTION: In separating a target
sound S and noises S and S, a sound insulator 20 is disposed at a boundary position between a
first space and a second space partitioned by a plane including the arrival direction of the target
sound S, After directivity is formed in four directions by the first and second sound receiving
devices 30, 40 provided on the first and second space sides of the sound shield 20 to receive and
perform frequency analysis, the first Layer processing (for example, SAFIA) separates the noise S
on the first space side and the other sounds (S, S) and separates the noise S on the second space
side and the other sounds (S, S) Further, the target sound S is separated in second layer
processing (for example, SS). Preferably, spectrum integration processing is performed as the
third layer processing. [Selected figure] Figure 1
Sound source separation method and system thereof
[0001]
The present invention relates to a sound source separation method and system for separating
target sound and noise, and can be used, for example, in performing hands-free speech
recognition with microphones installed on the left and right sides of the robot head.
[0002]
In ordinary voice recognition, a voice uttered at the mouth is recorded by a close-talking
microphone and recognition processing is performed.
10-04-2019
1
However, there are many applications where imposing the use of a close-talking microphone on a
user is unnatural, such as dialogue with a robot, voice-based operation of a car navigation
system, etc., creation of meeting minutes, etc. In such applications, it is desirable that voices be
recorded and recognized by microphones installed on the system side.
[0003]
To address this problem, a method called SAFIA has been proposed that uses the difference in
sound pressure reaching each microphone to cause sound separation caused by the difference in
positional relationship between each microphone and the sound source (see Patent Document 1).
). This method called SAFIA separates the sound by band selection, which narrows the spectrum
analysis of the output signals of a plurality of fixed microphones, and allocates the sound of that
frequency band to the microphone giving the largest power for each frequency band. It is a
technology (see FIG. 3 described later).
[0004]
Patent No. 3355598 publication (paragraphs [0006], [0007], FIG. 1, a summary)
[0005]
However, in the above-described SAFIA, although two sounds can be well separated in a situation
where two sounds overlap, when there are three or more sound sources, although separation is
theoretically possible, the separation performance is extreme. To deteriorate.
Therefore, in the presence of multiple noise sources, it is difficult to accurately separate the
target sound from the multiple noises.
[0006]
An object of the present invention is to provide a sound source separation method and system
capable of accurately separating a target sound from a plurality of noises.
10-04-2019
2
[0007]
The present invention is a sound source separation method for separating a target sound and a
noise, wherein a sound insulator is disposed at a boundary position between a first space and a
second space partitioned by a plane including the arrival direction of the target sound. The first
sound receiving device provided on the first space side of the sound shield makes the first space
main direction coincident with the arrival direction of the target sound or at a small angle with
respect to the arrival direction of the target sound and the first main sound direction The
directivity is formed in two directions of the first space side sub direction which makes an angle
with the arrival direction of the target sound larger than the space side main direction, the sound
is received, and the frequency analysis of these sound reception signals is performed. In parallel
with the second sound receiving device provided on the second space side of the sound shield,
the second space main corresponding to the arrival direction of the target sound or forming a
small angle with the arrival direction of the target sound. For the arrival direction of the target
sound than the direction and this second space side main direction After directivity is formed in
two directions on the second space side sub direction forming a sharp angle and sound is
received and frequency analysis of these sound reception signals is performed, the second sound
reception device is used as first layer processing. Noise in the first space side using a spectrum
obtained by directing directivity in the second space side main direction and a spectrum obtained
by directing directivity in the first space side subdirection by the first sound receiving device The
first space-side noise separation processing is performed to separate the sound and the other
sounds, and the spectrum obtained by directing directivity in the first space-side main direction
by the first sound receiving device and the second sound receiving device The second space-side
noise separation processing is performed to separate the noise on the second space side and the
sound other than that using the spectrum obtained by directing directivity in the two space-side
side directions, and then, As hierarchical processing, noise on the first space side separated by
the first space side noise separation processing as the first hierarchical processing First target
sound separation processing for separating target sound using the spectrum of the second space
side noise separated by the second space side noise separation processing and / or the first layer
processing means The target sound using the spectrum of noise on the second space side
separated by the second space-side noise separation processing and the spectrum of sounds
other than the noise on the first space side separated by the first space-side noise separation
processing as And a second target sound separation process for separating the first and second
target sounds.
[0008]
Here, two directional microphones (so-called fixed microphones) directed in different directions
can be used to “receive directivity by forming directivity in two directions and perform
frequency analysis of these received signals”. , Including those fixedly installed with respect to
10-04-2019
3
the sound insulator and those installed swingably.
In addition to performing frequency analysis on each signal received in step b), for example,
directivity is formed using the output signals of a plurality of nondirectional or directional
microphones constituting the microphone array device to form two directivity It includes
performing frequency analysis on each signal obtained by performing the characteristic control
and performing the directional characteristic control.
[0009]
The technique of directivity characteristic control by the latter microphone array is a known
technique, for example, a technique related to directivity characteristic control by Delayed Sum
Array or Beam-Forming, or Directionally Constrained Minimization of Power (DCMP). 2.) There
are technologies related to directivity control using an adaptive array.
[0010]
Also, "forming directivity" includes forming directivity using the presence of a sound insulator,
and for example, "first space side subdirection" and / or "second space side subdirection" In order
to "form directivity" in the direction, a non-directional microphone is used as the microphone
itself, but it is difficult to receive a sound from the opposite side of the space with the sound
insulation placed between the sound insulation. While making it easy to receive sound from the
space side where the nondirectional microphone is installed, form directivity directed to the
space side where the nondirectional microphone is installed. Is included.
[0011]
And the meaning of “receive directivity by forming in two directions” means directivity formed
to obtain a sound reception signal that is effectively used in sound source separation processing
performed hierarchically after sound reception. Means in two directions.
Therefore, for example, when three or more directional microphones are installed, and the output
signals of two directional microphones of them are selectively used to perform the subsequent
hierarchical sound source separation processing, or three or more directivity If all the output
signals of the microphones are used but only the output signals of substantially two directional
microphones function effectively, eventually, since there are two effective directional
microphones, according to the present invention It is included.
10-04-2019
4
The same applies to the case of performing directivity control in three or more directions by the
microphone array device, and two output signals among the output signals obtained by
performing directivity control in three or more directions are selectively selected. In the case of
performing hierarchical source separation processing afterward by using the above, the present
invention is included.
[0012]
Also, the "first space side sub-direction" and / or the "second space side sub-direction" need not
necessarily be a fixed direction, eg, towards noise if the direction of the noise is known. You may
form directivity.
That is, the first space side sub-direction and / or the second space side sub-direction may be
made to coincide with or substantially coincide with the arrival direction of the noise.
[0013]
Furthermore, "to form a small angle with the arrival direction of the target sound" means that the
first space side main direction forms an angle smaller than the first space side sub direction with
respect to the arrival direction of the target sound. And that the second space side main direction
forms an angle relatively smaller than the second space side sub direction with respect to the
arrival direction of the target sound. Similarly, "at a large angle with respect to the arrival
direction of the target sound" means that the first space side subdirection is at a relatively larger
angle than the first space main direction with respect to the arrival direction of the target sound.
This means that the second space side subdirection forms an angle relatively larger than the
second space side main direction with respect to the arrival direction of the target sound.
[0014]
And from the viewpoint of improving the separation accuracy of the target sound, it is preferable
that the “first space side main direction” and the “second space side main direction”
coincide or substantially coincide with the arrival direction of the target sound, and Even if they
do not coincide with or substantially coincide with each other, these directions sandwich the
sound insulator from the viewpoint of obtaining an equivalent separation effect in the processing
10-04-2019
5
performed in parallel (paired) in each layer. It is preferable to set the direction of plane
symmetry.
[0015]
Also, the “first space side sub direction” and the “second space side sub direction” do not
necessarily have to be plane symmetrical directions sandwiching the sound insulator, but in
parallel in each layer (as a pair From the viewpoint of obtaining an equivalent separation effect in
the processing to be performed, it is preferable to set a plane-symmetrical direction across the
sound insulator.
[0016]
In such a sound source separation method of the present invention, the first sound receiving
device and the second sound receiving device are installed in a state in which the sound insulator
is sandwiched, and in each of the first and second sound receiving devices, two directions are
provided. To receive directivity.
For this reason, each spectrum obtained by directing directivity in four directions by the
presence of the sound shield and formation of four directivity has the spectrum of the target
sound and the spectrum of the noise on the first space side and the noise on the second space
side. The result is a mixed spectrum that is superior or inferior to the spectrum of.
[0017]
That is, the spectrum obtained by directing directivity in the first space side main direction is
superior to the spectrum of the target sound and the spectrum of the noise on the first space
side, while the spectrum of the noise on the second space side is It is inferior.
The spectrum obtained by directing directivity in the first space side subdirection is dominated
by the spectrum of noise on the first space side, whereas the spectrum of the target sound and
the spectrum of noise on the second space side are inferior. is there. In the spectrum obtained by
directing directivity in the second space side main direction, the spectrum of the noise in the first
space side is superior to the spectrum of the target sound and the spectrum of the noise in the
second space side. It is inferior. In the spectrum obtained by directing directivity in the second
10-04-2019
6
space side subdirection, the spectrum of noise on the second space side is dominant, while the
spectrum of the target sound and the spectrum of noise on the first space side are inferior. is
there.
[0018]
Therefore, by performing the first layer processing and the second layer processing using the
four spectra obtained by directing directivity in these four directions, the first space side and the
second space side can be obtained for the target sound. Even in the presence of noise, it is
possible to accurately separate the target sound, thereby achieving the above object.
[0019]
In the sound source separation method described above, both the first and second target sound
separation processes are performed as the second layer process, and then the first target sound
separation as the second layer process is performed as the third layer process. Using the
spectrum of the target sound separated in the processing and the spectrum of the target sound
separated in the second target sound separation processing, these powers are added for each
frequency band or each power of each frequency band It is desirable to perform spectrum
integration processing by assigning the lesser power as the spectrum of the target sound by
comparing the magnitudes.
[0020]
Here, “to add” also includes the case where the signal value obtained by the addition is
multiplied by a proportional coefficient (for example, the case where the addition is performed by
1⁄2, etc.).
[0021]
Thus, when spectrum integration processing is performed as the third layer processing,
separation accuracy is much higher than that of the target sound obtained by the first or second
target sound separation processing as the second layer processing. The target sound is obtained.
[0022]
That is, a method of adding two obtained signals (hereinafter referred to as addition: Addition).
10-04-2019
7
As for), it is possible to emphasize only the target sound by adding.
[0023]
In addition, a method of assigning the inferior power as the spectrum of the target sound to the
obtained spectrum for each frequency band (hereinafter referred to as Minimization:
Minimization).
), The spectrum of the target sound obtained up to the second layer processing contains residual
noise that can not be eliminated even in the second layer processing, so the influence of noise on
the first space side and the second space side May be left.
For this reason, the spectrum of the target sound obtained by the second layer processing is
likely to be observed with a value larger than the spectrum originally contained in the target
sound.
Therefore, the influence of the noise on the first space side and the second space side can be
eliminated by assigning the smaller one of the powers to the target sound obtained separately for
each frequency band.
[0024]
Note that, as shown in the experimental results described later (see FIG. 6), the spectrum
integration process (Minimization) in which the inferior power is attributed as the spectrum of
the target sound is better than the spectrum integration process (Addition) to be added. It is
preferable in that high separation accuracy can be obtained.
[0025]
Furthermore, in the sound source separation method described above, the first space side noise
separation processing and the second space side noise separation processing as the first layer
processing compare the magnitudes of the powers of the same frequency band among the two
spectra. The process may be performed for each frequency band, and band selection may be
performed in which the larger power in each frequency band is attributed to the spectrum
obtained by separation.
10-04-2019
8
[0026]
Thus, when band selection is performed as the first layer processing (when sound source
separation is performed using a so-called SAFIA technique), it is possible to perform effective
separation with relatively simple processing. Become.
[0027]
In the sound source separation method described above, the first target sound separation
processing as the second layer processing is sound other than the noise on the second space side
separated by the second space side noise separation processing as the first layer processing.
Perform spectral subtraction that subtracts the power of each frequency band of the spectrum of
the first space side noise separated by the first space side noise separation processing by the
power of the same frequency band of the spectrum of the first space side multiplied by a
proportional coefficient The second target sound separation processing as the second layer
processing is each processing frequency of the spectrum of the sound other than the noise on the
first space side separated in the first space side noise separation processing as the first layer
processing. Spectral subtrak that subtracts the power of the same frequency band of the
spectrum of the noise of the second space side separated by the second space side noise
separation processing from the power of the band multiplied by a proportional coefficient ®
desirably down a process to perform.
[0028]
As described above, when spectral subtraction (SS) is performed as the second layer processing,
separation of the target sound with high accuracy is realized.
[0029]
Further, in the sound source separation method described above, the first and second target
sound separation processing as the second layer processing performs the comparison of the
magnitudes of the powers of the same frequency band among the two spectra for each frequency
band. Alternatively, band selection may be performed in which the larger power in each
frequency band is attributed to the spectrum obtained by separation.
[0030]
As described above, even when band selection is performed as the second layer processing (when
sound source separation is performed using a so-called SAFIA technique), separation of a target
10-04-2019
9
sound with high accuracy is realized.
However, as shown in the experimental results described later (see FIG. 6), it is preferable to
perform spectral subtraction as the second layer processing in that high separation accuracy can
be obtained.
[0031]
Furthermore, in the sound source separation method described above, the first sound receiving
apparatus is configured using two directional microphones that are arranged to direct directivity
in the first space side main direction and the first space side sub direction. It is desirable that the
second sound receiving apparatus be configured using two directional microphones which are
disposed so as to direct directivity in the second space side main direction and the second space
side sub direction, respectively.
[0032]
When sound is received using four directional microphones as described above, sound source
separation with high accuracy can be realized with a simple configuration, so that the cost of
equipment can be reduced.
[0033]
Then, in the sound source separation method described above, the first space-side main direction
and the second space-side main direction coincide with or substantially coincide with the arrival
direction of the target sound, and the first space-side sub-direction and the second space It is
desirable that the space side sub direction be a direction orthogonal or substantially orthogonal
to the arrival direction of the target sound.
[0034]
As described above, when sound is received with the four directions set to coincide with or
almost coincide with the arrival direction of the target sound, as well as directions orthogonal or
substantially orthogonal, even when the direction of the noise is unknown, sound reception and
It becomes possible to perform sound source separation.
[0035]
Further, as a system for realizing the sound source separation method of the present invention
described above, the following sound source separation system of the present invention may be
10-04-2019
10
mentioned.
[0036]
That is, the present invention is a sound source separation system for separating a target sound
and a noise, and the sound insulation body disposed at the boundary position between the first
space and the second space partitioned by a plane including the arrival direction of the target
sound. And a first space-side main direction provided on the first space side of the sound shield
and coinciding with the arrival direction of the target sound or forming a small angle with the
arrival direction of the target sound, and the first space-side main direction And a first sound
receiving apparatus that receives directivity by forming directivity in two directions of the first
space side sub direction that makes a larger angle than the arrival direction of the target sound
and performs frequency analysis of these sound reception signals, and A second space side main
direction provided on the second space side of the sound shield and coinciding with the arrival
direction of the target sound or forming a small angle with respect to the arrival direction of the
target sound and the second space side main direction Second space that makes a large angle to
the direction of arrival of the target sound A second sound receiving device that receives
directivity by forming directivity in two directions in the sub direction and performs frequency
analysis of these sound reception signals, and directs directivity in the second space side main
direction with the second sound receiving device First separating the noise on the side of the first
space and the other sounds using the spectrum obtained in the first sound receiving device and
the spectrum obtained by directing directivity in the first space side subdirection in the first
sound receiving device The space side noise separation processing is performed, and the
spectrum obtained by directing directivity in the first space side main direction by the first sound
receiving device and the directivity in the second space side subdirection by the second sound
receiving device First layer processing means for performing second space side noise separation
processing for separating noise on the second space side and sounds other than the second
spectrum using the obtained spectrum; and first space using the first layer processing means
Spectrum of the first space side noise separated by the side noise separation processing and the
second space side noise separation processing First target sound separation processing that
separates the target sound using the separated second space side noise and the spectrum of the
sound other than noise, and / or the second space side noise separation processing by the first
layer processing means The target sound is separated by using the spectrum of noise on the
second space side and the spectrum of sounds other than the noise on the first space separated
by the first space noise separation processing. And a second layer processing means.
[0037]
In such a sound source separation system of the present invention, the operation and effect
obtained by the above-described sound source separation method of the present invention can be
obtained as it is, thereby achieving the object.
10-04-2019
11
[0038]
In the sound source separation system described above, the second layer processing means is
configured to perform both the first and second target sound separation processes, and is
separated in the first target sound separation process by the second layer processing means.
Using the spectrum of the target sound and the spectrum of the target sound separated in the
second target sound separation processing, these powers are added for each frequency band, or
the magnitudes of the powers are compared for each frequency band. It is desirable to have a
configuration provided with a third layer processing means that performs spectrum integration
processing by assigning the lesser power as the spectrum of the target sound.
[0039]
Furthermore, in the sound source separation system described above, the first space side noise
separation processing and the second space side noise separation processing by the first layer
processing means compares the magnitudes of the respective powers for the same frequency
band among the two spectra. The process may be performed for each frequency band, and band
selection may be performed in which the larger power in each frequency band is attributed to the
spectrum obtained by separation.
[0040]
In the sound source separation system described above, the first target sound separation
processing by the second layer processing means is a sound other than the noise on the second
space side separated by the second space side noise separation processing by the first layer
processing means. Perform spectral subtraction that subtracts the power of each frequency band
of the spectrum of the first space side noise separated by the first space side noise separation
processing by the power of the same frequency band of the spectrum of the first space side
multiplied by a proportional coefficient In the processing, the second target sound separation
processing by the second layer processing means is each frequency of the spectrum of the sound
other than the noise on the first space side separated by the first space side noise separation
processing by the first layer processing means Spectral that subtracts the power of the band, the
power of the same frequency band of the spectrum of the noise of the second space separated by
the second space-side noise separation processing, and the proportional coefficient It is desirable
that the process of performing subtraction.
[0041]
In the sound source separation system described above, the first and second target sound
separation processes by the second layer processing means compare the magnitudes of the
10-04-2019
12
powers of the same frequency band among the two spectra for each frequency band.
Alternatively, band selection may be performed in which the larger power in each frequency
band is attributed to the spectrum obtained by separation.
[0042]
Furthermore, in the sound source separation system described above, the first sound receiving
apparatus is configured to include two directional microphones that are arranged to direct
directivity in the first space side main direction and the first space side sub direction,
respectively. Preferably, the second sound receiving apparatus is configured to include two
directional microphones which are arranged to direct directivity in the second space side main
direction and the second space side sub direction, respectively.
[0043]
In the sound source separation system described above, the first space-side main direction and
the second space-side main direction are directions that coincide or substantially coincide with
the arrival direction of the target sound, and the first space-side auxiliary direction and the
second space It is desirable that the space side sub direction be a direction orthogonal or
substantially orthogonal to the arrival direction of the target sound.
[0044]
Further, the present invention is a sound source separation system for separating a target sound
and a noise, wherein the sound insulation body is disposed at the boundary position between the
first space and the second space partitioned by a plane including the arrival direction of the
target sound. And a first space-side main direction provided on the first space side of the sound
shield and coinciding with the arrival direction of the target sound or forming a small angle with
the arrival direction of the target sound, and the first space-side main direction And a first sound
receiving apparatus that receives directivity by forming directivity in two directions of the first
space side sub direction that makes a larger angle than the arrival direction of the target sound
and performs frequency analysis of these sound reception signals, and A second space side main
direction provided on the second space side of the sound shield and coinciding with the arrival
direction of the target sound or forming a small angle with respect to the arrival direction of the
target sound and the second space side main direction The second space side secondary that
makes a large angle with the direction of arrival of the target sound Is characterized in that a
second sound receiving unit that performs frequency analysis of these received sound signals as
well as sound reception by forming a directivity in two directions countercurrent.
[0045]
10-04-2019
13
When configuring a sound source separation system provided with the above-mentioned sound
insulation, first sound receiving device and second sound receiving device, the above-mentioned
first layer processing and second layer processing are simultaneously realized at the same time
Processing may be performed.
[0046]
As described above, according to the present invention, the first sound receiving device and the
second sound receiving device are installed in a state in which the sound insulator is sandwiched,
and in each of the first and second sound receiving devices, two directions are provided. Because
each of the spectra obtained by directing directivity in four directions is the spectrum of the
target sound, the spectrum of the noise on the first space side, and the spectrum of the noise on
the second space side. Since a mixed spectrum with superiority and inferiority in different states
is obtained, hierarchical sound source separation processing using these four spectra has the
effect that the target sound can be accurately separated from a plurality of noises. is there.
[0047]
An embodiment of the present invention will be described below with reference to the drawings.
FIG. 1 shows the overall configuration of a sound source separation system 10 according to the
present embodiment.
A detailed configuration of part of the sound source separation system 10 is shown in FIG.
FIG. 3 is an explanatory diagram of band selection processing (SAFIA) performed in the first layer
processing by the sound source separation system 10.
[0048]
1 and 2, the sound source separation system 10 is a system that performs processing for
separating a target sound and noise, and is a boundary between a first space and a second space
partitioned by planes including the arrival direction of the target sound. A robot head 20 which is
a sound insulator disposed at a position, a first sound receiving device 30 provided on the first
space side, a second sound receiving device 40 provided on the second space side, and The first
layer processing means 50, the second layer processing means 60, and the third layer processing
10-04-2019
14
means 70 which perform sound source separation processing hierarchically using the sound
receiving signals of the first sound receiving device 30 and the second sound receiving device 40
Have.
[0049]
Here, the sound insulation body in the present invention is not limited in shape, size, application,
etc. as long as it is an object having a sound insulation function, but in the present embodiment,
it will be described as the robot head 20 as an example.
Therefore, in the present embodiment, the first space is the right space of the robot head 20, and
the second space is the left space of the robot head 20.
Further, in the present embodiment, the target sound is a sound emitted from the sound source
SC in the direction of the front (directly in front) of the robot head 20 (hereinafter, the target
sound is also a target sound without distinguishing its code from the sound source SC). Shown by
SC.
The noise on the first space side is the sound emitted from the sound source SR on the right
space side (hereinafter referred to as the noise on the first space side as noise SR without
distinguishing the sound source SR from the code.
The noise on the second space side is the sound emitted from the sound source SL on the left
space side (hereinafter referred to as the noise on the second space side as noise SL without
distinguishing the sound source SL from the code.
)である。
[0050]
As shown in FIG. 2, the first sound receiving device 30 is a directional microphone 31 (right side)
directed in a direction (first space side main direction in the present invention) that matches or
10-04-2019
15
substantially matches the arrival direction of the target sound SC. It is described as RF-Mic in the
figure in the sense that it is provided on the Right and directed to the Front: Front.
And the directional microphone 32 (right side: provided on Right and directed to the right: Right
direction) directed in a direction (first space side sub direction in the present invention)
orthogonal or substantially orthogonal to the arrival direction of the target sound SC. In the
sense of being described, it is described as RR-Mic in the figure.
And frequency analysis means 33 and 34 for analyzing the frequency of the output signals of the
directional microphones 31 and 32).
[0051]
As shown in FIG. 2, the second sound receiving device 40 is a directional microphone 41 (left
side) directed in a direction (second space side main direction in the present invention) that
matches or substantially matches the arrival direction of the target sound SC. It is described as
LF-Mic in the figure in the sense that it is provided on the Left and is directed to the Front: Front.
And a directional microphone 42 (left: provided at Left and directed at Left: left) directed in a
direction (second space side sub-direction in the present invention) orthogonal or substantially
orthogonal to the arrival direction of the target sound SC. In the sense of being described, it is
described as LL-Mic in the figure.
And frequency analysis means 43 and 44 for analyzing the frequency of the output signals of
these directional microphones 41 and 42.
[0052]
For example, a fast Fourier transform (FFT), a generalized harmonic analysis (GHA), or the like
can be adopted as the frequency analysis performed by each of the frequency analysis means 33,
34, 43, and 44.
10-04-2019
16
Although these frequency analysis means 33, 34, 43 and 44 are described as being divided into
four for convenience of explanation, actually, one computer (including an analyzer is included).
Or one central processing unit (CPU).
Further, the frequency analysis means 33, 34 of the first sound receiving device 30 and the
frequency analysis means 43, 44 of the second sound receiving device 40 are separately
provided in the first space side and the second space side as shown in the figure. It is not
necessary that the directional microphones 31 and 32 and the directional microphones 41 and
42, which are the sound receiving units, be separately provided on the first space side and the
second space side.
[0053]
The first layer processing means 50 comprises a spectrum obtained from an output signal of the
directional microphone (LF-Mic) 41 directed to the second space side main direction and a
directional microphone (RR to the first space side direction). The first space side noise that
separates the noise SR on the first space side and the other sounds (SC, SL) using the spectrum
obtained from the output signal of Mic 32), that is, between LF and RR The separation processing
51 is performed, and a spectrum obtained from the output signal of the directional microphone
(RF-Mic) 31 directed to the first space side main direction, and a directional microphone (LLdirected to the second space side subdirection) The second space side noise separation that
separates the second space side noise SL and the other sounds (SC, SR) using the spectrum
obtained from the output signal of Mic) 42, that is, between RF and LL place 52 is intended to
perform.
[0054]
The first space side noise separation process 51 and the second space side noise separation
process 52 performed by the first layer processing means 50 are, as an example, magnitudes of
respective powers of the same frequency band among two spectra in the present embodiment. Is
a process (SAFIA) of performing band selection in which the larger power in each frequency band
is attributed to the spectrum obtained by separation (refer to FIG. 3).
[0055]
The second layer processing means 60 comprises a spectrum of the noise SR on the first space
side separated by the first space side noise separation processing 51 by the first layer processing
means 50 and a second space noise from the first layer processing means 50. The first target
10-04-2019
17
sound separation processing 61 for separating the target sound SC using the spectrum of the
sound (SC, SR) other than the noise SL on the second space side separated in the separation
processing 52, that is, between RR and RF. And the spectrum of the noise SL on the second space
side separated in the second space side noise separation processing 52 by the first layer
processing means 50 and the first space side noise separation processing 51 in the first layer
processing means 50. A second target sound separation process 62 is performed to separate the
target sound SC using the separated spectrum of the sound (SC, SL) other than the noise SR on
the first space side, that is, between LL and LF. is there.
[0056]
In the present embodiment, the first target sound separation processing 61 by the second layer
processing means 60 is, as an example, performed on the second space side separated in the
second space side noise separation processing 52 by the first layer processing means 50. The
spectrum of the noise SR on the first space side separated by the first space side noise separation
processing 51 by the first layer processing means 50 from the power of each frequency band of
the sound (SC, SR) spectrum other than the noise SL It is assumed that the spectral subtraction is
performed to reduce the value obtained by multiplying the power of the frequency band of H by
the proportional coefficient.
Similarly, the second target sound separation process 62 performed by the second layer
processing means 60 is performed by the first space side noise SR process separated by the first
space side noise separation process 51 performed by the first layer process means 50. Of the
spectrum of the noise SL on the second space side separated in the second space-side noise
separation processing 52 by the first layer processing means 50 from the power of each
frequency band of the sound (SC, SL) spectrum It is assumed that spectral subtraction is
performed to reduce a value obtained by multiplying power by a proportional coefficient.
[0057]
The third layer processing means 70 uses the spectrum of the target sound SC separated in the
first target sound separation processing 61 by the second layer processing means 60 and the
second target sound separation processing 62 in the second layer processing means 60. A
spectrum integration process 71 is performed using the separated spectrum of the target sound
SC.
10-04-2019
18
In the present embodiment, the spectrum integration processing 71 by the third layer processing
means 70 compares the magnitudes of the powers for each frequency band, as an example, and
the spectrum of the target sound SC obtained after processing the inferior power. It is assumed
that the processing is to be attributed.
[0058]
The frequency analysis means 33 and 34 of the first sound receiving device 30, the frequency
analysis means 43 and 44 of the second sound receiving device 40, the first layer processing
means 50, the second layer processing means 60, and the third layer processing means 70
includes a computer (analyzer.
The CPU is realized by a CPU provided inside and one or more programs defining an operation
procedure of the CPU.
In addition, each of these means 33, 34, 43, 44, 50, 60, 70 may be realized by one computer,
may be realized by separate computers, or, for example, the first sound reception The frequency
analysis means 33, 34 of the device 30 and the frequency analysis means 43, 44 of the second
sound receiving device 40 are realized by one computer, and the other first layer processing
means 50, the second layer processing means 60, and the third The hierarchical processing
means 70 may be realized by another computer, or may be realized by a plurality of computers
by appropriately combining the respective means 33, 34, 44, 44, 50, 60, 70.
[0059]
In such an embodiment, the sound source separation system 10 is used to separate the target
sound SC and the noises SR and SL as follows.
[0060]
First, the directional microphone (RF-Mic) 31 and the directional microphone (RR-Mic) 32 of the
first sound receiving device 30, and the directional microphone (LF-Mic) 41 and the directional
microphone of the second sound receiving device 40. (LL-Mic) 42 and after receiving the mixed
sound of the target sound SC and the noises SR and SL, frequency analysis means 33 for each of
the sound reception signals of these directional microphones 31, 32, 41 and 42, The frequency
analysis is performed according to 34, 43 and 44 to obtain the spectrum of each sound receiving
10-04-2019
19
signal.
[0061]
At this time, the spectrum obtained from the sound reception signal of the directional
microphone (RF-Mic) 31 is the second space side, while the spectrum of the target sound SC and
the spectrum of the noise SR on the first space side are dominant. The spectrum of noise SL is
inferior.
In this way, a state in which the spectrum of SL is inferior to the spectra of SC and SR and the
subscript S of superscript on the inferior spectrum makes (SC, SR, SL <S>) It shall be written.
[0062]
Also, the spectrum obtained from the sound reception signal of the directional microphone (RRMic) 32 is dominated by the spectrum of the noise SR on the first space side, whereas the
spectrum of the target sound SC and the noise on the second space side Since it is inferior to the
spectrum of SL, it can be written as (SC <S>, SR, SL <S>).
[0063]
Furthermore, while the spectrum obtained from the sound reception signal of the directional
microphone (LF-Mic) 41 is superior to the spectrum of the target sound SC and the spectrum of
the noise SL on the second space side, Since the spectrum of the noise SR is inferior, it can be
expressed as (SC, SR <S>, SL).
[0064]
The spectrum obtained from the sound reception signal of the directional microphone (LL-Mic)
42 is dominated by the spectrum of the noise SL on the second space side, whereas the spectrum
of the target sound SC and the noise on the first space side Since it is inferior to the spectrum of
SR, it can be written as (SC <S>, SR <S>, SL).
[0065]
Next, the first layer processing means 50 performs a first space side noise separation process 51
10-04-2019
20
and a second space side noise separation process 52 by band selection (SAFIA) as a first layer
process.
The contents of the first space side noise separation processing 51 at this time will be described
with reference to FIG.
The contents of the second space side noise separation processing 52 are the same.
[0066]
In FIG. 3, in the spectrum obtained from the output signal of the directional microphone (LF-Mic)
41 directed to the second space side main direction, the power (amplitude value) of the
frequency band f1 is α1 and the power of the frequency band f2 is Let the power be α2.
On the other hand, of the spectrum obtained from the output signal of the directional
microphone (RR-Mic) 32 directed to the first space side subdirection, the power of the frequency
band f1 is β1 and the power of the frequency band f2 is β2.
[0067]
At this time, the magnitude of the power α1 of the frequency band f1 and the power β1 of the
same frequency band f1 are compared.
Here, as illustrated, if α1> β1, then the larger power α1 is selected, and this power α1 is
attributed to the directional microphone (LF-Mic) 41.
That is, since the spectrum obtained from the sound receiving signal of the directional
microphone (LF-Mic) 41 is (SC, SR <S>, SL), the larger power α1 is excluded from the spectrum
of the inferior SR. The spectrum is assigned as (SC, SL).
The smaller power β1 is discarded without being used for processing, that is, without being
10-04-2019
21
assigned to the separated spectrum.
[0068]
Further, the magnitude of the power α2 of the frequency band f2 and the power β2 of the same
frequency band f2 are compared.
Here, as illustrated, if β2> α2, the larger power β2 is selected, and this power β2 is attributed
to the directional microphone (RR-Mic) 32.
That is, since the spectrum obtained from the sound reception signal of the directional
microphone (RR-Mic) 32 is (SC <S>, SR, SL <S>), the larger power β2 It is assigned as the
spectrum of the removed SR of the spectrum of.
The smaller power α2 is discarded without being used for processing, that is, without being
assigned to the separated spectrum.
[0069]
Subsequently, the second hierarchical processing means 60 performs first and second target
sound separation processing 61 and 62 by spectral subtraction (SS) as second hierarchical
processing.
In this case, in the first target sound separation process 61, sounds other than the second space
noise SL separated in the second space side noise separation process 52 by the first layer
processing means 50 for each frequency band (SC, A value (K) obtained by multiplying the power
δ of the spectrum of the noise SR on the first space side separated in the first space side noise
separation processing 51 by the first layer processing means 50 from the power γ of the
spectrum SR) Reduce x). That is, the calculated value of γ−K × δ is the power of each
frequency band of the spectrum of the target sound SC obtained after separation. Thus, the target
sound SC is separated in such a manner that the spectrum of SR is removed from the spectrum of
(SC, SR). In a frequency band in which the power γ of the spectrum of (SC, SR) is smaller than
the value (K × δ) obtained by multiplying the power δ of the spectrum of SR by a
10-04-2019
22
proportionality factor K, for example, a certain rule A predetermined minimum value (a constant
value for each frequency band may be used, or a value proportional to the value of each power
for each frequency band of the (SC, SR) spectrum may be used. ) May be a calculated value, or
zero (usually, zero is unnatural, but in the present embodiment, even if it is considered that band
selection by SAFIA is performed in the first layer process, even zero is not unnatural). )としても
よい。
[0070]
The same applies to the second target sound separation process 62, and the target sound SC is
separated in such a manner that the spectrum of SL is removed from the spectrum of (SC, SL).
[0071]
Thereafter, the third layer processing means 70 performs spectrum integration processing 71 by
minimization.
At this time, the power of the spectrum of the target sound SC separated in the first target sound
separation processing 61 by the second layer processing means 60 and the second target sound
separation processing by the second layer processing means 60 for each frequency band. The
power of the target sound SC separated at 62 is compared in magnitude with the power of the
spectrum of the target sound SC, and the inferior power is attributed as the spectrum of the
target sound SC obtained after the processing. Thus, the target sound SC and the noises SR and
SL on the first space side and the second space side can be separated with high precision.
[0072]
According to this embodiment, the following effects can be obtained. That is, since the four
directional microphones 31, 32, 41 and 42 are provided with the robot head 20 as the sound
insulator interposed therebetween, each directional microphone is formed by the presence of the
sound insulator and the formation of four directivity. Each spectrum obtained from the received
signal of 31, 32, 41 and 42 is different from the spectrum of the target sound SC, the spectrum
of the noise SR on the first space side, and the spectrum of the noise SL on the second space side
Mixed spectra (SC, SR, SL <S>), (SC <S>, SR, SL <S>), (SC, SR <S>, SL), (SC <S>, SR) with superiority
and inferiority. <S>, SL).
10-04-2019
23
[0073]
Therefore, hierarchical source separation is performed by the first layer processing means 50
and the second layer processing means 60 using the four spectra obtained from the sound
reception signals of these four directional microphones 31, 32, 41 and 42. By performing the
processing, it is possible to accurately separate the target sound SC even in a situation where the
noise SR and SL exist on the first space side and the second space side with respect to the target
sound SC.
[0074]
In other words, by utilizing the magnitude relationship of the sound pressure generated by the
robot head 20 acting as a barrier, it is possible to realize sound source separation that is not
dependent on the environment and that does not require strict estimation of the transfer
characteristics.
[0075]
Further, since the sound source separation system 10 performs the spectrum integration
processing 71 by the third layer processing by the third layer processing unit 70 as the third
layer processing, the first or second object by the second layer processing unit 60 is performed.
Compared with the target sound SC obtained by the sound separation processing 61 and 62, it is
possible to obtain the target sound SC with higher separation accuracy.
[0076]
Furthermore, since the sound source separation system 10 performs the first space side noise
separation processing 51 and the second space side noise separation processing 52 by band
selection (SAFIA) as the first layer processing by the first layer processing means 50, the
comparison is performed. Separation can be performed with a simple process.
[0077]
Then, since the sound source separation system 10 performs the first and second target sound
separation processes 61 and 62 by spectral subtraction (SS) as the second hierarchy process by
the second hierarchy processing means 60, high accuracy can be achieved. Separation of the
target sound SC can be realized.
[0078]
Further, since the first sound receiving device 30 and the second sound receiving device 40 are
10-04-2019
24
configured using the four directional microphones 31, 32, 41, 42, it is possible to realize high
accuracy sound source separation with a simple configuration. The equipment cost can be
reduced.
[0079]
Since the four directional microphones 31, 32, 41, 42 are provided in the direction coincident
with or nearly coincident with the arrival direction of the target sound SC, and in the direction
orthogonal or substantially orthogonal, the noise SR, SL Even when the direction of is unknown,
sound reception and sound source separation can be performed effectively.
[0080]
The following comparative experiments were conducted to confirm the effects of the present
invention.
[0081]
<Recording conditions> The simultaneous speech of three speakers was recorded.
Recording was performed with a sampling frequency of 32 kHz and 16-bit quantization.
Three speakers SC, SR, and SL were placed at the positions shown in FIG. 4 as sound sources
instead of the speakers.
Robot head 20 which is a sound insulator (however, in this experiment, only the shell of the
robot head).
Distance d from each of the speakers SC, SR, and SL) is d = 100 cm, and the speakers SR and SL
as noise sources are in the front direction of the robot head 20 (direction of the speaker SC as a
target sound source) To the direction in which θ = 60 degrees.
Moreover, Audio-Technica (Audiotechnica) ATM15a was used as a directional microphone, and a
total of four directional microphones were arrange | positioned in the direction shown by the
10-04-2019
25
thick arrow of FIG.
[0082]
A total of 100 sentences were selected as the target speech SC from 20 male speakers in the
newspaper reading speech corpus (ASJ-JNAS) of the Acoustical Society of Japan.
Similarly, voices of male speakers not to be recognized from JNAS were used as the disturbing
voices (noises) SR and SL.
The volume of the sound reproduced from each of the speakers SC, SR, and SL was adjusted so
that the speech length of each of the speakers SC, SR, and SL was approximately equal, and the
speech energy of the target sound and the disturbance sound were equal.
As an evaluation set, two sets of target voices were all the same, and different disturbing voices
were prepared.
[0083]
<Recognition conditions> The recognition performance with respect to the processing method of
eight types of speech data (A) to (H) shown in FIG. In addition, (A) is a case where the sound is
received by one directional microphone directed to the target sound source SC, and the
hierarchical separation processing thereafter is not performed. The frame length and FFT size for
processing were 2048 points, and the frame shift was 512 points. The Hanning window was used
for the analysis window. Then, 20,000 words of continuous speech recognition are performed on
the processed speech. The acoustic feature quantities used for recognition are shown below.
[0084]
(Feature amount calculation parameters) (1) Pre-emphasis: 1−0.97z <−1> (2) Frame length: 25
ms (3) Frame period: 10 ms (4) Frequency analysis: 12 channels, etc. Mel interval filter bank (5 )
Feature amount (25 dimensions): MFCC + ΔMFCC + Δpower
10-04-2019
26
[0085]
In addition, as an acoustic model, learning was performed using about 20,000 sentences of
speech of about 100 male speakers of ASJ-JNAS.
The language model used a trigram of 20,000 vocabulary words provided by CSRC, and the
decoder was a decoder developed by the present applicant.
[0086]
<Experimental Results> FIG. 6 shows the recognition results of three speakers. The vertical axis
of the bar graph is obtained by dividing the total utterance number T by the substitution error
number S, the insertion error number I, and the dropout error number D {T− (S + I + D)} by the
total utterance number T Word recognition accuracy.
[0087]
According to FIG. 6, in the case where only SAFIA is performed as the first layer processing (B),
the word recognition accuracy is 0.7%, and it can be understood that the sound source
separation can not be performed only by this.
[0088]
In addition to this, when layer 2 processing is performed, error reduction is seen in (B) when
SAFIA is performed (C) and when spectral subtraction (SS) is performed (F).
This proves that hierarchical processing is effective. Also, comparing (C) and (F), it can be seen
that in the second layer processing, spectral subtraction (SS) is more effective than SAFIA.
[0089]
By performing spectrum integration processing as the third layer processing, recognition
10-04-2019
27
performance could be further improved. When (D) and (E) and (G) and (H) are compared, it can
be understood that the third layer processing is more effective for the minimization than the
addition. Addition is highly effective in emphasizing the target voice SC, but is less effective in
removing the disturbing voices SR and SL. On the other hand, the minimization (Minimization)
can be said to be effective in removing the disturbing voices SR and SL because it selects a highly
reliable spectrum. When hierarchical processing is performed in the order of SAFIA, spectral
subtraction (SS) and minimization (H), the recognition accuracy is 68.9% at the highest.
[0090]
Therefore, it is understood that the recognition accuracy can be improved by installing four
directional microphones and performing hierarchical sound source separation processing
utilizing the structure of the robot head 20 which is a sound insulator. In the simultaneous
speech recognition experiment of three speakers below, 72% compared to the remote
microphone by performing separation processing of three layers in the order of SAFIA, spectral
subtraction (SS) and minimization (Minimization) Succeeded in reducing errors. From the above,
the effects of the present invention were remarkably shown.
[0091]
The present invention is not limited to the above-described embodiment, and modifications and
the like within the scope where the object of the present invention can be achieved are included
in the present invention.
[0092]
That is, in the embodiment, the first space side main direction, the first space side sub direction,
the second space side main direction, the second space side main direction, and the second space
side main direction according to the present invention are used by using four directional
microphones 31, 32, 41, and 42. Although the directivity toward the space side subdirection was
formed, the formation of such four directivity is realized by using a microphone array device
configured by a plurality of nondirectional or directional microphones. The directionality toward
the first space side subdirection and the second space side subdirection is not realized by the
directional microphone but by utilizing the presence of the robot head 20 which is a sound
insulator. It may be realized by a nondirectional microphone.
10-04-2019
28
Therefore, in the latter case, the directivity microphones directed to the first space side main
direction and the second space side main direction are realized by the directional microphone as
in the above embodiment, while the first space side sub direction is realized. The directionality
and the directivity toward the second space side subdirection can be realized by a nondirectional
microphone (combination of a nondirectional microphone and a sound insulator).
[0093]
In the embodiment, the first layer processing means 50 is configured to perform band selection
(SAFIA) as the first layer processing, but the first layer processing in the present invention is
limited to the SAFIA. In other words, the noise SR on the first space side can be separated from
the other sounds (SC, SL), and the noise SL on the second space side can be separated from the
other sounds (SC, SR). It is sufficient if it is a process that can be performed.
[0094]
Furthermore, in the embodiment, the second layer processing means 60 is configured to perform
spectral subtraction (SS) as the second layer processing, but the second layer processing in the
present invention is limited to SS. For example, band selection (SAFIA) may be used.
However, as shown in the experimental result of FIG. 6 described above, SS is more preferable
than SAFIA from the viewpoint of improving separation accuracy. When band selection (SAFIA) is
performed as the second layer processing, the spectrum of noise SR on the first space side and
the spectrum of sounds (SC, SR) other than noise SL on the second space side are used. That is,
between RR and RF, an SAFIA is performed, and when the spectral power of (SC, SR) is larger
than the spectral power of SR, the larger power of SC is obtained by separation. Using the
spectrum of the noise SL on the second space side and the spectrum of the sound (SC, SL) other
than the noise SR on the first space side, ie, between LL and LF, while performing attribution as a
spectrum, perform SAFIA, When the power of the spectrum of (SC, SL) is larger than the power of
the spectrum of SL, the larger power is attributed as the spectrum of SC obtained separately .
[0095]
In the above embodiment, the sound source separation system 10 is configured to perform the
spectrum integration processing 71 as the third layer processing by the third layer processing
means 70, but the third layer processing may be omitted. However, as shown in the experimental
10-04-2019
29
result of FIG. 6 described above, it is preferable to perform the third layer processing from the
viewpoint of improving the separation accuracy of the target sound SC.
[0096]
As described above, the sound source separation method and system according to the present
invention are suitable for use in, for example, hands-free speech recognition with microphones
installed on the left and right sides of the robot head.
[0097]
BRIEF DESCRIPTION OF THE DRAWINGS The whole block diagram of the sound source
separation system of one Embodiment of this invention.
The detailed block diagram of a part of sound source separation system of the embodiment.
Explanatory drawing of the process (SAFIA) of the band selection performed by the 1st hierarchy
process by the sound source separation system of the said embodiment. The figure which shows
the recording environment at the time of experiment. The figure which shows the content of the
processing method (A)-(H) of eight types of audio | voice data which experimented. The figure
which shows an experimental result.
Explanation of sign
[0098]
DESCRIPTION OF SYMBOLS 10 Sound source separation system 20 Robot head which is a sound
insulation body 30 1st sound receiving apparatus 31, 32, 41, 42 Directional microphone 40 2nd
sound receiving apparatus 50 1st-layer processing means 51 1st space side noise separation
process 52 1st Two space side noise separation processing 60 second layer processing means 61
first object sound separation processing 62 second object sound separation processing 70 third
layer processing means 71 spectrum integration processing SC object sound SR first space side
noise SL No. 2 space side noise
10-04-2019
30
Документ
Категория
Без категории
Просмотров
0
Размер файла
46 Кб
Теги
description, jp2005234246
1/--страниц
Пожаловаться на содержимое документа