close

Вход

Забыли?

вход по аккаунту

?

JP2016127459

код для вставкиСкачать
Patent Translate
Powered by EPO and Google
Notice
This translation is machine-generated. It cannot be guaranteed that it is intelligible, accurate,
complete, reliable or fit for specific purposes. Critical decisions, such as commercially relevant or
financial decisions, should not be based on machine-translation output.
DESCRIPTION JP2016127459
Abstract: [Problem] Even if strong background noise exists around a sound source of a target
sound, the collection of background noise components is suppressed. SOLUTION: The present
invention relates to a sound collecting device. And the sound collection apparatus of this
invention suppresses the component of the non-target area sound from the directivity formation
means which forms directivity in the direction of the target area with respect to the output of the
microphone array, and the directivity formation means. Target area sound extraction means for
extracting a target area sound, amplitude spectrum ratio calculation means for calculating and
adding an amplitude spectrum ratio for each frequency from the output of the directivity
formation means, coherence for each frequency from the output of the directivity generation
means When it is determined that the target area sound is present, the area sound determination
section determining the presence or absence of the target area sound using the coherence
calculation section which calculates and adds the amplitude spectrum ratio addition value and
the coherence addition value And outputting means for outputting the target area sound and not
outputting the target area sound otherwise. [Selected figure] Figure 1
Sound collecting device, program and method
[0001]
The present invention relates to a sound collection device and program, and can be applied to,
for example, a sound collection device and program that emphasizes sound in a specific area and
suppresses sound in other areas.
[0002]
03-05-2019
1
Conventionally, as a technology for separating and collecting only sound in a specific direction
(hereinafter, also referred to as “target direction”) under an environment in which a plurality
of sound sources exist, a beam former (Beam Former; hereinafter referred to as BF) (See NonPatent Document 1).
BF is a technology for forming directivity by using the time difference between signals arriving at
each microphone.
[0003]
Conventional BFs can be roughly divided into two types: addition and subtraction. In particular,
the subtractive BF has an advantage that directivity can be formed with a smaller number of
microphones than the additive BF. As an apparatus to which the conventional subtraction type BF
is applied, there is one described in Patent Document 1.
[0004]
Hereinafter, a configuration example of a conventional subtractive BF will be described.
[0005]
FIG. 12 is an explanatory view showing a configuration example of a sound collection device PS
to which a conventional subtraction type BF is applied.
[0006]
A sound collection device PS shown in FIG. 12 is for extracting a target sound (a sound in a target
direction) from the output of a microphone array MA configured using two microphones M1 and
M2.
[0007]
In FIG. 12, the sound signals captured by the microphones M1 and M2 are shown as x 1 (t) and x
2 (t), respectively.
03-05-2019
2
Further, the sound collection device PS shown in FIG. 12 has a delay device DEL and a subtracter
SUB.
[0008]
The delay unit DEL calculates a time difference τ L between the signals x 1 (t) and x 2 (t)
arriving at each of the microphones M1 and M2, and adds a delay to match the phase difference
of the target sound.
Hereinafter, a signal obtained by adding a delay of a time difference τ L to x 1 (t) will be
denoted as x 1 (t−τ L).
[0009]
The delay unit DEL calculates the time difference τ L according to the following equation (1).
In the following equation (1), d represents the distance between the microphones M1 and M2, c
represents the speed of sound, and τ i represents the delay amount. Further, in the following
equation (1), θ L indicates the angle from the vertical direction to the target direction with
respect to the straight line connecting the microphones M1 and M2. τ L = (d sin θ L) / c (1)
[0010]
Here, when the dead angle is in the direction of the microphone M1 with respect to the centers
(midpoints) of the microphones M1 and M2, delay processing is performed on the input signal x
1 (t) of the microphone M1. The subtractor SUB performs processing of subtracting x 1 (t−τ L)
from x 2 (t), for example, according to the following equation (2). α (t) = x 2 (t) -x 1 (t-τ L) (2)
[0011]
03-05-2019
3
The subtractor SUB can also perform subtraction processing in the frequency domain. In that
case, the above equation (2) can be expressed as the following equation (3). A (ω) = X 2 (ω) -e <j> ωτ <L> X 1 (ω) (3)
[0012]
Here, in the case of θ L = ± π / 2, the directivity formed by the microphone array MA is a
cardioid unidirectivity as shown in FIG. 13A. On the other hand, when θ L = 0, π, the directivity
formed by the microphone array MA is an eight-shaped bi-directional as shown in FIG. 13B.
Hereinafter, a filter that forms unidirectionality from an input signal is referred to as a
unidirectional filter, and a filter that forms bidirectionality is referred to as a bidirectional filter.
In addition, in the subtractor SUB, by using the process of spectral subtraction (hereinafter, also
simply referred to as “SS”), it is possible to form strong directivity in a bi-directional blind spot.
[0013]
When the directivity is formed by SS, the subtractor SUB can perform subtraction processing
using the following equation (4). Although the input signal X 1 of the microphone M1 is used in
the following equation (4), the same effect can be obtained with the input signal X 2 of the
microphone M2. In the following equation (4), β is a coefficient for adjusting the intensity of SS.
The subtractor SUB performs processing (flooring processing) to replace zero or the original
value with a smaller value when the result value of the subtraction processing using the
following equation (4) becomes negative. Good. The subtractor SUB performs a subtraction
process according to the SS method to extract sounds existing outside the direction of the target
area, and the amplitude spectrum of the extracted sound (sounds existing outside the direction of
the target area) is the amplitude of the input signal By subtracting from the spectrum, the target
area sound can be emphasized. | Y (ω) | = | X 1 (ω) | -β | A (ω) | (4)
[0014]
When it is desired to pick up only the sound present in a specific area (hereinafter referred to as
"target area sound") in the conventional sound pickup apparatus, the sound is present around the
target area only by using the subtractive BF. The sound of the sound source (hereinafter referred
to as "non-purpose area sound") may also be collected.
03-05-2019
4
[0015]
Therefore, in Patent Document 1, for example, as shown in FIG. 14, a plurality of microphone
arrays are used to direct directivity to the target area from different directions and to cross the
directivity at the target area to collect the target area sound. A process of making a sound
(hereinafter referred to as "target area sound collection process") has been proposed.
In this method, first, the ratio of the power of the target area sound included in the BF output of
each microphone array is estimated and used as a correction coefficient.
[0016]
FIG. 14 shows an example of the prior art in which a target area sound is picked up using two
microphone arrays MA1 and MA2. When the target area sound whose source is the target area
sound is picked up using the two microphone arrays MA1 and MA2, the correction coefficient of
the target area sound power is, for example, the following (5), (6) or It is calculated by the
following equations (7) and (8).
[0017]
In the above equations (5) to (8), Y 1k (n) and Y 2k (n) are amplitude spectra of BF output of
microphone arrays MA 1 and MA 2, N is the total number of frequency bins, k is frequency α 1 (
n) and α 2 (n) represent power correction coefficients for each BF output. Further, in the
equations (5) to (8), mode represents a mode value, and median represents a median value.
Thereafter, by correcting each BF output with the correction coefficient and performing SS, it is
possible to extract non-target area sound existing in the target direction. Furthermore, the target
area sound can be extracted by SS the extracted non-target area sound from the output of each
BF. In order to extract the non-target area sound N 1 (n) present in the target direction viewed
from the microphone array MA 1, the microphone array from the BF output Y 1 (n) of the
microphone array MA 1 as shown in the following equation (9) The product of the BF output Y 2
(n) of MA 2 and the power correction coefficient α 2 is SS. Similarly, the non-target area sound N
2 (n) present in the target direction viewed from the microphone array MA 2 is extracted
according to the following equation (10). N 1 (n) = Y 1 (n) -α 2 (n) Y 2 (n) (9) N 2 (n) = Y 2 (n) α 1 (n) Y 1 (n) 10)
03-05-2019
5
[0018]
Then, according to the equations (11) and (12), the non-target area sound is SS from each BF
output Y 1 (n) and Y 2 (n), and the target area sound collection signal Z 1 (n), Z 2 ( n) extract. In
the following equations (11) and (12), γ 1 (n) and γ 2 (n) are coefficients for changing the
intensity at SS. Z 1 (n) = Y 1 (n) -γ 1 (n) N 1 (n) (11) Z 2 (n) = Y 2 (n) -γ 2 (n) N 2 (n) 12)
[0019]
As described above, if the technology described in Patent Document 1 is used, sound pickup
processing of the target area sound can be performed even if the non-target area sound exists
around the target area.
[0020]
JP, 2014-72708, A
[0021]
Asano Ta, "Sound Technology Series 16 Array signal processing of sound-Localization, tracking
and separation of sound source", Japan Acoustical Society, edited by Corona, February 25, 2011
[0022]
However, even if the technique described in Patent Document 1 is used, if the background noise
is strong (for example, the target area is a place with many people such as an event hall or a
place where music etc. is flowing around), the target area Due to noises that can not be erased by
sound collection processing, offensive noise such as musical noise is generated.
In the conventional sound pickup apparatus, these abnormal sounds are masked to some extent
by the target area sound, but if the target area sound does not exist, only the abnormal sound is
heard, which may make the listener uncomfortable.
[0023]
Therefore, there is a demand for a sound collection device, program and method for suppressing
03-05-2019
6
the collection of the background noise component even when strong background noise exists
around the sound source of the target sound.
[0024]
According to a first sound collecting apparatus of the present invention, (1) directivity forming
means for forming directivity in the direction of the target area with respect to the output of the
microphone array, and (2) from the output of the directivity forming means Target area sound
extraction means for extracting non-target area sound existing in the direction of the target area
and suppressing the component of non-target area sound extracted from the output of the
directivity forming means to extract the target area sound; The amplitude spectrum is calculated
from the output of the target area sound extraction means, the amplitude spectrum ratio for each
frequency is calculated using the amplitude spectrum and the amplitude spectrum of the input
signal of the microphone array, and the amplitude spectrum ratio of each frequency is calculated
Amplitude spectrum ratio calculating means for calculating the amplitude spectrum ratio count
value by addition and (4) calculating the coherence for each frequency from the output of the
directivity forming means, each frequency Using the coherence calculation means for calculating
the coherence addition value by adding the coherences of (5), the coherence addition value
calculated by the coherence calculation means, and the amplitude spectrum ratio count value
calculated by the amplitude spectrum ratio calculation means And (6) the target area extracted
by the target area sound extraction unit when it is determined that the target area sound is
present by the area sound determination unit which determines the presence or absence of the
target area sound; It is characterized in that it comprises: an output unit for outputting a sound
and not outputting the target area sound extracted by the target area sound extraction unit when
the area sound judgment unit judges that the target area sound does not exist.
[0025]
A sound pickup apparatus according to a second aspect of the present invention comprises a
computer, (1) directivity forming means for forming directivity in the direction of a target area
with respect to an output of a microphone array, and (2) the directivity forming means described
above. Target area sound extraction means for extracting non-target area sound present in the
direction of the target area from the output, and suppressing the component of non-target area
sound extracted from the output of the directivity forming means to extract the target area sound
(3) Calculate the amplitude spectrum from the output of the target area sound extraction means,
calculate the amplitude spectrum ratio for each frequency using the amplitude spectrum and the
amplitude spectrum of the input signal of the microphone array, and calculate the amplitude of
each frequency Amplitude spectrum ratio calculating means for calculating the amplitude
spectrum ratio count value by adding spectrum ratios, (4) coherence for each frequency from the
output of the directivity forming means Coherence calculation means for calculating the
coherence addition value by calculating and adding coherence of each frequency; (5) the sum of
03-05-2019
7
the amplitude spectrum ratio calculated by the amplitude spectrum ratio calculation means and
the coherence addition value calculated by the coherence calculation means Area sound judging
means for judging presence or absence of the target area sound using the value, and (6) the
target area sound extracting means when it is judged that the target area sound exists by the
area sound judging means The target area sound extracted is output, and when it is determined
that the target area sound does not exist by the area sound determination unit, the target area
sound extraction unit functions as an output unit that does not output the target area sound
extracted. It is characterized by
[0026]
A third aspect of the present invention is a sound collection method performed by a sound
collection device, which comprises: (1) directivity formation means, target area sound extraction
means, amplitude spectrum ratio calculation means, coherence calculation means, area sound
judgment means, and output means And (2) the directivity forming unit forms directivity in the
direction of the target area with respect to the output of the microphone array, and (3) the target
area sound extracting unit outputs the output of the directivity forming unit. To extract the nontarget area sound present in the direction of the target area, suppress the components of the
non-target area sound extracted from the output of the directivity forming means, and extract the
target area sound, and (4) the amplitude spectrum The ratio calculation means calculates an
amplitude spectrum from the output of the target area sound extraction means, and uses the
amplitude spectrum and the amplitude spectrum of the input signal of the microphone array to
generate an amplitude for each frequency. The spectral ratio is calculated, and the amplitude
spectral ratio of each frequency is added to calculate the amplitude spectral ratio count value, (5)
the coherence calculation means calculates the coherence for each frequency from the output of
the directivity forming means, The coherence of each frequency is added to calculate the
coherence addition value, and (6) the area sound judging means calculates the coherence
addition value calculated by the coherence calculating means and the amplitude spectrum ratio a
countable by the amplitude spectrum ratio calculating means The presence or absence of the
target area sound is determined using the value, and (7) the output means detects the target area
sound when it is determined that the target area sound is present by the area sound
determination means. Outputs the extracted target area sound, and when it is determined that
the target area sound does not exist by the area sound determination unit, And outputting the
target area sound A sound extraction means has extracted.
[0027]
According to the present invention, even when strong background noise exists around the sound
source of the target sound, it is possible to suppress the collection of the background noise
component.
03-05-2019
8
[0028]
It is the block diagram shown about the functional composition of the sound collection device
concerning a 1st embodiment.
It is explanatory drawing shown about the example of the positional relationship of the
microphone which comprises the microphone array which concerns on 1st Embodiment.
It is explanatory drawing shown about the directional characteristic which the sound collection
apparatus which concerns on 1st Embodiment forms using a microphone array.
It is explanatory drawing which shows the example of the positional relationship of the
microphone array which concerns on 1st Embodiment, and the object area.
It is an explanatory view showing change of an amplitude spectrum of each ingredient in a sound
collection device concerning a 1st embodiment.
It is explanatory drawing which showed the time change (the 1: the case of no reverberation) of
the amplitude spectrum ratio addition value calculated by the sound collection apparatus which
concerns on 1st Embodiment.
It is explanatory drawing which showed the time change (the 2: the case of reverberation
presence) of the amplitude spectrum ratio addition value calculated by the sound collection
apparatus which concerns on 1st Embodiment.
It is explanatory drawing which showed the time change (the 1: the case of no reverberation) of
the coherence addition value calculated by the sound collection apparatus which concerns on 1st
Embodiment. It is explanatory drawing which showed the time change (the 2: case with a
reverberation) of the coherence addition value calculated with the sound collection apparatus
which concerns on 1st Embodiment. It is explanatory drawing shown about the rule (The update
rule of a threshold value etc.) at the time of performing object area sound area determination by
the sound collection apparatus which concerns on 1st Embodiment. It is the block diagram
03-05-2019
9
shown about the functional composition of the sound collection device concerning a 2nd
embodiment. In the conventional sound collection apparatus, it is a figure which shows the
directivity characteristic formed by a subtraction type beam former using two microphones. It is
explanatory drawing explaining an example of the directional characteristic formed of the
conventional directional filter. In the conventional sound collection apparatus, it is explanatory
drawing shown about the structural example at the time of aiming the directivity by the
beamformer (BF) of two microphone arrays from a different direction to the target area.
[0029]
(A) First Embodiment A first embodiment of a speech processing apparatus, program and method
according to the present invention will be described in detail with reference to the drawings.
[0030]
(A-1) Configuration of First Embodiment FIG. 1 is a block diagram showing a functional
configuration of the sound collection device 100 of the first embodiment.
[0031]
The sound collection device 100 performs target area sound collection processing for collecting
a target area sound from a sound source of a target area using the two microphone arrays MA1
and MA2.
[0032]
The microphone arrays MA1 and MA2 are disposed at arbitrary places in the air where the target
area exists.
The position of the microphone array MA with respect to the target area may be, for example,
anywhere as long as the directivity of each microphone array MA overlaps only in the target area
as shown in FIG. Also good.
The microphone array MA is composed of two or more microphones 21 and each microphone 21
picks up an acoustic signal.
03-05-2019
10
In this embodiment, three microphones M1, M2, and M3 are arranged in each microphone array
MA. That is, each microphone array MA constitutes a 3-ch microphone array.
[0033]
FIG. 2 is an explanatory view showing the positional relationship between the microphones M1,
M2, and M3 in each microphone array MA.
[0034]
As shown in FIG. 2, in each microphone array MA, two microphones M1 and M2 are arranged to
be horizontal to the direction of the target area, and further orthogonal to a straight line
connecting the microphones M1 and M2 and The microphone M3 is disposed on a straight line
passing through either of the microphones M1 and M2.
At this time, the distance between the microphones M3 and M2 is the same as the distance
between the microphones M1 and M2. That is, it is assumed that the three microphones M1, M2,
and M3 are arranged to be the vertices of a right-angled isosceles triangle.
[0035]
The sound collection device 100 includes data input unit 1 (1-1, 1-2), directivity forming unit 2
(2-1, 2-2), delay correction unit 3, space coordinate data storage unit 4, power correction
coefficient A calculation unit 5, a target area sound extraction unit 6, an amplitude spectrum
calculation unit 7, a coherence calculation unit 8, and an area sound determination unit 9 are
provided. Detailed processing of each functional block constituting the sound collection device
100 will be described later.
[0036]
The sound collection device 100 may be configured entirely by hardware (for example, a
dedicated chip or the like), or may be configured as software (program) for part or all. The sound
collection device 100 may be configured, for example, by installing the sound collection program
according to the embodiment in a computer having a processor and a memory.
03-05-2019
11
[0037]
(A-2) Operation of the First Embodiment Next, the operation (the sound collecting method of the
embodiment) of the sound collecting device 100 of the first embodiment having the
configuration as described above will be described.
[0038]
The data input units 1-1 and 1-2 receive supply of analog signals of the acoustic signals captured
by the microphone arrays MA1 and MA2, respectively, convert the analog signals into digital
signals, and generate the directivity forming unit 2-1. , 2-2.
[0039]
The directivity forming units 2-1 and 2-2 perform processing for forming the directivity of each
of the microphone arrays MA1 and MA2 (forming the directivity of the signals supplied from the
microphone arrays MA1 and MA2). .
[0040]
The directivity forming unit 2 converts each from time domain to frequency domain using fast
Fourier transform.
In this embodiment, each directivity forming unit 2 forms a bi-directional filter with microphones
M1 and M2 arranged in a line orthogonal to the direction of the target area, and the directivity
forming unit 2 forms a bi-directional filter on a line parallel to the target direction. The
microphones M1 and M3 arranged side by side form a unidirectional filter that directs a blind
spot in the target direction.
[0041]
Specifically, the directivity forming unit 2 sets θ L = 0, and performs an operation according to
the above equations (1) and (3) on the outputs of the microphones M 1 and M 2 to obtain a
bidirectional filter. Perform the formation of
03-05-2019
12
Further, the directivity forming unit 2 sets θ L = −π / 2 and performs operations according to
the above equations (1) and (3) for the outputs of the microphones M1 and M3 to obtain single
directivity. Form a sex filter.
[0042]
FIG. 3 shows, at the output of the microphone array MA, directivity characteristics formed by the
above-mentioned bi-directional filter and uni-directional filter.
In FIG. 3, the hatched area indicates the overlapping portion (overlapping filtered area) of the bidirectional filter and the uni-directional filter described above. As shown in FIG. 3, although the
bi-directional filter and a part of the uni-directional filter overlap, performing this SS makes it
possible to eliminate this overlapping portion. Specifically, the directivity forming unit 2 can
erase the overlapping portion by performing SS in accordance with the following equation (13).
In the following equation (13), A BD represents a bi-directional amplitude spectrum, A UD
represents a uni-directional amplitude spectrum, and A UD 'represents an amplitude spectrum in
which the overlapping components of A UD and A BD are eliminated, respectively. There is. The
directivity forming unit 2 may perform the flooring process when A UD ′ becomes negative as a
result of SS using the following equation (13).
[0043]
Then, the directivity forming unit 2 SS these two directivity A BD and A UD ′ from the input
signal according to the following equation (14), in front of the microphone array MA with respect
to the target direction (target sound It is possible to obtain a signal Y (hereinafter, this signal is
also referred to as “BF output”) which has a sharp directivity only in the direction. In the
following equation (14), X DS represents an amplitude spectrum obtained by adding and
averaging the input signals (outputs of the microphones M1, M2, and M3). Further, in the
following equation (14), β 1 and β 2 are coefficients for adjusting the intensity of SS.
Hereinafter, the BF output based on the output of the microphone array MA1 is represented as Y
1, and the BF output based on the output of the microphone array MA2 is represented as Y 2. Y =
X DS −β 1 A BD −β 2 A UD1 ′ (14)
[0044]
03-05-2019
13
The directivity forming units 2-1 and 2-2 respectively process the BF as described above to form
directivity in the direction of the target area for the microphone arrays MA1 and MA2. In each
directivity forming unit 2, by performing the BF processing as described above, the directivity of
each microphone array MA is formed only in the front, so that the direction opposite to the
target area viewed from the microphone array MA is reversed. Can reduce the influence of
reverberation from the In addition, each directivity forming unit 2 performs BF processing as
described above, thereby suppressing non-target area sound positioned behind each microphone
array in advance and improving the SN ratio of target area sound collection processing. can do.
[0045]
The space coordinate data storage unit 4 holds position information of all target areas (position
information of the range of the target area) and position information of each microphone array
MA (position information of each microphone 21 constituting each microphone array MA) doing.
The specific format and display unit of the position information stored in the space coordinate
data storage unit 4 is not limited as long as the relative positional relationship of the target area
and each microphone array MA can be recognized.
[0046]
The delay correction unit 3 calculates and corrects a delay generated due to a difference in
distance between the target area and each microphone array MA.
[0047]
The delay correction unit 3 first obtains the position of the target area and the position of each
microphone array MA from the position information held in the space coordinate data storage
unit 4 and the arrival time of the target area sound to each microphone array MA Calculate the
difference of
Next, the delay correction unit 3 adds a delay so that the target area sounds reach all the
microphone arrays MA simultaneously based on the microphone array MA arranged farthest
from the target area. Specifically, the delay correction unit 3 performs a process of adding a
delay to one of Y 1 and Y 2 to make the phases coincide.
03-05-2019
14
[0048]
The power correction coefficient calculation unit 5 calculates a correction coefficient for making
the power of the target area sound component included in each BF output (Y 1, Y 2) the same
level. Specifically, the power correction coefficient calculation unit 5 calculates the correction
coefficient according to the above (5), (6) or the above (7), (8).
[0049]
The target area sound extraction unit 6 corrects each of the BF outputs Y 1 and Y 2 with the
correction coefficient calculated by the power correction coefficient calculation unit 5.
Specifically, the target area sound extraction unit 6 corrects each of the BF outputs Y 1 and Y 2
according to the above equations (9) and (10), and obtains N 1 and N 2 after correction.
[0050]
In addition, the target area sound extraction unit 6 SS non-target area sound (noise) using N 1
and N 2 corrected by the correction coefficient, and the target area sound collection signals Z 1
and Z 2 I get a signal) that was picked up area sound. Specifically, the target area sound
extraction unit 6 SS obtains Z 1 and Z 2 according to the above equations (11) and (12).
[0051]
Next, the processing outline of the amplitude spectrum calculation unit 7, the coherence
calculation unit 8 and the area sound determination unit 9 will be described.
[0052]
The area sound determination unit 9 includes a section in which a target area sound is present
(hereinafter referred to as “target area sound section”) and a section in which a target area
sound is not present (hereinafter referred to as “non-target area sound section” In the nontarget area sound section, the generation of the abnormal sound is suppressed by not outputting
03-05-2019
15
the sound subjected to the area pickup processing.
In this embodiment, it is assumed that noise (non-purpose area sound) is always generated. The
area sound determination unit 9 determines an amplitude spectrum ratio (area) between an input
signal and an output after area collection processing (hereinafter referred to as “area sound
output”) in order to determine whether a target area sound exists. Two feature quantities of
coherence between sound output / input signal) and each BF output are used.
[0053]
FIG. 5 is an explanatory view showing a change in the amplitude spectrum of the target area
sound and the non-target area sound in the area sound collection process.
[0054]
When the sound source is present in the target area, the target area sound is included in
common between the input signal X 1 and the area sound output Z 1, so that the amplitude
spectrum ratio of the target area sound component becomes a value close to one.
Further, since the non-target area sound component is suppressed in the area sound output, the
amplitude spectrum ratio becomes a small value. The area sound collection process also
performs a plurality of SSs for other background noise components. Therefore, even if the
dedicated noise suppression process is not performed in advance, it is suppressed to some extent,
and the amplitude spectrum ratio becomes a small value. Conversely, when the target area sound
does not exist, the area sound output includes only weak residual noise compared to the input
signal, so that the amplitude spectrum ratio becomes a small value in the entire region. Due to
this feature, when all the amplitude spectrum ratios determined at each frequency are added, a
large difference is generated between when the target area sound is present and when it is not
present.
[0055]
The time change of the value which added the amplitude spectrum ratio in case a target area
sound and two non-target area sounds actually exist is shown in FIG. A waveform W11 in FIG. 6
is a waveform of an input sound in which all sound sources are mixed. The waveform W12 in
03-05-2019
16
FIG. 6 is a waveform of the target area sound in the input sound. Further, a waveform W13 in
FIG. 6 indicates an amplitude spectrum ratio addition value. As shown in FIG. 6, it can be seen
that the amplitude spectrum ratio addition value is large in the section in which the target area
sound exists.
[0056]
And, FIG. 6 shows the amplitude spectrum ratio addition value in the environment with almost no
reverberation, but the time change of the amplitude spectrum ratio addition value in the
environment with reverberation is as shown in FIG.
[0057]
A waveform W21 in FIG. 7 is a waveform of an input sound in which all the sound sources are
mixed.
Further, a waveform W22 in FIG. 7 is a waveform of the target area sound in the input sound.
Furthermore, a waveform W23 in FIG. 7 indicates an amplitude spectrum ratio addition value. As
shown in FIG. 7, under reverberation, reflected non-target area sounds may be simultaneously
included in the directivity of each microphone array. In this state, the non-target area sound is
regarded as the target area sound, and the non-target area sound remains while the area sound is
output. Because of this, the amplitude spectrum ratio addition value becomes larger even in the
non-target area sound section as shown in FIG. 7, and therefore, it is necessary to set the
threshold value higher than in the environment without reverberation.
[0058]
Moreover, when determining the presence or absence of the target area sound based on the
amplitude spectrum ratio addition value, in order to set a suitable threshold value, it is desirable
to measure the strength of reverberation in advance for each area. Therefore, in this
embodiment, the coherence between each BF output is also used to determine the presence or
absence of the target area sound. Coherence is a feature that indicates the relationship between
two signals, and takes a value between 0 and 1. This value indicates that the closer to 1, the
stronger the relationship between the two signals. When a sound source is present in the target
area, the target area sound is included in common in each BF output, so the coherence of the
target area sound component becomes large. Conversely, when the target area sound does not
03-05-2019
17
exist, the non-target area sound included in each BF output is different, so the coherence is
reduced. Further, since the two microphone arrays MA1 and MA2 are separated, the background
noise component in each BF output is also different and the coherence is reduced. Due to this
feature, when all the coherences determined at each frequency are added, a large difference is
generated between when the target area sound is present and when it is not present.
[0059]
The time change of the value which added the coherence in case a target area sound and two
non-target area sounds actually exist is shown in FIG. 8, FIG. FIG. 8 shows a temporal change of
the coherence addition value in an environment with almost no reverberation. FIG. 9 shows a
temporal change of the coherence addition value under reverberation.
[0060]
Waveforms W31 and W41 in FIGS. 8 and 9 are waveforms of input sounds in which all the
sound sources are mixed. Also, waveforms W32 and W42 in FIGS. 8 and 9 are waveforms of
target area sounds in the input sound. Further, waveforms W33 and W43 in FIGS. 8 and 9
respectively indicate coherence addition values.
[0061]
It can be understood from FIGS. 8 and 9 that the coherence addition value is large in the target
area sound section. Comparing FIGS. 6 to 9, it can be seen that the coherence added value is
inferior to the amplitude spectrum ratio added value in detection of a weak target area sound
section, but is less susceptible to reverberation.
[0062]
The target area sound extraction unit 6 updates the threshold of the amplitude spectrum ratio
added value (the threshold used for determination of the target area sound section) under
reverberant utilizing the feature of the coherence added value as described above. The timing at
which the target area sound extraction unit 6 updates the threshold is determined by comparing
03-05-2019
18
the two determination results after determining, for example, the amplitude spectrum ratio added
value and the coherence added value respectively with the preset threshold. Then, if the two
determination results are the same, the target area sound extraction unit 6 outputs the area
sound output as it is if it is the target area sound section, and if it is the non-target area sound
section, the area sound is output. It outputs a sound without output data and silence or reducing
the gain of the input sound. However, if the two judgments are different, there is a possibility that
the reverberation may cause an erroneous judgment.
[0063]
Therefore, if the target area sound extraction unit 6 determines that the target area sound
section is based on the amplitude spectrum ratio added value and determines that the target area
sound section based on the coherence added value is not the target area sound section, the
history of the determination results in the past (final The determination is made using the history
of the determination results). In the example of this embodiment, the target area sound
extraction unit 6 gives priority to the determination of the amplitude spectrum ratio addition
value if the same result is less than a certain number of times, but if it continues more than a
certain number of times, it does not Since it is considered that there is a high possibility that the
threshold value of the amplitude spectrum ratio addition value is exceeded in the area sound
section, the threshold value of the amplitude spectrum ratio addition value is raised. Then, the
target area sound extraction unit 6 performs determination based on the amplitude spectrum
ratio addition value again after that.
[0064]
Also, the target area sound extraction unit 6 determines the non-target area sound section based
on the amplitude spectrum ratio added value, and also determines the target area sound section
based on the coherence added value in the same manner as the previous determination result.
Make a decision using the history. In the example of this embodiment, the target area sound
extraction unit 6 gives priority to the determination of the amplitude spectrum ratio addition
value if the same result is less than the predetermined number of times, but if it continues more
than the predetermined number of times, the amplitude spectrum ratio addition value Since there
is a high possibility that the threshold is too high, it is assumed that the threshold of the
amplitude spectrum ratio addition value is lowered, and then the determination based on the
amplitude spectrum ratio addition value is performed again.
03-05-2019
19
[0065]
In addition, the target area sound extraction unit 6 may obtain a correlation coefficient between
the amplitude spectrum ratio addition value and the coherence addition value, and update the
threshold of the amplitude spectrum ratio addition value. For example, in the example of this
embodiment, the target area sound extraction unit 6 may obtain the moving average of the
amplitude spectrum ratio addition value and the coherence addition value, and then obtain the
correlation coefficient of the two feature quantities. As a result, the target area sound section has
a high value regardless of the presence or absence of reverberation. Also, the correlation is high
even in the non-target area sound section without reverberation. However, in the reverberation
non-target area sound section, the correlation becomes low because the amplitude spectrum ratio
addition value is affected by the reverberation. Therefore, it is desirable that the target area
sound extraction unit 6 raise the threshold of the amplitude spectrum ratio addition value when
the correlation coefficient falls below a predetermined value, and set a threshold suitable for
reverberation.
[0066]
Next, detailed processing of the amplitude spectrum ratio calculation unit 7 will be described.
[0067]
The amplitude spectrum ratio calculating unit 7 is based on the input signals supplied from the
data input units 1-1 and 1-2 and the area sound outputs Z 1 and Z 1 supplied from the target
area sound emitting unit 6. After the amplitude spectrum ratio is calculated, amplitude spectrum
ratios for all frequencies are added to obtain an amplitude spectrum ratio added value.
[0068]
Specifically, first, the amplitude spectrum ratio calculation unit 7 outputs the input signals
supplied from the data input units 1-1 and 1-2 and the area sound outputs Z 1 and Z supplied
from the target area sound extraction unit 6. Obtain 2 and calculate the amplitude spectrum
ratio.
For example, the amplitude spectrum ratio calculator 7 calculates the amplitude spectrum ratio
of the area sound output Z 1 or Z 2 and the input signal for each frequency using the following
equations (15) and (16).
03-05-2019
20
Then, using the following equations (17) and (18), the amplitude spectrum ratios of all
frequencies are added to obtain an amplitude spectrum ratio count value. Here, in the equations
(15) and (16), W x1 is the amplitude spectrum of the input signal of the microphone array MA1,
and W x2 is the amplitude spectrum of the input signal of the microphone array MA2. Z 1 is the
amplitude spectrum of the area sound output when area pickup processing is performed with the
microphone array MA 1 as the main, and Z 2 is the area when area pickup processing is
performed with the microphone array MA 2 as the main It is an amplitude spectrum of sound
output. Further, U 1 obtained by the equation (17) is obtained by adding the amplitude spectrum
ratio R 1i of each frequency in the band from the lower limit m to the upper limit n of the
frequency, and is obtained using the process of the equation (18) U 2 is the sum of the amplitude
spectrum ratio R 2i of each frequency in the band from the lower limit m to the upper limit n of
the frequency. Here, the frequency band to be calculated may be limited in the amplitude
spectrum ratio calculation unit 7. For example, the above calculation may be performed with the
calculation target limited to 100 kHz to 6 kHz where sound information is sufficiently included.
[0069]
The amplitude spectrum ratio is calculated using equation (15) or (16) according to the
microphone array MA mainly used in the area sound collection process. In addition, in addition of
the amplitude spectrum ratio, calculation is performed using the equation (17) or (18) according
to the microphone array MA used as the main in the area sound collection process. Specifically,
in the case where the microphone array MA1 is used as the main in area sound collection
processing, the equations (15) and (17) are used, and in the case where the microphone array
MA2 is used, the equations (16) and (18) are used. Use
[0070]
Next, the detailed processing of the coherence calculator 8 will be described.
[0071]
The coherence calculation unit 8 obtains the BF output of the microphone arrays MA1 and MA2
from the directivity forming units 2-1 and 2-2, calculates the coherence for each frequency, and
then adds up all the frequencies to obtain the coherence addition value. Ask.
03-05-2019
21
The coherence calculation unit 8 calculates a coherence addition value according to the following
equation (19), and addition according to equation (20). The coherence calculation unit 8 uses the
phase of the input signal of each microphone array as the phase information of the BF outputs Y
1 and Y 2 required when calculating the coherence. At this time, the coherence calculation unit 8
may limit the frequency band. For example, the coherence calculation unit 8 may obtain the
coherence addition value by narrowing down from 100 Hz to 6 kHz where sound information is
sufficiently included. Where C is coherence, P Y1 Y2 is the cross spectrum of BF output Y 1 and
Y 2 of each microphone array, P Y1 Y1 and P Y2 Y2 are the power spectra of Y 1 and Y 2
respectively, and m and n are the lower and upper frequency limits, respectively. , H is a value
obtained by adding the coherence of each frequency. The Y 1 and Y 2 used to calculate the cross
spectrum and the power spectrum can also use past information, and in this case, Y 1 and Y 2
are updated by the equations (21) and (22), respectively. . Here, α is a coefficient that
determines how much past information is to be used, and its value is from 0 to 1. Y 1 (t) = αY 1
(t) + (1−α) Y 1 (t−1) (21) Y 2 (t) = αY 2 (t) + (1−α) Y 2 (t− 1) ... (22)
[0072]
Next, the detailed processing of the area sound determination unit 9 will be described.
[0073]
The area sound determination unit 9 compares the coherence addition value calculated by the
coherence calculation unit 8 with a preset threshold value to determine whether an area sound is
present or not.
When it is determined that the target area sound exists, the area sound determination unit 9
outputs the target area sound pickup signal (Z 1, Z 2) as it is, and when it is determined that the
target area sound does not exist, the target area sound collection Silence data (for example,
preset dummy data) is output without outputting sound signals (Z 1, Z 2). In addition, the area
sound determination unit 9 may output a signal obtained by reducing the gain of the input signal,
instead of the silent data. Furthermore, the area sound determination unit 9 determines that the
target area sound exists regardless of the coherence addition value in a few seconds after the
coherence addition value is larger than the threshold by a certain amount or more (processing
corresponding to the hangover function) ) May be added.
[0074]
03-05-2019
22
The format of the signal output from the area sound determination unit 8 is not limited. For
example, the target area sound collection signals Z 1 and Z 2 may be output based on the outputs
of all the microphone arrays MA. Alternatively, only a part of the target area sound pickup signal
(for example, one of Z 1 and Z 2) may be output.
[0075]
FIG. 10 is an explanatory view showing an example of the threshold update rule performed by
the area sound determination unit 9.
[0076]
First, the area sound determination unit 9 determines the amplitude spectrum ratio addition
value and the coherence addition value based on previously set thresholds.
In addition, the area sound determination unit 9 compares two determination results, and if the
two determination results are the same, the determination is performed as the result and the
output processing is performed.
Furthermore, when the area sound determination unit 9 determines that the two determinations
are different, the amplitude spectrum ratio addition value is determined to be the target area
sound section, and the coherence addition value is determined to be the non-target area sound
section, the same result is less than a certain number of times If there is, the determination of the
amplitude spectrum ratio addition value is followed. However, when the same determination
continues a certain number of times or more, there is a high possibility that the threshold of the
amplitude spectrum ratio addition value is exceeded in the non-target area sound section due to
the influence of reverberation. The threshold value of the added value is increased, and then the
determination is performed again based on the amplitude spectrum ratio added value.
Conversely, when the amplitude spectrum ratio addition value is determined to be the non-target
area sound section and the coherence addition value is determined to be the target area sound
section, if the same result is less than a certain number of times, determination of the amplitude
spectrum ratio addition value is performed. However, the threshold value of the amplitude
spectrum ratio addition value may be too high when continuing more than a fixed number of
times, so the area sound determination unit 9 lowers the threshold value of the amplitude
spectrum ratio addition value and then again the amplitude spectrum ratio addition value. Make
a judgment according to Further, the threshold value of the amplitude spectrum ratio added
value may be updated based on the correlation coefficient between the amplitude spectrum ratio
03-05-2019
23
added value and the coherence added value. In this case, the area sound determination unit 9
first obtains a moving average of the amplitude spectrum ratio addition value and the coherence
addition value. Thereafter, the area sound determination unit 9 obtains a correlation coefficient
from the two moving averages. The correlation coefficient is a high value in the target area sound
section regardless of whether or not reverberation is present. Also, the correlation is high even in
the non-target area sound section without reverberation. However, in the reverberation nontarget area sound section, the amplitude spectrum ratio addition value is affected by
reverberation and the correlation becomes low. Using this characteristic, when the correlation
coefficient falls below a predetermined value, the area sound judging section 9 judges that the
section is a non-target area sound section and raises the threshold of the amplitude spectrum
ratio addition value.
[0077]
(A-3) Effects of the First Embodiment According to the first embodiment, the following effects
can be achieved.
[0078]
In the sound collection device 100 according to the first embodiment, a section in which the
target area sound is present and a section in which the target area sound is not present are
determined, and in the non-existing section, the area pickup sound is not output. Suppress the
generation of sound.
Further, in the sound collection device 100 according to the first embodiment, when the
coherence addition value is determined by a preset threshold and it is determined that the target
area sound does not exist, an output from which the target area sound is extracted (hereinafter
Output a sound with silence or a reduced gain of the input sound without outputting data
(referred to as “area sound output”). As described above, in the sound collection device 100
according to the first embodiment, the presence or absence of the target area sound is
determined, and the area sound output data is not output when the target area sound is not
present. It is possible to suppress the generation of abnormal noise when there is no area sound.
[0079]
Further, as described above, in the sound collection device 100, since the presence or absence of
the target area sound is determined using both the amplitude spectrum ratio addition value and
the coherence addition value, the target area sound is accurately obtained regardless of the
03-05-2019
24
presence or absence of reverberation. The presence or absence can be determined.
[0080]
(B) Second Embodiment A second embodiment of the speech processing apparatus, program and
method according to the present invention will be described in detail with reference to the
drawings.
[0081]
(B-1) Configuration and Operation of Second Embodiment FIG. 11 is a block diagram showing a
functional configuration of the sound collection device 100A of the second embodiment.
[0082]
The sound collection device 100A of the second embodiment is different from the first
embodiment in that two noise suppression units 10 (10-1 and 10-2) are added.
The noise suppressing units 10-1 and 10-2 are inserted between the data input units 1-1 and 1-2
and the directivity forming units 2-1 and 2-2, respectively.
The outputs of the noise suppression units 10-1 and 10-2 are also supplied to the amplitude
spectrum calculation unit 7.
[0083]
The noise suppression units 10-1 and 10-2 respectively use the determination results of the area
sound determination unit 9 (the detection results of the section in which the target area sound
exists) from the data input units 1-1 and 1-2. Noise (sounds other than the target area sound) is
suppressed for the supplied signals (audio signals supplied from the microphones M of the
microphone arrays MA), and the directivity forming units 2-1 and 2-2 and the amplitude
spectrum It is supplied to the calculation unit 7.
[0084]
03-05-2019
25
The noise suppression unit 10 adjusts the noise suppression process using the result of the area
sound determination unit 9 as in voice activity detection (VAD).
Usually, when noise suppression is performed in the sound collection device, the input signal is
discriminated into a voice section and a noise section using VAD, and learning is performed in
the noise section to form a filter.
When the non-target area sound of the input signal is voice, it is determined to be a voice section
in the normal VAD processing, but in the determination of the area sound determination unit 9 of
this embodiment, the sound other than the target area sound is voice. Even treated as noise.
Therefore, the noise suppression unit 10 uses the determination result of the area sound
determination unit 9 to detect the target area sound section (section in which the target area
sound exists) and the non-purpose area sound section (the target area sound does not exist). ,
The section where only the sound of the non-purpose area exists). For example, the noise
suppression unit 10 can recognize a sound section in a section other than the target area sound
section as a non-target area sound section. Then, the noise suppression unit 10 recognizes the
non-target area sound section as a noise section, and performs filter learning and adjustment of
the filter gain by the same processing as the existing VAD.
[0085]
For example, when it is determined that the target area sound does not exist, the noise
suppression unit 10 can perform further filter learning. In addition, when the target area sound
does not exist, the noise suppression unit 10 may intensify the filter gain as compared to when it
exists.
[0086]
The determination received by the noise suppression unit 10 from the area sound determination
unit 9 is the processing result immediately before in time series (processing result of n-1 time
series), but the current processing result (n time series) It is also possible to receive the
processing result (1), perform the noise suppression process, and perform the area sound
collection process again. As the noise suppression method, various methods such as SS, Wiener
filter, and Minimum Mean Square Error-Short Time Spectral Amplitude (MMSE) method can be
used.
03-05-2019
26
[0087]
(B-3) Effects of the Second Embodiment According to the second embodiment, the following
effects can be achieved in addition to the effects of the first embodiment.
[0088]
In the second embodiment, by providing the noise suppression unit 10, it is possible to collect
the target area sound with higher accuracy than in the first embodiment.
[0089]
Further, since the noise suppression unit 10 can perform noise suppression processing using the
determination result (non-target area sound section) of the area sound determination unit 9, it is
possible to collect the target area sound more than the conventional noise suppression
processing. Suitable noise suppression can be performed.
[0090]
(C) Other Embodiments The present invention is not limited to the above-described embodiments,
and may include modified embodiments as exemplified below.
[0091]
(C-1) In each of the above embodiments, an acoustic signal acquired and acquired by the
microphone is processed in real time, but the acoustic signal acquired and acquired by the
microphone is stored in a storage medium, and then stored. It may be read from the medium and
processed to obtain an emphasis signal of the target sound and the target area sound.
As described above, when the storage medium is used, the place where the microphone is set and
the place where the target sound and the target area sound are extracted may be separated.
Similarly, even when performing real-time processing, the place where the microphone is set and
the place where extraction processing of the target sound and the target area sound may be
separated, so that the signal is supplied to a remote place by communication. Also good.
03-05-2019
27
[0092]
(C-2) Although the microphone array MA used in the above-described sound collection device
has been described as a 3-ch microphone array, a 2-ch microphone array (a microphone array
including two microphones) may be applied.
In this case, the directivity forming process by the directivity forming unit can be replaced with
various existing filter processes.
[0093]
(C-3) In the above sound collecting apparatus, the configuration for collecting the target area
sound from the outputs of the two microphone arrays has been described, but the target area
sound is collected from each of the outputs of the three or more microphone arrays It is good
also as composition.
In that case, the coherence calculation unit 8 may calculate the coherence addition value by
matching the phases of the BF outputs of all the microphone arrays.
[0094]
100 ... sound collecting device, 1, 1-1, 1-2 ... data input unit 1, 2, 2-1, 2-2 ... directivity forming
unit, 3 ... delay correction unit, 4 ... space coordinate data storage unit, 5 power correction
coefficient calculation unit 6 target area sound extraction unit 7 amplitude spectrum ratio
calculation unit 8 coherence calculation unit 9 area sound determination unit MA, MA1, MA2
microphone array M, M1, M2, M3 ... microphones.
03-05-2019
28
Документ
Категория
Без категории
Просмотров
0
Размер файла
43 Кб
Теги
jp2016127459
1/--страниц
Пожаловаться на содержимое документа