close

Вход

Забыли?

вход по аккаунту

?

DESCRIPTION JP2016042613

код для вставкиСкачать
Patent Translate
Powered by EPO and Google
Notice
This translation is machine-generated. It cannot be guaranteed that it is intelligible, accurate,
complete, reliable or fit for specific purposes. Critical decisions, such as commercially relevant or
financial decisions, should not be based on machine-translation output.
DESCRIPTION JP2016042613
Abstract: A target speech segment detection apparatus, target speech segment detection method,
target speech segment detection method, and method capable of improving the detection
performance of a target speech segment by reducing the influence of noise and calculating
average coherence even in a large noise environment. Provided are a detection program, an
audio signal processing device, and a server. According to an input sound signal, a first directivity
signal having a dead angle in a first predetermined direction, a second directivity signal having a
dead angle in a second predetermined direction, and a coherence coefficient are calculated for
each frequency. The strength coefficient of the influence of the noise signal component included
in the input sound signal is determined for each frequency based on the coherence coefficient
calculation unit 14 and the coherence coefficient for each frequency, and the coherence
coefficient in the frequency band where the influence of the noise signal component is small And
a target voice section determining unit 16 that determines whether the section of the input
sound signal belongs to a target voice section based on the average coherence. [Selected figure]
Figure 1
Target voice section detection apparatus, target voice section detection method, target voice
section detection program, voice signal processing apparatus and server
[0001]
The present invention relates to a target voice section detection apparatus, a target voice section
detection method, a target voice section detection program, a voice signal processing apparatus,
and a server, for example, for audio signal processing in communication devices and servers
using voice such as telephone and video conference. It is applicable.
11-04-2019
1
[0002]
For example, in mobile terminals (for example, smart phones, mobile phones, etc.) and in-vehicle
devices, etc., a voice recognition function or a voice call function for recognizing input voice has
come to be incorporated. Speech signal processing is being used in increasingly severe noise
environments.
In order to enable the audio signal processing function to maintain performance under a severe
noise environment, it is preferable to extract the speech emitted by the user separately from
noise and the like. Then, in order to extract the voice accurately, the section in which the speaker
is speaking (the target voice section) and the section in which the speaker is not speaking and
only the background noise is present (the background noise section) are distinguished and
detected. Technology is needed.
[0003]
As a method of distinguishing the target voice section and the background noise section, there is
a method of detecting based on a level difference between a voice signal level and a noise signal,
and a method of using coherence as described in Patent Document 1.
[0004]
The technology described in Patent Document 1 calculates, for each frequency band, the
coherence factor corresponding to the correlation between two signals obtained by forming two
directivity having a dead angle on the left and right of the microphone, and the coherence factor
of the entire frequency band The target voice section is detected based on the magnitude of the
averaged coherence.
Since the magnitude of the average coherence is a feature directly linked to the arrival direction
of the target voice, the technology described in Patent Document 1 can be said to be a method of
detecting the target voice section based on the arrival direction of the target voice. Therefore,
unlike the method of detecting based on the level difference of the voice signal, the target voice
section can be detected even when the target voice is buried in large noise and the difference
between the target voice level and the noise level is hard to be found.
11-04-2019
2
[0005]
JP, 2013-061421, A
[0006]
However, as mentioned above, in recent years, users have come to use mobile terminals and invehicle devices under increasingly severe noise environments, and the SN ratio approaches 0 due
to large noise, and even negative If the target voice is affected by the noise even in the method
described in Patent Document 1, the feature of the target voice fades and the detection
performance of the target voice section is degraded. Problems can arise.
[0007]
For example, when the SN ratio becomes negative as in a car at high speed, etc., part of the
coherence coefficient calculated for each frequency band is affected by noise, and the feature of
the target voice fades.
As a result, the average coherence obtained by averaging the coherence coefficients at all
frequencies is also indirectly affected by noise, and the characteristic difference between the
target voice section and the noise section becomes small, so that the detection performance of
the target voice section is degraded. .
[0008]
Therefore, there is a need for a target voice segment detection device, a target voice segment
detection method, a target voice segment detection program, a voice signal processing device,
and a server that can accurately detect a target voice segment even in a large noise environment.
[0009]
The present invention has been made to solve the above-mentioned problems, and adopts the
following configuration.
[0010]
According to a first aspect of the present invention, there is provided a target voice section
detecting apparatus comprising: (1) a first directivity signal having a blind spot in a first
predetermined direction and a second predetermined direction each formed based on an input
11-04-2019
3
sound signal; A coherence coefficient calculation means for calculating, for each frequency, a
coherence coefficient reflecting a correlation with the second directivity signal having a dead
angle, and (2) based on the coherence coefficient for each frequency calculated by the coherence
coefficient calculation means Average coherence calculation means for determining the strength
of the influence of noise signal components included in the input sound signal for each frequency
and calculating the average coherence using the coherence factor in the frequency band where
the influence of the noise signal components is small; ) Purpose of determining whether the
section of the input sound signal belongs to the target voice section based on the average
coherence calculated by the average coherence calculation means Characterized in that it
comprises a voice interval determining means.
[0011]
In a second target speech segment detection method according to the present invention, (1) the
coherence coefficient calculation means is formed based on the input sound signal, and the first
directivity signal having a blind spot in a first predetermined direction and A coherence
coefficient reflecting the correlation with the second directivity signal having a blind spot in the
second predetermined direction is calculated for each frequency, and (2) the average coherence
calculation means calculates the frequency calculated by the coherence coefficient calculation
means Based on each coherence coefficient, the strength of the influence of the noise signal
component included in the input sound signal is determined for each frequency, and the average
coherence is calculated using the coherence coefficient in the frequency band where the
influence of the noise signal component is small, (3) The target voice determination unit belongs
to the target voice section in the section of the input sound signal based on the average
coherence calculated by the average coherence calculation section. And judging as to whether or
not.
[0012]
According to a third aspect of the present invention, there is provided a target speech zone
detection program comprising: (1) a first directivity signal and a second directivity signal each
having a blind spot in a first predetermined direction formed based on an input sound signal;
Coherence coefficient calculation means for calculating, for each frequency, a coherence
coefficient reflecting the correlation with the second directivity signal having a dead angle in a
predetermined direction; (2) coherence coefficient for each frequency calculated by the
coherence factor calculation means Average coherence calculation means for determining the
strength of the influence of the noise signal component contained in the input sound signal for
each frequency and calculating the average coherence using the coherence coefficient in the
frequency band where the influence of the noise signal component is small. (3) The section of the
input sound signal belongs to the target speech section based on the average coherence
calculated by the average coherence calculation means Wherein the function as the target speech
11-04-2019
4
segment determination unit that determines whether.
[0013]
An audio signal processing apparatus according to a fourth aspect of the present invention is an
audio signal processing apparatus for performing predetermined audio signal processing based
on an input sound signal of ambient sound captured by at least two microphones. A coherence
factor reflecting the correlation between a first directivity signal having a dead angle in a first
predetermined direction and a second directivity signal having a dead angle in a second
predetermined direction, each formed based on Based on the coherence factor calculation means
calculated for each frequency and the coherence factor for each frequency calculated by the
coherence factor calculation means, the strength of the influence of the noise signal component
included in the input sound signal is determined for each frequency An average coherence
calculating means for calculating an average coherence using a coherence coefficient in a
frequency band in which the influence of noise signal components is small; Based on the average
coherence calculated by Nsu calculating means, it is the section of the input sound signal; and a
target speech segment determination unit that determines whether belonging to the target
speech segments.
[0014]
A server according to a fifth aspect of the present invention is a server that performs
predetermined audio signal processing based on an input sound signal of ambient sound
captured by at least two microphones, and is formed based on (1) input sound signal Further, a
coherence is calculated for each frequency that reflects the correlation between the first
directivity signal having a dead angle in a first predetermined direction and the second directivity
signal having a dead angle in a second predetermined direction. (2) The strength of the influence
of the noise signal component included in the input sound signal is determined for each
frequency based on the coherence coefficient for each frequency calculated by the coefficient
calculation means and the coherence coefficient calculation means, and the influence of the noise
signal component Average coherence calculation means for calculating average coherence using
coherence coefficients in a small frequency band of (3) average coherence calculation means
Based on the issued average coherence, it is the section of the input sound signal; and a target
speech segment determination unit that determines whether belonging to the target speech
segments.
[0015]
According to the present invention, even under a large noise environment, the influence of noise
can be reduced to calculate the average coherence, and the detection performance of the target
speech segment can be improved.
11-04-2019
5
[0016]
It is a block diagram showing composition of an object voice section detection device concerning
an embodiment.
BRIEF DESCRIPTION OF THE DRAWINGS FIG. 1 is an explanatory view for briefly explaining
schematic characteristics of a target voice and a noise signal under a large noise environment.
It is a block diagram showing composition of an average coherence calculation part concerning
an embodiment.
It is a block diagram showing the composition of the object voice section judging part concerning
an embodiment.
It is a flowchart which shows the operation example of the average coherence calculation process
in the average coherence calculation part 15 which concerns on embodiment.
[0017]
(A) Main Embodiment In the following, an embodiment of a target voice section detection
apparatus, a target voice section detection method, a target voice section detection program, a
voice signal processing apparatus and a server according to the present invention will be
described in detail with reference to the drawings. Do.
[0018]
(A-1) Configuration of Embodiment The target voice section detection device according to this
embodiment is one in which a pair of microphones are mounted or externally attached.
For example, a device such as a smartphone, a tablet terminal, a teleconference device, an invehicle device, etc., which has a pair of microphones mounted or externally attached, and widely
11-04-2019
6
performs audio signal processing on audio collected by the pair of microphones It can apply.
[0019]
The “audio signal processing device” described in the claims has an audio signal processing
function using an input sound signal of ambient sound captured by at least two microphones,
and, for example, a portable terminal (for example, The concept including a smartphone, a tablet
terminal, a mobile phone, etc., a notebook personal computer, a personal computer, a game
terminal, a video conference device, an in-vehicle device, etc. can be applied.
[0020]
In the following, the target voice section detection device according to this embodiment will be
described by exemplifying a case where a pair of microphones are mounted.
[0021]
FIG. 1 is a block diagram showing the configuration of a target speech segment detection device
1 according to this embodiment.
[0022]
The target voice section detection device 1 according to this embodiment may be constructed by
connecting various components such as hardware, and some components (for example, a
speaker, a microphone, analog / analog / The functions are realized by applying the program
execution configuration such as CPU, ROM, RAM etc to digital converter (A / D converter), digital
/ analog converter (D / A converter) etc. It may be built to
Regardless of which construction method is applied, the functional detailed configuration of the
target speech segment detection device 1 is the configuration shown in FIG.
When the program is applied, the program may be written in the memory of the target voice
segment detection device 1 at the time of shipment of the device, or may be installed by
downloading.
For example, as the latter case, a program may be prepared as an application for a smartphone,
11-04-2019
7
and a required user may download and install via the Internet.
[0023]
In FIG. 1, a target voice section detection device 1 according to this embodiment includes a
microphone m_1, a microphone m_2, an FFT (Fast Fourier Transform) unit 11, a first directivity
forming unit 12, a second directivity forming unit 13, The coherence coefficient calculation unit
14, the average coherence calculation unit 15, and the target voice section determination unit 16
are included.
[0024]
The microphones m_1 and m_2 capture ambient sound and convert it into an electrical signal
(analog signal).
The microphones m_1 and m_2 preferably have directivity so as to mainly capture sound coming
from the front.
The microphones m_1 and M_2 are connected to the FFT unit 11 via an A / D conversion unit
(not shown), and the input audio signals captured by the microphones m_1 and m_2 are digital
signals s1 (n And s 2 (n) are given to the FFT unit 11.
Each of the microphones m_1 and m_2 may be provided, for example, in a housing of a device on
which the target voice section detection device 1 is mounted, or is externally connected to the
device and connected. It is good.
[0025]
The FFT unit 11 converts each digital signal s1 (n) and s2 (n) of the input voice signal captured
by the microphones m_1 and M_2 from the time domain to the frequency domain to generate
the frequency domain signal X1 (f, K) and It calculates X2 (f, K).
Note that “n” is a parameter representing time, “f” is a parameter representing frequency,
11-04-2019
8
and “K” is a parameter representing a frame number of an analysis frame.
For example, based on the input signal s1 (n), the FFT unit 11 defines a predetermined N number
of samples as one analysis frame, and the FFT unit 11 performs fast Fourier transform
processing for each analysis frame to obtain an input. The signal s1 (n) is converted to a
frequency domain signal X1 (f, K).
In the following, when there is no particular problem in the order of the frames, the notation "K"
may be omitted.
[0026]
The first directivity forming unit 12 and the second directivity forming unit 13 perform delay
subtraction processing on the two frequency domain signals from the FFT unit 11 to form
directivity having a dead angle in a predetermined direction. It is. The first directivity forming
unit 12 and the second directivity forming unit 13 provide the coherence coefficient calculating
unit 14 with the signals B1 (n) and B2 (n) in which directivity having a dead angle in a
predetermined direction is formed.
[0027]
Based on the two frequency domain signals X1 (f, K) and X2 (f, K) from the FFT unit 11, the first
directivity forming unit 12 performs, for example, the right direction according to equation (1).
The signal B1 (f) having strong directivity is calculated.
[0028]
Further, the second directivity forming unit 13 sets, for example, the left to the front based on
the two frequency domain signals X1 (f, K) and X2 (f, K) from the FFT unit 11 according to the
equation 2). A signal B2 (f) having strong directivity in the direction is calculated.
The signals B1 (f) and B2 (f) are represented in complex form.
11-04-2019
9
[0029]
The coherence coefficient calculation unit 14 uses the signals B1 (f, K) and B2 (f, K) obtained by
the first directivity forming unit 12 and the second directivity forming unit 13 to obtain the
equation (3). The coherence coefficient cor (f, K) is calculated for each frequency according to In
Equation (3), B2 (f) * indicates a conjugate complex number of B2 (f). The coherence factor
calculation unit 14 gives the obtained coherence factor cor (f, K) to the average coherence
calculation unit 15.
[0030]
In this embodiment, although the coherence coefficient calculation unit 14 does not calculate the
coherence AVE_COR using the equation (4), since the coherence COR is referred to in the
description to be described later, the equation for calculating the coherence AVE_COR in the
equation (4) Write down. The coherence AVE_COR shown in the equation (4) is an average value
of coherence coefficients cor (f) at all frequencies f1 to fm.
[0031]
The average coherence calculation unit 15 determines the magnitude of the influence of noise
for each frequency based on the coherence coefficient cor (f, K) obtained by the coherence
coefficient calculation unit 14, and the coherence of the frequency band in which the influence of
noise is small The average coherence AVE_COR (K) is calculated using only the coefficients.
[0032]
Here, the average coherence calculator 15 will be described.
For example, in a high noise environment such as in a car such as a running car, the target voice
is buried in the noise. FIG. 2 is an explanatory view for briefly explaining the schematic
characteristics of the target voice and the noise signal under a large noise environment. In FIG. 2,
the horizontal axis represents frequency, and the vertical axis represents signal power. As shown
in FIG. 2, in the noise signal, the power of the noise component is concentrated in the low band,
and the power of the noise component is small in the high band, and the content of the noise
signal component is different for each frequency band . Therefore, there are bands where the
11-04-2019
10
noise signal has a large influence on the speech signal and small bands.
[0033]
Then, in the coherence coefficient for each frequency, (a) in the frequency band where the
influence of the noise signal component is large, the characteristic of the target voice is faded, so
the value of the coherence coefficient does not greatly fluctuate regardless of the presence or
absence of the target voice. (B) The characteristic of the target speech remains in the frequency
band where the influence of the noise signal component is small, so that the coherence
coefficient changes rapidly in the section where the target speech is present.
[0034]
Therefore, in this embodiment, the average coherence calculation unit 15 determines whether or
not the influence of the noise signal component is large for each frequency based on the feature
of the coherence coefficient for each frequency.
Then, the average coherence calculation unit 15 rejects the coherence coefficient of the
frequency band where the influence of the noise signal component is large without contributing
to the calculation of the coherence, and uses only the coherence coefficient of the frequency
band where the influence of noise is small. Control to contribute to the calculation of As a result,
even under a large noise environment, it is possible to calculate the coherence after reducing the
influence of the noise signal component and to improve the detection performance of the target
voice section.
[0035]
FIG. 3 is a block diagram showing the configuration of the average coherence calculation unit 15
according to this embodiment. In FIG. 3, the average coherence calculator 15 according to this
embodiment includes a long-term average value calculator 151, a noise influence degree
determiner 152, an adder 153, a counter 154, an average coherence calculator 155, and a longterm average value storage for each frequency. It has a part 156.
[0036]
The long-term average value calculation unit 151 calculates the long-term average value
11-04-2019
11
long_cor (f, K) of the coherence coefficient for each frequency using the coherence coefficient cor
(f, K) of each frequency obtained by the coherence coefficient calculation unit 14 It is a thing.
[0037]
The noise influence degree determination unit 152 determines the ratio of the long-term average
value long_cor (f, K) of the coherence coefficient for each frequency obtained by the long-term
average value calculation unit 151 to the coherence coefficient cor (f, K) and a predetermined
threshold value. The degree of influence of noise is determined for each frequency by comparing
with Θ.
In this embodiment, the noise influence degree determination unit 152 exemplifies a case where
the ratio between the long-term average value long_cor (f, K) of the coherence coefficient and the
coherence coefficient cor (f, K) is calculated. The difference between the long-term average value
long_cor (f, K) of the coherence coefficient and the coherence coefficient cor (f, K) is not limited,
and the difference value and the threshold value are compared and determined. Also good.
[0038]
This determination method captures a background that can estimate the magnitude of the effect
of noise. As described above, in a large noise environment, in the frequency band where the
influence of the noise signal component is large, the target voice signal component is buried in
the noise signal component, and the feature of the target voice signal is weakened, There is no
big change. On the other hand, in the frequency band in which the influence of the noise signal
component is small, the characteristic of the target audio signal remains, so the coherence
coefficient changes rapidly under the influence of the target audio signal component.
[0039]
Therefore, the noise influence degree determination unit 152 compares the ratio or difference
value between the long-term average value long_cor (f, K) of the coherence coefficient and the
coherence coefficient cor (f, K) with the predetermined threshold value 毎 for each frequency. If
the ratio or difference value is greater than or equal to the threshold Θ, it is determined that the
contribution of the signal component derived from the target voice is large and the influence of
the noise signal component is small, and if the ratio or difference is smaller than the threshold Θ,
the target voice It is determined that the contribution of the signal component derived from is
11-04-2019
12
small and the influence of the noise signal component is large.
[0040]
The addition unit 153 adds only the coherence coefficient of the frequency determined by the
noise influence degree determination unit 152 to be less affected by the noise signal component.
Further, the addition unit 153 initializes the addition value of the coherence coefficient for each
frame in order to obtain the addition value of the coherence coefficient of the frequency that is
less affected by the noise signal component for each frame.
[0041]
The counter unit 154 counts the number of coherence coefficients added by the addition unit
153. That is, the counter unit 154 increments the counter value each time the addition unit 153
adds the coherence coefficient. Further, the counter unit 154 initializes a counter value for each
frame in order to count the number of coherence coefficients added for each frame.
[0042]
The average coherence calculation unit 155 calculates the average coherence AVE_COR (K) by
dividing the addition value of the coherence coefficients obtained by the addition by the addition
unit 153 by the counter value counted by the counter unit 154. The average coherence AVE_COR
(K) obtained by the average coherence calculation unit 155 is supplied to the target voice section
determination unit 16 as an output of the average coherence calculation unit 15.
[0043]
The long-term average value storage unit for each frequency stores the long-term average value
of the past coherence coefficient of each frequency, which is used when calculating the long-term
average value of the coherence coefficient for each frequency in the long-term average value
calculation unit 151 It is
11-04-2019
13
[0044]
The target voice segment determination unit 16 determines a target voice segment based on the
average coherence AVE_COR (K) obtained by the average coherence calculator 15.
[0045]
FIG. 4 is a block diagram showing the configuration of the target voice section determination unit
16 according to this embodiment.
In FIG. 4, the target voice section determination unit 16 includes an average coherence
acquisition unit 161, a threshold comparison and determination unit 162, and a determination
result output unit 163.
[0046]
The average coherence acquisition unit 161 acquires the average coherence AVE_COR (K)
obtained by the average coherence calculation unit 15.
[0047]
The threshold comparison / determination unit 162 compares the average coherence AVE_COR
(K) acquired by the average coherence acquisition unit 161 with the target speech segment
determination threshold, and when the average coherence AVE_COR (K) is larger than the target
speech segment determination threshold, The frame is determined to be the target voice section,
and otherwise the frame is determined to be a background noise section.
[0048]
When the threshold comparison / determination unit 162 determines that the target speech
section is determined, the determination result output unit 163 substitutes “1” for the variable
res storing the determination result and outputs the result to the configuration unit in the
subsequent stage. When it is determined, “0” is substituted for the variable res, and the
variable res is output to the subsequent configuration unit.
[0049]
(A-2) Operation of Embodiment Next, the processing operation of the target speech segment
detection method in the target speech segment detection device 1 according to the embodiment
11-04-2019
14
will be described in detail with reference to the drawings.
[0050]
An input sound signal (analog signal) captured by the pair of microphones m_1 and m_2 is
converted into a digital signal by an A / D converter (not shown), and digital signals s1 (n) and s2
(n) are supplied to the FFT unit 11 Be
[0051]
In the FFT unit 11, the digital signals s1 (n) and s2 (n) are respectively converted from the time
domain to the frequency domain, and the frequency domain signals X1 (f, K) and X2 (f, K) have
first directivity. It is provided to the forming unit 12 and the second directivity forming unit 13.
[0052]
The first directivity forming unit 12 and the second directivity forming unit 13 generate the
signals B1 (f, K) and B2 (f, K) having a dead angle in a predetermined direction, and generate the
signals B1 (f, f). K) and B2 (f, K) are given to the coherence factor calculator 14.
[0053]
In coherence coefficient calculation unit 14, according to equation (3), signal B 1 (f, K) from first
directivity formation unit 12 and signal B 2 (f, K) from second directivity formation unit 13 The
coherence factor cor (f, K) is calculated on the basis of
The obtained coherence coefficient cor (f, K) is given to the average coherence calculator 15.
[0054]
The average coherence calculation unit 15 determines the strength of the influence of noise for
each frequency based on the coherence coefficient cor (f, K) of each frequency, and uses only the
coherence coefficient of the band where the influence of noise is small to determine the average
coherence AVE_COR ( K) is calculated.
[0055]
FIG. 5 is a flowchart showing an operation example of the average coherence calculation process
11-04-2019
15
in the average coherence calculation unit 15 according to the embodiment.
[0056]
At S101, an average coherence AVE_COR (K) and a counter value (COUNT) indicating the number
of coherence coefficients at frequencies at which the influence of noise is small are initialized.
[0057]
Next, in order to determine the magnitude of the influence of noise for each frequency for all
frequencies, the processing of S102 to S106 is looped for each frequency.
In S102, START is performed from a predetermined frequency bin f, and when the processing
related to the frequency bin is completed, the value of the frequency bin f is incremented (in FIG.
4, denoted as "f ++") and repeated until END.
[0058]
At S103, a long-term average value long_cor (f, K) of the coherence factor of the frequency is
calculated.
Here, equation (5) can be used to calculate the long-term average value long_cor (f, K) of the
coherence coefficient.
[0059]
Equation (5) is a long-term average by performing weighted averaging using the past long-term
average value long_cor (f, K−1) of the coherence factor of the frequency and the current
coherence factor cor (f, K) It is a relational expression which calculates value long_cor (f, K).
[0060]
Here, α is a value representing a weight given to the long-term average value long_cor (f, K−1)
11-04-2019
16
and the current coherence coefficient cor (f, K), and any value of 0 <α <1 It can be taken.
For example, when α is a value close to “0”, a long-term average value long_cor (f, K) in which
the influence of the past long-term average value long_cor (f, K−1) is increased can be
calculated.
On the other hand, when α is a value close to “1”, it is possible to calculate a long-term
average value long_cor (f, K) in which the influence of the coherence coefficient cor (f, K) of the
current frame is increased.
Note that α may be a fixed value or a variable value.
Furthermore, α may have the same value or different values for each frequency.
[0061]
Further, the long-term average value long_cor (f, K−1) of the past coherence coefficient in
Expression (5) may be calculated using a coherence coefficient of an arbitrary frame length.
The arbitrary frame length may be different for each frequency.
[0062]
In this embodiment, the long-term average value long_cor (f, K) of the coherence coefficient is
calculated using the equation (5), but any other calculation method may be used.
For example, an arithmetic mean may be used as another calculation method.
In the case of the arithmetic mean, for example, by setting α = 0.5 in the equation (5), the degree
of influence of the past long-term average value long_cor (f, K−1) and the present coherence
11-04-2019
17
coefficient cor (f, K) Can be calculated to calculate the long-term average value long_cor (f, K) of
the current frame.
[0063]
In S104, in order to detect an abrupt change in the value of the coherence coefficient, the ratio of
the long-term average value long_cor (f, K) calculated in S103 to the coherence coefficient cor (f,
K) of the current frame is calculated. Compare with the threshold Θ.
Then, if the ratio is equal to or more than the threshold value Θ, it is determined that the effect
of the target voice is large, and the process proceeds to S105.
If the ratio is less than the threshold Θ, it is determined that the influence of noise is large and
the influence of the target voice is small, and the process proceeds to S106.
[0064]
long_cor (f, K) / cor (f, K) Θ ((6) In the equation (6), the threshold Θ can be any value, for
example, it may be a fixed value or a variable value It may be Furthermore, the threshold Θ may
be the same value or different values for each frequency.
[0065]
If it is determined in S105 that the above ratio is greater than or equal to the threshold value 、
and the effect of the target voice is large (that is, it is determined that the effect of noise is small)
The coherence coefficient cor (f, K) of the frequency band is added, and the counter value for
counting the number of coherences is incremented (denoted as "COUNT ++" in FIG. 4). )する。
[0066]
In S105, the above ratio is equal to or greater than the threshold コ ヒ ー レ ン ス, and the
11-04-2019
18
coherence coefficient cor (f, K) of the frequency band determined to be largely affected by the
target voice is added to the average coherence AVE_COR (K). However, the coherence coefficient
of the frequency band in which the ratio is less than the threshold value 、 and the influence of
the target voice is determined to be small is not added, and the average coherence AVE_COR (K)
is not contributed. The above-described processing of S102 to S106 is looped until all the
frequencies are completed.
[0067]
At S107, the average coherence AVE_COR (K) is calculated by dividing the average coherence
AVE_COR (K) by the counter value (COUNT). Then, the obtained average coherence AVE_COR (K)
is given to the target speech section judging unit 16.
[0068]
At S108, K, which is an analysis frame, is incremented (denoted as "K ++" in FIG. 4), and the
process is repeated for the next frame.
[0069]
The target voice section determining unit 16 compares the average coherence AVE_COR (K)
calculated by the average coherence calculator 15 with a predetermined threshold, and
determines that the target voice section if the average coherence AVE_COR (K) is greater than or
equal to the threshold. If the average coherence AVE_COR (K) is less than the threshold value, it
is determined as a background noise section.
Then, in the case of the target voice section, the target voice section determining unit 16
substitutes “1” for the variable res storing the determination result, and for the background
noise section substitutes “0” for res, and the determination result is It is given to the
component of the latter stage.
[0070]
(A-3) Effects of the First Embodiment As described above, according to the first embodiment, the
11-04-2019
19
frequency band in which the influence of the noise signal component is small is selected even in
the large noise environment, and the frequency band is selected. The average coherence can be
calculated by contributing only the coherence factor of. As a result, it is possible to improve the
target speech segment detection performance under large noise.
[0071]
(B) Other Embodiments Although various modified embodiments have been mentioned in the
above-described embodiment, the present invention can be applied to the following modified
embodiments.
[0072]
(B-1) In the above-described embodiment, by applying the present invention to a communication
apparatus such as a television conference system or a mobile phone, the detection performance
of the target voice section can be improved. Can be expected to improve.
[0073]
Moreover, in the embodiment mentioned above, the loud noise environment under the inside of
vehicles, such as a motor vehicle and a train which are drive | working, was illustrated.
However, under a large noise environment, the power of the noise signal component is strongly
influenced in the low band, and the environment of a characteristic that the power of the noise
signal component tends to become smaller as the frequency becomes higher is intended. The
same effect as the above-described embodiment can be obtained not only in a place where a car,
a train, or the like travels by a device user who is outdoors, or in an airfield, under a guardrail, or
the like.
[0074]
(B-2) In the above-described embodiment, the average coherence calculator determines the
strength of the influence of the noise signal component based on the coherence coefficient for
each frequency, but the gradient index (GI: Gradient Index) You may make it determine using
modGI which correct | amended.
[0075]
11-04-2019
20
(B-3) In the embodiment described above, the voice target sound signal alone is used to execute
all processing, but detection processing of a target voice section or the like may be delegated to
an external server for execution. .
For example, when the audio signal processing device is a smartphone or the like, the system is
configured by a so-called cloud system, and the input sound signal acquired by the audio signal
processing device is transmitted to the external server, and the external server detects the target
audio section May be performed.
The “server” in the claims includes a server that configures the cloud system as described
above.
[0076]
(B-4) In the embodiment described above, an apparatus and a program for immediately
processing an input sound signal captured by a pair of microphones are shown, but the signals
captured by the pair of microphones are recorded on a recording medium and The present
invention is also applicable to the case of reproduction.
[0077]
(B-5) In the embodiment described above, the case where the audio signal processing apparatus
has two microphones as a pair is exemplified, but the audio signal processing apparatus may
have three or more microphones.
Even when the audio signal processing apparatus has three or more microphones, by forming a
plurality of directivity signals having directivity with a dead angle in a predetermined direction
based on the input sound signal captured by each microphone, The present invention can be
applied.
[0078]
DESCRIPTION OF SYMBOLS 1 ... Target voice area detection apparatus, m_1 and M_2 ...
Microphone, 11 ... FFT (fast Fourier transform) part, 12 ... 1st directivity formation part, 13 ... 2nd
11-04-2019
21
directivity formation part, 14 ... Coherence coefficient calculation part 15: Average coherence
calculation unit 16: Target voice section judgment unit
11-04-2019
22
Документ
Категория
Без категории
Просмотров
0
Размер файла
33 Кб
Теги
description, jp2016042613
1/--страниц
Пожаловаться на содержимое документа