close

Вход

Забыли?

вход по аккаунту

?

DESCRIPTION JP2016038405

код для вставкиСкачать
Patent Translate
Powered by EPO and Google
Notice
This translation is machine-generated. It cannot be guaranteed that it is intelligible, accurate,
complete, reliable or fit for specific purposes. Critical decisions, such as commercially relevant or
financial decisions, should not be based on machine-translation output.
DESCRIPTION JP2016038405
PROBLEM TO BE SOLVED: To provide a target sound segment detection device capable of
appropriately dividing a target sound segment having a target sound from an intended sound
source and another non-target sound segment even if there is an unintended sound such as an
impact sound. Do. SOLUTION: In a target sound section detection device according to the present
invention, a plurality of input sound signals obtained by different microphones or a plurality of
input sound signals on which predetermined processing has been performed on the input sound
signals are obtained for each frequency component in each input sound signal. The coherence
reflecting the correlation is calculated, and further, the number of times the inclination direction
of the coherence changes and the modGI reflecting the magnitude thereof are calculated, and the
modGI is compared with the threshold value, and the current input sound signal is the target
sound section Determine whether it is a non-target sound section. [Selected figure] Figure 2
Sound collection and sound emission device, target sound segment detection device and target
sound segment detection program
[0001]
The present invention relates to a sound collection / sound emission device, a target sound
segment detection device, and a target sound segment detection program, for example, a sound
coming from a sound source in a predetermined direction from a captured sound by a
microphone, a captured sound, etc. A section having a call (hereinafter referred to as a target
sound section; in this section a sound other than the target sound such as background noise is
superimposed) and another section (hereinafter referred to as a non-target sound section; a
target in this section The present invention can be applied to a communication terminal, a voice
recognition device, etc. which desires to separate sounds).
16-04-2019
1
[0002]
For example, in the case of inputting a call voice to a smartphone, or in the case of inputting a
voice command to an audio device, a smartphone, etc., the device to which voice is input is only
the voice from the front where the user's mouth seems to exist. It is preferable to extract the
voice separately from voices from other directions, music, noise and the like (hereinafter referred
to as non-target sound).
[0003]
Recently, as shown in FIG. 9, a pair of speakers 3L and 3R are disposed and connected on both
sides of the sound collecting device 2 having a communication function such as a portable
terminal (for example, a smartphone or a tablet terminal). The sound collecting and emitting
apparatus 1 for making a call with a remote place with such a configuration has come to be used.
Also, with the same configuration, the sound (music) by the music file recorded in the sound
collection device 2 and the music file acquired from the music distribution site on the Internet is
emitted from the speakers 3L and 3R on both sides. In the state, a method in which the user
receives a command by voice emitted from the front of the microphone of the sound collecting
device 2 is also considered.
[0004]
However, in a state where music etc. is emitted from the speakers 3L and 3R on both sides
(hereinafter, the non-targeted sound due to such emitted sound is referred to as a disturbance
sound), the target sound coming from the front is In the case of extracting and communicating
the uttered content to the other party, or recognizing a voice command through voice recognition
processing and executing processing corresponding to the voice command, sounds etc. emitted
from the speakers 3L and 3R It becomes an interference sound (noise) and greatly reduces the
speech quality and speech recognition rate.
[0005]
Therefore, the patent applicant has proposed a technique as shown in FIG. 10 and described
below in Japanese Patent Application No. 2013-199981 and the drawings.
16-04-2019
2
[0006]
The sound source data (for example, music data) read from each of the sound source data
storage units 21L and 21R are converted into analog signals by the corresponding D / A
conversion units 22L and 22R, respectively, and then released from the speakers 3L and 3R. Be
heard.
Such music and voice generated by the user toward the sound collection / sound generation
device 10 are captured by both the microphones 4L and 4R.
The input sound signals captured and obtained by the microphones 4L and 4R are converted into
digital signals inputL and inputR by the corresponding A / D converters 31L and 31R,
respectively, and are supplied to the non-target sound cancellation processing unit 32.
The sound emission non-target sound canceller processing unit 32 is also supplied with sound
source data sigL and sigR.
[0007]
The sound emission non-target sound canceller processing unit 32 is configured by diverting the
technique of the stereo echo canceller of the acoustic echo canceller (see Non-Patent Document
1). In the non-target sound cancellation processing unit 32, the non-target sound (disturbance
sound) is generated by subtracting the pseudo non-target sound signal generated internally from
the input sound signals (digital signals) input L and input R. To obtain the input sound signals
ECoutL and ECoutR from which.
[0008]
It is also studied to combine sound source separation processing with the input sound signals
ECoutL and ECoutR obtained in this manner. For example, the voice sound processing as
described in Patent Document 1 can multiply the input sound signal of the non-target sound
section by the attenuation gain to be greatly attenuated to greatly attenuate the amplitude and
further enhance the target sound. It is. In the technology described in Patent Document 1, the
16-04-2019
3
target sound section is detected based on the behavior that the coherence takes a large value
when the input sound comes from the front, and the coherence takes a small value when coming
from the side. And the voice switch process was running.
[0009]
JP, 2013-126026, A JP, 2014-106337, A
[0010]
Kitawaki Nobuhiko, "Digital voice and audio technology (future network technology series)",
published by Telecommunications Association, p. 218-p. 243, 1999
[0011]
As described above, since the disturbance sound is emitted from a speaker provided in the
vicinity of the microphone, the sound source data (for example, music data) may take a large
amplitude, and the used musical instrument. Depending on the number of instruments (for
example, drums) and simultaneous performances, it may have many frequency components.
[0012]
For example, when the disturbing sound includes an impulsive sound such as a drum sound, the
coherence may take a value equal to or higher than the target sound section.
In such a case, the section detected as the target sound section based on the coherence may be a
non-target sound section mainly containing the disturbance sound.
That is, there is also a possibility of misclassification between the target sound section and the
non-target sound section.
[0013]
If the target sound section or the non-target sound section is erroneously detected, the accuracy
of various processing (speech recognition processing, speech enhancement processing (noise
16-04-2019
4
suppression processing), and the like) using the detection result also decreases.
For example, there is a possibility that the sound quality may be degraded by excessive
suppression processing for the section in which the target sound and the interference sound
overlap, or the sound quality may be degraded without the suppression processing functioning
for the section including only the interference sound.
[0014]
Therefore, even in a situation where there is an unintended sound such as an impact sound, the
sound collection can properly separate the target sound section where the target sound from the
intended sound source is from the non-target sound section not including the target sound. A
sound emitting device, a target sound segment detection device, and a target sound segment
detection program are desired.
[0015]
The target sound segment detection device according to the first aspect of the present invention
is (1) an input sound signal obtained when at least two microphones capture an ambient sound
or an input sound signal obtained by executing predetermined processing on the input sound
signal, First feature quantity calculation means for calculating a first feature quantity from a
plurality of input sound signals in which a target sound from a sound source of a predetermined
direction and non-target sounds coming from other directions are mixed, and (2) obtained
Second feature amount calculating means for obtaining a second feature amount in which the
first feature amount is regarded as a time change signal, and the number of times the inclination
direction of the signal waveform changes and the magnitude thereof are reflected; (2) Target
sound section detection means for comparing the feature quantity with a threshold and detecting
whether the input sound signal at the processing time is a target sound section or a non-target
sound section.
[0016]
The target sound segment detection program according to the second aspect of the present
invention uses a computer as (1) an input sound signal obtained when at least two microphones
capture an ambient sound or an input sound signal obtained by executing predetermined
processing on the input sound signal. First feature amount calculating means for calculating a
first feature amount from a plurality of input sound signals in which a target sound from a sound
source of a predetermined direction and non-target sounds coming from other directions are
mixed; (2) Second feature amount calculation means for obtaining a second feature amount in
which the obtained first feature amount is regarded as a time change signal, and the number of
16-04-2019
5
times the inclination direction of the signal waveform changes and the magnitude thereof are
reflected; It is characterized in that it is made to function as a target sound section detecting
means for comparing the second feature quantity with a threshold and detecting whether the
input sound signal at the processing time is a target sound section or a non-target sound section.
[0017]
According to a third aspect of the present invention, in the sound collection / sound emission
device having a sound collection unit in which at least two microphones capture ambient sound,
and a sound emission unit emitting sound from one or more speakers, A sound signal emitted by
the sound emitting unit is input, and is emitted from the speaker to generate a pseudo emitted
non-target sound signal simulating an unintended sound accompanying the emitted sound
captured by each of the microphones; A non-target non-target sound removing device for
removing non-target non-target sound captured by each of the microphones by subtracting it
from an input sound signal from a microphone; (2) a subsequent stage of the above non-target
non-target sound removing device The sound signal output from the sound emission non-target
sound removal device is in a target sound section in which the target sound and other non-target
sound are mixed, or includes only the non-target sound. Target sound segment to detect if it is in
non-target sound segment Leaving a device, characterized in that (3) as the target sound period
detecting device, the application of the target sound period detecting pressure of the first
invention of the present invention.
[0018]
According to the present invention, a collection capable of appropriately separating a target
sound section having a target sound from an intended sound source and a non-target sound
section not including the target sound even in the presence of an unintended sound such as an
impact sound. A sound / sound emitting device, a target sound segment detection device and a
target sound segment detection program can be realized.
[0019]
It is a block diagram which shows the structure of the sound collection * sound emission
apparatus of 1st Embodiment.
It is a block diagram which shows the detailed structure of the target sound area detection part
in the sound collection * sound emission apparatus of 1st Embodiment.
16-04-2019
6
It is a characteristic view showing the time change of the output signal from the sound emission
non-target sound canceller processing part in the sound collection and sound emission device of
a 1st embodiment, and the coherence calculated from it.
It is a characteristic view showing the time change of the output signal from the sound emission
non-target sound canceller processing part in the sound collection and sound emission device of
a 1st embodiment, and the modGI value about the coherence calculated from it.
It is a flowchart which shows operation | movement of the threshold value comparison part in
the target sound area detection part in the sound collection * sound emission apparatus of 1st
Embodiment.
It is a block diagram which shows the detailed structure of the target sound area detection part
in the sound collection * sound emission apparatus of 2nd Embodiment. It is a block diagram
which shows the detailed structure of the target sound area detection part in the sound collection
* sound emission apparatus of 3rd Embodiment. It is explanatory drawing of the determination
method of the threshold value for detection by the target sound area detection part of the sound
collection * sound emission apparatus of 3rd Embodiment. It is explanatory drawing which shows
a mode that the speaker was connected to the sound collection * sound emission apparatus
which has the conventional microphone. It is a block diagram which shows the structure in
proposition which a speaker emits and it removes the component which captured the
microphone.
[0020]
(A) First Embodiment A first embodiment of a sound collection / sound generation device, a
target sound section detection apparatus and a target sound section detection program according
to the present invention will be described with reference to the drawings.
[0021]
(A-1) Configuration of First Embodiment The sound collection and emission device of the first
embodiment has a pair of microphones mounted or is externally attached, and a pair of speakers
mounted. Or, it is externally attached.
16-04-2019
7
For example, in the case of a sound collection / sound generation device using a sound collection
device such as a smartphone or a tablet terminal, a pair of microphones is mounted, and a pair of
speakers are externally attached. Further, for example, in the case of a sound collecting / sound
emitting device to which a speaker integrated type audio device corresponds, a pair of
microphones and a pair of speakers are also mounted. As described above, the connection form
of the pair of microphones and the pair of speakers is various, but any connection form may be
applied.
[0022]
In the following description, it is assumed that the sound collection / sound generation device of
the first embodiment is configured such that a pair of microphones is mounted and a pair of
speakers is externally attached as shown in FIG. 9 described above. Further, the reference
numerals of the constituent elements in the sound collection and emission apparatus according
to the first embodiment are the same as the reference numerals used in FIG. 9 and FIG. 10 for the
constituent elements described in FIG. 9 and FIG. Use as it is.
[0023]
FIG. 1 is a block diagram showing the configuration of the sound collection and emission device
10 of the first embodiment.
[0024]
The sound collection / sound generation device 10 according to the first embodiment may be
constructed by connecting various hardware-related components, and some components (for
example, a speaker, a microphone, The functions are realized by applying the program execution
configuration such as CPU, ROM, RAM etc to analog / digital converter (A / D converter) and
digital / analog converter (D / A converter)) It may be built to
Regardless of which construction method is applied, the functional detailed configuration of the
sound collection and emission device 10 is the configuration shown in FIG. When the program is
applied, the program may be written in the memory of the sound collection and sound emission
device 10 from the time of shipment of the device, or may be installed by downloading. good. For
example, as the latter case, a program may be prepared as an application for a smartphone, and a
16-04-2019
8
required user may download and install via the Internet.
[0025]
In FIG. 1, the sound collection / sound generation device 10 according to the first embodiment
includes a sound emission unit 20 and a sound collection unit 30.
[0026]
The sound emitting unit 20 has the same configuration as the existing sound emitting unit.
The sound output unit 20 includes sound source data storage units 21L and 21R for L and R
channels, D / A conversion units 22L and 22R, and speakers 3L and 3R.
[0027]
On the other hand, the sound collection unit 30 includes the microphones 4L and 4R for L
channel and R channel, A / D conversion units 31L and 31R, the non-target sound cancellation
process unit 32 and the purpose shown in FIG. And a sound section detection unit 33. Here, the
whole of the sound collection unit 30 having an input terminal of sound source data described
later may be constructed as a sound source separation unit and may be commercially available.
Further, a portion formed by the A / D conversion units 31L and 31R, the non-target sound
cancellation processing unit 32, and the target sound section detection unit 33 has an input
terminal of sound source data described later, and is constructed as a sound source separation
unit. And may be commercially available. That is, in the sound collection / sound generation
device 10, in particular, the sound collection unit 30 may be constructed using a sound source
separation unit.
[0028]
The sound source data storage units 21L and 21R store sound source data (digital signals) sigL
and sigR for the L channel and R channel, respectively, and read out and output the sound source
data sigL and sigR under the control of a sound emission control unit (not shown). It is a thing.
The sound source data sigL and sigR may be, for example, music data, or may be audio data for
16-04-2019
9
reading electronic books or the like. Each of the sound source data storage units 21L and 21R
may be a recording medium access device in which a recording medium such as a CD-ROM is
loaded, and stores the sound source data acquired by communication from an external device
such as a site on the Internet. It may be configured by the storage unit of the device. Further,
each of the sound source data storage units 21L and 21R may correspond to, for example, an
external device connected by a USB connector connection. Furthermore, although each sound
source data storage unit 21L, 21R is named as a "storage unit", the concept of each sound source
data storage unit 21L, 21R includes received sound source data such as a digital audio broadcast
receiver. It also includes a configuration for outputting in real time.
[0029]
The D / A conversion units 22L and 22R convert the sound source data sigL and sigR output
from the corresponding sound source data storage units 21L and 21R into analog signals and
give the analog signals to the corresponding speakers 3L and 3R.
[0030]
The speakers 3L and 3R respectively emit and output (sound output) the sound source signals
supplied from the corresponding D / A conversion units 22L and 22R.
Here, the sound or voice emitted and emitted from the speakers 3L and 3R is not intended to be
captured by the microphones 4R and 4L, but becomes an interference sound when viewed from
the capture function of the microphones 4R and 4L. There is.
[0031]
In the above, music and sound emitted from the speakers 3L and 3R have been described as
being digital signals (sound source data), but the configuration corresponding to the sound
source data storage units 21L and 21R is the record player An audio cassette tape recorder, an
AM or FM radio receiver, or the like may be used to output an audio signal or an audio signal
made of an analog signal. In this case, the D / A conversion units 22L and 22R are omitted, and
separately, an L / R channel A / D conversion unit is provided to convert an acoustic signal or an
audio signal of an analog signal into a digital signal. It is given to the non-target sound
cancellation processing unit 32.
16-04-2019
10
[0032]
Each of the microphones 4R and 4L captures ambient sound and converts it into an electrical
signal (analog signal). A stereo signal is obtained by the pair of microphones 4R and 4L. Each of
the microphones 4R and 4L has such directivity as to mainly capture the sound coming from the
front of the sound collection and emission device 10, but it emits sound from the speakers 3L
and 3R arranged on both sides. It also captures the sound of The speakers 3L and 3R are
preferably disposed on both sides of the pair of microphones 4R and 4L, but are not limited to
this arrangement.
[0033]
Each of the microphones 4 </ b> R and 4 </ b> L is attached, for example, in a cylinder provided
in the housing of the sound collection / sound generation device 10. Here, a sound insulation
member made of synthetic resin is provided on the inner surface of the cylinder so that when the
microphones 4R and 4L are attached, there is no path for sound to pass through the inside and
the outside of the housing. By this, it is possible to prevent, as much as possible, that the
microphones 4R and 4L capture noise generated inside the housing and noise which is going to
be output outside the housing by reflection from the outside into the housing from outside. Can.
[0034]
The A / D conversion units 31L and 31R convert the input sound signals captured by the
corresponding microphones 4R and 4L into digital signals inputL and inputR, respectively, and
supply the digital signals inputL and inputR to the non-target sound canceller processing unit 32.
Each of the A / D conversion units 31L and 31R converts, for example, into a digital signal of the
same sampling rate as the sampling rate of the sound source data sigL and sigR.
[0035]
The sound emission non-target sound canceller processing unit 32 is also supplied with sound
source data sigL and sigR output from the sound source data storage units 21L and 21R. Here, it
is required that the sampling rates of the four digital signals input to the non-target sound
16-04-2019
11
cancellation processing unit 32 be uniform. For example, the sampling rates of the sound source
data sigL and sigR downloaded from the site of the Internet and stored in the sound source data
storage units 21L and 21R are different from the sampling rates of the digital signals inputL and
inputR from the A / D conversion units 31L and 31R. In this case, the downloaded sound source
data sigL and sigR are given as they are to the D / A conversion units 22L and 22R, and sound
source data obtained by converting the sampling rates of the sound source data sigL and sigR to
the non-target sound cancellation processing unit 32 You should give it.
[0036]
The sound emission non-target sound canceller processing unit 32 includes the speakers 3L and
3R included in the input sound signals (digital signals) inputL and inputR based on the sound
source data sigL and sigR output from the sound source data storage units 21L and 21R. To
remove (or reduce) non-target sound components (hereinafter referred to as sound non-target
sound appropriately) due to sound emission from the input sound signals ECoutL and ECoutR
after removal processing to the target sound section detection unit 33 It is something to give.
[0037]
Here, an unnecessary sound (an emitted non-target sound, an interference sound) seen from the
target sound which is emitted from the speakers 3L, 3R and captured by the microphones 4R, 4L
is a sound which is a problem in telephone communication. It can be considered the same as
echo.
Therefore, in the first embodiment, the non-target sound cancellation processing unit 32 is
configured by diverting the technology of the acoustic echo canceller. For example, Non-Patent
Document 1 describes a "stereo echo canceller". In the first embodiment, it is assumed that one
shown in FIG. 3.71 or 3.75 of Non-Patent Document 1 is applied as the non-target sound
canceller processing unit 32. Note that, in FIG. 3.73 of Non-Patent Document 1, a monaural echo
canceler that removes a component obtained by capturing an L channel speaker sound with an L
channel microphone, and an L channel microphone L channel sound. A monaural echo canceller
that cascades a monaural echo canceller that removes the component captured in step 1 to
obtain the input sound signal ECoutL after removal processing, and removes a component
captured by the R channel microphone from the sound of the L channel speaker And a monaural
echo canceller for removing components captured from the R channel speaker by the R channel
microphone are connected in cascade to obtain the input sound signal ECoutR after the removal
processing. Also belongs to the category of stereo echo canceller, and It can be applied to
facilities embodiment.
16-04-2019
12
[0038]
The target sound section detection unit 33 has a detailed configuration shown in FIG. 2 and is a
section including the target sound in the input sound signal ECoutL or ECoutR based on the input
sound signals ECoutL and ECoutR from which the non-target sound is eliminated. The target
sound section is detected separately from other non-target sound sections, and a detection result
out indicating whether the target sound section or the non-target sound section is output.
[0039]
In Patent Document 1 described above, the coherence calculated from the sound signals obtained
by the pair of microphones has a large value calculated for the sound (sound signal) coming from
a predetermined direction (front of the device), and the sound coming from other directions (
Based on the fact that the value calculated for the sound signal is small, a section including the
target sound from the front is detected.
However, as described in the section of the problem, in the case where the non-target sound
includes an impact sound, there is a risk of false detection.
[0040]
Therefore, the target sound section detection unit 33 of the first embodiment has a detailed
configuration according to a new detection method, which is not a detection method based on
the magnitude of coherence.
[0041]
In FIG. 2, the target sound section detection unit 33 includes an FFT (Fast Fourier Transform)
unit 41, a coherence calculation unit 42, a modGI calculation unit 43, and a threshold
comparison unit 44.
[0042]
The FFT unit 41 is a signal in the time domain, and the input sound signals ECoutL (n) and
ECoutR (n) from which the emission non-target sound is removed are the signals YL (f, K) and YR
16-04-2019
13
(f) in the frequency domain, respectively. , K), and gives to the coherence calculator 42.
In the above, “n” is a parameter representing time, and “f” is a parameter representing
frequency.
[0043]
Now, the input sound signal ECoutL (n) is represented by the input signal s1 (n).
From the input signal s1 (n), an analysis frame FRAME1 (K) consisting of predetermined N
samples is constructed and applied. An example of forming an analysis frame FRAME1 (K) from
the input signal s1 (n) is shown in equation (1). Here, K is an index representing the order of
frames, and is expressed by a positive integer. In the text, the smaller K is the older analysis
frame, and the larger K is the newer analysis frame. Further, in the following description, it is
assumed that the index representing the latest analysis frame to be analyzed is K, unless
otherwise specified. When the order of the frames does not matter in particular, K may be
omitted and expressed (refer to Equations (3) to (6) described later).
[0044]
The FFT unit 41 transforms the input signal s1 (n) into a frequency domain signal YL (f, K) by
performing fast Fourier transform processing for each analysis frame. Here, YL (f, K) is not a
single value, but is constituted by spectral components of a plurality of frequencies f1 to fm as
shown in equation (2). YL(f,K)={(f1,K),(f2,K),…,(fm,K)}
…(2)
[0045]
The FFT unit 41 performs similar processing on the input sound signal ECoutR (n) to obtain a
frequency domain signal YR (f, K).
[0046]
16-04-2019
14
The coherence calculation unit 42 calculates the frequency domain signals YL (f, K) and YR (f, K)
obtained from the input sound signals ECoutL (n) and ECoutR (n) from which the non-target
sound is eliminated. The coherence COH (K) is calculated according to the equations (3) to (6).
The coherence COH (K) is calculated as an average value of coherence coefficients coef (f, K) at
all frequencies f1 to fm. Equation (5) is a calculation equation for the coherence coefficient coef
(f, K), and B1 (f) and B2 (f) in equation (5) are directed according to equations (3) and (4),
respectively. It has become a signal with sex.
[0047]
The modGI calculator 43 calculates the modGI value modGI (K) for the coherence COH (K) and
supplies the modGI value to the threshold comparator 44. Now, when the coherence COH (K) is
represented by s (K), the calculation formula of modGI (K) is represented by the equation (7).
[0048]
The threshold comparison unit 44 compares the modGI value modGI (K) for the coherence COH
(K) with the threshold Ψ, and when the modGI value modGI (K) is smaller than the threshold Ψ,
the detection result out (K) When modGI value modGI (K) is greater than or equal to the
threshold Ψ, the detection result out (K) is made a value representing a non-target sound section,
and the obtained detection result out (K) is an external processing unit (shown in FIG. It is given
to
[0049]
Here, the modGI value will be briefly described (see Patent Document 2 for details).
modGI means a modified one of the gradient index (hereinafter referred to as GI).
[0050]
For the GI before being modified, reference is made to the reference document "Naofumi Aoki,"
16-04-2019
15
"A Band Extension Technique for Narrow Band Telephony Speech Based on Full Wave
Rectification", IEICE Trans. Commun.,Vol. E93-B (3), pp. 729-731, 2010 ".
[0051]
GI is an index for measuring the number of times the inclination direction of the signal waveform
changes and its magnitude. GI is obtained by dividing the sum of absolute differences of
successive samples when the direction of inclination changes by the square root of the power of
the frame. Therefore, GI tends to increase as the number of changes in inclination in one frame
increases, and increases as the amount of change when the inclination changes increases.
[0052]
However, since GI uses a parameter that takes only a binary value of 0 or 2 that is a variable Δn
(n) and a large number of jumps with time occur frequently, the value becomes irregularly large
or small. It has the characteristic of "dooming".
[0053]
modGI has high correlation with GI in place of GI in view of the property that the GI value goes
wild (it has large jumps), and it suppresses the large jumps while having a high correlation with
GI. It is proposed as a feature quantity.
modGI is the power of the second-order difference of the calculation target signal normalized by
the “power of the calculation target signal” for an arbitrary signal (coherence in the present
application) of the feature amount calculation target (this is multiplied by a constant Are also
included)).
[0054]
Since modGI has a high correlation with GI, it functions as a stable indicator for measuring the
number of times the inclination direction of the signal waveform changes and its magnitude.
[0055]
The reason why the modGI value for coherence is applied to the detection of the target sound
16-04-2019
16
section will be described below.
[0056]
Coherence is calculated from the output signal (ECoutL or ECoutR) of the emitted non-target
sound canceller processing unit 32, and the target sound section in which the target sound and
the interference sound (sound emission sound) overlap and the interference sound exist
independently When comparing the characteristics of coherence in the sound section, there are
the following differences.
FIG. 3 is a characteristic diagram showing a time change of the original signal (ECoutL or
ECoutR) before being subjected to the calculation of the coherence and the coherence obtained
by the calculation.
[0057]
In an unintended sound section in which a disturbing sound is present alone, the coherence has a
large value only at the moment when an impact sound such as a drum sound is generated in the
emitted sound from the speakers 3L and 3R.
Most of the sound source sound components (sound emission sound components) captured by
the microphones 4L and 4R are removed through the sound emission non-target sound canceller
processing unit 32. However, an impact sound having a wide range of frequency components
such as drum sound and having a large level is not sufficiently eliminated even through the
emitted non-target sound canceller processing unit 32. The portion where the level in the
original signal before the coherence operation in FIG. 3 is instantaneously increased is the
portion of the impact sound. Even in the coherence obtained by the calculation, the value
becomes large only at the moment when the impulsive sound occurs. In sections other than the
impact sound in the non-target sound section, even if there is an emitted source sound, the
coherence is reduced to the same range as when only background noise is present. Therefore, in
the case where an audible noise, in which impulsive noise is generated intermittently, exists as
disturbance noise, behavior like “surge → sudden decrease → fine fluctuation such as
background noise” is repeated, and the positive and negative of the coherence slope is
frequently To change.
16-04-2019
17
[0058]
On the other hand, in the target sound section in which both the target sound and the
disturbance sound exist, the coherence is increased at the moment when the impulsive sound is
generated, but the target sound is present in other sections, so the coherence is moderate.
Maintain the size. Therefore, the fluctuation of the coherence is smaller than that of the nontarget sound section of the disturbance sound alone, and the fluctuation of the inclination is also
small.
[0059]
As apparent from FIG. 3, it can be seen that there is no significant difference between the
dynamic range in the non-target sound section and the dynamic range in the target sound
section.
[0060]
As described above, there is no difference in the dynamic range of the calculated coherence
between the non-target sound segment in which the disturbance sound is present alone and the
target sound segment in which both the target sound and the disturbance sound are present, but
the calculated coherence Since the number of times and the magnitude of the change in the
direction of the inclination are different, the modGI described above can be applied as an index
for discriminating between the non-target sound section and the target sound section.
[0061]
FIG. 4 shows the change in modGI values calculated for coherence.
FIG. 4 also shows the original signal (ECoutL or ECoutR) before being subjected to the operation
of coherence.
[0062]
When modGI values are compared between the target sound segment and the non-target sound
16-04-2019
18
segment, it is understood that modGI takes a large value in the non-target sound segment in
which the interference sound is present alone, and modGI takes a small value in the target sound
segment. .
It is also clear that the modGI range is different depending on the presence or absence of the
target sound. From the above, it can be understood that by focusing on the magnitude of modGI,
it is possible to detect a target sound section that includes not only the target sound that has not
been possible conventionally but also the interference sound.
[0063]
(A-2) Operation of the First Embodiment Next, the operation of the sound collection and emission
device 10 of the first embodiment will be described. In the following, it is appropriately described
that the sound source data is music data, and the target sound is a sound generated by a user
located in front of the sound collection and sound emission device 10.
[0064]
The sound source data (music data) read from each sound source data storage unit 21L, 21R is
converted into an analog signal by the corresponding D / A conversion unit 22L, 22R, and then
emitted from each speaker 3L, 3R. Ru. When such music is flowing from the sound collection and
emission device 10, the sound produced by the user toward the sound collection and emission
device 10 is captured by both the microphones 4L and 4R. At this time, since music also flows
from the speakers 3L and 3R, music from the speaker 3L is also captured by both the
microphones 4L and 4R, and music from the speaker 3R is also captured by both the
microphones 4L and 4R. Furthermore, ambient background noise (such as the driving noise of
the air conditioner, the traveling noise from a vehicle traveling in the vicinity, etc.) is also
captured by the two microphones 4L and 4R.
[0065]
That is, in the input sound signal obtained by each of the microphones 4L and 4R, in addition to
the target sound of the user's voice, background noise or non-target sound (disturbance sound)
of music emitted by the own device, etc. It is included.
16-04-2019
19
[0066]
The input sound signals captured and obtained by the microphones 4L and 4R are converted into
digital signals inputL and inputR by the corresponding A / D converters 31L and 31R,
respectively, and are supplied to the non-target sound cancellation processing unit 32.
The sound emission non-target sound canceller processing unit 32 is also supplied with sound
source data sigL and sigR.
[0067]
In the non-target sound cancellation processing unit 32, the non-target sound is eliminated by
subtracting the internally generated pseudo target sound signal from the input sound signal
(digital signal) input L related to the L channel. The input sound signal ECoutL is obtained, and
similarly, the non-target sound is eliminated by subtracting the internally generated pseudo
sounding target sound signal from the input sound signal (digital signal) inputR related to the R
channel. An input sound signal ECoutR is obtained. A pair of signals ECoutL and ECoutR from
which the emitted non-target sound thus obtained is removed is given to the target sound section
detection unit 33.
[0068]
The target sound section detection unit 33 to which a pair of signals ECoutL and ECoutR from
which the sound non-target sound is removed is applied as follows.
[0069]
In the FFT unit 41, the signals ECoutL (n) and ECoutR (n), which are time domain signals from
which the non-target sound is eliminated, are converted to frequency domain signals YL (f, K)
and YR (f, K), respectively. And may be provided to the coherence calculator 42.
[0070]
In the coherence calculation unit 42, the coherence COH (K) is calculated according to the
equations (3) to (6) based on the frequency domain signals YL (f, K) and YR (f, K) from the FFT
unit 41. Is calculated, and the obtained coherence COH (K) is given to the modGI calculator 43.
16-04-2019
20
[0071]
The modGI calculator 43 calculates modGI (K) for the coherence COH (K) according to the abovementioned equation (7), and gives the obtained modGI (K) to the threshold comparator 44.
[0072]
FIG. 5 is a flowchart showing the operation of the threshold comparison unit 44 in the target
sound section detection unit 33.
Although the threshold comparing unit 44 does not have to be configured centering on software,
its operation can be represented by FIG.
[0073]
When the new frame is to be compared with the threshold Ψ, the threshold comparison unit 44
takes in the modGI value modGI (K) for the new frame (step S101), and compares the taken
modGI value modGI (K) with the threshold Ψ (Step S102).
Then, when the modGI value modGI (K) is smaller than the threshold Ψ, the threshold
comparison unit 44 sets the detection result out (K) to a value representing the target sound
section (step S103), while the modGI value modGI (K) is the threshold Ψ. In the above case, the
detection result out (K) is set to a value representing the non-target sound section (step S104).
Then, the obtained detection result out (K) is sent to an external processing unit (not shown), and
the next frame is set as a new comparison target frame (step S105).
Such processing is repeated until there is no input of the modGI value modGI (K) (in other words,
signal input to the target sound section detection unit 33).
[0074]
16-04-2019
21
The detection result out (K) regarding the target sound section obtained as described above is
used by the processing unit in the subsequent stage (not shown). For example, if the postprocessing unit is a noise suppression processing unit such as a voice switch, the detection result
out (K) is used to switch the values of noise suppression and speech enhancement parameters
between the target sound section and the non-target sound section. Be done. For example, if the
post-stage processing unit is a speech encoding unit, the detection result out (K) is used so as to
encode only the input speech signal of the target sound section.
[0075]
(A-3) Effects of the First Embodiment According to the first embodiment, the target sound section
and the non-target sound section differ from each other in the behavior of modGI regarding
coherence. Since the target sound section was separated, the target sound section which was
conventionally difficult to detect as if the target sound and the emitted impact sound were
simultaneously captured by the microphone is also distinguished from the non-target sound
section and correctly It can be detected.
[0076]
As a result, it is possible to improve the accuracy of various processes using the detection result
regarding the target sound section.
That is, it is possible to contribute to improving the characteristics and the like of the device to
which the sound collection and sound emission device, the target sound segment detection
device, and the target sound segment detection program of the first embodiment are applied. For
example, an improvement in the speech quality of a speech device or an improvement in the
recognition performance in a speech recognition system can be expected.
[0077]
(B) Second Embodiment Next, a second embodiment of a sound collection and sound emission
device, a target sound segment detection device and a target sound segment detection program
according to the present invention will be described with reference to the drawings.
[0078]
The entire configuration of the sound collection and sound emission device (using the same
16-04-2019
22
reference numeral “10” as in the first embodiment) of the second embodiment can also be
represented by FIG. 1 used in the description of the first embodiment.
[0079]
However, in the sound collection and sound emission device 10 of the second embodiment, the
internal configuration of the target sound section detection unit (using the code "33A") is
different from that of the sound collection and sound emission device of the first embodiment.
ing.
[0080]
Also in the target sound segment detection unit 33A of the second embodiment, whether the
target sound segment or the non-target sound segment is used by using the number of times the
inclination direction of the signal waveform changes and modGI which is a stable index for
measuring its magnitude. Although it is divided, it does not use the modGI value as it is, but it
divides the target sound section or the non-target sound section based on the statistic of the
modGI value.
Hereinafter, the second embodiment will be described focusing on differences from the first
embodiment.
[0081]
FIG. 6 is a block diagram showing the detailed configuration of the target sound segment
detection unit 33A in the second embodiment, and the same or corresponding parts as in FIG. 2
according to the first embodiment are given the same reference numerals. ing.
[0082]
In FIG. 6, a target sound section detection unit 33A of the second embodiment includes an FFT
unit 41, a coherence calculation unit 42, a modGI calculation unit 43, a statistic calculation unit
45, and a threshold comparison unit 44A.
[0083]
The functions of the FFT unit 41, the coherence calculation unit 42, and the modGI calculation
unit 43 are the same as those in the first embodiment, and thus the description thereof will be
16-04-2019
23
omitted.
[0084]
The statistic calculator 45 calculates an average value AVEmodGI (K) and a dispersion value
VARmodGI (K) of a plurality of modGI values modGI (K) to modGI (KM) within a predetermined
period immediately before.
Note that instead of the average value, a statistic that represents another representative value
such as a median may be calculated, and instead of the variance value, a statistic that represents
another degree of variation such as a standard deviation is calculated. You may do so.
[0085]
When the average value AVEmodGI (K) is smaller than the average threshold value α and the
dispersion value VARmodGI (K) is smaller than the dispersion value threshold β, the threshold
comparison unit 44A of the second embodiment detects the detection result out ( K) is a value
representing a target sound segment, while in other cases the detection result out (K) is a value
representing a non-target sound segment.
[0086]
As can be seen from FIG. 4 described above, the modGI value for the coherence is generally small
in level and change width in the target sound section, and is substantially large in the non-target
sound section.
Therefore, it is possible to distinguish between the target sound section and the non-target sound
section based on the representative statistic such as the average value, and the target sound
section or the non-target sound section based on the statistic indicating the dispersion degree
such as the variance value. It can be divided into the target sound section.
In the second embodiment, it is decided to detect the target sound section or the non-target
sound section by using the average value AVEmodGI (K) and the dispersion value VARmodGI (K)
in combination.
16-04-2019
24
[0087]
In addition, as a modification of the second embodiment, one that detects a target sound section
or a non-target sound section using only the average value AVEmodGI (K) can be mentioned, and
the dispersion value VARmodGI (K) Can be used to detect whether the target sound section or the
non-target sound section is used by using only the modGI value modGI (K), the average value
AVEmodGI (K) and the variance value VARmodGI (K). Can be used to detect whether the target
sound section or the non-target sound section is used.
[0088]
When using a plurality of parameters for detection, a plurality of parameters may be applied in
parallel, or a plurality of parameters may be used sequentially.
As the latter method, for example, when the non-target sound section is determined by the first
parameter, it does not proceed to the determination, and when the target sound section is
determined by the first parameter, it proceeds to the determination by the second parameter.
When the non-target sound section is determined by the parameter 2 and the target sound
section is not determined by the second parameter, the non-target sound section is not
determined. When the target sound section is determined by the second parameter, the
determination may be performed by the third parameter.
[0089]
According to the second embodiment, substantially the same effect as that of the first
embodiment can be obtained.
In the second embodiment, since the target sound section or the non-target sound section is
detected based on the statistic of the modGI value, the risk of false detection due to the change of
the instantaneous value of the modGI value is reduced. It can be done.
[0090]
(C) Third Embodiment Next, a third embodiment of a sound collection and sound emission device,
16-04-2019
25
a target sound segment detection device and a target sound segment detection program
according to the present invention will be described with reference to the drawings.
[0091]
The entire configuration of the sound collection and emission device (using the same reference
numeral "10" as in the first embodiment) of the third embodiment can also be represented by
FIG. 1 used in the description of the first embodiment.
[0092]
However, in the sound collection and emission device 10 of the third embodiment, the internal
configuration of the target sound section detection unit (using the code “33B”) is different
from that of the sound collection and emission device of the first embodiment. ing.
[0093]
Also in the target sound segment detection unit 33B of the third embodiment, whether the target
sound segment or the non-target sound segment is used by using the number of times the
inclination direction of the signal waveform changes and modGI which is a stable index for
measuring its magnitude. Although it is divided, the threshold Ψ to be compared with the modGI
value is dynamically changed.
Hereinafter, the third embodiment will be described focusing on differences from the first
embodiment.
[0094]
FIG. 7 is a block diagram showing the detailed configuration of the target sound segment
detection unit 33B in the third embodiment, and the same or corresponding parts as in FIG. 2
according to the first embodiment are given the same reference numerals. ing.
[0095]
In FIG. 7, the target sound segment detection unit 33B of the third embodiment includes an FFT
unit 41, a coherence calculation unit 42, a modGI calculation unit 43, a threshold value
determination unit 46, and a threshold comparison unit 44B.
16-04-2019
26
[0096]
The functions of the FFT unit 41, the coherence calculation unit 42, and the modGI calculation
unit 43 are the same as those in the first embodiment, and thus the description thereof will be
omitted.
[0097]
The threshold determination unit 46 determines the threshold Ψ (K) to be applied at that time
based on the modGI value modGI (K) from the modGI calculation unit 43.
Note that depending on the method of determining the threshold described later, the detection
result out (K) is also given to the threshold determination unit 46, and the detection result in the
past is also used for the determination of the threshold Ψ (K).
[0098]
The threshold comparison unit 44B according to the third embodiment does not use the fixed
threshold Ψ, but uses the threshold Ψ (K) to be applied at that time, which is determined by the
threshold determination unit 46. It is to detect whether it is a target sound section, and except
for this point, it is the same as that of the first embodiment.
[0099]
Hereinafter, a method of determining the threshold Ψ (K) by the threshold determination unit 46
will be described by giving two examples.
[0100]
First, a value obtained by multiplying the average value of modGI values modGI (K-1) to modGI
(KM) by a predetermined multiple (or one) by a predetermined period immediately before that is
set as a threshold Ψ (K).
[0101]
Second, a threshold Ψ (K) is obtained by multiplying the minimum value of the modGI value in
16-04-2019
27
the immediately preceding period of the non-target sound segment frame as shown in FIG. 8 by a
predetermined multiple (or one). .
If the detection result of the immediately preceding frame K-1 is the target sound segment, the
frame which is the current detection target, which is the non-target sound segment older than
the target sound segment, is searched and the non-target sound is traced back to the past side
The continuous period immediately before the interval frame is identified, and the threshold Ψ
(K) is determined.
If the detection result of the immediately preceding frame K-1 is the non-target sound section
(FIG. 8 shows this case), the continuous period of the non-target sound section frame older than
that of the target sound section frame Is determined, and a threshold Ψ (K) is determined with
the continuous period as the continuous period immediately before the non-objective section
frame.
[0102]
Also in the third embodiment, substantially the same effect as in the first embodiment can be
obtained.
In the third embodiment, since the threshold す る to be compared with the modGI value is also
dynamically changed, it is possible to expect a further improvement in detection accuracy.
[0103]
(D) Other Embodiments In the description of each of the above-described embodiments, various
modified embodiments are mentioned, but further, modified embodiments as exemplified below
can be mentioned.
[0104]
In the description of each of the above embodiments, although modGI is calculated using
equation (7), which is the same as equation (13) of patent document 2, modGI is calculated
according to other equation described in patent document 2. May be calculated.
16-04-2019
28
Also, since GI before being corrected is also an index for measuring the number of times the
inclination direction of the signal waveform changes and its magnitude, GI may be applied
instead of modGI in each of the above embodiments.
When modGI or GI is applied, the value may not be applied as it is, but may be used to detect a
target sound section after performing a predetermined functional operation such as logarithm.
[0105]
In the above embodiments, the target sound segment detection unit detects the target sound
segment from the output signal from the sound emission non-target sound canceller processing
unit, but the target sound segment is detected from the other signals. It is good.
For example, the input signals inputL and inputR that are captured by the microphone and
converted into digital signals are also used for detection of the target sound section because the
target sound section and the non-target sound section are mixed.
[0106]
In each of the above embodiments, although the modGI value for coherence is used to detect the
target sound segment, the modGI value is detected for a signal having directivity other than the
coherence to use for detection of the target sound segment. You may do so.
For example, a front suppression signal in which the component in the front direction of the
device is suppressed is formed from the pair of input sound signals inputL and inputR or the pair
of input sound signals ECoutL and ECoutR from which the non-target sound is eliminated. The
modGI value of the signal may be detected and used to detect the target sound section.
[0107]
16-04-2019
29
In each of the above embodiments, the modGI value is compared with the threshold to detect
whether it is the target sound section or the non-target sound section, but the modGI value is
compared with two (or three or more) thresholds. A target sound section, a non-target sound
section or an intermediate section may be detected.
For example, when using a detection result as a voice switch, different attenuation amounts may
be used in three types of sections.
Further, for example, when the detection result is an intermediate section, parameters other than
the modGI value may be applied to determine whether the target sound section or the non-target
sound section.
In this case, modGI may not be the first applied parameter, but may be the second applied
parameter or later.
[0108]
In each of the above embodiments, two speakers are shown, but one or three or more speakers
may be used. Also, the number of microphones is not limited to two, and may be three or more.
The internal configuration of the emitted non-target sound canceller processing unit 32 may be
designed in consideration of the number of emitted sound paths determined according to the
number of speakers and microphones.
[0109]
In each of the above-described embodiments, although the sound collection and sound emission
device alone executes all processing, detection of a target sound section may be entrusted to an
external server for execution. For example, in the case where the sound collection and sound
emission device is a smart phone, the system may be configured by a so-called cloud, and the
target sound section may be detected so that the user does not know the presence of the external
server. The claims of “sound collection and sound emission device” in the claims include cases
where an external server invisible to the user is performing processing.
16-04-2019
30
[0110]
In each of the above embodiments, an apparatus and program for immediately processing a
signal captured by a pair of microphones are shown, but the present invention is not limited to
the case where a signal captured by a pair of microphones is recorded on a recording medium
and reproduced. It can apply.
[0111]
Further, the technical idea of the present invention can be applied even when there are no
speakers on both sides of the pair of microphones.
For example, in the case where a voice command is issued to a car navigation system under a
situation where a car audio is emitted, if the disturbing sound source is known, the sound
emission non-target sound canceller processing unit is effectively operated in the previous stage
of the signal processing unit. The present invention is effective because
[0112]
DESCRIPTION OF SYMBOLS 10 ... Sound collection * sound emission apparatus, 20 ... Sound
emission part, 21 L, 21 R ... Sound source data storage part, 22 L, 22 R ... D / A conversion part, 3
L, 3 R ... Speaker, 30 ... Sound collection part, 4 L, 4 R ... Microphone, 31L, 31R: A / D converter,
32: non-target sound canceller processing unit 33, 33A, 33B: target sound section detection unit,
41: FFT unit, 42: coherence calculation unit, 43: modGI calculation Unit, 44, 44A, 44B ...
threshold value comparing unit, 45 ... statistic amount calculating unit, 46 ... threshold value
determining unit.
16-04-2019
31
Документ
Категория
Без категории
Просмотров
0
Размер файла
47 Кб
Теги
description, jp2016038405
1/--страниц
Пожаловаться на содержимое документа