close

Вход

Забыли?

вход по аккаунту

?

DESCRIPTION JP2001318687

код для вставкиСкачать
Patent Translate
Powered by EPO and Google
Notice
This translation is machine-generated. It cannot be guaranteed that it is intelligible, accurate,
complete, reliable or fit for specific purposes. Critical decisions, such as commercially relevant or
financial decisions, should not be based on machine-translation output.
DESCRIPTION JP2001318687
[0001]
BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a
speech recognition apparatus, and more particularly, to a feature of speech from which noise has
been removed after background noise superimposed on input speech has been removed even in
an environment with ambient noise. The present invention relates to a voice recognition
apparatus that recognizes voice input by collating with a feature amount of a standard voice
pattern prepared in advance.
[0002]
2. Description of the Related Art The speech recognition rate is lowered because background
noise is superimposed on speech uttered under ambient noise or noise. FIG. 8 is a block diagram
showing the configuration of a conventional speech recognition apparatus that recognizes speech
in an environment where there is noise or the like. In the figure, t is time, K is the number of
noise sources, x (t) is the observed signal of the microphone, s (t) is a speech signal uttered by
the speaker, nk (t) is the noise source k (1 ≦ k ≦ A noise signal output from K), hs (t) represents
an impulse response from the speaker to the microphone, hkn (t) represents an impulse response
from the noise source k to the microphone, and * represents a convolution operation. In addition,
components of the speech recognition apparatus other than the microphone are collectively
shown as a recognition processing unit, and this recognition processing unit is configured using a
speech recognition technology that is well known in the field. As shown in the figure, the
observation signal of the microphone is a noise signal superimposed on the speech signal.
Therefore, it is necessary to remove superimposed noise in a speech recognition apparatus which
10-04-2019
1
needs to collate a noiseless speech pattern with a standard speech pattern and perform
recognition processing. The noise signal at the observation point is the sum of noise signals
output from the noise sources. In the following, there is a virtual noise source that outputs the
noise pattern by defining an identification pattern related to the noise given at the observation
point as a "noise pattern" by combining noises respectively output from a plurality of noise
sources. I will explain as.
[0003]
As a simple and effective method for removing superimposed noise, a two-input spectral
subtraction method (hereinafter, referred to as a two-input SS method) using a speech
microphone and a noise microphone is widely used. FIG. 9 is a diagram showing the
configuration of a speech recognition apparatus using the conventional two-input SS method
shown in, for example, "voice recognition in a car using noise elimination method with two
inputs; Signal Technical Report SP 89-81". It is. In the figure, 101 is a voice microphone for
collecting voice on which background noise is superimposed, 102 is a noise microphone for
mainly collecting background noise, and 103 is frequency conversion of noise superimposed
voice signal output from voice microphone 101 for noise A noise superimposed speech spectrum
computing means for outputting in time series a power spectrum relating to superimposed
speech, and 104 is a noise spectrum in which a noise pattern signal outputted by the noise
microphone 102 is frequency converted to output a power spectrum associated with the noise
pattern in time series Arithmetic means, 105 is a correction filter memory for storing one filter
for correcting the difference in the frequency characteristic relating to the transfer characteristic
with respect to the noise pattern between the voice microphone 101 and the noise microphone
102, 106 is a correction filter Correction stored in the memory 105 Noise spectrum correction
means for correcting the power spectrum of the noise pattern output from the noise spectrum
calculation means 104 using a filter and outputting the power spectrum of the correction noise
pattern in time series; 107 is noise superimposed speech spectrum calculation means 103
Denoised speech spectrum computing means for subtracting the power spectrum associated with
the corrected noise pattern from the power spectrum associated with the noise-superimposed
speech output from the output and outputting the power spectrum associated with the denoised
speech in time series; Feature vector calculating means for generating a feature vector from the
power spectrum of the denoised speech output from means 107 and outputting the feature
vector in time series, and 109 is a feature vector for a plurality of standard speech patterns
having no noise for comparison. Pre-store A verification pattern memory 110 is a recognition
candidate that gives the maximum likelihood by collating the feature vector output from the
feature vector computing means 108 with the feature vector related to the standard speech
pattern stored in the verification pattern memory 109 as a recognition result It is a collation
means to output.
10-04-2019
2
[0004]
Next, the operation will be described. The voice microphone 101 is generally disposed in the
vicinity of the speaker to collect voice superimposed with background noise. The noise
microphone 102 is generally placed at a distance from the speaker and mainly collects
background noise. This conventional speech recognition apparatus is configured on the
assumption that the leakage of speech into the noise microphone 102 is so small as to be
negligible.
[0005]
The noise-superposed speech spectrum calculation means 103 performs frequency conversion
on the noise-superimposed speech signal output from the microphone 101 for speech using FFT
(Fast Fourier Transform) for each analysis frame shifted every fixed time, The power spectrum
for each analysis frame for the superimposed audio signal is output in time series. Here, the
transfer characteristic from the speaker to the microphone 101 for speech from the speaker is
X1 (z) for z conversion of noise superimposed speech signal, S (z) for z conversion for speech
signal, N (z) for z conversion for noise pattern signal Let G 11 (z), and transfer characteristics
from a virtual noise source for the noise pattern to the voice microphone 101 be G 21 (z), the
following equation (1) is derived. X1(z)=G11(z)・S(z)+G21(z)・
N(z) (1)
[0006]
Also, assuming that no signal delay occurs across multiple analysis frames, Equation (1) can be
expressed as Equation (2) below. X1i (ω) = G11 (ω) · Si (ω) + G21 (ω) · Ni (ω) (2) In the
equation (2), ω is an angular frequency, and X1i (ω) is a microphone for speech in the analysis
frame i The power spectrum of the noise-superimposed speech signal to be output, Si (ω) is the
power spectrum of the speech uttered by the speaker in the analysis frame i, Ni (ω) is the noise
output by the virtual noise source in the analysis frame i The power spectrum relating to the
pattern, G11 (ω) is the frequency characteristic (filter) for the transfer characteristic from the
speaker to the voice microphone, and G21 (ω) is the frequency for the transfer characteristic
from the virtual noise source to the voice microphone It is a characteristic (filter). Since speech
recognition does not require phase information, the following description will be made on
10-04-2019
3
frequency regions where the phase information is not considered unless otherwise specified.
[0007]
The noise spectrum calculation means 104 performs frequency conversion on the noise pattern
signal output from the noise microphone 102 using FFT (Fast Fourier Transform) for each
analysis frame shifted for each fixed time, thereby generating a noise pattern signal. The power
spectrum for each analysis frame with respect to is output in time series. At this time, the power
spectrum X2i (ω) related to the noise pattern in the analysis frame i is expressed by the
following equation (3). In Formula (3), G22 ((omega)) is a frequency characteristic about the
transfer characteristic from the virtual noise source about a noise pattern to the microphone 102
for noise. X2i (ω) = G22 (ω) · Ni (ω) (3)
[0008]
The correction filter memory 105 is a filter H21 (.omega.) = G21 (.omega.) / G22 (.omega.) For
correcting the difference in the frequency characteristic relating to the transfer characteristic to
the noise pattern between the voice microphone 101 and the noise microphone 102. Remember.
In the above-mentioned “in-vehicle voice recognition using a two-input noise removal method”,
in the noise section immediately before the voice section, correction is made from the noise
section immediately before the voice section using Equation (4) described below Calculate the
filter and store its value. In equation (4), Ts indicates the analysis frame number at the beginning
of the voice interval, and according to equation (4), the frequency of the microphone for noise
relative to the noise microphone for the power spectrum of the noise pattern in 20 frames
immediately before the voice interval The average value of the ratio of each component is
calculated.
[0009]
The noise spectrum correction means 106 corrects the power spectrum of the noise pattern
using the correction filter stored in the correction filter memory 105, and outputs the power
spectrum of the correction noise pattern in time series. At this time, the power spectrum X2'i (?)
Related to the correction noise pattern in the analysis frame i is expressed as the following
equation (5). X2'i (?) = H21 (?) X2i (?) (5)
10-04-2019
4
[0010]
The noise removing speech spectrum computing means 107 converts the power spectrum of the
noise superimposed speech output respectively from the noise superimposed speech spectrum
computing means 103 in time series for each analysis frame into the corrected noise pattern
output from the noise spectrum correcting means 106. The power spectrum concerned is
subtracted, the power spectrum concerning noise removal speech is calculated, and this is
outputted in time series. At this time, a power spectrum S′i (ω) relating to the denoised speech
in the analysis frame i is expressed by the following equation (6). In equation (6), α is a
parameter for adjusting the subtraction amount of the power spectrum related to the correction
noise pattern, and β is the power spectrum related to the noise removed speech in order to
prevent excessive subtraction of the power spectrum related to the correction noise pattern. Is a
parameter for setting the lower limit value of each frequency component in Also, max {} is given
as a function that returns the element of the largest value among the elements in parentheses. S'i
(?) = Max {X1i (?)-? X2'i (?),?} (6)
[0011]
Here, since H21 (ω) = G21 (ω) / G22 (ω), X2'i (ω) = G21 (ω) Ni (ω) is obtained from the
equations (3) and (5). Substituting this equation and equation (2) into equation (6), when α = 1,
Si ′ (ω) = G11 (ω) Si (ω), and the power spectrum of the speech from which the noise has been
removed You can get
[0012]
The feature vector calculation unit 108 converts the power spectrum of the noise removal
speech output in time series by the noise removal speech spectrum calculation unit 107 into a
feature vector representing an acoustic feature in speech recognition such as LPC cepstrum and
the like. Output feature vectors in time series.
[0013]
The matching unit 110 performs matching between the feature vector output from the feature
vector computing unit 108 and the feature vector related to the standard speech pattern without
noise stored in the matching pattern memory 109 to give the maximum likelihood. The speech
recognition candidate is output as a recognition result.
10-04-2019
5
[0014]
SUMMARY OF THE INVENTION Since the conventional speech recognition apparatus using the
two-input SS method is configured as described above, leakage of speech to the noise
microphone is so small as to be negligible, and for speech It operates relatively well when the
variation of the frequency characteristic relating to the transfer characteristic to the noise
pattern between the microphone and the noise microphone is small, that is, when the noise
source is fixed.
However, the frequency characteristics pertaining to the transfer characteristics for the noise
pattern of the voice microphone and the noise microphone are such that voice leakage into the
noise microphone can not be ignored, and there are multiple noise sources and the noise sources
change over time. In the case of changing from moment to moment, there is a problem that
accurate noise removal can not be performed and recognition performance is degraded.
[0015]
The present invention has been made to solve the above problems, and it is an object of the
present invention to provide a speech recognition apparatus capable of accurately recognizing
speech in a noisy environment even when speech is leaked into a noise microphone. With the
goal.
[0016]
Further, according to the present invention, it is possible to accurately recognize speech in a
noisy environment even when the frequency characteristic relating to the transfer characteristic
to the noise pattern of the speech microphone and the noise microphone changes momentarily.
The purpose is to obtain a device.
[0017]
A speech recognition apparatus according to the present invention comprises an audio
microphone for collecting speech on which background noise is superimposed, a noise
microphone for mainly collecting background noise, and a speech microphone. Noise-superposed
speech spectrum computing means for frequency-converting a noise-superimposed speech signal
and outputting the power spectrum related to the noise-superimposed speech in time series, and
noise in which speech is leaked by frequency-converting a noise pattern signal output from a
noise microphone The noise spectrum computing means for outputting the power spectrum
10-04-2019
6
relating to the pattern in time series, and the filter for correcting the difference of the frequency
characteristic relating to the transfer characteristic for voice between the voice microphone and
the noise microphone Correct the power spectrum and output the power spectrum related to the
corrected noise superimposed speech in time series The noise spectrum superimposed speech
spectrum correction means and the power spectrum of the noise pattern from which the voice
leakage has been removed by subtracting the power spectrum of the corrected noise
superimposed speech from the power spectrum of the noise pattern where the speech is leaked
The power spectrum of the noise pattern from which the leaked speech has been removed by
using a filter for correcting the difference in the frequency characteristics relating to the transfer
characteristic of the noise microphone and the noise microphone A leaked speech removal noise
spectrum correction means for correcting and outputting a power spectrum related to the
corrected noise pattern in time series, and a power spectrum related to the corrected noise
pattern from the power spectrum related to the noise superimposed speech Output power
spectrum in time series And sound removal speech spectrum calculating means, in which as and
a recognition processing unit that executes speech recognition processing based on the power
spectrum of the noise removal speech.
[0018]
The speech recognition apparatus according to the present invention detects the position where
the speaker is present by the sensor and outputs the position data in time series, the transfer
characteristic to the voice between the voice microphone and the noise microphone. A voice
correction correction filter memory storing a plurality of correction filters for correcting
differences in frequency characteristics relating to the present invention, and a correction filter
corresponding to the speaker position data output from the speaker position detection means An
audio correction correction filter selection means is provided which selects from the filter
memory and outputs the correction filter to the noise superimposed speech spectrum correction
means in time series.
[0019]
A speech recognition apparatus according to the present invention frequency-converts a noise
superimposed speech signal output from a speech microphone that collects speech
superimposed with background noise, a noise microphone that mainly collects background noise,
and a speech microphone Noise-superposed speech spectrum computing means for outputting in
time-series power spectrums relating to noise-superimposed speech, and noise spectrum
computation for outputting in time-series power spectra relating to noise patterns by frequencyconverting noise pattern signals output from noise microphones Noise correction correction filter
memory for storing a plurality of correction filters for correcting differences in frequency
characteristics relating to transfer characteristics with respect to noise patterns between the
voice microphone and the noise microphone; Leaked sound corresponding to each of a plurality
10-04-2019
7
of stored correction filters A representative noise spectrum memory for storing a power
spectrum related to a noise pattern from which noise has been removed, a power spectrum
related to a noise pattern from which a leaked speech has been removed, and a noise pattern
from which a plurality of leaked speech stored in the representative noise spectrum memory
have been removed Noise spectrum selection means for selecting a noise pattern giving a
shortest distance value from the representative noise spectrum memory by calculating a distance
value between the power spectrum and the noise spectrum selection means and outputting a
signal identifying the noise pattern in time series; Noise correction correction filter selection
means for selecting the correction filter corresponding to the noise pattern identification signal
output from the selection means from the noise correction correction filter memory and
outputting the correction filter in time series; and the noise correction correction filter selection
means Using a correction filter to reduce noise Noise spectrum correction means for correcting
the noise and outputting the power spectrum for the correction noise pattern in time series, and
subtracting the power spectrum for the correction noise from the power spectrum for the noise
superimposed speech to obtain the power spectrum for the noise removed speech A noise
removing speech spectrum computing means for outputting in time series, and a recognition
processing unit for executing speech recognition processing based on a power spectrum of the
noise removing speech are provided.
[0020]
A speech recognition apparatus according to the present invention comprises a noise correction
correction filter memory for storing a plurality of correction filters for correcting differences in
frequency characteristics relating to transfer characteristics of noise microphones for speech and
noise microphones; A representative noise spectrum memory for storing a power spectrum
related to a noise pattern from which leaked speech corresponding to a plurality of correction
filters stored in the correction filter memory for correction is stored, and a power spectrum
related to a noise pattern from which leaked speech is removed The noise pattern which gives
the shortest distance value is calculated from the representative noise spectrum memory by
calculating the distance value between the noise spectrum and the power spectrum related to the
noise pattern from which the plurality of leaked speech stored in the representative noise
spectrum memory is removed. Outputs time-series signals that identify noise patterns For noise
correction which selects spectrum correction means corresponding to the noise pattern
identification signal outputted from the noise spectrum selection means from the noise
correction correction filter memory and outputs them in time series to the leaked speech removal
noise spectrum correction means And a correction filter selection unit.
[0021]
A speech recognition apparatus according to the present invention frequency-converts a noise
superimposed speech signal output from a speech microphone that collects speech with
10-04-2019
8
superimposed background noise, a noise microphone that mainly collects background noise, and
a speech microphone Noise superimposed speech spectrum computing means for outputting the
power spectrum pertaining to the noise superimposed speech in time series, and the power
spectrum pertaining to the noise pattern in which speech is leaked by converting the noise
pattern signal outputted from the noise microphone The power spectrum of the noisesuperimposed speech is corrected using noise spectrum computing means for outputting in
series and a filter for correcting differences in frequency characteristics of the speech
microphones and noise microphones for transfer characteristics of voices. Noise-superimposed
speech that outputs the power spectrum related to the corrected noise-superimposed speech in
time series A leaked voice that outputs in time series a power spectrum of a noise pattern from
which a leaked voice is removed by subtracting a power spectrum of a corrected noisesuperimposed voice from a power correction of a noise correction pattern and a noise pattern of
a voice leaked. A removing means, a first representative noise spectrum memory for storing a
plurality of power spectra relating to a noise pattern from which a leaked speech has been
removed, and a noise from which a plurality of leaked speech stored in the first representative
noise spectrum memory have been removed A second representative noise spectrum memory
storing power spectra related to a plurality of superimposed noise patterns respectively
corresponding to power spectra related to the pattern; a power spectrum related to a noise
pattern from which collected leaked speech has been removed; Multiple leaked sounds stored in
the representative noise spectrum memory A signal identifying a noise pattern by selecting from
the first representative noise spectrum memory a power spectrum relating to a noise pattern
giving a shortest distance value by calculating a distance value between the power spectrum
relating to the noise pattern from which the Noise spectrum selection means for outputting in
time series, and a power spectrum relating to the superimposed noise pattern corresponding to
the noise pattern identification signal outputted from the first noise spectrum selection means is
selected from the second representative noise spectrum memory The second noise spectrum
selection means for outputting in the same time series and the power spectrum for noise
superimposed speech, the power spectrum for the superimposed noise pattern selected by the
second noise spectrum selection means is subtracted to obtain the noise removed speech Noise
removal speech spectrum calculation means for outputting power spectrum in time series, noise
removal And a recognition processing unit that executes a speech recognition process based on a
power spectrum related to speech.
[0022]
A speech recognition apparatus according to the present invention calculates noise power level
from a noise pattern signal outputted from a noise microphone and outputs the noise power level
in time series from noise power level calculating means, and is outputted from a speech
microphone Voice section detecting means for determining a voice section on the basis of a noise
superimposed voice signal and a noise pattern signal outputted from a noise microphone, and
outputting in time series an identification signal whether or not it is a voice section; noise power
10-04-2019
9
level An identification signal indicating that learning of the correction filter is to be performed
when the noise power level output from the arithmetic operation means is less than or equal to
the threshold and the identification signal output from the speech activity detection means
indicates that it is an audio activity interval. And the identification signal output from the
correction filter learning determination unit is a correction filter Based on the power spectrum of
the noise-superimposed speech output from the noise-superimposed speech spectrum computing
means and the power spectrum of the noise pattern output from the noise spectrum And a
correction filter learning unit configured to learn a correction filter corresponding to the position
data of the speaker output from the speaker position detecting unit and output the correction
filter.
[0023]
A speech recognition apparatus according to the present invention calculates noise power level
from a noise pattern signal outputted from a noise microphone and outputs the noise power level
in time series from noise power level calculating means, and is outputted from a speech
microphone Noise section detecting means for judging a noise section based on a noise
superimposed speech signal and a noise pattern signal outputted from a noise microphone and
outputting a discrimination signal of whether or not it is a noise section in time series; noise
power level An identification signal indicating that noise spectrum learning is to be performed
when the noise power level output from the calculation means is equal to or higher than the
threshold and the identification signal output from the noise period detection means indicates a
noise period. Means for determining the noise spectrum in the time series, and the identification
signal output from the noise According to the noise pattern from which representative leaked
speech has been removed from the power spectrum of the noise pattern from which leaked
speech is output from the leaked speech removal means when it is indicated that spectral
learning is to be performed. Noise superimposed speech when the first noise spectrum learning
means for learning the power spectrum and outputting the power spectrum and the
identification signal output from the noise spectrum learning determination means indicates that
the noise spectrum is to be learned According to the superimposed noise pattern corresponding
to the power spectrum of the noise pattern obtained by removing the typical leaked voice output
from the first noise spectrum learning unit from the power spectrum of the noise superimposed
voice output from the spectrum computing unit The second spectrum that learns the power
spectrum and outputs the power spectrum It is obtained so as to include a spectrum learning
means.
[0024]
In the speech recognition apparatus according to the present invention, the first noise spectrum
memory stores the power spectrum relating to the noise pattern from which the plurality of
leaked voices output from the leaked voice removal unit have been removed. And a power
10-04-2019
10
spectrum related to a noise pattern from which a plurality of leaked speech stored in the first
noise spectrum memory has been removed, a power spectrum serving as a centroid of each class,
and a power spectrum of the noise pattern included in the class And clustering is performed so
as to minimize the sum of distance values between the first and second clustering means for
outputting a centroid of each class as a power spectrum related to a noise pattern from which
representative leaked speech has been removed. , Second noise spectrum learning means, the
first noise spectrum memo A second noise spectrum memory for storing a power spectrum
related to a noise pattern from which a plurality of leaked speech is removed and a plurality of
superimposed noise patterns respectively output to the same analysis frame, and a second noise
spectrum memory Clustering is performed on power spectra relating to a plurality of
superimposed noise patterns stored in the noise spectrum memory so as to reflect the clustering
result in the first clustering means, and centroids of each class are represented as representative
superimposed noise patterns. And second clustering means for outputting the power spectrum.
[0025]
In the speech recognition apparatus according to the present invention, the first noise spectrum
learning means is a first noise spectrum memory for storing a plurality of power spectra relating
to noise patterns from which the leaked speech output from the leaked speech removal means is
removed. A spectral outline parameter calculation means for calculating a parameter
representing the outline of the power spectrum from the power spectrum related to the noise
pattern from which the leaked speech stored in the first noise spectrum memory has been
removed and outputting the parameter; Spectrum intensity parameter computing means for
calculating a parameter representing the intensity of the power spectrum from the power
spectrum related to the noise pattern from which the leaked speech stored in the noise spectrum
memory 1 is removed and outputting the parameter; Power spectrum output from And a
plurality of leaks stored in the first noise spectrum memory using a distance value calculated by
applying a weight to a parameter representing the outline of the power spectrum and a
parameter representing the intensity of the power spectrum output from the spectrum intensity
parameter computing means The power spectrum according to the noise pattern from which
speech is removed is clustered, and weighted clustering means for outputting the power
spectrum according to the noise pattern from which representative leaked speech is removed is
provided.
[0026]
BEST MODE FOR CARRYING OUT THE INVENTION Hereinafter, an embodiment of the present
invention will be described.
10-04-2019
11
Embodiment 1
FIG. 1 is a diagram showing the configuration of a speech recognition system according to a first
embodiment of the present invention.
In the figure, 1 is a voice microphone for collecting voice on which background noise is
superimposed, 2 is a noise microphone for mainly collecting background noise, 3 is frequency
conversion of noise superimposed voice signal output from voice microphone 1 and noise A
noise superimposed speech spectrum computing means for outputting a power spectrum related
to superimposed speech in time series, 4 is a time series of a power spectrum associated with a
noise pattern in which speech is leaked by converting a noise pattern signal outputted by the
noise microphone 2 Noise spectrum computing means for outputting to the speaker, 5 is speaker
position detecting means for detecting the position where the speaker is present by the sensor
and outputting the position data in time series, 6 for the speech microphone 1 and the noise
microphone 2 A sound storing one or more filters for correcting differences in frequency
characteristics relating to voice transfer characteristics between A correction filter memory for
correction, 7 is a sound for selecting a correction filter corresponding to the position data of the
speaker output from the speaker position detection means 5 from the correction filter memory
for sound correction 6 and outputting the correction filter in time series The correction filter
selection means for correction, 8 is a correction filter output from the speech correction
correction filter selection means 7 to correct the power spectrum of the corresponding noise
superimposed speech to time-series the power spectrum of the correction noise superimposed
speech Noise spectrum correction means for outputting to the noise source; 9 is a power
spectrum related to the noise pattern in which the sound output from the noise spectrum
calculation means 4 leaks out; Leaked speech removal means for outputting in time series power
spectrum related to the noise pattern, 10 Noise correction correction filter memory for storing
one or a plurality of filters for correcting differences in frequency characteristics relating to
transfer characteristics of noise patterns between voice microphone 1 and noise microphone 2;
11 is for noise correction A representative noise spectrum memory for storing a power spectrum
related to a noise pattern from which representative leaked speech is removed corresponding to
each correction filter stored in the correction filter memory 10, 12 is outputted from the leaked
speech removal means 9 The shortest value is calculated by calculating the distance value
between the power spectrum of the noise pattern from which the leaked speech has been
removed and the power spectrum of the noise pattern from which a plurality of representative
leaked speech has been removed. The noise pattern giving the distance value is selected from the
representative noise spectrum memory 11 Noise spectrum selection means for outputting signals
identifying the noise pattern in time series, and 13 selects from the noise correction correction
filter memory 10 a correction filter corresponding to the noise pattern identification signal
10-04-2019
12
output from the noise spectrum selection means 12 Correction filter selection means for
outputting noise in time series, 14 is a correction filter output from the noise correction
correction filter selection means 13 to correct and correct the power spectrum of the noise
pattern from which the leaked speech has been removed The leaked speech removal noise
spectrum correction means for outputting the power spectrum related to the noise pattern in
time series; 15 is the power spectrum related to the noise removed speech by subtracting the
power spectrum related to the corrected noise pattern from the power spectrum related to the
noise superimposed speech Denoised speech spectrum computing means for outputting in time
series 16 is a feature vector computing means for generating a feature vector from a power
spectrum for noise reduction speech and outputting the feature vector in time series, and 17
stores in advance a feature vector for a plurality of noise-free standard speech patterns for
comparison. A matching pattern memory 18 for giving a maximum likelihood by matching the
feature vector outputted in time series from the feature vector computing means 16 with the
feature vector relating to the noise-free standard speech pattern stored in the matching pattern
memory 17 It is a collation means which outputs a recognition candidate as a recognition result.
In addition, the feature vector computing means 16, the matching pattern memory 17 and the
matching means 18 can be collectively regarded as being provided with a recognition processing
unit that executes speech recognition processing based on the power spectrum of the noiseremoved speech. .
[0027]
Next, the operation will be described.
The voice microphone 1 is generally disposed in the vicinity of the speaker to collect voice
superimposed with background noise.
The noise microphone 2 is generally placed at a position distant from the speaker and mainly
collects background noise.
The speech recognition apparatus according to the first embodiment of the present invention
assumes an environment in which there are a plurality of noise sources and the noise sources
change over time, and the leakage of voice to the noise microphone 2 is not so small as to be
negligible. It is configured assuming a case.
10-04-2019
13
[0028]
The noise superimposed speech spectrum computing means 3 performs frequency conversion on
the noise superimposed speech signal output from the speech microphone 1 using FFT (Fast
Fourier Transform) for each analysis frame shifted every predetermined time, The power
spectrum for each analysis frame for the noise-superimposed speech signal is output in time
series.
At this time, the power spectrum X1i (ω) related to the noise-superimposed speech in the
analysis frame i is expressed by the following equation (7).
In equation (7), Si (ω) is the power spectrum of the speech produced by the speaker in analysis
frame i, Ni (ω) is the power spectrum of the noise pattern output by the virtual noise source in
analysis frame i, G11 (x (i), y (i)) (ω) is the frequency of the transfer characteristic from the
speaker to the voice microphone 1 at the speaker position (x (i), y (i)) in the analysis frame i A
characteristic (filter), G21, i (ω) is a frequency characteristic (filter) about the transfer
characteristic from the virtual noise source in the analysis frame i to the microphone 1 for
speech.
X1i (ω) = G11 (x (i), y (i)) (ω) · Si (ω) + G21, i (ω) · Ni (ω) (7)
[0029]
Similarly, the noise spectrum calculation means 4 performs frequency conversion on the noise
pattern signal output from the noise microphone 2 using FFT (Fast Fourier Transform) for each
analysis frame which is shifted every fixed time, thereby making the voice The power spectrum
for each analysis frame with respect to the leaked noise pattern signal is output in time series.
At this time, a power spectrum X2i (ω) relating to the leaked noise of the speech in the analysis
frame i is expressed by the following equation (8).
10-04-2019
14
In equation (8), G12 (x (i), y (i)) (ω) is the speaker to noise microphone 2 at the speaker position
(x (i), y (i)) in analysis frame i The frequency characteristic (filter) for the transfer characteristic
of (G22, i (ω)) is the frequency characteristic (filter) for the transfer characteristic from the
virtual noise source to the noise microphone 2 in the analysis frame i. X2i (ω) = G12 (x (i), y (i))
(ω) · Si (ω) + G22, i (ω) · Ni (ω) (8)
[0030]
The speaker position detecting means 5 detects the position where the speaker is present by the
sensor, and outputs the speaker position data (x (i), y (i)) in time series for each analysis frame i.
[0031]
The voice correction correction filter memory 6 corrects the difference in frequency
characteristics relating to the transfer characteristic of voice between the voice microphone 1
and the noise microphone 2 which are learned in advance for each speaker position (x, y). The
filter W12 (x, y) (?) = G12 (x, y) (?) / G11 (x, y) (?) Is stored.
Here, a learning method of the correction filter will be described. The correction filters at each
speaker position are pre-learned in speech segments uttered in a noise free or noise free
environment. At this time, the power spectrum X1j (ω) voice related to the signal output from
the voice microphone 1 in the analysis frame j and the power spectrum X2j (ω) voice related to
the signal output from the noise microphone 2 are represented by the following equation (9) Is
represented by Equation (9) is derived by deleting the second terms of Equations (7) and (8),
under the assumption that the background noise is negligible.
[0032]
Therefore, to correct the difference in frequency characteristics relating to the transfer
characteristic of voice between microphone 1 for speech and microphone 2 for noise at the
speaker position (x (j), y (j)). The filter W12 (x (j), y (j)) (ω) is derived using the following
equation (10).
[0033]
The correction filter selection means 7 for speech correction is a correction corresponding to the
speaker position data (x (i), y (i)) in the analysis frame i output in time series from the speaker
position detection means 5 The filter W12 (x (i), y (i)) (ω) = G12 (x (i), y (i)) (ω) / G11 (x (i), y (i))
10-04-2019
15
(ω) The correction filter is selected from the correction filter memory 6 and the correction filter
is output in time series for each analysis frame i.
[0034]
The noise superimposed speech spectrum correction means 8 corrects the power spectrum
relating to the noise superimposed speech using the correction filter output from the speech
correction correction filter selection means 7 to make the power spectrum relating to the
corrected noise superimposed speech in time series. Output.
The power spectrum X1'i (ω) related to the corrected noise-superimposed speech in each
analysis frame i is expressed by the following equation (11).
X1'i (?) = W12 (x (i), y (i)) (?) X1i (?) (11)
[0035]
The leaked speech removal means 9 uses the power spectrum associated with the corrected
noise superimposed speech output from the noise superimposed speech spectrum correction
means 8 from the power spectrum associated with the noise pattern in which the speech output
from the noise spectrum computing means 4 leaks. The power spectrum according to the noise
pattern from which the leaked speech has been removed by subtraction is output in time series.
The power spectrum Y2i (ω) related to the noise pattern from which the leaked speech is
removed in the analysis frame i is expressed by the following equation (12).
[0036]
The noise correction correction filter memory 10 is a filter that corrects the difference in
frequency characteristics relating to the transfer characteristic of the noise pattern between the
voice microphone 1 and the noise microphone 2 by prior learning using a noise section. Are
stored as N, which is an appropriate number according to the type of noise pattern assumed.
Further, the representative noise spectrum memory 11 stores the power spectrum of the noise
pattern corresponding to each of the N correction filters stored in the noise correction correction
10-04-2019
16
filter memory 10.
[0037]
Hereinafter, a correction filter stored in the noise correction correction filter memory 10 and a
learning method and a storage method of a power spectrum related to a noise pattern
corresponding to the correction filter will be described. In the noise section, a power spectrum
X1j (ω) noise observed by the speech microphone in the analysis frame j is expressed by the
following equation (13). Since equation (13) is a noise section without speech, it can be derived
by deleting the first term of equation (7). X 1 j (ω) noise = G 21, j (ω) · N j (ω) (13)
[0038]
Therefore, for the microphone 1 for speech and noise for the transfer characteristic of the noise
pattern according to the combination .OMEGA. (J) = {N1j, N2j,..., NKj} of K noises output from K
noise sources in the analysis frame j A filter W Ω (j) 21 (ω) that corrects the difference in
frequency characteristics between the microphone 2 and the microphone 2 is expressed by the
following equation (14).
[0039]
Although the noise pattern relating to the combination of noise from the K noise sources in the
analysis frame j is unknown, if Ω (j1) = Ω (j2) in the analysis frames j1 and j2, WΩ (j1) It can be
considered that 21 (ω) = WΩ (j 2) 21 (ω).
Therefore, the values of X1j (ω) noise / Y2j (ω) output in time series are clustered into an
appropriate number of N classes. Clustering is performed so as to minimize the evaluation
function represented by the following equation (15). In equation (15), Wn 21 (ω) is a centroid of
class n, Θ (n) is a set of time series numbers of elements of class n, and dis (X, Y) is between
power spectrum X and power spectrum Y It is a function that returns the distance value between
Further, the centroid Wn21 (ω) of each class is derived from the following equation (16). In
equation (16), Mn is the number of elements of class n. After the end of clustering, N Wn 21 (ω)
are output as a representative correction filter, and stored in the noise correction correction filter
memory 10.
10-04-2019
17
[0040]
Also, the representative noise spectrum memory 11 similarly calculates the power spectrum Y2j
(ω) related to the noise pattern from which the leaked voices output in time series are removed
based on the clustering result of X1j (ω) noise / Y2j (ω). After classification into N classes, the
centroid of each class n (1 ≦ n ≦ N) is stored in the representative noise spectrum memory 11
as a power spectrum Y 2 n (ω) from which representative leaked speech is removed. The
centroid of each class Y 2 n (ω) is derived from the following equation (17). In equation (17), Mn
is the number of elements of class n. As described above, the N correction filters Wn 21 (ω) are
stored so as to correspond to the noise patterns divided into N classes, respectively, and the
correspondence between N sets of Y 2 n (ω) and W n 21 (ω) Based on the relationship, a
correction filter W Ω (j) 21 (ω) corresponding to the noise pattern Y 2 j (ω) in any frame j can
be derived. That is, although it is considered that the number of noise patterns related to the
combination of K noises output from K noise sources is almost infinite, the noise pattern most
similar to any noise pattern Y2j (ω) is the representative noise spectrum The correction filter W
n21 (ω) corresponding to the most similar noise pattern Y 2 n (ω) is selected from the
representative N noise patterns stored in the memory 11 and the correction filter W Ω (j) 21 (ω)
in frame j. Used as
[0041]
The noise spectrum selection means 12 includes a power spectrum related to the noise pattern
from which the leaked speech output in time series from the leaked speech removal means 9 has
been removed, and N representative noise patterns stored in the representative noise spectrum
memory 11. The representative noise pattern which gives the shortest distance value to the
power spectrum of the noise pattern from which the leaked speech has been removed is
respectively calculated from the representative noise spectrum memory 11 by calculating the
distance value between the power spectrum according to Output a signal identifying the noise
pattern. At this time, the power spectrum Y 2 l (i) (ω) relating to the noise pattern giving the
shortest distance value is expressed by equation (18). In equation (18), dis (X, Y) is a function
that returns the distance between the power spectrum X and the power spectrum Y, and l (i) is
the number of the noise pattern that gives the shortest distance value in the analysis frame i.
[0042]
The noise correction correction filter selection means 13 selects the correction filter Wl (i) 21
10-04-2019
18
(ω) corresponding to the noise pattern identification signal output in time series from the noise
spectrum selection means 12 as the noise correction correction filter memory 10. Select from
and output in time series. The leaked speech removal noise spectrum correction means 14 uses
the correction filter output from the noise correction correction filter selection means 13 to
control the power related to the noise pattern from which the leaked speech removed from the
leaked speech removal means 9 is removed. The spectrum is corrected, and the power spectrum
related to the corrected noise pattern is output in time series. At this time, a power spectrum Y2'i
(ω) related to the correction noise pattern is expressed by the following equation (19). Y2'i (?) =
Wl (i) 21 (?) Y2i (?) (19)
[0043]
The noise removal speech spectrum calculation means 15 subtracts the power spectrum
concerning the correction noise pattern from the power spectrum concerning noise
superimposed speech, and outputs the power spectrum S'i (ω) concerning noise elimination
speech in time series. At this time, the power spectrum S′i (ω) relating to the denoised speech
in the analysis frame i is expressed by the following equation (20). In equation (20), α is a
parameter for adjusting the subtraction amount of the power spectrum related to the correction
noise pattern, and β is the power spectrum related to the denoised speech in order to prevent
excessive subtraction of the power spectrum related to the correction noise pattern. Is a
parameter for setting the lower limit value of each frequency component in Also, max {} is given
as a function that returns the element of the largest value among the elements in parentheses. S'i
(ω) = max {X1i (ω) -αY2'i (ω), β} (20)
[0044]
Here, the correction filter W Ω (i) 21 (ω) for the noise pattern related to the combination Ω (i)
of K noises output by the K noise sources in the analysis frame i is already properly stored in the
prior learning In the case where W i (i) 21 (ω) = W Ω (i) 21 (ω). Therefore, from the equations
(12) and (19), Y2'i (ω) = G21, i (ω) · Ni (ω). Then, when this equation and equation (7) are
substituted into equation (17), S′i (ω) = G11 (x (i), y (i)) (ω) Si (ω) when α = 1. And the power
spectrum of the speech from which noise has been removed can be obtained.
[0045]
The operations relating to the feature vector computing means 16, the matching pattern memory
17 and the matching means 18 are the same as the feature vector computing means 108, the
10-04-2019
19
matching pattern memory 109 and the matching means 110 of the prior art, and therefore the
description thereof is omitted.
[0046]
As described above, according to the first embodiment, the power spectrum of the noise pattern
from which the leaked speech has been removed by subtracting the power spectrum of the
corrected noise-superimposed speech from the power spectrum of the noise pattern from which
the speech leaks. Since the leaked voice removal means for outputting the spectrum in time
series is provided, even if there is voice leak to the noise microphone, the voice pattern is
removed from the noise pattern to remove noise superimposed speech. Since the removal of the
noise pattern from which the leaked speech has been removed can be performed, it is possible to
improve the performance of the speech recognition.
[0047]
Further, a speaker position detecting means 5 for detecting the speaker position and outputting
the speaker position data in time series for each analysis frame, and the voice microphone 1 and
the noise microphone 2 to be learned for each speaker position. A voice correction correction
filter memory 6 storing a plurality of filters for correcting differences in frequency characteristics
relating to voice transfer characteristics between the voice correction characteristics; voice
correction correction filter selection means 7 selecting a correction filter corresponding to the
speaker position Therefore, it is possible to select the appropriate correction filter according to
the speaker's position, and to accurately remove the power spectrum of the leaked voice from the
power spectrum of the noise pattern of the leaked voice. The present invention has the effect of
being able to accurately carry out the removal of the noise pattern from the noise superimposed
speech and further improve the performance of the speech recognition. .
[0048]
A noise correction correction filter memory 10 for storing a plurality of filters for correcting
differences in frequency characteristics relating to noise transfer characteristics between the
voice microphone 1 and the noise microphone 2; noise correction correction A representative
noise spectrum memory 11 for storing a power spectrum related to a noise pattern
corresponding to each correction filter stored in the filter memory 10, and a power spectrum
related to a noise pattern from which a leaked speech has been removed and stored in the
representative noise spectrum memory 11 The noise pattern which gives the shortest distance
value by calculating the distance value between the plurality of noise patterns and the power
spectrum related to the plurality of noise patterns is selected from the representative noise
spectrum memory 11 and the signal identifying the noise patterns is output in time series Noise
spectrum selection means 12 and noise spectrum selection means 1 Since the correction filter
10-04-2019
20
corresponding to the noise pattern identification signal output from the selection circuit is
selected from the noise correction correction filter memory 10 and output in time series, the
noise correction correction filter selection unit 13 is provided. Selecting a suitable correction
filter according to the noise pattern from which the noise has been removed to generate a power
spectrum according to the correction noise pattern, and accurately removing the power spectrum
concerning the correction noise pattern from the power spectrum concerning noise
superimposed speech Since it is possible, the effect of being able to improve the performance of
speech recognition is produced.
[0049]
Second Embodiment
FIG. 2 is a diagram showing the configuration of a speech recognition system according to a
second embodiment of the present invention.
In FIG. 2, the same reference numerals as those in FIG.
21 is a first representative noise spectrum memory for storing a plurality of power spectra
relating to noise patterns from which typical leaked speech has been removed, and 22 is a
plurality of power spectra relating to representative superimposed noise patterns for noisesuperimposed speech The second representative noise spectrum memory to be stored, 23 is a
power spectrum according to the noise pattern from which the leaked speech output from the
leaked speech removal means 9 has been removed and a plurality of representatives stored in
the first representative noise spectrum memory 21 The representative noise pattern giving the
shortest distance value by calculating the distance value between the noise spectrum and the
power spectrum is selected from the first representative noise spectrum memory 21 and the
signal identifying the representative noise pattern is output in time series Means for selecting the
first noise spectrum, and 24 is output from the first noise Second noise spectrum selection means
for selecting from the second representative noise spectrum memory 22 a power spectrum
relating to the superimposed noise pattern corresponding to the representative noise pattern
identification signal to be output in time series; The power spectrum of the noise superimposed
speech output from the second noise spectrum selection unit 24 is subtracted from the power
spectrum of the noise superimposed speech output from 3 to output the power spectrum of the
noise eliminated speech in time series Denoised speech spectrum calculation means.
10-04-2019
21
[0050]
Next, the operation will be described.
The operation relating to the sound removal means 9 leaking from the sound microphone 1 and
the operation relating to the feature vector calculation means 16 to the verification means 18 are
the same as in the first embodiment, and thus the description thereof is omitted.
[0051]
The first representative noise spectrum memory 21 has N power spectra related to the noise
pattern from which the leaked speech has been removed by the prior learning using the noise
section, which is an appropriate number according to the type of the noise pattern assumed. Only
remember. Further, the second representative noise spectrum memory 22 has a superimposed
noise pattern corresponding to the noise pattern from which the N leaked voices stored in the
first representative noise spectrum memory 21 have been removed by prior learning using the
noise section. The power spectrum is stored.
[0052]
Hereinafter, a learning method and a storage method of a power spectrum related to a noise
pattern from which leaked speech is removed and a power spectrum related to a superimposed
noise pattern will be described. In the noise section, what the speech microphone 1 outputs in
the analysis frame j is the superimposed noise component superimposed on the noise
superimposed speech, and its power spectrum X 1 j (ω) noise is expressed by the following
equation (21) . Since equation (21) is a noise section without speech, it is derived by deleting the
first term of equation (7). X1 j (ω) noise = G 21 j (ω) · N j (ω) (21) This is a power spectrum for
the superimposed noise pattern superimposed on the noise superimposed speech, and is defined
as Y 1 j (ω). Y 1 j (ω) = G 21, j (ω) · N j (ω) (22)
[0053]
10-04-2019
22
If it is possible to estimate the power spectrum Y1j (ω) for the superimposed noise pattern
superimposed on the noise superimposed speech in the analysis frame j, by subtracting Y1j (ω)
estimated from the power spectrum of the noise superimposed speech. , Noise removal can be
performed. Therefore, in order to estimate Y1j (ω) from the power spectrum Y2j (ω) related to
the noise pattern from which the leaked speech has been removed, the power spectrum Y2j (ω)
from the noise spectrum from which the leaked speech is removed is applied to the
superimposed noise pattern. The mapping relation to the power spectrum Y1j (ω) is learned by
the following procedure.
[0054]
In analysis frame j, the noise pattern relating to the combination of K noises output by K noise
sources Ω (j) = {N1 j, N2 j,..., NK j} is unknown, but analysis frames j1, j2 If Ω (j1) = Ω (j2) in,
then the power spectrum relating to the noise pattern from which the leaked speech has been
removed is considered to be equal, that is, Y2j1 (ω) = Y2j2 (ω). Therefore, a plurality of power
spectra Y2j (ω) related to the noise pattern from which the leaked voices output in time series
have been removed are clustered into an appropriate number of N classes. Clustering is
performed such that the evaluation function D represented by the following equation (23) is
minimized. In equation (23), Y 2 n (ω) is a centroid of class n, Θ (n) is a set of time series
numbers of elements of class n, and dis (X, Y) is between power spectrum X and power spectrum
Y It is a function that returns the distance value between Also, the centroid Y2n (ω) of each class
is derived using equation (17). After the end of clustering, N Y 2 n (ω) are output as a power
spectrum related to the noise pattern from which representative leaked speech has been
removed, and stored in the first representative noise spectrum memory 21.
[0055]
Similarly, after classifying the power spectrum Y1j (ω) relating to the superimposed noise
pattern output in time series from the microphone 1 for speech into N classes based on the
clustering result of Y2j (ω), each class n ( The centroid of 1 ≦ n ≦ N is stored in the second
representative noise spectrum memory 22 as a power spectrum Y1n (ω) according to a
representative superimposed noise pattern. Each class of centroids Y1 n (ω) is derived from the
following equation (24). In equation (24), Θ (n) is a set of time-sequence numbers possessed by
elements of class n as a result of clustering of the power spectrum related to the noise pattern
from which leaked speech is performed performed by the clustering means, Mn is class n The
number of elements of
10-04-2019
23
[0056]
As described above, N Y 1 n (ω) and Y 2 n (ω) are stored so as to correspond to the noise
patterns divided into N classes, and N sets of Y 2 n (ω) and Y 1 n Deriving a power spectrum
associated with a superimposed noise pattern superimposed on a noise superimposed voice
corresponding to a power spectrum associated with a noise pattern from which a leaked voice in
any frame j has been removed based on the correspondence relationship with (ω) it can. That is,
although it is considered that the number of noise patterns from which the leaked speech
determined by the combination of K noises output by the K noise sources is eliminated is almost
infinite, the noise pattern from which any leaked speech is eliminated is most A noise pattern
from which similar leaked speech has been removed is selected from the noise patterns from
which N leaked speech stored in the first representative noise spectrum memory 21 have been
removed, and a noise pattern from which the most similar leaked speech has been removed. Is
selected from the second representative noise spectrum memory 22 and used as a power
spectrum related to the superimposed noise pattern in the frame j.
[0057]
The first noise spectrum selection means 23 includes a power spectrum according to a noise
pattern from which the leaked speech output from the leaked speech removal means 9 has been
removed and N leaks stored in the first representative noise spectrum memory 21. The
representative leaked speech which gives the shortest distance value to the power spectrum
related to the noise pattern from which the leaked speech has been removed by respectively
calculating the distance value between the power spectrum related to the noise pattern from
which the embedded speech has been removed and The removed noise pattern is selected from
the first representative noise spectrum memory 21 to output a signal for identifying the noise
pattern. The number l (i) of the power spectrum relating to the noise pattern from which the
leaked speech giving the shortest distance value has been removed in the analysis frame i is
derived using the equation (25). In equation (25), dis (X, Y) is a function that returns the distance
between power spectrum X and power spectrum Y.
[0058]
The second noise spectrum selection means 24 selects the power spectrum Y 1 l (i) (ω) relating
to the superimposed noise pattern corresponding to the noise pattern identification signal output
in time series from the first noise spectrum selection means 23. It selects from the second
representative noise spectrum memory 22 and outputs it in time series.
10-04-2019
24
[0059]
The noise removal speech spectrum calculation means 25 subtracts the power spectrum
concerning the superimposed noise pattern outputted from the second noise spectrum selection
means 24 from the power spectrum concerning the noise superimposed speech outputted from
the noise superimposed speech spectrum calculation means 3 Then, the power spectrum S′i
(ω) relating to the denoised speech is output in time series.
At this time, the power spectrum S′i (ω) relating to the denoised speech in the analysis frame i
is derived using the following equation (26). In equation (26), α is a parameter for adjusting the
subtraction amount of the power spectrum related to the superimposed noise pattern, and β is
the power spectrum related to the noise-free speech to prevent excessive subtraction of the
power spectrum related to the superimposed noise pattern Is a parameter for setting the lower
limit value of each frequency component in Also, max {} is given as a function that returns the
element of the largest value among the elements in parentheses. S'i (ω) = max {X1i (ω) -αY1l (i)
(ω), β} (26)
[0060]
Here, if the power spectrum related to the superimposed noise pattern with respect to the
combination Ω (i) of K noises output by the K noise sources in the analysis frame i is
appropriately learned, then Y 11 (i) (ω) = G21, i (ω) Ni (ω). Substituting this and the equation
(7) into the equation (23), when α = 1, S'i (ω) = G11 (x (i), y (i)) (ω) · Si (ω), It is possible to
obtain a power spectrum of speech from which noise has been removed.
[0061]
FIG. 3 is a block diagram showing a processing procedure for obtaining a power spectrum related
to the denoised speech. As already stated, X1i (ω) is the power spectrum for noise-superimposed
speech, X2i (ω) is the power spectrum for noise patterns in which speech is leaked, and Y2i (ω)
is a noise pattern from which speech is removed. The power spectrum, Y11 (i) (ω) is a power
spectrum related to the estimated superimposed noise pattern, S'i (ω) is a power spectrum
related to the denoised speech, and W12 (ω) is a correction filter. As shown in FIG. 3, the power
10-04-2019
25
spectrum X2i (ω) of the noise pattern in which the voice leaks is subtracted from the power
spectrum X1i (ω) of the noise-superposed voice to which the correction filter W12 (ω) is
applied. Then, a power spectrum Y2i (ω) related to the noise pattern from which the leaked
speech has been removed is obtained. Next, from the mapping relationship between Y2n (ω) and
Y1n (ω) obtained by prior learning, the power spectrum Y1l (i) (ω) relating to the superimposed
noise pattern corresponding to Y2i (ω) is estimated. Finally, by subtracting Y11 (i) (ω) estimated
from the power spectrum X1i (ω) of the noise-superimposed speech, a power spectrum S′i (ω)
of the noise-canceled speech can be obtained.
[0062]
As described above, according to the second embodiment, leaked speech removal means 9 is
provided, and speaker position detection means 5, speech correction correction filter memory 6
and speech correction correction filter selection means 7 are provided. The same effects as in the
first embodiment can be obtained. Furthermore, a first representative noise spectrum memory 21
for storing a plurality of power spectra related to noise patterns from which the leaked speech
has been removed, a second representative noise spectrum memory 22 for storing a plurality of
power spectra related to the superimposed noise patterns, and leakage A noise pattern giving the
shortest distance value by calculating distance values between the power spectrum of the noise
pattern from which the speech is removed and the power spectra of the plurality of noise
patterns stored in the first representative noise spectrum memory 21 A first noise spectrum
selection means 23 for selecting signals from the first representative noise spectrum memory 21
and identifying the noise patterns in time series, and a noise pattern outputted from the first
noise spectrum selection means 23 The second representative noise spectrum of the power
spectrum of the superimposed noise corresponding to the identification signal Since the second
noise spectrum selection means 24 is selected from the memory 22 and output in time series, the
power spectrum according to the appropriate superimposed noise pattern according to the noise
pattern from which the leaked speech has been removed is provided. Since it is possible to select
and accurately remove the power spectrum of the superimposed noise pattern from the power
spectrum of the noise superimposed voice, the transfer characteristics of the noise microphone 1
and the noise microphone 2 change from moment to moment In such a case, the speech
recognition performance can be further improved.
[0063]
Third Embodiment In the speech recognition apparatus according to the second embodiment,
since it is necessary to carry out in advance learning related to the correction filter and the
10-04-2019
26
power spectrum of the noise pattern, etc., under an environment where noise patterns etc. are
not included in the previous learning data. It is expected that the noise removal can not be
performed accurately. The third embodiment is characterized in that it comprises learning means
for performing learning relating to the correction filter and the power spectrum of the noise
pattern and the like in an environment where speech recognition is actually performed.
[0064]
FIG. 4 is a diagram showing the configuration of a speech recognition system according to a third
embodiment of the present invention. In FIG. 4, the same reference numerals as in FIGS. 1 and 2
indicate the same or corresponding parts, and thus the description thereof will be omitted. 31 is
a noise power level calculating means for calculating a noise power level from the noise pattern
signal output from the noise microphone 2 and outputting the noise power level in time series;
32 is a noise superimposed voice signal output from the voice microphone 1 Voice section
detection means for determining a voice section based on the noise microphone 2 and the noise
pattern signal output from the noise microphone 2 and outputting an identification signal as to
whether or not it is a voice section in time series; Noise segment detection means that determines
a noise segment based on the noise superimposed voice signal that is output and the noise
pattern signal that is output from the noise microphone 2 and outputs an identification signal of
whether or not it is a noise segment in time series; 34 is that the noise power level output from
the noise power level calculation means 31 is equal to or less than the threshold value, and
Correction filter learning determination means for outputting in time series an identification
signal indicating that the correction filter is to be learned when the identified identification signal
indicates that it is a voice section; 35 is a noise power level calculation means 31 When the noise
power level to be output is equal to or higher than the threshold and the identification signal
output from the noise section detection means 33 indicates that it is a noise section, the
identification signal indicating learning of the noise spectrum is Noise spectrum learning
determination means for outputting in series, 36 is output from the noise superimposed speech
spectrum calculation means 3 when the identification signal output from the correction filter
learning determination means 34 indicates that learning of the correction filter is to be
performed Power spectrum of the noise-superimposed speech and the power spectrum of the
noise pattern output from the noise spectrum computing means 4 Correction filter learning
means for learning a correction filter corresponding to the position data of the speaker output
from the speaker position detection means 5 and outputting the correction filter; 37 is an
identification output from the noise spectrum learning determination means 35 When the signal
indicates that the learning of the noise spectrum is to be carried out, representative leaked
speech is determined based on the power spectrum related to the noise pattern from which the
leaked speech is output from the leaked speech removal means 9. The first noise spectrum
learning means for learning the power spectrum related to the removed noise pattern and
10-04-2019
27
outputting the power spectrum, 38 indicates that the identification signal output from the noise
spectrum learning determination means 35 carries out the learning of the noise spectrum Power
of the noise superimposed speech output from the noise superimposed speech spectrum
calculation means 3 A second learning of the power spectrum related to the superimposed noise
pattern corresponding to the noise pattern from which typical leaked speech has been removed,
output from the first noise spectrum learning means 37 based on the spectrum and outputting
the power spectrum It is a noise spectrum learning means.
[0065]
FIG. 5 is a diagram showing an internal configuration of the first noise spectrum learning means.
In FIG. 5, 41 is a first noise spectrum memory for storing a plurality of power spectra related to
noise patterns from which leaked voices are output from the leaked voice removal means 9, and
42 is stored in the first noise spectrum memory 41. Is performed on the power spectrum of the
noise pattern from which a plurality of leaked speech has been removed, and the power
spectrum of the noise pattern from which the leaked speech has been removed representative of
the power spectrum corresponding to the centroid in the clustering result Is a first clustering
means for outputting as
[0066]
FIG. 6 is a diagram showing an internal configuration of the second noise spectrum learning
means. In FIG. 6, reference numeral 43 denotes a plurality of power spectrums output from the
noise superimposed speech spectrum computing means 3 in the same analysis frame as the
power spectrum relating to the noise pattern from which the plurality of leaked speech stored in
the first noise spectrum memory 41 is removed. The second noise spectrum memory for storing
the power spectrum related to the superimposed noise pattern, 44 is a first noise reduction
memory of the first clustering unit 42 with respect to the power spectrum related to the plurality
of superimposed noise patterns stored in the second noise spectrum memory 43. The second
clustering unit is a second clustering unit that implements clustering based on the clustering
result and outputs a power spectrum corresponding to a centroid in the clustering result as a
power spectrum related to a representative superimposed noise pattern.
[0067]
10-04-2019
28
Next, the operation will be described. The operations relating to the speech removal means 9
leaking from the speech microphone 1, the operations relating to the feature vector calculation
means 16 to the verification means 18, and the operations relating to the noise removal speech
spectrum calculation means 25 from the first representative noise spectrum memory 21 Since
the second embodiment is the same as the second embodiment, the description is omitted.
[0068]
The noise power level calculator 31 calculates the noise power level of the noise pattern signal
output from the noise microphone 2 and outputs the noise power level in time series. Assuming
that the noise pattern signal output from the noise microphone 2 at time t is x2 (t), the noise
power level LEVi in the analysis frame i can be derived from the following equation (27). In
equation (27), x 2 (t) is the noise pattern signal output by the noise microphone 2 at time t, M is
the shift amount of the analysis frame, and L is the number of samples of one analysis frame.
[0069]
The voice section detecting means 32 determines the voice section from the noise superimposed
voice signal outputted from the voice microphone 1 and the noise pattern signal outputted from
the noise microphone 2 and discriminates whether it is a voice section or not. Output signals in
time series. Whether or not the analysis frame i is a voice section is determined by whether or
not the following equation (28) is satisfied. In equation (28), P1i is the power of the noisesuperimposed speech signal in the analysis frame i, P2i is the power of the noise pattern signal in
the analysis frame i, and THv is a threshold value for speech interval determination.
[0070]
The noise section detecting means 33 determines the noise section on the basis of the noise
superimposed speech signal outputted from the speech microphone 1 and the noise pattern
signal outputted from the noise microphone 2 and determines whether it is a noise section or
not. Output an identification signal of the time series. Whether or not the analysis frame i is a
noise section is determined by whether or not the following equation (29) is satisfied. In equation
(29), P1i is the power of the noise-superimposed speech signal in analysis frame i, P2i is the
power of the noise pattern signal in analysis frame i, and THn is a threshold for noise section
determination.
10-04-2019
29
[0071]
The correction filter learning determination means 34 indicates that the noise power level output
from the noise power level calculation means 31 is equal to or less than the threshold and the
identification signal output from the voice period detection means 32 is a voice period. If it does,
the identification signal indicating that the correction filter should be learned is output in time
series. That is, an identification signal indicating that learning of the correction filter is to be
performed in a voice section uttered in an environment where the noise power level of the
background noise is small and the influence of the background noise can be ignored is output in
time series.
[0072]
The noise spectrum learning determination unit 35 indicates that the noise power level output
from the noise power level calculation unit 31 is equal to or higher than the threshold and that
the identification signal output from the noise section detection unit 33 is a noise section. , And
outputs a time-series identification signal indicating that noise spectrum learning is to be
performed. That is, the identification signal indicating that the learning of the noise spectrum is
to be performed in the noise section where the noise power level of the background noise is large
and the voice is not uttered is output in time series.
[0073]
The correction filter learning means 36 sets the noise superimposed speech output from the
noise superimposed speech spectrum calculation means 3 when the identification signal output
from the correction filter learning determination means 34 indicates that the correction filter is
to be learned. Speaker position data (x (i), y (i)) output from the speaker position detection unit 5
based on the power spectrum and the power spectrum related to the noise pattern output from
the noise spectrum calculation unit 4 Learning the correction filter W12 (x (i), y (i)) (. Omega.)
Corresponding to Y.sub.12 and outputting the correction filter. The learned correction filter is
stored in the speech correction correction filter memory 6. Power spectrum X 1 j (ω) related to
noise superimposed speech output from noise superimposed speech spectrum computing means
3 in analysis frame j when noise is generated under an environment where background noise can
be ignored, and noise spectrum computing means 4 The power spectrum X2 j (ω) related to the
10-04-2019
30
noise pattern output from can be expressed by the following equation (30). Equation (30) is
derived by deleting the second terms of Equations (7) and (8), under the assumption that
background noise is negligible. Therefore, a filter W12 (x (j (j (j)) is used to correct the difference
in frequency characteristics relating to the transfer characteristics of the voice microphone 1 and
the noise microphone 2 at the speaker position (x (j), y (j)). ), Y (j)) (ω) are derived using the
following equation (31).
[0074]
The first noise spectrum learning means 37 is outputted from the leaked speech removing means
9 when the identification signal outputted from the noise spectrum learning determining means
35 indicates that the learning of the noise spectrum is to be carried out. Based on the power
spectrum of the noise pattern from which the leaked speech has been removed, the power
spectrum of the noise pattern from which the typical leaked speech has been removed is learned
and the power spectrum is output. The power spectrum related to the noise pattern from which
the learned representative leaked speech has been removed is stored in the first representative
noise spectrum memory 21. The first noise spectrum learning means 37 comprises a first noise
spectrum memory 41 and a first clustering means 42.
[0075]
The first noise spectrum memory 41 stores a plurality of power spectra relating to the noise
pattern from which the leaked speech output from the leaked speech removal means 9 has been
removed.
[0076]
The first clustering means 42 performs clustering on the power spectrum related to the noise
pattern from which the plurality of leaked speech stored in the first noise spectrum memory 41
has been removed, and the power corresponding to the centroid in the clustering result The
spectrum is output as a power spectrum according to a noise pattern from which typical leaked
speech is removed.
Clustering is performed such that the evaluation function D represented by equation (32) is
minimized. In equation (32), N is a class number, Y 2 n (ω) is a centroid of class n, Θ (n) is a set
of time series numbers possessed by elements of class n, and i is the first noise spectrum
10-04-2019
31
memory 41 The time series number of the power spectrum according to the noise pattern from
which the stored leaked speech has been removed, dis (X, Y) is a function that returns the
distance value between the power spectrum X and the power spectrum Y. Also, the centroid Y2n
(ω) of each class is derived using equation (17). After the end of clustering, N Y 2 n (ω) are
output as a power spectrum related to the noise pattern from which representative leaked speech
has been removed, and stored in the first representative noise spectrum memory 21.
[0077]
The second noise spectrum learning means 38 performs noise superposition on the noise
superimposed speech spectrum calculation means 3 when the identification signal output from
the noise spectrum learning determination means 35 indicates that learning of the noise
spectrum is to be performed. Based on the power spectrum related to speech, the power
spectrum related to the superimposed noise pattern corresponding to the power spectrum
related to the noise pattern from which the typical leaked speech output from the first noise
spectrum learning means 37 has been removed is learned. Output a spectrum. The power
spectrum of the learned representative superimposed noise pattern is stored in the second
representative noise spectrum memory 22. The second noise spectrum learning means 38
comprises a second noise spectrum memory 43 and a second clustering means 44.
[0078]
The second noise spectrum memory 43 uses a plurality of superimposed noise patterns
respectively output in the same analysis frame as the power spectrum related to the noise
pattern from which the plurality of leaked voices stored in the first noise spectrum memory 41
have been removed. The power spectrum is stored. In the noise section, the power spectrum
relating to the noise-superimposed speech output by the speech microphone 1 in the analysis
frame j is expressed by the following equation (33). Since equation (33) is a noise section without
speech, it can be derived by deleting the first term of equation (7). X1j (ω) = G21, j (ω) · Nj (ω)
(33) This is the power spectrum for the superimposed noise pattern superimposed in the noise
superimposed speech, and this is the same as in equation (19). Define as). Y 1 j (ω) = G 21 j (ω) ·
N j (ω) (34) That is, the second noise spectrum memory 43 has removed a plurality of leaked
speech stored in the first noise spectrum memory 41 The power spectrum Y1i (ω) relating to the
superimposed noise pattern output from the noise superimposed speech spectrum computing
means 3 is stored in the same analysis frame as the analysis frame i in which the power spectrum
Y2i (ω) relating to the noise pattern is output.
10-04-2019
32
[0079]
The second clustering unit 44 performs clustering on the power spectrum of the plurality of
superimposed noise patterns stored in the second noise spectrum memory 43 based on the
clustering result of the first clustering unit 42, and the clustering result The power spectrum
corresponding to the centroid in is output as a power spectrum according to a representative
superimposed noise pattern. Each class of centroids Y1 n (ω) is derived using equation (21).
After the end of clustering, N pieces of Y 1 n (ω) are output as a power spectrum relating to a
representative superimposed noise pattern, and stored in the second representative noise
spectrum memory 22.
[0080]
As described above, N pieces of Y1n (ω) and Y2n (ω) are stored so as to correspond to noise
patterns classified into N pieces, respectively, and N sets of Y2n (ω) and Y1n (ω) The power
spectrum of the superimposed noise pattern superimposed on the noise superimposed speech
corresponding to the power spectrum of the noise pattern from which the leaked speech in any
frame j is removed can be derived based on the correspondence relationship with.
[0081]
As described above, according to the third embodiment, the noise power level calculating means
31 that calculates the noise power level from the noise pattern signal output from the noise
microphone 2 and outputs the noise power level in time series; A voice section for detecting a
voice section based on a noise superimposed voice signal output from the voice microphone 1
and a noise pattern signal output from the noise microphone 2 and outputting a signal
identifying whether it is a voice section in time series When the noise power level output from
the detection unit 32 and the noise power level calculation unit 31 is equal to or less than the
threshold and the identification signal output from the voice section detection unit 32 indicates
that it is a voice section, learning of the correction filter Correction filter learning determination
means 34 for outputting in time series an identification signal indicating that the correction filter
should be performed, or correction filter learning determination means 34 The power spectrum
of the noise superimposed speech output from the noise superimposed speech spectrum
computing means 3 and the noise pattern output from the noise spectrum computing means 4
when the output identification signal indicates that the correction filter is to be learned. The
correction filter corresponding to the position data of the speaker output from the speaker
position detecting means 5 is learned on the basis of the power spectrum according to and the
correction filter learning means 36 for outputting the correction filter is provided. Even when the
utterance is performed at the speaker position which can not be learned by the prior learning,
10-04-2019
33
the power spectrum of the noise superimposed speech is correctly corrected, and the power
spectrum of the noise pattern in which the speech leaks is leaked Since the removal can be
performed accurately, the performance of speech recognition can be improved. There is an effect
that that.
[0082]
Also, a noise power level calculator 31 which calculates a noise power level from the noise
pattern signal output from the noise microphone 2 and outputs the noise power level in time
series, and a noise superimposed voice signal output from the voice microphone 1 A noise
section detection unit 33 for detecting a noise section based on the noise microphone 2 and the
noise pattern signal output from the noise microphone 2 and outputting a signal identifying
whether it is a noise section in time series; When the noise power level to be output is equal to or
higher than the threshold and the identification signal output from the noise section detection
means 33 indicates that it is a noise section, the identification signal indicating learning of the
noise spectrum is The noise spectrum learning determination means 35 that outputs the
sequence and the identification signal output from the noise spectrum learning determination
means 35 When it is indicated that learning of the sound spectrum is to be performed, noise
from which typical leaked speech has been removed based on the power spectrum of the noise
pattern from which leaked speech is output from the leaked speech removal means 9 is removed.
When the first noise spectrum learning means 37 that learns the power spectrum related to the
pattern and outputs the power spectrum, and the identification signal output from the noise
spectrum learning determination means 35 indicates that the learning of the noise spectrum is
performed Then, based on the power spectrum of the noise-superimposed speech output from
the noise-superimposed speech spectrum calculation means 3, power relating to the noise
pattern from which the typical leaked speech output from the first noise spectrum learning
means 37 has been removed Learn the power spectrum of the superimposed noise pattern
corresponding to the spectrum and Since the second noise spectrum learning means 38 for
outputting the vector is configured to be provided, even when the noise pattern which can not be
learned by the prior learning is superimposed on the voice, the noise pattern in which the leaked
voice is removed is Since it is possible to accurately remove the power spectrum of the
superimposed noise pattern from the power spectrum of the noise superimposed speech by
selecting the power spectrum of the appropriate superimposed noise pattern, the performance of
speech recognition is further improved. The effect of being able to
[0083]
Furthermore, a first noise spectrum memory 41 for storing a power spectrum relating to a noise
pattern from which the first noise spectrum learning means 37 has removed a plurality of leaked
voices output from the leaked voice removal means 9; The distance value between the centroid of
each class and the power spectrum of the noise pattern included in the class with respect to the
10-04-2019
34
plurality of power spectra related to the noise pattern from which the plurality of leaked speech
is stored stored in the noise spectrum memory 41 of And second clustering means for
performing clustering so as to minimize the total sum of the first and second clustering means
for outputting a centroid of each class as a power spectrum related to a noise pattern from which
representative leaked speech has been removed; Noise spectrum learning means 38 is stored in
the first noise spectrum memory 41 A second noise spectrum memory 43 for storing a power
spectrum of a noise pattern from which a plurality of leaked speech has been removed and a
power spectrum of a plurality of superimposed noise patterns respectively output to the same
analysis frame, and a second noise spectrum memory Clustering is performed on the power
spectra of the plurality of superposition noise patterns stored in 43 so as to reflect the clustering
result in the first clustering unit 42, and the centroid of each class is representative of the
superposition noise patterns. Since the second clustering means 44 for outputting as a power
spectrum is provided, the noise pattern from which the leaked speech is removed is the distance
value between the centroid of each class and the power spectrum included in the class. To
minimize the sum By performing appropriate clustering, and storing the centroid of each class as
a representative power spectrum for the noise pattern from which the leaked speech has been
removed and the superimposed noise pattern, the noise pattern from which the leaked speech
has been removed Since the mapping relationship between the power spectrum and the power
spectrum related to the superimposed noise can be precisely learned, the power spectrum related
to the superimposed noise pattern can be accurately removed from the power spectrum related
to the noise superimposed speech, and the speech recognition is performed. There is an effect
that the performance of can be further improved.
[0084]
Fourth Embodiment
In the speech recognition apparatus according to the second and third embodiments, since the
mapping relationship between the noise pattern from which the leaked speech has been removed
and the superimposed noise pattern is learned by simple clustering, When the fluctuation is
large, etc., the mapping relationship having resolution only in the noise strength direction and
not having resolution in the noise type direction is learned excessively, and as a result, noise can
not be accurately removed. is expected.
Therefore, the speech recognition apparatus according to the fourth embodiment is characterized
in that the mapping relationship between the noise pattern from which the leaked speech has
been removed and the superimposed noise pattern is more precisely learned by raising the
10-04-2019
35
clustering accuracy.
[0085]
FIG. 7 is a diagram showing an internal configuration of the first noise spectrum learning means
in the speech recognition system according to the fourth embodiment of the present invention.
In FIG. 7, the same reference numerals as those in FIG. 5 denote the same or corresponding parts,
and therefore the description thereof will be omitted.
51 is a spectrum outline parameter calculation means for calculating a parameter representing
the outline of the power spectrum from the power spectrum related to the noise pattern from
which the leaked speech is stored stored in the first noise spectrum memory 41 and outputting
the parameter; Is a spectrum intensity parameter operation means for calculating a parameter
representing the intensity of the power spectrum from the power spectrum related to the noise
pattern from which the leaked speech is stored stored in the first noise spectrum memory 41 and
outputting the parameter; Using a distance value calculated by applying a weight to a parameter
representing the outline of the power spectrum output from the parameter calculation means 51
and a parameter representing the intensity of the power spectrum output from the spectrum
intensity parameter calculation means 52 1 noise spectrum Clustering a power spectrum of the
noise pattern to remove audio narrowing leakage plurality stored in the memory 41, a weighting
clustering means for outputting a power spectrum of the noise pattern to remove audio
narrowing typical leakage.
[0086]
Next, the operation will be described. The spectrum outline parameter calculation means 51
calculates a parameter representing the outline of the power spectrum from the power spectrum
related to the noise pattern from which the leaked voice output from the leaked voice removal
means 9 has been removed and outputs it in time series. Specifically, the cepstrum Ci (p) of Y2i
(ω) is obtained from the equation (35), and Ci (p) (1 ≦ p ≦ P) is used as a parameter
representing the outline of the power spectrum. P is the order of the cepstrum. Further, in the
equation (35), F-1 is a function for performing the inverse FFT. Ci (p) = F-1 (ln (Y2i (ω))) (35)
10-04-2019
36
[0087]
The spectral intensity parameter computing means 52 calculates a parameter representing the
intensity of the power spectrum from the power spectrum related to the noise pattern from
which the leaked voice output from the leaked voice removal means 9 has been removed and
outputs it in time series. Specifically, the cepstrum Ci (p) of Y2i (ω) is obtained from the equation
(35), and Ci (0) is used as a parameter representing the intensity of the power spectrum.
[0088]
The weighting clustering means 53 calculates by weighting the parameter representing the
outline of the power spectrum output from the spectrum outline parameter calculation means 51
and the parameter representing the intensity of the power spectrum output from the spectrum
intensity parameter calculation means 52 Power spectrum according to the noise pattern from
which a plurality of leaked speeches are stored in the first noise spectrum memory 41 are
clustered using the distance value to calculate the power according to the noise pattern from
which representative leaked speech is removed Output a spectrum. Clustering is performed such
that the evaluation function D represented by the following equation (36) is minimized. In
equation (36), Cn (p) is a centroid of class n, Θ (n) is a set of time series numbers of elements of
class n, and dis (X, Y) is a cepstrum X in a specified range of degree It is a function that returns
the distance between C and Cepstrum Y. W is a weighting factor that determines the ratio of the
degree of contribution to the overall distance value for the parameter that represents the outline
of the power spectrum and the parameter that represents the strength of the power spectrum.
Also, the centroid Y2n (ω) of each class is derived using equation (17). After the end of
clustering, N Y 2 n (ω) are output as a power spectrum related to the noise pattern from which
representative leaked speech has been removed, and stored in the first representative noise
spectrum memory 21.
[0089]
As described above, according to the fourth embodiment, the parameter representing the outline
of the power spectrum is calculated from the power spectrum related to the noise pattern from
which the leaked speech stored in the first noise spectrum memory 41 has been removed. A
parameter representing the intensity of the power spectrum is calculated from the power
spectrum of the noise pattern from which the spectrum is removed and the speech pattern stored
in the first noise spectrum memory 41 is calculated. The spectral intensity parameter computing
means 52 for outputting, the parameter representing the outline of the power spectrum output
from the spectral outline parameter computing means 51 and the parameter representing the
10-04-2019
37
intensity of the power spectrum output from the spectral intensity parameter computing means
52 Multiply by weight The power spectrum according to the noise pattern from which the
plurality of leaked speech stored in the first noise spectrum memory 41 has been removed is
clustered using the outgoing distance value, and according to the noise pattern from which the
typical leaked speech is removed Since the present invention is configured to include the
weighting clustering means 53 that outputs a power spectrum, even in an environment where
the noise power level fluctuates sharply, etc., more precise clustering is possible by adjusting the
weight for the parameter representing the noise intensity. Thus, the mapping relationship
between the power spectrum related to the noise pattern from which the leaked speech has been
removed and the power spectrum related to the superimposed noise pattern can be learned more
precisely, and from the power spectrum related to the noise superimposed speech Exactly the
power spectrum In order to be able to removed by an effect that it is possible to further improve
the performance of speech recognition.
[0090]
As described above, according to the present invention, the power for noise superimposed speech
is obtained using a filter for correcting the difference between the frequency characteristics of
the voice microphone and the noise microphone for the transfer characteristics of the voice. A
noise superimposed speech spectrum correction means for correcting the spectrum and
outputting the power spectrum pertaining to the corrected noise superimposed speech in time
series, and subtracting the power spectrum pertaining to the corrected noise superimposed
speech from the power spectrum pertaining to the noise pattern in which the speech leaks. And
the leaked speech removal means for outputting in time series the power spectrum related to the
noise pattern from which the leaked speech has been removed, so that even if there is speech
leaked into the noise microphone, the noise pattern Noise putter that is able to remove voices
that are leaked out of the Since it is possible to remove from the noisy speech, an effect that it is
possible to improve the performance of speech recognition.
[0091]
According to the present invention, the speaker position detecting means detects the position
where the speaker is present by the sensor and outputs the position data in time series, and the
frequency related to the transfer characteristic to the voice of the voice microphone and the
noise microphone A voice correction correction filter memory for storing a plurality of correction
filters for correcting differences in characteristics and a voice correction correction filter memory
corresponding to the position data of the speaker output from the speaker position detection
means Since it is configured to include the speech correction correction filter selection means for
selecting and outputting the correction filter to the noise superimposed speech spectrum
correction means in time series, the appropriate correction filter is selected according to the
speaker position and the speech is selected. Can accurately remove the power spectrum of the
10-04-2019
38
leaked speech from the power spectrum of the leaked noise pattern From the noise removal from
noisy speech can be carried out accurately, an effect that it is possible to further improve the
performance of speech recognition.
[0092]
According to the present invention, a noise correction correction filter memory for storing a
plurality of correction filters for correcting differences in frequency characteristics related to
transfer characteristics with respect to noise patterns of a voice microphone and a noise
microphone, and noise correction correction A representative noise spectrum memory for storing
power spectra related to noise patterns respectively corresponding to a plurality of correction
filters stored in the filter memory, and a plurality of noises stored in the power spectrum related
to the collected noise patterns and the representative noise spectrum memory Noise spectrum
selection means for selecting from the representative noise spectrum memory a noise pattern
giving a shortest distance value by calculating a distance value between the power spectrum
according to the pattern and outputting a signal identifying the noise pattern in time series; Noise
pattern output from noise spectrum selection means Since the correction filter corresponding to
the noise identification signal is selected from the noise correction correction filter memory and
output in time series as noise correction correction filter selection means, the noise pattern
collected by the noise microphone is selected. Accordingly, the power spectrum relating to the
noise pattern can be accurately removed from the power spectrum relating to the noisesuperimposed speech by selecting the appropriate correction filter, so that the performance of
the speech recognition can be improved.
[0093]
According to the present invention, a noise correction correction filter memory for storing a
plurality of correction filters for correcting differences in frequency characteristics related to
transfer characteristics with respect to noise patterns of a voice microphone and a noise
microphone, and noise correction correction It is stored in a representative noise spectrum
memory for storing power spectra related to noise patterns respectively corresponding to a
plurality of correction filters stored in a filter memory, and a power spectrum and representative
noise spectrum memory related to a noise pattern from which leaked speech is removed Noise
spectrum selection which calculates a distance value between power spectra related to a plurality
of noise patterns and gives a shortest distance value from a representative noise spectrum
memory and outputs a signal identifying the noise patterns in time series Means and noise
spectrum selection means A correction filter corresponding to the noise pattern identification
signal is selected from the noise correction correction filter memory, and the noise correction
correction filter selection unit outputs the correction filter in time series to the leaked speech
removal noise spectrum correction unit. Since the power spectrum related to the noise pattern
can be accurately removed from the power spectrum related to the noise-superimposed speech
10-04-2019
39
by selecting the appropriate correction filter according to the noise pattern from which the
leaked speech has been removed, the performance of speech recognition There is an effect that it
is possible to improve the
[0094]
According to the present invention, the first representative noise spectrum memory storing a
plurality of power spectra related to the noise pattern from which the leaked speech has been
removed and the plurality of leaked speech stored in the first representative noise spectrum
memory A second representative noise spectrum memory for storing power spectra of a plurality
of superimposed noise patterns respectively corresponding to the power spectrum of the noise
pattern, a power spectrum of a noise pattern from which a leaked speech is removed, and a first
representative First representative noise spectrum memory for a power spectrum according to a
noise pattern giving a shortest distance value by calculating a distance value between the
plurality of leaked speech stored in the noise spectrum memory and the power spectrum
according to the noise pattern from which noises are removed Output a signal identifying the
noise pattern in time series. A power spectrum relating to a superimposed noise pattern
corresponding to the first noise spectrum selection means and the noise pattern identification
signal output from the first noise spectrum selection means is selected from the second
representative noise spectrum memory and output in time series Since the second noise
spectrum selecting means is provided, the power spectrum according to the appropriate
superimposed noise pattern according to the noise pattern from which the leaked voice is
removed is selected, and the power spectrum according to the noise superimposed voice is
selected. Since the power spectrum relating to the superimposed noise pattern can be accurately
removed, even when the frequency characteristic relating to the transfer characteristic to the
noise pattern of the voice microphone and the noise microphone changes momentarily, the voice
recognition The effect is that the performance can be further improved.
[0095]
According to the present invention, the noise power level calculating means which calculates the
noise power level from the noise pattern signal outputted from the noise microphone and
outputs the noise power level in time series, the noise superimposed voice outputted from the
voice microphone Voice section detection means for determining a voice section based on the
signal and the noise pattern signal output from the noise microphone and outputting a
discrimination signal of whether or not it is a voice section in time series; noise power level
computing means When the noise power level to be output is equal to or less than the threshold
and the identification signal output from the voice section detecting means indicates that it is a
voice section, the identification signal indicating learning of the correction filter is time-series
Filter learning decision means for outputting the correction filter and the identification signal
output from the correction filter learning decision means for learning the correction filter Based
10-04-2019
40
on the power spectrum of the noise superimposed speech output from the noise superimposed
speech spectrum computing means and the power spectrum of the noise pattern output from the
noise spectrum computing means when it is indicated to Since the correction filter learning
means for learning the correction filter corresponding to the position data of the speaker output
from the detection means and outputting the correction filter is provided, the speaker position
can not be learned by the prior learning. Even in the case where the voice recognition is
performed, the power spectrum of the noise superimposed speech can be correctly corrected,
and the noise removal from the power spectrum for the noise pattern in which the voice leaks
can be performed correctly. The effect is that the performance can be improved.
[0096]
According to the present invention, the noise power level calculating means which calculates the
noise power level from the noise pattern signal outputted from the noise microphone and
outputs the noise power level in time series, the noise superimposed voice outputted from the
voice microphone A noise section detection unit that determines a noise section based on a signal
and a noise pattern signal output from the noise microphone, and outputs an identification signal
of whether or not it is a noise section in time series; When the noise power level to be output is
equal to or higher than the threshold and the identification signal output from the noise section
detection means indicates that it is a noise section, the identification signal indicating learning of
the noise spectrum is time-sequenced Means for determining the noise spectrum to be output to
the noise spectrum, and the identification signal output from the means for determining the noise
spectrum When it is indicated that learning is to be carried out, the noise pattern according to
the representative leaked speech is removed based on the power spectrum related to the noise
pattern from which the leaked speech is output from the leaked speech removal means. Noise
superimposed speech when the first noise spectrum learning means for learning the power
spectrum and outputting the power spectrum and the identification signal output from the noise
spectrum learning determination means indicates that the noise spectrum is to be learned A
superimposed noise pattern corresponding to a power spectrum related to a noise pattern from
which typical leaked speech is output from the first noise spectrum learning means based on a
power spectrum related to noise superimposed speech output from the spectrum computing
means Second power source which learns the power spectrum Since the system is configured to
include the spectrum learning means, even when the noise pattern which can not be learned by
the prior learning is superimposed on the voice, the power according to the appropriate
superimposed noise pattern according to the noise pattern from which the leaked voice is
removed. Since the spectrum can be selected and the power spectrum of the superimposed noise
pattern can be accurately removed from the power spectrum of the noise superimposed speech,
the speech recognition performance can be further improved.
[0097]
10-04-2019
41
According to the present invention, the first noise spectrum memory stores the power spectrum
related to the noise pattern from which the plurality of leaked voices output from the leaked
voice removal unit are removed; With regard to the power spectrum related to the noise pattern
from which a plurality of leaked speech is stored stored in one noise spectrum memory, the
power spectrum which is the centroid of each class and the power spectrum of the noise pattern
included in the class Clustering is performed so as to minimize the sum of distance values, and
first clustering means for outputting centroids of each class as a power spectrum related to a
noise pattern from which representative leaked speech has been removed; Noise spectrum
learning means of the first noise spectrum memory A second noise spectrum memory for storing
a power spectrum of a noise pattern from which a plurality of leaked speech has been removed
and a power spectrum of a plurality of superimposed noise patterns respectively output to the
same analysis frame; Clustering is performed to reflect the clustering result in the first clustering
unit on the power spectrum related to the plurality of superimposed noise patterns to be stored,
and the centroid of each class is used as the power spectrum related to the representative
superimposed noise pattern Since the second clustering means for outputting is provided, the
total sum of the distances between the centroid of each class and the power spectrum included in
the class is minimized with respect to the noise pattern from which the leaked speech has been
removed. So appropriate to be The power spectrum of the noise pattern from which the leaked
speech has been removed by storing the centroid of each class as a representative power
spectrum with respect to the noise pattern from which the leaked speech has been removed and
the superimposed noise pattern while performing rastering Since the mapping relationship
between the superimposed noise pattern and the power spectrum can be precisely learned, the
power spectrum associated with the superimposed noise pattern can be accurately removed from
the power spectrum associated with the noise superimposed speech, and the speech recognition
performance can be improved. The effect of being able to improve more is produced.
[0098]
According to the present invention, the first noise spectrum memory stores the plurality of power
spectra relating to the noise pattern from which the leaked speech output from the leaked speech
removal means has been removed; Spectrum outline parameter operation means for calculating a
parameter representing the outline shape of the power spectrum from the power spectrum
relating to the noise pattern from which the leaked speech stored in the noise spectrum memory
of 1 is removed and outputting the parameter; Spectrum intensity parameter computing means
for calculating a parameter representing the intensity of the power spectrum from the power
spectrum relating to the noise pattern from which the leaked voice stored in the spectrum
memory has been removed and outputting the parameter; Outline of the power spectrum A
plurality of leaked voices stored in the first noise spectrum memory are eliminated by using a
distance value calculated by applying a weight to a parameter and a parameter representing the
power spectrum intensity output from the spectrum intensity parameter computing means. The
power spectrum according to the noise pattern is clustered, and the weighting clustering means
10-04-2019
42
for outputting the power spectrum according to the noise pattern from which the typical leaked
speech is removed is provided. Even in the above, by adjusting the weight for the parameter
representing the noise intensity, more precise clustering becomes possible, and the mapping
relationship between the power spectrum of the noise pattern from which the leaked speech has
been removed and the power spectrum of the superimposed noise is Learning more precisely By
that can, for a power spectrum of the superimposed noise pattern from the power spectrum of
the noisy speech can be accurately removed, there is an effect that it is possible to further
improve the performance of speech recognition.
[0099]
Brief description of the drawings
[0100]
FIG. 1 is a diagram showing a configuration of a speech recognition device according to a first
embodiment of the present invention.
[0101]
FIG. 2 is a diagram showing the configuration of a speech recognition system according to a
second embodiment of the present invention.
[0102]
FIG. 3 is a block diagram showing a processing procedure for obtaining a power spectrum related
to the denoised speech.
[0103]
FIG. 4 is a diagram showing the configuration of a speech recognition system according to a third
embodiment of the present invention.
[0104]
FIG. 5 is a diagram showing an internal configuration of a first noise spectrum learning means.
[0105]
FIG. 6 is a diagram showing an internal configuration of second noise spectrum learning means.
10-04-2019
43
[0106]
FIG. 7 is a diagram showing an internal configuration of a first noise spectrum learning means of
a speech recognition system according to a fourth embodiment of the present invention.
[0107]
FIG. 8 is a block diagram showing a configuration of a conventional speech recognition apparatus
that recognizes speech in an environment where there is noise or the like.
[0108]
FIG. 9 is a diagram showing the configuration of a conventional speech recognition apparatus
using a two-input SS method.
[0109]
Explanation of sign
[0110]
1 voice microphone, 2 noise microphone, 3 noise superimposed voice spectrum calculation
means, 4 noise spectrum calculation means, 5 speaker position detection means, 6 voice
correction correction filter memory, 7 voice correction correction filter selection means, 8 noise
Superposed speech spectrum correction means, 9 leaked speech removal means, 10 noise
correction correction filter memory, 11 representative noise spectrum memory, 12 noise
spectrum selection means, 13 noise correction correction filter selection means, 14 leaked
speech removal noise spectrum correction Means (noise spectrum correction means) 15, 25
noise removal speech spectrum calculation means, 16 feature vector calculation means, 17
reference pattern memory, 18 reference means, 21 first representative noise spectrum memory,
22 second representative noise spectrum memory , 23 first noise spectrum selection Stage 24,
24 second noise spectrum selecting means, 31 noise power level calculating means, 32 voice
section detecting means, 33 noise section detecting means, 34 correction filter learning
determining means, 35 noise spectrum learning determining means, 36 correction filter learning
means 37 first noise spectrum learning means, 38 second noise spectrum learning means, 41
first noise spectrum memory, 42 first clustering means, 43 second noise spectrum memory, 44
second clustering means, 51 spectrum Rough parameter calculation means, 52 spectral intensity
parameter calculation means, 53 weighting clustering means.
10-04-2019
44
Документ
Категория
Без категории
Просмотров
0
Размер файла
70 Кб
Теги
jp2001318687, description
1/--страниц
Пожаловаться на содержимое документа