close

Вход

Забыли?

вход по аккаунту

?

DESCRIPTION JP2010251937

код для вставкиСкачать
Patent Translate
Powered by EPO and Google
Notice
This translation is machine-generated. It cannot be guaranteed that it is intelligible, accurate,
complete, reliable or fit for specific purposes. Critical decisions, such as commercially relevant or
financial decisions, should not be based on machine-translation output.
DESCRIPTION JP2010251937
An object of the present invention is to provide a "voice processing apparatus" capable of
properly processing a voice signal obtained in an environment where speech and noise are
mixed. SOLUTION: An amplification means 12 for amplifying an audio signal outputted from a
microphone based on a set gain value in response to an utterance of a user, and an audio process
for processing an audio signal amplified by the amplification means 12. Means 120 for detecting
the peak value of the audio signal that has been amplified by the amplifying means 13 when the
user speaks; and detected by the peak value detecting means 130 Means for generating speech
peak value distribution information representing the statistical distribution of the peak values of
the speech signal, and the amplification means 12 should be set based on the speech peak value
distribution information and a predetermined reference peak value range. And gain value
determination means 150 for determining the gain value. [Selected figure] Figure 1
Voice processing device
[0001]
The present invention amplifies voice signals output from a microphone in response to a user's
speech based on a set gain value, and processes voice signals subjected to the amplification
according to a predetermined method such as voice recognition processing. It relates to a
processing device.
[0002]
Conventionally, there is a speech recognition apparatus (speech processing apparatus) that
10-04-2019
1
recognizes a speech by processing a speech signal output from a microphone in response to a
user's speech according to a speech recognition algorithm.
In this type of voice processing apparatus, the voice signal output from the microphone is
amplified based on the set gain value (amplification factor value), and the voice signal subjected
to the amplification is supplied to a voice recognition engine (voice processing means). Ru. Then,
in order to prevent a decrease in the recognition rate when the level of the voice signal supplied
to the voice recognition engine exceeds the predetermined level, an AGC (Automatic Gain
Control) circuit is provided which adjusts the level of the voice signal within the predetermined
level. It is generally known to provide (see, for example, Patent Document 1).
[0003]
JP 2001-117585 A
[0004]
By the way, it is conceivable to use a voice processing device such as a voice recognition device
as a human interface (HI) of a vehicle-mounted device.
In this case, the voice of the occupant (user) is taken in from the microphone installed in the
cabin, but in addition to the voice of the user, the running noise of the vehicle and the operation
noise of the air conditioner Various noises, such as irregular traveling noise of oncoming
vehicles, are input.
[0005]
The above-described AGC circuit, which is generally used, is suitable for adjusting the speech
signal level of the speech to be recognized, but for adjustment of the speech signal level in which
various noises and intended speech are mixed. It is not necessarily suitable. It is important to
adjust an audio signal in which such various noises and target uttered speech are mixed as a
whole to a certain level, for example, a level suitable for processing by a speech recognition
engine.
10-04-2019
2
[0006]
The present invention has been made in view of such circumstances, and provides a voice
processing apparatus capable of properly processing a voice signal obtained in an environment
where speech and noise coexist.
[0007]
An audio processing apparatus according to the present invention comprises an amplifying
means for amplifying an audio signal output from a microphone based on a set gain value in
response to a user's speech, and an audio signal amplified by the amplifying means. A speech
processing apparatus comprising speech processing means for processing according to a
predetermined method, said speech peak value detecting means detecting a peak value of a
speech signal amplified by said amplification means when said user speaks; Means for generating
speech peak value distribution information representing a statistical distribution of the peak
values of the speech signal detected by the value detection means, and the amplification based on
the speech peak value distribution information and a predetermined reference peak value range
And a gain value determining means for determining a gain value to be set to the means.
[0008]
With such a configuration, the peak value of the audio signal amplified by the amplification
means is detected when the user speaks, and audio peak value distribution information
representing the statistical distribution of the peak values of the audio signal is generated.
Then, the gain value in the amplification means is determined based on the speech peak value
distribution information and the reference peak value range.
For example, the gain value in the amplification means can be determined such that the
distribution range of the peak value of the audio signal represented by the audio peak value
distribution information approaches the reference peak value range.
[0009]
The statistical distribution of peak values of the audio signal is a distribution of peak values of
the detected audio signal, and may represent a detected frequency distribution of each peak
10-04-2019
3
value of the audio signal. Further, the reference peak value range can be determined based on
the peak value range of the audio signal properly processed by the audio processing means for
processing the amplified audio signal.
[0010]
Further, in the voice processing apparatus according to the present invention, the detection
means for determining whether or not the peak value of the audio signal detected by the audio
peak value detecting means is within the reference peak value range; When it is determined that
the peak value of the audio signal is within the reference peak value range, the gain value
determination means determines a new gain value based on the audio peak value distribution
information and the reference peak value range. It can be configured to maintain the already set
gain value without doing so.
[0011]
With such a configuration, when the peak value of the audio signal to be detected is within the
reference peak value range, the gain value of the amplification means is maintained as an
appropriate audio signal level.
[0012]
Further, in the voice processing device according to the present invention, the gain value
determining means determines whether or not the width of the peak value distribution range
represented by the voice peak value distribution information is equal to or less than the width of
the reference peak value range. Each peak value of the peak value distribution range when the
width of the peak value distribution range is determined to be equal to or less than the width of
the reference value peak value range. The gain value may be determined based on the change in
the gain value necessary to bring the value into.
[0013]
With such a configuration, when the width of the statistical distribution range (peak value
distribution range) of the peak values of the speech signal up to now through amplification is less
than the width of the reference peak value range, the peak values of the speech signal so far The
speech signal is amplified with the gain value determined based on the change in the gain value
necessary to bring the range of the statistical distribution (peak value distribution range) into the
reference peak value range, so More of the peak values of the audio signal that has been output
and amplified by the amplification means can be made to fall within the reference peak value
range.
10-04-2019
4
[0014]
Furthermore, in the voice processing device according to the present invention, the voice peak
value distribution information is updated to represent a statistical distribution of peak values of
voice signals of the peak value distribution range contained in the reference peak value range. It
can be configured to have one audio peak value distribution information updating means.
[0015]
With such a configuration, the audio peak value distribution information representing the
statistical distribution of the peak values of the audio signal that has been detected up to now is
the statistical distribution of the peak values of the audio signal amplified based on the new gain
value. It will be able to be updated to what it represents.
[0016]
Further, in the voice processing device according to the present invention, the gain value
determining means may calculate a difference between the peak value in the middle of the peak
value distribution range and the peak value in the middle of the reference peak value range as a
change in the gain value. The gain value may be determined as
[0017]
With such a configuration, the gain value is determined based on the change in gain value
required to bring the peak value distribution range into the center of the reference peak value
range.
[0018]
Further, in the voice processing device according to the present invention, when it is determined
that the width of the peak value distribution range is not less than or equal to the width of the
reference peak value range, the gain value determining means determines from the peak value
distribution range. Main range determining means for determining a main range of the detected
peak value which is a range within which the total number of frequencies of each peak value is
the largest within the width of the reference peak value range; and each peak value of the main
range of the detected peak value The gain value may be determined based on a change in the
gain value required to fall within the reference peak value range.
[0019]
With such a configuration, if the width of the range (peak value distribution range) of the
10-04-2019
5
statistical distribution of peak values of the current speech signal after amplification does not fall
below the width of the reference peak value range, The range (detection peak value main range)
in which the total number of frequencies of each peak value is the largest within the range of the
statistical distribution of peak values (peak value distribution range) within the width of the
reference peak value range Since the audio signal is amplified by the gain value determined
based on the change in the gain value necessary to fall within the range, the peak of the audio
signal newly output from the microphone and amplified by the amplification means More of the
values may become within the reference peak value range.
[0020]
Furthermore, in the speech processing device according to the present invention, the speech
peak value distribution information is statistically calculated from the peak values of speech
signals in the speech peak value distribution range including the main detected peak value range
contained in the reference peak value range. It can be configured to have a second speech peak
value distribution information updating means that updates to represent the distribution.
[0021]
With such a configuration, the audio peak value distribution information representing the
statistical distribution of the peak values of the audio signal that has been detected up to now is
the statistical distribution of the peak values of the audio signal amplified based on the new gain
value. It will be able to be updated to what it represents.
[0022]
Further, in the voice processing device according to the present invention, the gain value
determining means may calculate the difference between the peak value at the boundary of the
main range of the detected peak value and the peak value at the corresponding boundary of the
reference peak value range. The gain value may be determined as a change of.
[0023]
With such a configuration, the gain value is determined based on the change in gain value
required to make the boundary of the detected peak value main range in the peak value
distribution range coincide with the corresponding boundary of the reference peak value. .
[0024]
Further, in the voice processing device according to the present invention, the main range
determining means sets a range in which the total number of frequencies of each peak value is
10-04-2019
6
the same as the width of the reference peak value range from the peak value distribution range.
It can be configured to be determined as the detected peak value main range.
[0025]
With such a configuration, the total number of frequencies of each peak value is the same as the
width of the reference peak value range within the range (peak value distribution range) of the
statistical distribution of peak values of the speech signal until now through amplification. Since
the audio signal is amplified with the gain value determined based on the change in the gain
value necessary to bring the range (the detected peak value main range) to be within the abovementioned reference peak value range, More of the peak values of the audio signal output from
and amplified by the amplification means may also be within the reference peak value range.
[0026]
Further, in the voice processing apparatus according to the present invention, the voice peak
value detecting means is a sampling means for sampling a voice signal level amplified by the
amplification means when the user speaks; The peak value determining means for determining
the peak value of the voice signal based on the voice signal level, and the sampling means is
configured to sample the voice signal level to be sampled when the voice signal level exceeds a
predetermined value. And limiting the peak value to the peak value of the audio signal based on
the sampled audio signal level when the sampled audio signal level includes the predetermined
value. It can be configured to have an estimation means.
[0027]
With such a configuration, even if the audio signal level sampled by the sampling means is
limited to a predetermined value, the peak value of the audio signal to be originally detected is
estimated, so the peak value of the estimated audio signal Can be used to determine more
accurate gain values.
[0028]
The aforementioned audio peak value distribution information may include a histogram
representing the frequency of each peak value of the detected audio signal.
[0029]
According to the speech processing apparatus according to the present invention, speech peak
value distribution information representing a statistical distribution of peak values of the speech
signal amplified by the amplification means at the time of speech of the user is based on the
10-04-2019
7
predetermined reference peak value range. Since the gain value in the amplification means is
determined, it is possible to adjust the speech signal in which various noises and target speech
are mixed as a whole to the reference peak value range, and as a result, speech and noise It is
possible to properly process an audio signal obtained in a mixed environment.
[0030]
It is a block diagram showing the composition of the speech processing unit concerning one
embodiment of the present invention.
It is a flow chart which shows a flow of basic processing of a speech processing unit.
It is a flowchart which shows the specific flow of the peak value calculation process in the
process shown in FIG.
It is a figure showing the presumed method of peak value.
It is a flowchart (the 1) which shows the specific flow of the gain calculation process in the
process shown in FIG.
It is a flowchart (the 2) which shows the specific flow of the gain calculation process in the
process shown in FIG.
It is a figure which shows the example of the histogram showing statistical distribution of the
peak value of the detected audio | voice signal.
FIG. 7 is a diagram showing a state in which the peak value distribution range of the histogram
shown in FIG. 6 is moved so as to be within the dynamic range (reference peak value range) of
the speech recognition engine.
It is a figure which shows the other example of the histogram showing the statistical distribution
10-04-2019
8
of the peak value of the detected audio | voice signal.
It is a figure which shows the examination range for determining the detection peak value main
range in the histogram shown in FIG.
FIG. 9 is a diagram showing a state in which the histogram is moved so that the main range of
detected peak values in the histogram shown in FIG. 8 falls within the dynamic range (reference
peak value range) of the speech recognition engine.
It is a figure showing the table showing the gain value decided for every user, and the set of the
peak value of an audio signal.
[0031]
Embodiments of the present invention will be described using the drawings.
[0032]
A voice processing apparatus according to an embodiment of the present invention is configured
as shown in FIG.
The voice processing device is, for example, a voice recognition device used as a human interface
(HI) of a vehicle-mounted device.
[0033]
In FIG. 1, the speech recognition apparatus includes a processing unit 10, a microphone 11, a
microphone amplifier 12, and an A / D converter 13.
The microphone 11 is installed in the vehicle compartment and outputs an audio signal
corresponding to the audio input when the user (passenger) utters.
10-04-2019
9
The microphone amplifier 12 amplifies the audio signal from the microphone 11 based on the
set gain value (amplification factor).
The A / D converter 13 converts the audio signal amplified by the microphone amplifier 12 into
a digital value (digital audio signal).
The processing unit 10 takes in the digital audio signal from the A / D converter 13 as audio
data.
[0034]
The processing unit 10 is, for example, a computer unit including a CPU, and is a voice data
storage unit (voice buffer) 110, a voice recognition engine 120 (voice processing means), a peak
value calculation processing unit 130 (voice peak value detection means), It has a histogram
calculation processing unit 140 (means for generating speech peak value distribution
information) and a gain calculation processing unit 150 (gain value determination means).
The voice data storage unit 110 stores voice data that the processing unit 10 takes in from the A
/ D converter 13 when the user utters.
The voice recognition engine 120 processes voice data read from the voice data storage unit 110
according to a predetermined voice recognition algorithm, and generates recognition data for the
user's uttered voice.
[0035]
The peak value calculation processing unit 130 has a data maximum amplitude calculation unit
131, an excessive input determination unit 132, and a projection amount estimation unit 133.
The data maximum amplitude calculation unit 131 samples the audio data (representing the
audio signal level) stored in the audio data storage unit 110 at predetermined time intervals, and
10-04-2019
10
the audio based on the sampled audio signal level (amplitude) Determine the maximum
amplitude value, or peak value, of the data.
The number of output bits of the data maximum amplitude calculation unit 131 is limited, and
the peak value to be determined is limited to a predetermined value represented by the limited
number of bits.
In the excessive input determination unit 132, whether all the output bits of the data maximum
amplitude calculation unit 131 excluding the sign bit are “1” (full bit) and the output of the
data maximum amplitude calculation unit 131 is restricted to a predetermined value That is, it is
determined whether or not there is an excessive voice input. The protrusion amount estimation
unit 133 determines that the predetermined value which is the limit value of the peak value and
the predetermined value thereof when the excessive input determination unit 132 determines
that the output of the data maximum amplitude calculation unit 131 is limited to the
predetermined value. Estimate the difference (protrusion amount) with the actual peak value
exceeding.
[0036]
The histogram calculation processing unit 140 has a speech peak value storage unit 141 and a
histogram calculation unit 142. The utterance peak value storage unit 141 stores the peak value
of voice data obtained by the peak value calculation processing unit 130 for each utterance of
the user. The histogram calculation unit 142 generates a histogram (speech peak value
distribution information) representing a statistical distribution of peak values of already obtained
speech data stored in the speech peak value storage unit 141. The gain calculation unit 150
includes an in-range determination unit 151 and a gain update unit 152. The in-range
determination unit 151 determines whether the width of the distribution range of peak values
represented by the histogram generated by the histogram calculation unit 140 is smaller than
the width of the dynamic range (reference peak value range) of the speech recognition engine
120. Determine if The gain update unit 152 updates the gain value to be set to the microphone
amplifier 12 according to the method according to the determination result of the in-range
determination unit 151. The in-range determination unit 151 also determines whether the peak
value obtained by the peak value calculation processing unit 130 is included in the dynamic
range of the speech recognition engine 120, and the gain update unit 152 determines the
determination result. It is determined whether or not to update the gain value according to.
10-04-2019
11
[0037]
The processing unit 10 in the speech recognition apparatus configured as described above
executes processing according to the procedure shown in FIG. 2 to adjust the gain value to be set
to the microphone amplifier 12.
[0038]
In FIG. 2, the processing unit 10 sets the gain value for the microphone amplifier 12 to an initial
value (S1).
At the time of first use of this speech recognition device, a predetermined value set at the time of
shipment from the factory is set as an initial value, and if this speech recognition device is
already used as a part of vehicle equipment, The gain value finally set at the time of use is set as
an initial value. When the initial value of the gain value for the microphone amplifier 12 is set,
the processing unit 10 determines whether the talk switch has been operated (S2). In this voice
recognition device, the user operates the talk switch to speak. Then, the audio signal output from
the microphone 11 corresponding to the uttered voice is amplified by the microphone amplifier
12 based on the gain value set to the initial value, and the audio signal that has been amplified is
converted to the A / D converter 13. It is taken into the processing unit 10 as voice data via the
interface.
[0039]
If it is determined that the talk switch has been operated (YES in S2), the processing unit 10
stores the acquired voice data in the voice data storage unit 110 and supplies it to the voice
recognition engine 120 (S3). The voice recognition engine 120 determines whether or not the
user's speech has ended based on the supplied voice data, and the processing unit 10 determines
whether the speech recognition engine 120 can obtain the result of the determination on the end
of the speech. It is determined (S4). In the processing unit 10, storage of speech data
corresponding to the speech into the speech data storage unit 110 and supply to the speech
recognition engine 120 (S3) until the speech recognition engine 120 obtains the determination
result about the speech end (S4). ) Will continue.
[0040]
10-04-2019
12
When the determination result about the speech end is obtained from the speech recognition
engine 120 (YES in S4), in the processing unit 10, the peak value calculation processing unit 130
calculates the peak value Ym of the speech from the speech data fetched this time (S5) ). The
peak value calculation processing unit 130 executes processing (peak value calculation
processing) in accordance with the procedure shown in FIG.
[0041]
In FIG. 3, the peak value calculation processing unit 130 acquires the speech start sample point
ns and the speech end sample point ne from the speech recognition engine 120 as speech
position information (S51). In the peak value calculation processing unit 130 that has acquired
the speech position information, the data maximum amplitude calculation unit 131 samples and
samples the speech data stored in the speech data storage unit 110 from the speech start sample
point ns at predetermined time intervals. While obtaining the value y (n), the excessive input
determination unit 132 determines whether the sampling value y (n) is a value represented by a
full bit (S52). If it is determined that the bit is a full bit (YES in S52), the sample point n
(representing the sample timing) is stored (S53) on the assumption that there is an excessive
level of voice input.
[0042]
The data maximum amplitude calculation unit 131 determines whether the obtained sampling
value y (n) is larger than the maximum value Ym at that time (S54), and the sampling value y (n)
is larger than the maximum value Ym. In the case (YES in S54), the sampling value y (n) is set as
a new maximum value Ym (S55). Then, the data maximum amplitude calculation unit 131,
together with the excessive input determination unit 132, performs the above-described
processing (S52, S53, S54, S55) for each sample point n until the processing for the final sample
point ne is completed (S56). Execute repeatedly.
[0043]
When the processing for the final sample point ne is completed (YES in S56), the peak value
calculation processing unit 130 stores the sample points stored based on the determination
result of the excessive input level in the excessive input determination unit 132. It is determined
10-04-2019
13
whether there are a plurality of points in succession (S57). If a plurality of such sample points are
not stored consecutively (NO in S57), the parameters (sampling value y (n), speech start sample
point ns, and speech end sample point ne) are initialized and processing is performed. It is ended
(S60). Then, the maximum value Ym set at the end time is sent to the histogram calculation
processing unit 140 and the gain calculation unit 150 as the peak value of the voice signal
representing the speech.
[0044]
On the other hand, when there are a plurality of continuous storage sample points (YES in S57),
the projection amount estimation unit 133 executes projection amount estimation processing
(S58). In this projection amount estimation process, as shown in FIG. 4, the predetermined value
(“32768” corresponding to a signed 16-bit full bit) which is a limited sampling value and an
actual true value exceeding the predetermined value And the difference (protrusion amount) α
(dB) with. Specifically, the limit value (sampling value) at the starting point ni + 2 of the plurality
of sampling points (ni + 2,..., Ni + 9) stored corresponding to the limit value represented by the
full bit The intersection point between the approximation straight line of the sampling value
series passing through) and the approximation straight line of the sampling value series passing
the limit value (sampling value) at the end point ni + 9 is calculated, and the position (value) of
the intersection point and full bit The difference from the limit value represented is calculated as
the protrusion amount α. Then, the value of the intersection becomes an estimated value of the
true value of the voice signal. Among the estimated values of the true value calculated in this
manner, the largest value is the estimated value Ys of the peak value of the audio signal.
[0045]
As a method of obtaining the estimated value Ys of the peak value, a method using spline
interpolation or a method using DFT may be considered other than the method of obtaining from
the intersection of two straight lines.
[0046]
When the protrusion amount estimation processing in the protrusion amount estimation unit
133 described above is completed, the maximum value Ym is the value of the estimated value Ys
when the peak value Ys estimated in this processing is larger than the maximum value Ym at that
time. Updated to
10-04-2019
14
Thereafter, the parameters described above are initialized (S60), and the maximum value Ym set
at that time is sent to the histogram calculation processing unit 140 and the gain calculation unit
150 as the peak value of the speech signal representing the speech.
[0047]
Returning to FIG. 2, when the peak value Ym of the speech signal representing the speech is
obtained as described above (S5), in the processing unit 10, the peak calculated by the histogram
calculation processing unit 140 is sent from the peak value calculation processing unit 130. The
histogram H is updated based on the value Ym (S6). Specifically, as described above, the peak
value Ym from the peak value calculation processing unit 130 is stored in the utterance peak
value storage unit 141, and is stored in the utterance peak value storage unit 141 and the peak
value Ym up to now. The histogram calculation unit 142 generates a histogram H based on the
peak value.
[0048]
For example, as shown in FIG. 6, when the peak value Ym of the audio signal detected this time is
included in the range "-16 dB to -14 dB" in the level range up to 0 dB corresponding to the full
bit peak value, The frequency of the peak value at "-14 dB to -12 dB" is 7, the frequency of peak
value at the range "-12 dB to -10 dB" is 8, the frequency of peak value at the range "-10 dB to -8
dB" is 6, and (The frequency of the peak value in the other range is 0). In addition, the histogram
Hm-1 is updated to the histogram Hm in which the frequency of the peak value in the range "-16
dB to -14 dB" is 1. Also, for example, as shown in FIG. 8, when the peak value Ym of the audio
signal detected this time is similarly included in the range "-16 dB to -14 dB", the peak value in
the range "-22 dB to -20 dB" The frequency is 1, the frequency of the peak value in the range "20 dB to -18 dB" is 5, the frequency of the peak value in the range "-18 dB to -16 dB" is 8, the
peak value in the range "-16 dB to -14 dB" Of the peak value in the range "-14 dB to -12 dB", the
peak frequency in the range "-12 dB to -10 dB" and the peak in the range "-10 dB to -8 dB" The
frequency of the value is 7, the frequency of the peak value in the range "-8 dB to -6 dB" is 6, the
frequency of the peak value in the range "-6 dB to -4 dB" is 4, and the frequency in the range "-4
dB to -2 dB" The peak frequency is 3 (in other ranges The degree of over-click value 0) histogram
Hm-1 is the frequency of the peak value in the range "-16dB~-14dB" is updated to altered
histogram Hm to 6.
10-04-2019
15
[0049]
When the histogram representing the statistical distribution of the peak values of the audio
signal detected as described above is updated (generated), in the processing unit 10, the gain
calculation processing unit 150 executes a gain calculation process (S7). The specific procedure
of this gain calculation is shown in FIGS. 5A and 5B.
[0050]
5A, the gain calculation processing unit 150 calculates data necessary for gain calculation,
specifically, the current peak value Ym calculated by the peak value calculation processing unit
130 and the histogram calculation processing unit 140. Information on the histogram H is
fetched (S71). Then, the in-range determination unit 151 determines whether the acquired peak
value Ym deviates from the dynamic range (maximum value Dmax, minimum value Dmin) of the
voice recognition engine 120 (S72). Note that this dynamic range is a range of amplitude levels
at which the speech signal can be properly subjected to speech recognition processing by the
speech recognition engine 120. When in-range determination unit 151 determines that peak
value Ym does not deviate from the dynamic range of speech recognition engine 120 (NO in
S72), the audio signal amplified by the currently set gain value The gain updating unit 152
maintains the gain value (Gnow) already set in the microphone amplifier 12 (Gnew = Gnow)
(S73), assuming that the level of the signal is within the appropriate range. Then, the gain
calculation process ends (see FIG. 5B).
[0051]
On the other hand, when in-range determination unit 151 determines that the detected peak
value Ym deviates from the dynamic range of speech recognition engine 120 (YES in S72), gain
calculation processing unit 150 continues to Processing according to the procedure shown in
FIG. 5B is performed.
[0052]
In FIG. 5B, the gain updating unit 152 calculates the maximum value Ymax and the minimum
value Ymin of peak values distributed from the histogram H (see FIGS. 6 and 8) acquired from
the histogram calculation processing unit 140.
10-04-2019
16
Next, the gain updating unit 152 calculates the width (Dmax−Dmin) of the dynamic range of the
speech recognition engine 120 and calculates the width (Ymax−Ymin) at which the peak value
of the audio signal is distributed in the histogram (S75) . The gain updating unit 152 determines
whether the width (Ymax−Ymin) of the distribution range of peak values in the histogram is
equal to or less than the width (Dmax−Dmin) of the dynamic range of the speech recognition
engine 120 (S76). When it is determined that the width (Ymax−Ymin) of the distribution range
of peak values in the histogram is equal to or less than the width (Dmax−Dmin) of the dynamic
range (YES in S76), the gain update unit 152 determines each peak value in the histogram To
calculate the change in gain value necessary to bring the value within the dynamic range, and
based on the change, a new gain value (Gnew) to be updated from the gain value (Gnow)
currently set in the microphone amplifier 12 ) Is calculated (S77, S78).
[0053]
Specifically, an intermediate value Ymid of the distribution range of peak values in the histogram
and an intermediate value Dmid of the dynamic range of the speech recognition engine 120 are
respectively calculated (S77), and the intermediate value Ymid of the distribution range of peak
values in the histogram is dynamic. A change in gain value (Dmid-Ymid) necessary to move the
histogram to match the intermediate value Dmid of the range is calculated. Then, a new gain
value Gnew is calculated according to the following equation (S78): Gnew = Gnow + (Dmid-Ymid)
based on the change (Dmin-Ymin) of the gain value.
[0054]
For example, as shown in FIG. 6, the width (Ymax-Ymin) of the distribution range of the peak
value “Ymin = (− 16 dB) to Ymax (−8 dB)” in the histogram obtained by adding the peak
value Ym detected this time As shown in FIG. 7, if 8 dB) is equal to (or less than) the width
(Dmax−Dmin = 8 dB) of the dynamic range (Dmin = (− 14 dB) to Dmax = (− 6 dB)) of the
speech recognition engine 120 Change in gain value necessary to move the histogram so that the
middle value Ymid of the distribution range of peak values in the histogram = -12 dB matches the
middle value D mid of the dynamic range-10 dB (D mid-Y mid = 2 dB) Based on, a new gain value
Gnew is calculated according to: Gnew = Gnow + 2dB. That is, the gain value of the microphone
amplifier 12 is increased by 2 dB. This is because the statistical distribution range of the peak
values represented by the histogram is closer to the side lower by 2 dB than the dynamic range
of the speech recognition engine 120 as a whole, so the statistical distribution (histogram) of the
peak values of the speech signal In order to make the range fall within the dynamic range, the
10-04-2019
17
gain value of the microphone amplifier 12 is increased by 2 dB.
[0055]
After that, each of the peak values of the speech signal up to now stored in the speech peak value
storage unit 141 in the histogram calculation processing unit 140 is changed by the change (for
example, +2 dB) of its gain value (first speech peak Value distribution information updating
means: S81). That is, the histogram H of peak values of the speech signal is updated so as to be
within the dynamic range (reference peak value range) of the speech recognition engine 120.
Specifically, the histogram Hm shown in FIG. 7 is updated so as to be contained in the dynamic
range (Dmax, Dmin) of the speech recognition engine 120. As a result, the updated histogram Hm
(see FIG. 7) represents a histogram of peak values of the audio signal amplified based on the new
gain value (2 dB increase).
[0056]
On the other hand, when it is determined that the width (Ymax−Ymin) of the distribution range
of peak values in the histogram is larger than the width (Dmax−Dmin) of the dynamic range (NO
in S76), the gain update unit 152 determines the peak in the histogram In the distribution range
of values, a range having the same width as the dynamic range of the speech recognition engine
120 and in which the total frequency of each peak value included in it is maximum (hereinafter,
main detected peak value range) is calculated (S79). Specifically, when the histogram of the peak
value of the audio signal as shown in FIG. 8 is obtained, the peak value Ym (−16 dB to Ranges
S1, S2, S3, and S4 having the same width as the dynamic range are set, including -14 dB). And
the total of the frequency of each peak value included in each range S1, S2, S3, and S4 is
calculated. For example, the total frequency of the range S1 is “28 (= 6 + 7 + 8 + 7)”, the total
frequency of the range S2 is “29 (= 8 + 6 + 7 + 8)”, the total frequency of the range S3 is “26
(= 5 + 8 + 6 + 7)”, and the range The total frequency of S4 is “20 (= 1 + 5 + 8 + 6)”. From each
of the ranges S1 to S4, a range S2 in which the total frequency is maximum is determined as a
candidate for the detected peak value main range. Further, a total frequency “28 (= 7 + 8 + 7 +
6)” is calculated for the range Sm-1 of the histogram which is the same range as the dynamic
range of the speech recognition engine 120. The histogram range Sm-1 which is the same range
as the dynamic range is determined as the main detected peak value range in the previous
speech.
[0057]
10-04-2019
18
The total frequency of the range S2 determined as a candidate of the detected peak value main
range is compared with the total frequency of the range Sm-1 which is the same range as the
dynamic range, and the total frequency of the range S2 is the same as the dynamic range If it is
equal to or higher than the total frequency of the range Sm−1, the range S2 is determined as a
new detected peak value main range Sm.
[0058]
Next, the gain updating unit 152 causes the microphone amplifier 12 to change the peak value of
the detected peak value main range (S2) to the dynamic range of the speech recognition engine
120 based on the change in gain value. A new gain value (Gnew) to be updated is calculated from
the set gain value (Gnow) (S80).
[0059]
Specifically, the minimum value Rmin (-18 dB) of the detected peak value main range (S2)
determined in the histogram matches the minimum value Dmin (-14 dB) of the dynamic range of
the speech recognition engine 120. A change (Dmin−Rmin) of the gain value necessary to move
the detected peak value main range (S2) is calculated.
Then, a new gain value Gnew is calculated according to the following equation: Gnew = Gnow +
(Dmin-Rmin) based on the change (Dmin-Rmin) of the gain value.
[0060]
For example, the detection main range S2 (histogram) is moved so that the minimum value Rmin
= −18 dB of the detection peak value main range S2 shown in FIG. 9 matches the minimum
value Dmin = −14 dB of the dynamic range of the speech recognition engine 120. A new gain
value Gnew is calculated according to Gnew = Gnow + 4 dB, based on the change in gain value
(Dmid−Rmin = 4 dB) necessary for causing the correction.
That is, the gain value of the microphone amplifier 12 is increased by 4 dB. This is because the
range where the total frequency is large and main in the histogram (detected peak value main
range) is closer to the side lower by 4 dB than the dynamic range of the speech recognition
engine 120 as a whole. The gain value of the microphone amplifier 12 is increased by 4 dB so
10-04-2019
19
that the statistical distribution (histogram) range falls within the dynamic range.
[0061]
After that, each of the peak values of the speech signal up to now stored in the speech peak value
storage unit 141 in the histogram calculation processing unit 140 is changed by the change (for
example, +4 dB) of its gain value (second speech peak Value distribution information updating
means: S81). That is, the detected peak value main range in the histogram H of the peak values of
the speech signal is updated so as to fall within the dynamic range (reference peak value range)
of the speech recognition engine 120. Specifically, the histogram Hm shown in FIGS. 8 and 9 is
updated to the histogram Hm shown in FIG. 10 so that the detected peak value main range is
contained in the dynamic range (Dmax, Dmin) of the speech recognition engine 120. Ru. Thus,
the updated histogram Hm (see FIG. 10) represents the histogram of the peak value of the audio
signal amplified based on the new gain value (4 dB increase).
[0062]
As described above, when the gain value Gnew to be set in the microphone amplifier 12 is
determined, the gain value is set in the microphone amplifier 12 as the gain value for the next
speech (S8). Thereafter, the processing unit 10 updates the gain value set in the microphone
amplifier 12 by the same process each time the talk switch is operated, that is, whenever the user
utters.
[0063]
According to the speech recognition apparatus as described above, based on the histogram
representing the statistical distribution of the peak values of the speech signal that has been
amplified by the microphone amplifier 12 when the user speaks, and the dynamic range of the
speech recognition engine 120, Since the gain value at the amplifier 12 is determined, it is
possible to adjust within the dynamic range of the speech recognition engine 120 at a rate at
which the speech signal in which various noises and intended speech are mixed is statistically the
highest. As a result, it becomes possible to properly perform speech recognition processing on a
speech signal obtained in an environment where speech and noise are mixed.
[0064]
10-04-2019
20
In the voice recognition apparatus described above, for example, as shown in FIG. 11, for each
user (user), the gain value set in the microphone amplifier 12 and the set of the peak value of the
captured voice signal are managed. You can also
In this case, in the process of setting initial values (see S1) in the process shown in FIG. 2, a gain
value Gnow corresponding to the user who speaks is set as an initial value. Also, an initial
histogram can be created using a set of peak values corresponding to the user who speaks (see
S6). Such a speech recognition apparatus is suitable as what is mounted in in-vehicle equipment
used by vehicles used by a plurality of people.
[0065]
Further, in the above-described speech recognition apparatus, although the gain value is
sequentially updated for each speech, the present invention is not limited to this. Every time a
predetermined number of times of speech is performed, the speech recognition apparatus is used
based on the speech data captured during that time. The gain value may be updated.
Furthermore, the frequency may be multiplied by a weight (forgetting factor less than 1) that
weakens the influence of the peak value in the past speech as the speech is overlapped.
[0066]
The present invention is not limited to a speech recognition device, and can be applied to a
general speech processing device that converts human speech into speech signals and processes
them.
[0067]
As described above, the voice processing device according to the present invention has the effect
of being able to properly process a voice signal obtained in an environment where speech and
noise are mixed, and responds to the user's speech The present invention is useful as a voice
processing apparatus that amplifies a voice signal output from a microphone based on a set gain
value, and processes the voice signal that has been amplified according to a predetermined
method such as voice recognition processing.
[0068]
10 processing unit 11 microphone 12 microphone amplifier (amplification means) 13 A / D
10-04-2019
21
converter 110 voice data storage unit 120 voice recognition engine (voice processing means)
130 peak value calculation processing unit 131 data maximum amplitude calculation unit 132
excess input determination unit 133 Projection amount estimation unit 140 Histogram
calculation processing unit 141 Utterance peak value storage unit 142 Histogram calculation
unit 150 Gain calculation unit 151 In-range determination unit 152 Gain update unit
10-04-2019
22
Документ
Категория
Без категории
Просмотров
0
Размер файла
35 Кб
Теги
description, jp2010251937
1/--страниц
Пожаловаться на содержимое документа