close

Вход

Забыли?

вход по аккаунту

?

DESCRIPTION JP2001337694

код для вставкиСкачать
Patent Translate
Powered by EPO and Google
Notice
This translation is machine-generated. It cannot be guaranteed that it is intelligible, accurate,
complete, reliable or fit for specific purposes. Critical decisions, such as commercially relevant or
financial decisions, should not be based on machine-translation output.
DESCRIPTION JP2001337694
[0001]
BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a
sound source position estimation method, a speech recognition method and a speech
enhancement method.
[0002]
2. Description of the Related Art For example, in an automobile, various noises such as running
noise and radio noise exist. For this reason, there is a problem that speech recognition in a car
has a low recognition rate. Moreover, in such an environment with large noise, the SN ratio of the
target voice (voice signal to be recognized) is generally small, and it is not easy to perform noise
removal or voice emphasis.
[0003]
SUMMARY OF THE INVENTION The present invention has been made in view of the abovementioned circumstances, and is a sound source which can improve the recognition rate of a
target audio signal in a noisy environment, for example, in a car. It aims at providing a position
estimation method, a speech recognition method and a speech enhancement method.
[0004]
10-04-2019
1
According to a first aspect of the present invention, there is provided a method of estimating a
sound source position according to the first aspect of the present invention, comprising: a
plurality of receiving units for receiving audio signals emitted from a sound source; , Y), to
estimate the position of the sound source in two dimensions in the X-Y direction.
xi, yi: coordinates of each receiver. * Represents complex conjugate transposition. The left side of
equation (b) is a vector quantity represented by the following equation (e), where Sq (t) is the
amplitude of the sound wave received by the M receivers. Further, in the equation (c), vq is s (t),
where s (t) in the following equation (f) represents a received signal when K sound waves arrive
at each of the M receivers. The components in the eigenvector V = [v1,..., VM] of the correlation
matrix of FIG. Here, nm (t) represents a noise component in each receiver.
[0005]
The speech recognition method according to claim 2 comprises a delay-and-sum array unit, a
speech recognition unit, and a receiving unit, and first estimates the position of the sound source,
and based on the estimation, the delay-sum array unit A directional characteristic is formed at
the position of to emphasize the target voice, and then the voice recognition unit recognizes the
input signal input from the receiving unit as voice.
[0006]
The speech enhancement method according to claim 3 comprises a delay / sum array unit, a
speech recognition unit, a pitch extraction unit, a speech synthesis unit, and a reception unit, and
is configured to include the following steps.
(1) Estimating the position of a sound source, and based on this estimation, forming a directivity
characteristic at the position of the sound source by the delay-and-sum array unit and
emphasizing target voice (2) Next, the speech recognition unit To recognize the input signal input
to the receiving unit as voice (3) extracting the pitch frequency by the pitch extracting unit based
on the output from the delay and sum array unit (4) the pitch extracting Combining the output of
the unit and the output of the speech recognition unit by the speech synthesis unit to obtain a
speech output.
[0007]
10-04-2019
2
According to a fourth aspect of the present invention, in the speech recognition method
according to the second aspect, the method of estimating the position of the sound source is a
sound source position estimation method according to the first aspect.
[0008]
According to a fifth aspect of the present invention, there is provided a speech enhancement
method in which the method of estimating the position of the sound source according to the
third aspect is the sound source position estimation method according to the first aspect.
[0009]
The sound source position estimation method according to claim 1 or the speech recognition
method according to any one of claims 2 and 4 or the speech recognition method according to
any one of claims 3 and 5 It is considered to be a method of speech enhancement.
[0010]
DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS A sound source position
estimation method and a speech recognition and enhancement method using the same according
to an embodiment of the present invention will be described based on the attached drawings.
First, the outline of the speech recognition apparatus used in the present embodiment will be
described based on FIG.
In this embodiment, a speaker (human) is assumed as the sound source 1. However, the sound
from the speaker may be used, and the type of the sound source is not limited.
A speaker as the sound source 1 is assumed to be, for example, sitting in a driver's seat, a
passenger's seat, a rear seat or the like in a car. This apparatus comprises a plurality of reception
units 2, a sound source position estimation unit 3, a delay and sum array unit 4, a speech
recognition unit 5, a pitch extraction unit 6, and a speech synthesis unit 7 as main components. .
In this embodiment, a microphone is used as the receiving unit 2, and the entire receiving unit 2
constitutes a microphone array. First, an audio signal transmitted from the sound source 1 (a
signal in the audio band is referred to and the source and the transmission medium are not
limited) is received by the receiving unit 2 and converted into an input signal (generally an
10-04-2019
3
electrical signal). The input signal is input to the sound source position estimation unit 3. The
sound source position estimation unit 3 performs position estimation on a two-dimensional plane
(xy plane) as follows. First, the premise of position estimation will be described. The amplitude of
the sound wave that arrives from the coordinates (x, y) and is received by the receiver 2 is Sq (t).
If the amplitude in each receiving part 2 (it is set as M pieces) is represented using a vector, it
will become following Formula (1). ω: angular frequency, T: transpose, xi, yi: coordinates of each
microphone (i-th). τ i indicates the deviation of the sound wave arrival time due to the difference
in the sound source position. From this, the amplitude spectrum s (t) of the received signal when
K sound waves are arriving can be expressed as the sum of the respective sound waves as
follows. Here, nm (t) is a noise component in each microphone. Next, using the eigenvectors V =
[v1,..., VM] of the correlation matrix R of s (t), the matrix Rn is determined by the following
equation (3). However, * represents complex conjugate transposition. Here, since the direction
vector of arrival of the signal and the noise subspace can be regarded as orthogonal, finding the
maximum value of P (x, y) in the following equation (4) allows the position of the sound source in
the two-dimensional plane to be determined. It can be estimated. That is, the sound source
position estimation unit 3 performs a process of estimating the position of the sound source on
the two-dimensional plane by finding the maximum value of P (x, y) described above. The
estimated position is sent to the delay and sum array unit 4. In the sound source position
estimation unit 3, if the search range of the sound source position is limited, it is possible to
reduce the operation amount and the erroneous estimation. For example, in a car, such an
advantage can be obtained by setting only the vicinity of a seat in the car as a search range.
Furthermore, the sound source position estimation unit 3 does not use the input signal from the
reception unit 2 as it is, but targets the band of 100 [Hz] to 4 [kHz] in which speech is mainly
present, thereby achieving estimation accuracy. Can also improve.
[0011]
The delay and sum array (Delay and Sum: DS) unit 4 adds a delay to the signal received by each
of the receiving units 2 and then sums up the signals to enhance the sound at the target position
(see Reference Reference: Toshiro Ohashi, Yoshio Yamazaki, Yutaka Kanada "Sound System and
Digital Processing" The Institute of Electronics, Information and Communication Engineers,
March, 1995. In this embodiment, the position estimated by the sound source position estimation
unit 2 is set as the target position, the directivity characteristic is formed with respect to the
coordinates, and the target voice is emphasized. However, on the low frequency side, the delayand-sum array does not work sufficiently, so components for which there are almost no audio
signals, for example, components of 0 Hz to 100 Hz are removed by a filter (not shown). Since
the configuration of the delay-and-sum array itself is known in the art, further detailed
description will be omitted.
10-04-2019
4
[0012]
The speech recognition unit 5 performs speech recognition on the input signal including the
target speech emphasized by the delay and sum array unit 4. Therefore, there is an advantage
that the target voice can be recognized with high accuracy. The pitch extraction unit 6 extracts
the pitch frequency in the input signal. The speech synthesis unit 7 synthesizes a speech signal
from the output of the pitch extraction unit 6 and the output of the speech recognition unit 5.
Thereby, the quality of the speech signal obtained by the speech recognition unit 5 can be
improved in an auditory sense. As a result, for example, when it is necessary to acquire the audio
signal itself, such as a car telephone, there is an advantage that an audio signal with high quality
can be obtained. The configurations of the speech recognition unit 5, the pitch extraction unit 6,
and the speech synthesis unit 7 themselves are known, and thus the description thereof will be
omitted.
[0013]
EXPERIMENTAL EXAMPLE An example in which speaker position estimation and speech
recognition are performed by simulation based on the method of this embodiment is shown.
Here, it is assumed that the speaker's utterance is recognized in a car. The simulation conditions
are as follows, and the results are shown in FIG. (Simulation conditions) Number of microphones
(reception part): 16 noise sources: 6 locations (in-vehicle noise such as running noise, SN ratio is
0 dB) Change of speaker position: xy coordinates (0.2, 0.2) ( Switching from the left rear seat) to
coordinates (0.4, 1) at the 43rd frame FIGS. 2 (a) and 2 (b) show estimated speaker positions in
each frame. It can be seen that in the voiced section (time during which the voice is being
emitted), the coordinates (0.2, 0.2) are switched to the coordinates (0.4, 1) before and after 43
frames. From this, it can be seen that position estimation can be performed almost correctly.
[0014]
Furthermore, it can be seen from FIGS. 2 (c) to 2 (d) that it is possible to recognize speech close
to the original speech by the method of this embodiment.
[0015]
In the present embodiment, x and y indicate arbitrarily selected two dimensions, and are not
10-04-2019
5
limited to the horizontal plane.
In addition, specific means of each part for realizing the present embodiment may use hardware,
software, network, a combination of these, or any other means, which is obvious to those skilled
in the art. . Furthermore, the descriptions of the above embodiments and examples are merely
examples, and do not show essential configurations of the present invention. The configuration of
each part is not limited to the above as long as the purpose of the present invention can be
achieved.
[0016]
According to the estimation method of the first aspect, the position of the sound source in the
two-dimensional direction can be estimated relatively accurately.
[0017]
According to the speech recognition method of the second aspect, relatively accurate speech
recognition can be performed under a noise environment.
[0018]
According to the speech emphasizing method of the third aspect, it is possible to obtain a speech
having a high quality aurally based on the speech information acquired under the noise
environment.
[0019]
According to the speech recognition method of the fourth aspect, since speech recognition can be
performed based on the sound source position estimated relatively accurately, the accuracy of
speech recognition can be improved.
[0020]
According to the speech enhancement method of the fifth aspect, speech enhancement can be
performed based on the sound source position estimated relatively accurately, so that it is
possible to further improve the auditory quality of the obtained speech. Become.
[0021]
Brief description of the drawings
10-04-2019
6
[0022]
1 is a functional block diagram for explaining a speech recognition method according to an
embodiment of the present invention.
[0023]
2 is a graph showing the results of an embodiment of the present invention, wherein FIG. (A)
shows the relationship between the estimated position on the x-axis and the frame number, and
FIG. (B) is the estimated y-axis (C) shows the time waveform of the original speech signal, FIG. (D)
shows the time waveform in which noise is added to the original speech signal, and FIG. Shows a
time waveform of the voice signal that has been received.
[0024]
Explanation of sign
[0025]
DESCRIPTION OF SYMBOLS 1 sound source 2 receiving part 3 sound source position estimation
part 4 delay sum array part 5 speech recognition part 6 pitch estimation part 7 speech synthesis
part
10-04-2019
7
Документ
Категория
Без категории
Просмотров
0
Размер файла
16 Кб
Теги
description, jp2001337694
1/--страниц
Пожаловаться на содержимое документа