close

Вход

Забыли?

вход по аккаунту

?

DESCRIPTION JP6345327

код для вставкиСкачать
Patent Translate
Powered by EPO and Google
Notice
This translation is machine-generated. It cannot be guaranteed that it is intelligible, accurate,
complete, reliable or fit for specific purposes. Critical decisions, such as commercially relevant or
financial decisions, should not be based on machine-translation output.
DESCRIPTION JP6345327
Abstract: To accurately extract speech and improve the accuracy of speech recognition. A voice
extraction apparatus according to the present application includes a forming unit, an acquiring
unit, an emphasizing unit, a generating unit, and a selecting unit. The forming unit forms
directivity in advance by beam forming processing for each microphone in a microphone array in
which a plurality of channels are formed by having a plurality of microphones. An acquisition
part acquires the observation signal which is a signal of the sound received by each channel. The
emphasizing unit emphasizes the observation signal of each channel to generate an emphasizing
signal according to the directivity of each microphone formed by the forming unit. The
generation unit generates, for each channel, a frequency distribution of the amplitudes of the
emphasis signal generated by the emphasis unit. The selection unit selects a channel
corresponding to the speech signal used for speech recognition among the channels based on the
frequency distribution corresponding to each channel generated by the generation unit. [Selected
figure] Figure 4
Speech extraction apparatus, speech extraction method and speech extraction program
[0001]
The present invention relates to a voice extraction device, a voice extraction method, and a voice
extraction program.
[0002]
Recently, devices equipped with a voice UI (User Interface) have attracted attention worldwide.
11-04-2019
1
Such a device is assumed to be in an environment where the distance between the speaker and
the device is long, and in such an environment, the performance of speech recognition is
degraded if the influence of noise and reverberation is strong. Therefore, in a device equipped
with such a voice UI and a system including the device, since the accuracy of voice recognition is
important, a configuration that is robust against noise, reverberation, and the like is required.
[0003]
As a device equipped with such an audio UI, for example, a plurality of microphones are used to
perform sound source localization for estimating the direction of the speaker, and beam forming
for enhancing speech coming from the direction of the speaker estimated by the sound source
localization Devices for processing have been proposed.
[0004]
Patent Document 1: Japanese Patent Application Laid-Open No. 2002-091469 Patent Document
2: Japanese Published Patent Application No. 2014-510481
[0005]
However, in the above-mentioned prior art, when an error occurs in the estimation of the sound
source localization, the observed speech is distorted, and there is a problem that the performance
of speech recognition is deteriorated.
[0006]
The present application has been made in view of the above, and it is an object of the present
invention to provide a voice extraction device, a voice extraction method, and a voice extraction
program that can appropriately extract voice and improve the accuracy of voice recognition. .
[0007]
A voice extraction apparatus according to the present invention is a microphone array in which a
plurality of channels are formed by having a plurality of microphones, and a forming unit for
forming directivity in advance by beam forming processing for each of the microphones; An
emphasis signal is generated by emphasizing the observation signal of each channel according to
the directivity of the acquisition unit that acquires an observation signal that is a signal of
sounded sound and the microphone formed by the formation unit. An emphasizing unit, a
generating unit that generates a frequency distribution of the amplitude of the emphasizing
11-04-2019
2
signal generated by the emphasizing unit for each of the channels, and the frequency distribution
corresponding to each of the channels generated by the generating unit. And a selection unit for
selecting a channel corresponding to an audio signal used for speech recognition among the
channels. Characterized in that was.
[0008]
According to one aspect of the embodiment, it is possible to appropriately extract speech to
improve the accuracy of speech recognition.
[0009]
FIG. 1 is a diagram showing an example of extraction processing according to the embodiment.
FIG. 2 is a diagram showing an example of a conventional speech recognition system.
FIG. 3 is a diagram showing an example of the configuration of the speech recognition system
according to the embodiment.
FIG. 4 is a diagram showing an example of the configuration of the speech extraction apparatus
according to the embodiment.
FIG. 5 is a diagram illustrating an example of a calculation result storage unit according to the
embodiment.
FIG. 6 is a diagram for explaining the directivity of the microphone array device according to the
embodiment.
FIG. 7 is a diagram showing an example of the frequency distribution of the amplitude of the
emphasis signal according to the embodiment. FIG. 8 is a diagram showing an example of a
method of calculating a cartsis of the frequency distribution of amplitude according to the
embodiment. FIG. 9 is a view showing an example of the arrangement of a recognition apparatus
according to the embodiment. FIG. 10 is a flowchart showing an example of processing of the
11-04-2019
3
speech recognition system according to the embodiment. FIG. 11 is a diagram illustrating an
example of the extraction process according to the modification. FIG. 12 is a diagram showing an
example of the configuration of a voice extraction device according to a modification. FIG. 13 is a
diagram illustrating an example of a hardware configuration of a computer for realizing the
function of the voice extraction device. FIG. 14 is a diagram showing an example of verification
results of character correct accuracy of each system. FIG. 15 is a diagram showing an example of
the processing time of the extraction processing of the processing time of each system.
[0010]
Hereinafter, a voice extraction apparatus, a voice extraction method, and a mode for
implementing a voice extraction program according to the present application (hereinafter,
referred to as “embodiment”) will be described in detail with reference to the drawings. Note
that the speech extraction apparatus, speech extraction method, and speech extraction program
according to the present application are not limited by this embodiment. Moreover, it is possible
to combine each embodiment suitably in the range which does not make process contents
contradictory. Moreover, the same code | symbol is attached | subjected to the same site | part in
the following each embodiment, and the overlapping description is abbreviate | omitted.
[0011]
〔1. Extraction Process] FIG. 1 is a diagram showing an example of the extraction process
according to the embodiment. An example of the extraction process according to the present
embodiment will be described with reference to FIG. In FIG. 1, the voice extraction device 20
according to the present invention is a microphone array device for a voice signal based on the
voice of the speaker received by the microphone array device 10 (hereinafter sometimes referred
to as “observed signal”). The observation signal of each channel is enhanced according to the
directivity formed in advance corresponding to each of the 10 microphones, and the channel is
selected based on the cartography of the frequency distribution of the amplitude of the
enhancement signal, and the observation signal corresponding to the selected channel An
example of executing extraction processing for outputting Here, the channel indicates each
sound receiving unit for receiving the voice of the speaker in the microphone array device 10,
and specifically, corresponds to each microphone in which the directivity is formed as described
above. I assume.
[0012]
11-04-2019
4
As described above, the voice extraction device 20 shown in FIG. 1 has directivity previously
formed corresponding to each microphone of the microphone array device 10 with respect to an
observation signal based on voice received by the microphone array device 10. An apparatus for
emphasizing the observation signal of each channel, selecting a channel based on the curvature
of the frequency distribution of the amplitude of the emphasis signal, and extracting and
outputting the observation signal corresponding to the selected channel . As shown in FIG. 1, the
speech extraction apparatus 20 has a directivity formation / emphasis function 61 and a channel
selection function 62 as functions.
[0013]
A microphone array device 10 shown in FIG. 1 has a plurality of microphones for receiving the
sound of the surrounding environment, and transmits the sound received by each microphone to
the sound extraction device 20 as an observation signal. For example, as shown in FIG. 1, the
microphones of the microphone array device 10 are circularly arranged at equal intervals in the
case of the device body.
[0014]
In the example illustrated in FIG. 1, the plurality of microphones included in the microphone
array device 10 is an example in which eight microphones are circular and arranged at equal
intervals, but the present invention is not limited to this. That is, the plurality of microphones
may be disposed, for example, in a rectangular shape, or may be disposed three-dimensionally
instead of on the same plane.
[0015]
Further, the microphone array device 10 is not limited to being configured as a single device
provided with a plurality of microphones, and for example, a plurality of microphones may not be
integrated into one device, but may be independently provided. It may be arranged. For example,
a plurality of microphones may be individually arranged on the wall of a room where a speaker is
present. However, the relative positional relationship of each microphone needs to be determined
in advance.
11-04-2019
5
[0016]
In the following example, the microphone array device 10 is described as having eight
microphones.
[0017]
The recognition device 30 shown in FIG. 1 receives an observation signal output by extraction
processing by the speech extraction device 20, executes speech recognition processing on the
observation signal, and converts it into text represented by the observation signal. It is a server
device that outputs.
[0018]
FIG. 2 is a diagram showing an example of a conventional speech recognition system.
Here, with reference to FIG. 2, an outline of processing of the conventional speech recognition
system will be described.
As shown in FIG. 2, the conventional speech recognition system shown as an example includes,
for example, a microphone array device 110, a speech extraction device 120, and a recognition
device 130.
[0019]
The microphone array device 110 has a function similar to that of the microphone array device
10 according to the present embodiment described above, has a plurality of microphones that
receive the sound of the surrounding environment, and observes the sound received by each
microphone , And transmits it to the speech extraction device 120.
[0020]
The voice extraction device 120 estimates the direction of the sound source by sound source
localization from the observation signal based on the sound received by each microphone of the
11-04-2019
6
microphone array device 110, and forms directivity in the estimated direction by beam forming
processing. It is an apparatus which emphasizes an observation signal and generates (extracts)
an emphasis signal based on the formed directivity.
As shown in FIG. 2, the speech extraction apparatus 120 has a sound source localization function
161 and a directivity formation / emphasis function 162 as functions.
[0021]
The sound source localization function 161 is a function of estimating the direction of the sound
source from the observation signal based on the sound received by each microphone of the
microphone array device 110 by sound source localization. As a method of sound source
localization, for example, MUSIC (MUltiple SIgnal Classification) method, GCC-PHAT (Generalized
Cross-Correlation methods with PHAse Transform) and the like can be mentioned. The MUSIC
method is a method of estimating the sound source direction using a spatial correlation matrix of
noise and an array manifold vector recorded in advance. GCC-PHAT is a method of estimating the
sound source direction by calculating the cross-correlation function of the observation signals of
the microphones in the frequency domain. As compared with the viewpoint of operation load,
since the MUSIC method needs to perform eigenvalue expansion of the spatial correlation matrix,
GCC-PHAT can reduce the operation processing load.
[0022]
The directivity formation / emphasis function 162 forms directivity for the sound source
(speaker) estimated by the sound source localization function 161 by beam forming processing,
and emphasizes the observed signal based on the formed directivity. It is a function to generate
an emphasis signal. Examples of the beam forming process include a DS (Delay-and-Sum) method
(delay-sum method), and a MVDR (Minimum Variance Distortionless Response). Although the
MVDR suppresses noise in the surrounding environment under the restriction that distortion to
the direction forming directivity is small, it is known that it is effective for speech recognition if
the sound source direction can be estimated correctly. Since the inverse matrix of the spatial
correlation matrix of noise needs to be estimated, the amount of operation becomes high. On the
other hand, the DS method is superior to the MVDR in terms of the amount of calculation
because there is no need to estimate the spatial correlation matrix, so if it is intended to reduce
the processing load by calculation, it is better to adopt the DS method desirable.
11-04-2019
7
[0023]
The recognition device 130 receives the emphasizing signal extracted and output by the speech
extracting device 120, executes speech recognition processing on the emphasizing signal,
converts the text into a text represented by the observation signal, and outputs the converted
text. It is. The speech recognition process includes, for example, a recognition algorithm using a
deep neural network or a hidden Markov model.
[0024]
In the conventional speech recognition system as described above, every time an observation
signal is received from the microphone array device 110, sound source localization is performed
to estimate the direction of the sound source (speaker), and directivity in the direction is
determined by beamforming processing. There is a problem that the processing load is large.
Furthermore, if an error occurs in the estimation of the sound source localization, the
emphasizing signal emphasized based on the observation signal is distorted, and there is also a
problem that the performance of speech recognition is deteriorated.
[0025]
Therefore, the speech recognition system 1 according to the present embodiment executes the
process described below (in particular, the extraction process by the speech extraction apparatus
20), so that the speech coming from the direction directly facing each microphone of the
microphone array device 10 is generated. It is assumed that directivity is formed in advance to
be emphasized, the observation signal of each channel is emphasized according to the directivity,
a channel is selected based on the cartography of the frequency distribution of the amplitude of
the emphasis signal, and the selected channel is selected. It is assumed that the corresponding
observation signal is extracted. By this, it is not necessary to form directivity each time an
observation signal is received, and each channel emphasized based on the formed directivity
rather than the estimation of the sound source direction by the sound source localization of the
conventional voice recognition system described above The channel is selected based on the
frequency distribution of the amplitude generated by the emphasizing signal (specifically,
cartosis calculated from the distribution). Although this channel selection corresponds to the
function of sound source localization of the conventional speech recognition system, it is not
necessary to execute the above-described processing of sound source localization with high
11-04-2019
8
computational load. Therefore, compared with the conventional speech recognition system, the
load of arithmetic processing can be reduced, and since the generation of distortion of a signal
can be suppressed by appropriately extracting the speech, the accuracy of speech recognition
can be reduced. It can be improved. Hereinafter, referring back to FIG. 1, an example of the
process of the speech recognition system 1 according to the present embodiment (in particular,
the extraction process of the speech extraction apparatus 20) will be described along the flow.
[0026]
The voice extraction device 20 forms directivity in advance so that voice coming from a direction
directly facing each microphone of the microphone array device 10 is emphasized (step S11).
The specific content of formation of directivity will be described later with reference to FIG.
[0027]
Thus, the speaker U01 speaks toward the microphone array device 10 in a state in which
directivity is formed in advance in the direction directly facing each microphone (each channel)
of the microphone array device 10 (step S12). . Then, the microphone array device 10 transmits
the sound received by each microphone as an observation signal to the sound extraction device
20 (step S13).
[0028]
When the speech extraction device 20 receives the observation signal from the microphone array
device 10, as the directivity formation / emphasis function 61, the speech extraction device 20
emphasizes the observation signal of each channel according to the directivity formed in advance
to generate an emphasis signal (step S14). ). Further, the speech extraction apparatus 20
generates, for each channel, the frequency distribution of the amplitude of the emphasis signal
(for example, the frequency distribution 51 shown in FIG. 1) as the channel selection function 62
based on the emphasis signal of each channel (step S15). In addition, the speech extraction
apparatus 20, as the channel selection function 62, calculates a cartesian (kness) of the
frequency distribution of the amplitude of the generated intensity signal of each channel (step
S16). At this time, as in the calculation result information 52 shown in FIG. 1, the speech
extraction device 20 stores the calculated cartography of each channel in association with the
microphone ID of each microphone of the microphone array device 10.
11-04-2019
9
[0029]
Further, the voice extraction device 20 selects a channel for outputting an observation signal to
the recognition device 30 based on the calculated cartsis of each channel as the channel selection
function 62 (step S17). Specifically, the speech extraction apparatus 20 selects the channel
corresponding to the largest cartsis among the cartsis of each channel. At this time, the voice
extraction device 20 stores the selection flag in association with the microphone ID of the
microphone of the microphone array device 10 as shown in the calculation result information 52
shown in FIG. The speech extraction device 20 extracts the observation signal corresponding to
the channel selected in step S17 out of the observation signals of the sound received by each
microphone of the microphone array device 10, and outputs it to the recognition device 30 (step
S18). ).
[0030]
The recognition device 30 executes speech recognition processing on the observation signal
received (input) from the speech extraction device 20, and converts it into text (step S19). Then,
the recognition device 30 outputs the text converted (generated) from the observation signal to
an external device using the text (step S20).
[0031]
By the processing of the speech recognition system 1 as described above, it is not necessary to
form directivity every time an observation signal is received, and it is not the estimation of the
sound source direction by the sound source localization of the conventional speech recognition
system described above It is assumed that channels are selected based on the frequency
distribution of amplitudes (specifically, a cartasis calculated from the distribution) generated by
the emphasis signal of each channel emphasized based on the gender. Therefore, compared with
the conventional speech recognition system, the load of arithmetic processing can be reduced,
and since the generation of distortion of a signal can be suppressed by appropriately extracting
the speech, the accuracy of speech recognition can be reduced. It can be improved.
[0032]
11-04-2019
10
Hereinafter, the speech extraction apparatus 20 performing such processing and the
configuration of the speech recognition system 1 including the speech extraction apparatus 20
will be described in detail.
[0033]
〔2.
Configuration of Speech Recognition System] FIG. 3 is a view showing an example of the
configuration of a speech recognition system according to the embodiment. The configuration of
the speech recognition system 1 according to the present embodiment will be described with
reference to FIG.
[0034]
As shown in FIG. 3, the speech recognition system 1 according to the present embodiment
includes a microphone array device 10, a speech extraction device 20, and a recognition device
30. The microphone array device 10 is connected to the voice extraction device 20, and
transmits the received voice signal to the voice extraction device 20. The voice extraction device
20 is communicably connected to the recognition device 30 by wire or wireless via the network
N.
[0035]
Although the voice recognition system 1 shown in FIG. 3 includes one microphone array device
10 and one voice extraction device 20, the present invention is not limited to this, and a plurality
of microphone array devices 10 may be used. And a plurality of voice extraction devices 20 may
be included. Further, a plurality of microphone array devices 10 may be connected to the sound
extraction device 20. Further, FIG. 3 shows an example in which the microphone array device 10
is directly connected to the sound extraction device 20. However, the present invention is not
limited thereto, and may be communicably connected wirelessly, or may be communicably
connected via a wired or wireless network.
11-04-2019
11
[0036]
The microphone array device 10 is a device that has a plurality of microphones for receiving the
sound of the surrounding environment, and transmits the sound received by each microphone to
the sound extraction device 20 as an observation signal. The microphones of the microphone
array device 10 are, for example, circular and arranged at equal intervals as shown in FIG.
[0037]
The voice extraction device 20 is a device that executes the above-described extraction
processing based on an observation signal based on voice received by each microphone of the
microphone array device 10. The voice extraction device 20 is realized by, for example, a
computer such as a PC (Personal Computer), a workstation, or a dedicated device.
[0038]
The recognition device 30 receives an observation signal output by extraction processing by the
speech extraction device 20, executes speech recognition processing on the observation signal,
and converts the observation signal into text and outputs the converted text. It is. The recognition
device 30 is realized by, for example, a computer such as a PC or a workstation.
[0039]
Although the microphone array device 10, the voice extraction device 20, and the recognition
device 30 are shown as independent devices in FIG. 3, for example, one device integrated with
the microphone array device 10 and the voice extraction device 20, voice The extraction device
20 and the recognition device 30 may be integrated as one device, or the microphone array
device 10, the voice extraction device 20, and the recognition device 30 may be configured as
one device all integrated.
[0040]
〔3.
11-04-2019
12
Configuration of Speech Extraction Device] FIG. 4 is a view showing a configuration example of
the speech extraction device according to the embodiment. The configuration of the speech
extraction apparatus 20 according to the present embodiment will be described with reference to
FIG.
[0041]
As shown in FIG. 4, the speech extraction apparatus 20 according to the present embodiment
includes a communication unit 210, a storage unit 220, a control unit 230, and a communication
unit 240. The voice extraction device 20 may be an input unit (for example, a mouse or a
keyboard) that receives various operations from a manager or the like who uses the voice
extraction device 20, or a display unit (for example, a liquid crystal display or organic EL) that
displays various information. (Electro-Luminescence display etc.) may be included.
[0042]
(Regarding Communication Unit 210) The communication unit 210 is a functional unit that
communicates information with the microphone array device 10. Specifically, the communication
unit 210 receives, for example, the voice of the speaker received by the microphone array device
10 as a voice signal (observation signal). The communication unit 210 is realized by a
communication I / F 1600 (for example, a USB (Universal Serial Bus) interface or the like) shown
in FIG. 13 described later.
[0043]
(Regarding Storage Unit 220) The storage unit 220 is a functional unit that stores various
information to be used for the processing of the voice extraction device 20. The storage unit 220
calculates, for example, a parameter that determines directivity, which is formed by a formation
unit 232 of the control unit 230 described later, information of the frequency distribution of the
amplitude of the emphasis signal generated by the generation unit 234, and the calculation unit
235 The cartesis (currency) etc. of the determined frequency distribution are stored. As shown in
FIG. 4, the storage unit 220 includes a calculation result storage unit 221. The storage unit 220
is a RAM (Random Access Memory) 1200 shown in FIG. 13 described later, an auxiliary storage
device 1400 (HDD (Hard Disk Drive) or SSD (Solid State Drive), etc.), or a recording medium
11-04-2019
13
1900 (DVD-RW (Digital Versatile Disc Rewritable) etc.).
[0044]
(Regarding Calculation Result Storage Unit 221) The calculation result storage unit 221 stores,
for example, the cartography (the kurtosis) of the frequency distribution calculated by the
calculation unit 235 described later.
[0045]
FIG. 5 is a diagram illustrating an example of a calculation result storage unit according to the
embodiment.
An example of the calculation result storage unit 221 according to the present embodiment will
be described with reference to FIG. In the example of the calculation result storage unit 221
illustrated in FIG. 5, the calculation result storage unit 221 associates and stores “microphone
ID (CH)”, “cartography”, and “selection flag”.
[0046]
The “microphone ID (CH)” is information for identifying each microphone (ie, channel) of the
microphone array device 10. As described above, when the microphone array device 10 has eight
microphones, for example, as shown in FIG. 5, identification information “1” to “8” is
assigned to each microphone (each channel). .
[0047]
"Kartosis" is a value indicating the kurtosis for the frequency distribution of the amplitude of the
emphasis signal emphasized according to the directivity formed for the corresponding channel.
The calculation method of "cartosis" will be described later with reference to FIG.
[0048]
11-04-2019
14
The “selection flag” is flag information indicating which channel is selected by the selection
unit 236 of the control unit 230 described later. In the example shown in FIG. 5, “1” indicates
that the corresponding channel has been selected, and “0” indicates that the corresponding
channel has not been selected. That is, as described later, selection unit 236 selects the channel
with the largest cartsis, and thus indicates that channel “5” having the largest “2.29” shown
in FIG. 8 is selected. It is done.
[0049]
That is, an example of the calculation result storage unit 221 illustrated in FIG. 5 indicates that,
for the microphone ID (CH) “5”, the cartsis is “2.29” and the selection flag is “1”.
[0050]
The configuration of the calculation result storage unit 221 illustrated in FIG. 5 is an example,
and other information may be included.
For example, the calculation result storage unit 221 may store information of the date and time
when the channel is selected by the selection unit 236 in association with the above-described
information.
[0051]
Further, although the calculation result storage unit 221 shown in FIG. 5 is information in the
form of a table, it is not limited to this, as long as the values of the fields of the table can be
managed in association with each other. It may be format information.
[0052]
(Regarding Control Unit 230) The control unit 230 is a functional unit that controls the overall
operation of the voice extraction device 20.
As shown in FIG. 4, the control unit 230 includes an acquisition unit 231, a formation unit 232, a
highlighting unit 233, a generation unit 234, a calculation unit 235, a selection unit 236, and an
11-04-2019
15
output unit 237. The control unit 230 executes a program stored in a ROM (Read Only Memory)
1300, an auxiliary storage device 1400, or the like by using a central processing unit (CPU) 1100
shown in FIG. To be realized.
[0053]
Note that some or all of the above-described functional units of the control unit 230 are realized
not by a program that is software but by a hardware circuit such as a field-programmable gate
array (FPGA) or an application specific integrated circuit (ASIC). It is also good.
[0054]
Also, each functional unit of the control unit 230 shown in FIG. 4 conceptually shows a function,
and is not limited to such a configuration.
For example, a plurality of functional units illustrated as functional units of the independent
control unit 230 in FIG. 4 may be configured as one functional unit. On the other hand, the
function possessed by one functional unit of the control unit 230 in FIG. 4 may be divided into a
plurality of components and configured as a plurality of functional units.
[0055]
(Regarding Acquisition Unit 231) The acquisition unit 231 is a functional unit that acquires, as an
observation signal, the sound received by each microphone (each channel) of the microphone
array device 10 through the communication unit 210. The acquisition unit 231 sends the
acquired observation signals of each microphone to the emphasis unit 233.
[0056]
(Regarding Forming Unit 232) The forming unit 232 is a functional unit that forms directivity so
that voice coming from a direction directly facing each microphone of the microphone array
device 10 is emphasized. The forming unit 232 forms directivity corresponding to each
microphone in advance as the first process of the extraction process by the speech extraction
apparatus 20. The function of the formation unit 232 is included in the directivity formation /
11-04-2019
16
emphasis function 61 shown in FIG. 1 described above.
[0057]
Here, the formation of directivity refers to a process (beam forming process) of determining
parameters that enhance speech coming from the direction directly facing each microphone.
Specifically, in order to emphasize voice coming from a direction directly facing a specific
microphone, for example, weighting may be performed after adding an arbitrary delay to an
observation signal of each voice received by each microphone (for example, Maximize the weight
of the observation signal of the sound received by the particular microphone, and minimize the
weight of the observation signal of the sound received by the microphone arranged at a position
farthest from the particular microphone) Perform the process of adding and adding. As described
above, the process of determining a specific value using a weight as a parameter for each
observation signal used in the process of emphasizing speech coming from a direction directly
facing a specific microphone is referred to as directivity formation. Then, as will be described
later, the emphasis unit 233 is a functional unit that emphasizes the voice coming from the
direction facing the specific microphone by using the parameter determined by the formation of
the directivity.
[0058]
The beamforming processing, which is the formation of the directivity, may be performed by a
known method such as the above-described DS method or MVDR. However, if the purpose is to
reduce the processing load by calculation, it is preferable to use the DS method.
[0059]
FIG. 6 is a diagram for explaining the directivity of the microphone array device according to the
embodiment. The formation of directivity by the forming unit 232 according to the present
embodiment will be described with reference to FIG.
[0060]
11-04-2019
17
As shown in FIG. 6, the microphone array apparatus 10 according to the present embodiment
includes microphones 10a to 10h as a plurality of microphones. The forming unit 232 performs,
for example, a process of determining a parameter such that the voice coming from the
directivity forming direction 500 a which is a direction directly facing the microphone 10 a of
the microphone array device 10 is emphasized. Similarly, the forming unit 232 performs a
process of determining such a parameter that voices coming from the directivity forming
directions 500b to 500h which are directions respectively facing the microphones 10b to 10h
are emphasized.
[0061]
(Regarding Emphasizing Unit 233) The emphasizing unit 233 is a functional unit that emphasizes
the observation signal of each channel according to the directivity formed by the forming unit
232 to generate an emphasizing signal. Specifically, when emphasizing the voice that directly
faces a specific microphone (channel), the emphasizing unit 233 uses the parameter determined
by the forming unit 232 to monitor the observation signal of the voice received in each channel.
By doing weighting and addition, the voice that directly faces the particular microphone is
emphasized. Hereinafter, using the observation signal of the voice received by each channel and
the parameter corresponding to the directivity of the specific channel determined by the forming
unit 232, emphasizing the voice facing the microphone of the specific channel. In some cases, the
observation signal of the particular channel is simply emphasized. Then, the emphasizing unit
233 sends the generated emphasizing signal of each channel to the generating unit 234. The
function of the emphasizing unit 233 is included in the directivity formation / emphasis function
61 shown in FIG. 1 described above.
[0062]
(Regarding Generation Unit 234) The generation unit 234 is a functional unit that generates the
frequency distribution of the amplitude of the enhancement signal for each channel based on the
enhancement signal of each channel enhanced by the enhancement unit 233. The generation
unit 234 causes the storage unit 220 to store information on the frequency distribution of the
amplitude of the emphasis signal generated for each channel. The function of the generation unit
234 is included in the channel selection function 62 shown in FIG. 1 described above.
[0063]
11-04-2019
18
FIG. 7 is a diagram showing an example of the frequency distribution of the amplitude of the
emphasis signal according to the embodiment. The frequency distribution of the amplitude
generated by the generation unit 234 will be described with reference to FIG. 7.
[0064]
The emphasis signal which is a voice signal generated by the emphasizing unit 233 includes
signals of various frequency components. A distribution generated by calculating the frequency
of what amplitude of signal is included in each frequency bin, for example, at each equally spaced
timing within a predetermined time, is shown in FIG. It is a graph. Therefore, the generation unit
234 generates the frequency distribution (histogram) of the amplitude of the emphasis signal
shown in FIG. 7 for each frequency bin. Then, the generation unit 234 sends the information on
the frequency distribution of the amplitude of the enhancement signal of each generated channel
to the calculation unit 235.
[0065]
(Regarding Calculation Unit 235) The calculation unit 235 is a functional unit that calculates a
cartesian (curvature) for the frequency distribution of the amplitude of the emphasis signal of
each channel generated by the generation unit 234. Here, the curvature (curvature) is a value
indicating the sharpness of the distribution shape of the peak portion of the frequency
distribution of the amplitude (for example, the frequency distribution shown in FIG. 7) and the
vicinity thereof.
[0066]
For example, the frequency distribution shown in FIG. 7B is a distribution in which the peak
portion and its neighboring portion are sharp and the peak portion is at a position higher than
that of the bottom portion, and the frequency shown in FIG. The distribution is such that the peak
portion and its neighboring portion are rounded, and the peak portion is not higher than the tail
portion. In this case, the kurtosis calculated for the frequency distribution of FIG. 7B is a value
higher than the kurtosis calculated for the frequency distribution of FIG. 7A.
11-04-2019
19
[0067]
FIG. 8 is a diagram showing an example of a method of calculating a cartsis of the frequency
distribution of amplitude according to the embodiment. An example of a method of calculating a
cartsis of the frequency distribution of the amplitude by the calculation unit 235 will be
described with reference to FIG. 8.
[0068]
First, the generation unit 234 performs STFT (Short-Time Fourier Transform) on the emphasis
signal generated by the emphasis unit 233 (step S21). By this, the frequency component is
extracted from the emphasis signal. In the example shown in FIG. 8, components of J frequency
bins are extracted. Then, the generation unit 234 obtains the amplitude spectrum | X (i, 0) |, | X
(i, 1) |,. (I, J) | is obtained (step S22). The generation unit 234 generates the above-described
frequency distribution of the amplitude from the amplitude spectrum for each frequency bin.
[0069]
Next, the calculation unit 235 calculates cartsis for each frequency bin from the frequency
distribution of amplitude based on the amplitude spectrum (step S23). The calculator 235
calculates the cartesis for each frequency bin, for example, according to the following equation
(1).
[0070]
[0071]
In equation (1), K j is a cartsis corresponding to the j-th frequency bin, | X (i, j) | is an amplitude
spectrum in the i-th frame, and M [x <n>] is It is the nth moment.
Further, the moment M [x <n>] is defined by the following equation (2).
11-04-2019
20
[0072]
[0073]
In equation (2), p (x) is a probability density function that follows the distribution of variable x.
[0074]
Then, the calculation unit 235 calculates the average value (K) of each cartsis calculated for each
frequency bin according to the following equation (3) (step S24), and the average value
corresponds to the focused channel Let it be a chart of frequency distribution of amplitude.
[0075]
[0076]
The calculation unit 235 executes the calculation process of steps S21 to S24 described above
for each channel.
The calculation unit 235 stores the cartsis corresponding to each calculated channel in the
calculation result storage unit 221.
Specifically, as shown in FIG. 5, the calculation unit 235 stores the calculated cartography of
each channel in the calculation result storage unit 221 in association with the microphone ID of
each microphone of the microphone array device 10.
The function of the calculator 235 is included in the channel selection function 62 shown in FIG.
1 described above.
[0077]
(Regarding Selection Unit 236) The selection unit 236 is a functional unit that selects a channel
11-04-2019
21
for outputting an observation signal to the recognition device 30 based on the cartsis of each
channel calculated by the calculation unit 235.
Specifically, the selection unit 236 selects a channel corresponding to the largest cartsis among
cartsis of each channel. As shown in FIG. 5, the selection unit 236 stores the selection flag in the
calculation result storage unit 221 in association with the microphone ID of the microphone of
the microphone array device 10. The function of the selection unit 236 is included in the channel
selection function 62 shown in FIG. 1 described above.
[0078]
The reason for using cartesis to select a channel that outputs observation signals in this way is as
follows. The distribution of the speech signal follows the distribution such as the Laplace
distribution, while the distribution of the speech signal in which a plurality of noise sources are
mixed has the property that it becomes close to a normal distribution. That is, the kurtosis of the
frequency distribution of the amplitude of the audio signal (here, the emphasis signal) in the case
where there is an audio corresponding to each channel, that is, cartosis is higher than that of the
frequency distribution in the absence of audio. Is expected.
[0079]
(Regarding Output Unit 237) The output unit 237 extracts the observation signal corresponding
to the channel selected by the selection unit 236 from the observation signals of the sound
received by each microphone of the microphone array device 10, and transmits the
communication signal. It is a functional unit that outputs to the recognition device 30 via 240.
The function of the output unit 237 is included in the channel selection function 62 shown in
FIG. 1 described above.
[0080]
(Regarding Communication Unit 240) The communication unit 240 is a functional unit that
communicates information with the recognition device 30. Specifically, the communication unit
240 transmits an observation signal corresponding to the channel selected by the selection unit
236 to the recognition device 30 via the network N, for example, by the function of the output
11-04-2019
22
unit 237. The communication unit 240 is realized by a network I / F 1500 (for example, a NIC
(Network Interface Card) etc.) shown in FIG. 13 described later.
[0081]
〔4. Configuration of Recognition Device] FIG. 9 is a view showing a configuration example of
the recognition device according to the embodiment. The configuration of the recognition device
30 according to the present embodiment will be described with reference to FIG.
[0082]
As shown in FIG. 9, the recognition device 30 according to the present embodiment includes a
communication unit 310, a storage unit 320, and a control unit 330. The recognition device 30 is
an input unit (for example, a mouse or a keyboard) that receives various operations from a
manager or the like who uses the recognition device 30, or a display unit that displays various
information (for example, a liquid crystal display or an organic EL display) ) May be included.
[0083]
(Regarding Communication Unit 310) The communication unit 310 is a functional unit that
communicates information with the voice extraction device 20 according to the present
embodiment. Specifically, for example, the communication unit 310 performs the extraction
process on the observation signal of the sound received by the microphone array device 10 by
the sound extraction device 20 and outputs the observation signal to the network N. Receive
through. The communication unit 310 is realized by a network I / F 1500 (for example, a NIC
(Network Interface Card) etc.) shown in FIG. 13 described later.
[0084]
(Regarding Storage Unit 320) The storage unit 320 is a functional unit that stores various
information to be used for the process of the recognition device 30. The storage unit 320 stores,
for example, data of observation signals acquired by an acquisition unit 331 of the control unit
330 described later, data of texts generated by speech recognition processing by the speech
11-04-2019
23
recognition unit 332, and the like. The storage unit 320 is realized by at least one of a RAM 1200
shown in FIG. 13 described later, an auxiliary storage device 1400 (HDD or SSD etc.), or a
recording medium 1900 (DVD-RW etc).
[0085]
(Regarding Control Unit 330) The control unit 330 is a functional unit that controls the operation
of the entire recognition device 30. As shown in FIG. 9, the control unit 330 includes an
acquisition unit 331, a speech recognition unit 332, and an output unit 333. The control unit
330 is realized by the CPU 1100 shown in FIG. 13 described later executing programs stored in
the ROM 1300 and the auxiliary storage device 1400 etc. using the RAM 1200 as a work area.
[0086]
Note that some or all of the above-described functional units of the control unit 330 may be
realized by a hardware circuit such as an FPGA or an ASIC instead of a program that is software.
[0087]
Further, each functional unit of the control unit 330 shown in FIG. 9 conceptually shows a
function, and is not limited to such a configuration.
For example, a plurality of functional units illustrated as functional units of the independent
control unit 330 in FIG. 9 may be configured as one functional unit. On the other hand, the
function possessed by one functional unit of the control unit 330 in FIG. 9 may be divided into a
plurality of components and configured as a plurality of functional units.
[0088]
(Regarding Acquisition Unit 331) The acquisition unit 331 performs the extraction processing on
the observation signal of the sound received by the microphone array device 10 by the sound
extraction device 20 and outputs the observation signal to the communication unit 310. It is a
functional unit acquired through. The acquisition unit 331 sends the acquired emphasis signal to
the speech recognition unit 332.
11-04-2019
24
[0089]
(Regarding the Speech Recognition Unit 332) The speech recognition unit 332 is a functional
unit that executes speech recognition processing on the observation signal acquired by the
acquisition unit 331 and converts it into text. Here, the speech recognition process may be
performed by a known algorithm such as a recognition algorithm using a deep neural network.
The speech recognition unit 332 sends the text converted from the observation signal to the
output unit 333.
[0090]
(Regarding Output Unit 333) The output unit 333 is a functional unit that outputs the text
converted from the observation signal by the voice recognition unit 332 to an external device
that uses the text via the communication unit 310. The text converted from the observation
signal by the speech recognition unit 332 does not necessarily have to be output to the outside,
and may be output to an application executed in the recognition device 30.
[0091]
〔5. Flow of Process] FIG. 10 is a flowchart showing an example of a process of the speech
recognition system according to the embodiment. A flow of processing of the speech recognition
system 1 according to the present embodiment will be described with reference to FIG.
[0092]
(Step S101) The formation unit 232 of the speech extraction apparatus 20 forms directivity in
advance so that speech coming from a direction directly facing each microphone of the
microphone array device 10 is emphasized. Then, the process proceeds to step S102.
[0093]
(Step S102) When the microphone array device 10 receives the voice of the speaker by each
11-04-2019
25
microphone (Step S102: Yes), the process proceeds to Step S103, and when the voice is not
received (Step S102: No), End the process.
[0094]
(Step S103) The microphone array device 10 transmits the sound received by each microphone
(each channel) to the sound extraction device 20 as an observation signal.
Then, the process proceeds to step S104.
[0095]
(Step S104) The emphasizing unit 233 of the voice extraction device 20 emphasizes the
observation signal of each channel acquired by the acquiring unit 231 in accordance with the
directivity formed by the forming unit 232 and generates an emphasizing signal. Then, the
process proceeds to step S105.
[0096]
(Step S105) Based on the emphasis signal of each channel emphasized by the emphasis unit 233,
the generation unit 234 of the speech extraction device 20 generates a frequency distribution of
the amplitude of the emphasis signal for each channel. Then, the process proceeds to step S106.
[0097]
(Step S106) The calculation unit 235 of the speech extraction device 20 calculates a kurtosis
(currency) of the frequency distribution of the amplitude of the emphasis signal of each channel
generated by the generation unit 234. At this time, the calculation unit 235 stores the cartsis
corresponding to each calculated channel in the calculation result storage unit 221. Then, the
process proceeds to step S107.
11-04-2019
26
[0098]
(Step S107) The selection unit 236 of the speech extraction device 20 selects a channel for
outputting an observation signal to the recognition device 30 based on the cartsis of each
channel calculated by the calculation unit 235. Specifically, the selection unit 236 selects a
channel corresponding to the largest cartsis among cartsis of each channel. Then, the output unit
237 of the voice extraction device 20 outputs the observation signal corresponding to the
channel selected by the selection unit 236 to the recognition device 30 via the communication
unit 240. Then, the process proceeds to step S108.
[0099]
(Step S108) The acquisition unit 331 of the recognition device 30 acquires, through the
communication unit 310, the observation signal output after the speech extraction device 20
executes the extraction process. The speech recognition unit 332 of the recognition device 30
executes speech recognition processing on the observation signal acquired by the acquisition
unit 331, and converts it into text. The output unit 333 of the recognition device 30 outputs the
text converted from the observation signal by the speech recognition unit 332 to an external
device using the text via the communication unit 310. Then, the process ends.
[0100]
The processes of the speech recognition system 1 according to the present embodiment are
executed by steps S101 to S108 as described above. Specifically, after the directivity is formed by
the forming unit 232 in step S101, steps S102 to S108 are repeatedly executed.
[0101]
〔6. Modifications] The above-described voice extraction device 20 may be implemented in
various different forms other than the above-described embodiment. In the following, another
embodiment of the speech extraction device will be described.
[0102]
11-04-2019
27
〔6−1. Output of Emphasis Signal FIG. 11 is a diagram showing an example of extraction
processing according to a modification. In the above-mentioned embodiment, the example which
outputs the observation signal corresponding to the selected channel to recognition device 30
was shown. On the other hand, a process of outputting an emphasis signal corresponding to the
selected channel to the recognition device 30 will be described with reference to FIG.
[0103]
The speech extraction device 20a shown in FIG. 11 observes each channel according to the
directivity formed in advance corresponding to each microphone of the microphone array device
10 with respect to the observation signal based on the sound received by the microphone array
device 10. It is an apparatus for emphasizing a signal, selecting a channel based on the
cartography of the frequency distribution of the amplitude of the emphasizing signal, and
extracting and outputting an emphasizing signal corresponding to the selected channel. As
shown in FIG. 11, the speech extraction apparatus 20a has a directivity formation / emphasis
function 61 and a channel selection function 62a as functions.
[0104]
The directivity formation / emphasis function 61 is a function similar to the directivity formation
/ emphasis function 61 of the speech extraction apparatus 20 shown in FIG.
[0105]
The channel selection function 62a selects a channel based on the cartography of the frequency
distribution of the amplitude of the emphasis signal generated by the function of the directivity
formation / emphasis function 61, and extracts and outputs an emphasis signal corresponding to
the selected channel Function.
[0106]
FIG. 12 is a diagram showing an example of the configuration of a voice extraction device
according to a modification.
11-04-2019
28
The configuration of the speech extraction device 20a according to the present modification will
be described with reference to FIG.
[0107]
As shown in FIG. 12, the speech extraction apparatus 20a according to the present modification
includes a communication unit 210, a storage unit 220, a control unit 230a, and a
communication unit 240.
The voice extraction device 20a may be an input unit (for example, a mouse or a keyboard) that
receives various operations from a manager or the like who uses the voice extraction device 20a,
or a display unit (for example, a liquid crystal display or an organic EL) that displays various
information. It is good also as what has a display etc.). The functions of the communication unit
210, the storage unit 220, and the communication unit 240 are the same as the functions
described above with reference to FIG.
[0108]
The control unit 230a is a functional unit that controls the operation of the entire speech
extraction device 20a. As illustrated in FIG. 12, the control unit 230a includes an acquisition unit
231, a formation unit 232, a highlighting unit 233, a generation unit 234, a calculation unit 235,
a selection unit 236, and an output unit 237a. The control unit 230a is realized by the CPU 1100
shown in FIG. 13 described later executing programs stored in the ROM 1300 and the auxiliary
storage device 1400 etc. using the RAM 1200 as a work area. The functions of the acquiring unit
231, the forming unit 232, the emphasizing unit 233, the generating unit 234, the calculating
unit 235, and the selecting unit 236 are the same as the functions described in FIG. 4 described
above. The functions of the forming unit 232 and the emphasizing unit 233 are included in the
directivity forming / emphasis function 61 shown in FIG. 11 described above. The functions of
the generation unit 234, calculation unit 235, selection unit 236, and output unit 237a are
included in the channel selection function 62a shown in FIG. 11 described above.
[0109]
Note that some or all of the above-described functional units of the control unit 230a may be
realized not by a program that is software but by a hardware circuit such as an FPGA or an ASIC.
11-04-2019
29
[0110]
Further, each functional unit of the control unit 230a illustrated in FIG. 12 conceptually shows a
function, and is not limited to such a configuration.
For example, a plurality of functional units illustrated as functional units of the independent
control unit 230a in FIG. 12 may be configured as one functional unit. On the other hand, the
function possessed by one functional unit of the control unit 230a in FIG. 12 may be divided into
a plurality of components and configured as a plurality of functional units.
[0111]
The output unit 237a extracts an emphasizing signal corresponding to the channel selected by
the selecting unit 236 from the observation signals of the sound received by each microphone of
the microphone array device 10, and recognizes the signal through the communication unit 240.
This is a functional unit that outputs data to the device 30. The function of the output unit 237a
is included in the channel selection function 62a shown in FIG. 11 described above.
[0112]
As described above, in FIG. 4, the observation signal corresponding to the selected channel is
output to the recognition device 30, but as shown in FIG. 12 of this modification, the emphasis
signal for the selected channel is output. It may be output. By this, similarly to the voice
extraction device 20 according to the above-described embodiment, by appropriately extracting
the voice, it is possible to suppress the occurrence of distortion of the signal and improve the
accuracy of voice recognition.
[0113]
〔6−2. Selection by Other Index Values Based on Frequency Distribution In the abovedescribed embodiment, the calculator 235 calculates the kurtosis of the frequency distribution of
the amplitude of the emphasis signal of each channel generated by the generator 234, The
11-04-2019
30
selection unit 236 selects the channel corresponding to the largest cartsis among the cartsis of
each channel calculated by the calculation unit 235. However, the present invention is not
limited to this, and for example, a channel that outputs an observation signal (or an emphasis
signal) to the recognition device 30 may be selected by the following method.
[0114]
For example, it is assumed that the selecting unit 236 selects one or more channels respectively
corresponding to cartsis having a predetermined threshold value or more among the calculated
cartsis of each channel, and the output unit 237 (237a) selects one or more selected channels.
The observation signals (or enhancement signals) corresponding to the respective channels may
be averaged or synthesized, for example, and output to the recognition device 30. In this case,
the number of channels selected by the selection unit 236 may have an upper limit.
[0115]
Further, for example, the calculation unit 235 may calculate different index values as the index
value instead of cartesis from the frequency distribution of the amplitudes of the generated
emphasis signals of the respective channels. For example, the calculation unit 235 may calculate
the frequency of the frequency distribution, the variance, the average value, the height between
the peak and the bottom of the frequency distribution, the width of the graph at a predetermined
position from the peak of the frequency distribution, or the frequency distribution It is also
possible to calculate an index value such as the mode value of. In this case, the selection unit 236
may select a channel that outputs the observation signal (or enhancement signal) based on the
calculated index value.
[0116]
Also, for example, a model (pattern) of the frequency distribution of the amplitudes of human
(speaker) speech signals is prepared in advance, and the calculation unit 235 determines the
frequency distribution of the amplitudes of the generated emphasis signals of each channel, The
similarity may be calculated as an index value by comparing with the model. In this case, for
example, the selection unit 236 may select a channel corresponding to the emphasis signal
having the highest degree of similarity with the frequency distribution model.
11-04-2019
31
[0117]
Thus, based on the frequency distribution of the amplitude of the emphasizing signal, the
distortion of the signal can be appropriately extracted by the above-described methods and the
like as well as the speech extraction apparatus 20 according to the above-described embodiment.
Can be suppressed to improve the accuracy of speech recognition.
[0118]
〔7.
Hardware Configuration] FIG. 13 is a diagram showing an example of a hardware configuration
of a computer for realizing the function of the voice extraction device. The voice extraction
device 20 and the recognition device 30 according to the embodiment described above are
realized by, for example, a computer 1000 configured as shown in FIG. Hereinafter, the voice
extraction device 20 will be described as an example.
[0119]
The computer 1000 includes a CPU 1100, a RAM 1200, a ROM 1300, an auxiliary storage device
1400, a network I / F (interface) 1500, a communication I / F (interface) 1600, and an input /
output I / F (interface) 1700. And a media I / F (interface) 1800. The CPU 1100, the RAM 1200,
the ROM 1300, the auxiliary storage device 1400, the network I / F 1500, the communication I /
F 1600, the input / output I / F 1700, and the media I / F 1800 are connected by a bus 1950 to
enable data communication with each other.
[0120]
The CPU 1100 is an arithmetic device that operates based on a program stored in the ROM 1300
or the auxiliary storage device 1400 and controls each part. The ROM 1300 is a non-volatile
storage device that stores a boot program and a basic input / output system (BIOS) executed by
the CPU 1100 when the computer 1000 starts up, a program depending on the hardware of the
computer 1000, and the like.
11-04-2019
32
[0121]
The auxiliary storage device 1400 is a non-volatile storage device that stores a program executed
by the CPU 1100, data used by the program, and the like. The auxiliary storage device 1400 is,
for example, an HDD or an SSD.
[0122]
The network I / F 1500 receives data from another device via the communication network 600
(corresponding to the network N shown in FIG. 3), sends the data to the CPU 1100, and sends the
data generated by the CPU 1100 to the other via the communication network 600. It is a
communication interface that transmits to the device of The network I / F 1500 is, for example,
an NIC or the like.
[0123]
The communication I / F 1600 is a communication interface for communicating data with
peripheral devices. The communication I / F 1600 is, for example, a USB interface or a serial
port.
[0124]
The CPU 1100 controls an output device such as a display or a printer and an input device such
as a keyboard or a mouse via an input / output I / F 1700. The CPU 1100 acquires data from the
input device via the input / output I / F 1700. The CPU 1100 also outputs the generated data to
the output device via the input / output I / F 1700.
[0125]
The media I / F 1800 is an interface that reads a program or data stored in the storage medium
1900 and provides the CPU 1100 with the program via the RAM 1200. The CPU 1100 loads the
11-04-2019
33
provided program from the recording medium 1900 onto the RAM 1200 via the media I / F
1800 and executes the loaded program. The recording medium 1900 is, for example, an optical
recording medium such as a digital versatile disc (DVD) or a phase change rewritable disc (PD), a
magneto-optical recording medium such as a magneto-optical disc (MO), a tape medium, a
magnetic recording medium, or a semiconductor Memory etc.
[0126]
For example, when the computer 1000 functions as the voice extraction device 20 according to
the embodiment, the CPU 1100 of the computer 1000 realizes the function of the control unit
230 by executing a program loaded on the RAM 1200. In addition, the auxiliary storage device
1400 stores data in the storage unit 220. The CPU 1100 of the computer 1000 reads these
programs from the recording medium 1900 and executes them, but as another example, these
programs may be acquired from the other device via the communication network 600.
[0127]
The hardware configuration of the computer 1000 illustrated in FIG. 13 is merely an example,
and it is not necessary to include all the components illustrated in FIG. 13 or may include other
components.
[0128]
〔8.
Others] Further, among the processes described in the above-described embodiment, all or part
of the processes described as being automatically performed may be manually performed, or
may be manually performed. All or part of the treatment may be performed automatically by
known methods. In addition, information including processing procedures, specific names,
various data, and parameters shown in the above-mentioned documents and drawings can be
arbitrarily changed unless otherwise specified. For example, the various information shown in
each figure is not limited to the illustrated information.
[0129]
11-04-2019
34
Further, each component of each device illustrated is functionally conceptual, and does not
necessarily have to be physically configured as illustrated. That is, the specific form of the
distribution and integration of each device is not limited to the illustrated one, and all or a part
thereof is functionally or physically distributed in any unit depending on various loads and usage
conditions, etc. It can be integrated and configured. For example, the generation unit 234 and the
calculation unit 235 illustrated in FIG. 4 may be integrated. Also, for example, the information
stored in the storage unit 220 may be stored in a predetermined storage device provided outside
via the network N.
[0130]
In the above-described embodiment, the speech extraction apparatus 20 emphasizes the
observation signal of each channel according to directivity and generates an emphasis signal, for
example, based on the emphasis signal of each emphasized channel. An example has been shown
in which the generation process of generating the frequency distribution of the amplitude of the
emphasis signal for each channel. However, the above-described voice extraction device 20 may
be separated into an emphasis device that performs emphasis processing and a generation device
that performs generation processing. In this case, the emphasizing device has at least the
emphasizing unit 233. The generation device has at least a generation unit 234. And the process
by the above-mentioned voice extraction device 20 is realized by voice recognition system 1
which has each device of an emphasis device and a generating device.
[0131]
Moreover, it is possible to combine suitably embodiment and its modification which were
mentioned above in the range which does not make process content contradictory.
[0132]
〔9.
Effects] As described above, the voice extraction device 20 (20a) according to the embodiment
includes the forming unit 232, the acquiring unit 231, the emphasizing unit 233, the generating
unit 234, and the selecting unit 236. The forming unit 232 forms directivity in advance by beam
forming processing for each microphone in the microphone array device 10 in which a plurality
11-04-2019
35
of channels are formed by having a plurality of microphones. The acquisition unit 231 acquires
an observation signal which is a signal of the sound received by each channel. The emphasizing
unit 233 emphasizes the observation signal of each channel according to the directivity of each
microphone formed by the forming unit 232 and generates an emphasizing signal. The
generation unit 234 generates, for each channel, a frequency distribution of the amplitudes of
the emphasis signal generated by the emphasis unit 233. The selection unit 236 selects a
channel corresponding to the speech signal used for speech recognition among the channels
based on the frequency distribution corresponding to each of the channels generated by the
generation unit 234.
[0133]
In this way, it is not necessary to form directivity every time an observation signal is received,
and without emphasizing the sound source of the conventional speech recognition system, the
emphasis signal of each channel emphasized based on the formed directivity is used. Channels
are selected based on the generated frequency distribution of amplitudes. Although this channel
selection corresponds to the sound source localization function of the conventional speech
recognition system, it is not necessary to execute the processing of the sound source localization
with a high computational load. Therefore, the load of arithmetic processing can be reduced, and
generation of distortion of a signal can be suppressed by appropriately extracting speech, so that
the accuracy of speech recognition can be improved.
[0134]
In addition, the voice extraction device 20 (20a) according to the embodiment further includes an
output unit 237 (237a). The output unit 237 (237 a) outputs a voice signal corresponding to the
channel selected by the selection unit 236 among the channels of the microphone array device
10 to the recognition device 30 that performs voice recognition.
[0135]
As described above, the voice extraction device 20 (20a) according to the embodiment
appropriately extracts the voice in which the occurrence of distortion of the signal is suppressed
and outputs the voice signal corresponding to the voice, so the voice recognition in the
recognition device 30 Accuracy can be improved.
11-04-2019
36
[0136]
Further, based on the frequency distribution corresponding to each channel generated by the
generation unit 234, the selection unit 236 selects a channel corresponding to an observation
signal as an audio signal used for speech recognition among the channels.
The output unit 237 outputs the observation signal corresponding to the channel selected by the
selection unit 236 to the recognition device 30.
[0137]
Thus, the speech extraction device 20 according to the embodiment may output an observation
signal as a speech signal used for speech recognition in the recognition device 30. As a result,
even if a defect occurs in the beam forming process by the forming unit 232 and the
emphasizing process of the observation signal by the emphasizing unit 233 and distortion occurs
in the emphasizing signal, the observation signal without distortion is output as it is. The
accuracy of speech recognition can be improved.
[0138]
Further, based on the frequency distribution corresponding to each channel generated by the
generation unit 234, the selection unit 236 selects a channel corresponding to an emphasis
signal as a speech signal used for speech recognition among the channels. The output unit 237a
outputs an emphasis signal corresponding to the channel selected by the selection unit 236 to
the recognition device 30.
[0139]
Thus, the speech extraction apparatus 20a according to the modification of the embodiment may
output an emphasis signal as a speech signal used for speech recognition in the recognition
apparatus 30. By this, it is possible to improve the accuracy of speech recognition by outputting
an emphasis signal which is an emphasized speech signal corresponding to a properly selected
channel.
11-04-2019
37
[0140]
In addition, the voice extraction device 20 (20a) according to the embodiment further includes a
calculation unit 235. The calculation unit 235 calculates an index value for the frequency
distribution corresponding to each channel generated by the generation unit 234. The selection
unit 236 selects a channel corresponding to the audio signal used for speech recognition among
the channels based on the index value calculated by the calculation unit 235.
[0141]
As described above, the speech extraction apparatus 20 (20a) according to the embodiment uses
the index value for the frequency distribution calculated by the calculation unit 235 in order to
select the channel corresponding to the speech signal used for speech recognition. Good. As a
result, since the channel can be selected based on the index value appropriately indicating the
characteristics of the frequency distribution, the voice can be appropriately extracted, and the
occurrence of the distortion of the signal can be suppressed, so that the voice recognition can be
performed. Accuracy can be improved.
[0142]
Further, the calculation unit 235 calculates the cartography of the frequency distribution
corresponding to each channel as an index value. The selection unit 236 selects a channel
corresponding to the speech signal used for speech recognition among the channels based on the
cartsis calculated by the calculation unit 235.
[0143]
Thus, the speech extraction apparatus 20 (20a) according to the embodiment may use the
cartography of the frequency distribution calculated by the calculation unit 235 to select a
channel corresponding to the speech signal used for speech recognition. . As a result, it is
possible to select a channel based on a chart that appropriately indicates the characteristics of
frequency distribution, so that voice can be properly extracted, and distortion of a signal can be
11-04-2019
38
suppressed, so that voice recognition can be performed. Accuracy can be improved.
[0144]
Further, the selection unit 236 selects a channel corresponding to the largest cartsis among
cartsis corresponding to each channel calculated by the calculation unit 235.
[0145]
This makes it possible to select a channel corresponding to the emphasizing signal that is clearly
emphasized from the observation signal, so that speech can be properly extracted, and distortion
of the signal can be suppressed, so that speech recognition can be performed. Accuracy can be
improved.
[0146]
The calculator 235 calculates, for each channel, the degree of similarity between the frequency
distribution corresponding to each channel and the predetermined frequency distribution model
of the amplitude of the audio signal.
The selection unit 236 selects a channel corresponding to the highest similarity among the
similarities corresponding to the respective channels calculated by the calculation unit 235.
[0147]
Thus, the speech extraction apparatus 20 (20a) according to the embodiment calculates the
frequency distribution corresponding to each channel calculated by the calculation unit 235 in
order to select the channel corresponding to the speech signal used for speech recognition; The
similarity to a predetermined model of the frequency distribution of the amplitude of the audio
signal may be used.
As a result, it is possible to select a channel corresponding to the emphasis signal determined to
be closer to the model speech signal, so that speech can be appropriately extracted, and the
occurrence of signal distortion can be suppressed. Therefore, the accuracy of speech recognition
can be improved.
11-04-2019
39
[0148]
FIG. 14 is a diagram showing an example of verification results of character correct accuracy of
each system. When the speech extraction device 20 according to the above-described
embodiment shown in FIGS. 1 and 4 is used with reference to FIG. 14, and the speech extraction
device 20a according to the above-mentioned modification shown in FIGS. An example of the
verification result of the character correct accuracy in the recognition device 30 in the case
where it has been detected will be described. In the present example, verification was performed
under the following verification conditions.
[0149]
-Number of elements (microphones) of microphone array device: 8-Microphone array shape:
circular, radius 3.7 cm-Speech used for learning speech recognition model: Speech added with
noise and reverberation-Evaluation data: in real environment Recorded command utterances
9900 utterances-4 combinations of 6 rooms, microphone and speaker position combination 6
sets
[0150]
Moreover, specifically, the comparison of the character correct accuracy was performed with the
following <1>-<5> systems.
[0151]
<1> channel_select (enh) This is a speech recognition system using the speech extraction
apparatus 20a according to the above-described modified example, and uses the DS method as
the beamforming process.
<2> channel_select (obs) It is a speech recognition system using speech extraction device 20
concerning the above-mentioned embodiment, and used DS method as beam forming processing.
<3> Static In this system, only one microphone located in front of the speaker among the
microphones in the static microphone array device is used to receive sound. <4> BeamformIt is a
11-04-2019
40
conventional speech recognition system shown in FIG. 2, and GCC-PHAT was used for sound
source localization, and the DS method was used for beamforming processing. When sound
source localization is performed, the Viterbi algorithm is further applied to the result of GCCPHAT. <5> BeamformIt (channel_select) A speech recognition system using an observation signal
in a selected channel as a signal in BeamformIt.
[0152]
As shown in the result of the character correct accuracy shown in FIG. 14, it was confirmed that
the conventional speech recognition systems BeamformIt and BeamformIt (channel_select) have
degraded performance compared to static. This seems to be due to the fact that in a noise and
reverberant environment, sound source localization is difficult and beamforming processing has
failed.
[0153]
On the other hand, it was confirmed that channel_select (obs) which is a speech recognition
system concerning the above-mentioned embodiment is improving recognition performance
compared with Static. From this, it is considered that channel_select (obs) can select an effective
channel for speech recognition. Moreover, it was confirmed that channel_select (enh) which is a
speech recognition system concerning the above-mentioned modification shows the highest
performance by this examination. It is considered that this is because the selection of the channel
by cartography improves the performance of the selection over the conventional speech
recognition system and shows the effect of forming directivity by beamforming processing in
advance.
[0154]
FIG. 15 is a diagram showing an example of the processing time of the extraction processing of
the processing time of each system. Processing with the system (channel_select (obs) described
above) using the speech extraction apparatus 20 according to the above-described embodiment
shown in FIGS. 1 and 4 with reference to FIG. 15 and a conventional speech recognition system
(described above I will explain the comparison results of the calculation time with the processing
by BeamformIt). In the present example, the calculation time was compared under the following
conditions.
11-04-2019
41
[0155]
-Machine specification: Intel (R) Xeon (R) CPU E5-2630L 0 @ 2.00 GHz-Measurement method:
Measurement with Linux (registered trademark) time command and user time-Average and
standard deviation when processing 4980 speech Calculation
[0156]
As shown in the comparison result of the calculation time shown in FIG. 15, it was confirmed that
the speech recognition system according to the present embodiment can reduce the calculation
time significantly more than the conventional speech recognition system.
[0157]
1 voice recognition system 10 microphone array device 20 voice extraction device 30
recognition device 210 communication unit 220 storage unit 221 calculation result storage unit
230 control unit 231 acquisition unit 232 formation unit 233 emphasis unit 234 generation unit
235 calculation unit 236 selection unit 237 output unit 240 Communication unit
11-04-2019
42
Документ
Категория
Без категории
Просмотров
0
Размер файла
61 Кб
Теги
description, jp6345327
1/--страниц
Пожаловаться на содержимое документа