close

Вход

Забыли?

вход по аккаунту

?

JP2009282645

код для вставкиСкачать
Patent Translate
Powered by EPO and Google
Notice
This translation is machine-generated. It cannot be guaranteed that it is intelligible, accurate,
complete, reliable or fit for specific purposes. Critical decisions, such as commercially relevant or
financial decisions, should not be based on machine-translation output.
DESCRIPTION JP2009282645
To enable detailed input operations such as cursor movement to small icons and buttons by voice
or exhalation in an environment where hands can not be used. SOLUTION: The duration of a
sound emitted from a user's nasal cavity is specified based on voice data acquired by a
microphone array in which a plurality of microphones are provided in a predetermined
arrangement, and emitted from the user's nasal cavity Display control means for specifying
whether or not the duration is longer than a predetermined time, and display control means for
performing display to the user; The means controls to change the display mode on the display in
accordance with the determination result by the duration determination means and the utterance
position identified by the utterance position identification means. [Selected figure] Figure 1
Information processing device
[0001]
The present invention relates to an information processing apparatus that displays a user's
operation on a display or the like using exhalation or voice at a position in a three-dimensional
space.
[0002]
A keyboard and a mouse are widely used as an input means to a computer.
Since all of them are premised to operate using a hand, it is an obstacle to access to the computer
04-05-2019
1
by handicapped persons with disabilities. As an input means which does not use a hand, there is
one using "eye gaze", "tongue", "voice" and the like. Since the method of using the "line of sight"
uses a head-mounted device, there is a problem that it is more difficult for the handicapped
person to wear it by himself. In the method using the "tongue", since the sensor is used in the
oral cavity, there are hygiene problems as well as difficulties in wearing the device. In the "voice"
method, it is not necessary to attach sensors by speaking into the microphone placed on the
table, but it is a prerequisite that it can utter clear voice commands to the extent that voice
recognition is possible. There is.
[0003]
A voice pointing device that specifies vocalization positions such as voice and exhalation sound
by microphone array processing for those who are difficult to operate manually by using a
mouse, etc., and it is difficult to utter a clear voice command that allows voice recognition.
Development is taking place. For example, in patent document 1 (Japanese Patent Application
Laid-Open No. 2004-280301), a development example of an audio pointing device for operating
a cursor and an interface using the same is disclosed by moving the tip of a mouth or face while
blowing breath into a microphone array. It is done. This is a specification that determines the
moving direction of the cursor based on two-dimensional positional information of the input
sound detected on the microphone array arranged on a plane, and controls the speed of the
cursor movement by the intensity of the input sound. In addition, Patent Document 2 (Japanese
Patent Application Laid-Open No. 2007-228135) develops a voice pointing device with improved
robustness to noise, and directs the user to the direction in which he / she wishes to go, and
sounds such as voice, exhalation sound, or whistle We are developing an electric wheelchair that
can indicate the direction of travel. In the example of the electric wheelchair, the two-dimensional
position of the sound source is estimated by mounting two microphone arrays which are sensor
units of the pointing device on the left and right armrest tips. JP, 2004-280301, A JP, 2007228135, A
[0004]
However, as in the method described in Patent Document 1, when the input operation is
performed with only the intensity of the input sound as the speed of cursor movement, fine input
such as cursor movement to small icons and buttons becomes difficult. Moreover, since it is weak
to the interference of ambient noise, the practicability in the general environment with noise was
low.
04-05-2019
2
[0005]
By the way, the inventor has developed a three-dimensional voice pointing device in which the
pointing device is downsized so that it can be used on a desktop, and the three-dimensional
microphone array is configured so that three-dimensional sound source position estimation can
be performed with high accuracy. I have already done. By using such a three-dimensional voice
pointing device developed by the inventor, the problem of vulnerability to noise could be
avoided.
[0006]
However, even with such a three-dimensional voice pointing device, there is much room for
improvement in the interface regarding the problem that fine control such as cursor movement
to small icons and buttons is difficult.
[0007]
In view of the problems of the prior art as described above, the present invention provides an
information processing apparatus for displaying a user operation on a display or the like using
exhalation or voice even in a noisy environment, using the user's position in the threedimensional space. It is another object of the present invention to provide an information
processing apparatus capable of performing detailed input operations such as cursor movement
to small icons and buttons.
[0008]
In order to solve the above-mentioned subject, according to the invention concerning claim 1, it
is emitted from the user's nose and mouth based on the microphone array which a plurality of
microphones are provided in a predetermined arrangement, and the voice data acquired by the
microphone array. And a display control unit for controlling the display unit, the display control
unit including the display control unit for specifying the utterance position of the uttered sound
three-dimensionally. According to another aspect of the present invention, there is provided an
information processing apparatus comprising: control to change a cursor position displayed on
the display means in accordance with the utterance position specified by the position specifying
means.
[0009]
The invention according to claim 2 is a microphone array in which a plurality of microphones are
04-05-2019
3
provided in a predetermined arrangement, and a duration time of sound emitted from a user's
nose and mouth based on voice data acquired by the microphone array. Means for specifying the
duration, means for specifying the location of the sound of the sound emitted from the nasal
cavity of the user three-dimensionally based on voice data acquired by the microphone array,
and means for specifying the duration Control means for determining whether or not the
duration time specified by is longer than a predetermined time, display means for performing
display to the user, and display control means for controlling the display means; The means
responds to the determination result by the duration determination means and the utterance
position identified by the utterance position identification means. Te is an information processing
apparatus and controls to change the display form on the display means.
[0010]
The invention according to claim 3 is a microphone array in which a plurality of microphones are
provided in a predetermined arrangement, and a duration time of a sound emitted from a user's
nose and mouth based on voice data acquired by the microphone array. Means for specifying the
duration, means for specifying the location of the sound of the sound emitted from the nasal
cavity of the user three-dimensionally based on voice data acquired by the microphone array,
and means for specifying the duration Control means for determining whether or not the
duration time specified by is longer than a predetermined time, display means for performing
display to the user, and display control means for controlling the display means; The means
responds to the determination result by the duration determination means and the utterance
position identified by the utterance position identification means. Te is an information processing
apparatus and controls to change another enlarged display execution or enlarged display
cancellation on the display unit.
[0011]
The invention according to claim 4 is a microphone array in which a plurality of microphones are
provided in a predetermined arrangement, and a duration time of sound emitted from a user's
nose and mouth based on audio data acquired by the microphone array. Means for specifying the
duration, means for specifying the location of the sound of the sound emitted from the nasal
cavity of the user three-dimensionally based on voice data acquired by the microphone array,
and means for specifying the duration Control means for determining whether or not the
duration time specified by is longer than a predetermined time, display means for performing
display to the user, and display control means for controlling the display means; The means
responds to the determination result by the duration determination means and the utterance
position identified by the utterance position identification means. Te is an information processing
apparatus and controls to change the magnification of the display on the display means.
04-05-2019
4
[0012]
According to the fifth aspect of the present invention, there is provided a microphone array in
which a plurality of microphones are provided in a predetermined arrangement, and continuation
of sound emitted from the user's nose and mouth based on voice data acquired by the
microphone array. A duration identifying means for identifying a time, an utterance position
identifying means for three-dimensionally identifying an utterance position of a sound emitted
from the nasal cavity of the user based on audio data acquired by the microphone array, and the
duration It has a duration determination means for determining whether or not the duration
specified by the identification means is longer than a predetermined time, a display means for
performing display to the user, and a display control means for controlling the display means.
The display control means responds to the determination result by the duration determination
means and the utterance position identified by the utterance position identification means. The
information processing apparatus is characterized in that control is performed to change the
movement amount of the cursor displayed on the display means.
[0013]
In the information processing apparatus according to the third or fourth aspect of the present
invention, the display control means is displayed on the display means in accordance with a
change in a magnification of display on the display means. Changing the amount of movement of
the cursor.
[0014]
The invention according to claim 7 is the information processing apparatus according to claim 6,
wherein a volume specifying means for specifying the volume of the sound emitted from the
nasal cavity of the user based on the audio data acquired by the microphone array. , And the
display control means controls the amount of movement of the cursor displayed on the display
means to be proportional to the volume of the volume identified by the volume identification
means or to the logarithmic value of the volume. .
[0015]
The invention according to claim 8 is the information processing apparatus according to claim 6,
wherein the display control means specifies the amount of movement of the cursor displayed on
the display means by the utterance position specifying means. It is characterized in that it is
controlled to be proportional to the distance between the position and the predetermined line
segment.
[0016]
04-05-2019
5
The invention according to claim 9 is the information processing apparatus according to any one
of claims 1 to 8, wherein the display control means displays the utterance position specified by
the utterance position specifying means on the display means. To control.
[0017]
According to the information processing apparatus of the present invention, it is possible to
display the user's operation on a display or the like using exhalation or voice, even in a noisy
environment, using the position of the user in the three-dimensional space.
[0018]
Further, according to the information processing apparatus of the present invention, it is possible
to perform detailed input operations such as cursor movement to small icons and buttons.
[0019]
Hereinafter, embodiments of the present invention will be described with reference to the
drawings.
FIG. 1 is a perspective view showing an appearance of an information processing apparatus
according to an embodiment of the present invention, and FIG. 2 is a perspective view of an
appearance of an interface apparatus used in the information processing apparatus according to
the embodiment of the present invention. FIG. 3 is a block diagram of the information processing
apparatus according to the embodiment of the present invention.
[0020]
1 and 2, 10 is an information processing apparatus, 20 is a computer main unit, 30 is a display
unit, 100 is an interface apparatus, 200 is a microphone array, 201 is a silicon microphone, 202
is a wind screen, 210 is a stand, 211 is a stand. A main support post 212, a left support post, a
right support post 213, a microphone amplifier 280, an AD conversion unit 290, a CPU 300, a
storage unit 400, and a connection port unit 500 are shown.
[0021]
04-05-2019
6
The information processing apparatus 10 includes an interface device 100 which is an
alternative to an input pointing device such as a mouse, a computer main unit 20 which receives
an input from the interface device 100 and performs arithmetic processing and the like based on
the input device. And a display unit 30 for displaying the output from the display for the user.
The computer main unit 20 includes a CPU (not shown) and a ROM (not shown) that holds
programs operating on the CPU, an HDD (not shown), a RAM (not shown) that functions as a
work area of the CPU, and other devices. It is a general-purpose information processing
mechanism including interface means (not shown) for connection, etc. For example, a generalpurpose personal computer can be used.
The configuration expressed as "display control means" in the claims is realized by the CPU of the
computer main unit 20, a program operating on the CPU, a video RAM (not shown), and the like.
The configuration and operation of such a computer main unit 20 are all well known, and thus
the detailed description is omitted.
Moreover, the structure described as a "display means" in a claim is the display part 30. FIG.
A general one can be used for this display unit 30 as well.
Although a general-purpose personal computer or the like can be used as the computer main
body 20, the present invention is not limited to this, and various other computers can be used.
[0022]
Hereinafter, the characteristic interface device 100 in the information processing apparatus 10
according to the present invention will be described in detail.
FIG. 2 shows the configuration of the user interface unit of the interface device 100, which
functions as an input device for a computer or the like based on the sound emitted from the nasal
cavity / oral cavity of the user as shown.
04-05-2019
7
Note that such an interface device 100 can be used not only for input to a computer but also for
input to an electrical product or a vehicle.
[0023]
The external appearance of the interface device 100 is composed of a main column 211 erected
on a stand 210, left and right columns 212 and 213 branched from the main column 211, and a
microphone group provided on each column. It has been set up on the tabletop.
More specifically, the silicon microphones 201 are provided on the substrate (not shown) at
intervals of 3 cm on each of the main support 211, the left support 212, and the right support
213, and the microphone array 200 is obtained from a total of 12 microphone groups. It is
configured.
The interface device 100 according to the present embodiment will be described based on the
use of twelve silicon microphones 201. However, the number of silicon microphones 201 may be
three or more, and the present invention includes twelve silicon microphones 201. The use of the
silicon microphone 201 is not limited.
The noise resistance is degraded when the number of silicon microphones 201 is small, and the
processing load of audio data is heavy when the number of silicon microphones 201 is large.
Therefore, in the present embodiment, as described above, the microphone array is described.
200 is configured by 12 silicon microphones 201.
In addition, as the silicon microphone 201, a small silicon microphone of about 3 mm × 5 mm is
adopted.
[0024]
The four silicon microphones 201 disposed on each of the columns are covered by the wind
screen 202, and they are blinded by the wind noise.
04-05-2019
8
Further, the microphone group disposed on the left support 212 and the microphone group
disposed on the right support 213 are arranged to have a substantially "H" layout, and the
microphone groups disposed on the main support 211 are vertical. Is located in
[0025]
FIG. 3 is a block diagram showing the interface apparatus 100. As shown in FIG. An output of the
microphone array 200 configured of twelve silicon microphones 201 is amplified by the
microphone amplifier 280, analog-digital converted by the AD conversion unit 290, and then
input to the CPU 300. The storage unit 400 is configured of a ROM that holds a program that
operates on the CPU 300 and a RAM that functions as a work area of the CPU 300. The CPU 300
operates based on the program stored in the storage unit 400 to function as the interface device
100 of the present invention.
[0026]
Note that each means such as “duration specifying means”, “speaking position specifying
means”, “duration determining means”, “volume specifying means” and the like described
in the claims is stored in the storage unit 400. This is realized by the CPU 300 operating based
on the program.
[0027]
Further, the storage unit 400 stores and holds an event database described later.
The connection port unit 500 is an interface unit for connecting to another device such as the
computer main unit 20, and can use a known device such as USB.
[0028]
The usage form of the interface device 100 configured as described above will be described.
Although various embodiments are individually described below, each embodiment can be
04-05-2019
9
realized by changing the program stored in the storage unit 400. In addition, an interface device
configured by arbitrarily combining various embodiments which will be individually described
below is also included in the interface device of the present embodiment.
[0029]
FIG. 4 is a view showing an example of usage of the interface apparatus according to the
embodiment of the present invention. In the interface device 100 according to the present
embodiment, the interface device 100 is used to identify to which region the utterance position
estimated in the three-dimensional space belongs.
[0030]
Hereinafter, the term "utterance" includes all kinds of sounds emitted from the user's nose and
mouth. The sound emitted from the user's nose and mouth includes, for example, the sound of a
tongue slap, etc. However, for general use, the user's short sound such as "Shu" or "Pu" or "Shoe"
A continuous, continuous sound such as “A”, etc. is assumed.
[0031]
In the embodiment shown in FIG. 4, the speech detection area R of the user is defined, and only
the user's speech in this speech detection area is detected, so that the sound from outside the
speech detection area R is noise. Process.
[0032]
Then, the position from which the utterance was made is specified in the defined user's utterance
detection area R.
The configuration for performing such identification is expressed as "speech position specifying
means" in the claims.
[0033]
04-05-2019
10
Further, as described later, the utterance detection area R is divided into areas into virtual spaces.
Then, in the virtually divided space, information relating to in which space the utterance
occurred is used.
[0034]
Also, the duration from the start to the end of the utterance is specified in the defined user's
utterance detection area. That is, a distinction is made between the user's short utterances such
as "Shu" and "Pu" and continuous continuous utterances such as "Shoe" and "A". The
configuration for performing such specification is expressed as "duration specifying means" in
the claims.
[0035]
Also, the volume of the user's utterance is specified in the defined user's utterance detection area.
The configuration for performing such specification is expressed as "volume specifying means" in
the claims.
[0036]
The processing of the interface device in the embodiment as described above will be described.
FIG. 5 is a diagram showing a flowchart of processing of the interface device according to the
embodiment of the present invention.
[0037]
When the process is started in step S100, the process proceeds to step S101, and audio data is
fetched from the microphone array 200. More specifically, in this step, the analog signal of the
sound output from the microphone array 200 is amplified by the microphone amplifier 280, then
converted into a digital signal by the AD conversion unit 290, and temporarily stored in the
storage unit 400.
04-05-2019
11
[0038]
In the next step S102, three-dimensional information of the user utterance position and the
ambient noise arrival direction is specified. More specifically, the user's voice position is
determined using the method described in Japanese Patent Application Publication Nos. 2007228135, 2008-67854, and Japanese Patent Application No. 2006-240721 by the inventors of
the present application, and the drawings. And ambient noise arrival directions are specified in a
three-dimensional space.
[0039]
Next, in step S103, it is determined whether or not there is a user's utterance. In this step, the
user's utterance is detected using the method described in Japanese Patent Application No.
2006-240721, and if the user's utterance is not detected, the process is repeated from step
S101. If the user's utterance is detected, the process proceeds to step S104.
[0040]
In step S104, suppression of ambient noise is performed. In this step, sound source separation
processing for suppressing ambient noise and emphasizing the user's speech is performed using
the method described in Japanese Patent Application No. 2006-240721.
[0041]
In step S105, the duration of the user's utterance is specified. That is, in this step, processing for
specifying the time from the start to the end of the continuous sound of the user's utterance is
performed.
[0042]
04-05-2019
12
In step S106, a three-dimensional utterance position is identified. More specifically, it identifies
which region the utterance position estimated in the three-dimensional space belongs to. For
example, as shown in FIG. 3, the user's utterance detection area is defined, and the utterance
detection area is further divided into eight areas. Then, in the eight divided areas, it is specified in
which area the utterance is detected.
[0043]
In step S107, identification of the volume of the user's utterance is performed. This is performed
by measuring a parameter corresponding to the volume such as power representing the size of
the sound.
[0044]
In step S108, a subroutine of an event identification process is executed. The event database held
in the storage unit 400 stores, for example, events according to the utterance duration time, the
utterance detection position, the utterance volume, and the like. That is, a combination of an
utterance continuation time, an utterance detection position, an utterance volume and the like
with an event is defined and held in the event database. In the event identification process, the
information of this event database is referred to.
[0045]
In the event database, for example, an event defined as an utterance of a short time in an area on
the left side on the upper side of the upper part of FIG. 3 is registered in advance. Then, in the
event identification process of step S106, it is determined whether the utterance position is the
above-mentioned position, and it is determined whether the utterance continuation time is equal
to or less than a certain threshold, and the utterance is equal to or more than a predetermined
utterance volume If all the conditions are met, it is determined that the event has occurred.
[0046]
In step S109, it is determined whether there is a corresponding event. In step S108, it is checked
04-05-2019
13
whether an event matching the event database is detected, and if no event is detected, the
process returns to step S101. If an event is detected, the process proceeds to step S110.
[0047]
In step S110, an event detection signal is transmitted to the computer main body 20.
[0048]
Typical processing on the application side is shown in the dotted box.
Hereinafter, typical processing assumed on the application side will be described. In step S201,
the process continues to wait for reception of an event detection signal sent from the interface
apparatus of the present invention. If an event detection signal is received, the process proceeds
to step S202. In step S202, an appropriate process corresponding to the received event detection
signal is performed. Then, the process returns to step S201.
[0049]
In the interface apparatus 100, for example, when an utterance in a front, upper left divided area
is detected, an event detection signal corresponding to the left click of the mouse is generated
and transmitted to the computer main unit 20 side. The computer main body 20 having received
such an event detection signal executes a process corresponding to the left click of the mouse in
step S202.
[0050]
As described above, according to the information processing apparatus of the present
embodiment, the user's exhalation sound and the utterance position of the utterance are threedimensionally specified even in a noisy environment, and the processing corresponding to the
specified items is performed by the computer main unit 20 will be able to run.
[0051]
04-05-2019
14
Next, a first embodiment of the subroutine processing in step S108 will be described.
FIG. 6 is a flowchart showing the subroutine processing of the interface apparatus according to
the first embodiment of the present invention. FIG. 7 is defined in the utterance detection area R
in the information processing apparatus according to the first embodiment of the present
invention. FIG. 8 is a view showing an example of the virtual space, and FIG. 8 is a view showing
an example of the virtual space defined in the utterance detection area R in the information
processing apparatus according to the first embodiment of the present invention.
[0052]
In FIG. 6, when the event specifying process subroutine is started in step S300, the process
proceeds to step S301, in which it is determined whether the specified utterance duration is
longer than a predetermined time. In this determination step, it is determined whether the user is
making a short utterance such as "Shu" or "Pu" or a long utterance such as "Shoo" or "A".
[0053]
When the determination result in step S301 is YES, the process proceeds to step S302, and when
the determination result is NO, the process proceeds to step S304.
[0054]
In step S302, an event is identified based on the virtual space A.
Such a virtual space A is as shown in FIG. The virtual space A is divided into four spaces A1, A2,
A3 and A4, and when there is an utterance in each of the spaces, “upper”, “lower” and
“right” of a general cross key , Is defined as an event corresponding to "left". For example,
when the user pronounces "shoes" for a long time in the area A1, an event detection signal
corresponding to "upper" of the cross key is generated.
[0055]
04-05-2019
15
In step S303, the amount of cursor movement proportional to the vocalization volume (or the
logarithmic value thereof) is specified. That is, it generates an event detection signal that can
move a large amount at a time as the user speaks at a large volume. According to such an event
detection signal, the display unit 30 performs display control such that the cursor moves a large
amount at a time as the user speaks at a large volume. For example, when the user utters a
relatively loud and long "shoe" in the area A1, an event detection signal is generated such that
the cursor moves fast in the direction "up" of the cross key.
[0056]
In step S304, an event is identified based on the virtual space B. Such a virtual space B is as
shown in FIG. The virtual space B is divided into two spaces B1 and B2, and when there is an
utterance in each space, it is defined as an event corresponding to "right click" and "left click" of a
general mouse. It is done. For example, when the user utters short "sh" in the area of B1, an event
detection signal corresponding to "right click" of the mouse is generated.
[0057]
In step S305, left click or right click is identified as described above.
[0058]
ステップS306ではリターンする。
[0059]
The computer main body unit 20 controls the display on the display unit 30 according to the
event detection signal generated by the interface device 100 as described above.
That is, for example, when the user pronounces "shoes" for a long time in the area A1, the display
unit 30 controls the display so that the cursor moves upward.
In addition, when the user pronounces short "Shu" in the area B1, the display unit 30 performs
display control corresponding to the right click of the mouse.
04-05-2019
16
[0060]
In the above embodiment, the virtual space A is divided into four spaces A1, A2, A3 and A4, and
when there is an utterance in each space, “above” a general cross key, Although the simple
case of assigning “bottom”, “right” and “left” has been described, it is also possible to
divide the virtual space in more detail. That is, when speaking long in the middle of A1 and A3,
the interface device 100 generates an event detection signal that causes the cursor to move in
the diagonally upward direction, and the cursor moves in the diagonally upward direction on the
display unit 30. Display control may be performed. In other words, the cursor moves upward
when an utterance is detected right above the origin of the XY plane, and the cursor moves so as
to move 45 ° in the upper right direction when an utterance is detected in the direction of 45 °
in the upper right . Also, the movement amount of the cursor may be determined in proportion to
the distance from a predetermined line segment (O-O ') present at the boundary of the divided
area. That is, display control is performed so that the cursor can be moved larger at one time as
the user goes away from the center. The amount of movement of the cursor may be determined
in proportion to the distance and the volume of the utterance.
[0061]
According to such an information processing apparatus of the present invention, it is possible to
display the user's operation on a display or the like using exhalation or voice even in a noisy
environment, using the user's position in the three-dimensional space.
[0062]
Next, a second embodiment of the present invention will be described.
FIG. 9 is a flowchart showing the subroutine processing of the interface apparatus according to
the second embodiment of the present invention. FIG. 10 is defined in the utterance detection
area R in the information processing apparatus according to the second embodiment of the
present invention. It is a figure which shows the example of the virtual space which was decided.
[0063]
04-05-2019
17
In step S400, when the subroutine of the event identification process is started, the process
proceeds to step S401, and it is determined whether the identified utterance continuation time is
longer than a predetermined time. In this determination step, it is determined whether the user is
making a short utterance such as "Shu" or "Pu" or a long utterance such as "Shoo" or "A".
[0064]
In step S402, an event is identified based on the virtual space A. Such a virtual space A is as
shown in FIG. The virtual space A is divided into four spaces A1, A2, A3 and A4, and when there
is an utterance in each of the spaces, “upper”, “lower” and “right” of a general cross key ,
Is defined as an event corresponding to "left". For example, when the user pronounces "shoes" for
a long time in the area A1, an event detection signal corresponding to "upper" of the cross key is
generated.
[0065]
In step S403, the amount of cursor movement proportional to the vocalization volume (or its
logarithm value) is specified. That is, it generates an event detection signal that can move a large
amount at a time as the user speaks at a large volume. According to such an event detection
signal, the display unit 30 performs display control such that the cursor moves a large amount at
a time as the user speaks at a large volume.
[0066]
In step S304, an event is identified based on the virtual space C. Such a virtual space C is as
shown in FIG. The virtual space C is divided into three spaces C 1, C 2 and C 3, and when there is
an utterance in each space, the “right click”, “left click” and “enlarged view” of the
general mouse are displayed. It is defined as an event corresponding to "execution /
cancellation". For example, when the user utters short "sh" in the area of C1, an event detection
signal corresponding to "right click" of the mouse is generated. In addition, when the user
pronounces a short "shout" in the C3 area, an event detection signal corresponding to a
command to execute enlarged display of the display near the cursor or cancel the enlarged
display is generated. is there.
04-05-2019
18
[0067]
In the present embodiment, such an event detection signal is transmitted to the computer main
unit 20 side, and based on that, the computer main unit 20 performs display control of the
display unit 30, so small icons and buttons are displayed. It is possible to perform detailed input
operations such as cursor movement to the cursor.
[0068]
In step S405, whether the left click, the right click, or the execution / cancellation of the enlarged
display is specified as described above.
[0069]
ステップS406ではリターンする。
[0070]
The computer main body unit 20 controls the display on the display unit 30 according to the
event detection signal generated by the interface device 100 as described above.
That is, when the user short-sounds "Shu" etc. in the area C3, display control is performed to
execute enlarged display of the display in the vicinity of the cursor or cancel the enlarged display
on the display unit 30.
In other words, the user's short utterance in the C3 area plays a role like a toggle switch of
enlarged display ON / OFF, and the user performs enlarged display as needed by such a switch
function, and it is fine It becomes possible to execute input work.
Thus, in the information processing apparatus 10 of the present invention, it is possible to
perform detailed input operations such as cursor movement to small icons and buttons.
[0071]
In the above embodiment, the virtual space A is divided into four spaces A1, A2, A3 and A4, and
04-05-2019
19
when there is an utterance in each space, “above” a general cross key, Although the simple
case of assigning “down”, “right”, and “left” has been described, it is also possible to
divide the virtual space in more detail as described above. Also, the movement amount of the
cursor may be determined in proportion to the distance from a predetermined line segment (O-O
') present at the boundary of the divided area. That is, display control is performed so that the
cursor can be moved larger at one time as the user goes away from the center. The amount of
movement of the cursor may be determined in proportion to the distance and the volume of the
utterance.
[0072]
Next, a third embodiment of the present invention will be described. FIG. 11 is a flowchart of a
subroutine process of the interface apparatus according to the third embodiment of the present
invention, and FIG. 12 is defined in the utterance detection area R in the information processing
apparatus according to the third embodiment of the present invention. It is a figure which shows
the example of virtual space which was
[0073]
In FIG. 11, when the subroutine of the event identification process is started in step S500, next,
the process proceeds to step S501, and it is determined whether the identified utterance
continuation time is longer than a predetermined time. In this determination step, it is
determined whether the user is making a short utterance such as "Shu" or "Pu" or a long
utterance such as "Shoo" or "A". Also in this embodiment, first, the virtual space used in the
utterance detection area R is processed to be different depending on the duration of the user's
utterance continuous tone.
[0074]
When the determination result of step S501 is YES, the process proceeds to step S502, and when
the determination result is NO, the process proceeds to step S506.
[0075]
In step S 502, an event is identified based on the virtual space D.
04-05-2019
20
Such a virtual space D is as shown in FIG. The virtual space D is divided into five spaces of D1,
D2, D3, D4, and D5, and when there is an utterance in each space, “upper”, “lower”, and the
like of a common cross key, respectively. It is defined as an event corresponding to "right" and
"left". For example, when the user pronounces "shoes" for a long time in the area of D1, an event
detection signal corresponding to "up" of the cross key is generated. Further, the area D5 set on
the back side of the utterance detection area R is a space provided so that the user can change
the enlargement ratio of the display on the display unit 30. Then, in changing the enlargement
ratio, the enlargement ratio is set in accordance with the distance between the PQRS plane and
the utterance position. In the region D5, the closer the PQRS plane is to the utterance position,
the larger the enlargement ratio, and the closer the PQRS plane is to the utterance position, the
smaller the enlargement ratio. It can be used intuitively as a device. For example, an event
detection signal is generated by the interface device 100 such that the enlargement ratio
becomes larger as the user makes a long “sue” in the area of D5 and approaches with a long
utterance, and this is transmitted to the computer main unit 20 Then, the display control is
performed so as to increase the enlargement ratio of the display on the display unit 30.
[0076]
In step S503, it is determined whether the utterance position is in D1, D2, D3, or D4. When the
determination result of step S503 is YES, the process proceeds to step S504, and when the
determination result is NO, the process proceeds to step S505.
[0077]
In step S504, the amount of cursor movement proportional to the vocalization volume (or its
logarithm value) is specified. That is, it generates an event detection signal that can move a large
amount at a time as the user speaks at a large volume. According to such an event detection
signal, the display unit 30 performs display control such that the cursor moves a large amount at
a time as the user speaks at a large volume. For example, when the user utters a relatively loud
and long "shoe" in the area of D1, an event detection signal is generated such that the cursor
moves fast in the direction "up" of the cross key.
[0078]
In step S505, the method as described above--the closer the PQRS plane and the utterance
04-05-2019
21
position, the larger the enlargement ratio, and the closer the PQRS plane and the utterance
position, the smaller the enlargement ratio. Identify the magnification rate.
[0079]
In step S506, an event is identified based on the virtual space B.
Such a virtual space B is as shown in FIG. The virtual space B is divided into two spaces B1 and
B2, and when there is an utterance in each space, it is defined as an event corresponding to the
“right click” and “left click” of a general mouse. ing. For example, when the user utters short
"sh" in the area B1, an event detection signal corresponding to "right click" of the mouse is
generated.
[0080]
In step S507, left click or right click is identified as described above.
[0081]
ステップS508ではリターンする。
[0082]
The computer main body unit 20 controls the display on the display unit 30 according to the
event detection signal generated by the interface device 100 as described above.
That is, the closer the user is to the PQRS plane while pronouncing "shoes" etc. in the area of D5
for a longer time, the larger the display magnification on the display unit 30, the longer the user
in the area of D5 The display control is performed so that the enlargement ratio of the display on
the display unit 30 decreases as the distance from the PQRS plane increases while sounding
"shoes" or the like.
Such a change in magnification ratio allows the user to magnify and execute detailed input work
as needed. Thus, in the information processing apparatus 10 of the present invention, it is
04-05-2019
22
possible to perform detailed input operations such as cursor movement to small icons and
buttons.
[0083]
In the above embodiment, the virtual space D is divided into four spaces D1, D2, D3 and D4, and
when there is an utterance in each space, “above” a general cross key, Although the simple
case of assigning “bottom”, “right” and “left” has been described, it is also possible to
divide the virtual space in more detail. That is, when speaking long in the middle of D1 and D3,
the interface device 100 generates an event detection signal such that the cursor moves in the
diagonally upward direction, and the cursor moves in the diagonally upward direction on the
display unit 30. Display control may be performed. In other words, the cursor moves upward
when an utterance is detected right above the origin of the XY plane, and the cursor moves so as
to move 45 ° in the upper right direction when an utterance is detected in the direction of 45 °
in the upper right . Also, the movement amount of the cursor may be determined in proportion to
the distance from a predetermined line segment (O-O ') present at the boundary of the divided
area. That is, display control is performed so that the cursor can be moved larger at one time as
the user goes away from the center. The amount of movement of the cursor may be determined
in proportion to the distance and the volume of the utterance.
[0084]
Next, a fourth embodiment of the present invention will be described. FIG. 13 is a flowchart
showing the subroutine processing of the interface apparatus according to the fourth
embodiment of the present invention. FIG. 14 is defined in the utterance detection area R in the
information processing apparatus according to the fourth embodiment of the present invention.
It is a figure which shows the example of the virtual space which was
[0085]
In FIG. 13, when the event identification processing subroutine is started in step S600, the
process proceeds to step S601, where it is determined whether the identified utterance duration
time is longer than a predetermined time. In this determination step, it is determined whether the
user is making a short utterance such as "Shu" or "Pu" or a long utterance such as "Shoo" or "A".
Also in this embodiment, first, the virtual space used in the utterance detection area R is
04-05-2019
23
processed to be different depending on the duration of the user's utterance continuous tone.
[0086]
When the determination result in step S601 is YES, the process proceeds to step S602, and when
the determination result is NO, the process proceeds to step S604.
[0087]
In step S602, an event is identified based on the virtual space A.
Such a virtual space A is as shown in FIG. The virtual space A is divided into four spaces A1, A2,
A3 and A4, and when there is an utterance in each of the spaces, “upper”, “lower” and
“right” of a general cross key , Is defined as an event corresponding to "left". For example,
when the user pronounces "shoes" for a long time in the area A1, an event detection signal
corresponding to "upper" of the cross key is generated.
[0088]
In step S603, the amount of cursor movement proportional to the vocalization volume (or its
logarithm value) is specified. That is, it generates an event detection signal that can move a large
amount at a time as the user speaks at a large volume. According to such an event detection
signal, the display unit 30 performs display control such that the cursor moves a large amount at
a time as the user speaks at a large volume. For example, when the user utters a relatively loud
and long "shoe" in the area A1, an event detection signal is generated such that the cursor moves
fast in the direction "up" of the cross key.
[0089]
In step S604, an event is identified based on the virtual space E. Such a virtual space E is as
shown in FIG. The virtual space E is divided into six spaces of E1, E2, E3, E4, E5, and E6, and
when there is an utterance in the area of E1 and E2, a “right click” of a general mouse, “ It is
defined as an event corresponding to "left click". For example, when the user utters short "sh" in
the area of E1, an event detection signal corresponding to "right click" of the mouse is generated.
04-05-2019
24
[0090]
In addition, when there is a voice in each of the spaces E3, E4, E5, and E6 in the space area
divided into four on the back side, the “up”, “down”, and “right” of the general cross key
are ", Corresponding to" left ", and is defined as an event in which the amount of movement of the
cursor is a small amount. For example, when the user pronounces "sh" short in the area E3, an
event detection signal corresponding to a small amount of cursor movement to "up" is generated.
According to such an event detection signal, the display unit 30 performs display control such
that the cursor slightly moves. That is, the user can execute fine movement of the cursor by
making short utterances of E3, E4, E5, and E6 in the space area as needed.
[0091]
In step S605, it is determined whether the utterance position is in E1 or E2. When the
determination result in step S605 is YES, the process proceeds to step S606, and when the
determination result is NO, the process proceeds to step S607.
[0092]
In step S606, left click or right click is identified as described above.
[0093]
Further, in step S607, as described above, a small amount of cursor movement is specified for
each cross direction.
[0094]
ステップS608でリターンする。
[0095]
The computer main body unit 20 controls the display on the display unit 30 according to the
event detection signal generated by the interface device 100 as described above.
04-05-2019
25
That is, particularly in the present embodiment, when there is a short utterance in each of the
spaces E3, E4, E5, and E6 in the space area divided into four on the back side, the movement
amount of the cursor becomes small. As described above, display control is performed, and the
user can execute fine movement of the cursor by making short utterances of E3, E4, E5, and E6
in the space area as necessary.
Thus, in the information processing apparatus 10 of the present invention, it is possible to
perform detailed input operations such as cursor movement to small icons and buttons.
[0096]
In the above embodiment, the virtual space A is divided into four spaces A1, A2, A3 and A4, and
when there is an utterance in each space, “above” a general cross key, Although the simple
case of assigning “bottom”, “right” and “left” has been described, it is also possible to
divide the virtual space in more detail.
That is, when speaking long in the middle of A1 and A3, the interface device 100 generates an
event detection signal that causes the cursor to move in the diagonally upward direction, and the
cursor moves in the diagonally upward direction on the display unit 30. Display control may be
performed. In other words, the cursor moves upward when an utterance is detected right above
the origin of the XY plane, and the cursor moves so as to move 45 ° in the upper right when an
utterance is detected in the direction of 45 ° in the upper right. . Also, the movement amount of
the cursor may be determined in proportion to the distance from a predetermined line segment
(O-O ') present at the boundary of the divided area. That is, display control is performed so that
the cursor can be moved larger at one time as the user goes away from the center. The amount of
movement of the cursor may be determined in proportion to the distance and the volume of the
utterance.
[0097]
Further, in the above embodiment, on the far side of the virtual space E, the space is divided into
four spaces E3, E4, E5, and E6. Although the simple case of assigning "upper", "lower", "right" and
"left" has been described, it is also possible to divide the virtual space in more detail. That is,
04-05-2019
26
when speaking long in the middle of E3 and E5, the interface device 100 generates an event
detection signal such that the cursor moves in the diagonally upward direction, and the cursor
moves in the diagonally upward direction on the display unit 30. Display control may be
performed. In other words, the cursor moves upward when an utterance is detected right above
the origin of the XY plane, and the cursor moves so as to move 45 ° in the upper right direction
when an utterance is detected in the direction of 45 ° in the upper right .
[0098]
Next, a fifth embodiment of the present invention will be described. FIG. 15 is a diagram showing
a flowchart of subroutine processing of the interface apparatus according to the fifth
embodiment of the present invention. The present embodiment is a modification of the second
embodiment, and step S403 of the flowchart shown in FIG. 9 is changed to step S403 '.
Therefore, this step will be mainly described.
[0099]
In step S403 'of the subroutine processing of the interface apparatus according to the fifth
embodiment, when specifying the cursor movement amount, the cursor movement is in
proportion to the utterance volume (or its logarithmic value) and is in inverse proportion to the
enlargement ratio. The quantity is to be identified. For example, when the magnification ratio is
doubled, for example, the movement amount of the cursor is set to 1/2.
[0100]
In this embodiment, when the cursor is moved largely in the global range, the magnification of
the screen is made equal and when the cursor is moved small, the magnification is increased and
displayed, but the amount of movement of the cursor when the screen is magnified However,
with the same as before the enlargement, the cursor moves largely in the enlarged screen and the
operation becomes difficult. Therefore, in order to avoid this, the movement amount of the cursor
in the magnified display is kept constant by making the magnification ratio and the movement
amount of the actual cursor inversely proportional. For example, in the case of 2 × display, the
actual movement amount of the cursor is halved. When selecting an object such as a small icon
or button, increase the magnification to make the apparent object larger and easier to select.
04-05-2019
27
[0101]
Next, a sixth embodiment of the present invention will be described. FIG. 16 is a flowchart of a
subroutine process of the interface apparatus according to the sixth embodiment of the present
invention, and FIG. 17 is defined in the utterance detection area R in the information processing
apparatus according to the sixth embodiment of the present invention. It is a figure which shows
the example of the virtual space which was
[0102]
The present embodiment is a modification of the second embodiment, and step S403 of the
flowchart shown in FIG. 9 is changed to step S403 '. Therefore, we will focus on this step.
[0103]
In step S403 ′ ′ of the subroutine processing of the interface apparatus according to the sixth
embodiment, when specifying the cursor movement amount, the cursor movement amount is
specified that is proportional to the distance d and inversely proportional to the enlargement
ratio. It is supposed to be. For example, when the magnification ratio is doubled, for example, the
movement amount of the cursor is set to 1/2.
[0104]
In this embodiment, the amount of movement of the cursor is determined in proportion to the
distance d from a predetermined line (O-O ') existing at the boundary of the divided area. That is,
display control is performed so that the cursor can be moved larger at one time as the user goes
away from the center.
[0105]
In addition, by selecting the target such as a small icon or button by keeping the movement
amount of the cursor constant in the enlargement display by making the enlargement ratio and
the movement amount of the actual cursor inversely proportional, the enlargement ratio By
04-05-2019
28
increasing, it is possible to make the apparent object larger and easier to select.
[0106]
Next, a seventh embodiment of the present invention will be described.
FIG. 18 is a diagram showing a flowchart of subroutine processing of the interface apparatus
according to the seventh embodiment of the present invention. The present embodiment is a
modification of the third embodiment, and step S504 of the flowchart shown in FIG. 11 is
changed to step S504 '. Therefore, this step will be mainly described.
[0107]
In the step S 504 ′ of the subroutine processing of the interface apparatus according to the
seventh embodiment, when specifying the cursor movement amount, the cursor movement is in
proportion to the utterance volume (or its logarithm value) and in inverse proportion to the
enlargement ratio The quantity is to be identified. For example, when the magnification ratio is
doubled, for example, the movement amount of the cursor is set to 1/2. In this way, by selecting
the target such as a small icon or button by making the moving amount of the cursor in the
enlarged display constant by making the magnifying ratio and the moving amount of the actual
cursor inversely proportional, it is enlarged By raising the rate, it is possible to make the
apparent object larger and easier to select.
[0108]
Next, an eighth embodiment of the present invention will be described. FIG. 19 is a flowchart
showing the subroutine processing of the interface apparatus according to the eighth
embodiment of the present invention. FIG. 20 is defined in the utterance detection area R in the
information processing apparatus according to the eighth embodiment of the present invention.
It is a figure which shows the example of the virtual space which was
[0109]
04-05-2019
29
The present embodiment is a modification of the third embodiment, and step S504 of the
flowchart shown in FIG. 11 is changed to step S504 '. Therefore, I will explain this step with a
focus on it.
[0110]
In step S 504 ′ ′ of the subroutine processing of the interface apparatus according to the
eighth embodiment, when specifying the cursor movement amount, the cursor movement
amount proportional to the distance d and inversely proportional to the enlargement ratio is
specified. It is supposed to be. For example, when the magnification ratio is doubled, for example,
the movement amount of the cursor is set to 1/2.
[0111]
In this embodiment, the amount of movement of the cursor is determined in proportion to the
distance d from a predetermined line segment (O-O ') present at the boundary of the divided
area. That is, display control is performed so that the cursor can be moved larger at one time as
the user goes away from the center.
[0112]
In addition, by selecting the target such as a small icon or button by keeping the movement
amount of the cursor constant in the enlargement display by making the enlargement ratio and
the movement amount of the actual cursor inversely proportional, the enlargement ratio By
increasing, it is possible to make the apparent object larger and easier to select.
[0113]
Next, a ninth embodiment of the present invention will be described.
This embodiment is used in combination with any of the embodiments described above. FIG. 21
is a view showing a display example on the display unit in the information processing apparatus
according to the ninth embodiment of the present invention. In FIG. 21, a window 31 graphically
04-05-2019
30
indicates at which position in the utterance detection area R the interface apparatus 100
recognizes the user's utterance position.
[0114]
The information processing apparatus 10 according to the present invention differs from a
pointing device such as a mouse that can be visually confirmed, in grasping which position in the
three-dimensional utterance detection area R the position of the user who is speaking
corresponds to It's not easy. In addition, the volume of the voice changes depending on the
difference between the sounds "see" and "shoe", and it is difficult to grasp.
[0115]
Therefore, displaying the current utterance position on the display unit 30 makes it easy for the
user to grasp his / her utterance position. The display displays the utterance position as a circle
in a pseudo three-dimensional area. Note that the spatial position may be easily grasped by
displaying shadows on each plane (X-Y, Y-Z, Z-X). In addition, a graph of changes in the volume
of the utterance during a predetermined period of time may be displayed sequentially. This
makes it easier to understand the relationship between the utterance volume and the speed of
the cursor.
[0116]
Next, an experiment relating to the feeling of use using the information processing apparatus 10
of the present invention will be described. An interface device 100 which is a three-axis
microphone array is installed and used in front of the display unit 30 of Windows (registered
trademark) to be operated. For evaluation, five subjects performed the cursor operation to the
target using the information processing apparatus 10 of the present invention, and interviewed
about the feeling of use.
[0117]
The cursor operation experiment to a target displayed a target on the display part 30, moved the
04-05-2019
31
cursor on it, made 1 trial until it left-clicked, and recorded the arrival time and movement path
from the start position. The display used has a resolution of 1600 × 1200. The target size is 16
× 16, about the same size as the Windows® window close button. In addition, in order to verify
the effectiveness of the zoom function, three trials were conducted for each subject under the
following three conditions. 1) No zoom function (first embodiment) 2) ON / OFF switching zoom
by click operation (second embodiment) 3) ON / OFF switching zoom by speech position (third
embodiment) Person about each condition The experiment was conducted until practice was
judged to be operable. In any zoom function, the time to reach the target is reduced as compared
to the case without zoom. From this, it can be said that the detailed operation of the cursor is
possible by the zoom, indicating the effectiveness of the zoom function. In addition, comparing
the two zoom functions, the on / off switching zoom according to the speech position almost
reaches the target in a short time. This is considered to be because it was necessary to stop the
movement once and perform the click in the ON / OFF switching by the click. However, in some
trials, ON / OFF switching by position takes longer time to reach the goal. This is because the
speech position in the Z-axis direction has changed when the user did not intend. At this time, the
zoom magnification changes, and the cursor moves at an unexpected speed, making control
difficult. This tendency was more likely to appear in subjects whose practice time was shorter. As
a result of interviews with subjects, the following points were found regarding the feeling of use
of the system. -It is difficult to control the utterance position in the back and forth (Z-axis)
direction, and the zoom magnification can not be controlled as expected. ・ Fine adjustment of
the cursor position can not be performed (without the zoom function). ・ We want you to be able
to fine-tune not click but short-term utterance.
[0118]
As can be seen from the above, many subjects felt that it was difficult to control the zoom and
tended to move the cursor to the target without using the zoom function as much as possible. It
has been found that it is not easy to absolutely control the speech position in the Z-axis
coordinate direction, that is, in the vertical direction with respect to the display unit 30. The
reason is that when moving the cursor to the left and right, the neck is rotated to move the
speech position to the left and right. When the neck is rotated, the distance from the display also
changes, and a change in the user's unintended Z-axis coordinate appears. This causes the
magnification to change unexpectedly and causes confusion for the user. When the practice time
is increased, Z-axis coordinate control of the speech position is possible, but many subjects
perform operations so as not to move back and forth as much as possible, that is, not to use the
zoom function. Also, ・ If you do not have a goal in the magnified view, you lose sight of the goal.
・ It is difficult to operate while watching visual feedback. -There were problems with visual
aspects such as the tendency to lose sight of the cursor. The visual feedback is displayed on the
lower right of the display unit 30 so as not to disturb the operation. It is difficult to check the
04-05-2019
32
display while operating. In addition, since it is difficult to control as compared with a mouse or
the like, it often moves larger than expected and the cursor may be lost. Therefore, it is
conceivable to display visual feedback in a superimposed manner on the cursor.
[0119]
Next, the elemental technology in the processing of the interface device 100 will be described.
[0120]
In the interface device 100, even in the presence of ambient noise, a three-dimensional user's
voice position and user voice separated from noise are required.
Five processes of the interface device 100 which is a three-dimensional voice pointing device
necessary to extract such information. Estimation of user vocalization position (estimate of nearfield sound source); 2. Arrival direction estimation of ambient noise (estimate of sound wave
arrival direction of sound source at far distance); User's speech detection; Sound source
separation, 5. Speech recognition processing (Japanese Patent Application No. 2003-320183)
will be described below. 1. Estimation of User Uttered Position (Estimation of Near-Range
Sound Source) A method of estimating the position of a sound source located within a short
distance of about 1 m from the microphone array with the microphone array will be described
below.
[0121]
The plurality of microphones can be arranged at any position in the three-dimensional space. Any
position in 3D space
[0122]
The sound signal output from the sound source placed at the arbitrary position in the threedimensional space
[0123]
04-05-2019
33
Receive sound with Q microphones placed in.
The distance Rq between the sound source and each microphone can be obtained by the
following equation.
[0124]
The propagation time τ q from the sound source to each microphone can be obtained by the
following equation, where the sound velocity is v.
[0125]
The gain gq for the narrow band signal of the center frequency ω received by each microphone
to that of the sound source is generally defined as a function of the distance Rq between the
sound source and the microphone and the center frequency ω.
[0126]
For example, a function such as the following equation experimentally obtained using the gain as
a function of the distance Rq is used.
[0127]
The transfer characteristics between the sound source and each microphone for a narrow band
signal at center frequency ω are
[0128]
It is expressed as
Then, a position vector a (ω, P0) representing the sound source at the position P0 is defined as a
complex vector having as an element the transfer characteristic between the sound source and
each microphone related to the narrow band signal as in the following equation.
[0129]
04-05-2019
34
The estimation of the sound source position is performed by using the MUSIC method (the
eigenvalue decomposition of the correlation matrix to obtain the signal subspace and the noise
subspace, and the reciprocal of the inner product of the arbitrary sound source position vector
and the noise subspace. Perform the following procedure using the method of checking the
position).
The short time Fourier transform of the qth microphone input
[0130]
The observation vector is defined as follows using
[0131]
Here, n is an index of the frame time.
The correlation matrix is determined by the following equation from N continuous observation
vectors.
[0132]
Eigenvalues arranged in descending order of this correlation matrix
[0133]
And the corresponding eigenvectors
[0134]
とする。
Then, the number of sound sources S is estimated by the following equation.
04-05-2019
35
[0135]
Alternatively, it is possible to set a threshold for the eigenvalues and set the number of
eigenvalues exceeding the threshold as the number of sound sources S.
Define the matrix R n (ω) from the basis vectors of the noise subspace as
[0136]
frequency band
[0137]
And search area U of sound source position estimation
[0138]
として、
[0139]
Calculate
Then, a coordinate vector at which the function F (P) takes a maximal value is determined.
Here, it is assumed that P1, P2,..., Ps are estimated as coordinate vectors giving S maximum
values.
Next, the power of the sound source in each of the coordinate vectors is determined by the
following equation.
04-05-2019
36
[0140]
Then, two thresholds Fthr and Pthr are prepared, and when F (Ps) and P (Ps) in each position
vector satisfy the following condition,
[0141]
It is determined that an utterance has occurred in coordinate vector P1 in N consecutive frame
times.
The estimation process of the sound source position processes N consecutive frames as one
block.
In order to estimate the sound source position more stably, the number of frames N is increased,
and / or it is determined that an utterance has occurred if the condition of the equation (30) is
satisfied in all the consecutive Nb blocks.
The number of blocks is set arbitrarily. In the case where the sound source is moving at such a
speed that the sound source can be seen to be approximately stationary within the time of N
consecutive frames, the movement miracle of the sound source can be captured by the above
method. 2. Estimation of Direction of Arrival of Ambient Noise (Estimation of Direction of
Arrival of Sound of Sound Source at Long Distance) A method of estimating the direction in
which the sound wave of a sound source at a long distance from the microphone array arrives
with the microphone array will be described below. The plurality of microphones can be
arranged at any position in the three-dimensional space. Sound waves coming from a long
distance are considered to be observed as plane waves.
[0142]
FIG. 22 is an explanatory view for explaining a sound receiving function using the microphone
array of the present invention. FIG. 22 shows, as an example, a case where three microphones
m1, m2 and m3 arranged at arbitrary positions receive sound waves coming from a sound
source. In FIG. 22, a point c indicates a reference point around which the arrival direction of the
sound wave is estimated. In FIG. 22, the plane s shows the cross section of the plane wave
including the reference point c. The normal vector n of the plane s is defined as the following
04-05-2019
37
equation, with the direction of the vector opposite to the propagation direction of the sound
wave.
[0143]
3
The sound wave arrival direction of the sound source in the dimensional space is represented by
two parameters (θ, φ). The sound wave coming from the direction (θ 1, φ) is received by each
microphone, and the received signal is decomposed into narrow band signals by finding its
Fourier transform, and the gain and phase are determined for each narrow band signal of each
received signal. A vector which is expressed as a complex number and which is arranged as an
element by the entire sound receiving signal for each narrowband signal is defined as a position
vector of the sound source. In the following processing, the sound wave coming from the
direction (θ, φ) is expressed as the above-mentioned position vector. The position vector is
specifically determined as follows. The distance rq between the qth microphone and the plane s
is determined by the following equation.
[0144]
The distance rq is positive if the microphone is located on the sound source side with respect to
the plane s, and conversely, takes a negative value if it is on the opposite side of the sound
source. Assuming that the sound velocity is v, the propagation time Tq between the microphone
and the plane s is expressed by the following equation.
[0145]
The gain with respect to the amplitude at a distance rq from the amplitude in the plane s is
defined as follows as a function of the center frequency ω of the narrowband signal and the
distance rq.
[0146]
The phase difference at a position separated by a distance rq from the phase in the plane s is
expressed by the following equation.
[0147]
04-05-2019
38
From the above, the gain and the phase difference of the narrow band signal observed by each
microphone are represented by the following equation with reference to the plane s.
[0148]
When sound waves coming from the (θ, φ) direction are observed with Q microphones, the
position vector of the sound source is defined as a vector whose element is a value obtained
according to equation (26) for each microphone as defined by the following equation Ru.
[0149]
Once the position vector of the sound source is defined, the direction of arrival estimation of the
sound wave is performed using the MUSIC method.
Using the matrix Rn (ω) given by equation (15), the search region I of the sound wave arrival
direction estimation is
[0150]
として、
[0151]
Calculate
Then, the direction (θ, φ) in which the function J (θ, φ) gives the maximum value is
determined.
Here, it is assumed that there are K sound sources, and K sound wave arrival directions ((θ1,
φ1),..., (ΘK, φK)) giving maximum values are estimated.
Next, the power of the sound source in each sound wave arrival direction is determined by the
following equation.
04-05-2019
39
[0152]
Then, two thresholds Jthr and Qthr are prepared, and when J (θk, φk) and Q (θk, φk) in each
direction of arrival satisfy the following conditions,
[0153]
It is determined that there is an utterance in the direction of arrival (θ k, φ k) within N
consecutive frame times.
The process of estimating the direction of arrival of sound waves treats N consecutive frames as
one block.
In order to estimate the direction of arrival more stably, it is assumed that the number of frames
N is increased and / or that the sound wave has arrived from that direction if the condition of
equation (31) is satisfied in all of the consecutive Nb blocks. to decide. The number of blocks is
set arbitrarily. If the sound source is moving at such a speed that the sound source can be seen to
be approximately stationary within the time of N consecutive frames, the above method can
capture the movement miracle of the direction of arrival of the sound wave. .
[0154]
Although the position estimation result of the near-field sound source and the sound wave arrival
direction estimation result of the far-field sound source play an important role in the subsequent
speech detection processing and sound source separation processing, the near-field sound source
and the far-distance sound source are generated simultaneously When the power of the nearfield sound source becomes significantly larger than the sound wave coming from the far-field
sound source, the arrival direction of the far-field sound source may not be estimated well. In
such a case, use the result of estimation of the direction of arrival of the sound source of the fardistance sound source estimated just before the near-distance sound source is generated. 3.
User Speech Detection When there are multiple sound sources, it is generally difficult to identify
which sound source is the speech to be recognized. On the other hand, in a system employing an
interface using speech, it is possible to determine in advance a user speech area representing in
what position the user of the system speaks relative to the system. In this case, even if a plurality
04-05-2019
40
of sound sources exist around the system by the above-described method, if the position of each
sound source and the arrival direction of the sound wave can be estimated, the system selects the
sound source entering the user speech area assumed in advance. This makes it possible to easily
identify the user's voice.
[0155]
By satisfying the conditions of Expression (20) and Expression (31), the presence of the sound
source is detected, and further, the conditions of the position of the sound source and the arrival
direction of the sound wave are satisfied, and the user's utterance is detected. This detection
result plays an important role in speech recognition processing as speech zone information.
When speech recognition is performed, it is necessary to detect the start time point and the end
time point of the speech section from the input signal. However, it is not always easy to detect a
speech segment in a noise environment where ambient noise is present. In general, if the start
point of the speech segment is shifted, the speech recognition accuracy is significantly degraded.
On the other hand, even if there are a plurality of sound sources, the function represented by the
equation (18) or (29) shows a sharp peak at the position where the sound source is located or
the arrival direction of the sound wave. Therefore, the speech recognition apparatus of the
present invention which performs speech zone detection using this information has the
advantage of being able to perform robust speech zone detection even in the presence of a
plurality of ambient noises, and to maintain high speech recognition accuracy. Have.
[0156]
For example, the user's utterance area as shown in FIG. 23 can be defined. FIG. 23 is a functional
explanatory diagram of the speech detection processing according to the present invention.
Although this drawing shows only the XY plane for the sake of simplicity, in general, any user's
speech area can be defined similarly in a three-dimensional space. In FIG. 23, assuming a process
using eight microphones m1 to m8 arranged at arbitrary positions, the user utterance area is
defined in each of the search area of the short distance sound source and the search area of the
far distance sound source There is. The search space for the short-distance sound source is a
rectangular area whose diagonal is a straight line connecting two points (PxL, PyL) and (PxH,
PyH), and (PTxL1, PTyL1) and (PTxH1, PTyH1), Two rectangular areas whose diagonals are a
straight line connecting two points PTxL2, PTyL2 and (PTxH2, PTyH2) are defined as the user
utterance area. Therefore, it is possible to select the sound source position determined by the
equation (20) from among the sound source positions determined to have a voice by selecting
the one whose coordinate vector is within the user voiced area, thereby making the user I can
04-05-2019
41
identify the voice.
[0157]
On the other hand, with respect to the search space of the far-distance sound source, the
direction from the angles θL to θH is defined as the search area with respect to the point C, and
the area from the angles θTL1 to θTH1 is defined as the user's utterance area in that area.
Therefore, among sound sources existing at a long distance, it is selected by selecting one whose
arrival direction is within the user's utterance region, from among the arrival directions of sound
waves determined to have been vocalized by equation (31). User voice can be identified. 4.
Sound Source Separation A sound source separation process for emphasizing the user's voice and
suppressing ambient noise using the position estimation result of the sound source whose speech
has been detected or the arrival direction estimation result of the sound wave will be described
below. The utterance position or arrival direction of the user voice is obtained by the utterance
detection process. Also, the source position or the incoming direction of ambient noise is already
estimated. Using these estimation results, the sound source position vectors of Equations (8) and
(27), and σ representing the variance of the omnidirectional noise, the matrix V (ω) is defined
as the following equation.
[0158]
Eigenvalues arranged in descending order of this correlation matrix
[0159]
And the corresponding eigenvectors
[0160]
とする。
Here, since the correlation matrix V (ω) includes S sources for S near-field sources and K for farrange sources and includes (S + K) sources, the eigenvalues and eigenvectors of (S + K) from the
largest eigenvalue Define Z (ω) as
04-05-2019
42
[0161]
Then, a separation filter W (ω) for emphasizing the voice of the user present in the coordinate
vector P at a short distance is given by the following equation.
[0162]
By multiplying the separation filter of equation (36) by the observation vector of equation (10),
the voice v (ω) of the user present in the coordinate vector P can be obtained.
[0163]
The waveform signal of the emphasized user speech can be obtained by calculating the inverse
Fourier transform of equation (37).
[0164]
On the other hand, the separation filter M (ω) in the case of emphasizing the voice of the user
who is in the far direction (θ, φ) is given by the following equation.
[0165]
By multiplying the separation filter of equation (38) by the observation vector of equation (10),
the emphasized speech v (ω) of the user in the direction (θ, φ) is obtained.
[0166]
The waveform signal of the emphasized user speech can be obtained by calculating the inverse
Fourier transform of equation (37).
In the case where the sound source is moving at such a speed that the sound source can be seen
to be approximately stationary within the time of N consecutive frames, the above-described
method provides the enhanced voice of the moving user.
5.
Speech recognition processing Although the sound source separation processing is effective for
04-05-2019
43
directional noise, noise remains to some extent for nondirectional noise.
Also, the noise suppression effect can not be expected so much for noise generated in a short
time, such as sudden noise.
Therefore, for the recognition of the user voice emphasized by the sound source separation
process, for example, the feature correction method described in Japanese Patent Application No.
2003-320183 "correction method of background noise distortion and speech recognition system
using the same" is disclosed. By using the built-in speech recognition engine, the influence of
residual noise is reduced.
The present invention is not limited to the Japanese Patent Application No. 2003-320183 as a
speech recognition engine, and it is conceivable to use a speech recognition engine in which
various methods resistant to noise are implemented.
[0167]
The feature correction method described in Japanese Patent Application No. 2003-320183
performs feature amount correction of noise superimposed speech based on Hidden Markov
Model (HMM) that the speech recognition engine has in advance as a template model for speech
recognition. . The HMM is learned based on Mel-Frequency Cepstrum Coefficient (MFCC)
obtained from clean speech without noise. For this reason, there is an advantage that it is
relatively easy to incorporate a feature correction method into an existing recognition engine
without having to prepare a new parameter for feature correction. In this method, noise is
considered as a stationary component and a non-stationary component that temporarily changes,
and the stationary component of noise is estimated from several frames immediately before
speech regarding the stationary component.
[0168]
A copy of the distribution possessed by the HMM is generated, and a stationary component of
the estimated noise is added to generate a feature amount distribution of the stationary noise
superimposed speech. By evaluating the a posteriori probability of the feature quantity of the
observed noise-superimposed speech with the feature quantity distribution of this stationary
04-05-2019
44
noise-superimposed speech, distortion due to a stationary component of noise is absorbed.
However, since distortion due to the non-stationary component of noise is not taken into
consideration by this processing alone, the posterior probability obtained by the above means is
not accurate when the non-stationary component of noise is present. On the other hand, by using
an HMM for feature correction, it is possible to use the temporal structure of the feature time
series and the accumulated output probability obtained along it. By assigning a weight calculated
from the accumulated output probability to the above-mentioned posterior probability, it is
possible to improve the reliability of the posterior probability degraded by the temporarily
changing non-stationary component of noise.
[0169]
When speech recognition is performed, it is necessary to detect the start time point and the end
time point of the speech section from the input signal. However, it is not always easy to detect a
speech segment in a noise environment where ambient noise is present. In particular, since the
speech recognition engine incorporating the feature correction estimates stationary features of
ambient noise from several frames immediately before the start of speech, recognition accuracy
is significantly degraded if the start point of the speech section deviates. On the other hand, even
if there are a plurality of sound sources, the function represented by the equation (18) or (29)
shows a sharp peak at the position where the sound source is located or the arrival direction of
the sound wave. Therefore, the speech recognition apparatus according to the present invention,
which performs speech zone detection using this information, can perform robust speech zone
detection even in the presence of a plurality of ambient noises, and can maintain high speech
recognition accuracy.
[0170]
As described above, according to the information processing apparatus of the present invention,
it is possible to display the user's operation on a display or the like using exhalation or sound
even in a noisy environment, using the position of the user in the three-dimensional space.
[0171]
Further, according to the information processing apparatus of the present invention, it is possible
to perform detailed input operations such as cursor movement to small icons and buttons.
[0172]
04-05-2019
45
FIG. 1 is a perspective view showing an appearance of an information processing apparatus
according to an embodiment of the present invention.
FIG. 2 is a perspective view showing an appearance of an interface device used in the information
processing apparatus according to the embodiment of the present invention.
It is a figure showing the block composition of the information processor concerning an
embodiment of the invention. It is a figure which shows the example of a utilization form of the
interface apparatus which concerns on embodiment of this invention. It is a figure which shows
the flowchart of a process of the interface apparatus which concerns on embodiment of this
invention. It is a figure which shows the flowchart of the subroutine processing of the interface
apparatus based on 1st Embodiment of this invention. It is a figure which shows the example of
the virtual space defined in the utterance detection area | region R in the information processing
apparatus which concerns on 1st Embodiment of this invention. It is a figure which shows the
example of the virtual space defined in the utterance detection area | region R in the information
processing apparatus which concerns on 1st Embodiment of this invention. It is a figure which
shows the flowchart of the subroutine processing of the interface apparatus which concerns on
2nd Embodiment of this invention. It is a figure which shows the example of a virtual space
defined in the utterance detection area | region R in the information processing apparatus which
concerns on 2nd Embodiment of this invention. It is a figure which shows the flowchart of the
subroutine processing of the interface apparatus based on 3rd Embodiment of this invention. It is
a figure which shows the example of a virtual space defined in the utterance detection area |
region R in the information processing apparatus which concerns on 3rd Embodiment of this
invention. It is a figure which shows the flowchart of the subroutine processing of the interface
apparatus which concerns on 4th Embodiment of this invention. It is a figure which shows the
example of a virtual space defined in the utterance detection area | region R in the information
processing apparatus which concerns on 4th Embodiment of this invention. It is a figure which
shows the flowchart of the subroutine processing of the interface apparatus which concerns on
5th Embodiment of this invention. It is a figure which shows the flowchart of the subroutine
processing of the interface apparatus based on 6th Embodiment of this invention. It is a figure
which shows the example of a virtual space defined in the utterance detection area | region R in
the information processing apparatus which concerns on 6th Embodiment of this invention. It is
a figure which shows the flowchart of the subroutine processing of the interface apparatus based
on 7th Embodiment of this invention. It is a figure which shows the flowchart of the subroutine
processing of the interface apparatus based on 8th Embodiment of this invention. It is a figure
which shows the example of a virtual space defined in the utterance detection area | region R in
the information processing apparatus which concerns on 8th Embodiment of this invention. It is
a figure which shows the example of a display on the display part in the information processing
04-05-2019
46
apparatus which concerns on 9th Embodiment of this invention. It is explanatory drawing
explaining the sound reception function using the microphone array of this invention. It is
function explanatory drawing of the speech detection process by this invention.
Explanation of sign
[0173]
DESCRIPTION OF SYMBOLS 10 ... Information processing apparatus, 20 ... Computer main-body
part, 30 ... Display part, 31 ... Window, 100 ... Interface apparatus, 200 ... Microphone array, 201
... Silicon microphone, 202: Wind screen, 210: Stand, 211: Main support, 212: Left support, 213:
Right support, 280: Microphone amplifier, 290: AD converter, 300 ... CPU, 400 ... storage unit,
500 ... connection port unit
04-05-2019
47
Документ
Категория
Без категории
Просмотров
0
Размер файла
68 Кб
Теги
jp2009282645
1/--страниц
Пожаловаться на содержимое документа