close

Вход

Забыли?

вход по аккаунту

?

JPWO2014132533

код для вставкиСкачать
Patent Translate
Powered by EPO and Google
Notice
This translation is machine-generated. It cannot be guaranteed that it is intelligible, accurate,
complete, reliable or fit for specific purposes. Critical decisions, such as commercially relevant or
financial decisions, should not be based on machine-translation output.
DESCRIPTION JPWO2014132533
Abstract: Provided is a voice input device capable of reducing false recognition of a user's voice
when there is a sound source unintended by a recording person, such as a person speaking at a
different distance in the same direction and in the same direction as the user who is to obtain
voice. The voice input device arranges the camera and the microphone array so that the
reference point (41) of the camera and the reference point (22) of the microphone array are
separated by a predetermined distance L, and uses the camera as a reference based on the image
input by the camera. The calculated camera reference user angle θ and the camera reference
user distance D are calculated. Then, the voice input device calculates the microphone array
reference user angle α with reference to the microphone array based on the camera reference
user angle θ and the camera reference user distance D, and sets the microphone array reference
user angle α to the microphone array reference user angle α. Control the pointing angle.
Voice input device and image display device provided with the voice input device
[0001]
The present invention relates to an audio input device including a microphone array having a
plurality of microphones and capable of inputting audio from a specific direction, and an image
display device including the audio input device.
[0002]
There is known a method of changing the directivity angle of voice input using a microphone
array having a plurality of microphones, emphasizing voice information from a specific direction,
03-05-2019
1
and suppressing voice information from other directions.
By mounting this on the voice input device and directing the directivity angle of the microphone
array in the direction in which the voice is generated, the recognition accuracy of voice
recognition can be improved. When voice input is performed by such a method, the direction of
voice uttered by the user is specified, and the directivity angle of the microphone array is
directed to the direction of the voice uttered.
[0003]
However, since the direction is specified only from the voice information, the directivity angle is
directed to the direction of the voice information not intended by the user, such as voice
information of a person other than the user, and the voice information not intended by the user
May be input and recognized.
[0004]
Therefore, a technique has been proposed to reduce the possibility of unintended voice input by
detecting the direction of the user from image information captured by a camera and directing
the directivity angle of the voice input only in that direction (Patented) Reference 1).
In the technology described in Patent Document 1, voice recognition is performed by specifying a
dynamically changing speaker direction using a captured image captured by a camera and
controlling the directivity direction of the microphone to be the speaker direction. Improve the
accuracy of
[0005]
JP, 2009-225379, A
[0006]
However, the technology described in Patent Document 1 has the following problems.
03-05-2019
2
If there is an unintended sound source from the recording person in the same direction as the
user's direction detected by the camera, such as when there is another speaker behind the user
who is the target of voice acquisition with respect to the camera, Even in the direction of the
user, voice unintended by the user may be input, and erroneous recognition may occur in the
voice recognition process. That is, with this technology, for example, a voice other than the user's
voice from the same direction as the user, such as a voice of a person at a different distance from
the target user, is picked up and recognized, causing false recognition of the user voice.
[0007]
The present invention has been made in view of the above-described actual situation, and the
purpose thereof is that there is a sound source unintended by the recording person, such as a
person speaking at a different distance in the same direction as the user who is the voice
acquisition target. Another object of the present invention is to provide a voice input device
capable of reducing false recognition of user voice, and an image display device provided with
the voice input device.
[0008]
In order to solve the above problems, according to a first technical means of the present
invention, there is provided a microphone array having a plurality of microphones for acquiring
voice, a camera having an imaging element and inputting an image by photographing, An audio
input device including a directivity angle control unit for controlling a directivity angle of the
microphone array based on the image, wherein the camera and the microphone array are spaced
apart by a predetermined distance, and the voice input is The apparatus calculates a camera
reference user angle calculation unit that calculates a camera reference user angle that is an
angle of a user position of a voice acquisition target user with respect to the camera based on the
image, and the camera based on the image A camera reference user distance calculation unit for
calculating a camera reference user distance which is a distance to the user position as a
reference; the predetermined distance; and the camera reference user A microphone array
reference user angle calculating unit that calculates a microphone array reference user angle that
is an angle of a user position with respect to the microphone array based on the angle and the
camera reference user distance; and the directivity angle control The control unit is characterized
in that the directivity angle of the microphone array is controlled to be the microphone array
reference user angle.
[0009]
A second technical means of the present invention is the first technical means, wherein the
microphone array reference user angle calculation unit comprises the predetermined distance,
the camera reference user distance, the camera reference user angle, and the camera reference
03-05-2019
3
user. The microphone array reference user angle range is calculated based on an angle
calculation error range according to an angle, and the directivity angle control unit controls the
directivity angle of the microphone array to be the microphone array reference user angle range.
It is characterized by
[0010]
A third technical means of the present invention is the first technical means, wherein the
microphone array reference user angle calculation unit comprises the predetermined distance,
the camera reference user angle, the camera reference user distance, and the camera reference
user. The microphone array reference user angle range is calculated based on an angle and / or a
distance calculation error range according to the camera reference user distance, and the
directivity angle control unit is configured to obtain the microphone array reference user angle
range. It is characterized in that the directivity angle of the microphone array is controlled.
[0011]
A fourth technical means of the present invention is the camera reference user according to any
one of the first to third technical means, wherein a face area of a subject in the image is detected,
and the position of the face area is the user position. An angle and / or the camera reference user
distance may be calculated.
[0012]
A fifth technical means of the present invention is characterized in that the image display
apparatus is provided with the voice input device in any one of the first to fourth technical
means.
[0013]
According to the present invention, in the voice input device capable of controlling the pointing
angle, when there is a sound source unintended by the recording person, such as a person
speaking at a different distance in the same direction as the user who is the voice acquisition
target. By inputting only the voice of the user who is at the target distance, false recognition of
the user voice can be reduced, and the user voice can be acquired with low noise.
[0014]
It is a block diagram showing an example of 1 composition of a speech input device concerning
03-05-2019
4
Embodiment 1 of the present invention.
It is an overhead view which shows one structural example of the microphone array in the audio
| voice input device of FIG.
It is an overhead view which shows the other structural example of the microphone array in the
audio | voice input device of FIG.
It is an overhead view which shows an example of the arrangement | positioning relationship of
the camera and microphone array in the audio | voice input device of FIG.
It is a figure of a perspective projection camera model for demonstrating an example of the
camera reference user angle calculation method in the audio | voice input device of FIG.
It is a figure for demonstrating the calculation method of the microphone array reference |
standard user angle in the audio | voice input device of FIG. 1, and is an overhead view which
shows an example in case a user is located in the middle of a camera and a microphone array.
FIG. 10 is a view for explaining a method of calculating a microphone array reference user angle
in the voice input device of FIG. 1, and an overhead view showing an example in which the user is
located in the same direction as the camera and the microphone array and in a position near the
microphone array. It is.
FIG. 10 is a view for explaining a method of calculating a microphone array reference user angle
in the voice input device of FIG. 1, and a bird's-eye view showing an example in which the user is
positioned in the same direction as the camera and the microphone array is there. It is a figure
which shows one structural example of the microphone array in the audio | voice input device
based on Embodiment 2 of this invention. It is a figure for demonstrating an example of the
calculation method of the microphone array reference | standard user angle range in the speech
input device which concerns on Embodiment 2 of this invention, and is a bird's-eye view showing
an example in case a calculation error range is large. It is a figure for demonstrating an example
of the calculation method of the microphone array reference | standard user angle range in the
audio | voice input device which concerns on Embodiment 2 of this invention, and is an overhead
view which shows an example in case a calculation error range is small. It is a block diagram
03-05-2019
5
which shows one structural example of the speech recognition apparatus which concerns on
Embodiment 4 of this invention. It is a block diagram which shows one structural example of the
image display apparatus which concerns on Embodiment 5 of this invention.
[0015]
Embodiment 1 Hereinafter, Embodiment 1 of the present invention will be specifically described
with reference to FIGS. 1 to 8. FIG. 1 is a block diagram showing one configuration example of a
voice input device according to the present embodiment, FIG. 2 is a bird's-eye view showing one
configuration example of a microphone array in the voice input device of FIG. It is an overhead
view which shows the other structural example of the microphone array in 1 audio | voice input
device.
[0016]
As shown in FIG. 1, the voice input device 1 according to this embodiment includes a microphone
array 11, a camera 12, a user coordinate detection unit 13 in an image, a camera reference user
angle calculation unit 14, a camera reference user distance calculation unit 15, and a microphone
array A reference user angle calculation unit 16 and a pointing angle control unit 17 are
provided.
[0017]
The microphone array 11 has a plurality of microphones for acquiring voice.
Although each microphone may be an audio input element, FIG. 2 shows an example in which the
microphone array 11 is configured by arranging a plurality of superdirective microphones 21.
Each superdirective microphone 21 is a microphone having a sharp directivity such as a gun
microphone, and is arranged such that the sound collection direction 23 of each superdirective
microphone 21 is substantially radial around an arbitrary point in space. Ru. In the present
embodiment, a point at the center of the sound collection direction 23 is referred to as a
microphone array reference point 22. However, the microphone array reference point 22 does
not have to be exactly one point, and the sound collection directions may intersect in
substantially the same area.
03-05-2019
6
[0018]
In the present embodiment, the reference direction of the directivity angle of the microphone
array 11 is determined to be the central direction in which the microphone array 11 is arranged
(the direction connecting the center of the arrangement and the microphone array reference
point 22). Is called the microphone array front direction 24. Also, a virtual straight line extending
in the front direction 24 of the microphone array through the microphone array reference point
22 is referred to as a microphone array axis 25. That is, a straight line connecting the center of
the array and the microphone array reference point is the microphone array axis 25. By
configuring the microphone array 11 in this manner, the directivity angle control unit 17 can
control the directivity angle of the input audio information.
[0019]
In the present embodiment, control of the directivity angle is realized by the configuration shown
in FIG. 2, but the method of controlling the directivity angle is not limited to this. For example, a
plurality of microphones with weak directivity may be used as the microphone array 11, and the
microphone array may be used to control the directivity angle by calculating the difference in
time and volume of sound reaching each microphone. The arrangement of the microphones
illustrated in FIG. 3 is an arrangement example in the case of realizing the microphone array 11
by controlling the directivity angle by calculating the time difference (difference in sound arrival
time) of the sound reaching each microphone 31. . The microphone 31 is a microphone with
weak directivity or a microphone without directivity. When the pointing direction 32 (pointing
angle) is determined, the arrival time of the voice to each microphone 31 can be different. By
shifting and superposing the voices of the respective microphones 31 by the difference in arrival
time, it is possible to emphasize only the voice component from the pointing direction 32. A point
serving as a reference in determining the pointing direction 32 is a microphone array reference
point 22, and the microphone array front direction 24 and the microphone array axis 25 are
defined as in the configuration of FIG. With the microphone array 11 configured as such, the
directivity angle control unit 17 can control the directivity angle of the input audio information.
[0020]
The camera 12 is an image input unit having an imaging element and inputting an image or an
image and shooting information (for example, including a focal distance in the image) of the
image by shooting. Examples of the above-described imaging device include solid-state imaging
03-05-2019
7
devices such as a charge coupled device (CCD) sensor and a complementary metal oxide
semiconductor (CMOS) sensor. The camera 12 includes an optical component such as a lens in
addition to the imaging device, and can be mounted on the voice input device 1 as a camera
module. It is preferable to use a camera with a wide imaging angle of view, such as a camera
equipped with a super wide-angle lens or a fisheye lens, as the camera 12 because it can capture
a voice acquisition target user (hereinafter simply referred to as a “user”) .
[0021]
Then, the camera 12 in the present embodiment transmits the photographed image (and
photographing information) to the in-image user coordinate detection unit 13. Note that the
camera 12 does not have to be dedicated to still image shooting, and may be dedicated to moving
image shooting or may be a combination of moving image and still image. The image used by the
in-image user coordinate detection unit 13 is basically a still image, and when a moving image is
used, a frame immediately before the time of voice acquisition may be used.
[0022]
FIG. 4 is an overhead view showing an example of the arrangement relationship between the
camera 12 and the microphone array 11 in the present embodiment. As shown in FIG. 4, the
camera 12 and the microphone array 11 are disposed such that their reference points 41 and 22
are separated by a predetermined distance L. Since the change in the user position is larger in
the horizontal direction than in the vertical direction as the vertical direction, the camera 12 and
the microphone array 11 are horizontally oriented so that the predetermined distance L is in the
horizontal direction. It is preferable to be spaced apart. This is because it is possible to output a
desired voice with reduced noise as the input voice by controlling the pointing direction in the
separated direction as described later. The camera reference point 41 is a point used as a
reference when calculating the angle (direction) in which the user is present with respect to the
camera 12 within the imaging range 40 of the camera 12. Although an example in which the
camera focus is set to the camera reference point 41 is given in the present embodiment, the
present invention is not limited thereto. Of course, the camera 12 and the microphone array 11
may be arranged apart from each other in the horizontal direction, and the pointing direction
may be controlled accordingly.
[0023]
03-05-2019
8
The camera 12 and the microphone array 11 are arranged such that the camera reference point
41 and the microphone array reference point 22 do not have the same position in the horizontal
direction. This is because, as described above, the change in the user position is larger in the
horizontal direction than in the vertical direction.
[0024]
The positional relationship between the camera 12 and the microphone array 11 is known at the
design and manufacturing stages when both are built in the voice input device 1. On the other
hand, in the case of a structure in which the camera 12 and the microphone array 11 can move
independently, their positional relationship changes, so the positional relationship between the
camera 12 and the microphone array 11 is measured in advance, and measurement is
performed. Fix so as not to change later. In the present embodiment, the camera optical axis 42
and the microphone array axis 25 are substantially parallel, the camera photographing direction
43 and the microphone array front direction 24 are the same direction, and the camera reference
point 41 passes through the camera optical axis 42. The microphone array 11 and the camera 12
are installed in such a positional relationship that the microphone array reference point 22 exists
on the substantially vertical straight line 44. Further, the direction of the microphone array 11 is
set so that the straight line 44 exists on the same plane in which the sound collection direction
23 (FIG. 2) is spread radially.
[0025]
The in-image user coordinate detection unit 13 detects position information of the user from the
image (input image) captured by the camera 12. The user position can be detected, for example,
by detecting a face area from image information. The two-dimensional coordinates of the user's
face in the image and the size of the user's face in the image are calculated. The technique for
detecting the face can use a commonly used method. For example, there is a method in which an
image in which a face of a standard person is photographed is stored in advance as reference
data, and the face is detected from a correlation value with the reference data. Here, as the image
of the face of a person, images of faces of both standard men and women may be held as
reference data, and further, images of faces of standard men and women according to
generations May be held as reference data. Although a specific user's face may be held, such an
example will be described later as a third embodiment.
03-05-2019
9
[0026]
The in-image user coordinate detection unit 13 transmits the two-dimensional coordinates of the
user's face in the image to the camera reference user angle calculation unit 14, and the size of
the user's face in the image is the camera reference user distance calculation unit 15. Transmit to
The detection of the face area is preferable as a process for detecting the position of the user
because there is a mouth that emits a voice. Of course, the mouth area in the face may be
detected instead.
[0027]
When there are a plurality of users in the image, the in-image user coordinate detection unit 13
transmits the two-dimensional coordinates in which each user's face in the image is present to
the camera reference user angle calculation unit 14. The size of the user's face is transmitted to
the camera reference user distance calculation unit 15. At this time, if a large number of users
are detected, the amount of data increases, so even if a specific number of detection results are
transmitted depending on the detected two-dimensional coordinate position of the face or the
size of the detected face. Good.
[0028]
The camera reference user angle calculation unit 14 calculates a camera reference user angle
based on the input image. Here, the camera reference user angle is an angle of the user position
of the voice acquisition target user with reference to the camera 12.
[0029]
The camera reference user angle calculation unit 14 in the present embodiment has a user based
on the camera reference point 41 from the in-image user face coordinates that are twodimensional coordinates in the image transmitted from the in-image user coordinate detection
unit 13 The camera reference user angle, which is an angle indicating the direction to be The
camera reference user angle is calculated by using a perspective projection camera model.
03-05-2019
10
[0030]
This calculation method will be described with reference to FIG. FIG. 5 is a diagram of a
perspective projection camera model for explaining an example of a camera reference user angle
calculation method in the voice input device 1. This perspective projection camera model is a
model of the camera 12 viewed from a direction perpendicular to the camera optical axis 42 and
the straight line 44. If the focal length of the camera 12, the pixel pitch of the sensor (image
sensor), and the intersection coordinates of the sensor pixel and the optical axis 42 are known,
the position of the projection plane 51 can be calculated. The angle indicated by the direction 53
of the user face coordinates 52 in the image on the projection surface 51 from the camera
reference point 41 is the camera reference user angle θ. Thus, the camera reference user angle
θ can be calculated. The camera reference user angle calculation unit 14 transmits the
calculated camera reference user angle θ to the microphone array reference user angle
calculation unit 16.
[0031]
When the camera 12 is a fisheye camera, the projection plane 51 is a projection camera model
that is a sphere centered on the camera reference point 41, and the number of pixels of the angle
of view of the fisheye camera and the radius of the image circle If the intersection coordinates of
the pixel and the optical axis 42 are known, the position of the projection surface 51 can be
calculated, so the camera reference user angle θ can be calculated by the same method.
[0032]
The camera reference user distance calculation unit 15 calculates the camera reference user
distance based on the input image.
Here, the camera reference user distance is the distance to the user position with reference to the
camera 12.
[0033]
The camera reference user distance calculation unit 15 in the present embodiment calculates the
camera reference user distance from the camera reference point to the user from the size of the
03-05-2019
11
user's face in the image transmitted from the in-image user coordinate detection unit 13 . The
camera reference user distance is obtained by capturing an image obtained by capturing the face
of a person to be used as reference data by the in-image user coordinate detection unit 13 using
the camera 12 and measuring the distance at the time of capturing. It can be calculated by the
following equation (1). D=d・(t/t′) ・・・(1)
[0034]
In each variable of equation (1), the camera reference user distance calculated by D, d is the
camera reference user distance at the time of reference data shooting, t is the number of vertical
pixels of reference data, and t 'is the in-image user coordinate detection unit 13 Is the number of
vertical pixels of the extraction range sent from.
[0035]
Here, since the size of the face is different depending on the age and gender, if the gender or
generation reference data (reference data having different face sizes) is not stored, the in-image
user coordinate detection unit 13 Estimating the age and gender from the face to be detected and
calculating the camera reference user distance D is preferable because it is possible to reduce the
error of the distance calculation.
This can be realized by setting the relationship between the face size and distance of the
reference age and increasing or decreasing the distance depending on the estimated age. It can
also be realized by setting the relationship between the size of the face and the distance for each
age. For example, when faces of the same size are detected, the distance between the child and
the adult is different, and the child is present closer than the adult. Therefore, for example, when
the detected face is a child, the calculated distance may be increased at a predetermined rate. The
camera reference user distance calculation unit 15 transmits the calculated camera reference
user distance D to the microphone array reference user angle calculation unit 16.
[0036]
Further, in this embodiment, the distance D from the camera reference point 41 to the user is
calculated from the size of the user face in the image, but the method for calculating the distance
from the camera reference point 41 to the user is The present invention is not limited, and
various types of shooting information obtained at the time of obtaining the image by the camera
03-05-2019
12
12 can also be used. For example, if the camera 12 has an autofocus function, the distance of the
user position is calculated by setting the focus distance when the face area is detected and the
detected face area is in focus as the distance of the user position. can do. In addition, TOF (Time
Of Flight) using infrared rays or multi-viewpoint camera as camera 12 is used to obtain parallax
information of the user's face area as shooting information, and the distance from the camera
reference point 41 to the user from the parallax information May be calculated. Furthermore, the
various distance calculation methods described above can be combined and applied.
[0037]
The in-image user coordinate detection unit 13 in the present embodiment detects the face area
of the subject (mainly a person but may be an animal) in the input image, and the camera
reference user angle calculation unit 14 and the camera reference user distance The calculation
unit 15 calculates the camera reference user angle θ and the camera reference user distance D,
with the position of the face area as the user position. However, only one of the camera reference
user angle θ and the camera reference user distance D may be calculated based on the detection
of the face area. Alternatively, either the camera reference user angle θ or the camera reference
user distance D can be calculated by detecting the user position without using the face area
detection. Note that this point is the same in the other embodiments.
[0038]
Although the configuration example in which the in-image user coordinate detection unit 13 is
provided separately from the camera reference user angle calculation unit 14 and the camera
reference user distance calculation unit 15 has been described, this configuration example is
different from the in-image user coordinate detection unit 13 This is equivalent to providing such
a function of specifying the user position to both the camera reference user angle calculation unit
14 and the camera reference user distance calculation unit 15.
[0039]
The microphone array reference user angle calculation unit 16 is based on the predetermined
distance L, the camera reference user angle θ transmitted from the camera reference user angle
calculation unit 14, and the camera reference user distance D transmitted from the camera
reference user distance calculation unit 15. And calculate the microphone array reference user
angle.
03-05-2019
13
Here, the microphone array reference user angle is an angle of the user position based on the
microphone array 11, and in this example, is a user angle based on the microphone array
reference point 22.
[0040]
FIG. 6 is a view for explaining the method of calculating the microphone array reference user
angle, and is a bird's-eye view showing an example in which the user is positioned between the
camera 12 and the microphone array 11. Once the camera reference user angle θ and the
camera reference user distance D are calculated, the position of the user 61 with respect to the
microphone array reference point 22 can be calculated by geometrical calculation. Specifically,
since the positional relationship between the camera reference point 41 and the microphone
array reference point 22 is set in advance including the predetermined distance L when
designing the voice input device 1, the microphone array reference point 22 is used as a
reference. The user's microphone array reference user angle α can be calculated. For example,
even when another speaker 62 utters behind the user looking from the camera reference point
41 and the camera can not detect another speaker 62, the directivity angle of the microphone
array 11 is directed only to the user 61. The voice of the user 61 can be properly input. The
microphone array reference user angle α calculated by the microphone array reference user
angle calculation unit 16 is transmitted to the directivity angle control unit 17.
[0041]
Although the case where the user 61 is located between the camera 12 and the microphone
array 11 is described in FIG. 6, the present invention can be similarly applied to the case where
the user is in the same direction from each reference point. This point will be described with
reference to FIGS. 7 and 8. FIG. 7 is an overhead view showing an example in which the user is
positioned in the same direction as the camera 12 and the microphone array 11 in the same
direction and in the vicinity of the microphone array 11, and FIG. 8 is the same as viewed from
the camera 12 and the microphone array 11. It is an overhead view which shows an example in
case a user is located in the direction and the place near the camera 12. FIG.
[0042]
Even when the user 61 is positioned on the right as viewed from the reference points 22 and 41
as illustrated in FIG. 7, the user 61 is positioned on the left as viewed from the reference points
03-05-2019
14
22 and 41 as illustrated in FIG. 8 However, as in the case of FIG. 6, the position of the user 61
relative to the microphone array reference point 22 can be calculated by geometrical calculation
from the camera reference user angle θ and the camera reference user distance D.
[0043]
As described above, the positional relationship between the camera reference point 41 and the
microphone array reference point 22 can be used to calculate the microphone array reference
user angle α of the user 61 based on the microphone array reference point 22.
However, when the camera 12 and the microphone array 11 are arranged such that the camera
reference point 41 and the microphone array reference point 22 are at the same position, they
are substantially the same as viewed from the camera, for example, when another speaker 62 is
behind the user 61 When at an angle, directional angle control can not separate the user's voice
and input voice independently. In addition, the user can not detect overlapping images captured
by the camera 12 as well. Therefore, the camera 12 and the microphone array 11 are disposed in
a state where they are separated by the predetermined distance L in the direction in which the
angle calculated by the camera reference user angle calculation unit 14 changes. As a result,
when another speaker 62 is at substantially the same angle as the user 61 when viewed from the
camera 12, it is possible to independently input the voice of each speaker. The predetermined
distance L may be any value.
[0044]
The directivity angle control unit 17 controls the directivity angle of the microphone array 11
based on the image input by the camera 12. As a main feature of the present invention, the
directivity angle control unit 17 sets the microphone array reference user angle α as described
above. Control the pointing angle. That is, the directivity angle control unit 17 controls the
directivity angle of the microphone array 11 toward the microphone array reference user angle
α transmitted from the microphone array reference user angle calculation unit 16, and outputs
the sound transmitted from the microphone array 11 Do. In FIGS. 6 to 8, for convenience,
although the angles θ and α are represented as angles formed by the axes 42 and 25
regardless of whether they are positive or negative, a position where the angle is 0 degrees or
any one is positive It may be arbitrarily decided whether it is a direction.
03-05-2019
15
[0045]
In the present embodiment, such a voice corresponding to the microphone array reference user
angle α is selected from the voice information of the superdirective microphone 21 (or the
microphone 31 of FIG. 3) sent from the microphone array 11. Control can be realized. Here,
when the microphone array 11 uses a microphone with weak directivity, it is possible to acquire
voice from a specific angle as voice information by considering the difference in arrival time of
voice.
[0046]
Further, in the present embodiment, although the angle θ and the angle α have been described
as one value, an error and a width exist in the calculated value, so a predetermined allowable
range can be set for each angle. . For example, the allowable range of the calculated angle can be
set in advance as a fixed range such as ± 5 °. The allowable range does not have to be set in
advance as an angle or in a fixed range, and such an example will be described later as a second
embodiment.
[0047]
As described above, in the plane where the camera reference user angle θ changes, the camera
12 and the microphone array 11 are disposed in the state of being separated by the
predetermined distance L, and the target distance (the above distance D) is separated from the
camera 12 By directing the directivity angle of the microphone array 11 to a different position,
when another user (another speaker) 62 is at substantially the same angle as the user 61 when
viewed from the camera 12, the voice of each user can be input independently It is possible to
reduce false recognition of the user's voice by inputting only the voice of the user who is the
voice acquisition target from the voice of another user and inputting it separately. Note that by
applying a moving image as an input image, the directivity angle is sequentially switched in the
direction that matches the position of the user, so that an effect of reducing misrecognition even
if the user moves can be achieved.
[0048]
As described above, according to the present invention, in the voice input device capable of
controlling the pointing angle, the person who is speaking such as a person speaking at a
03-05-2019
16
different distance in the same direction as the speaker (user) who is the voice acquisition target
When there is an unintended sound source of the user who is used and who is basically different
from the above-mentioned speaker), the user voice is determined by determining the distance
and inputting only the voice of the user who is at the target distance. False positives can be
reduced, and user speech can be acquired with low noise.
[0049]
Second Embodiment The second embodiment of the present invention will be specifically
described below with reference to FIGS. 9 to 11 together.
FIG. 9 is a view showing a configuration example of a microphone array in the voice input device
according to the present embodiment. 10 and 11 are diagrams for explaining an example of a
method of calculating the microphone array reference user angle range in the voice input device
according to the present embodiment. FIG. 10 shows an example where the calculation error
range is large, and FIG. 11 shows an example where the calculation error range is small.
[0050]
A block diagram showing a configuration example of the voice input device in the present
embodiment is the same as FIG. 1 shown in the first embodiment, and the detailed description of
each part is omitted. The operations of the microphone array 11 and the camera 12 are also the
same as those described in the first embodiment.
[0051]
The directivity angle control unit 17 in the present embodiment is configured to be able to
control the directivity range (range indicated by the directivity angle) of the microphone array 11
of FIG. 1, and the microphone array 11 can also perform such control. It consists of multiple
microphones. In this embodiment, it is desirable to use a plurality of microphones 31 with weak
or no directivity as the microphone array 11, as exemplified in FIG. 9, and such an example will
be mentioned. However, even a plurality of microphones with strong directivity can be used.
03-05-2019
17
[0052]
The pointing angle control unit 17 emphasizes the voice in each pointing direction (pointing
angle) contained in the pointing range (pointing area) 91 for which voice is desired to be
acquired, in the same manner as in the first embodiment, and voices of other angles. Suppress.
Here, as in the example of FIG. 3, the directivity range is controlled by using the difference in the
voice arrival time to the plurality of microphones 31. Thus, the directivity angle control unit 17
can control the directivity range of the microphone array 11. Hereinafter, this control will be
specifically described. Such control is possible by driving one or a plurality of microphones that
are directed to the directivity range indicated by the directivity angle, even for a plurality of
microphones with strong directivity.
[0053]
The camera 12 transmits the image (and the shooting information) captured as in the first
embodiment to the in-image user coordinate detection unit 13. Here, for example, when using a
fisheye camera as the camera 12, the user's face can be detected in a wide range because the
imaging angle of view is wide, but since the image is circular, the resolution is different between
the image center and the image periphery. It is different. Therefore, since the resolution of the
photographed image of the camera 12 is different between when the angle value of the camera
reference user angle (angle θ in FIG. 6) with respect to the camera optical axis is small, the angle
calculation accuracy is different. Therefore, in the present embodiment, the camera reference
user angle calculation unit 14 performs an operation in consideration of the angle calculation
error range so that the desired effect can be appropriately obtained even if the calculation
accuracy of the angles is different.
[0054]
The camera reference user angle calculation unit 14 can reduce the operation failure due to the
angle calculation error by setting the allowable range when calculating the camera reference
user angle θ from the user face coordinates in the image. Therefore, when calculating the
camera reference user angle θ from the in-image user face coordinates, the camera reference
user angle calculation unit 14 also calculates an angle calculation error range at the calculated
angle θ.
03-05-2019
18
[0055]
As the angle calculation error range, for example, in the camera 12, the camera reference user
angles θa and θb (see FIG. 10) when the user face coordinates in the image deviate one pixel to
the left and right are calculated, and the camera reference user angle θa, A range bounded by
θb is taken as an angle calculation error range 101. At this time, when the resolution of the
captured image is high and the angle calculation error is small at a certain camera reference user
angle, the change in angle when the user's face coordinates in the image deviate by one pixel is
small. On the other hand, when the resolution of the photographed image is low and the angle
calculation error is large, the angle change when the user's face coordinates in the image deviate
by one pixel becomes large. That is, the angle calculation error range differs depending on the
calculated camera reference user angle θ.
[0056]
For comparison, as described in the first embodiment, the case where the allowable range of the
error is set as the constant error range regardless of the calculated camera reference user angle
θ will be described. In the case of such setting, when the angle calculation error range of the
high resolution area at the center of the input image is set as a reference, the resolution is low in
the peripheral area (edge side) of the input image, so the angle calculation error becomes large
and desired voice It may be excluded. On the other hand, if the angle calculation error range of
the low resolution region located around the input image is set as a reference, the excessive
tolerance is provided in the region of the image center despite the high resolution and the small
angle calculation error. The desired voice may contain noise.
[0057]
In order to prevent such a situation, in the present embodiment, it is assumed that the angle
calculation error range is different depending on the calculated camera reference user angle θ,
and the allowable range of the error is set depending on the angle θ. The angle calculation error
range is small at the center of the image, and the angle calculation error range is large in the
peripheral region of the input image. As a result, noise reduction and desired voice acquisition
can be appropriately performed.
[0058]
03-05-2019
19
Here, although the angle calculation error range has been described as a range in the case where
one pixel is shifted to the left and right, the angle calculation error range can be appropriately set
according to the shooting conditions such as the camera resolution. Although the angle
calculation error range can be calculated each time from the calculated angle θ, a LUT (Look Up
Table) in which the angle calculation error range is defined for each calculated angle θ may be
held.
[0059]
The camera reference user angle calculation unit 14 in the present embodiment transmits the
calculated camera reference user angle θ and the angle calculation error range 101 to the
microphone array reference user angle calculation unit 16.
[0060]
The camera reference user distance calculation unit 15 sets the allowable range when calculating
the camera reference user distance D from the size of the user's face in the image transmitted
from the in-image user coordinate detection unit 13, Malfunction due to a distance calculation
error can be reduced.
Therefore, when calculating the camera reference user distance D from the in-image user face
coordinates, the camera reference user distance calculation unit 15 also calculates a distance
calculation error range.
[0061]
In this embodiment, for example, distances Da and Db (see FIG. 10) in the case where the size of
the user's face in the image is increased by 1 pixel on the left and 1 pixel by the left Is calculated,
and a range having the calculated distances Da and Db as a boundary is taken as a distance
calculation error range 102. At this time, when the camera reference user distance D is short, the
change in the user distance when the size of the user face in the image is shifted by one pixel is
small. On the other hand, when the distance D is long, the change in the user distance is large
when the size of the user face in the image is shifted by one pixel. That is, the distance
calculation error differs depending on the calculated camera reference user distance D.
03-05-2019
20
[0062]
For comparison, the case where the allowable range of error is set on the assumption that the
distance calculation error range is constant regardless of the calculated camera reference user
distance D will be described. In such a setting, if the distance calculation error range in the case
where the distance D is short is set as a reference, the distance calculation error may become
large when the distance D is long, and a desired user voice may be excluded. On the other hand,
when the distance calculation error range when the distance D is long is set as a reference, noise
is included in the desired user voice because the excessive allowable range is provided when the
distance D is short despite the small distance calculation error. There is a possibility that
[0063]
Further, the distance calculation error range 102 also changes depending on the camera
reference user angle θ at the time of calculation. For example, in the case of using a camera
equipped with a fisheye lens as the camera 12, if the camera reference user angle θ increases,
the resolution of the face area decreases. When the resolution of the captured image is high and
the distance calculation error is small, the change in user distance is small when the size of the
user face in the image is shifted by one pixel. On the other hand, when the resolution of the
captured image is low and the distance calculation error is large, the change in the user distance
becomes large when the size of the user face in the image is shifted by one pixel. That is, the
distance calculation error differs depending on the calculated camera reference user angle θ.
[0064]
For comparison, the case where the allowable range of error is set on the assumption that the
distance calculation error range is constant regardless of the calculated camera reference user
angle θ will be described. In the case of such setting, if the error calculation range of the region
with high resolution at the center of the input image (corresponding to the region with a small
angle θ) is set as a reference, the resolution is low in the peripheral region of the input image.
There is a possibility that the desired user voice may be excluded due to the increase. On the
other hand, when the error calculation range of the low resolution area (corresponding to the
area with a large angle θ) located around the input image is set as a reference, the resolution is
high and the distance calculation error is small in the area at the center of the image. Due to the
03-05-2019
21
excessive tolerance, the desired user speech may be noisy.
[0065]
In order to prevent such a situation, in this embodiment, it is assumed that the distance
calculation error range differs depending on the camera reference user distance D and the
camera reference user angle θ, and the tolerance of the error depends on the distance D and the
angle θ. Set The distances Da and Db are calculated when the size of the user's face in the image
is increased by 1 pixel on the left and right and 1 pixel in the left and right pixels, and the
distance is calculated using the calculated distances Da and Db as the boundaries. By setting the
error range 102, the distance calculation error range 102 is small at the center of the input
image (corresponding to the region where the angle θ is small), and at the peripheral region of
the input image (corresponding to the edge side and the region where the angle θ is large)
When the distance calculation error range 102 is large and the distance D is short, the distance
calculation error range 102 is small, and when the distance D is long, the distance calculation
error range 102 is large. As a result, it is possible to appropriately perform noise reduction and
desired voice acquisition, which is preferable.
[0066]
Here, the distance calculation error range has been described as a range in the case where the
size of the user's face in the image is shifted by one pixel to the left and right, but can be set
appropriately according to shooting conditions such as camera resolution. Also, although it is
possible to calculate the distance calculation error range each time from the angle θ at the time
of calculation and the calculated distance D, the LUT in which the distance calculation error
range is defined for each calculated angle θ and the calculated distance D is held. You may
[0067]
Further, in the present embodiment, the distance calculation error range is different depending
on the distance D and the angle θ, and the allowable range of the error is set depending on the
distance D and the angle θ. However, the distance calculation error range may be different
depending on only one of the distance D and the angle θ, and the allowable range of the error
may be set depending on the distance D or the angle θ. Also in this case, the LUT for which the
distance calculation error range is defined for each calculated distance D or the LUT for which
03-05-2019
22
the distance calculation error range is defined for each angle θ at the time of calculation is held,
and the LUT is referred to You may ask for
[0068]
The camera reference user distance calculation unit 15 of the present embodiment transmits the
calculated camera reference user distance D and the distance calculation error range 102, which
are distances from the camera reference point 41 to the user, to the microphone array reference
user angle calculation unit 16. .
[0069]
The microphone array reference user angle calculation unit 16 uses the microphone array
reference point 22 as a reference based on the predetermined distance L, the camera reference
user angle θ, the angle calculation error range 101, the camera reference user distance D, and
the distance calculation error range 102. The microphone array reference user angle range,
which is the range of the user angle, is calculated.
[0070]
A method of calculating the microphone array reference user angle range will be described with
reference to FIG. 10 and FIG.
The microphone array reference user angle calculation unit 16 receives the predetermined
distance L, the camera reference user angle θ transmitted from the camera reference user angle
calculation unit 14 and the angle calculation error range 101, and the camera reference user
distance calculation unit 15 From the camera reference user distance D and the distance
calculation error range 102, a range 103 where the user 61 can be estimated to be present is
calculated by geometrical calculation.
Here, the range 103 is a common range which is a common part of the angle calculation error
range 101 and the distance calculation error range 102.
[0071]
03-05-2019
23
Specifically, since the positional relationship between the camera reference point 41 and the
microphone array reference point 22 is set in advance including the predetermined distance L
when designing the voice input device 1, a range where the user 61 can be estimated to be
present The microphone array reference user angle range 104 based on the microphone array
reference point 22 can be calculated so as to include 103. At this time, as shown in FIG. 10, a
range 103 larger than the size of the face of the user 61 is detected, and as shown in FIG. 11, a
range 103 smaller than the size of the face of the user 61 is detected. Sometimes. However, since
the range 103 is calculated in view of an appropriate error based on the distance D and the angle
θ, this range 103 can be said to be an appropriate range in view of the error, and the
microphone array reference user angle range obtained from the range 103 104 is also an
appropriate range.
[0072]
The microphone array reference user angle range 104 calculated by the microphone array
reference user angle calculation unit 16 is transmitted to the directivity angle control unit 17. At
this time, the microphone array reference user angle (the angle of the central value of the range
104) may be simultaneously transmitted.
[0073]
The directivity angle control unit 17 in the present embodiment controls the directivity range 91
of the microphone array 11 so that the voice of the microphone array reference user angle range
104 transmitted from the microphone array reference user angle calculation unit 16 can be
acquired. That is, the directivity angle control unit 17 in the present embodiment controls the
directivity range 91 which is the directivity angle of the microphone array 11 to be the
microphone array reference user angle range 104. Thus, the directivity range of the microphone
array 11 can be appropriately controlled in accordance with the microphone array reference
user angle range 104.
[0074]
According to the voice input device 1 of the present embodiment, an appropriate microphone
array reference user angle range 104 can be calculated according to the resolution for each
camera reference user angle θ of the camera 12 and the resolution for each camera reference
03-05-2019
24
user distance D. The directivity range of the microphone array can be set appropriately. In
addition, the microphone array reference user angle (the above-mentioned angle α) can be
transmitted by transmitting the microphone array reference user angle (the angle of the central
value of the range 104) or determining it from the microphone array reference user angle range
104 as the central angle. It is also possible to perform control such as raising the voice input
level in the pointing direction indicated by and decreasing the voice input level as the end of the
range 104 is approached.
[0075]
As mentioned above, although the microphone array reference user angle range 104 was set to
the result of having calculated angle calculation error range 101 and distance calculation error
range 102, although an example was given, only angle calculation error range 101 or distance
calculation error range 102 was mentioned. Such setting can be implemented only by itself, and
in any case, the effect of realizing noise reduction and desired voice acquisition can be obtained.
When such setting is performed only in the angle calculation error range 101, such setting is
performed only in the distance calculation error range 102 so as to sandwich the arc that is the
distance D and the angle calculation error range 101. In this case, the microphone array
reference user angle range 104 is set so as to sandwich the line segment which is the angle θ
and the distance calculation error range 102.
[0076]
That is, the microphone array reference user angle calculation unit 16 may calculate the
microphone array reference user angle range based on the predetermined distance, the camera
reference user distance, the camera reference user angle, and the angle calculation error range.
Alternatively, the microphone array reference user angle calculation unit 16 may calculate the
microphone array reference user angle range based on the predetermined distance, the camera
reference user angle, the camera reference user distance, and the distance calculation error
range.
[0077]
Also, in the example of calculating the angle calculation error range and the distance calculation
error range, the camera reference user angle and the camera reference user are used because the
edge side of the input image is often out of focus compared to the center portion. Since it can be
said that the calculation error of the distance often becomes large, it is also useful when the user
03-05-2019
25
position is detected by a method other than the detection of the face area.
[0078]
As described above, in the present embodiment, the microphone array reference user angle
range is determined according to the camera reference user angle and / or the camera reference
user distance, more specifically, the resolution and / or each camera reference user angle. The
calculation is performed according to the resolution for each camera reference user distance,
whereby the directivity range of the microphone array can be appropriately controlled.
[0079]
Third Embodiment The third embodiment of the present invention will be specifically described
below.
A block diagram showing a configuration example of the voice input device in the present
embodiment is the same as FIG. 1 shown in the first embodiment, and the detailed description of
each part is omitted.
The operations of the microphone array 11 and the camera 12 are also the same as those
described in the first embodiment. However, the features of the present embodiment can also be
applied to the second embodiment.
[0080]
The in-image user coordinate detection unit 13 in the present embodiment identifies a user
individual from the image captured by the camera 12 and detects the position information of the
user. The identification of the user can be realized, for example, by performing template
matching using, as a template image, an image in which the enlargement ratio has been changed
in several steps with respect to an image in which a user face stored in advance (hereinafter, user
face) is photographed.
[0081]
03-05-2019
26
More specifically, a range of the same number of vertical and horizontal pixels as the template
image is extracted from the image for detecting the user face as a window image, and the sum of
pixel value differences with corresponding template image pixels for all the window image pixels.
Calculate the value. The differences are calculated for all window images that can be extracted
from the image for detecting the user face, and a range in which the window image with the
smallest difference is extracted is searched. The two-dimensional coordinates for the image for
detecting the user face in the extraction range where the difference is the smallest and the
difference is less than or equal to the threshold are the two-dimensional coordinates of the user's
face, and the difference is the smallest and the difference is less than or equal to the threshold
For example, it is assumed that the number of vertical pixels is the size of the user's face. When
there is no extraction range in which the difference becomes the smallest and the difference
becomes the threshold or less, it is determined that the user's face does not exist in the image.
[0082]
The in-image user coordinate detection unit 13 transmits the in-image user face coordinates and
user identification information in the image to the camera reference user angle calculation unit
14, and the size and user identification information of each user face in the image The camera
reference user distance calculation unit 15 is notified (such as an ID or a name indicating which
user the user is).
[0083]
The camera reference user angle calculation unit 14 calculates the camera reference user angle
from the identified user face coordinates in the image of the user, and associates the camera
reference user angle with the user identification information as to which user.
The calculation of the camera reference user angle can be realized by the same method as that of
the first embodiment. The camera reference user angle calculation unit 14 transmits the camera
reference user angle to which the user specifying information is added to the microphone array
reference user angle calculation unit 16.
[0084]
The camera reference user distance calculation unit 15 calculates the camera reference user
03-05-2019
27
distance from the camera reference point to each user from the size of the user's face in the
image transmitted from the in-image user coordinate detection unit 13, The camera reference
user distance is associated with which user specific information the user's. The camera reference
user distance is implemented by capturing an image of the user face to be used as a template
image by the in-image user coordinate detection unit 13 by using the camera 12 and measuring
the distance at the time of capturing. It can be calculated by the calculation of equation (1) as in
the first embodiment.
[0085]
Each variable in the equation (1) in this embodiment is the camera reference user distance
calculated by D, d is the camera reference user distance at the time of capturing a template
image, t is the number of vertical pixels of the template image, and t 'is the in-image user It is the
number of vertical pixels of the extraction range sent from the coordinate detection unit 13.
Here, since the size of the face differs depending on the user, the size of the user's face specified
by the user coordinate detection unit 13 in the image is calculated by calculating the camera
reference user distance using the distance at the time of capturing the template image of the
user. The camera reference user distance reflecting the distance can be calculated, and the error
of the distance calculation can be reduced. The camera reference user distance calculation unit
15 transmits the camera reference user distance to which the user specifying information is
added to the microphone array reference user angle calculation unit 16.
[0086]
The microphone array reference user angle calculation unit 16 is based on the predetermined
distance L, the camera reference user angle θ transmitted from the camera reference user angle
calculation unit 14, and the camera reference user distance D transmitted from the camera
reference user distance calculation unit 15. Then, a microphone array reference user angle which
is a user angle based on the microphone array reference point 22 is calculated. The microphone
array reference user angle can be calculated in the same manner as in the first embodiment from
the camera reference user angle θ and the camera reference user distance D to which the same
user specifying information is added in addition to the predetermined distance L.
[0087]
03-05-2019
28
The microphone array reference user angle calculation unit 16 sets the user identification
information transmitted from the camera reference user angle calculation unit 14 and the user
identification information transmitted from the camera reference user distance calculation unit
15 to the microphone array reference user angle α. Give. The microphone array reference user
angle calculation unit 16 transmits the microphone array reference user angle α to which the
user identification information and the user identification information are added to the directivity
angle control unit 17.
[0088]
The directivity angle control unit 17 controls the directivity angle of the microphone array 11
toward the microphone array reference user angle α transmitted from the microphone array
reference user angle calculation unit 16, and outputs the sound transmitted from the microphone
array 11. As a result, distance calculation can be performed in consideration of the actual size of
the user's face, and directivity angle control in the directivity angle control unit 17 can be
performed with high accuracy.
[0089]
Here, in the present embodiment, when photographing the user's face in advance, the face
photographing data and the distance are associated with each other to be the reference data, but
the face attribute detected by the user coordinate detection unit 13 in the image is used as the
user specifying information. The camera reference user distance may be corrected. The face
attribute may be set by gender, age, etc., and may be set to be male than female, and larger for
adults than for children. Conventional methods can be used to determine each attribute, and
estimation can be made from the similarity of reference data and the like. As described above, by
calculating the user distance in consideration of the face attribute, the directivity angle control in
the directivity angle control unit 17 can be performed accurately.
[0090]
Fourth Embodiment The fourth embodiment of the present invention will be specifically
described below with reference to FIG. FIG. 12 is a block diagram showing an example of the
configuration of the speech recognition apparatus according to the fourth embodiment of the
present invention. As shown in FIG. 12, the speech recognition apparatus in the present
03-05-2019
29
embodiment is a speech recognition apparatus including the speech input apparatus 1 and the
speech recognition unit 121 described in the third embodiment. However, the present
embodiment can be applied to any of the first and second embodiments.
[0091]
The speech recognition unit 121 analyzes the speech input result transmitted from the speech
input device 1 and outputs the analysis result. The analysis method may use a conventional
method, for example, analyze and recognize speech by comparing with reference data. The
output result is converted into text data, and used to control the on / off of the electric device,
and the like. The speech recognition apparatus according to the present embodiment can
recognize speech with high accuracy because the speech recognition apparatus according to the
present embodiment includes the speech input apparatus 1 capable of acquiring desired speech
with low noise.
[0092]
Furthermore, the directivity angle control unit 17 in the present embodiment may output user
identification information added to the microphone array reference user angle α used for
directivity angle control to the sound transmitted from the microphone array 11 and output the
same. it can. As a result, in speech recognition, it is possible to use a speech recognition method
with high speech recognition accuracy according to the user, which is preferable because false
recognition of speech recognition can be reduced. For example, reference data for speech
recognition is updated from user identification information and speech input results, and speech
recognition is performed using the reference data for each user. As a result, it is possible to
realize speech recognition in consideration of user-specific speech features. As the feature
amount, general ones can be used as sound parameters, such as intonation and frequency.
Therefore, by adding the user identification information to the voice input, the reference data for
voice recognition can be changed according to the user, and false recognition can be reduced.
[0093]
Fifth Embodiment The fifth embodiment of the present invention will be specifically described
below with reference to FIG. FIG. 13 is a block diagram showing an example of the configuration
of an image display apparatus according to Embodiment 5 of the present invention. As shown in
03-05-2019
30
FIG. 13, the image display device in the present embodiment is an image display device provided
with the voice recognition device described in the fourth embodiment, the control unit 131, and
the image display unit 132. Although the image display device according to the present
embodiment will be described on the premise that it includes the voice recognition unit 121, the
voice recognition unit 121 can be configured as a separate device, and the image display device
according to the present embodiment is implemented It suffices to include the voice input device
1 as described in the first to third embodiments.
[0094]
The control unit 131 controls the display content of the image display unit 132, the power on /
off, the volume, and the like based on the speech recognition result for each user transmitted
from the speech recognition unit 121. As a result, the image display apparatus of the present
embodiment can perform an operation by voice. Preferably, the voice input device 1 is mounted
on a remote controller of the image display device.
[0095]
Examples of the contents of the operation include channel change of television broadcasting,
calling of a program guide, designation of a program, and the like. The image display apparatus
according to the present embodiment includes the voice input device 1 capable of appropriately
acquiring a desired voice with low noise, and can recognize voice with high accuracy, so more
accurate operation using voice is possible. It can be carried out.
[0096]
The timing of shooting by the camera 12 may be performed when such an operation is
performed. Further, in any of the first to fifth embodiments, the timing of photographing by the
camera 12 in the voice input device 1 may be during the voice input or immediately before or
after the voice input, and the predetermined operation (for example, It may be performed when
the voice input button is pressed or the shutter button different from the voice input operation is
pressed).
[0097]
03-05-2019
31
Further, the present invention is not limited to this embodiment, and if the voice input device 1 is
provided with an image display unit, the user of the user may touch the face while the
photographed image is displayed by the camera 12 and the user position is specified The camera
reference user angle or the camera reference user distance can also be calculated based on the
designated position.
[0098]
As described above, according to the present embodiment, it is possible to provide an image
display apparatus provided with an audio input device capable of appropriately controlling the
directivity range of the microphone array 11.
Furthermore, since the image display apparatus according to the present embodiment includes
the user position detection function such as the camera 12 and the in-image user coordinate
detection unit 13, the in-image user coordinates of a plurality of users can be detected. In
addition, the image of the camera 12 can be displayed on the image display unit 132. In addition,
in the present embodiment, the voice input device 1 can separately input voices of a plurality of
users with low noise. Therefore, if the image display device according to the present embodiment
is used, for example, a video conference system or the like that can cut out and display the
images of a plurality of users and distribute the images and sounds of each user separately for
each user Can be built. Moreover, as an image display apparatus in this embodiment, a mobile
telephone (a thing called a smart phone is also included), a tablet, etc. are mentioned.
[0099]
(Other Embodiments) For example, each component of the apparatus according to the present
invention, such as the microphone array 11, the camera 12 and the parts other than the image
display unit 132 illustrated in FIG. 1, FIG. 12 and FIG. DSP can be realized by hardware such as a
digital signal processor (DSP), a memory, a bus, an interface, a peripheral device such as a remote
control, and software executable on the hardware. Part of the hardware can be mounted as an
integrated circuit / IC (Integrated Circuit) chip set, in which case the software may be stored in
the memory. In addition, all the components of the present invention may be configured by
hardware, and in such a case as well, it is also possible to mount a part of the hardware as an
integrated circuit / IC chip set.
03-05-2019
32
[0100]
Further, a recording medium storing a program code of software for realizing the functions in the
various configuration examples described above is a voice input device or a device (may be a
general-purpose computer) including the same, and a microphone array The object of the present
invention is also achieved by supplying a device in which the camera 11 and the camera 12 are
built in or connected, and the program code being executed by a microprocessor or DSP in the
device. In this case, the program code itself of the software implements the functions of the
various configuration examples described above, and even this program code itself or a recording
medium (external recording medium or internal storage device) recording the program code The
present invention can be configured by the control side reading and executing the code.
Examples of the external recording medium include various media such as an optical disc such as
a CD-ROM or a DVD-ROM, and a nonvolatile semiconductor memory such as a memory card.
Examples of the internal storage device include various devices such as hard disks and
semiconductor memories. The program code can also be downloaded and executed from the
Internet and can be received and executed from broadcast waves.
[0101]
Although the voice input device according to the present invention and the voice recognition
device and the image display device provided with the voice input device have been described
above, as the procedure of the process has been described, the present invention can A mode as
an audio input direction changing method in an audio input device provided with a microphone
array and a camera may be adopted. In this voice input direction changing method, the camera
reference user angle calculation unit calculates a camera reference user angle that is an angle of
the user position of the voice acquisition target user with reference to the camera based on the
image input by the camera The camera reference user distance calculation unit calculates the
camera reference user distance which is the distance to the user position with reference to the
camera based on the image; and the microphone array reference user angle calculation unit
calculates the predetermined distance Calculating a microphone array reference user angle,
which is an angle of a user position based on the microphone array, based on the camera
reference user angle and the camera reference user distance; Controlling the pointing angle of
the microphone array to be an angle. The other application examples are as described for the
voice input device, and the description thereof is omitted.
[0102]
03-05-2019
33
The program code itself is, in other words, a program for causing a computer to execute the voice
input direction changing method. That is, this program calculates the camera reference user
angle which is an angle of the user position of the voice acquisition target user with reference to
the camera based on the image input by the camera to the computer, and based on the image
Calculating a camera reference user distance which is a distance to the user position with respect
to the camera; and a user based on the microphone array based on the predetermined distance,
the camera reference user angle, and the camera reference user distance It is for performing the
steps of: calculating a microphone array reference user angle that is a position angle; and
controlling a pointing angle of the microphone array so as to be the microphone array reference
user angle. The other application examples are as described for the voice input device, and the
description thereof is omitted.
[0103]
As described above, the voice input device according to the present invention is based on a
microphone array having a plurality of microphones for obtaining voice, a camera having an
imaging element and inputting an image by photographing, and the image input by the camera.
An audio input device including a directivity angle control unit for controlling a directivity angle
of the microphone array, wherein the camera and the microphone array are spaced apart by a
predetermined distance, and the audio input device is configured to A camera reference user
angle calculation unit that calculates a camera reference user angle that is an angle of a user
position of a voice acquisition target user based on the camera; and the user based on the camera
based on the image A camera reference user distance calculation unit that calculates a camera
reference user distance that is a distance to a position; the predetermined distance; the camera
reference user angle; A microphone array reference user angle calculation unit that calculates a
microphone array reference user angle that is an angle of a user position with respect to the
microphone array based on the camera reference user distance, and the directivity angle control
unit The pointing angle of the microphone array may be controlled to be the microphone array
reference user angle.
[0104]
As a result, in the voice input device capable of controlling the pointing angle, a person (such as a
user who uses a voice input device) who is speaking at a different distance in the same direction
as the speaker (user) who is the voice acquisition target. Basically, when there is an unintended
sound source of a person different from the above-mentioned speaker, the false recognition of
the user's voice is reduced by determining the distance and inputting only the voice of the user at
the target distance, User speech can be acquired with low noise.
03-05-2019
34
[0105]
In addition, the microphone array reference user angle calculation unit may set the microphone
based on the predetermined distance, the camera reference user distance, the camera reference
user angle, and an angle calculation error range according to the camera reference user angle. An
array reference user angle range may be calculated, and the directivity angle control unit may
control the directivity angle of the microphone array to be in the microphone array reference
user angle range.
Since the microphone array reference user angle range is calculated according to the camera
reference user angle, the directivity range of the microphone array can be appropriately
controlled.
[0106]
Alternatively, the microphone array reference user angle calculation unit may calculate a
distance based on the predetermined distance, the camera reference user angle, the camera
reference user distance, and the camera reference user angle and / or the camera reference user
distance. The microphone array reference user angle range may be calculated based on the
range, and the directivity angle control unit may control the directivity angle of the microphone
array to be within the microphone array reference user angle range.
Since the microphone array reference user angle range is calculated according to the camera
reference user distance, the directivity range of the microphone array can be appropriately
controlled.
[0107]
The camera reference user angle and / or the camera reference user distance may be calculated
by detecting a face area of a subject in the image and setting the position of the face area as the
user position. The detection of the face area is preferable as a process for detecting the position
of the user because there is a mouth that emits a voice.
03-05-2019
35
[0108]
An image display apparatus according to the present invention is characterized by including the
above-described voice input device. Thus, it is possible to provide an image display device
provided with an audio input device capable of appropriately controlling the directivity range of
the microphone array.
[0109]
D: Camera reference user distance, L: Distance between camera and microphone array
(predetermined distance), θ: Camera reference user angle, α: Directional direction, 1: Voice
input device, 11: Microphone array, 12: Camera, 13 ... In-image user coordinate detection unit
14: camera reference user angle calculation unit 15: camera reference user distance calculation
unit 16: microphone array reference user angle calculation unit 17: directional angle control unit
21: superdirective microphone 22 ... microphone array reference point, 23 ... sound collecting
direction, 24 ... microphone array front direction, 25 ... microphone array axis, 31 ... microphone,
32 ... camera reference point, 40 ... shooting range, 41 ... camera reference point, 42 ... camera
light Axis, 43: Camera shooting direction, 44: Straight, 51: Projection plane of camera, 52: User
face coordinates in image, 61: User, 62: Another speaker, 91: Directional range Range (oriented
area) 101 angle calculation error range 102 distance calculation error range 103 common range
104 microphone array reference user angle range 121 voice recognition unit 131 control unit
132 image display unit .
03-05-2019
36
Документ
Категория
Без категории
Просмотров
0
Размер файла
55 Кб
Теги
jpwo2014132533
1/--страниц
Пожаловаться на содержимое документа