close

Вход

Забыли?

вход по аккаунту

?

DESCRIPTION JP2010010857

код для вставкиСкачать
Patent Translate
Powered by EPO and Google
Notice
This translation is machine-generated. It cannot be guaranteed that it is intelligible, accurate,
complete, reliable or fit for specific purposes. Critical decisions, such as commercially relevant or
financial decisions, should not be based on machine-translation output.
DESCRIPTION JP2010010857
A voice input robot capable of effectively supporting voice communication in which a plurality of
people participate is provided. SOLUTION: A voice input unit 111 for receiving an input of voice,
a sound source position estimation unit 121 for estimating a sound source position of voice
received by the voice input unit 111, an operation unit 112 for varying the position of the voice
input unit 111; The sound source position estimation unit 121 estimates the sound source
positions of the plurality of voices received by the voice input unit 111, and the operation unit
112 generates the sound input unit 111 and the plurality of voice input units 111 based on the
estimation result of the sound source position estimation unit 121. Change the positional
relationship between the sound source position of the voice. [Selected figure] Figure 1
Voice Input Robot, Teleconference Support System, Teleconference Support Method
[0001]
The present invention relates to a voice input robot having a voice input unit, a teleconference
support system having the robot, and a teleconference support method using the robot.
[0002]
[Background Art] Conventionally, regarding a robot apparatus, "a robot apparatus capable of
performing a more natural operation on an object to improve entertainment characteristics and a
robot apparatus behavior control method are provided.
10-04-2019
1
As a technique aimed at, the robot apparatus 1 includes a CCD camera 22, a microphone 24, a
moving body detection module 32 for detecting a moving body from image data, a face detection
module 33 for detecting a human face, and voice data A sound source direction estimation
module 34 for estimating a sound source direction from the above; control means for controlling
movement to any of the moving body direction based on the moving body detection result, the
face direction based on the face detection result, and the estimated sound source direction When
the face is detected during walking in the moving body direction or the sound source estimation
direction, the control means controls to move in the face direction, and approaches the object for
which face detection is to be made within a predetermined range. Control to stop walking. Patent
Document 1 has been proposed.
[0003]
The present invention also provides an action control apparatus for an autonomous action robot
that responds to a human in a pet-like behavior and can sense human affinity. As a technique
aiming at the purpose, “image input device 1 with stereo camera, person detection device for
detecting a person by image processing and tracking face region of person, distance calculation
device for calculating distance from images of stereo camera 3. Person identification device 4 for
identifying a person from the information in person information storage unit 5, voice input
device 6 composed of a microphone attached to the body, sound source direction detection
device 7, voice recognition device 8, front and back, left and right of robot Ultrasonic sensor 9
which is installed and sends out obstacle information to obstacle detection device 10, touch
sensor 11 which sends to action control device 12 a signal which can identify each when hit and
hit, two sensors It comprises a leg motor 13 by wheels, a head motor 14 for rotating the head,
and an audio output device 15 attached to the mouth of the robot. Patent Document 2 has been
proposed.
[0004]
Further, with regard to the interactive robot, "the interactive robot capable of improving the
speech recognition accuracy without increasing the operation load of the human interacting with
the robot" is provided. As a technique aiming at the purpose, “a speech recognition interactive
robot 400, which is a sound source direction estimation means for estimating a sound source
direction of a target speech to be subjected to speech recognition, and a sound source estimated
by the sound source direction estimation means The voice recognition is performed on the target
voice acquired by the target voice acquisition unit and the target voice acquisition unit that
acquires the target voice at the moving unit that moves the interactive robot itself in the
10-04-2019
2
direction, the position after the movement by the moving unit. And voice recognition means.
Patent Document 3 has been proposed.
[0005]
JP-A-2004-130427 (abstract) JP-A-2003-326479 (abstract) JP-A-2006-181651 (abstract)
[0006]
For example, as in the case of performing remote communication, in an environment in which
voice dialogue is performed through a microphone, the voice of the speaker may be difficult to
hear depending on the positional relationship between the speaker and the microphone.
In particular, in a situation where there are a plurality of speakers, the ease of listening to voice
differs for each speaker due to the difference in the volume of speech of each speaker, the
distance and positional relationship with the microphone, and the like.
[0007]
Under such circumstances, a person who is in a position to listen to the sound collected by the
microphone (in the case of remote communication, it corresponds to the other party in the
remote area) is "speaking difficult" to the speaker. Try to improve the situation. However, such
exchanges cause interruptions in speech, hindering the smooth progress of communication, and
giving participants extra stress.
[0008]
In order to solve these problems, it is possible to improve the performance of microphones that
collect voice and improve the number of installed microphones, but the cost of maintaining these
environments is high.
[0009]
On the other hand, in the techniques described in the above-mentioned Patent Documents 1 to 3,
10-04-2019
3
it is disclosed that the sound source position is estimated by acquiring voice and the robot moves
in that direction.
It can be considered that this is intended to input voice at a position close to the speaker.
However, this action is for human-robot interaction, and not for facilitating remote
communication.
[0010]
For example, it is conceivable to optimize the distance between the robot and the person on the
other side of the dialog by moving the robot using the techniques described in Patent Documents
1 to 3. However, in an environment where multiple people participate in communication, such as
in teleconferencing, it is not always necessary to optimize the progress of the entire conference,
even if only the relationship between the robot and the other party in the dialogue is optimized.
It does not. That is, in an environment in which a plurality of persons participate in a conference,
in other words, in an environment where it is required to collect voices generated from a
plurality of sound sources as a whole, the techniques described in Patent Documents 1 to 3 are
necessarily suitable. Not.
[0011]
Therefore, a voice input robot that can effectively support voice communication in which a
plurality of people participate has been desired.
[0012]
A voice input robot according to the present invention comprises a voice input unit that receives
voice input, a sound source position estimation unit that estimates a sound source position of
voice received by the voice input unit, and an operation unit that changes the position of the
voice input unit And the sound source position estimation unit estimates sound source positions
of the plurality of voices received by the sound input unit, and the operation unit is configured to
calculate the sound input unit with the sound input unit based on the estimation result of the
sound source position estimation unit. The positional relationship between the sound source
positions of the plurality of sounds is changed.
[0013]
According to the voice input robot according to the present invention, since voices generated
10-04-2019
4
from a plurality of sound source positions can be collected as a whole, voice communication in
which a plurality of people participate can be effectively supported.
[0014]
Embodiment 1
FIG. 1 is a block diagram of a teleconference support system according to a first embodiment of
the present invention.
The remote conference support system according to the first embodiment includes a voice input
robot 100 and a conference terminal 200.
The voice input robot 100 and the conference terminal 200 are remotely connected via a
network 300 such as a LAN (Local Area Network) or the Internet, for example.
[0015]
The voice input robot 100 includes a robot body unit 110 and a robot control unit 120. The
robot main body 110 includes a main body casing of the voice input robot 100 and each
component attached to the main body casing. The specific configuration will be described later.
The robot control unit 120 controls the operation of the voice input robot 100. The specific
configuration will be described later. The robot control unit 120 and each component thereof can
be configured by hardware such as a circuit device that realizes the function, or defines an
arithmetic device such as a microcomputer or a CPU (Central Processing Unit) and its operation.
It can also be configured by software. In addition, necessary storage devices and network
interfaces are provided as appropriate.
[0016]
The robot body unit 110 and the robot control unit 120 may be configured on the same housing,
or, for example, the robot control unit 120 may be separated from the robot body unit 110 and
configured outside, and may communicate with each other by wire or radio. You may configure
it.
10-04-2019
5
[0017]
The robot body 110 includes a voice input unit 111 and an operation unit 112.
[0018]
The voice input unit 111 is composed of, for example, a microphone array provided with a
plurality of microphones, and collects voices around the voice input robot 100.
In order to enable the voice input robot 100 to collect voice from all directions without changing
its posture, it is preferable to configure the voice input unit 111 with a microphone array.
For example, a method may be considered in which a plurality of unidirectional microphones are
arranged on the circumference and the pointing direction is directed to the outside of the circle.
The voice collected by the voice input unit 111 is output to a voice information processing unit
121 described later.
[0019]
The operation unit 112 has a function of changing the spatial position of the voice input unit
111 based on the instruction of the motion determination unit 123 in the space where the voice
input robot 100 exists. A specific configuration example of the operation unit 112 will be
described later with reference to FIG.
[0020]
The robot control unit 120 includes a voice information processing unit 121, a statistical
processing unit 122, an operation determination unit 123, a database 124, and a setting unit
125.
[0021]
10-04-2019
6
The voice information processing unit 121 receives the voice collected by the voice input unit
111, estimates the sound source position of the voice, and calculates the volume of the estimated
sound source.
The estimation result and the calculation result are stored in the database 124. As a method of
estimating the sound source position, an appropriate known technique such as any known
technique is appropriately used. Also, the voice information processing unit 121 transmits the
voice received from the voice input unit 111 to the conference terminal 200 via the network
300.
[0022]
The statistical processing unit 122 performs statistical processing described later with reference
to FIGS. 3 to 5 from the data accumulated in the database 124 and the setting information
received by the setting unit 125, and the voice environment of the space where the voice input
robot 100 exists. To create a speech distribution map. The created map is stored in the database
124. The target of the statistical processing performed by the statistical processing unit 122 is
the above-described information processed by the voice information processing unit 121, that is,
the estimated position of the sound source, the volume of the estimated sound source position,
time (sampling time) and the like.
[0023]
The operation determining unit 123 determines whether or not to change the spatial position of
the audio input unit 111, and the variable destination position, from the audio distribution map
generated by the statistical processing unit 122 and the setting information received by the
setting unit 125. The determined result is output to the operation unit 112 as a variable
command.
[0024]
The database 124 holds the above-described information processed by the voice information
processing unit 121, that is, the estimated position of the sound source, the volume of the
estimated sound source position, and the like in chronological order. The database 124 can be
10-04-2019
7
configured using a storage device such as a hard disk drive (HDD) that stores information to be
held. The storage format of the information may be arbitrary.
[0025]
The setting unit 125 receives an input of setting information for setting a sound environment
and a sound collection situation desired by the listener of the voice, that is, how the listener
wants to listen to the voice from the speaker. Specific examples of setting contents will be
described later. Specifically, for example, a configuration may be considered in which the input of
the setting information described above is received via a network interface or screen input. From
the setting information received by the setting unit 125 and the voice distribution map generated
by the statistical processing unit 122, the content of the variable command output by the
operation determining unit 123 is determined.
[0026]
The “voice environment desired by the voice listener side” received by the setting unit 125
refers to, for example, the following (1) to (3).
[0027]
(1) I want to listen to the speech in a positional relationship that is approximately equidistant
from each speaker.
In this case, voices from multiple sound sources are simultaneously acquired. Since the volume is
greatly affected by the speech volume of the speaker, it is easy to read the emotion of the
speaker by the voice volume.
[0028]
(2) I want to hear the utterances of the speakers at the same volume. Also in this case, voices
from multiple sound sources are simultaneously acquired. It is hard to receive the influence of
the volume difference about the strength of the utterance and the claim.
10-04-2019
8
[0029]
(3) I want to listen in a situation where it is easy to hear the utterance of a specific speaker. This
corresponds to a situation in which a specific utterer has many remarks, for example, a situation
in which a certain utterer is explaining material. In this case, it is assumed that there is a demand
to pay attention to the position of the speaker who frequently speaks, and to decrease the volume
when the voice of the speaker is too loud and to increase the volume when the voice is too small.
.
[0030]
As described above, since the setting change of the sound collection state is made possible
through the setting unit 125, the voice input robot 100 can effectively support the
communication between humans. This point is different from the technology based on
communication between a human and a robot, which only performs an operation according to a
programmed specified purpose like the interactive robot described in Patent Documents 1 to 3
above.
[0031]
The "sound source position estimation unit" in the first embodiment corresponds to the voice
information processing unit 121. Further, the “operation unit” corresponds to the operation
unit 112 and the operation determination unit 123 that determines the content of the operation.
[0032]
The conference terminal 200 is a terminal used by a remote conference participant, and can be
configured using a computer such as a laptop computer, for example. Further, the voice output
unit 210 is configured of, for example, a speaker or the like. The conference terminal 200
receives the voice transmitted by the robot control unit 120 via the network 300, and outputs
the voice from the voice output unit 210 as voice. The remote conference participant can listen
to the voice of the conference participant around the voice input robot 100 by listening to the
voice.
10-04-2019
9
[0033]
The configuration of the teleconference support system according to the first embodiment has
been described above. Next, a specific configuration example of the operation unit 112 will be
described.
[0034]
FIG. 2 is a view showing an example of the appearance configuration of the voice input robot
100. As shown in FIG. Fig.2 (a) is a self-propelled type, FIG.2 (b) is a fixed movable type structural
example.
[0035]
In the self-propelled configuration shown in FIG. 2A, the operation unit 112 is configured of a
vehicle movable in any direction on a plane, and a plurality of voice input units 111 configured
of microphones are installed on the vehicle pedestal. It was composition. The operation unit 112
configured by a vehicle drives the wheels based on the instruction of the operation determination
unit 123, and moves the voice input robot 100 in the instructed direction.
[0036]
In the fixed and movable configuration shown in FIG. 2 (b), the operation unit 112 is constituted
by a movable swing arm fixed to the bottom pedestal, and the voice input unit constituted by a
microphone on the pedestal fixed on the movable swing arm It was set as the structure which
installs multiple 111. The operating unit 112 configured of the movable swing arm moves the
spatial position of the voice input unit 111 by changing the attitude (yaw pitch angle) and the
length of the arm based on the instruction of the operation determining unit 123.
[0037]
10-04-2019
10
The specific configuration example of the operation unit 112 has been described above. Next, an
example of the voice distribution map created by the statistical processing unit 122 will be
described in relation to “a voice environment desired by the voice listener side” input to the
setting unit 125 described above.
[0038]
FIG. 3 is an example of an audio distribution map created based on only the sound source
position. Hereinafter, a process of changing the spatial position of the voice input unit 111 will
be described with reference to FIG. Here, as the setting information described above, “(1) I
would like to listen to the utterance in a positional relationship that is approximately equidistant
from each speaker. “Is input to the setting unit 125.
[0039]
FIG. 3A shows an initial state of the conference participant and the voice input robot 100. As
shown in FIG. In the figure, 1 to 3 indicate the positions of the conference participants, and black
triangles indicate the initial position of the voice input robot 100. In the state of FIG. 3A, the
distance between the voice input robot 100 and the conference participant 2 is closest, and the
distance between the other conference participants and the voice input robot 100 is longer.
[0040]
The voice information processing unit 121 receives speech voices of the conference participants
1 to 3 from the voice input unit 111 within a predetermined sampling time, estimates the sound
source position of each conference participant, and stores the estimated sound source position in
the database 124. The statistical processing unit 122 creates an audio distribution map in which
the positions of the respective conference participants are mapped on the two-dimensional plane
coordinates as shown in FIG. 3A using the estimation results of the sound source positions of the
respective conference participants.
[0041]
10-04-2019
11
FIG. 3B is a diagram showing how the operation determination unit 123 determines the
movement destination of the voice input robot 100. The operation determination unit 123
determines the distance between the voice input robot 100 (or the voice input unit 111, the same
applies hereinafter) and each conference participant based on the voice distribution map shown
in FIG. 3A and the setting information received by the setting unit 125. The movement
destination of the voice input robot 100 is determined so as to be equidistant.
[0042]
FIG. 3C is a voice distribution map after the voice input robot 100 has moved. As the spatial
position of the voice input robot 100 moves, the distance between the voice input robot 100 and
each conference participant becomes equal.
[0043]
FIG. 4 is an example of an audio distribution map created based on the sound source position and
the volume of each sound source. Hereinafter, a process of changing the spatial position of the
voice input unit 111 will be described with reference to FIG. Here, “(2) I would like to hear the
utterances of the speakers with the same volume as the setting information described above. “Is
input to the setting unit 125.
[0044]
FIG. 4A shows an initial state of the conference participant and the voice input robot 100. As
shown in FIG. In the figure, 1 to 3 indicate the positions of the conference participants, the size of
the circle indicates the speech volume of each conference participant, and the black triangle
indicates the initial position of the voice input robot 100. In the state of FIG. 4A, the distance
between the voice input robot 100 and the conference participant 1 is closest, and the volume of
sound collected from the conference participant 1 in response to this is the largest.
[0045]
10-04-2019
12
The voice information processing unit 121 receives speech voices of the conference participants
1 to 3 from the voice input unit 111 within a predetermined sampling time, estimates the sound
source position of each conference participant, and stores the estimated sound source position in
the database 124. In addition, the speech volume of each conference participant is calculated and
stored in the database 124. The speech volume referred to here is, for example, a value such as
the maximum / minimum volume within the sampling time, or the average value of the volume
within the sampling time. The statistical processing unit 122 uses the estimation result of the
sound source position of each conference participant to map an audio distribution map in which
the position of each conference participant and the speech volume are mapped on a twodimensional plane coordinate as shown in FIG. create.
[0046]
FIG. 4B is a diagram showing how the operation determination unit 123 determines the
movement destination of the voice input robot 100. The operation determination unit 123 makes
the speech volume of each conference participant who collects the voice input robot 100 equal
based on the voice distribution map shown in FIG. 4A and the setting information received by the
setting unit 125. The destination of the voice input robot 100 is determined.
[0047]
FIG. 4C is a voice distribution map after the voice input robot 100 has moved. As the spatial
position of the voice input robot 100 moves, the speech volume (the size of a circle) of each
conference participant who collects the voice input robot 100 becomes equal.
[0048]
FIG. 5 is an example of a sound distribution map created based on the sound source position, the
sound volume of each sound source, and the sound generation frequency of each sound source.
Hereinafter, the process of changing the spatial position of the voice input unit 111 will be
described with reference to FIG. Here, as the setting information described above, “(3) I would
like to listen in a situation where it is easy to hear the utterance of a specific speaker. “Is input
to the setting unit 125.
10-04-2019
13
[0049]
FIG. 5A is a diagram showing an initial state of the conference participant and the voice input
robot 100. As shown in FIG. In the figure, 1 to 3 indicate the positions of the conference
participants, the size of the circle indicates the speech volume of each conference participant, the
number of circles on the circle indicates the number of utterances, and the black triangle
indicates the initial position of the voice input robot 100. In addition, the listener side wishes the
situation where it is easy to listen to the speech of the meeting participant 3.
[0050]
The voice information processing unit 121 receives speech voices of the conference participants
1 to 3 from the voice input unit 111 within a predetermined sampling time, estimates the sound
source position of each conference participant, and stores the estimated sound source position in
the database 124. Further, the speech volume and the number of speeches of each conference
participant are calculated and stored in the database 124. The statistical processing unit 122
maps the position of each conference participant, the speech volume, and the number of
utterances on the two-dimensional plane coordinates as shown in FIG. 5A using the estimation
result of the sound source position of each conference participant. Create a speech distribution
map.
[0051]
FIG. 5B is a diagram showing how the operation determination unit 123 determines the
movement destination of the voice input robot 100. The operation determination unit 123 makes
the speech volume of the conference participant 3 who collects the sound of the speech input
robot 100 the largest, based on the sound distribution map shown in FIG. 5A and the setting
information received by the setting unit 125. The movement destination of the voice input robot
100 is determined.
[0052]
FIG. 5C is a voice distribution map after the voice input robot 100 has moved. As the spatial
position of the voice input robot 100 moves, the speech volume (the size of the circle) of the
10-04-2019
14
conference participant 3 who collects the voice input robot 100 becomes the largest, and the
speech volume of other conference participants becomes smaller. . Since the number of
utterances itself does not change even if the voice input robot 100 moves, the number of rings of
each circle does not change.
[0053]
The example of the sound distribution map created by the statistical processing unit 122 has
been described above.
[0054]
Note that the movement determination unit 123 determines the movement destination
immediately after determining the movement destination in consideration of the sound generated
from the speech input robot 100 itself and the change in the sound collection state caused by
movement of the speech input robot 100. Instead of issuing the command, the movement unit
112 is instructed to move when any of the following conditions are satisfied.
[0055]
(Condition 1) A state where no sound is generated from each sound source continues for a
certain unit time.
(Condition 2) The volume generated from each sound source is in a state below a certain level.
[0056]
In addition, while the voice input robot 100 is moving, statistical processing is performed in
consideration of a change in the sound collection state due to movement of the sound generated
from the voice input robot 100 itself or movement of the voice input robot 100 as described
above. Abort.
Specifically, the operation determining unit 123 may instruct the statistical processing unit 122
to that effect.
10-04-2019
15
[0057]
FIG. 6 is an operation flow for improving the voice environment by operating the voice input
robot 100 so as to achieve a desired voice condition (sound collection state of the voice input
robot 100) on the listener side. Here, the scene of a teleconference is assumed. Hereinafter, each
step of FIG. 6 will be described.
[0058]
(S601) The following steps are repeated until the exchange of voice through the voice input unit
111 is completed. The end of the voice exchange means, for example, the end of the
teleconference. (S602) The voice input unit 111 acquires the voice of the space where the voice
input robot 100 exists, in this case, the voice of the conference room on the speech side. The
acquired voice is transmitted to the robot control unit 120.
[0059]
(S603) Based on the voice received from the voice input unit 111, the voice information
processing unit 121 performs arithmetic processing such as estimation of the sound source
position, volume of the estimated sound source, and the number of times of voice output of the
estimated sound source. Also, the voice received from the voice input unit 111 is transmitted to
the conference terminal 200. (S604) The voice information processing unit 121 stores the result
of step S603 in the database 124. (S605) If the voice input robot 100 is moving, the process
proceeds to step S611. If the voice input robot 100 is not moving, the process proceeds to step
S606.
[0060]
(S606) The statistical processing unit 122 executes the above-described statistical processing
based on each data stored in the database 124 and the setting information (voice environment
desired by the listener) received by the setting unit 125. (S607) The statistical processing unit
122 creates an audio distribution map as described in FIGS. 3 to 5 based on the processing result
of step S606. The created speech distribution map is stored in the database 124 in an arbitrary
10-04-2019
16
data format.
[0061]
(S608) Based on the voice distribution map created in step S607 and the setting information
received by the setting unit 125, the operation determining unit 123 improves the voice
environment of the voice input robot 100 in order to improve the voice environment as desired
by the listener. It is determined whether it is necessary to change the position. If the position
needs to be changed, the process proceeds to step S609. If the position is not required, the
process returns to step S602 to repeat the process. (S609) The operation determination unit 123
determines the movement destination position of the voice input robot 100 based on the voice
distribution map created in step S607 and the setting information received by the setting unit
125.
[0062]
(S610) The operation determination unit 123 determines whether the movement / operation of
the voice input robot 100 may be started / executed. Judgment here refers to judging whether
the above-mentioned conditions 1-2 are fulfilled. If the movement / operation of the voice input
robot 100 is permitted, the process proceeds to step S611. If the movement / operation is not
permitted, the process returns to step S602 to repeat the process. (S611) The operation
determination unit 123 issues an operation command to the operation unit 112. The operating
unit 112 drives the voice input robot 100 based on the operation command to change the spatial
position of the voice input unit 111.
[0063]
The flow for improving the voice environment by operating the voice input robot 100 has been
described above. By operating the voice input robot 100, the sound collection state of the voice
input unit 111 is changed to the state desired by the listener.
[0064]
As described above, according to the first embodiment, voices generated from a plurality of
sound source positions can be collected entirely under the conditions matching the setting
10-04-2019
17
information received by the setting unit 125. As such, voice communication involving multiple
people can be effectively supported.
[0065]
Further, according to the first embodiment, for example, in an environment in which voice
exchange is performed through voice input means such as a remote conference, a voice condition
desired by the listener (sound collection state of voice input unit 111) can be obtained. The voice
input robot 100 can be moved to improve the voice environment.
[0066]
Further, according to the first embodiment, there are advantages for the speaker side in addition
to the receiver side advantage of obtaining the desired voice environment for the listener side.
[0067]
In the conventional remote conference technology, feedback to the speaker side is poor as to
how the speech situation is heard by the listener.
For example, there is no means to obtain feedback other than to obtain feedback by conversation
such as "I can not hear the voice well" from the listener side.
Therefore, if the listener does not give feedback through speech, there is no feedback that the
speaker can obtain.
In addition, it seems that smooth communication is hindered if the listeners give feedback by
conversation every time.
[0068]
With regard to this task, according to the first embodiment, the fact that the voice input robot
100 actually moves in the conference space on the speaker side gives feedback that the listener
wants the improvement of the sound collection state. It will be given to you. The speaker side can
10-04-2019
18
notice, for example, that the user's speech may not be heard well by the listener by watching the
motion of the speech input robot 100 approaching him, for example.
[0069]
In this regard, it is also conceivable to improve the sound collection state by software processing
such as amplification arithmetic processing of audio signals. On the other hand, in the first
embodiment, improvement of the sound collection state and feedback to the speaker can be
simultaneously performed by the operation of the voice input robot 100 itself moving.
[0070]
Second Embodiment In the first embodiment, when the voice input robot 100 moves, in
consideration of the influence of the sound generated from the voice input robot 100 itself and
the change in the sound collection state caused by the voice input robot 100 moving, the
predetermined condition is set. It was decided not to permit the movement of the voice input
robot 100 until it was satisfied.
[0071]
When such an operation is performed, a time lag occurs between the movement instruction to
the voice input robot 100 and the actual movement. Therefore, the movement of the voice input
robot 100 delays the feedback of the listener's request indirectly to the speaker's side. If the
movement or feedback of the voice input robot 100 is delayed, the request on the listener side is
delayed by that amount, and a state in which it is difficult to listen to speech continues.
[0072]
Therefore, in the second embodiment, the feedback delay as described above is eliminated, and
attention is given to the utterer to promote improvement of the utterance situation (for example,
the utterer changes the position, raises the volume, etc.). Plan.
[0073]
10-04-2019
19
FIG. 7 is a block diagram of a teleconference support system according to a second embodiment
of the present invention.
The remote conference support system according to the second embodiment includes the display
unit 113 in the robot body 110 in addition to the configuration described in FIG. 1 of the first
embodiment. The other configuration is substantially the same as that of FIG. 1, and therefore,
the difference will be mainly described below.
[0074]
The display unit 113 is a functional unit that displays the moving direction and the moving
position of the voice input robot 100 based on an instruction from the operation determining
unit 123. After the movement determination unit 123 determines the movement destination
position and direction of the voice input robot 100 based on the statistical processing of the
statistical processing unit 122, the movement determination unit 123 instructs the display unit
113 before instructing the operation unit 112 to that effect. Display the direction.
[0075]
In this way, when a movement instruction to the voice input robot 100 is generated, the speaker
does not make the contents appear for the first time by actual movement, but displays it in
advance, so that the utterer can hear the sound on the listener side. You can know indirectly
what is being transmitted. Further, since only the display is performed, the change of the voice
environment due to the movement of the voice input robot 100 is not caused.
[0076]
On the other hand, by displaying the moving direction and the position, the speaker is notified
that the voice input robot 100 is about to move, and the following effects are exhibited. That is,
in order for the speech input robot 100 to start moving, the utterer can take an action of
temporarily interrupting the speech and leaving a gap between the speech until the movement of
the speech input robot 100 is completed.
10-04-2019
20
[0077]
FIG. 8 is a diagram showing a configuration example of the display unit 113. As shown in FIG.
FIG. 8 (a) shows an example in which the display unit 113 is configured using a projector, and
FIG. 8 (b) shows an example in which the display unit 113 is configured using an LED (Light
Emitting Diode).
[0078]
In the example of FIG. 8A, the display unit 113 configured by using a projector is a portion
around the voice input robot 100 using a figure, a character, or the like such as an arrow, as the
direction in which the voice input robot 100 is moving. Project into space. Specifically, for
example, a method may be considered in which the moving direction is represented by the
direction of the arrow, and the moving distance is represented by the length of the arrow. A
method other than this may be used, and expression methods other than arrows and characters
may be used.
[0079]
In the example of FIG. 8B, the moving direction is displayed by arranging a plurality of LEDs
circumferentially around the voice input robot 100 and lighting the LED in the direction in which
the voice input robot 100 is about to move. .
[0080]
In any of FIGS. 8A and 8B, the display is turned off when the voice input robot 100 is not moving.
[0081]
FIG. 9 is an operation flow for improving the voice environment by operating the voice input
robot 100 so as to be a voice condition desired by the listener (the sound collection state of the
voice input robot 100) in the second embodiment.
10-04-2019
21
As in FIG. 6, assume a teleconference scene.
Hereinafter, each step of FIG. 9 will be described.
[0082]
(S901)-(S909) Since it is the same as that of FIG.6 S601-S609, description is abbreviate | omitted.
(S910) The operation determination unit 123 instructs the display unit 113 to display the
moving direction of the voice input robot 100. The display unit 113 displays the moving
direction of the voice input robot 100 based on the instruction.
[0083]
(S911) The operation determining unit 123 determines whether the movement / operation of the
voice input robot 100 may be started / executed, that is, whether the conditions 1 to 2 described
in the first embodiment are satisfied. . If the movement / operation of the voice input robot 100
is permitted, the process proceeds to step S912. If not permitted, the loop of step S901 is
continued. (S912) Since it is the same as step S611 of FIG. 6, the description is omitted.
[0084]
The flow of improving the voice environment by operating the voice input robot 100 in the
second embodiment has been described above.
[0085]
The display content of the display unit 113 may not necessarily be only the information on the
movement destination of the voice input robot 100.
That is, it is only necessary to be able to feedback the sound collection state which the listener is
listening to to the speaker in any form, directly or indirectly. For example, the intention of the
listener may be displayed to give feedback to the speaker by displaying the intention of the
listener that he / she does not want to interrupt the utterance but has a question for the current
10-04-2019
22
utterance content. Communication can be facilitated by such a display.
[0086]
As described above, by providing means for notifying the speaker of at least information
suggesting a sound collection state, the same effect as that of the second embodiment can be
exhibited.
[0087]
As described above, according to the second embodiment, since the display unit 113 displays in
advance that the voice input robot 100 is about to move, the user speaks how the voice is heard
on the listener side It is possible to give feedback indirectly to
As a result, the speaker can make an utterance conscious of the listener. The speaker who
receives feedback takes measures such as changing his / her speech state from the direction in
which the speech input robot 100 is moving, taking a speech interval to satisfy the movement
start condition of the speech input robot 100, etc. be able to.
[0088]
Third Embodiment In Embodiments 1 and 2 above, the voice information processing unit 121
transmits the voice received from the voice input unit 111 to the conference terminal 200 as it
is. The voice information processing unit 121 performs noise removal on the speaker side, other
noise canceling processing, and the like on the voice received from the voice input unit 111 as
necessary, and transmits the voice to the conference terminal 200. You may do so.
[0089]
The noise on the speaker side here is, for example, a fan operation sound of a PC. When noise
cancellation processing is performed, necessary statistical processing and learning processing
may be performed using the voice data stored in the database 124.
10-04-2019
23
[0090]
Fourth Embodiment In the prior art as described in Patent Documents 1 to 3, the interactive
robot estimates the sound source position in order to narrow down the direction in which the
subsequent operation is performed, and the sound source which is deviated from the motion
direction candidates is It is excluded from the processing target. On the other hand, in the above
first to third embodiments, the selection of sound sources is not performed such as excluding the
estimated sound source position and volume. This is different from the prior art in which the
interactive robot and the speaker are aware of one-on-one interaction, and in the present
invention, the present invention aims to collect voices of a plurality of speakers. It depends. That
is, in the present invention, since it is not necessary to exclude the sound source position from
the processing object, the selection of the sound source position is not performed.
[0091]
However, since it is unnecessary for the sound source position estimation on the speaker side, a
voice output means such as a speaker for outputting the sound emitted by the listener side on
the speaker side may be exceptionally excluded from the processing target . This is common to
the above-described embodiments.
[0092]
Embodiment 5 Although the example which represented the audio | voice distribution map on
two-dimensional plane coordinate was demonstrated in FIGS. 3-5, you may map audio | voice
distribution on three-dimensional space coordinate. For example, a method of expressing the size
of speech and the number of times of speech by height can be considered. In the latter case, the
number of circles of the circle is used as contour lines, resulting in an image in which the height
is represented.
[0093]
Furthermore, the arrangement of the voice input unit 111 or the movement range of the voice
input robot 100 may be expanded in three dimensions. Necessary moving means are provided
appropriately. For example, in a remote conference, a speaker may spread a laptop computer in
10-04-2019
24
front of one's eyes to hold a conference, and the laptop computer becomes a wall and affects the
sound collection. Therefore, as described above, the arrangement of the voice input unit 111 and
the movement range of the voice input robot 100 may be extended also in the height direction so
that more flexible voice collection can be performed.
[0094]
Sixth Embodiment In the first to fifth embodiments described above, the robot control unit 120
may be provided with a functional unit that estimates the self position of the voice input robot
100. For example, in the case of the self-propelled configuration described in FIG. 2A, the self
position is estimated using values such as the rotation direction of the wheel, the number of
rotations, and the wheel diameter. In the case of the fixed movable type configuration described
in FIG. 2B, the self position is estimated using values such as the arm length and the attitude (yaw
pitch angle) of the arm. By using the self position estimation, the voice distribution maps
described in FIGS. 3 to 5 are not a relative coordinate system centering on the position of the
voice input robot 100 but a map of an absolute coordinate system. The ideal position of the voice
input robot 100 on absolute coordinates is determined on the basis of the maximum / minimum
volume of the sound source on the absolute coordinate axis, the frequency of occurrence of
speech, and the like. Thus, the ideal position of the voice input robot 100 can be determined
quickly.
[0095]
Embodiment 7 In the above first to sixth embodiments, for convenience of explanation, the voice
input robot 100 is installed on the speaker side and the conference terminal 200 is installed on
the listener side. However, in the case of two-way communication such as a remote conference,
since both parties speak, the voice input robot 100 and the conference terminal 200 may be
installed at both bases and configured to be equivalent environments.
[0096]
FIG. 1 is a configuration diagram of a teleconference support system according to a first
embodiment. FIG. 1 is a view showing an example of the appearance configuration of a voice
input robot 100; It is an example of the audio distribution map created on the basis of only the
sound source position. It is an example of the audio distribution map created on the basis of the
10-04-2019
25
sound source position and the volume of each sound source. It is an example of the sound
distribution map created on the basis of the sound source position, the sound volume of each
sound source, and the sound generation frequency of each sound source. The operation flow is to
improve the voice environment by operating the voice input robot 100 so as to achieve a desired
voice situation on the listener side. FIG. 7 is a configuration diagram of a teleconference support
system according to a second embodiment. FIG. 6 is a diagram showing an example of
configuration of a display unit 113. In Embodiment 2, it is an operation flow which operates
voice input robot 100 so that it may be in the voice condition which the listener side desires, and
improves a voice environment.
Explanation of sign
[0097]
DESCRIPTION OF SYMBOLS 100 audio | voice input robot, 110 robot main-body part, 111 audio
| voice input part, 112 operation part, 113 display part, 120 robot control part, 121 audio |
voice information processing part, 122 statistical processing part, 123 operation determination
part 124 database, 125 setting part .
10-04-2019
26
Документ
Категория
Без категории
Просмотров
0
Размер файла
40 Кб
Теги
description, jp2010010857
1/--страниц
Пожаловаться на содержимое документа