close

Вход

Забыли?

вход по аккаунту

?

JP2013134312

код для вставкиСкачать
Patent Translate
Powered by EPO and Google
Notice
This translation is machine-generated. It cannot be guaranteed that it is intelligible, accurate,
complete, reliable or fit for specific purposes. Critical decisions, such as commercially relevant or
financial decisions, should not be based on machine-translation output.
DESCRIPTION JP2013134312
Abstract: An object of the present invention is to identify, among voices acquired by voice
acquisition means, a sound including a collision sound generated due to a collision of a device
body. A voice analysis device according to the present invention includes a device body (30) and
a strap (40) connected to the device body (30) and used to be worn from the neck of a user.
Further, in this voice analysis device, the distance from the device body 30 is larger than the
distance from the device body 30 to the first microphone 11 when the first microphone 11
provided in the device body 30 and the strap 40 are hung on the neck. The second microphone
12 is provided at the following position. Furthermore, the voice analysis device compares the
first sound pressure which is the sound pressure of the voice obtained by the first microphone
11 with the second sound pressure which is the sound pressure of the voice obtained by the
second microphone 12; The identification unit is configured to identify a sound whose first sound
pressure is larger than the second sound pressure by a predetermined value or more. [Selected
figure] Figure 2
Voice analyzer
[0001]
The present invention relates to a voice analysis device.
[0002]
Patent Document 1 discloses the following prior art.
03-05-2019
1
This prior art reduces the decrease in speech recognition rate due to differences in the
installation locations of the microphones. And this prior art A / D converts the voice signal
obtained from the microphone which picks up a sound, A / D converts the noise signal obtained
from the analysis part which calculates the spectrum, and the microphone which picks up noise.
D-converting, an analysis unit that calculates the spectrum, a voice compensation unit that
corrects the voice side spectrum by taking the difference between the voice side spectrum and
the noise side spectrum, and a voice signal corrected at the time of registration processing A
speech recognition apparatus comprising: a registration processing unit for storing a pattern; and
a recognition processing unit for recognizing a voice by comparing a corrected voice signal with
a standard pattern stored in the registration processing unit during voice recognition; And a
position setting unit configured to set an optimum installation position of the microphone and
the microphone.
[0003]
Further, Patent Document 2 discloses the following prior art. This conventional technique is a
standard pattern generation method in which a plurality of speech patterns are averaged. In each
speech pattern, an unstable component in speech generation is present as an inevitable element
in the speech pattern averaging. It is a standard pattern creation method characterized by having
done. Then, only a normal voice signal is averaged from the voice generated immediately after
the sudden noise or the pattern in which the tail of the voice is missing to create a standard
pattern.
[0004]
JP-A-7-191688 JP-A-63-226691
[0005]
An object of the present invention is to identify, among voices acquired by voice acquisition
means, a sound including a collision sound generated by collision of the device body.
[0006]
The invention according to claim 1 comprises an apparatus body, a strap connected to the
apparatus body, and used to carry the apparatus body from the neck of the user, and the voice
03-05-2019
2
provided to the strap or the apparatus body. The distance of the sound wave propagation path
from the device main body is from the distance of the sound wave propagation path from the
device main body to the first sound acquisition means when the first voice acquiring means to
acquire and the strap is hung on the neck Second sound acquisition means provided at a position
to be increased to acquire sound, and the first sound pressure and the second sound acquisition
being sound pressure of the sound provided in the device body and acquired by the first sound
acquisition means An identification unit for identifying a sound whose first sound pressure is
larger than the second sound pressure by a predetermined value or more based on the
comparison result with the second sound pressure which is the sound pressure of the sound
acquired by the means; Voice analysis device characterized by comprising A.
The invention according to claim 2 is characterized in that the first voice acquisition means is
provided in the device body, and the second voice acquisition means is provided in the strap.
Voice analysis device.
In the invention according to claim 3, the identification unit is a sound acquired by the first
sound acquisition unit and the second sound acquisition unit based on a comparison result of the
first sound pressure and the second sound pressure. 3. The voice analysis device according to
claim 1, wherein the voice analysis device identifies whether the voice of the user who has the
strap attached to the neck or the voice of another person. In the invention according to claim 4,
the identification unit acquires the first sound based on a comparison result of the first sound
pressure and the second sound pressure of sounds other than the sound identified by the
identification unit. A method according to any one of claims 1 to 3, characterized in that the
voice acquired by the means and the second voice acquisition means is either the speech of the
user who has the strap attached to his neck or the speech of another person. It is a voice analysis
device given in either.
[0007]
According to the first aspect of the present invention, it is possible to identify, among the sounds
acquired by the sound acquiring means, a sound including a collision sound generated by the
collision of the device body. According to the invention of claim 2, compared with the case where
the present invention is not used, it is possible to more accurately identify the sound including
the collision sound generated by the collision of the device body among the sounds acquired by
the sound acquiring means. According to the third aspect of the present invention, it is possible
to identify whether the speaker is a wearer or not based on the acquired non-verbal information
of voice. According to the fourth aspect of the present invention, it is possible to identify whether
03-05-2019
3
the speaker is a wearer or not in a state where at least a part of noise is removed from the voice
acquired by the voice acquisition means.
[0008]
It is a figure showing the example of composition of the speech analysis system by this
embodiment. It is a figure which shows the structural example of the terminal device in this
embodiment. It is a figure which shows the relationship of a wearer's and the other person's
mouth (speech part), and a position with a microphone. It is a figure which shows the
relationship between the distance of the sound wave propagation path between a microphone
and a sound source, and sound pressure (input sound volume). It is a figure which shows the
identification method of a user's own speech voice and the other person's speech. It is the figure
which showed the relationship between the sound pressure of a microphone, and a collision
sound. It is a figure which shows the relationship of the position of an apparatus main body and a
microphone. It is a figure which shows the relationship between the distance of the sound wave
propagation path between a microphone and a sound source, and sound pressure (input sound
volume). It is a figure which shows the relationship between the identification method of a
utterer, and the identification method that the acquired audio | voice contains a collision sound.
It is a flowchart which shows operation | movement of the terminal device in this embodiment. It
is a figure which shows the audio | voice data at the time of the terminal device in this
embodiment acquiring the uttered voice containing a collision sound. It is a figure which shows
the condition where the several wearer who each mounted | worn the terminal device of this
embodiment is having a conversation. It is a figure which shows the example of the speech
information of each terminal device in the conversation condition of FIG. It is a figure showing an
example of functional composition of a host device in this embodiment.
[0009]
Hereinafter, embodiments of the present invention will be described in detail with reference to
the accompanying drawings. <System Configuration Example> FIG. 1 is a view showing a
configuration example of a speech analysis system according to the present embodiment. As
shown in FIG. 1, the system of the present embodiment is configured to include a terminal device
10 and a host device 20. The terminal device 10 and the host device 20 are connected via a
wireless communication line. As a type of wireless communication line, a line according to an
existing method such as Wi-Fi (registered trademark) (Wireless Fidelity), Bluetooth (registered
trademark), ZigBee (registered trademark), UWB (Ultra Wideband), etc. may be used. Further, in
the illustrated example, only one terminal device 10 is described, but as will be described in
03-05-2019
4
detail later, the terminal device 10 is worn and used by each user, and is actually used The
terminal devices 10 for the number of persons are prepared. Hereinafter, the user wearing the
terminal device 10 is referred to as a wearer.
[0010]
The terminal device 10 includes at least one set of microphones (first and second microphones
11 and 12) and amplifiers (first and second amplifiers 13 and 14) as sound acquisition means. In
addition, the terminal device 10 includes, as processing means, a voice analysis unit 15 that
analyzes acquired voice, a data transmission unit 16 for transmitting an analysis result to the
host device 20, and further includes a power supply unit 17.
[0011]
The first microphone 11 and the second microphone 12 are arranged at different positions in the
distance of the sound wave propagation path from the wearer's mouth (speaking part)
(hereinafter simply referred to as “distance”). Here, the first microphone 11 is disposed at a
position (e.g., about 35 cm) far from the mouth (speaking part) of the wearer, and the second
microphone 12 is a position (e.g., about 10 cm) near the mouth (speaking part) Shall be placed in
As types of microphones used as the first microphone 11 and the second microphone 12 of the
present embodiment, various existing ones such as a dynamic type and a capacitor type may be
used. In particular, it is preferable to use a nondirectional MEMS (Micro Electro Mechanical
Systems) microphone.
[0012]
The first amplifier 13 and the second amplifier 14 amplify the electrical signal (audio signal)
output by the first microphone 11 and the second microphone 12 in accordance with the
acquired voice. As the amplifiers used as the first amplifier 13 and the second amplifier 14 of the
present embodiment, an existing operational amplifier or the like may be used.
[0013]
03-05-2019
5
The voice analysis unit 15 analyzes voice signals output from the first amplifier 13 and the
second amplifier 14. Then, it is determined whether the voice acquired by the first microphone
11 and the second microphone 12 is a voice uttered by the wearer wearing the terminal device
10 or a voice uttered by another person. That is, the voice analysis unit 15 functions as an
identification unit that identifies a speaker of voice based on the voice acquired by the first
microphone 11 and the second microphone 12. The contents of specific processing for speaker
identification will be described later.
[0014]
The data transmission unit 16 transmits the acquired data including the analysis result by the
voice analysis unit 15 and the ID of the terminal device 10 to the host device 20 via the abovedescribed wireless communication line. As information to be transmitted to the host device 20,
according to the contents of processing performed in the host device 20, in addition to the above
analysis result, for example, acquisition time of voice by the first microphone 11 and second
microphone 12 and acquired voice Information such as sound pressure may be included. The
terminal device 10 may be provided with a data storage unit for storing an analysis result by the
voice analysis unit 15 and batch transmission of storage data for a certain period may be
performed. It may be transmitted by a wired line.
[0015]
The power supply unit 17 supplies power to the first microphone 11, the second microphone 12,
the first amplifier 13, the second amplifier 14, the voice analysis unit 15, and the data
transmission unit 16 described above. As a power supply, for example, an existing power supply
such as a dry battery or a rechargeable battery is used. Further, the power supply unit 17
includes known circuits such as a voltage conversion circuit and a charge control circuit, as
necessary.
[0016]
The host device 20 outputs a data receiving unit 21 that receives data transmitted from the
terminal device 10, a data storage unit 22 that stores the received data, a data analysis unit 23
that analyzes the stored data, and an analysis result. And an output unit 24. The host device 20 is
realized by, for example, an information processing device such as a personal computer. Further,
03-05-2019
6
as described above, in the present embodiment, a plurality of terminal devices 10 are used, and
the host device 20 receives data from each of the plurality of terminal devices 10.
[0017]
The data receiving unit 21 corresponds to the above-described wireless communication line,
receives data from each of the terminal devices 10, and sends the data to the data storage unit
22. The data storage unit 22 is realized by, for example, a storage device such as a magnetic disk
device of a personal computer, and stores the reception data acquired from the data reception
unit 21 for each speaker. Here, the identification of the speaker is performed by collating the
terminal ID transmitted from the terminal device 10 with the speaker name and the terminal ID
registered in the host device 20 in advance. Further, instead of the terminal ID, the wearer's
name may be transmitted from the terminal device 10.
[0018]
The data analysis unit 23 is realized by, for example, a program-controlled CPU of a personal
computer, and analyzes data stored in the data storage unit 22. The specific analysis content and
analysis method can take various contents and methods according to the usage purpose and
usage mode of the system of the present embodiment. For example, analyzing the frequency of
interaction between the wearers of the terminal device 10 and the tendency of the other party of
the interaction with each wearer, or analogizing the relationship of the interlocutors from the
information of the length and sound pressure of each utterance in the dialogue To be done.
[0019]
The output unit 24 outputs an analysis result by the data analysis unit 23 or performs an output
based on the analysis result. The output means can take various means such as display display,
print output by a printer, audio output, etc., depending on the purpose of use and the use mode
of the system, the contents and format of the analysis result, and the like.
[0020]
03-05-2019
7
<Example of Configuration of Terminal Device> FIG. 2 is a view showing an example of the
configuration of the terminal device 10. As shown in FIG. As described above, the terminal device
10 is worn and used by each user. In order to make the user attachable, as shown in FIG. 2, the
terminal device 10 of the present embodiment is configured to include an apparatus main body
30 and a strap 40 connected to the apparatus main body 30. In the illustrated configuration, the
user wears the strap 40 and wears the device body 30 from the neck.
[0021]
The device main body 30 is a circuit and a power source that realizes at least the first amplifier
13, the second amplifier 14, the voice analysis unit 15, the data transmission unit 16, and the
power supply unit 17 in a thin rectangular parallelepiped case 31 formed of metal or resin. The
power supply (battery) of the part 17 is accommodated and comprised. The case 31 may be
provided with a pocket into which an ID card or the like displaying ID information such as the
name or affiliation of the wearer is inserted. In addition, such ID information or the like may be
printed on the surface of the case 31 itself, or a seal in which the ID information or the like is
described may be attached.
[0022]
The strap 40 is provided with a first microphone 11 and a second microphone 12 (hereinafter
referred to as the microphones 11 and 12 when the first microphone 11 and the second
microphone 12 are not distinguished from each other). The microphones 11 and 12 are
connected to the first amplifier 13 and the second amplifier 14 housed in the device body 30 by
a cable (electric wire or the like) passing through the inside of the strap 40. As materials of the
strap 40, various existing materials such as leather, synthetic leather, cotton and other natural
fibers, synthetic fibers such as resin, metals, etc. may be used. Moreover, the coating process
using a silicone resin, a fluorine resin, etc. may be given.
[0023]
The strap 40 has a tubular structure, and the microphones 11 and 12 are housed inside the strap
40. By providing the microphones 11 and 12 inside the strap 40, it is possible to prevent the
microphones 11 and 12 from being damaged or soiled, and to prevent the communicator from
becoming aware of the presence of the microphones 11 and 12. The first microphone 11
03-05-2019
8
disposed at a position far from the wearer's mouth (speaking part) may be incorporated in the
case 31 and provided in the apparatus main body 30. In the present embodiment, a case where
the first microphone 11 is provided to the strap 40 will be described as an example.
[0024]
Referring to FIG. 2, the first microphone 11, which is an example of the first voice acquisition
unit, is provided at an end (for example, within 10 cm from the center of the device body 30) of
the strap 40 connected to the device body 30. It is done. As a result, in a state where the wearer
puts the strap 40 on the neck and lowers the device body 30, the first microphone 11 is disposed
at a position approximately 30 cm to 40 cm away from the mouth (speaking part) of the wearer .
Also when the first microphone 11 is provided in the device body 30, the distance from the
wearer's mouth (speaking part) to the first microphone 11 is approximately the same.
[0025]
The second microphone 12, which is an example of the second voice acquisition unit, is provided
at a position away from the end of the strap 40 connected to the device body 30 (e.g., about 25
cm to 35 cm from the center of the device body 30) It is done. Thus, with the wearer hanging the
strap 40 around the neck and lowering the device body 30, the second microphone 12 is located
at the neck of the wearer (for example, at a position that hits the clavicle), and the mouth of the
wearer (speech The site is placed at a distance of about 10 cm to 20 cm from the site.
[0026]
In addition, the terminal device 10 of this embodiment is not limited to the structure shown in
FIG. For example, in the microphones 11 and 12, the distance (of the sound wave propagation
path) from the first microphone 11 to the wearer's mouth (speech site) is (sound wave
propagation path) from the second microphone 12 to the wearer's mouth (speech site) The
positional relationship between the first microphone 11 and the second microphone 12 may be
specified so as to be approximately several times the distance of. Further, the microphones 11
and 12 may be attached to the wearer by various methods without being limited to the
configuration provided on the strap 40 as described above. For example, each of the first
microphone 11 and the second microphone 12 may be configured to be individually fixed to
clothes using a pin or the like. In addition, a dedicated attachment designed so that the positional
03-05-2019
9
relationship between the first microphone 11 and the second microphone 12 is fixed at a desired
position may be prepared and attached.
[0027]
Further, as shown in FIG. 2, the device body 30 is not limited to a configuration that can be
connected to the strap 40 and carried from the neck of the wearer, as long as the device body 30
can be easily carried. For example, it may be configured to be attached to clothes or a body by a
clip or a belt instead of the strap as in the present embodiment, or may be configured to be
simply carried in a pocket or the like. In addition, a function of receiving, amplifying, and
analyzing audio signals from the microphones 11 and 12 may be realized by a mobile phone or
other existing portable electronic information terminals. However, when the first microphone 11
is provided in the device main body 30, the positional relationship between the first microphone
11 and the second microphone 12 needs to be maintained as described above, so the position of
the device main body 30 at the time of carrying is It is identified.
[0028]
Furthermore, the microphones 11 and 12 and the apparatus main body 30 (or the voice analysis
unit 15) may be connected by wireless communication rather than by wired connection.
Although the first amplifier 13, the second amplifier 14, the voice analysis unit 15, the data
transmission unit 16, and the power supply unit 17 are housed in a single case 31 in the above
configuration example, they are divided into a plurality of individual You may configure. For
example, the power supply unit 17 may not be housed in the case 31 and may be connected to
an external power supply and used.
[0029]
<Identification of Speaker (Self (Others)) Based on Non-Language Information of Acquired Voice>
Next, a method of identifying a speaker in the present embodiment will be described. The system
according to the present embodiment uses the information of the voice acquired by the two
microphones 11 and 12 provided in the terminal device 10 to discriminate between the voice of
the wearer of the terminal device 10 and the voice of another person. Do. In other words, the
present embodiment identifies oneself and the other with respect to the speaker of the acquired
speech. In the present embodiment, of the information of the acquired voice, the speech is not
03-05-2019
10
based on the linguistic information obtained using morphological analysis or dictionary
information, but on the basis of non-verbal information such as sound pressure (input volume to
microphones 11 and 12). Identify the In other words, the speaker of the voice is identified from
the speech situation specified by the non-language information, not the speech content specified
by the language information.
[0030]
As described with reference to FIGS. 1 and 2, in the present embodiment, the first microphone 11
of the terminal device 10 is disposed at a position far from the mouth (speaking portion) of the
wearer, and the second microphone 12 is the wearer. Placed at a position close to the mouth
(speaking part) of That is, when the wearer's mouth (speaking part) is used as a sound source,
the distance between the first microphone 11 and the sound source and the distance between the
second microphone 12 and the sound source are largely different. Specifically, the distance
between the first microphone 11 and the sound source is about 1.5 to 4 times the distance
between the second microphone 12 and the sound source. Here, the sound pressure of the
acquired voice at the microphones 11 and 12 attenuates (distance attenuation) as the distance
between the microphones 11 and 12 and the sound source increases. Therefore, regarding the
speech voice of the wearer, the sound pressure of the acquired voice at the first microphone 11
and the sound pressure of the acquired voice at the second microphone 12 are largely different.
[0031]
On the other hand, considering the case where the mouth (speaking part) of a person (other)
other than the wearer is the sound source, since the other person is apart from the wearer, the
distance between the first microphone 11 and the sound source The distance between the second
microphone 12 and the sound source does not change significantly. Depending on the position of
the other person with respect to the wearer, a difference between the two may occur, but the
distance between the first microphone 11 and the sound source is second as in the case where
the wearer's mouth (speaking part) is used as the sound source. It will not be several times the
distance between the microphone 12 and the sound source. Therefore, regarding the speech
voice of the other person, the sound pressure of the acquired speech at the first microphone 11
and the sound pressure of the acquired speech at the second microphone 12 do not differ greatly
as in the case of the speech of the wearer.
[0032]
03-05-2019
11
FIG. 3 is a diagram showing the positional relationship between the mouths of the wearer and
others (speaking parts) and the microphones 11 and 12. In the relationship shown in FIG. 3, the
distance between the sound source a, which is the wearer's mouth (speaking part), and the first
microphone 11 is La1, and the distance between the sound source a and the second microphone
12 is La2. Further, the distance between the sound source b which is the other person's mouth
(speaking part) and the first microphone 11 is Lb1, and the distance between the sound source b
and the second microphone 12 is Lb2. In this case, the following relationship holds.
La1>La2(La1≒1.5×La2∼4×La2) Lb1≒Lb2
[0033]
FIG. 4 is a view showing the relationship between the distance between the microphones 11 and
12 and the sound source and the sound pressure (input volume). As described above, the sound
pressure attenuates in accordance with the distance between the microphones 11 and 12 and the
sound source. When sound pressure (first sound pressure) Ga1 in the case of distance La1 and
sound pressure (second sound pressure) Ga2 in the case of distance La2 in FIG. 4 are compared,
sound pressure Ga2 is about 4 times of sound pressure Ga1. It has become. On the other hand,
since the distance Lb1 and the distance Lb2 approximate each other, the sound pressure Gb1 for
the distance Lb1 and the sound pressure Gb2 for the distance Lb2 are substantially equal.
Therefore, in the present embodiment, the difference between the sound pressure ratios is used
to discriminate between the user's own utterance voice and the other person's utterance voice in
the acquired voice. Although the distances Lb1 and Lb2 are 60 cm in the example shown in FIG.
4, it means that the sound pressure Gb1 and the sound pressure Gb2 are almost equal, and the
distances Lb1 and Lb2 are limited to the values shown in the figure. I will not.
[0034]
FIG. 5 is a diagram showing a method of identifying the voice of the wearer's own voice and the
voice of the other's voice. As described with reference to FIG. 4, the sound pressure Ga2 of the
second microphone 12 is several times (for example, about 4 times) the sound pressure Ga1 of
the first microphone 11 with respect to the voice of the wearer. Further, regarding the speech
voice of the other person, the sound pressure Gb2 of the second microphone 12 is substantially
equal to (about 1 times) the sound pressure Gb1 of the first microphone 11. Therefore, in the
present embodiment, a threshold (first threshold) is set to the ratio of the sound pressure of the
second microphone 12 to the sound pressure of the first microphone 11. The first threshold is
03-05-2019
12
set to a value between the value of the sound pressure ratio in the speaker's own speech and the
value of the sound pressure ratio in the other's speech. Then, the voice whose sound pressure
ratio is larger than the first threshold is determined as the voice of the wearer's own voice, and
the voice whose sound pressure ratio is smaller than the first threshold is determined as the
voice of the other person. In the example shown in FIG. 5, the first threshold is 2 and the sound
pressure ratio Ga2 / Ga1 exceeds the first threshold 2, so it is judged that the wearer's own
speech is voice sound ratio Gb2 / Gb1 is the first threshold 2 As it is smaller than this, it is
judged as the voice of another person.
[0035]
<Identification of Acquired Voice Including Impulsive Sound> As described above, the user of the
terminal device 10 inserts the neck of the strap 40 and wears the device body 30 from the neck.
Then, in a state where the user carries the terminal device 10 from the neck, for example, when
the user moves, the terminal device 10 may shake and the device body 30 of the terminal device
10 may collide with other members. Thus, when the device body 30 collides with another
member, a collision sound is generated. For example, when the device body 30 collides with a
part of the user's body of the terminal device 10, a desk, or an ID card or a mobile phone that the
user can carry from the neck other than the terminal device 10, a collision sound is generated.
Then, this collision sound is acquired as an acquired voice by the microphones 11 and 12
together with the voice of the wearer and others.
[0036]
Now, when the microphones 11 and 12 acquire the collision sound generated by the device body
30 colliding with another member, the user's own speech in the acquired speech may be
recognized as the other's speech. Hereinafter, the relationship between acquiring the collision
sound and the recognition of the speech of the wearer as the speech of the other person will be
described. FIG. 6 is a diagram showing the relationship between the sound pressure of the
microphones 11 and 12 and the collision sound. Specifically, FIG. 6 (a) is a diagram showing a
change in sound pressure of the microphones 11 and 12 which have acquired speech sound
including a collision sound, and FIG. 6 (b) is shown in FIG. 6 (a). It is a figure showing change of
sound pressure ratio of microphones 11 and 12.
[0037]
03-05-2019
13
In the terminal device 10 of the present embodiment, the magnitude of the collision sound
acquired by the first microphone 11 is larger than the magnitude of the collision sound acquired
by the second microphone 12. To explain further, the collision sound acquired by the first
microphone 11 is generated in a short time (for example, about 0.3 ms) as compared to the
speech sound. For example, in FIG. 6A, the average sound pressure (average gain) Ga1 of the first
microphone 11 and the average sound pressure of the second microphone 12 when the terminal
device 10 collides with another member (within the alternate long and short dash line) The
sound pressure Ga1 is larger than the sound pressure Ga2 when the relationship with the
average gain) Ga2 is compared. This is because the first microphone 11 is closer to the device
main body 30 that generates the collision sound than the second microphone 12. Further, in FIG.
6B, the sound pressure ratio between the average sound pressure Ga1 of the first microphone 11
and the average sound pressure Ga2 of the second microphone 12 when the terminal device 10
collides with another member (within the alternate long and short dash line) Is smaller than the
sound pressure ratio except when it collides with another member (within the alternate long and
short dash line).
[0038]
Now, it will be described in more detail that the collision sound acquired by the first microphone
11 is larger than the collision sound acquired by the second microphone 12. FIG. 7 is a diagram
showing the positional relationship between the device body 30 and the microphones 11 and 12.
In the relationship shown in FIG. 7, the distance between the sound source S which is the center
of the apparatus body 30 and the first microphone 11 is Ls1, and the distance between the
sound source S and the second microphone 12 is Ls2. Then, as described above, for example, the
first microphone 11 is provided at a position within 10 cm from the center of the device body 30,
and the second microphone 12 is provided at a position around 25 cm to 35 cm from the center
of the device body 30. In this case, the following relationship holds. Ls1 <Ls2 (2.5 × Ls1 to 3.5
× Ls1 ≒ Ls2) When the first microphone 11 is provided in the apparatus main body 30, the
distance Ls1 is further reduced.
[0039]
FIG. 8 is a diagram showing the relationship between the distance of the sound wave propagation
path between the microphones 11 and 12 and the sound source and the sound pressure (input
sound volume). As described above, the sound pressure attenuates in accordance with the
distance between the microphones 11 and 12 and the sound source. When sound pressure Gs1
03-05-2019
14
in the case of distance Ls1 and sound pressure Gs2 in the case of distance Ls2 are compared in
FIG. 8, sound pressure Gs2 is about 0.3 times the sound pressure Gs1. When the first microphone
11 is provided in the device body 30, the distance Ls1 is further reduced, and the sound pressure
Gs1 is further increased accordingly. Therefore, in this case, the sound pressure Gs2 is further
smaller than 0.3 times the sound pressure Gs1.
[0040]
FIG. 9 is a diagram showing the relationship between the method of identifying the speaker and
the method of identifying that the acquired voice contains a collision sound. As shown in FIG. 9,
in the present embodiment, the sound pressure ratio is larger than the first threshold (ie, the
sound pressure Ga2 of the second microphone 12 is larger than twice the sound pressure Ga1 of
the first microphone 11). As the uttered voice of the wearer. However, even in the section where
the wearer speaks, when the sound pressure Ga1 of the first microphone 11 becomes large due
to the influence of the collision sound, the sound pressure ratio becomes smaller than the first
threshold and another person utters Can be identified as In addition, when the wearer speaks, a
collision sound is more likely to be generated by the device main body 30 because gestures are
often accompanied. Therefore, in this case, although it is a section where the wearer speaks, the
frequency with which the section is uttered by another person is frequently identified.
[0041]
Therefore, in the present embodiment, by adopting the following configuration, it is determined
whether or not the acquired voice contains a collision sound, and the collision sound is given to
the discrimination between the speech of the wearer and the speech of another person. Suppress
the impact. Specifically, in the present embodiment, a threshold (second threshold) is set to the
ratio of the sound pressure of the second microphone 12 to the sound pressure of the first
microphone 11.
[0042]
This utilizes the fact that the ratio of the sound pressure of the second microphone 12 to the
sound pressure of the first microphone 11 tends to be different between the acquired voice
including the collision sound and the acquired voice not including the collision sound. . To
explain further, as described with reference to FIG. 8, the sound pressure Gs2 of the second
03-05-2019
15
microphone 12 is a fraction of the sound pressure Gs1 of the first microphone 11 with respect to
the acquired voice when the collision sound occurs. For example, about 0.3 times). On the other
hand, as described above, the sound pressure Ga2 of the second microphone 12 is several times
(for example, about 4 times) the sound pressure Ga1 of the first microphone 11 with respect to
the speaker's speech. The sound pressure Gb2 of the two microphones 12 is approximately equal
to the sound pressure Gb1 of the first microphone 11 (for example, about one time).
[0043]
Therefore, an appropriate value between the sound pressure ratio of another person's speech
and the sound pressure ratio of the acquired sound when the collision sound is generated is set
as the second threshold. Then, the sound whose sound pressure ratio is smaller than the second
threshold is determined as the acquired sound including the collision sound, and the sound
whose sound pressure ratio is larger than the second threshold is determined as the acquired
sound not including the collision sound. In the present embodiment, when it is determined that
the acquired voice contains a collision sound, the discrimination between the speech of the
wearer and the speech of another person is not performed. In the example shown in FIG. 9, the
second threshold is 0.4, and the sound pressure ratio Ga2 / Ga1 and the sound pressure ratio
Gb2 / Gb1 are larger than the second threshold 0.4, so it is determined that the sound does not
include collision sound. Since the sound pressure ratio Gs2 / Gs1 is smaller than the second
threshold value 0.4, it is determined that the sound contains a collision sound. The first threshold
and the second threshold described above are merely examples, and can be changed according to
the environment in which the system of the present embodiment is used.
[0044]
Now, in the sound acquired by the microphones 11 and 12, in addition to the speech sound and
the collision sound, the sound of the environment using the terminal device 10 (environmental
sound such as the operation sound of air conditioning, footsteps accompanied by the walk of the
wearer Is included. The relationship of the distance between the sound source of the
environmental sound and the microphones 11 and 12 is similar to that of the other person's
speech. That is, according to the example shown in FIGS. 4 and 5, the distance between the noise
source c and the first microphone 11 is Lc1, and the distance between the noise source c and the
second microphone 12 is Lc2. Then, the distance Lc1 and the distance Lc2 approximate each
other. And sound pressure ratio Gc2 / Gc1 in the acquisition sound of microphones 11 and 12
becomes smaller than the 1st threshold 2. However, such environmental sound is separated from
the uttered voice and removed by performing filtering processing by an existing technique using
03-05-2019
16
a band pass filter, a gain filter, and the like.
[0045]
<Operation Example of Terminal Device> FIG. 10 is a flowchart showing the operation of the
terminal device 10 in the present embodiment. As shown in FIG. 10, when the microphones 11
and 12 of the terminal device 10 acquire audio, electric signals (audio signals) corresponding to
the acquired audio are transmitted from the microphones 11 and 12 to the first amplifier 13 and
the second amplifier 14. (Step 1001). When the first amplifier 13 and the second amplifier 14
acquire audio signals from the microphones 11 and 12, they amplify the signals and send them
to the audio analysis unit 15 (step 1002).
[0046]
The voice analysis unit 15 performs filtering processing on the signals amplified by the first
amplifier 13 and the second amplifier 14 to remove the component of the environmental sound
from the signal (step 1003). Next, the voice analysis unit 15 acquires each of the microphones
11 and 12 in a fixed time unit (for example, several tenths of a second to several hundredths of a
second) for the signal from which the noise component is removed. The average sound pressure
in the voice is determined (step 1004).
[0047]
If the gain of the average sound pressure in each of the microphones 11 and 12 determined in
step 1004 is present (Yes in step 1005), the voice analysis unit 15 determines that there is a
voice (voice is performed), and then, A ratio (sound pressure ratio) between the average sound
pressure at the first microphone 11 and the average sound pressure at the second microphone
12 is determined (step 1006). Then, if the sound pressure ratio obtained in step 1006 is larger
than the first threshold (Yes in step 1007), the voice analysis unit 15 determines that the speech
voice is a voice according to the wearer's own speech (step 1008) . Also, the sound pressure ratio
obtained in step 1006 is smaller than the first threshold (No in step 1007), and the sound
pressure ratio calculated in step 1006 is larger than the second threshold (Yes in step 1009) And
the voice analysis unit 15 determine that the uttered voice is a voice uttered by another person
(step 1010). Furthermore, if the sound pressure ratio calculated in step 1006 is smaller than the
first threshold (No in step 1007) and the sound pressure ratio calculated in step 1006 is smaller
03-05-2019
17
than the second threshold (No in step 1009) And the voice analysis unit 15 determine that the
acquired sound includes a collision sound. The speech analysis unit 15 recognizes the acquired
sound including the collision sound as noise. Note that, in the present embodiment, when it is
determined that the acquired sound includes a collision sound, as described above, the voice
analysis unit 15 does not distinguish between the speech of the wearer and the speech of
another person.
[0048]
Further, when there is no gain of the average sound pressure in each of the microphones 11 and
12 obtained in step 1004 (No in step 1005), the voice analysis unit 15 determines that there is
no voice (voice is not performed) ( Step 1011).
[0049]
Thereafter, the voice analysis unit (identification unit) 15 uses the data transmission unit 16 as a
result of analysis of the information (the presence or absence of the utterance, the information of
the speaker) obtained in the process of steps 1004 to 1011. Send to (step 1012).
The length of time of utterance of each speaker (the wearer or others), the value of the gain of
the average sound pressure, and other additional information may be transmitted to the host
device 20 together with the analysis result. At this time, if it is determined No at Step 1009, that
is, if it is determined that the acquired voice contains a collision sound, the voice analysis unit 15
transmits the analysis result without identifying the speaker.
[0050]
In the present embodiment, by comparing the sound pressure of the first microphone 11 and the
sound pressure of the second microphone 12, it is determined whether the speech sound is a
speech by the wearer's own speech or a speech by another person's speech. . However, the
identification of the speaker according to the present embodiment may be performed based on
the non-verbal information extracted from the audio signals themselves acquired by the
microphones 11 and 12 and is not limited to the comparison of the sound pressure.
[0051]
03-05-2019
18
For example, the voice acquisition time (output time of voice signal) in the first microphone 11
may be compared with the voice acquisition time in the second microphone 12. In this case, the
difference between the distance from the wearer's mouth (speaking part) to the first microphone
11 and the distance from the wearer's mouth (speaking part) to the second microphone 12 is
large. Therefore, a certain difference (time difference) occurs in the voice acquisition time. On the
other hand, since the speech of the other person has a small difference between the distance
from the wearer's mouth (speaking part) to the first microphone 11 and the distance from the
wearer's mouth (speaking part) to the second microphone 12 The time difference between the
voice acquisition times is smaller than in the case of the user's uttered voice. Therefore, the first
threshold is set for the time difference between the voice acquisition times, and when the time
difference between the voice acquisition times is larger than the first threshold, it is determined
that the wearer speaks, and the time difference between the voice acquisition times is If it is
smaller than the threshold of 1, it may be determined that the other person's utterance.
[0052]
In addition, when the voice acquisition time in the first microphone 11 and the voice acquisition
time in the second microphone 12 are compared, the distance from the device main body 30
generating the collision sound to the first microphone 11 and the device main body generating
the collision sound Since the difference between the distance 30 and the distance from the
second microphone 12 is large, a certain difference (time difference) occurs in the voice
acquisition time of the voice acquisition voice including the collision sound. To explain further,
the voice acquisition time of the first microphone 11 is higher than the voice acquisition time of
the second microphone 12. On the other hand, when the collision sound is not included (the
voice of the wearer's own voice or the voice of another person's voice), whether the voice
acquisition time of the first microphone 11 is later than the voice acquisition time of the second
microphone 12 The voice acquisition time of the microphone 11 and the voice acquisition time
of the second microphone 12 are substantially the same time. Therefore, a second threshold is
set for the time difference between the voice acquisition times, and a voice whose time difference
between the voice acquisition times is smaller than the second threshold is determined to be an
acquired voice including a collision sound. A voice larger than the threshold of 2 may be
determined as an acquired voice not including a collision sound.
[0053]
03-05-2019
19
<Operation Example of Voice Analysis Unit 15 Acquiring Voice Including Collision Sound> Here,
an operation example of the voice analysis unit 15 when the speech voice including the collision
sound is obtained will be described. FIG. 11 is a diagram showing voice data when the terminal
device 10 according to the present embodiment acquires a speech voice including a collision
sound. Specifically, FIG. 11A is a diagram showing a change in the microphone input of the
microphones 11 and 12 in the case where it is not identified that the collision sound is included
unlike the present embodiment, and FIG. It is the figure which showed the change of the
microphone input of the microphones 11 and 12 at the time of having identified that a collision
sound is included. Further, in FIGS. 11A and 11B, the case where the voiced voice of the wearer is
identified is represented as microphone input 1, and the case where the voiced voice of the other
person is identified is represented as microphone input 0. ing.
[0054]
First, unlike the system of the present embodiment, a case will be described where the voice
analysis unit 15 does not identify whether it is an acquired voice including a collision sound. In
this case, when the voice analysis unit 15 analyzes the acquired voice when the collision sound is
generated in the section where the wearer speaks, the analysis result is as shown in FIG. As
shown in FIG. 11A, when it is not determined whether the acquired voice includes the collision
sound or not, the wearer utters because the sound pressure Ga1 of the first microphone 11
becomes large under the influence of the collision sound. Although there is a section (refer to the
symbol α in the drawing), a section (section in which the microphone input is 0) that is
identified as the voice of another person occurs. If it further explains, although it is a section
where the wearer speaks, it will be identified as a section where the wearer does not speak
(silence section).
[0055]
On the other hand, when it is identified by the voice analysis unit 15 of the present embodiment
whether it is an acquired voice including a collision sound, the analysis result is as shown in FIG.
That is, as shown in FIG. 11B, the voice acquired in the section (refer to the symbol α in the
figure) where the wearer is speaking is recognized as the voice of the wearer without being
affected by the collision sound. . Here, as described above, the voice analysis unit 15 according to
the present embodiment does not distinguish between the voice of the wearer's own voice and
the voice of the other when the voice analysis unit 15 of the present embodiment identifies the
acquired voice including the collision sound. Then, in the acquired voice acquired immediately
before the acquired voice identified as including the collision sound, the identification result in
03-05-2019
20
which the uttered voice of the wearer itself and the uttered voice of another person are identified
continues.
[0056]
<Example of Application of System and Function of Host Device> In the system of the present
embodiment, information related to speech (hereinafter referred to as speech information)
obtained as described above by a plurality of terminal devices 10 is collected in the host device
20. The host device 20 uses the information obtained from the plurality of terminal devices 10 to
perform various analyzes in accordance with the purpose of use and the mode of use of the
system. Hereinafter, an example using this embodiment as a system which acquires information
about communication of a plurality of wearers is explained.
[0057]
FIG. 12 is a view showing a state in which a plurality of wearers wearing the terminal device 10
of the present embodiment are in conversation. FIG. 13 is a diagram showing an example of the
utterance information of each of the terminal devices 10A and 10B in the conversation situation
of FIG. As shown in FIG. 12, consider the case where two wearers A and B, who respectively wear
the terminal device 10, are in conversation. At this time, the voice recognized as the utterance of
the wearer in the terminal device 10A of the wearer A is recognized as the utterance of the other
person in the terminal device 10B of the wearer B. On the contrary, the voice recognized as the
speech of the wearer in the terminal device 10B is recognized as the speech of the other person
in the terminal device 10A.
[0058]
Speech information is sent to the host device 20 independently from the terminal device 10A and
the terminal device 10B. At this time, as shown in FIG. 13, the utterance information acquired
from the terminal device 10A and the utterance information acquired from the terminal device
10B are opposite to each other in the identification result of the speaker (the wearer and the
other person), but the utterance The information indicating the state of speech such as the length
of time or the timing when the speaker is switched approximates. Therefore, the host device 20
according to this application example compares the information acquired from the terminal
device 10A with the information acquired from the terminal device 10B to determine that these
03-05-2019
21
pieces of information indicate the same utterance situation, and the wearer It recognizes that A
and the wearer B are in conversation. Here, as the information indicating the utterance status, at
least the length of the utterance time in each utterance of each speaker mentioned above, the
start time and end time of each utterance, the time (timing) when the speaker is switched, etc. As
such, time information on the utterance is used. Note that only part of the time information on
these utterances may be used to determine the utterance status of a particular conversation, or
other information may be used additionally.
[0059]
FIG. 14 is a diagram showing an example of a functional configuration of the host device 20 in
the present application example. In the present application example, the host device 20 detects
the speech information (hereinafter, conversation information) from the terminal device 10 of
the wearing person having a conversation among the speech information acquired from the
terminal device 10. And a conversation information analysis unit 202 that analyzes the detected
conversation information. The conversation information detection unit 201 and the conversation
information analysis unit 202 are realized as a function of the data analysis unit 23.
[0060]
Speech information is also sent to the host device 20 from terminal devices 10 other than the
terminal device 10A and the terminal device 10B. The speech information from each of the
terminal devices 10 received by the data receiving unit 21 is stored in the data storage unit 22.
Then, the conversation information detection unit 201 of the data analysis unit 23 reads out the
speech information of each terminal device 10 stored in the data storage unit 22, and detects
conversation information which is speech information related to a specific conversation.
[0061]
As shown in FIG. 13 described above, as the speech information of the terminal device 10A and
the speech information of the terminal device 10B, characteristic correspondences different from
the speech information of the other terminal devices 10 are extracted. The conversation
information detection unit 201 compares the utterance information acquired from each of the
terminal devices 10 stored in the data storage unit 22 and, among the utterance information
acquired from the plurality of terminal devices 10, the correspondence as described above. The
03-05-2019
22
speech information that it has is detected and identified as speech information pertaining to the
same speech. Since the utterance information is sent to the host device 20 from the plurality of
terminal devices 10 as needed, the conversation information detection unit 201 performs the
above-mentioned processing while sequentially separating the utterance information for a fixed
time, for example, and performs a specific conversation It is determined whether or not the
conversation information pertaining to is included.
[0062]
The condition for the conversation information detection unit 201 to detect conversation
information related to a specific conversation from the speech information of the plurality of
terminal devices 10 is not limited to the correspondence shown in FIG. 13 described above.
Conversation information related to a specific conversation may be detected from any of plural
pieces of utterance information by any method.
[0063]
Moreover, in the above-mentioned example, although the example in which two wearers who
wore the terminal device 10 respectively are talking is shown, the number of persons who
participate in a conversation is not limited to two. When three or more wearers are in
conversation, in the terminal device 10 worn by each wearer, the speech voice of the wearer of
the own device is recognized as the speech voice of the wearer itself, and the other person (two
people This is distinguished from the uttered voice of However, the information indicating the
utterance status such as the utterance time and the timing when the speaker is switched
approximates among the acquired information in each terminal device 10. Therefore, the
conversation information detection unit 201 detects the speech information acquired from the
terminal device 10 of the wearer who participates in the same conversation, and does not
participate in the conversation, as in the case of the above two conversations. It distinguishes
from the utterance information acquired from the terminal device 10 of the wearer.
[0064]
Next, the conversation information analysis unit 202 analyzes the conversation information
detected by the conversation information detection unit 201, and extracts features of the
conversation. In the present embodiment, as a specific example, the feature of the conversation is
03-05-2019
23
extracted based on three evaluation criteria of the conversation degree, the listening degree, and
the conversation activity degree. Here, the degree of interactivity represents the balance of the
speech frequency of the conversation participants. The degree of listening shall represent the
degree to which the other participant in the individual conversation participant listens. The
conversational activity level represents the density of utterances in the entire conversation.
[0065]
The degree of interaction is specified by the number of times of speaker substitution during a
conversation and the variation of the time until the speaker is substituted (the time when one
speaker is speaking continuously). This is obtained from the number of times the speaker has
been switched and the time when the speaker has been switched in the conversation information
for a fixed time. Then, it is assumed that the value (level) of the degree of interaction is larger as
the number of times of change of the speaker is larger and the variation of the continuous speech
time of each speaker is smaller. This evaluation criterion is common to all conversation
information (speech information of each terminal device 10) related to the same conversation.
[0066]
The degree of listening is specified by the ratio of the own speech time for each speech
participant in the speech information to the speech time of another person. For example, in the
case of the following equation, it is assumed that the value (level) of the listening degree is larger
as the value is larger. Degree of listening = (speaking time of another person) ÷ (speaking time
of the wearer) This evaluation criterion is, even for conversation information related to the same
conversation, each utterance information acquired from the terminal device 10 of each
conversation participant Will be different.
[0067]
The degree of conversational activity is an index that represents so-called excitement of
conversation, and is specified by the ratio of silent time (time when no one of the conversation
participants is speaking) to the entire conversation information. As the sum of silent time is
shorter, it means that one of the conversation participants speaks in the conversation, and the
value (level) of conversational activity is assumed to be larger. This evaluation criterion is
common to all conversation information (speech information of each terminal device 10) related
03-05-2019
24
to the same conversation.
[0068]
As described above, analysis of the conversation information by the conversation information
analysis unit 202 extracts the feature of the conversation related to the conversation information.
Also, the above analysis identifies how each participant participates in the conversation. Note
that the above evaluation criteria is only an example of information representing the
characteristics of conversation, and the purpose and use of the system of this embodiment can be
achieved by adopting other evaluation items or adding weights to each item. You may set the
evaluation criteria according to an aspect.
[0069]
By performing the above-described analysis on various pieces of conversation information
detected by the conversation information detection unit 201 out of the speech information
stored in the data storage unit 22, in the entire group of the wearer of the terminal device 10
Can analyze communication trends. Specifically, for example, by examining the correlation
between the number of conversation participants, the time the conversation was conducted, the
degree of interaction, the value of activity, etc., and the frequency of occurrence of conversation,
what kind of group of wearers are It is determined whether the conversation of the aspect is
likely to take place.
[0070]
Moreover, the communication tendency of the wearer can be analyzed by performing the abovementioned analysis on a plurality of conversation information of a specific wearer. The manner in
which a particular wearer participates in a conversation may have a certain tendency depending
on conditions such as the number of conversation partners and the number of conversation
participants. Therefore, by examining a plurality of pieces of conversation information in a
specific wearer, for example, features such as a large conversation level in a conversation with a
specific party, an increase in the degree of listening when the number of conversation
participants increases, and the like are detected It is expected to be done.
03-05-2019
25
[0071]
Note that the above-described process of identifying speech information and the process of
analyzing conversational information only show application examples of the system according to
the present embodiment, and the purpose and mode of use of the system according to the
present embodiment, functions of the host device 20, etc. It is not something to do. A processing
function for executing various analyzes and investigations on the utterance information acquired
by the terminal device 10 of the present embodiment may be realized as a function of the host
device 20.
[0072]
Now, in the above, it has been described that the voice analysis unit 15 identifies whether the
acquired voice contains a collision sound after identifying whether it is a voice uttered by the
wearer itself or a voice uttered by another person, The present invention is not limited to this as
long as it is configured to identify whether the wearer utters a voice or a voice uttered by another
person and to identify whether the acquired voice includes a collision sound. For example, it may
be configured to identify whether the voice uttered by the wearer itself or the voice uttered by
another person after identifying whether the acquired voice includes a collision sound.
[0073]
In the above description, it has been described that when the voice analysis unit 15 determines
that the voice is an acquired voice including a collision sound, discrimination between the speech
of the wearer and the speech of another person is not performed, but the present invention is not
limited thereto. . For example, after the voice analysis unit 15 determines that the acquired voice
includes a collision sound, the voice analysis unit 15 separates and removes the voice acquired
by the first microphone 11 and the second microphone 12 (removes noise) (performs filtering
processing) ), And may be configured to identify that the user's uttered voice is obtained at the
time when the obtained voice is obtained. This suppresses the identification of the acquired
voice, which is the voice of the wearer, from the voice of the other person.
[0074]
DESCRIPTION OF SYMBOLS 10 ... Terminal device, 11 ... 1st microphone, 12 ... 2nd microphone,
13 ... 1st amplifier, 14 ... 2nd amplifier, 15 ... Voice analysis part, 16 ... Data transmission part, 17
03-05-2019
26
... Power supply part, 20 ... Host apparatus , 21: data reception unit, 22: data storage unit, 23:
data analysis unit, 24: output unit, 30: device main body, 40: strap, 201: conversation
information detection unit, 202: conversation information analysis unit
03-05-2019
27
Документ
Категория
Без категории
Просмотров
0
Размер файла
44 Кб
Теги
jp2013134312
1/--страниц
Пожаловаться на содержимое документа