close

Вход

Забыли?

вход по аккаунту

?

JP2016206261

код для вставкиСкачать
Patent Translate
Powered by EPO and Google
Notice
This translation is machine-generated. It cannot be guaranteed that it is intelligible, accurate,
complete, reliable or fit for specific purposes. Critical decisions, such as commercially relevant or
financial decisions, should not be based on machine-translation output.
DESCRIPTION JP2016206261
Abstract: A conversation processing apparatus capable of measuring the degree of excitement of
a conference, and a conversation processing method. According to one embodiment, a
conversation processing device includes a sound collection unit that collects voice signals of a
plurality of speakers, a conversation recording unit that records voice signals of each speaker
collected by the sound collection unit, and a conversation recording unit. And a conversation
analysis unit that analyzes the conversation between any two speakers from the audio signal
recorded in the and calculates the degree of excitement in the conversation between the two
speakers. [Selected figure] Figure 6
Conversation processing apparatus and conversation processing method
[0001]
The present invention relates to a conversation processing apparatus and a conversation
processing method.
[0002]
A device that records the speech content of each speaker by collecting voices of a plurality of
speakers of the conference using a plurality of microphones and performing sound source
separation processing on the collected voices in a voice conference is proposed. (See, for
example, Patent Document 1).
[0003]
03-05-2019
1
In addition, in group discussions held during work, classes, or interviews during employment and
employment, it is required that the conference be heated up and discussions be advanced.
And, in group discussions, it is required to identify who is the main person who has enlivened the
meeting.
[0004]
JP 2007-295104 A
[0005]
However, with the technique described in Patent Document 1, it has not been possible to
measure the degree of excitement of the conference.
[0006]
The present invention has been made in view of the above-described points, and an object of the
present invention is to provide a conversation processing apparatus and a conversation
processing method capable of measuring the degree of excitement of a conference.
[0007]
(1) In order to achieve the above object, a conversation processing device according to an aspect
of the present invention includes a sound collection unit that collects sound signals of a plurality
of speakers, and each speaker collected by the sound collection unit Conversations between any
two speakers from the conversation recording unit that records the speech signals of the above
and the speech signals recorded in the conversation recording unit, and the degree of excitement
in the conversation between the two speakers And b) calculating a conversation analysis unit.
[0008]
(2) Further, in the conversation processing device according to the aspect of the present
invention, the degree of excitement may be based on the influence given by each of the two
speakers on the conversation in time series.
03-05-2019
2
(3) Further, in the conversation processing device according to the aspect of the present
invention, the conversation analysis unit may calculate the degree of excitement of the two
arbitrary speakers by using a thermal equation.
[0009]
(4) Further, in the conversation processing device according to the aspect of the present
invention, the conversation analysis unit selects and selects all pairs of the two arbitrary speakers
for the plurality of speakers. The rising degree is calculated for each pair, and the calculated
rising degree for each pair is used as a weight for an edge to generate a weighted complete graph
composed of a note and an edge, and the generated weighted complete graph The center person
of the conversation may be estimated based on the graph.
[0010]
(5) Further, in the conversation processing device according to the aspect of the present
invention, the conversation analysis unit may normalize the volume of an audio signal in a
conversation between the two arbitrary speakers.
(6) Further, in the conversation processing device according to one aspect of the present
invention, the conversation analysis unit determines a length of a voice signal uttered at one time
about a voice signal in a conversation between any two speakers. The volume may be normalized
based on
[0011]
(7) Further, in the conversation processing device according to one aspect of the present
invention, the conversation analysis unit may calculate the degree of excitement of the two
arbitrary speakers at predetermined time intervals.
(8) Further, in the conversation processing device according to one aspect of the present
invention, the conversation analysis unit activates the conversation when the degree of
excitement of the two arbitrary speakers is larger than a predetermined threshold. It may be
determined that it has been.
03-05-2019
3
[0012]
(9) Further, in the conversation processing device according to the aspect of the present
invention, a sound source localization unit that localizes a sound source position using the audio
signal collected by the sound collection unit, and a result of localization by the sound source
localization unit And the speech recording unit may record voice signals of the speakers
separated by the sound source separation unit.
[0013]
(10) In order to achieve the above object, in the conversation processing method according to an
aspect of the present invention, a sound collection procedure in which a sound collection unit
collects sound signals of a plurality of speakers; A speech recording procedure for recording the
speech signal of each speaker picked up by the sound procedure, and a speech analysis unit
analyzes the conversation between any two speakers from the speech signal recorded by the
speech recording procedure And a conversation analysis procedure for calculating the degree of
excitement in the conversation between the two speakers.
[0014]
According to the configurations of (1) and (9) described above, it is possible to measure the
degree of excitement between speakers by using two arbitrary voice signals.
According to the configurations of (2) and (3) described above, when the extracted two speakers
utter, the degree of excitement is solved by solving the heat equation in the heat propagation
model in which each utterance is injected in time series Can be calculated.
According to the configuration of (4) described above, the weighted complete graph can be used
to estimate the central person of conversation.
[0015]
According to the configurations of (5) and (6) described above, even if the volume between the
two selected speakers is different, the influence of the volume difference is reduced by
performing normalization. Can.
03-05-2019
4
In addition, according to the configuration of (6), the degree of excitement is increased even if
the speech of one speaker becomes long by normalizing the volume in consideration of the
speech time of one speaker. Can be calculated appropriately.
According to the configuration of (7) described above, since the degree of excitement can be
calculated at each predetermined time, it is possible to provide a time change of the degree of
excitement in the conference.
According to the configuration of (8) described above, it can be determined whether or not the
conference has been excited. According to the configuration of (9) described above, for example,
the sound source separation unit performs sound source separation on the sound signal collected
by the microphone array, and uses the sound signals of any two of the separated sound signals.
Can measure the degree of excitement among the speakers.
[0016]
It is a figure showing the excitement degree model in case the speaker which concerns on
embodiment is two persons. It is a figure explaining the example of Pareto distribution used by
normalization concerning an embodiment. It is a figure showing the example of the time change
of excitement degree hAB in the conversation with the speakers A and B which concerns on
embodiment. It is a figure showing the weighted perfect graph in case the speaker which
concerns on embodiment is three. It is a figure showing the example of the time change of
average h of the excitement degree hAB in the conversation of the speakers A, B, and C
concerning embodiment, hBC, hAC, and three excitement degrees. It is a block diagram showing
composition of a conversation processing device concerning an embodiment. It is a figure which
shows an example of the information currently recorded on the conversation recording part
which concerns on embodiment. It is a flowchart showing an example of the processing which
the conversation processing device concerning an embodiment performs. It is a figure showing
an example of the time change of excitement degree hxy (t) at the time of changing the value of
conversation diffusion rate D concerning an embodiment. It is a figure showing an example of the
presumed result which the degree-of-contribution calculation part presumed about the
conversation of three speakers concerning an embodiment.
[0017]
03-05-2019
5
<Summary of the Invention> The summary of the invention will be described. In a conference in
which a plurality of speakers participate, the conversation processing device 1 (see FIG. 6)
separates and records the speech of each speaker. The conversation processing device 1
sequentially selects any two conversations from the recorded voices. For example, when the
participants in the conference are two persons A and B, the combination of pairs to be selected is
one of AB (= BA). When the participant is three persons of A, B and C, combinations of selected
pairs are AB, (= BA), AC (= CA), BC (= CB). In the present embodiment, the conversation is the
supply of heat, and the excitement during the conversation is heat transfer in the space, and the
heat propagation model is used to calculate the excitement degree indicating the excitement of
the conversation. The conversation processing device 1 calculates the degree of excitement of the
conversation at each time using the heat equation, using the selected two voices. Then, the
conversation processing device 1 generates a weighted complete graph using the calculated
degree of excitement. The conversation processing device 1 calculates the degree of contribution
of each utterance at a predetermined time (hereinafter referred to as the degree of contribution
to the utterance) in the conference using the generated weighted complete graph to obtain a
central person at each predetermined time in the conference. Estimate The predetermined time
is, for example, every one second.
[0018]
First, the degree of excitement used in the present embodiment will be described. FIG. 1 is a
diagram showing a rising degree model in the case where there are two speakers according to
the present embodiment. In FIG. 1, the horizontal axis is the x-axis direction, and represents the
position where the speaker is present. The vertical axis represents the degree of excitement. As
shown in FIG. 1, the speaker A exists at one end (= 0) on the x-axis, and the speaker B exists at
the other end (= 1). In the present embodiment, a heat transfer model is used as the rise degree
model. Thus, it is the speech of the speaker A that corresponds to the heat added from x = 0, and
the speech of the speaker B that corresponds to the heat added from x = 1. In this excitement
degree model, when one of two people speaks, the heat is supplied from one side of x = 0 or x = 1
where the speaker exists. In addition, when two speakers do not speak at the same time, the
amount of heat supplied from both ends decreases. In the present embodiment, as shown in FIG.
1, the temperature u in the heat equation is defined as the degree of swelling h AB at the position
of the center (x / 2) where the speakers A and B exist respectively. Subscripts AB represent
speakers A and B, respectively.
[0019]
03-05-2019
6
Next, the audio signal supplied to the excitement degree model will be described. First, terms
used in the present embodiment will be defined. The number of participants in the conversation
is represented by M, and the serial number of each speaker is m (∈ {1,..., M}). In one conference,
the total number of speeches uttered by the speaker m is I m and the serial number of the speech
of the speaker m is im (∈ {1,..., I m}). Let t im be the utterance start time of the utterance im by
the speaker m. Also, let the volume of the ith speech of the speaker m (hereinafter also referred
to as a speech volume) be v im.
[0020]
Here, an example in the case where the number of speakers is 2 (= M) will be described. Due to
the speaker 1 speaking at time t i1, the volume v i1 is supplied from one end of the excitement
degree model. At time t i2 after time t i1, the speaker 2 utters, so that the sound volume v i2 is
supplied from the other end of the excitement degree model. Hereinafter, when the speaker 1 or
the speaker 2 speaks, the volume v im is sequentially supplied from the end of the excitement
degree model.
[0021]
Next, calculation of the degree of excitement will be described. The heat equation when there is a
conductor on the number line x is expressed as the following equation (1).
[0022]
[0023]
In equation (1), c is the specific heat of the conductor, ρ is the density of the conductor, u is the
temperature of the position x in the conductor, K is the thermal conductivity, and a is the thermal
diffusivity.
In the present embodiment, the temperature u is replaced with the conversation excitement
03-05-2019
7
degree h AB between the speakers A and B, and the thermal diffusivity a is replaced with the
conversation diffusivity D. The speech diffusion rate D indicates that the larger the value is, the
faster the speech is propagated, and the smaller the value, the slower the speech is propagated.
In the present embodiment, the amount of heat given is replaced with the amount of speech. As a
result, equation (1) is replaced by equation (2).
[0024]
[0025]
Assuming that the position of one end to which the speech is supplied is 0 and the position of the
other end is 1, the boundary condition in equation (2) is expressed by the following equation (3).
Further, as shown in the equation (3), it is assumed that the conversation rise degree h AB at time
0 is 0.
[0026]
[0027]
In Formula (3), f 1 (t) is a function indicating the influence of the speaker A on the degree of
speech excitement h AB, and is a function based on the degree and frequency of the speech by
the speaker A.
Further, f 2 (t) is a function indicating the influence of the speaker B on the degree of speech
excitement h AB, and is a function based on the degree and frequency of the speech by the
speaker B. That is, in the present embodiment, the amounts of heat (amount of speech) f 1 (t) and
f 2 (t) supplied from both ends change with time t. Further, in the present embodiment, in
consideration of the volume difference of the voice signal between the speakers, it is assumed
that the volume of the voice signal of the speaker follows the Pareto distribution as shown in FIG.
Normalize 2 (t). Also, in the present embodiment, normalization is performed assuming that the
volume is in the range of 0-1.
03-05-2019
8
[0028]
FIG. 2 is a diagram for explaining an example of a Pareto distribution used in normalization
according to the present embodiment. In FIG. 2, the horizontal axis represents the volume, and
the vertical axis represents the frequency of speech. A curve g1 represents a Pareto distribution
curve used when normalizing the volume. The example shown in FIG. 2 is an example, and the
distribution used for normalization is not limited to the Pareto distribution, and another
statistical distribution may be used.
[0029]
Furthermore, it is assumed that the longer the time that only one person speaks among the
speakers, the less the conference is excited, that is, the degree of excitement h AB is lower. For
this reason, it is desirable for the amount of speech to be given to be lower as the speech of one
speaker becomes longer. Therefore, in this embodiment, the functions f 1 (t) and f 2 (t) are
defined so as to be normalized in proportion to the volume and to decrease exponentially
according to the length of the conversation. As a result, the functions f 1 (t) and f 2 (t) are
expressed by the following equation (4).
[0030]
[0031]
In Formula (4), m is 1 or 2 and t represents time.
Also, v is a normalized volume value, and t i is a speech start time. Also, α is an attenuation
constant representing a decrease in the contribution of the speech l according to the elapsed
time from the speech start time t l. That is, the attenuation constant α is a coefficient
representing a decrease in activity due to the continuation of the utterance by a specific speaker
without the replacement of the speaker. Thus, equation (4) represents the sum of each
"conversation" consisting of a plurality of "speech" sets. In the present embodiment, the degree of
excitement h AB calculated in this manner is used as the edge weight in the graph. If there are
two speakers A and B, then the nodes are A and B, and the weight for the edge between nodes A
03-05-2019
9
and B is the degree of excitement h AB.
[0032]
<Example of Temporal Change of Degree of Upsurge> Next, an example of the time change of the
degree of swell h AB in the conversation between the speakers A and B will be described. FIG. 3 is
a diagram showing an example of the time change of the excitement level h AB in the
conversation between the speakers A and B according to the present embodiment. In FIG. 3, the
horizontal axis represents time, and the vertical axis represents the degree of excitement h AB.
Further, a curve g2 represents the degree of excitement h AB (t) with respect to time, and a
broken line g3 represents a threshold value used in determining whether the conversation is
booming. In the example illustrated in FIG. 3, the conference is performed between time t0 to t6,
and it is determined that the conversation is booming between the time t1 to t3 and the time t4
to t5, which are periods equal to or larger than the threshold g3 It is a period. Further, in the
example shown in FIG. 3, the value of the degree of excitement h AB (t) is the largest time at time
t2.
[0033]
<Description in the Case of Three Speakers> Next, the case of three speakers A, B and C will be
described. When there are three speakers, the conversation processing device 1 sequentially
generates the voice signals of the speakers A and B, the voice signals of the speakers A and C,
and the voice signals of the speakers B and C from the recorded voice signals. Extract. The
conversation processing device 1 replaces h AB with h AC or h BC in the equation (2) to calculate
the degree of excitement h AB, h AC, h BC between the two speakers. The conversation
processing device 1 generates a weighted complete graph as shown in FIG. 4 using the calculated
degrees of excitement h AB, h AC and h BC. A complete graph is a graph in which an edge exists
between all nodes. The weighted graph is a graph in which edges are weighted. FIG. 4 is a
diagram showing a weighted complete graph in the case of three speakers according to the
present embodiment.
[0034]
As shown in FIG. 4, in the case of three speakers A, B and C, the nodes are A, B and C, and the
weight for the edge between nodes A and B is a rising degree h AB, The weight for the edge
03-05-2019
10
between A and C is the rise degree h AC, and the weight for the edge between the nodes B and C
is the rise degree h BC. When there are four speakers, a 4-vertex (node) weighted complete graph
is used, and when there are m speakers, an m-vertex weighted complete graph is used.
[0035]
Next, an example of the degree of excitement when there are three speakers will be described.
FIG. 5 is a diagram showing an example of time change of the degree of excitement h AB, h BC, h
AC and the average h of three degrees of excitement in the conversation between the speakers A,
B and C according to the present embodiment. In FIG. 5, the horizontal axis represents time, and
the vertical axis represents the degree of excitement. The broken line g3 represents a threshold
used to determine whether the conversation is exciting. The curve g11 represents the degree of
excitement h AB (t) with respect to time in the conversation between the speakers A and B, and
the curve g12 represents the degree of excitement h BC (t) with respect to the time in the
conversation between the speakers B and C. A curve g13 represents the degree of excitement h
AC (t) with respect to the time in the conversation between the speakers A and C. Also, the curve
g14 is the average h h (t) of the rising degree h AB (t), h BC (t), h AC (t) (= 1/3 (h AB (t) + h BC (t)
+ h AC (T))) is shown.
[0036]
The example shown in FIG. 5 indicates that a conference was held between time t0 and t14. As
for the speakers A and B, as indicated by the curve g11, it indicates that the conversation was
excited between the time t1 to t5 and the time t9 to t10. As for the speakers B and C, as shown
by the curve g12, it indicates that the conversation was excited during the period from time t6 to
t11. As for the speakers A and C, as shown by the curve g13, it indicates that the conversation
was excited between the time t2 to t4 and the time t8 to t13. Then, for the speakers A, B, and C,
as shown by the curve g14, it indicates that the conversation was excited between the time t3 to
t6 and the time t7 to t12.
[0037]
As shown in FIG. 5, according to the present embodiment, not only the rising time in the entire
conference, but also the time change of the rising degree due to the combination of any two
persons in the meeting participants can be measured. Using such a result, for example, the leader
03-05-2019
11
of the conference considers whether Speaker B is good or Speaker C is good as a partner of
Speaker A when Speaker A participates in the conference. It can be used as a reference when
[0038]
<Estimated Speakers Contributing to the Uplift of the Conference> Next, estimation of the
speakers contributing to the uplift of the conference will be described. As an example, the case of
three speakers will be described. In the case of three speakers A, B, and C, as described above,
the nodes in the graph are A, B, and C, and the edge weights are the degrees of excitement h AB,
h BC, and h AC. The adjacency matrix N of the weighted complete graph in such three speakers is
expressed as the following equation (5).
[0039]
[0040]
In the second term of the equation (5), it is assumed that the degree of excitement h AA, h BB and
h CC in the case of the speaker itself is 0.
Also, the degree of excitement h xy = h yx (where x, y ∈ {A, B, C}). As a result, the second term of
equation (5) is expressed as the third term of equation (5). Here, according to the PeronFrobenius theorem, the eigenvector components for the largest eigenvalue of the matrix in which
all the components are nonnegative are equal in all codes. Since each component of the
adjacency matrix N of the generated graph is all nonnegative as shown in equation (5), the
eigenvector components for the largest eigenvalue are all equal in sign. In this embodiment, the
eigenvector R for the largest eigenvalue is defined as the degree of speech contribution in
speech. The conversation processing device 1 calculates the eigenvector R of the adjacency
matrix N by the following equation (6).
[0041]
[0042]
03-05-2019
12
In equation (6), λ is a real number and represents an eigenvalue.
The eigenvector R shown in equation (6) is expressed as the following equation (7).
[0043]
[0044]
In equation (7), C A is the speech contribution degree of speaker A, C B is the speech contribution
degree of speaker B, and C C is the speech contribution degree of speaker C.
The speech contribution degree C indicates the degree of contribution to the excitement of the
conference. The speaker with the largest value of the degree of speech contribution C is the
person at the center of the conversation (see reference 1). Then, in the present embodiment, by
calculating the speech contribution degree C every predetermined time, it is possible to analyze
the temporal transition of the central person of the conversation. In addition, the calculation
method of the conversational contribution degree C mentioned above is an example, It is not
restricted to this. The conversation processing device 1 may calculate the conversation
contribution degree C using another method of calculating the center in graph theory.
[0045]
[Reference 1] “The PageRank Citation Ranking: Bringing Order to the Web.”, Lawrence Page,
Sergey Brin, Rajeev Motwani, Terry Winograd, Stanford InfoLab., Technical Report,1999
[0046]
<Configuration of Conversation Processing Device 1> Next, the configuration of the conversation
processing device 1 will be described.
FIG. 6 is a block diagram showing the configuration of the conversation processing device 1
according to the present embodiment. As shown in FIG. 6, the conversation processing device 1
03-05-2019
13
includes the sound collection unit 11, the audio signal acquisition unit 12, the sound source
localization unit 13, the sound source separation unit 14, the conversation recording unit 15, the
operation unit 16, the conversation analysis unit 17, and analysis. A result output unit 18 is
provided.
[0047]
The sound collection unit 11 is a microphone, and includes microphones 101-1 to 101-N (N is an
integer of 2 or more). Note that the sound collection unit 11 may be a microphone array, or may
be a tie pin microphone (also referred to as a pin microphone) attached to each speaker. When
one of the microphones 101-1 to 101 -N is not specified, it is referred to as the microphone 101.
The sound collection unit 11 converts the voice of the collected speaker into an electrical signal,
and outputs the converted voice signal to the voice signal acquisition unit 12. Note that the
sound collection unit 11 may wirelessly transmit the recorded N-channel audio signal to the
audio signal acquisition unit 12 or may transmit it by wire. At the time of transmission, the audio
signals need only be synchronized between the channels.
[0048]
The audio signal acquisition unit 12 acquires N audio signals recorded by the N microphones
101 of the sound collection unit 11. The audio signal acquisition unit 12 generates an input
signal in the frequency domain by performing Fourier transform for each of the acquired N audio
signals in the time domain. The audio signal acquisition unit 12 outputs the N audio signals
subjected to Fourier transform to the sound source localization unit 13 and the sound source
separation unit 14.
[0049]
The sound source localization unit 13 estimates (also referred to as sound source localization) an
azimuth angle of a sound source that is a speaker based on the N sound signals input from the
sound signal acquisition unit 12. The sound source localization unit 13 outputs the estimated
azimuth information for each sound source to the sound source separation unit 14. The sound
source localization unit 13 estimates the azimuth angle using, for example, the MUSIC (Multiple
Signal Classification) method. Note that for estimation of the azimuth angle, beam forming (Beam
Forming) method, WDS-BF (Weighted Delay and Sum Beam Forming; weighted delay sum beam
03-05-2019
14
forming) method, MUSIC (GSVD-MUSIC) using generalized singular value expansion Other sound
source direction estimation methods such as Generalized Singular Value Decomposition-Multiple
Signal Classification) may be used.
[0050]
The sound source separation unit 14 acquires the N sound signals output from the sound signal
acquisition unit 12 and the azimuth angle information for each sound source output from the
sound source localization unit 13. The sound source separation unit 14 separates the acquired N
speech signals into speech signals for each speaker by using, for example, a geometric high-order
decorrelation-based source separation (GHDSS) method. Alternatively, the sound source
separation unit 14 may perform the sound source separation process using, for example, an
Independent Component Analysis (ICA) method. The sound source separation unit 14 causes the
conversation recording unit 15 to record identification information that can identify a speaker in
association with the separated voice signal of each speaker. Note that the sound source
separation unit 14 separates the speech signal of each speaker after separating the noise and the
speech signal of the speaker using, for example, the transfer function in the room stored in the
own unit. Good. In this case, the sound source separation unit 14 calculates an acoustic feature
amount for each of, for example, N audio signals, and based on the calculated acoustic feature
amount and the azimuth angle information input from the sound source localization unit 13,
voice of each speaker It may be separated into signals.
[0051]
In the conversation recording unit 15, as shown in FIG. 7, the date and time when the speech
signal of the conference is recorded (also referred to as recording date and time) and the
identification information and the separated speech signal for each speaker are associated with
each other. Recorded every meeting. FIG. 7 is a view showing an example of information
recorded in the conversation recording unit 15 according to the present embodiment. The
example shown in FIG. 7 is an example in the case of three speakers. As shown in FIG. 7,
identification information m (m is any one of A, B, and C) and an audio signal m are associated
with each other, and recording dates and times are further associated with each other and
recorded. The information shown in FIG. 7 is recorded in the conversation recording unit 15 for
each conference.
[0052]
03-05-2019
15
Returning to FIG. 6, the description of the conversation processing device 1 will be continued.
The operation unit 16 receives the user's operation, and outputs the received operation
information to the conversation analysis unit 17. The operation information includes, for
example, meeting selection information indicating which of the recorded meetings is to be
analyzed, analysis start information indicating the start of analysis, and the like.
[0053]
The conversation analysis unit 17 includes a sound source selection unit 171, a sound volume
normalization unit 172, a excitement degree calculation unit 173, a graph generation unit 174,
and a contribution degree calculation unit 175. Each functional unit of the conversation analysis
unit 17 performs each process at predetermined time intervals.
[0054]
The sound source selection unit 171 starts analysis of the conference according to the analysis
start information included in the operation information output from the operation unit 16. The
sound source selection unit 171 reads out the voice signal and the identification information of
the instructed conference from among the recorded in the conversation recording unit 15
according to the conference selection information included in the operation information output
from the operation unit 16. The sound source selection unit 171 sequentially selects two
arbitrary audio signals among the read audio signals for all pairs according to the number of
identification information. Specifically, in the example shown in FIG. 7, the speech analysis unit
17 determines the audio signals A and B of the identification information A and B, the audio
signals B and C of the identification information B and C, and the identification information A and
C, respectively. The respective audio signals A and C are selected. The sound source selection
unit 171 sequentially outputs the two selected audio signals and the identification information to
the volume normalization unit 172. Note that the sound source selection unit 171 sequentially
outputs, to the volume normalization unit 172, audio signals of a pair according to the number of
identification information, for example, in a time division manner within a predetermined time.
[0055]
03-05-2019
16
The volume normalization unit 172 calculates the functions f 1 (t) and f 2 (t) for each speaker
using the above-described equation (4) for the two voice signals output from the sound source
selection unit 171. To normalize the volume. The volume normalization unit 172 associates
identification information with each of the calculated f 1 (t) and f 2 (t), and outputs the
identification information to the excitement degree calculation unit 173. Note that the sound
volume normalization unit 172 calculates functions f 1 (t) and f 2 (t) for each pair of audio
signals in a combination according to the number of identification information, for example,
within a predetermined time.
[0056]
The rising degree calculation unit 173 uses the functions f 1 (t) and f 2 (t) output from the
volume normalization unit 172 and the boundary conditions of the equation (3) described above
to obtain the thermal equation of equation (2). For example, by solving by the difference method,
two audio signals, that is, the degree of excitement h xy (t) between two speakers are calculated.
The excitement degree calculation unit 173 associates the calculated excitement degree h xy (t)
with the identification information and sequentially outputs it to the graph generation unit 174.
For example, in the example shown in FIG. 7, the excitement degree calculation unit 173
associates the calculated excitement degree h AB with the selected identification information A
and B, and associates the excitement degree h BC with the identification information B and C.
And associates the degree of excitement h AC with the identification information A and C and
outputs the result to the graph generation unit 174. Note that the excitement degree calculation
unit 173 calculates the excitement degree h xy (t) for each pair of audio signals of a combination
according to the number of identification information in a predetermined time period, for
example. Further, the excitement level calculation unit 173 calculates an average h ¯ (t) (see FIG.
5) of the excitement levels of all the pairs of speakers. The excitement level calculation unit 173
uses the threshold value stored in the local unit to generate an image representing a time change
of the excitement level h xy (t) for each pair as shown in FIG. 5 and an average h ¯ of the
excitement level. An image representing time change of (t) is generated, and the generated image
is output to the analysis result output unit 18.
[0057]
The graph generation unit 174 generates a weighted complete graph by a known method using
the degree of excitement h xy (t) output from the degree of excitement calculation unit 173 and
the identification information. The graph generation unit 174 generates the adjacency matrix N
of the generated graph by Equation (5), and outputs the generated adjacency matrix N to the
03-05-2019
17
contribution degree calculation unit 175.
[0058]
The contribution degree calculation unit 175 calculates an eigenvector R for each predetermined
time using Expression (6), using the adjacency matrix N output from the graph generation unit
174. The contribution degree calculation unit 175 estimates the central person of the
conversation at each predetermined time based on the calculated eigenvector R, and outputs the
estimated result (for example, FIG. 10) to the analysis result output unit 18. An example of the
estimation result will be described later.
[0059]
The analysis result output unit 18 connects the external device (not shown) and the conversation
processing device 1 as an analysis result out of at least one of the image output by the
excitement degree calculation unit 173 and the estimation result output by the contribution
degree calculation unit 175. The information is output to a display unit (not shown), a printer
(not shown) connected to the conversation processing apparatus 1 or the like.
[0060]
<Process Performed by Conversation Processing Device 1> Next, an example of a processing
procedure performed by the conversation processing device 1 will be described.
FIG. 8 is a flowchart showing an example of processing performed by the conversation
processing device 1 according to the present embodiment. In the following process, the voice
signal in the conference has already been acquired, and the voice signal and identification
information of each speaker whose sound source has been separated are recorded in the
conversation recording unit 15. Then, the following processing is performed after the user
instructs an analysis instruction on the audio signal of the conference by the operation unit 16.
[0061]
(Step S1) The sound source selection unit 171 receives an audio signal and identification
03-05-2019
18
information of a conference instructed from among those recorded in the conversation recording
unit 15 according to the conference selection information included in the operation information
output from the operation unit 16 Read out. Subsequently, the sound source selection unit 171
selects two arbitrary voice signals (voice signals of two speakers) from the read voice signals for
all pairs according to the number of identification information.
[0062]
(Step S2) The volume normalization unit 172 applies the functions f 1 (t) and f 2 (for each
speaker) to the two speech signals selected by the sound source selection unit 171 using the
above-mentioned equation (4). Normalize the volume by calculating t).
[0063]
(Step S3) The rising degree calculation unit 173 uses the functions f 1 (t) and f 2 (t) calculated by
the volume normalization unit 172 and the boundary conditions of the equation (3) described
above to obtain the equation (2). The degree of excitement h xy (t) between two speakers is
estimated by solving the heat equation of.
Subsequently, the excitement level calculation unit 173 calculates the average h 盛 り (t) of the
excitement level for each pair of all the speakers, and uses the threshold stored in the own
section to produce the excitement level h xy for each pair. An image representing the time
change of (t) and an image representing the time change of the average h 盛 り (t) of the degree
of excitement are generated.
[0064]
(Step S4) The sound source selection unit 171 determines whether all pairs have been selected in
step S1. If the sound source selection unit 171 determines that selection of all pairs is completed
(step S4; YES), the process proceeds to step S5, and if it is determined that selection of all pairs is
not completed (step S4; NO), the process returns to step S1.
[0065]
(Step S5) The graph generation unit 174 generates a weighted complete graph by a known
method using the degree of excitement h xy (t) estimated by the degree of excitement calculation
03-05-2019
19
unit 173 and the identification information.
[0066]
(Step S6) The degree-of-contribution calculation unit 175 uses the adjacency matrix N for each of
the two speakers and for each predetermined time generated by the graph generation unit 174
to set the eigenvector R to the formula (6) at each predetermined time. Calculated by
Subsequently, the contribution degree calculation unit 175 estimates the central person of the
conversation at each predetermined time based on the calculated eigenvector R. Subsequently,
the analysis result output unit 18 analyzes at least one of the information indicating the central
person of the conversation for each predetermined time estimated by the contribution degree
calculation unit 175 or the image generated by the excitement degree calculation unit 173. As a
result, it outputs to an external device (not shown) etc. Above, the process which the
conversation processing apparatus 1 performs is complete | finished.
[0067]
<Experimental Results> Next, an example of experimental results performed using the
conversation processing device 1 of the present embodiment will be described. The experiment
was conducted by recording a conference in which three speakers participated. First, an example
of the result of changing the value of the speech diffusion rate D in the above-mentioned
equation (2) will be described. FIG. 9 is a diagram showing an example of a time change of the
excitement degree h xy (t) when the value of the conversation diffusion rate D according to the
present embodiment is changed. In FIG. 9, the horizontal axis is time, and the vertical axis is the
degree of excitement. In the example shown in FIG. 9, the curve g16 is an example in which the
conversation diffusion rate D has a value of 1, and the curve g17 is an example in which the
conversation diffusion rate D has a value of 20. As shown in FIG. 9, as the value of the
conversation diffusion rate D is smaller, the time change of the degree of uplift h xy (t) is a
smooth curve. The value of the conversation diffusion rate D and the threshold used when
determining whether the conference is exciting may be set in advance by the user of the
conversation processing device 1. Alternatively, the time change of the excitement degree h xy (t)
as shown in FIG. 9 is displayed on the display unit (not shown) connected to the conversation
processing device 1 and the user looks at the displayed image The conversation diffusion rate D
may be set by operating the operation unit 16. In this case, for example, the correspondence
between the value of the conversation diffusion rate D and the threshold value may be stored in
03-05-2019
20
the contribution degree calculation unit 175.
[0068]
Next, an example of the estimation result estimated by the contribution calculation unit 175 with
respect to the conversation of three speakers will be described. FIG. 10 is a diagram illustrating
an example of the estimation result estimated by the contribution calculation unit 175 for the
conversation of three speakers according to the present embodiment. In FIG. 10, the horizontal
axis represents time, and the vertical axis represents the degree of speech contribution C.
Further, in FIG. 10, a curve g21 represents the speaker's speech contribution degree C A of the
identification information A, a curve g22 represents the speaker's speech contribution degree C B
of the identification information B, and a curve g23 represents the identification information It
represents the utterance contribution C C of the C speaker.
[0069]
In the example shown in FIG. 10, the speech contribution degree C is the highest for the speaker
B corresponding to the identification information B, and then the speaker A corresponding to the
identification information A is high, and the speaker C corresponding to the identification
information C is It represents lower than the other two people. Also, in the example shown in FIG.
10, the speech contribution degree C A of the speaker A is high at the beginning of the meeting,
but then the speech contribution degree C B of the speaker B becomes higher than the speech
contribution degree C A It can be seen that the speech contribution degree C B after that is high.
The user can use the estimation result as shown in FIG. 10 output from the conversation
processing device 1 to know the change in time of the central person of the conference and the
central person who excited the conference in the entire conference.
[0070]
Note that the example of the estimation result shown in FIG. 10 is an example, and the way of
expressing the estimation result is not limited to this. For example, the analysis result output unit
18 may display the estimation result in a three-dimensional image by arranging an image of
change for each speaker with time on the horizontal axis and the speech contribution degree C
on the vertical axis.
03-05-2019
21
[0071]
As described above, the conversation processing apparatus 1 according to the present
embodiment includes the sound collection unit 11 that collects the sound signals of a plurality of
speakers, and the conversation that records the sound signals of the speakers collected by the
sound collection unit. Conversation analysis that analyzes the conversation between any two
speakers from the recording unit 15 and the speech signal recorded in the conversation
recording unit and calculates the degree of excitement (the degree of excitement) in the
conversation between the two speakers And a unit 17. With this configuration, according to the
present embodiment, it is possible to measure the degree of excitement between speakers using
voice signals of two arbitrary persons.
[0072]
Further, in the conversation processing device 1 according to the present embodiment, the
degree of excitement (the degree of excitement) is based on the influence given by two arbitrary
speakers to the conversation in time series. Further, in the conversation processing device 1 of
the present embodiment, the conversation analysis unit 17 calculates the degree of excitement of
any two speakers using a heat equation (for example, equation (2)). With this configuration, in
the present embodiment, when two extracted speakers speak, each utterance is injected into the
heat propagation model in time series. And in this embodiment, the degree of upsurge can be
calculated by solving the heat equation in this heat propagation model.
[0073]
Further, in the conversation processing device 1 of the present embodiment, the conversation
analysis unit 17 selects all pairs of arbitrary two speakers for a plurality of speakers, and the
degree of excitement ( Degree) and using the calculated degree of excitement for each pair as a
weight for an edge to generate a weighted complete graph composed of notes and edges, and
based on the generated weighted complete graph, the center of the conversation Estimate the
person. With this configuration, according to the present embodiment, it is possible to estimate
the central person of the conversation using the weighted complete graph.
[0074]
03-05-2019
22
Further, in the conversation processing device 1 of the present embodiment, the conversation
analysis unit 17 normalizes the volume of the audio signal in the conversation between any two
speakers. Further, in the conversation processing device 1 of the present embodiment, the
conversation analysis unit 17 normalizes the volume of the voice signal in the conversation
between two arbitrary speakers based on the length of the voice signal uttered at one time. Turn
With this configuration, according to the present embodiment, even if the volume between the
two selected speakers is different, the influence of the volume difference can be reduced by
performing normalization. Further, according to the present embodiment, by normalizing the
volume in consideration of the speech time of one speaker, even when the speech of one speaker
becomes long, the degree of excitement is appropriate. Can be calculated.
[0075]
Further, in the conversation processing apparatus 1 of the present embodiment, the conversation
analysis unit 17 calculates the degree of excitement (the degree of excitement) of any two
speakers at every predetermined time. With this configuration, according to the present
embodiment, since the degree of excitement can be calculated at each predetermined time, it is
possible to provide a time change of the degree of excitement in a conference.
[0076]
Further, in the conversation processing device 1 of the present embodiment, the conversation
analysis unit 17 determines that the conversation has been activated when the degree of
excitement of any two speakers is larger than a predetermined threshold. With this configuration,
according to the present embodiment, it can be determined whether or not the conference has
been excited.
[0077]
In addition, the conversation processing device 1 according to the present embodiment separates
the sound source based on the sound source localization unit 13 that localizes the sound source
position using the audio signal collected by the sound collection unit 11, and the result of
localization by the sound source localization unit. And the speech recording unit 15 records the
03-05-2019
23
voice signal of each speaker separated by the sound source separation unit. With this
configuration, for example, the sound source separation unit 14 performs sound source
separation on the sound signal collected by the microphone array, and the sound signal of two
persons among the separated sound signals is used to create excitement among speakers The
degree can be measured.
[0078]
The conversation processing device 1 described in the present embodiment may be applied to,
for example, an IC recorder or a minutes generation device. Furthermore, the conversation
processing device 1 may be configured by installing an application for executing the
conversation processing device 1 on a smartphone, a tablet terminal or the like.
[0079]
Note that the program for realizing the functions of the conversation processing device 1 in the
present invention is recorded in a computer readable recording medium, and the program
recorded in the recording medium is read into a computer system and executed. Calculation of
the degree of contribution to conversation, estimation of the person at the center of conversation,
and the like may be performed. Here, the “computer system” includes an OS and hardware
such as peripheral devices. The "computer system" also includes a WWW system provided with a
homepage providing environment (or display environment). The term "computer-readable
recording medium" refers to a storage medium such as a flexible disk, a magneto-optical disk, a
ROM, a portable medium such as a ROM or a CD-ROM, or a hard disk built in a computer system.
Furthermore, the "computer-readable recording medium" is a volatile memory (RAM) in a
computer system serving as a server or a client when a program is transmitted via a network
such as the Internet or a communication line such as a telephone line. In addition, those that hold
the program for a certain period of time are also included.
[0080]
The program may be transmitted from a computer system in which the program is stored in a
storage device or the like to another computer system via a transmission medium or by
transmission waves in the transmission medium. Here, the “transmission medium” for
transmitting the program is a medium having a function of transmitting information, such as a
03-05-2019
24
network (communication network) such as the Internet or a communication line (communication
line) such as a telephone line. Further, the program may be for realizing a part of the functions
described above. Furthermore, it may be a so-called difference file (difference program) that can
realize the above-described functions in combination with a program already recorded in the
computer system.
[0081]
DESCRIPTION OF SYMBOLS 1 ... Conversation processing apparatus, 11 ... Sound collection part,
12 ... Sound signal acquisition part, 13 ... Sound source localization part, 14 ... Sound source
separation part, 15 ... Conversation recording part, 16 ... Operation part, 17 ... Conversation
analysis part, 18 ... Analysis result output unit 171: sound source selection unit 172: volume
normalization unit 173: excitement degree calculation unit 174: graph generation unit 175:
contribution degree calculation unit h AB, h AC, h BC, h XY ... Degree of excitement, h ... ...
Average of degree of excitement, C, C A, C B, C C ... Speech contribution
03-05-2019
25
Документ
Категория
Без категории
Просмотров
0
Размер файла
40 Кб
Теги
jp2016206261
1/--страниц
Пожаловаться на содержимое документа