close

Вход

Забыли?

вход по аккаунту

?

DESCRIPTION JP2016045389

код для вставкиСкачать
Patent Translate
Powered by EPO and Google
Notice
This translation is machine-generated. It cannot be guaranteed that it is intelligible, accurate,
complete, reliable or fit for specific purposes. Critical decisions, such as commercially relevant or
financial decisions, should not be based on machine-translation output.
DESCRIPTION JP2016045389
Abstract: An evaluation test simulating a conversation MOS test in a loudspeaker communication
system is performed with a small number of operations. A superimposed signal based on data of
a first channel including a first acoustic signal at a first end of the system, a signal derived from
the first acoustic signal, and a second acoustic signal at a second end of the system The call
quality is evaluated using data of a data structure including: and data of a second channel
included. [Selected figure] Figure 1
Data structure, data generation apparatus, data generation method, and program
[0001]
The present invention relates to a technology for evaluating speech quality, and more
particularly to a quality evaluation test technology for a loudspeaker communication system.
[0002]
In order to make a subjective evaluation of the voice quality of the echo canceller, a
conversational MOS (Mean Opinion Score) by a real device is essential (see, for example, NonPatent Document 1).
[0003]
Ryo Takahashi, Atsushi Kurashima, Hitoshi Aoki, “Integrated Speech Quality Estimation
Technology for Broadband Voice Communication Services,” NTT Technical Journal, February
16-04-2019
1
2006, pp.
60-63 (2006)
[0004]
However, since it takes too much work to acquire the conversational MOS, it is difficult to carry
out the subjective evaluation.
Such a problem is not limited to the case of subjectively evaluating the voice quality of the echo
canceller, and is common to the case of subjectively evaluating the voice quality in the
loudspeaker communication system.
[0005]
An object of the present invention is to conduct an evaluation test simulating a conversation MOS
test in a loudspeaker communication system with a small number of operations.
[0006]
In the present invention, in a signal in which a signal derived from a first acoustic signal at a first
end of a system is superimposed on a second acoustic signal at a second end of the system, the
second acoustic signal at the second end is A data structure for quality evaluation, which is a
second channel including data of a first channel including the first acoustic signal, a signal
derived from the first acoustic signal, and a superimposed signal based on the second acoustic
signal. And a data structure having
When the data of the first channel and the data of the second channel are read into the sound
quality evaluation apparatus, the sound represented by the data of the first channel is
reproduced from one channel of the binaural sound reproducing apparatus, and the second
channel The sound represented by the data is reproduced from the other channel of the binaural
sound reproduction apparatus, and information indicating the evaluation result is input to the
sound quality evaluation apparatus.
16-04-2019
2
[0007]
By using data of such a data structure, it is possible to perform an evaluation test simulating a
conversation MOS test in a loudspeaker communication system with a small number of
operations.
[0008]
FIG. 1 is a block diagram illustrating the functional configuration of the data generation
apparatus of the first embodiment.
FIG. 2 is a conceptual diagram for explaining a data structure generated by the data generation
device of the first embodiment. FIG. 3 is a diagram for illustrating a data structure generated by
the data generation device of the first embodiment. FIG. 4 is a block diagram illustrating the
functional configuration of the data generation device of the second embodiment. FIG. 5A is a
block diagram illustrating the communication environment simulation processing unit of FIG. FIG.
5B is a block diagram illustrating the signal processing unit of FIG. FIG. 6 is a block diagram
illustrating the functional configuration of the sound quality evaluation apparatus of the third
embodiment. FIG. 7 is a view exemplifying the display contents in the sound quality evaluation
test of the third embodiment. FIG. 8 is a diagram for illustrating an acoustic quality evaluation
method. FIG. 9 is a diagram for illustrating an acoustic quality evaluation method. FIG. 10 is a
diagram for illustrating an acoustic quality evaluation method. FIG. 11 is a diagram for
illustrating an acoustic quality evaluation method. FIG. 12 is a diagram for illustrating an acoustic
quality evaluation method.
[0009]
Hereinafter, embodiments of the present invention will be described with reference to the
drawings. First Embodiment <Evaluation Test Simulating Conversational MOS Test in
Loudspeaker Communication System> First, an evaluation test simulating conversational MOS
test in a loudspeaker communication system will be conceptually described. In this evaluation
test, the near-end speaker and the far-end speaker talk through the loudspeaker communication
system, and the evaluator located on the near-end speaker side performs quality evaluation of
the loudspeaker communication system. A loudspeaker system is a communication system for
transmitting and receiving sound signals between terminal devices provided with a microphone
and a speaker, and at least a part of the sound output from the speaker of the terminal device is a
16-04-2019
3
microphone of the terminal device. It is something that is received at (what causes wraparound of
sound). An example of a loudspeaker system is an audio conference system or a video conference
system.
[0010]
In the loudspeaker communication system illustrated in FIG. 2, the near-end speaker's voice is
received by the near-end speaker's microphone, and an acoustic signal obtained based on it is
transmitted to the far-end speaker via the network. The sound represented by the acoustic signal
is output from the far end speaker. Also, the far-end speaker's sound is received by the far-end
speaker's microphone, the acoustic signal obtained based on that is transmitted to the near-end
speaker via the network, and the sound represented by the acoustic signal is It is output from the
speaker on the near end speaker side. However, at least a part of the sound output from the far
end speaker is also received by the far end microphone. That is, the far-end talker's sound
received by the far-end talker's microphone is the far-end talker's voice superimposed on the
near-end talker's voice wraparound (acoustic echo). Further, the acoustic signal transmitted to
the near end talker side is derived from a processed signal obtained by performing
predetermined "signal processing" on the signal representing the sound received by the far end
talker side microphone. Or may be obtained without performing such signal processing. "Signal
processing" may be any process. An example of “signal processing” is processing including at
least one of echo cancellation processing and noise cancellation processing.
[0011]
The evaluator uses a binaural-type sound reproducing apparatus such as headphones and
earphones to listen to the direct sound from the near-end speaker with one ear (for example, the
non-dominant ear-for example, the right ear), and the near-end. The sound output from the
speaker at the speaker side is listened to by the other ear (for example, dominant ear-for
example, left ear), and the call quality is subjectively evaluated (opinion evaluation). In this
embodiment, the channel on the direct sound side from the near-end speaker is denoted as
“Rch”, and the channel on the sound side output from the near-end speaker is denoted as
“Lch”. As described above, the sound output from the near-end speaker's speaker is the farend speaker's sound in which the sound echo of the near-end speaker's voice is superimposed on
the far-end speaker's voice The sound signal received by the microphone of the person side and
obtained based on that is transmitted to the near end talker side and output from the speaker of
the near end talker side. Therefore, the acoustic echo component of the near-end speaker's voice
contained in the sound output from the near-end speaker's speaker is delayed more than the
16-04-2019
4
direct sound of the near-end speaker's voice (the sound signal is closer Delay time for one round
trip between the end talker side and the far end talker side). Also, the component of the far-end
speaker's voice contained in the sound output from the near-end speaker's speaker is delayed
compared to when the far-end speaker's voice is emitted (the sound signal is far The delay of
time transmitted from the end talker side to the near end talker side). Here, an acoustic signal
representing a direct sound from the near end speaker and an acoustic signal representing a
sound output from the speaker at the near end speaker when there is a sound wrap around the
far end speaker The set is called a "deterioration signal". In particular, the "deterioration signal"
not subjected to the above-mentioned "signal processing" is referred to as "deterioration signal
D1", and the "deterioration signal" subjected to the "signal processing" is referred to as
"deterioration signal D2". Also, for reference, an acoustic signal representing the direct sound
from the near-end speaker and the sound output from the near-end speaker on the assumption
that the far-end speaker has no sound wraparound A set of acoustic signals to represent is
referred to as "reference signal". The evaluator subjectively evaluates the speech quality by
comparing, for example, any one of “degraded signal D 1”, “degraded signal D 2”, and
“reference signal”.
[0012]
<Data Generation Device> Next, a data generation device for generating a data structure for
performing an evaluation test simulating a conversational MOS test in a loudspeaker
communication system will be exemplified. As exemplified in FIG. 1, the data generation device 1
according to the present embodiment includes a near end talker acoustic signal storage unit 101,
a far end talker acoustic signal storage unit 102, reproduction units 103 and 104, speakers 105
and 106, and a microphone. A time adjustment processing unit 108, a recording processing unit
109, a near end terminal unit 110, a far end terminal unit 120, output units 131, 132, 141, 142,
151, 152, and a data storage unit 180 are included. The far-end terminal unit 120 includes a
signal processing unit 121, and the near-end terminal unit 110 and the far-end terminal unit 120
are configured to be able to communicate via a network (NW). At least the speakers 105 and 106
and the microphone 107 are disposed in the same room. The data generation apparatus 1 is
connected to, for example, a speaker and a microphone, and is a processor (hardware processor)
such as a CPU (central processing unit) or a memory such as a RAM (random-access memory) or
a ROM (read-only memory) Is a device configured by executing one or more general-purpose or
special-purpose computers having a program. Each computer may have one processor and a
memory, or may have a plurality of processors and a memory. This program may be installed in a
computer or may be stored in advance in a ROM or the like. Also, some or all of the processing
units may be configured using an electronic circuit that realizes a processing function
independently, instead of an electronic circuit (circuitry) that realizes a functional configuration
by reading a program like a CPU. . Also, the electronic circuit that constitutes one device may
16-04-2019
5
include a plurality of CPUs.
[0013]
<Data Generation Process> Next, the data generation process of this embodiment will be
described. As pre-processing, a near-end speaker acoustic signal (first acoustic signal on the first
end side of the system) representing a sound corresponding to the direct sound of the near-end
speaker (speech of the near-end speaker) heard by the evaluator The data is stored in the near
end speaker acoustic signal storage unit 101, and a far end speaker acoustic signal (on the
second end side of the system) representing a sound corresponding to the direct sound of the far
end speaker (speech of the far end speaker). The data of the second sound signal is stored in the
far-end speaker sound signal storage unit 102. The near-end speaker acoustic signal and the farend speaker acoustic signal according to the present embodiment are both time-series acoustic
signals, and are obtained, for example, based on voices recorded in a soundproof room. However,
this does not limit the present invention, and at least one of the near-end speaker acoustic signal
and the far-end speaker acoustic signal may be recorded in a normal indoor environment.
Further, in this embodiment, the speech timing between the near-end speaker voice represented
by the near-end speaker acoustic signal and the far-end speaker voice represented by the far-end
speaker acoustic signal (ie, when the near-end speaker voice is spoken) There is no restriction on
the relative time when the far-end speaker's voice is uttered, for example, the fog between the
near-end speaker's voice and the far-end speaker's voice). However, this does not limit the
present invention, and any restriction may be placed on the speech timing between the near end
talker voice and the far end talker voice. In addition, there is no restriction on the near-end
speaker and the far-end speaker, and these may be persons other than the evaluator, or at least
one of them may be the same person as the evaluator.
[0014]
Based on the above premise, a data structure for performing the above-mentioned evaluation test
is generated as follows. The reproduction unit 103 extracts data of the near-end speaker acoustic
signal from the near-end speaker acoustic signal storage unit 101 and outputs a near-end
speaker acoustic signal. The near-end speaker acoustic signal output from the reproduction unit
103 is sent to the output units 131, 141, 151 and the near-end terminal unit 110. The output
units 131, 141, and 151 respectively transmit the transmitted near-end speaker acoustic signal
(the first acoustic signal on the first end side of the system) as “deterioration signal D 1”
“deterioration signal D 2” “reference signal” Output as Rch data (data of the first channel
including the first acoustic signal on the first end side of the system). In addition, the near-end
16-04-2019
6
terminal unit 110 transmits the sent near-end speaker acoustic signal to the far-end terminal unit
120 via the network. The far-end terminal unit 120 sends the transmitted near-end speaker
acoustic signal (signal derived from the first acoustic signal) to the speaker 105, and the speaker
105 represents the sound represented by the near-end speaker acoustic signal (second end of the
system Output the reproduction signal derived from the first acoustic signal sent to the
[0015]
The reproduction unit 104 extracts data of the far-end speaker acoustic signal from the far-end
speaker acoustic signal storage unit 102 and outputs a far-end speaker acoustic signal. The farend speaker acoustic signal output from the reproduction unit 104 is sent to the time adjustment
processing unit 108 and the speaker 106. The time adjustment processing unit 108 delays the
sent far-end speaker acoustic signal and sends it to the output unit 152. The delay amount τ in
the time adjustment processing unit 108 simulates the transmission delay amount B from the farend terminal unit 120 to the near-end terminal unit 110, and is determined based on the
transmission delay amount B, for example. For example, the transmission delay amount B from
the far-end terminal unit 120 to the near-end terminal unit 110, the predicted value of the
transmission delay amount B, the average value of the transmission delay amount B, or any
approximation or correction value thereof Let the function value be the delay amount τ in the
time adjustment processing unit 108. In addition, "the approximate value of (alpha)" means the
value which belongs to the range of (alpha)-(beta) 1 or more and (alpha) + (beta) 2 or less. β 1
and β 2 are positive values (eg, constants), and may be β 1 = β 2 or β 1 ≠ β 2. Further, as
for the transmission delay amount B, a round-trip delay amount C (near end speaker acoustic
signal is transmitted from the near end terminal unit 110 to the far end terminal unit 120, a
sound representing that is outputted from the speaker 105, The time taken for the signal
obtained by receiving the sound to be transmitted from the far-end terminal unit 120 to the nearend terminal unit 110 is approximately half. Therefore, the delay amount τ may be determined
based on the delay amount C. For example, 1⁄2 value of the delay amount C, 1⁄2 value of the
predicted value of the delay amount C, 1⁄2 value of the average value of the delay amount C, or
any function value thereof can be used as the delay amount τ. It is also good. The delay amount
τ may be a fixed value or may be determined based on the actually measured transmission delay
amount B. However, depending on the network environment, the amount of delay between the
forward path and the return path may be different. Also, if the near-end terminal unit 110, the
far-end terminal unit 120, the signal processing unit 121, and the network environment change,
the transmission delay amount B and the delay amount C change, so the delay amount τ is
determined according to such changes. Is desirable. The output unit 152 generates Lch data
(reference sound signal) of the “reference signal” for the far-end speaker sound signal
(reference sound signal, second comparison signal based on the second sound signal) delayed by
the time adjustment processing unit 108. Output as the second channel data representing
16-04-2019
7
[0016]
The speaker 106 outputs a sound (reproduction signal derived from the second acoustic signal at
the second end) represented by the sent far-end speaker acoustic signal (second acoustic signal
at the second end of the system). The sound output from the speaker 105 and the sound output
from the speaker 106 are superimposed in the indoor space and received by the microphone
107. A sound reception signal (a signal based on a signal derived from the first sound signal and
a second sound signal) obtained by sound reception by the microphone 107 is sent to the signal
processing unit 121 of the far-end terminal unit 120. The signal processing unit 121 can control
the presence or absence of execution of signal processing on the received sound reception signal.
When the signal processing is performed, the signal processing unit 121 performs signal
processing on the received sound reception signal to obtain a processing signal, and the far-end
terminal unit 120 transmits the processing signal to the near-end terminal unit 110 (the
Transmit to 1) For this signal processing, even if the near-end speaker acoustic signal (near-end
speaker acoustic signal input to the speaker 105) transmitted from the near-end terminal unit
110 to the far-end terminal unit 120 via the network is used. Good. On the other hand, when the
signal processing is not performed, the far-end terminal unit 120 transmits the sound reception
signal sent to the signal processing unit 121 to the near-end terminal unit 110 (first end side) via
the network. Further, the signal processing unit 121 sends, for example, information
representing the presence or absence of signal processing to the recording processing unit 109.
Further, the signal processing unit 121 performs signal processing on the sent sound receiving
signal to obtain a processing signal, and the far-end terminal unit 120 transmits the processing
signal to the near-end terminal unit 110 via the network, and The same sound receiving signal as
the signal processing or the same sound receiving signal obtained under the same conditions
may be transmitted to the near-end terminal unit 110 via the network. That is, a series of
processes in the case of performing signal processing on one of two identical or identical sound
reception signals may be performed, and a series of processes in the case of not performing the
signal processing may be performed on the other. The "same condition" means that at least the
data generator 1, the near-end speaker acoustic signal, the far-end speaker acoustic signal, and
the speech timing are the same. “Signal processing” may be any processing, and an example of
“signal processing” is processing including at least one of echo cancellation processing and
noise cancellation processing. The echo cancellation processing refers to processing by an echo
canceller in a broad sense for reducing an echo. The process by the echo canceller in a broad
sense means the whole process for reducing the echo.
The processing by the broad echo canceller may be realized, for example, by only the narrowsense echo canceller using an adaptive filter, may be realized by a voice switch, may be realized
16-04-2019
8
by an echo reduction, or the like. It may be realized by a combination of at least some
technologies, or may be realized by a combination with other technologies (for example,
“Knowledge base Knowledge forest, group 2-6 section 5”, “acoustic echo Canceller ", see the
Institute of Electronics, Information and Communication Engineers"). The noise cancellation
processing means processing for suppressing or removing noise components generated around
the far-end terminal microphone due to environmental noise other than the far-end speaker's
voice. Environmental noise refers to, for example, the air conditioning sound of an office, the
sound inside a car while traveling, the sound of passing cars at intersections, the sound of
insects, the touch of a keyboard, the voices of multiple people (gray noises), etc. Large / small,
indoor / outdoor does not matter.
[0017]
A signal (a superimposed signal based on a signal derived from a first acoustic signal and a
second acoustic signal on the second end side of the system) transmitted from the far-end
terminal unit 120 via a network is input to the near-end terminal unit 110 , And sent to the
recording processing unit 109. Here, when the signal processing unit 121 is executing signal
processing (when the signal processing is ON), the recording processing unit 109 transmits the
transmitted signal (based on the signal derived from the first acoustic signal and the second
acoustic signal). Signal processing is performed on the signal, and the superimposed signal
derived from the processed signal) is sent to the output unit 142. The output unit 142 outputs
the transmitted signal (evaluation target acoustic signal T 2) as Lch data (data of the second
channel including the superimposed signal) of the “deterioration signal D 2”. On the other
hand, when the signal processing unit 121 does not execute the signal processing (when the
signal processing is OFF), the recording processing unit 109 transmits the received signal (the
first signal obtained by sending the sound reception signal to the first end). The comparison
signal is sent to the output unit 132. The output unit 132 outputs the sent signal (evaluation
target acoustic signal T 1) as Lch data (data of the second channel including the superimposed
signal) of the “deterioration signal D 1”.
[0018]
A combination of the data of the Rch near-end speaker acoustic signal output from the output
unit 131 and the data of the evaluation target acoustic signal T 1 of Lch output from the output
unit 132 is stored as “deterioration signal D 1” It is stored in the unit 180. A combination of
data of the Rch near-end speaker acoustic signal output from the output unit 141 and data of the
evaluation target acoustic signal T 2 output from the output unit 142 is stored as “deterioration
16-04-2019
9
signal D 2”. It is stored in the unit 180. A set of the data of the Rch near-end speaker acoustic
signal output from the output unit 151 and the data of the Lch reference acoustic signal output
from the output unit 152 is stored in the data storage unit 180 as a “reference signal”. Ru. The
near-end speaker acoustic signals of Rch of “deterioration signal D 1”, “deterioration signal D
2”, and “reference signal” corresponding to the same time interval are identical to each other.
Therefore, it is not necessary to store the data of the near-end speaker acoustic signal of the
same Rch in the data storage unit 180 for each of “deterioration signal D 1”, “deterioration
signal D 2”, and “reference signal”. Of course, data of the near-end speaker acoustic signal of
the same Rch may be stored in the data storage unit 180 for each of “deterioration signal D 1”,
“deterioration signal D 2” and “reference signal”.
[0019]
The reference signal, the deterioration signal D 1, and the deterioration signal D 2 obtained as
described above are illustrated using FIG. 3. In the example of FIG. 3, a series of processing is
performed in the case of performing signal processing on one of the two received signals that
can be regarded as identical or identical described above, and a series of processes is performed
in the case of not performing signal processing on the other. Both “deterioration signal D 2”
when processing is performed and “deterioration signal D 1” when signal processing is not
performed are obtained. Further, in the example of FIG. 3, processing including echo cancellation
processing is used as “signal processing”.
[0020]
The data structure of the “reference signal” in the present embodiment includes Rch data (the
first channel data including the first acoustic signal on the first end side of the system) including
the near-end speaker acoustic signal described above, and the aforementioned And Lch data (the
second channel data including the second comparison signal based on the second acoustic signal
on the second end side) including the reference acoustic signal based on the far-end speaker
acoustic signal. The data structure of “deterioration signal D 1” of this embodiment is Rch data
including the above-described near-end speaker acoustic signal (data of the first channel
including the first acoustic signal on the first end side of the system), It includes Lch data (the
second channel data including the superimposed signal based on the signal derived from the first
acoustic signal and the second acoustic signal on the second end side of the system) including
the aforementioned evaluation target acoustic signal T 1 . The evaluation target acoustic signal T
1 is a “first comparison signal” obtained without performing signal processing. The data
structure of “deterioration signal D 2” of this embodiment is Rch data including the above-
16-04-2019
10
mentioned near-end speaker acoustic signal (data of the first channel including the first acoustic
signal on the first end side of the system), The Lch data including the evaluation target acoustic
signal T 2 described above (the signal including the superimposed signal derived from the
processing signal obtained by performing signal processing on the signal based on the signal
derived from the first acoustic signal and the second acoustic signal And 2 channels of data).
Note that “data of Lch including evaluation target acoustic signal T 1” and “data of Lch
including evaluation target acoustic signal T 2” are both “the signal derived from the first
acoustic signal and the second end side of the system. It corresponds to “data of the second
channel including a superimposed signal based on the second acoustic signal”. In particular,
“data of Lch including the evaluation target acoustic signal T 2” is signal-processed to a signal
based on “the signal derived from the first acoustic signal and the second acoustic signal among
the data including such“ superimposed signal ”. Data derived from the processed signal
obtained by
[0021]
As illustrated in FIG. 3, in the time interval a-b of the Rch data of “reference signal”
“deterioration signal D 1” “deterioration signal D 2”, near-end speaker acoustic signals (first
acoustic signals identical to each other) Signal) is included. The acoustic echo component of the
near-end speaker acoustic signal is included in the time interval e-d 'of the Lch data of
"deterioration signal D1" and "deterioration signal D2". The acoustic echo component is a signal
derived from the above-described near-end speaker acoustic signal (a signal derived from the
first acoustic signal), but compared to the near-end speaker acoustic signal by only the time
interval ae (delay amount C). It is delayed. The delay amount C is obtained by transmitting the
near-end speaker acoustic signal from the near-end terminal unit 110 to the far-end terminal unit
120, outputting a sound representing the signal from the speaker 105, and receiving the sound
by the microphone 107. Furthermore, it corresponds to the time from the far-end terminal unit
120 to the near-end terminal unit 110 to be transmitted.
[0022]
The time interval cd of the Lch data of the "reference signal" includes the far-end speaker
acoustic signal component (the 22nd component based on the second acoustic signal) based on
the far-end speaker acoustic signal, and The far-end speaker acoustic signal component (the 21st
component based on the second acoustic signal) based on the far-end speaker acoustic signal is
superimposed on the time interval c'-d 'of the Lch data of D 1 " A far-end speaker acoustic signal
component (a first component based on a second acoustic signal) based on the far-end speaker
16-04-2019
11
acoustic signal is superimposed on the time interval c′-d ′ of the Lch data of the signal D 2 ′
′ . A time difference a-c 'exists from the start point a of the near-end speaker acoustic signal of
Rch of "deterioration signal D 1" and "deterioration signal D 2" to the start point c' of the far-end
speaker acoustic signal component of Lch. Do. Further, there is a time difference a−c from the
start point a of the near-end speaker acoustic signal of Rch of “reference signal” to the start
point c of the far-end speaker acoustic signal component of Lch. Here, the time difference a−c ′
in “deterioration signal D 1” and “deterioration signal D 2” is the time difference A between
the start timing of the near-end speaker acoustic signal and the start timing of the far-end
speaker acoustic signal and the signal This corresponds to the sum A + B of the transmission
delay amount B from the far-end terminal unit 120 to the near-end terminal unit 110. On the
other hand, the time difference a−c in the “reference signal” corresponds to the sum A + τ of
the time difference A and the delay amount τ in the time adjustment processing unit 108. As
described above, since the delay amount τ is determined based on the transmission delay
amount B, the delay amount τ and the transmission delay amount B match or approximate, and
the time difference a−c matches or approximates the time difference a−c ′. It can be done. In
the evaluation test using such a data structure, the time from the output of the near-end speaker
acoustic signal on Rch of “deterioration signal D 2” to the output of the far-end speaker
acoustic signal component on Lch The time from the output of the near-end speaker acoustic
signal at Rch of the reference signal to the output of the far-end speaker acoustic signal
component at Lch can be matched or approximated. Similarly, the time from the output of the
near-end speaker acoustic signal on Rch of “deterioration signal D 1” to the output of the farend speaker acoustic signal component on Lch, and the near-end talk on Rch of “reference
signal” The time from the output of the person's acoustic signal to the output of the far-end
speaker's acoustic signal component on Lch can be matched or approximated. Furthermore, the
time from the output of the near-end speaker acoustic signal at Rch of “deterioration signal D
1” to the output of the far-end speaker acoustic signal component at Lch and the near-end at
Rch of “deterioration signal D 2” The time from the output of the speaker sound signal to the
output of the far end speaker sound signal component by Lch can be made to coincide or
approximate.
That is, the superimposed signal includes the first component based on the second acoustic
signal, and the comparison signal includes the second component (the 21st component or the
22nd component) based on the second acoustic signal, and the first signal is transmitted in the
first channel. The time from the output of an audio signal to the output of the first component on
the second channel and the time from the output of the first audio signal on the first channel to
the output of the second component on the second channel , Can be matched or approximated.
Although FIG. 3 exemplifies a situation in which the near-end speaker speaks prior to the far-end
speaker, the far-end speaker speaks prior to the near-end speaker, and the time difference is a-c '.
There are also cases where ≒ 0. For example, the time difference A between the start timing of
the near-end speaker acoustic signal and the start timing of the far-end speaker acoustic signal,
16-04-2019
12
and the transmission delay B until the signal is transmitted from the far-end terminal unit 120 to
the near-end terminal unit 110 When and are equal, time difference a−c ′ = difference A−B ≒
0. Furthermore, when the far-end speaker starts talking to the near-end speaker earlier than the
transmission delay amount B, the positional relationship of the waveforms is reversed, and the
start point c 'of the Lch far-end speaker audio signal component is " It may be earlier than the
start point a of the near-end speaker acoustic signal of Rch of the deterioration signal D 1 and
the deterioration signal D 2. Even in such a case, time adjustment can be performed similarly.
[0023]
Further, in the above data structure, the data of the near-end speaker acoustic signal of Rch is
associated with the data of the reference acoustic signal of Lch as the “reference signal”, and
the near-end speaker of Rch as the “deterioration signal D 1” The data of the acoustic signal is
associated with the data of the evaluation target acoustic signal T 1 of Lch, the data of the nearend speaker acoustic signal of Rch and the data of the evaluation target acoustic signal T 2 of
Lch as “deterioration signal D 2” Are associated. In an evaluation test using such a data
structure, a control that outputs a reference sound signal by Lch while outputting a near-end
speaker sound signal by Rch and an output of a near-end speaker sound signal by Rch, And
control to output the evaluation target acoustic signal T 1. Similarly, control that outputs a
reference sound signal by Lch while outputting a near-end speaker sound signal by Rch, and
output a target sound signal T 2 to be evaluated by Lch while outputting a near-end speaker
sound signal by Rch Control can also be performed. Further, the control of outputting the
evaluation target acoustic signal T 1 by Lch while outputting the near-end speaker acoustic
signal by Rch and the output of the near-end speaker acoustic signal by Rch, the evaluation
target acoustic signal T 2 by Lch And output control. That is, the control for outputting the
comparison signal in the second channel while outputting the first acoustic signal in the first
channel, and the output of the superimposed signal in the second channel while outputting the
first acoustic signal in the first channel Control is possible.
[0024]
During the evaluation test, “reference signal”, “deterioration signal D 1” and “deterioration
signal D 2” are reproduced in some order. The reproduction sound of the Rch signal of
“reference signal” “deterioration signal D 1” “deterioration signal D 2” is output from, for
example, the right speaker of the binaural sound reproduction apparatus, and the reproduction
sound of the Lch signal is For example, the sound is output from the left speaker of this binauraltype sound reproducing apparatus (stereo reproduction). The evaluator wears the binaural sound
16-04-2019
13
reproducing apparatus on both ears and listens to these stereo reproduced sounds to subjectively
evaluate the speech quality. At this time, it is desirable that the evaluator listen to the reproduced
sound of the Lch signal with a dominant ear (for example, the left ear) and listen to the
reproduced sound of the Rch signal with an ear that is not the dominant ear (for example, the
right ear). Details of the evaluation test will be described in the third embodiment.
[0025]
Modification of the First Embodiment In the first embodiment, the far-end speaker acoustic signal
delayed by the delay amount τ is used as the reference acoustic signal of Lch of the “reference
signal”. This is between the start of the near-end speaker acoustic signal (Rch) and the start of
the far-end speaker acoustic signal component (Lch) between the “reference signal” and the
“deterioration signal D 1” and the “deterioration signal D 2” To match or approximate (for
example, match or approximate between time interval ac in FIG. 3 and time interval ac 'in FIG. 3).
However, such purpose can also be realized by other means. For example, the near-end speaker
sound output from the reproducing unit 103 is output from the output unit 152 as a reference
sound signal of Lch of the “reference signal” without delaying the far-end speaker sound
signal output from the reproducing unit 104. A signal obtained by temporally advancing the
signal by time τ (with a time shift opposite to the delay) may be used as the near-end speaker
acoustic signal of Rch of the “reference signal”. Alternatively, the far-end speaker acoustic
signal output from the reproduction unit 104 delayed by time τ-T is output from the output unit
152 as the reference sound signal of Lch of the “reference signal” and output from the
reproduction unit 103 A near-end speaker acoustic signal that has been temporally advanced by
time T may be used as the near-end speaker acoustic signal of Rch of the “reference signal”.
However, the value of T is, for example, 0 ≦ T ≦ τ. Alternatively, when the evaluation test is
performed, the start time of the near-end speaker acoustic signal (Rch) and the far-end speaker
acoustic signal between the “reference signal” and the “deterioration signal D 1” and
“deterioration signal D 2” It may be a data structure that can match or approximate the time
interval between the start of the component (Lch). For example, it may be a data structure having
file names of “reference signal”, “deterioration signal D 1”, “deterioration signal D 2” and
time information of signals constituting them. The data structure may further have information
for specifying the delay amount τ. In such a case, between the “reference signal” and the
“deterioration signal D 1” and the “deterioration signal D 2” stored in the data storage unit
180, the start time and the far end of the near-end speaker acoustic signal (Rch) The time
interval between the start of the end-speaker acoustic signal component (Lch) may not be
matched or approximated. The point is that the start time of the near-end speaker sound signal
(Rch) and the far-end speaker sound signal component (Rch) between the “reference signal”
and the “deterioration signal D 1” and “deterioration signal D 2” It may be a data structure
that can match or approximate the time interval between the start of Lch).
16-04-2019
14
Furthermore, depending on the environment, when the near-end speaker acoustic signal (Rch)
starts and the far-end speaker acoustic signal component (Lch) between the "reference signal"
and the "deterioration signal D 1" and "deterioration signal D 2" Evaluation tests may be
conducted without adjusting the time interval between the start of the In such a case, the start
time of the near-end speaker acoustic signal (Rch) and the far-end speaker acoustic signal
component (the “reference signal” and the “deterioration signal D 1” and the
“deterioration signal D 2” It may be a data structure in which it is impossible to match or
approximate the time interval between the start of Lch). Also, the time between the start of the
near-end speaker acoustic signal (Rch) and the start of the far-end speaker acoustic signal
component (Lch) between “deterioration signal D 1” and “deterioration signal D 2” It may
be a data structure in which the sections do not match.
[0026]
Second Embodiment The second embodiment is a modification of the first embodiment, and is a
data generation apparatus that electrically simulates a communication environment and an
indoor environment, and generates a data structure for performing an evaluation test. is there. In
the following, differences from the items described above will be mainly described. The matters
described above will be simplified using the reference numerals used for them.
[0027]
<Data Generation Device> As illustrated in FIG. 4, the data generation device 2 of the present
embodiment includes a near end talker acoustic signal storage unit 101, a far end talker acoustic
signal storage unit 102, a time adjustment processing unit 208, and communication. An
environment simulation processing unit 260, a signal processing unit 270, output units 131,
132, 141, 142, 151, 152, and a data storage unit 180 are included. The data generation device 2
is, for example, a device configured by executing one or more general-purpose or dedicated
computers capable of processing audio signals and executing a predetermined program. In
addition, some or all of the processing units may be configured using an electronic circuit that
alone realizes the processing function.
[0028]
16-04-2019
15
The communication environment simulation processing unit 260 performs communication
environment simulation processing that electrically simulates the communication environment
and the surrounding environment (space transfer system). The communication environment
simulation process includes at least a signal obtained by performing a process including a first
time adjustment process on the near-end speaker acoustic signal (first acoustic signal) and a farend speaker acoustic signal (second acoustic signal) And superimposing the signal obtained by
performing the process including the second time adjustment process. Furthermore, the
communication environment simulation process may include a process of superimposing at least
one of pseudo echo and pseudo noise. For example, as illustrated in FIG. 5A, the communication
environment simulation processing unit 260 includes time adjustment processing units 264 and
266, a pseudo echo generation unit 265, an addition unit 267, input units 261 and 262, and an
output unit 263. Furthermore, the communication environment simulation processing unit 260
may include a pseudo noise source 268. The pseudo noise source 268 is for simulating any
environmental noise other than the far-end speaker's voice generated around the far-end
terminal microphone.
[0029]
The signal processing unit 270 performs predetermined signal processing on the input signal
and outputs the signal. As in the first embodiment, “signal processing” may be any processing,
and an example of “signal processing” is processing including at least one of echo cancellation
processing and noise cancellation processing. The echo cancellation process is a process by a
broad echo canceller for reducing an echo. For example, as illustrated in FIG. 5B, the signal
processing unit 270 includes input units 271 and 272, an output unit 273, an addition unit 274,
an adaptive filter 275, and a time adjustment processing unit 276. The signal processing unit
270 may further include a noise estimation unit 278 and a multiplication unit 277. In addition,
although the echo canceller is configured using the adaptive filter 275 in FIG. 5B, the echo
canceller may be configured using a voice switch, echo reduction or other techniques, or a
combination thereof with the adaptive filter 275.
[0030]
Next, data generation processing according to the present embodiment will be described. As in
the first embodiment, first, as pre-processing, data of the near-end speaker acoustic signal (first
acoustic signal) is stored in the near-end speaker acoustic signal storage unit 101, and the farend speaker acoustic signal (second acoustic signal (second Data) is stored in the far-end speaker
16-04-2019
16
acoustic signal storage unit 102. Based on the above premise, a data structure for performing the
above-mentioned evaluation test is generated as follows.
[0031]
The near-end speaker acoustic signal is extracted from the near-end speaker acoustic signal
storage unit 101, and is sent to the output units 131, 141, 151, the input unit 262 of the
communication environment simulation processing unit 260, and the input unit 272 of the signal
processing unit 270. Be A far-end speaker acoustic signal is extracted from the far-end speaker
acoustic signal storage unit 102, and is input to the time adjustment processing unit 208 and the
input unit 261 of the communication environment simulation processing unit 260.
[0032]
The output units 131, 141, and 151 output the transmitted near-end speaker acoustic signal
(first acoustic signal) to Rch data of “deterioration signal D 1”, “deterioration signal D 2”,
and “reference signal” (first Output as data of the first channel including an acoustic signal).
[0033]
The communication environment simulation processing unit 260 performs the above-described
“communication environment simulation processing on the far-end speaker acoustic signal
(second acoustic signal) and the near-end speaker acoustic signal (first acoustic signal) input to
the input units 261 and 262. And the simulation signal obtained thereby is output from the
output unit 263.
In the example of FIG. 5A, the far-end speaker acoustic signal input to the input unit 261 is input
to the time adjustment processing unit 266, and the near-end speaker acoustic signal input to the
input unit 262 is input to the time adjustment processing unit 264. It is input. The time
adjustment processing unit 266 gives the delay of the delay amount B 'to the far-end speaker
acoustic signal, and sends the signal obtained thereby to the addition unit 267 (first time
adjustment processing). The time adjustment processing unit 264 gives a delay of the delay
amount C ′ to the near end speaker acoustic signal, and sends the delayed near end speaker
acoustic signal to the pseudo echo generation unit 265 (second time adjustment processing). The
pseudo echo generation unit 265 creates a pseudo echo by using the delayed near-end speaker
acoustic signal (for example, reproduces the near-end speaker acoustic signal (first acoustic
16-04-2019
17
signal) by the speaker on the far-end speaker side). A signal representing a space transfer system
when sound is picked up by the far end talker side microphone and a signal simulating waveform
distortion at the time of sound pickup is generated as a pseudo echo), and a signal obtained
thereby is sent to the addition unit 267. The addition unit 267 superimposes the signal obtained
by the first time adjustment process and the signal obtained by the pseudo echo generation unit
265 after the second time adjustment process. When the pseudo noise source 268 is present, the
adding unit 267 may further superimpose the pseudo noise signal output from the pseudo noise
source 268. The signal obtained by the adding unit 267 is sent to the output unit 263, and the
output unit 263 outputs it as a simulation signal.
[0034]
The above-mentioned delay amount B 'simulates, for example, the transmission delay amount B
(the transmission delay amount from the far-end terminal unit 120 to the near-end terminal unit
110) of the first embodiment. On the other hand, for the delay amount C ′, for example, the
delay amount C of the first embodiment (a signal is transmitted from the near end terminal unit
110 to the far end terminal unit 120, a sound representing that is output from the speaker 105,
Furthermore, the signal obtained by receiving the sound simulates the time until the signal is
transmitted from the far-end terminal unit 120 to the near-end terminal unit 110). Therefore, it
is desirable that B '<C' (e.g., C '= 2 x B'). However, this does not limit the present invention, and B
'= C' or B '> C' or B '= C' = 0.
[0035]
The simulation signal output from the output unit 263 is input to the output unit 132 and the
input unit 271 of the signal processing unit 270. The output unit 132 outputs the transmitted
simulation signal (evaluation target acoustic signal T 1, first comparison signal) as Lch data (data
of second channel including superimposed signal) of “deterioration signal D 1”.
[0036]
The signal processing unit 270 performs signal processing on the simulation signal using the
simulation signal input to the input unit 271 and the near-end speaker acoustic signal input to
the input unit 272 to obtain a superimposed signal. In the case of the example of FIG. 5B, the
echo obtained by applying the adaptive filter 275 to the signal obtained by delaying the near-end
16-04-2019
18
speaker acoustic signal by the time adjustment processing unit 276 and the simulation signal by
the addition unit 274 When cancellation processing is performed and the noise estimation unit
278 and the multiplication unit 277 are provided, noise cancellation processing is further
performed to thereby obtain a superimposed signal. The obtained superimposed signal is output
from the output unit 273. Note that the method of the noise cancellation processing is, for
example, noise estimation of the stationary noise level of the pseudo noise transmitted by the
pseudo noise source 268 in FIG. 5A in the absence of both the near end speaker and the far end
speaker acoustic signal. The output signal from the adding unit 274 is estimated by the unit 278
and multiplied by a gain value by the multiplying unit 277 so that the amplitude is suppressed by
the estimated stationary noise level (for example, U, Yoichi Haneda, Masashi Tanaka, Junko
Sasaki, Akitoshi Kataoka, "Acoustic echo canceller with noise suppression and echo suppression
functions," Transactions of the Institute of Electronics, Information and Communication
Engineers Vol. J87-A, No. 4, pp. 448-457 (April 2004), etc.). The output unit 273 sends the
superimposed signal (a superimposed signal derived from a processed signal obtained by
performing signal processing on a signal based on the signal derived from the first acoustic
signal and the second acoustic signal) to the output unit 142. The output unit 142 outputs the
sent superimposed signal (evaluation target acoustic signal T 2) as Lch data (data of the second
channel including the superimposed signal) of the “deterioration signal D 2”.
[0037]
In addition, the time adjustment processing unit 208 delays the input far-end speaker acoustic
signal by the delay amount τ ′ and sends the delayed far-end speaker acoustic signal to the
output unit 152. The delay amount τ ′ of the present embodiment corresponds to, for example,
the above-described delay amount B ′. For example, the delay amount B 'or an approximate
value or a correction value (function value) of the delay amount B' is set as the delay amount τ '.
Alternatively, the delay amount τ ′ may correspond to the delay amount C ′. For example, τ
′ may be a function value of C ′ / 2 or C ′ / 2. Alternatively, the delay amount τ ′ may
correspond to the delay amount B ′ and the delay amount C ′. The output unit 152 generates
Lch data (reference sound signal) of the “reference signal” for the far-end speaker sound
signal (reference sound signal, second comparison signal based on the second sound signal)
delayed by the time adjustment processing unit 208 Output as the second channel data
representing
[0038]
The data structure as illustrated in FIG. 3 can be obtained also by the above processing. The
16-04-2019
19
obtained data structure is stored in the data storage unit 180.
[0039]
[Modification of Second Embodiment] In the second embodiment, the delay process of each of
the time adjustment processing units 208, 264, 266, 276 allows the processing between the
“reference signal”, the “deterioration signal D 1” and the “deterioration signal D 2”.
Match or approximate the time interval between the start of the near-end speaker acoustic signal
(Rch) and the start of the far-end speaker acoustic signal component (Lch) (time interval ac in
FIG. 3 and time interval (match or approximate) with a-c '. However, as in the modification of the
first embodiment, such an object can also be realized by other means. For example, the far-end
speaker acoustic signal read out from the far-end speaker acoustic signal storage unit 102 is
output from the output unit 152 as a reference acoustic signal of Lch of the “reference signal”
without delay, A near-end speaker acoustic signal of Rch of “reference signal” may be used as
the near-end speaker acoustic signal read out from the signal storage unit 101 and temporally
raised by time τ ′. The point is that (1) after the near-end speaker acoustic signal (first acoustic
signal) of Rch of “deterioration signal D 2” is output, the far-field included in the evaluation
target acoustic signal T 2 (superimposed signal) of that Lch. The time until the end speaker
acoustic signal component (first component) is output, and the near end speaker acoustic signal
(first acoustic signal) of Rch of the “reference signal” is output, and then the reference sound
of that Lch is output. Match or approximate with the time until the far-end speaker acoustic
signal component (the 22nd component) contained in the signal is output, and (2) the near-end
speaker acoustic signal (the degraded signal D 1) of the Rch ( The time from the output of the
first acoustic signal to the output of the far-end speaker acoustic signal component (the 21st
component) included in the evaluation target acoustic signal T 1 of the Lch, and the Rch of the
“reference signal” After the near-end speaker acoustic signal (first acoustic signal) is output,
the Lch base is Equal or close to the time of the far-end talker's sound signal component included
in an acoustic signal (22 components) are output, it is sufficient that comprise one or more time
adjustment processing unit for performing at least one of. In addition, depending on the
processing at the time of the evaluation test, between the “reference signal”, “deterioration
signal D 1” and “deterioration signal D 2”, the start time of the near-end speaker acoustic
signal (Rch) and the far-end speaker acoustic signal component It may be a data structure that
can match or approximate the time interval between the start of (Lch) and the start. The point is
that the start time of the near-end speaker acoustic signal (Rch) and the far-end speaker acoustic
signal component (Lch) between the “reference signal”, “deterioration signal D 1”, and
“deterioration signal D 2” in some way The data structure may be any data structure that can
match or approximate the time interval between the start time of
16-04-2019
20
Furthermore, depending on the environment, the start of the near-end speaker acoustic signal
(Rch) and the start of the far-end speaker acoustic signal component (Lch) between “reference
signal”, “deterioration signal D 1”, and “deterioration signal D 2” Evaluation tests may be
conducted without adjusting the time interval between time. In such a case, when the near-end
speaker acoustic signal (Rch) starts and the far-end speaker acoustic signal component (Lch) is
selected between the “reference signal”, “deterioration signal D 1”, and “deterioration
signal D 2”. There may be a data structure in which it is impossible to match or approximate the
time interval between the start of the and the start of the event.
[0040]
Third Embodiment In the third embodiment, a quality evaluation method using the data structure
generated as described above will be described.
[0041]
<Sound Quality Evaluation Device> As illustrated in FIG. 6, the sound quality evaluation device 3
of this embodiment includes a data storage unit 180, a counting result storage unit 305, a
reproduction control unit 301, a display control unit 302, a counting unit 303, The control unit
304 includes an audio output processing unit 310-n, a display unit 320-n, and an input unit 330n.
However, n = 1,..., N, and N is an integer of 1 or more (for example, N is 1 or more and 4 or less).
The sound quality evaluation apparatus 3 is an apparatus configured by, for example, one or
more general-purpose or dedicated computers having a display device (display or the like) and an
input device (keyboard or mouse or the like) executing a predetermined program. is there. In
addition, some or all of the processing units may be configured using an electronic circuit that
alone realizes the processing function.
[0042]
<Sound Quality Evaluation Process> The sound quality evaluation apparatus 3 performs an
evaluation test simulating the conversational MOS test in the above-described loud-sounding
communication system under the control of the control unit 304 using the above-described data
structure.
[0043]
16-04-2019
21
For n = 1,..., N, Rch (first channel: right channel, for example, right channel) which is one channel
of the binaural sound reproducing apparatus 340-n at the output unit 311-n of the sound output
processing unit 310-n Are connected, and Lch (second channel: for example, left channel), which
is the other channel of the binaural sound reproduction apparatus 340-n, is connected to the
output unit 312-n.
Note that the binaural-wearing type sound reproducing device 340-n includes a speaker
dedicated to one ear that outputs the sound of one channel Rch and a speaker dedicated to the
other ear that outputs the sound of the other channel Lch. It is a sound reproduction device
capable of stereo reproduction. Specific examples of the both-ears mounted sound reproducing
device 340-n are headphones and earphones. The evaluator 350-n wears the binaural wearing
type sound reproducing apparatus 340-n, and according to the display content output from the
display unit 320-n, the sound output from the binaural wearing type sound reproducing
apparatus 340-n Subjective evaluation is performed, and the evaluation result is input to the
input unit 330-n. In addition, the evaluator 350-n wears the speaker on the side outputting the
sound of channel Lch to its dominant ear (for example, the left ear), and the speaker on the side
not for dominant ear (for example, the right ear) It is desirable to wear a speaker that outputs
sound. These processes will be described in detail below.
[0044]
The reproduction control unit 301 extracts any of “reference signal”, “deterioration signal D
1”, and “deterioration signal D 2” from the data structure described above from the data
storage unit 180 according to the control of the control unit 304 (control content will be
described later). To the sound output processing unit 310-n (where n = 1,..., N). At this time,
processing may be performed to match or approximate the time interval between the start of the
near-end speaker acoustic signal (Rch) and the start of the far-end speaker acoustic signal
component (Lch). The sound output processing unit 310-n performs the following processing
according to the sent signal. The sound represented by the reference acoustic signal of the
"reference signal" is referred to as the "reference sound", the sound represented by the evaluation
target acoustic signal T 1 of the "deterioration signal D 1", and the evaluation target acoustic
signal of the "deterioration signal D 2" The sound represented by T 2 is referred to as the
“evaluation sound”.
[0045]
16-04-2019
22
<< When the “reference signal” is sent >> When the “reference signal” is sent, the sound
output processing unit 310-n (where n = 1,..., N) is the near end of the “reference signal”.
While outputting the speaker acoustic signal (first acoustic signal) from the output unit 311-n to
Rch (first channel) which is one channel of the binaural acoustic playback device 340-n, the
reference of “reference signal” The audio signal is output from the output unit 312-n to Lch
(second channel) which is the other channel of the binaural sound reproducing apparatus 340-n
(first process).
[0046]
<< When “Degraded Signal D 1” is Sent >> When “Degraded Signal D 1” is Sent, the acoustic
output processing unit 310-n (where n = 1,..., N) While outputting the near-end speaker acoustic
signal (first acoustic signal) of “D 1” from the output unit 311-n to the Rch (first channel) of
the binaural acoustic playback device 340-n, “the degraded signal D 1” Evaluation target
acoustic signal T 1 (a superimposed signal representing an evaluation sound based on the signal
derived from the first acoustic signal and the second acoustic signal) from the output unit 312-n
from the binaural-type acoustic reproducing device 340-n Output to Lch (second channel)
(second processing).
[0047]
<< When "deterioration signal D2" is sent >> When "deterioration signal D2" is sent, the acoustic
output processing unit 310-n (where n = 1, ..., N) outputs "deterioration signal While outputting
the near-end speaker acoustic signal (first acoustic signal) of “D 2” from the output unit 311-n
to the Rch (first channel) of the binaural sound-reproducing device 340-n, “deteriorated signal
D 2” Evaluation target acoustic signal T 2 (a superimposed signal representing an evaluation
sound based on a signal derived from a first acoustic signal and a second acoustic signal.
However, this superimposed signal is derived from the processing signal obtained by performing
signal processing on the signal based on the signal derived from the first acoustic signal and the
second acoustic signal.
) Is output from the output unit 312-n to the Lch (second channel) of the binaural soundreproducing apparatus 340-n (second process).
[0048]
16-04-2019
23
The display control unit 302 sends display information to the display unit 320-n (where n = 1,...,
N) according to the control of the control unit 304 (control content will be described later). The
display unit 320-n has three or more stages consisting of a combination of whether or not the
difference between the reference sound and the evaluation sound can be understood according
to the sent display information, and two or more stages of difficulty in listening to the evaluation
sound. Display evaluation categories including the category of. The evaluator 350-n subjectively
evaluates the sound output from the binaural sound reproducing apparatus 340-n according to
the display. Here, the “reference sound” corresponds to an acoustic signal received from the
far-end speaker in an ideal state. The ideal state of the loudspeaker communication system can
be simulated by presenting it together with the "near end speaker sound" corresponding to the
direct sound from the near end speaker. By presenting the “near end talker sound”
simultaneously with the “reference sound signal”, it becomes easy to distinguish between the
wraparound (sound echo) of the near end talker's voice and the far end talker's speech. By
comparing “evaluation sound” with “reference sound” constantly, we objectively and
subjectively evaluate how close the communication system to be evaluated is to an ideal state or
a different state. can do. When only the evaluation sound is presented and evaluated, the far end
talker's remarks and the far end talker's ambient noise etc. are judged as deterioration factors,
and it is highly likely to be evaluated low. By always comparing with the “reference sound”,
deterioration factors other than the communication system are excluded from the evaluation
targets, and accurate evaluation values with less variation can be obtained. Further, this
evaluation category defines not only the deterioration of the evaluation sound with respect to the
reference sound, but also the evaluation criteria for the difficulty in listening to the evaluation
sound (the ease of listening). As described above, by displaying the evaluation category
combining the degree of deterioration from the reference sound of the evaluation sound and the
ease of hearing, the evaluation category focusing only on the deterioration is displayed as in the
conventional DCR (deterioration category evaluation). As compared with the case where it carries
out, it becomes clear what kind of criteria should be used for evaluation, and evaluation variation
can be made small also in the environment where a plurality of factors intertwine complicatedly.
In addition, when the evaluation criteria (positive evaluation criteria) for the listening
“easiness” of the evaluation sound are displayed by displaying the evaluation criteria (negative
evaluation criteria) for the listening “noise” of the evaluation sound In this case, the selection
of the evaluators 350-n becomes stricter than the above, and the evaluation accuracy is
improved. This is based on physiological natural laws.
[0049]
Preferably, the evaluation category includes four or more categories consisting of a combination
16-04-2019
24
of whether or not the difference between the reference sound and the evaluation sound can be
recognized, and the degree of difficulty in listening to the evaluation sound. The evaluation
accuracy can be further improved by setting an evaluation standard for three or more stages of
difficulty in listening to the evaluation sound. In particular, the evaluation category is a one-step
category representing that the difference between the reference sound and the evaluation sound
is not understood, and a four-level degree of difficulty in hearing the evaluation sound and that
the difference between the reference sound and the evaluation sound is understood It is desirable
to include a four-step category consisting of a combination of Specific examples of evaluation
categories are shown below. "There is no difference from the reference sound" "There is a
difference" "There is a difference" means "whether or not the difference between the reference
sound and the evaluation sound can be understood", "There is no problem in listening" “Slightly
hard to hear”, “very hard to hear”, and “very hard to hear” represent the degree of
difficulty in listening to the evaluation sound. Each evaluation category in this example is
associated with a value representing an evaluation of 1 to 5, and the larger the value is, the
higher the quality is. Here, the category is set as the “reference sound” is in the ideal state, but
the “evaluation sound” is evaluated higher than the “reference sound” due to the effect of
the noise canceller of the communication system to be evaluated. Is also conceivable. In this case,
it is possible to include "there is a difference but it is easy to hear" as a higher category.
[0050]
The evaluation categories focusing on only the deterioration used in the conventional DCR
(deterioration category evaluation) are shown below. It can be seen that there are more
subjective and internal expressions compared to the evaluation categories in Table 1.
[0051]
Furthermore, the display information output by the display control unit 302 includes information
for instructing the evaluation of the ease of listening to the evaluation sound, and the display unit
320-n further instructs the evaluation of the ease of listening to the evaluation sound. (Display
showing "what to evaluate") may be performed. For example, the display unit 320-n may display
“Please evaluate the ease of listening to the evaluation sound“ female voice (left side) ””. In
this example, the left side indicates the output of the Lch (second channel) side speaker in the
“reference signal”, “deterioration signal D 1” and “deterioration signal D 2”. As described
above, the evaluation category is a combination of whether or not the difference between the
reference sound and the evaluation sound can be understood, and the degree of difficulty in
listening to the evaluation sound. Physiologically, humans are sensitive to the presence or
16-04-2019
25
absence of a difference, and even if no particular attention is given, the presence or absence of a
difference between a reference sound and an evaluation sound can be evaluated. On the other
hand, appropriate attention can not be made unless attention is paid to the ease of hearing.
Evaluation accuracy can be improved or evaluation variation can be reduced by the display unit
320-n further performing display for instructing evaluation of the audibility of the evaluation
sound based on such a natural law. In addition, when "display for instructing the evaluation of
listening to the evaluation sound" of difficulty "is performed as a display indicating what is to be
evaluated, the evaluator 350-n pays too much attention to details in physiological terms.
Therefore, the influence on "the ease of listening" tends to evaluate even a small deterioration. As
a display indicating what is to be evaluated, “display for instructing evaluation of hearing“
easiness ”of evaluation sound” makes evaluation of the evaluator 350-n appropriate and the
evaluation accuracy can be improved, Evaluation variation can be reduced.
[0052]
Furthermore, the display information output by the display control unit 302 may include
information for displaying what to focus on, and the display unit 320-n may display “what to
focus on”. For example, the display unit 320-n performs a display indicating an instruction to
focus on the reference sound in the above-described “first process”, and an instruction to
focus on the evaluation sound in the “second process”. You may display it. For example, the
display unit 320-n displays “Reference sound (1): Please pay attention to“ female voice (left)
”” at the “first processing”, and outputs “deterioration signal D 1” In the case of "the
second process", which displays "Evaluation sound (1): Please pay attention to" female voice (left)
"" in the "second process" to output, and outputs "deterioration signal D 2" You may display "The
evaluation sound (2): Please pay attention to" woman voice (left) "". Thus, the evaluation target is
clarified, and the evaluator 350-n focuses on the evaluation target acoustic signal (far end
speaker audio signal side), and the evaluator 350-n does not focus on the near end speaker audio
signal side. Can be In addition, in accordance with the signal output from the sound output
processing unit 310-n, the display of “what to focus on” and “what to evaluate” displayed
from the display unit 320-n is changed to be evaluated. The generation timing of the acoustic
signal can be visually recognized.
[0053]
The evaluator 350-n who has performed the subjective evaluation inputs an evaluation value In,
which is information (information representing an evaluation result) representing a category
selected from the evaluation categories, to the input unit 330-n. FIG. 7 illustrates a display screen
16-04-2019
26
321 displayed by the display unit 320-n. The display screen 321 displays a focused content
presentation unit 3211 that displays “what to focus on”, an evaluation instruction presentation
unit 3212 that displays “what to evaluate”, an evaluation category presentation unit 3213 that
displays an evaluation category, an evaluation The icons 3214 to 3218 which are touched or
clicked for the input of the values “1” to “5” (evaluation value In) representing H, and the
icon 3219 which is touched or clicked for the input confirmation. The evaluator 350-n
subjectively evaluates the sound output from the binaural sound reproducing apparatus 340-n
according to the display of the content-of-interest presentation unit 3211, the evaluation
instruction presentation unit 3212, and the evaluation category presentation unit 3213, and
evaluates it. Touch or click any of the corresponding icons 3214 to 3218 and touch or click on
the icon 3219 for confirmation. Until the icons 3214 to 3219 are active and the icon 3219 is
touched or clicked, the evaluator 350-n can perform a touch or click operation to reselect the
icons 3214 to 3218 many times. As a result, an evaluation value In that represents a category
selected from the evaluation categories is input to the input unit 330-n. In addition, in order to
make evaluation conditions the same, it is desirable for the above-mentioned evaluation test to be
simultaneously performed by all the evaluators 350-n (however, n = 1, ..., N). When there are
evaluators whose evaluation is not determined for a predetermined time or more, a screen
display prompting the evaluator to confirm and a screen display allowing other evaluators to wait
may be performed.
[0054]
The evaluation value I-n input to the input unit 330-n is sent to the aggregation unit 303. The
aggregation unit 303 aggregates the evaluation value I-n, and stores the aggregation result
obtained thereby in the aggregation result storage unit 305. For example, the tabulated result is
stored together with an ID representing the evaluator 350-n, an acoustic signal such as
“deteriorated signal D 2” used in the evaluation test, and the condition thereof. The tabulated
result of the evaluation value In may be a set of the evaluation value In, or may be a maximum
value, a minimum value, an average value, a dispersion value, etc. for each acoustic signal used in
the evaluation test. It is also good. The maximum value, the minimum value, the average value,
the variance value, and the like obtained after excluding the evaluation value I-n corresponding
to the evaluator 350-n who has a doubt in the content of the evaluation may be the aggregation
result. Other detailed analysis may be performed by other processing devices.
[0055]
<< Control Content of Control Unit 304 >> Next, control content of the control unit 304 will be
16-04-2019
27
illustrated using FIG. 8 to FIG. The horizontal axis in these figures represents a time axis, and the
later in the drawing the later in time. The row of “Lch” in these figures represents the sound to
be output from the Lch-side speaker of the binaural sound reproducing apparatus 340-n, and the
row of “Rch” represents a binocular wearable sound reproducing apparatus 340-n Represents
the sound to be output from the speaker on the Rch side of. The column “3211” in these
figures represents the presentation content (what to focus on) of the focused content
presentation unit 3211, and the column “3212” indicates the presentation content (what to
evaluate) of the evaluation instruction presentation unit 3212 Represents the presentation
content (evaluation category) of the evaluation category presentation unit 3213.
[0056]
<< Example of FIG. 8 >> In the example of FIG. 8, first, the reproduction control unit 301 reads
the “reference signal” from the data storage unit 180, and outputs it to the sound output
processing unit 310-n (where n = 1,...). , N). The sound output processing unit 310-n outputs a
reference sound signal of “reference signal” from the output unit 312-n, and outputs a near
end speaker sound signal of “reference signal” from the output unit 311-n. As a result, the
"reference sound" represented by the reference sound signal is output from the Lch of the
binaural sound reproducing apparatus 340-n, and the "near end speaker sound corresponding to
the direct sound from the near end speaker" is output from the Rch. "Is output. At this time, the
display control unit 302 sends display information representing the content of interest F 1 and
the evaluation category to the display unit 320-n. Note that the focused content F 1 means
content indicating an instruction to focus on the reference sound (Lch) (for example, “reference
sound (1): please pay attention to“ female voice (left) ”). In addition, the evaluation category is
an evaluation including three or more categories consisting of a combination of the abovementioned "whether or not the difference between the reference sound and the evaluation sound
is understood, and two or more stages of difficulty in listening to the evaluation sound. It is a
category. The display unit 320-n presents the content of interest F 1 to the content of interest
presentation unit 3211, and presents the evaluation category to the evaluation category
presentation unit 3213 (step S1).
[0057]
Next, the reproduction control unit 301 reads “deterioration signal D 2” from the data storage
unit 180 and sends it to the sound output processing unit 310-n (where n = 1,..., N). The acoustic
output processing unit 310-n outputs the evaluation target acoustic signal T 2 of “deterioration
signal D 2” from the output unit 312-n, and the near-end speaker acoustic of “deterioration
16-04-2019
28
signal D 2” from the output unit 311-n. Output a signal. As a result, an "evaluation sound"
represented by the evaluation target acoustic signal T2 of the "deterioration signal D2" is output
from the Lch of the binaural sound reproducing apparatus 340-n, and a near-end speaker
acoustic signal is output from the Rch. The "near end speaker sound" to represent is output. At
this time, the display control unit 302 sends the content of interest F 2, the evaluation instruction
S 1, and the display information indicating the evaluation category to the display unit 320-n.
Note that the focused content F 2 indicates content that indicates an instruction to focus on the
evaluation sound (Lch) (for example, “Evaluation sound (1): please pay attention to“ female
voice (left) ”). The evaluation instruction S 1 means an instruction for evaluation of the ease of
evaluation of the evaluation sound (Lch) (for example, “Please evaluate the ease of listening of“
female voice (left) ”of evaluation sound”). The display unit 320-n presents the attention
content F 2 to the attention content presentation unit 3211, presents the evaluation instruction S
1 to the evaluation instruction presentation unit 3212, and presents the evaluation category to
the evaluation category presentation unit 3213 (Step S2). .
[0058]
Next, step S1 is executed again (step S3), and step S2 is executed again (step S4). The repetition
of step S1 and step S2 may be performed three or more times.
[0059]
Thereafter, the icons 3214 to 3219 are activated, and the evaluation value In from the input unit
330-n and the input indicating the confirmation are received (step S5).
[0060]
Furthermore, even if processing in which “degraded signal D 2” in steps S1 to S5 is replaced
with “degraded signal D 1” and “evaluation target acoustic signal T 2” is replaced with
“evaluation target acoustic signal T 1” is executed. Good.
Further, the presentation of the evaluation category of the evaluation category presentation unit
3213 may be continuously performed through steps S1 to S5, or the presentation of the
evaluation category may disappear each time each step is completed.
16-04-2019
29
[0061]
<< Example of FIG. 9 >> In the example of FIG. 9, one of the “reference sound”, the
“evaluation sound” represented by the evaluation target acoustic signal T 1, and the
“evaluation sound” represented by the evaluation target acoustic signal T 2 is compared
Randomly select a set of sounds and output the selected sounds in order.
[0062]
Specific examples of the process are shown below.
First, the reproduction control unit 301 randomly selects a pair to be compared from “reference
signal”, “deterioration signal D 1”, and “deterioration signal D 2”. An example of a set to be
compared is a set consisting of "reference signal" and "deterioration signal D 1", a set consisting
of "reference signal" and "deterioration signal D 2", "deterioration signal D 1" and "deterioration
signal D It is a group consisting of 2 ". Among the signals constituting the pair to be compared,
the signal to be output first is referred to as a "first output signal", and the signal to be output
later is referred to as a "second output signal". It does not matter which one of the signals
constituting the pair to be compared is output first. For example, when comparing a set of
"reference signal" and "deteriorated signal D 1", the "reference signal" is also referred to as "first
output signal" and "deteriorated signal D 1" as "second output signal". Alternatively, the
"reference signal" may be the "second output signal", and the "deterioration signal D 1" may be
the "first output signal".
[0063]
Next, “reference sound or evaluation sound” corresponding to “first output signal” is output
from Lch, and “near end speaker sound” corresponding to “first output signal” is output
from Rch (step S21 ). The process of step S21 in the case where the "first output signal" is the
"reference signal" is the same as step S1 described above. The process of step S21 when the
“first output signal” is “deterioration signal D 2” is the same as step S2 described above
except that the evaluation instruction S 1 is not presented to the evaluation instruction
presenting unit 3212. In the process of step S21 when the "first output signal" is "deterioration
signal D1", "deterioration signal D2" is replaced with "deterioration signal D1" in the process of
step S2 described above, and "evaluation target" In this processing, the acoustic signal T 2 ′ ′
is replaced with “evaluation target acoustic signal T 1 ′ ′, and the evaluation instruction S 1
is not presented to the evaluation instruction presenting unit 3212.
16-04-2019
30
[0064]
Next, "reference sound or evaluation sound" corresponding to "second output signal" is output
from Lch, and "near end speaker sound" corresponding to "second output signal" is output from
Rch (step S22) . The process of step S22 when the “second output signal” is the “reference
signal” is a process of presenting the evaluation instruction S 1 to the evaluation instruction
presenting unit 3212 in addition to the above-described step S1. The process of step S21 when
the “second output signal” is the “deterioration signal D 2” is the same as step S2 described
above. In the process of step S21 when the "second output signal" is "deterioration signal D1",
"deterioration signal D2" is replaced with "deterioration signal D1" in the process of step S2
described above, and "evaluation target" It is the process which substituted acoustic signal T2
"by" evaluation object acoustic signal T1. "
[0065]
Finally, the input of the evaluation value and the determination thereof are performed (step S5).
[0066]
In addition, as a modification of steps S 21 and 22, it may not be presented whether the sound
output from Lch is a “reference sound” or an “evaluation sound”.
That is, instead of the attention content F 1 and the attention content F 2, a content representing
an instruction to focus on Lch (for example, “Please pay attention to“ female voice (left) ”)
may be presented. In this case, the evaluator 350-n performs subjective evaluation without being
informed whether the sound being presented is the "reference sound" or the "evaluation sound".
[0067]
<< example of FIG. 10 >> In the example of FIG. 10, the "reference sound" is output at the first
time, and the "evaluation sound" represented by the "hidden reference sound" or the evaluation
target acoustic signal T 1 at the second and third times. Alternatively, the “evaluation sound”
represented by the evaluation target acoustic signal T 2 is output. Here, when the “hidden
16-04-2019
31
reference sound” is output for the second time, the “evaluation sound” represented by the
evaluation target acoustic signal T 1 or the “evaluation sound” represented by the evaluation
target acoustic signal T 2 is output for the third time. Be done (pattern 1). On the other hand,
when the “evaluation sound” represented by the evaluation target acoustic signal T 1 or the
“evaluation sound” represented by the evaluation target acoustic signal T 2 is output for the
second time, the “hidden reference sound” is output for the third time. (Pattern 2). The
“hidden reference sound” means a “reference sound” that is output without indicating that
it is a “reference sound”. Further, it is randomly determined whether the pattern 1 or the
pattern 2 is to be used.
[0068]
Specific examples of the process are shown below.
[0069]
First, "reference sound" corresponding to "reference signal" is output from Lch, and "near end
speaker sound" corresponding to "reference signal" is output from Rch (step S31).
The process of step S31 is the same as step S21 described above.
[0070]
Next, the reproduction control unit 301 randomly selects the pattern 1 or the pattern 2. When
the pattern 1 is selected, first, the Lch outputs the “hidden reference sound” corresponding to
the “reference signal”, and the Rch outputs the “near end speaker sound” corresponding to
the “reference signal” (see Step S32) Next, Lch outputs “evaluation sound” represented by
the evaluation target acoustic signal T 1 of “deterioration signal D 1” or “evaluation sound”
represented by the evaluation target acoustic signal T 2 of “deterioration signal D 2” Then, the
“near end speaker sound” corresponding to the “deterioration signal D 1” or the
“deterioration signal D 2” is output from the Rch (step S33). On the other hand, when the
pattern 2 is selected, the Lch outputs the “evaluation sound” represented by the evaluation
target acoustic signal T 1 or the “evaluation sound” represented by the evaluation target
acoustic signal T 2, and the Rch outputs “deterioration signal D 1” or The “near end speaker
sound” corresponding to the “deteriorated signal D 2” is output (step S 32), and then the
“hidden reference sound” corresponding to the “reference signal” is output from the Lch. A
16-04-2019
32
"near end speaker sound" corresponding to the reference signal is output (step S33).
[0071]
The process of outputting the "hidden reference sound" corresponding to the "reference signal"
from the Lch and outputting the "near end speaker sound" corresponding to the "reference
signal" from the Rch is focused on in place of the focused content F 2 This process is the same as
step S1 described above except that the content F 1 is presented to the focused content
presentation unit 3211 and the evaluation instruction S 1 is presented to the evaluation
instruction presentation unit 3212. In addition, Lch outputs “evaluation sound” that evaluation
target acoustic signal T 1 represents or “evaluation sound” that evaluation target acoustic
signal T 2 indicates, and Rch corresponds to “deterioration signal D 1” or “deterioration
signal D 2” In the processing of outputting the “near end speaker sound” to be performed,
“deterioration signal D 2” is replaced with “deterioration signal D 1” in the process of step
S2 or the process of step S2 described above. This process is the same as the process in which 2
′ ′ is replaced with “evaluation target acoustic signal T 1”.
[0072]
Finally, the input of the evaluation value and the determination thereof are performed (step S5).
However, the evaluator 350-n determines which of the sounds output in steps S 32 and S 33 is
an evaluation sound, and inputs an evaluation value only for the sound judged as an evaluation
sound. A sound not judged to be an evaluation sound is automatically judged to be judged as a
"hidden reference sound", and an evaluation value "5" is given to the hidden reference sound. In
addition, the evaluator 350-n may be configured to be able to execute steps S 31 to S 33 any
number of times in a desired order before step S 5 by performing an instruction input to the
input unit 330-n.
[0073]
<< Example of FIG. 11 >> Also in the example of FIG. 11, the "reference sound" is output at the
first time, and the "hidden reference sound" is generated according to the pattern 1 or the
pattern 2 randomly selected at the second and third times. Alternatively, the “evaluation
sound” represented by the evaluation target acoustic signal T 1 or the “evaluation sound”
represented by the evaluation target acoustic signal T 2 is output. However, evaluation values for
16-04-2019
33
each of the second and third outputs are input (steps S132 and S133), and finally, only the
finalized evaluation value is input (step S105). Of the sounds output in steps S132 and S133,
evaluator 350-n inputs evaluation value "5" to those who judged "hidden reference sound" and
judged as "evaluation sound". Enter your own evaluation value in. The other details are the same
as in the example of FIG.
[0074]
<< Example of FIG. 12 >> In FIG. 12, the “reference sound” is output at the first time (step
S41), and the second to x + 1 th times (x is an integer of 3 or more (for example, x is 14 or less))
“Evaluation sound x” is output (steps S42-1 to S42-x), and the input of the evaluation value
and the determination thereof are performed (step S5). Note that “evaluation sound 1” to
“evaluation sound x” are hidden by at least one of “evaluation sound” represented by the
evaluation target acoustic signal T 1 and “evaluation sound” represented by the evaluation
target acoustic signal T 2. "Reference sound" includes one or more "anchor sounds". Note that
"anchor sound" indicates a sound that is a standard of poor acoustic quality. In the case of
including multiple anchor sounds, a criterion of sound quality that becomes progressively worse
may be used. In step S5, the evaluation value of each of the sounds output in steps S42-1 to S42x is input. In addition, the output order of “evaluation sound 1” to “evaluation sound x” is
randomly determined. However, even if the evaluator 350-n performs an instruction input to the
input unit 330-n, the configuration is such that steps S42-1 to S42-x can be executed any
number of times in a desired order before step S5 Good. Others are similar to the example of FIG.
[0075]
[Other Modifications, Etc.] The present invention is not limited to the above-described
embodiment. For example, the reference signal or the deterioration signal may be obtained based
on an acoustic signal (music, background sound, etc.) other than voice. Also, the reference signal
and the degradation signal may not be time series signals. In addition, the various processes
described above are not only executed chronologically according to the description, but may be
executed in parallel or individually depending on the processing capability of the apparatus
executing the process or the necessity. It goes without saying that other modifications can be
made as appropriate without departing from the spirit of the present invention.
[0076]
16-04-2019
34
When the above configuration is implemented by a computer, the processing content of the
function that each device should have is described by a program. The above processing functions
are realized on a computer by executing this program on a computer. The program describing
the processing content can be recorded in a computer readable recording medium. An example
of a computer readable recording medium is a non-transitory recording medium. Examples of
such recording media are magnetic recording devices, optical disks, magneto-optical recording
media, semiconductor memories and the like.
[0077]
This program is distributed, for example, by selling, transferring, lending, etc. a portable
recording medium such as a DVD or CD-ROM in which the program is recorded. Furthermore,
this program may be stored in a storage device of a server computer, and the program may be
distributed by transferring the program from the server computer to another computer via a
network.
[0078]
For example, a computer that executes such a program first temporarily stores a program
recorded on a portable recording medium or a program transferred from a server computer in its
own storage device. At the time of execution of the process, this computer reads the program
stored in its own recording device and executes the process according to the read program. As
another execution form of this program, the computer may read the program directly from the
portable recording medium and execute processing in accordance with the program, and further,
each time the program is transferred from the server computer to this computer Alternatively,
processing may be performed sequentially according to the received program. The configuration
described above is also executed by a so-called ASP (Application Service Provider) type service
that realizes processing functions only by executing instructions and acquiring results from the
server computer without transferring the program to this computer. Good.
[0079]
In the above embodiment, the processing function of the present apparatus is realized by
executing a predetermined program on a computer, but at least a part of these processing
functions may be realized by hardware.
16-04-2019
35
[0080]
1, 2 Data generation device 3 Sound quality evaluation device
16-04-2019
36
Документ
Категория
Без категории
Просмотров
0
Размер файла
60 Кб
Теги
description, jp2016045389
1/--страниц
Пожаловаться на содержимое документа