close

Вход

Забыли?

вход по аккаунту

?

DESCRIPTION JPWO2017061023

код для вставкиСкачать
Patent Translate
Powered by EPO and Google
Notice
This translation is machine-generated. It cannot be guaranteed that it is intelligible, accurate,
complete, reliable or fit for specific purposes. Critical decisions, such as commercially relevant or
financial decisions, should not be based on machine-translation output.
DESCRIPTION JPWO2017061023
Abstract: An audio signal processing apparatus and method for separating the sound of each
sound source even when a plurality of devices asynchronously record sounds are input. Each of
the plurality of devices is instructed to output a reference signal of a different frequency, and in
response to the instruction, each reference signal output from the speaker of the plurality of
devices is received, and output from the speaker of the plurality of devices Each reference signal
receives an audio signal input to the microphones of the plurality of devices, and time shift is
performed for each of the devices from each of the received reference signals output from the
speaker and the received audio signal. The amount is calculated, and the plurality of audio
signals input to the microphones of the plurality of devices are separated based on the calculated
time shift amount.
Audio signal processing method and apparatus
[0001]
The present invention relates to an audio signal processing method and apparatus for separating
a sound in which a plurality of sound sources are mixed.
[0002]
As background art of this technical field, there are patent documents 1 and patent documents 2.
In Patent Document 1, “The complex spectrum of the observed signal observed by two
11-04-2019
1
microphones is obtained, and the inter-microphone phase difference of the complex spectrum of
the observed signal is calculated for each time frequency. Each time using the complex spectrum
of the observation signal obtained from the observation signal observed by one microphone, the
inter-microphone phase difference obtained by the feature amount creation unit, and the prior
information representing the distribution of the complex spectrum of the sound source signal
The parameters of the probability model representing the complex spectrum of the observed
signal corresponding to each sound source at frequency and the distribution of the intermicrophone phase difference are estimated. Using the contribution rate to the complex spectrum
and inter-microphone phase difference of the observed signal of each sound source at each time
frequency obtained from the estimated parameters, from the contribution rate of that time
frequency and the complex spectrum of the observed signal, A technique for extracting a
complex spectrum and converting it to a time domain separated signal is disclosed (see abstract).
Further, in Patent Document 2, “The sound pressure and frequency characteristic measurement
device in the speech recognition system takes in environmental noise in a time zone without
speech input from a microphone, and measures the time variation of the sound pressure and
frequency characteristics. The voice recognition availability determination device determines
whether voice recognition is “good”, “possible”, or “impossible” based on the measured
sound pressure of the environmental noise and the time change amount of the frequency
characteristic. The determination result of the voice recognition availability determination device
is notified to the user by the status notification device. Technology is disclosed.
[0003]
JP, 2013-186383, A JP, 2003-271596, A
[0004]
The present invention relates to audio signal processing for separating a signal in which a
plurality of sounds are mixed and extracting the sound of each sound source.
Patent Document 1 describes an apparatus and a method for extracting the sound of each sound
source using sound signals recorded by a plurality of microphones as an input. However, this
method presupposes that each microphone records sound synchronously. When the recording
systems are not synchronized, the phase difference between the observed signals changes due to
the difference in the recording start timing or the difference in the sampling frequency, so the
separation performance is degraded, and if speech recognition is performed in the latter stage,
The recognition performance also decreases.
11-04-2019
2
[0005]
Patent Document 2 describes a method of determining the ease of speech recognition using the
sound pressure of ambient environmental noise, but does not mention the method of determining
the deterioration of speech recognition performance by an asynchronous recording device. .
[0006]
An object of the present invention is to provide an audio signal processing method and apparatus
for separating the sound of each sound source even when a plurality of devices input sounds
asynchronously.
[0007]
In order to solve the above problems, the present invention is an audio signal processing method
in a system including a plurality of devices provided with a microphone and a speaker, and
instructing to output reference signals of different frequencies for each of the plurality of
devices, In response to the instruction, each reference signal output from the speakers of the
plurality of devices is received, and each reference signal output from the speakers of the
plurality of devices is an audio signal input to the microphones of the plurality of devices. The
amount of time shift for each device is calculated from each received reference signal output
from the received speaker and the received audio signal, and the plurality of the plurality of
devices are calculated based on the calculated amount of time shift. A configuration is adopted in
which a plurality of audio signals input to the microphone of the device are separated, and the
separated audio signals are output.
[0008]
According to the present invention, it is possible to separate the sound of each sound source even
when a plurality of devices asynchronously record sounds as input.
[0009]
The block diagram of the audio | voice signal processing apparatus which is one Embodiment of
this invention, and an audio | voice input / output device is shown.
It is a block diagram which makes an audio | voice input / output device perform audio | voice
signal processing.
11-04-2019
3
FIG. 3 is a block diagram of an audio signal processing device having an audio input / output
function.
FIG. 2 is a functional block diagram of the signal processing apparatus 100 and the voice input /
output devices 110 and 120 of the present embodiment.
It is a processing flowchart of the signal processing apparatus 100 of a present Example. It is a
processing flow which explained time shift amount calculation processing (S502) in detail. It is
an example of the reference signal output from each device, and the microphone input signal of
each device at the time of reference signal output. In this example, microphone input signals in
each device are time-aligned using the calculated time shift amount. It is a processing flow which
explained separation performance evaluation processing (S505) in detail. This is an example of a
case where two mixed voice signals are separated into two signals with low separation
performance, and into two signals with high separation performance. It is a processing flow
which explained sampling mismatch calculation processing (S507) in detail.
[0010]
Hereinafter, embodiments of the present invention will be described in detail with reference to
the drawings.
[0011]
In this embodiment, an example of the signal processing apparatus 100 that performs sound
source separation on voices recorded asynchronously by a plurality of devices will be described.
[0012]
FIG. 1 shows a block diagram of a sound source separation system according to this embodiment.
The voice separation system in the present embodiment has a configuration in which two voice
input / output devices 110 and 120 and a signal processing apparatus 100 for performing
source separation perform wireless communication.
11-04-2019
4
[0013]
Each voice input / output device (110 and 120) includes a microphone (111 and 121), a speaker
(112 and 122), an A / D converter (113 and 123), a D / A converter (114 and 124), a central
operation It is constituted by devices (115 and 125), memories (116 and 126), storage media
(117 and 127), and communication control units (118 and 128).
Examples of devices having such a configuration include portable terminals such as smartphones
and tablet PCs.
[0014]
The signal processing apparatus 100 includes a central processing unit 101, a memory 102, a
storage medium 103, and a communication control unit 104.
[0015]
In each device (110 and 120), after being converted from a digital signal to an analog signal by D
/ A converters (112 and 122), audio is output from speakers (112 and 122).
At the same time, the microphones (111 and 121) record ambient sounds, and the analog signals
recorded by the A / D converters (113 and 123) are converted into digital signals. These voice
input and output are performed asynchronously.
[0016]
The central processing unit (115 and 125) stores the digital signal output from the speaker and
the digital signal input to the microphone in the memory (116 and 126). The communication
control devices (118 and 128) transmit the speaker output signal and the microphone input
signal stored in the memory to the communication control device 104 on the signal processing
device 100 side.
11-04-2019
5
[0017]
After storing the signal received from the device (110 or 120) in the memory 102, the central
processing unit 101 of the signal processing device 100 performs sound source separation
processing. Further, the central processing unit 101 has a function of transmitting a reference
signal to the communication control units (118 and 128) of each device through the
communication control unit 104 as processing necessary for sound source separation.
[0018]
The series of processes are executed by programs stored in the respective storage media 103,
117 and 127.
[0019]
In the configuration of FIG. 1, the devices 110 and 120 and the signal processing apparatus 100
communicate wirelessly. However, as shown in FIG. 2, even in a configuration in which one or
both of the sound source separation processes are executed. Good.
In this case, the central processing unit (205 and / or 215) in one or both devices has a function
of performing sound source separation processing. Further, as shown in FIG. 3, a configuration in
which the signal processing apparatus 300 has an audio input / output function without using an
independent device (that is, a configuration in which the devices 110 and 120 in FIG. 1 and the
signal processing apparatus 100 are integrated. ) Is also possible. In the present embodiment, the
configuration in FIG. 1 will be described as an example.
[0020]
FIG. 4 is a functional block diagram of the signal processing apparatus 100 and the voice input /
output devices 110 and 120 of this embodiment.
[0021]
In each device (110, 120), the data transmission / reception unit (411 and 421) receives the
11-04-2019
6
reference signal and the speaker output signal from the data transmission / reception unit 402
on the signal processing apparatus 100 side, and the D / A conversion unit (413 and 423).
Output from the speakers (112 and 122).
At the same time, ambient sounds recorded by the microphones (111 and 121) are converted
into digital signals by the A / D conversion unit (412 and 422), and then data transmission and
reception on the signal processing apparatus 100 side by the data transmission and reception
unit (411 and 421). Send to unit 402.
[0022]
The time shift amount calculation unit 401 in the signal processing apparatus 100 transmits and
receives data of reference signals in order to calculate the time shift amount between
microphone input signals of each device and the time shift amount between microphone input
and speaker output for each device. The signal is output from the speaker (112, 122) through
the unit (402, 411, 421) and the D / A conversion unit (412, 423) of each device. At this time,
the time shift amount calculation unit 401 receives the microphone input and the speaker output
signal of each device through the data transmission / reception units (402, 411, 421), and
calculates the time shift amount.
[0023]
The signal separation unit 403 receives the microphone input and the speaker output input from
the data transmission / reception unit 402 and the time shift amount calculated by the time shift
amount calculation unit 401, and performs signal separation and echo removal. Here, echo refers
to sound that is output from a speaker and is collected around a microphone. The signal
separation unit 403 outputs the separated signal, the microphone input, and the speaker output
to the separation performance evaluation unit 404.
[0024]
The separation performance evaluation unit 404 receives the separated signal transmitted from
the signal separation unit 403 as an input, and evaluates the separation performance. At this
time, if the separation performance is equal to or less than the threshold value, the time shift
11-04-2019
7
amount calculation unit 401 performs time shift amount calculation processing again by
transmitting a time shift amount calculation mode switching instruction to the time shift amount
calculation unit 401.
[0025]
The sampling mismatch calculation unit 405 sequentially calculates a time shift amount due to
an error of the sampling frequency using the microphone input, the separated signal and the
speaker output transmitted from the separation performance evaluation unit 404 as input, and
feeds back to the signal separation unit 403 Do.
[0026]
The sampling mismatch calculation unit 405 outputs the separated signal to the post-processing
unit 406, and the post-processing unit 406 performs some processing using the received
separated signal, and transmits / receives some sound that is the processing result Output from
the speaker of each device through 402.
Examples of the processing by the post-stage processing unit 406 include speech recognition
processing in which speech recognition is performed using the separated signal, translation is
performed to another language using the recognition result, and the translated speech is output
from a speaker.
[0027]
FIG. 5 is a process flowchart of the signal processing device 100 of the present embodiment.
After the processing start (S501), first, the time shift amount calculation unit 401 calculates the
time shift amount between microphone input signals of each device and the time shift amount
between microphone input and speaker output for each device (S502) . Thereafter, each device
constantly performs voice input and output, and continues to transmit the microphone input and
the speaker output to the signal processing apparatus 100 each time (S 503). Next, the signal
separation unit 403 performs sound source separation and echo removal on the microphone
input signal (S504). Next, the separation performance evaluation unit 404 evaluates the
separation performance for the signal after separation (S505).
11-04-2019
8
[0028]
In the evaluation process of S505, when the separation performance is equal to or less than the
threshold (S506: Yes), it is determined that the synchronization between the input and output of
the device is not established, and the time shift calculation process (S502) is performed again. If
the separation performance exceeds the threshold (S506: No), the sampling mismatch calculation
unit 405 sequentially calculates the amount of time shift due to the error of the sampling
frequency of each device (S507). Then, post-processing such as voice recognition is performed,
and output to the speaker is performed as necessary (S508). The microphone input, the sound
source separation from the speaker output, the separation performance evaluation, the sampling
mismatch calculation, and the post-stage processing (S503 to S508) are repeatedly performed.
The details of each process are described below.
[0029]
FIG. 6 is a process flow that illustrates in detail the time shift amount calculation process (S 502)
in FIG. 5. First, the time shift amount calculation unit 401 causes the data transmission /
reception units 402, 411, 421 to output reference signals from the speakers 112, 122 (S602).
Next, each device transmits the speaker output signal and the microphone input signal in the
time zone in which the reference signal is output to the time shift amount calculation unit 401
through the data transmission / reception units 411, 421, and 402 (S603). Then, the time shift
amount calculation unit 401 calculates the time shift amount between the microphone inputs of
each device and the time shift amount between the microphone input and the speaker output for
each device (S604).
[0030]
FIG. 7 shows an example of a reference signal output from each device and a microphone input
signal of each device when the reference signal is output. First, assuming that A / D conversion
and D / A conversion are moving in synchronization for each device, reference signals are
observed at the same timing in the speaker output signal and the microphone input signal.
However, when A / D conversion and D / A conversion are not synchronized, there is a time shift
between the speaker output and the microphone input for each device due to the processing
delay in the device. Also, the microphone input signal for each device has a time shift due to the
difference in the recording start timing (see FIG. 7).
11-04-2019
9
[0031]
In the time shift amount calculation process (S502 in FIG. 5), these time shift amounts are
calculated. As the method, it is possible to calculate the cross correlation function between the
corresponding reference signals and to calculate the time shift amount between the signals by
using the time when the cross correlation coefficient becomes a peak. At this time, however,
there is a case where the cross-correlation function of non-corresponding reference signals is
calculated and an incorrect time shift amount is calculated.
[0032]
In FIG. 7, the reference signals are output in the order of device 1 and device 2, and the
respective reference signals are recorded by the microphones of the respective devices. At this
time, among the two reference signals recorded by the microphone, the signal recorded first
should correspond to the reference signal of the device 1 and the signal recorded later should
correspond to the reference signal of the device 2. However, when the output interval of the
reference signal for each device is short and the reference signals overlap, etc., the cross
correlation function of the non-corresponding reference signals may be calculated, but the time
shift amount may not be calculated. . As a countermeasure, in this embodiment, a reference
signal having a frequency band unique to each device is output. By calculating the crosscorrelation function by narrowing down to the frequency band set for each device, the crosscorrelation function of non-corresponding reference signals has a low value, so that it is possible
to stably calculate the time shift amount.
[0033]
Further, in the present embodiment, the reference signal in the audible range is output. On the
other hand, it is also possible to calculate the time shift amount at any time in parallel with the
voice separation processing by outputting the sound of the inaudible area such as the ultrasonic
wave as the reference signal at predetermined intervals (or constantly).
[0034]
The signal processing device 100 uses the time shift amount calculated by the time shift amount
calculation unit 401 to perform time alignment between asynchronous microphone input and
11-04-2019
10
speaker output signals.
[0035]
FIG. 8 is an example in which microphone input signals in each device are time-aligned using the
calculated time shift amount.
If the time of each signal does not match, it is difficult to apply a sound source separation method
and an echo canceling method using a plurality of microphones as conventionally used. The
reason is that, as described above, the conventional sound source separation method and echo
canceling method are premised on synchronization between a plurality of microphones or
between microphones and speakers.
[0036]
Therefore, in the present embodiment, sound source separation and echo canceling can be
implemented by matching the time of each signal using the time shift amount calculated by the
time shift amount calculation unit 401. The sound source separation and the echo canceling use
a method using a known microphone array and an echo canceling method.
[0037]
FIG. 9 is a process flow that describes the separation performance evaluation process (S505 in
FIG. 5) in detail. In this process, the separation performance evaluation unit 404 evaluates the
separation performance by calculating the similarity between the separation signals, the
correlation coefficient, and the like for the plurality of sound signals separated by the signal
separation unit 403. For example, the similarity between the separated signals is calculated
(S802), and the reciprocal of the calculated similarity is used as a performance evaluation value
(S803).
[0038]
11-04-2019
11
FIG. 10 shows an example of a case in which two mixed voice signals are separated into two
signals with low separation performance and into two signals with high separation performance.
Basically, mixed voices are utterances of independent contents, so if separation is performed with
high performance, the separated signals become independent voices that are not similar to each
other. On the other hand, when the separation performance is low, the respective voices remain
as noises in the separated signals, so the separated signals become similar voices. Using this
property, the separation performance is evaluated using the similarity between the separated
signals and the correlation coefficient.
[0039]
The degree of similarity includes, for example, measuring Euclidean distance between signals of
each other and using the reciprocal thereof. Using the obtained similarity and correlation
coefficient, for example, the reciprocal thereof is used as an index representing separation
performance, and if the value is equal to or less than a predetermined threshold value, it can be
determined that separation has not been performed correctly. is there. Alternatively, it is also
possible to determine that the separation is not correctly performed if the degree of similarity or
the correlation coefficient is used as it is and the value is equal to or greater than a
predetermined threshold.
[0040]
In this embodiment, if it is determined in the evaluation processing by the separation
performance evaluation unit 404 that the separation is not performed correctly, it is determined
that the calculation of the time shift amount is not correctly performed, and the time shift
calculation processing (S502) is performed. I have a configuration to do again. As a result, even if
the time alignment between the signals can not be accurately performed in the middle of the
separation processing, it is possible to automatically detect it and carry out the time shift
calculation processing again.
[0041]
FIG. 11 is a process flow specifically describing the sampling mismatch calculation process (S507
in FIG. 5). The sampling mismatch calculation unit 405 calculates the amount of time shift
between the microphone input signals of each device by calculating the cross-correlation
11-04-2019
12
function between the microphone input signals of each device (S1002). Then, the amount of time
shift between the microphone input and the speaker output for each device is calculated by
calculating the cross-correlation function between the echo component after separation and the
speaker output signal (S1003).
[0042]
Even if the time shift amount is initially calculated in the process S502, it changes while
continuing the separation process and the post-stage process. This is because there is an error in
the sampling frequency for each device. Therefore, it is necessary to recalculate the time shift
amount one by one, but outputting the reference signal each time hinders the post-stage
processing. Therefore, in the sampling mismatch calculation processing S507, the time shift
amount is sequentially calculated using the microphone input and the speaker output instead of
the reference signal.
[0043]
First, calculation of the time shift amount between microphone input signals of each device
(S1002) can be performed by calculating a cross-correlation function between microphone input
signals before sound source separation and searching for the peak thereof. Next, the amount of
time shift between the microphone input and the speaker output for each device is calculated
(S1003). At this time, in addition to the echo component by the speaker output, the external
sound is mixed with the microphone input signal, the cross correlation function between the
echo component obtained by the sound source separation processing and the speaker output is
calculated, and the peak is By searching, the amount of time shift between the microphone input
and the speaker output for each device is calculated.
[0044]
As described above, when the sound of the inaudible range is used as the reference signal, it is
possible to calculate the amount of time shift sequentially by outputting the reference signal at
predetermined intervals (or constantly).
[0045]
100 audio signal processing device 101 central processing unit of audio signal processing device
100 memory of audio signal processing device 100 storage medium of audio signal processing
11-04-2019
13
device 100 communication control device of audio signal processing device 100 audio input /
output device 1 111 audio Microphone of Input / Output Device 1 (110) 112 Speaker of Voice
Input / Output Device 1 (110) 120 Voice Input / Output Device 2 121 Microphone of Voice
Input / Output Device 2 (120) 122 Speaker 401 of Voice Input / Output Device 2 (120) Shift
amount calculation unit 402 Data transmission / reception unit 403 Signal separation unit 404
Separation performance evaluation unit 405 Sampling mismatch calculation unit 406 Postprocessing unit 411 Data transmission / reception unit 412 in voice input / output device 1
(110) Voice input / output device 1 (110) A / D conversion unit 413 D / A conversion unit in
voice input / output device 1 (110) 421 data transmission / reception unit in voice input / output
device 2 (120) A / D conversion unit 42 in voice input / output device 2 (120) D / A converter in
voice input / output device 2 (120)
11-04-2019
14
Документ
Категория
Без категории
Просмотров
0
Размер файла
24 Кб
Теги
description, jpwo2017061023
1/--страниц
Пожаловаться на содержимое документа