close

Вход

Забыли?

вход по аккаунту

?

DESCRIPTION JP2007215163

код для вставкиСкачать
Patent Translate
Powered by EPO and Google
Notice
This translation is machine-generated. It cannot be guaranteed that it is intelligible, accurate,
complete, reliable or fit for specific purposes. Critical decisions, such as commercially relevant or
financial decisions, should not be based on machine-translation output.
DESCRIPTION JP2007215163
When performing sound source separation processing by a BSS method based on the ICA
method, even when the position of a sound source with respect to a microphone moves,
separated signals corresponding to a specific sound source can be output through a specific
output terminal (a sound source Can be tracked). A frequency analysis unit performs frequency
analysis calculation on separated signals y1i for each predetermined time length output through
the first output channel Op1i, thereby performing frequency analysis of each separated signal
y1i. The feature quantity is calculated, and the state of replacement of the separated signal y1i is
determined by automatically evaluating the transition of the feature quantity. Further, the output
buffer 22 switches which separation signal y1i outputted through the first output channel Op1i
is outputted through which one of the second output channels Op2i based on the result of the
switching judgment of the separation signal y1i. [Selected figure] Figure 1
Sound source separation device, program for sound source separation device and sound source
separation method
[0001]
According to the present invention, a plurality of mixed voice signals are sequentially input
through each of the voice input means in a state where a plurality of sound sources and a
plurality of voice input means exist in a predetermined sound space, and sound source signals
from each of the sound sources are superimposed. Source separation apparatus sequentially
generating a plurality of separated signals corresponding to the source signal by performing
sound source separation processing by a blind source separation method based on an
independent component analysis method, a program for the source separation apparatus, and a
source separation method It is about
10-04-2019
1
[0002]
When a plurality of sound sources and a plurality of microphones (voice input means) exist in a
predetermined sound space, individual voice signals (hereinafter referred to as sound source
signals) from each of the plurality of sound sources are superimposed for each of the plurality of
microphones. An audio signal (hereinafter referred to as a mixed audio signal) is input.
A method of sound source separation processing for identifying (separating) each of the sound
source signals based only on the plurality of mixed speech signals input in this manner is called a
blind source separation method (hereinafter referred to as BSS method). It is called. Furthermore,
as one of the sound source separation processes of the BSS system, there is a sound source
separation process of the BSS system based on the Independent Component Analysis (hereinafter
referred to as the ICA method). The BSS system based on the ICA method separates
predetermined ones of a plurality of mixed voice signals (time-series voice signals) input through
a plurality of microphones by utilizing that the source signals are statistically independent. It is a
processing method which performs identification (source separation) of the sound source signal
by optimizing a matrix (inverse mixing matrix) and applying a filtering process to the plurality of
input mixed speech signals using the optimized separation matrix. . At that time, optimization of
the separation matrix is used later by sequential calculation (learning calculation) based on the
signal (separated signal) identified (separated) by filter processing using the separation matrix
set at a certain point in time It is done by computing the separation matrix. Here, according to
the sound source separation processing of the BSS system based on the ICA method, each of the
separated signals has the same number of output terminals (may be referred to as output
channels) as the number of input (= the number of microphones) of the mixed speech signal.
Output through Such sound source separation processing of the BSS system based on the ICA
method is described in detail, for example, in Non-Patent Document 1, Non-Patent Document 2,
and the like. On the other hand, in the sound source separation process of the BSS method based
on the ICA method, a separation matrix is obtained by learning calculation, but various
techniques for estimating the direction (DOA: Direction of Arrivals) of the sound source based on
the separation matrix Are conventionally known. For example, Non-Patent Document 3 and NonPatent Document 4 show a technique for estimating DOA by multiplying the separation matrix by
a steering vector. SUMMARY OF THE INVENTION Hiroko Saruwatari, "Basics of blind source
separation using array signal processing", Technical Report of the Institute of Electronics,
Information and Communication Engineers, vol. EA 2001-7, pp. 49-56, April 2001. Tomoya
Takatani et al., "High-fidelity blind source separation using ICA based on SIMO model", Technical
Report of IEICE. US 2002-87, EA 2002-108, January 2003.
10-04-2019
2
Takeshi Nishikawa et al., "Blind Source Separation of Three or More Sources Based on Fast
Convergence Algorithm Integrating ICA and Beamforming", Proceedings of the Acoustical Society
of Japan, 1-6-13, March 2005. Saruwatari Hiroshi et al., "Blind source separation for speech
based on fast-convergence algorithm with ICA and beamforming", EUROSPEECH 2001, pp 26032606.
[0003]
By the way, in the sound source separation processing of the BSS system based on the ICA
method, when the positions of the sound source with respect to the microphone move, when the
directions (left and right directions) of the plurality of sound sources with respect to the
microphone change, Separated signals output to each of the output channels are also exchanged.
However, in the sound source separation process of the BSS method based on the conventional
ICA method, when the position of the sound source with respect to the microphone moves,
tracking a specific sound source, that is, the separated signal corresponding to the specific sound
source is always a specific output There is a problem that it can not be output through the end.
In addition, in the process of the change in the direction of the sound source being generated, the
state in which two sound sources originally present one by one in each of the sound collection
ranges are biased to one of the sound collection ranges of two adjacent microphones (
Hereinafter, it is often referred to as “localized state of sound source”. In the sound source
separation process by the BSS sound source separation method based on the ICA method, high
sound source separation performance is obtained when the sound collection ranges of the
plurality of microphones correspond to the positions of the plurality of sound sources one by
one, It is known that the occurrence of the uneven distribution state of the sound source causes a
problem that the sound source can not be properly separated. This is because the method of
solving the permutation problem in the ICA method with a practical calculation load is not
realized in the uneven state of the sound source. The permutation problem is disclosed in, for
example, paragraph 0008 of Patent Document 1. Therefore, the present invention has been made
in view of the above circumstances, and the purpose of the present invention is to perform the
sound source separation processing by the BSS method based on the ICA method, and even when
the position of the sound source with respect to the microphone moves, A separated signal
corresponding to a sound source can be output through a specific output terminal (a sound
source can be tracked), and a plurality of sound sources are unevenly distributed in the sound
collection range of one microphone, and appropriate sound source separation can not be
performed It is an object of the present invention to provide a sound source separation device, a
program for the sound source separation device, and a sound source separation method which
can avoid the problem as much as possible.
10-04-2019
3
[0004]
In order to achieve the above object, the present invention is a signal sequentially input through
each of the voice input means in a state where a plurality of sound sources and a plurality of
voice input means (microphones) exist in a predetermined acoustic space, By performing sound
source separation processing by a blind sound source separation method (hereinafter referred to
as ICA-BSS sound source separation method) based on the independent component analysis
method on a plurality of mixed speech signals which are signals on which the sound source
signals from each sound source are superimposed. A plurality of separated signals corresponding
to the sound source signal (a signal identifying the signal of the sound source) are sequentially
generated, and each of the plurality of separated signals is output through each of a plurality of
output terminals (hereinafter referred to as first output terminals). A sound source separation
device provided with means (hereinafter referred to as "sequential sound source separation
means") or a sound which causes a processor of such a sound source separation device to
execute a predetermined procedure Program for separation device (computer program), or is
constituted as a sound source separation method having the steps. The feature is the feature
quantity calculation / recording procedure for calculating the feature quantity of the separated
signal for each predetermined time length and temporarily storing it in a predetermined storage
means for each of the first output terminals, and the feature quantity calculation The transition
state of the separated signal output through each of the first output terminals is determined by
automatically evaluating the transition of the feature value for each of the first output terminals
temporarily stored by the recording procedure. Any one of the separated signals output through
the first output end based on the determination result of the signal change determination step
and the signal change determination step is one or more other ones different from the first
output end. An apparatus comprising: means for performing an output switching procedure for
switching which one of the output end (hereinafter referred to as the second output end) is to be
output; or Program for sound source separation apparatus to be executed by the service, or is to
be constructed as a sound source separation method including the respective steps. In addition,
the said separated signal for every predetermined time length shown here is not necessarily the
thing which divided all the separated signals produced | generated one by one for predetermined
time length. For example, every predetermined period longer than the predetermined time length
or every time when calculation of the feature amount is completed, a predetermined time length
from an arbitrary time point, such as a separation signal for the predetermined time length
generated from that time point Is a concept that includes separated signals of
[0005]
According to the above configuration, when the position of the sound source with respect to the
10-04-2019
4
voice input means moves, the separated signals output from the first output end when the
directions (left and right directions) of the plurality of sound sources with respect to the voice
input means change. Changes (changes) in the feature amount of the feature amount, and the
transmission path of the separation signal from the first output end to the second output end is
switched according to the state of the change. Here, as the feature quantity calculation /
recording procedure, for example, it is conceivable to calculate a feature quantity of frequency
based on frequency analysis calculation of the separated signal for each predetermined time
length. Further, as a more specific example of the feature quantity calculation / recording
procedure, it is conceivable to calculate a peak frequency in the power spectrum of the separated
signal for each predetermined time length as the feature quantity, in this case, the signal In the
replacement determination procedure, it is conceivable to determine the replacement state of the
separated signal by comparing the past peak frequency with the current peak frequency.
[0006]
In addition, it is more preferable that the sound source separation device according to the
present invention further includes each component shown in the following (1) to (3). (1) Based
on the separation matrix calculated by the learning calculation performed in the sound source
separation processing by the blind source separation method (the ICA-BSS sound source
separation method) based on the independent component analysis method, A specific sound
source for estimating the direction in which each of the two sound sources (hereinafter referred
to as specific sound sources) present respectively in the sound collection ranges of two
predetermined adjacent sound input means (hereinafter referred to as specific sound input
means) Direction estimation means. (2) A voice input means orientation adjustment mechanism
for adjusting the direction of the plurality of voice input means as a whole. (3) By controlling the
voice input means direction adjustment mechanism, the middle direction of the direction of each
of the specific voice input means is the middle of the direction in which each of the specific
sound sources estimated by the specific sound source direction estimation means Voice input
means orientation control means for directing. If the sound source separation device according to
the present invention further includes the components shown in the above (1) to (3), it becomes
a tracking target when it is desired to track one sound source in a situation where the sound
source can move. It is possible to avoid, as much as possible, the occurrence of the uneven
distribution state of the sound source due to the proximity of the sound source and the sound
source (the two specific sound sources) adjacent to it. As a method of estimating the direction
(DOA) of the sound source based on the separation matrix, a conventionally known method may
be adopted.
[0007]
10-04-2019
5
By the way, even if the uneven distribution state of the sound source does not occur and the state
in which one specific sound source is present in each of the sound collection ranges of the two
specific sound input means is maintained, the specification for the specific sound input means
When the direction in which the sound source exists changes significantly, it takes a long time to
converge the separation matrix in the learning calculation, or the sound source separation
performance is degraded. On the other hand, sound source separation processing is performed
according to the ICA-BSS sound source separation method in a state where the direction of the
specific sound source is fixed to a predetermined reference direction (hereinafter referred to as
reference state), and the learning calculation is sufficiently performed. The separation matrix
(hereinafter referred to as a reference separation matrix) sufficiently converged with respect to
the direction of the specific sound source at that time is obtained. Then, when the reference
separation matrix is used as an initial value (initial matrix) of the separation matrix used for the
learning calculation when the reference state or a state close to the reference state, the time
required for the learning calculation is relatively short. Also, a new separation matrix with high
separation performance is obtained (even if the number of sequential calculations is small).
Therefore, it is more preferable that the sound source separation device according to the present
invention further includes each component shown in the following (4) and (5) in addition to each
component shown in the above (1) to (3). is there. Here, initial matrix candidate information
representing a combination of a plurality of reference directions indicating the direction in which
the specific sound source is present and a plurality of candidates of the initial matrix which is the
initial value of the separation matrix is stored in advance in predetermined storage means It is
assumed that The plurality of candidates for the initial matrix in the initial matrix candidate
information correspond to the reference separation matrix obtained by the learning calculation in
each of the plurality of types of reference states. (4) After control by the voice input means
direction control means based on the estimation result by the specific sound source direction
estimation means and the adjustment amount of the direction of the plurality of voice input
means by control of the voice input means direction control means After-control specific sound
source direction calculation means for calculating the direction in which the specific sound
source is present with respect to the direction of the specific voice input means. (5) Initial matrix
selection for selecting the initial matrix to be used for the next learning calculation from among a
plurality of candidates of the initial matrix in the initial matrix candidate information based on
the calculation result of the specific sound source direction calculating means after control
means. More specifically, the initial matrix selection means is a direction in which the specific
sound source is present after control by the voice input means direction control means from the
initial matrix candidate information (hereinafter referred to as a specific sound source direction
after direction control) And identifying the candidate of the initial matrix corresponding to the
identified reference direction, and setting it as the initial matrix to be used for the next learning
calculation.
10-04-2019
6
For example, when the change in the specific sound source direction after the direction control
(for example, the difference between the previously calculated value and the currently calculated
value) changes by a predetermined angle or more, the selection of the initial matrix by the initial
matrix selection unit It is possible to do Thereby, even when the direction of the specific sound
source to the specific voice input means largely changes, the appropriate initial matrix is selected
(set) according to the change, so convergence of the separation matrix in the learning calculation
It can avoid the problem that it takes a long time and the sound source separation performance is
deteriorated.
[0008]
According to the present invention, when the positions of the sound source with respect to the
voice input means (microphone) move and the directions (left and right directions) of the
plurality of sound sources with respect to the voice input means change, the output is made from
the first output end. The feature quantities of the separated signals are switched, and the
transmission path of the separated signal from the first output terminal to the second output
terminal is switched according to the status of the switching. As a result, the separated signal
corresponding to the specific sound source can be output through the specific output end (the
second output end). That is, it becomes possible to track the sound source. Further, by
performing control to direct the middle direction of the direction of each of the two specific voice
input means to the middle direction of the direction (estimated direction) in which each of the
specific sound sources is present, a localized state of the sound source is generated. It is possible
to avoid, as much as possible, a situation where proper sound source separation can not be
performed. Further, the specific sound source for the specific voice input means is selected by
selecting the initial matrix to be used for the next learning calculation according to the direction
in which the specific sound source is present after performing the direction control of the voice
input means. Even in the case where the direction of the presence of the light source largely
changes, it is possible to avoid the problem that the convergence of the separation matrix in the
learning calculation takes a long time or the sound source separation performance is
deteriorated, and high sound source separation performance can be maintained.
[0009]
Hereinafter, embodiments of the present invention will be described with reference to the
accompanying drawings for understanding of the present invention. The following embodiment
is an example embodying the present invention and is not of the nature to limit the technical
10-04-2019
7
scope of the present invention. Here, FIG. 1 is a block diagram showing a schematic configuration
of the sound source separation apparatus X according to the embodiment of the present
invention, FIG. 2 is a diagram for explaining the operation of an output buffer provided in the
sound source separation apparatus X, and FIG. 4 is a block diagram showing a schematic
configuration of the sound source separation unit Z1 that performs the sound source separation
processing of the BSS method based on the TDICA method, FIG. 5 is a flowchart showing the
procedure of the output channel switching processing performed by the FIG. 6 is a block diagram
showing a schematic configuration of a sound source separation unit Z2 that performs sound
source separation processing, FIG. 6 is a block diagram showing a schematic configuration of a
sound source separation device X ′ which is an application example of the sound source
separation device X, FIG. Schematically shows how the direction of the microphone is adjusted by
the controller, and FIG. 8 shows the procedure of the direction control of the microphone and the
initial matrix setting process by the sound source separation device X ′ It is a flow chart
showing.
[0010]
First, before describing the embodiments of the present invention, various ICA-BSS sound source
separation units (sequential sound source separation means applicable as components of the
present invention) will be described using the block diagrams shown in FIG. 4 and FIG. Will be
described. Note that the sound source separation units Z1 and Z2 shown below each have a
plurality of sound sources and a plurality of microphones (voice input means) in a predetermined
acoustic space, and individual microphones from each of the sound sources may be transmitted
through each of the microphones In the case where a plurality of mixed speech signals, which are
signals on which the speech signal (hereinafter referred to as a sound source signal) is
superimposed, are sequentially input, the mixed speech signal is subjected to sound source
separation processing of the ICA-BSS method, A process (sequential sound source separation
process) of sequentially generating a plurality of separated signals (signals that identify the
sound source signal) corresponding to the sound source signal is performed.
[0011]
FIG. 4 shows a conventional sound source division unit Z1 that performs sound source separation
processing of the BSS method based on time-domain independent component analysis method
(hereinafter referred to as TDICA method), which is a kind of ICA-BSS method. It is a block
diagram showing schematic structure. The details of this process are shown in Non-Patent
Document 1, Non-Patent Document 2, and the like. In the sound source separation unit Z1, the
10-04-2019
8
sound source signals S1 (t) and S2 (t) (sound signals for each sound source) from the two sound
sources 1 and 2 are divided into two microphones (voice input means) 111, The sound source
separation is performed by filtering the mixed speech signals x1 (t) and x2 (t) of two channels
(the number of microphones) input at 112 by the separation matrix W (z). The mixed speech
signals x1 (t) and x2 (t) are signals digitized at a predetermined sampling period, but in FIGS. 4
and 5, the description of the A / D conversion means is omitted. . In FIG. 4, two channels (the
number of microphones) in which sound source signals S1 (t) and S2 (t) (individual audio signals)
from two sound sources 1 and 2 are input by two microphones (sound input means) 111 and
112. An example is shown in which sound source separation is performed on the basis of the
mixed speech signals x1 (t) and x2 (t), but the same applies to two or more channels. In the case
of sound source separation according to the ICA-BSS method, (the number n of channels of the
mixed speech signal to be input (ie, the number of microphones)) ≧ (the number m of sound
sources) is sufficient. Sound source signals from a plurality of sound sources are superimposed
on the mixed sound signals x1 (t) and x2 (t) collected by the plurality of microphones 111 and
112, respectively. Hereinafter, each mixed speech signal x1 (t) and x2 (t) will be generically
referred to as x (t). This mixed speech signal x (t) is expressed as a temporally spatial convolution
of the sound source signal S (t), and is expressed by the following equation (1). The theory of
sound source separation by TDICA can estimate S (t) if x (t) is known, using that each sound
source of the sound source signal S (t) is statistically independent. Therefore, it is a theory based
on the idea that sound sources can be separated. Here, assuming that the separation matrix used
for the sound source separation processing is W (z), the separated signal (that is, the
identification signal) y (t) is expressed by the following equation (2). Here, W (z) is obtained from
the output y (t) by sequential calculation (learning calculation).
Also, separated signals are obtained for the number of channels. Note that, in the sound source
synthesis processing, an array corresponding to the inverse operation processing may be formed
based on the information on the W (z), and the inverse operation may be performed using this.
Further, an initial value (initial matrix) of the separation matrix at the time of performing the
sequential calculation of the separation matrix W (z) is set in advance. By performing sound
source separation according to such an ICA-BSS method, for example, from mixed sound signals
of a plurality of channels in which human singing voice and sound of an instrument such as a
guitar are mixed, sound source signal of singing voice and sound source signal of musical
instrument Are separated (identified). Here, the equation (2) can be rewritten and expressed as
the following equation (3). The separation filter (separation matrix) W (n) in the equation (3) is
sequentially calculated by the following equation (4). That is, W (n) of the present (j + 1) is
obtained by sequentially applying the output y (t) of the previous (j) to the equation (4).
[0012]
10-04-2019
9
Next, the conventional sound source separation unit Z2 that performs sound source separation
processing based on the FDICA method (Frequency-Domain ICA), which is a type of ICA-BSS
method, will be described using the block diagram shown in FIG. In the FDICA method, first, for
the input mixed speech signal x (t), a short time discrete Fourier transform is performed for each
frame, which is a signal divided by the ST-DFT processing unit 13 every predetermined period.
Hereinafter, ST-DFT processing is performed to analyze the observation signal in a short time.
Then, with respect to the signal (signal of each frequency component) of each channel after the
ST-DFT processing, the separation filter processing unit 11 f performs separation filter
processing based on the separation matrix W (f) to identify the sound source (identification of
sound source signal )I do. Here, when f is a frequency bin and m is an analysis frame number, the
separated signal (identification signal) y (f, m) can be expressed as the following equation (5).
Here, the update equation of the separation filter W (f) can be expressed, for example, as the
following equation (6). According to this FDICA method, the sound source separation processing
is treated as an instantaneous mixing problem in each narrow band, and the separation filter
(separation matrix) W (f) can be updated relatively easily and stably.
[0013]
Hereinafter, the sound source separation device X according to the embodiment of the present
invention will be described using the block diagram shown in FIG. The sound source separation
device X includes a plurality of microphones 111 and 112 (voice input means) disposed in an
acoustic space in which a plurality of sound sources 1 and 2 exist, and the sound sources 1 and 2
sequentially input through the microphones 111 and 112, respectively. A separated signal (that
is, an identification signal corresponding to the sound source signal) obtained by separating
(identifying) the sound source signal (individual sound signal) from a plurality of mixed speech
signals xi (t) on which the sound source signals (individual speech signals) from each are
superimposed Signal y1i (tI is sequentially generated and output in real time to a speaker (audio
output means).
[0014]
As shown in FIG. 1, the sound source separation apparatus X includes an ADC (A / D converter)
21, a separation calculation processing unit 11, a learning calculation unit 12, an output buffer
22, a DAC (D / A converter) 23, and a frequency analysis unit 24. , And is configured to include
the feature amount storage unit 25 and the like. Here, the learning calculation processing unit 12
and the separation calculation processing unit 11 together constitute a sound source separation
10-04-2019
10
unit 10. The constituent elements of the sound source separation unit 10 and the frequency
analysis unit 24 are respectively a processor for calculation such as a DSP (Digital Signal
Processor) and storage means such as a ROM storing a program executed by the processor, and a
RAM And other peripheral devices, such as, can be considered. Alternatively, it is conceivable that
a computer having one CPU and its peripheral device is configured to execute a program module
corresponding to a process performed by each component. It is also conceivable to provide a
program for a sound source separation device that causes a predetermined computer (including a
processor included in the sound source separation device) to execute the processing of each
component. Here, FIG. 1 shows an example in which the number of channels (that is, the number
of microphones) of the mixed speech signal xi (t) to be input is two, but (number of channels n)
≧ (number of sound sources) If m), three or more channels can be realized by the same
configuration.
[0015]
The ADC 21 converts each analog mixed speech signal input from each of the plurality of
microphones 111 and 112 into a digital mixed speech signal Xi (t) by sampling at a
predetermined sampling period. For example, when each sound source signal Si (t) is a voice
signal of human voice, it may be digitized at a sampling period of about 8 kHz. The separation
operation processing unit 11 corresponds to each of the sound source signals Si (t) by
performing matrix operation using the separation matrix W on each mixed speech signal xi (t)
sequentially input through each of the microphones 111 and 112. A sound source separation
process (sequential sound source separation processing of sequentially generating a plurality of
separation signals yi (t) and outputting each of the plurality of separation signals y1i (t) through
a plurality of output terminals Op1i (hereinafter referred to as first output channel) Procedure) is
performed (an example of a sequential sound source separation unit). The microphones 111 and
112 are both disposed in a predetermined acoustic space in which a plurality of sound sources 1
and 2 exist. Here, each of the separation signals y1i (t) has the same number of first output
channels Op1i as the number of inputs of the mixed speech signal (= the number of
microphones). In the example shown in FIG. 1, i represents 1 or 2 (for 2 channels). The learning
calculation unit 12 performs the learning calculation of the separation matrix W in the ICA-BSS
sound source separation process using a plurality of mixed speech signals xi (t) for a
predetermined time length, thereby performing the separation calculation processing unit 11.
The separation matrix W used in is sequentially calculated. Note that, since the mixed speech
signal xi (t) is sampled and digitized at a predetermined cycle, defining the time length of the
mixed speech signal xi (t) can be performed by digitizing the mixed speech signal xi (t). It is
synonymous with defining the number of samples in Here, the separation matrix calculation
(learning calculation) by the learning operation unit 12 and the sound source separation process
(matrix operation process) executed by the separation operation processing unit 11 based on the
10-04-2019
11
separation matrix are shown in FIG. A sound source separation unit Z1 (TDICA method), update
processing and separation filter processing of a separation matrix (separation filter) based on the
FDICA method shown in FIG. The separation filter processing units 11 t and 11 f illustrated in
FIGS. 4 and 5 correspond to the separation calculation processing unit 11.
[0016]
The data storage unit 25 is a storage unit for storing various data read and written by the
frequency analysis unit 24. The data storage unit 25 includes, for example, a RAM, an EEPROM, a
flash memory, and the like. The frequency analysis unit 24 performs frequency analysis
calculation (automatic calculation) and various processes based on the calculation result. The
contents will be described later. The DAC 23 (D / A converter) converts an analog digital signal
y21, y22 (generally referred to as y2i) of the audio output from the output buffer 22 through the
second output channels Op21, Op22 (which will be described later). It is converted into a signal.
The converted analog signal is output as voice through a predetermined speaker.
[0017]
The output buffer 22 is a so-called ping-pong buffer and has an input / output system for a
plurality of channels. Hereinafter, the plurality of input ends Ip1 and Ip2 included in the output
buffer 22 are referred to as input channels, and the plurality of output ends Op21 and Op22
included in the output buffer 22 are referred to as second output channels. The example shown
in FIG. 1 represents an example having an input / output system for two channels. The operation
of the output buffer 22 will be described below with reference to FIG. The output buffer 22
comprises two FIFO buffers (M1a and M1b, M2a and M2b) for each input channel. In each buffer
shown in FIG. 2, the right side in the figure represents the top address side. Hereinafter, two
buffers corresponding to the input channel Ip1 are referred to as buffers M1a and M1b, and two
buffers corresponding to the input channel Ip2 are referred to as buffers M2a and M2b. Here,
since the input channel Ipi of the output buffer 22 and the first output channel Op1i of the sound
source separation unit 10 in the previous stage are connected in a one-to-one fixed manner, one
input channel Ip1 and the first output channel Op11 Are equivalent, and the other input channel
Ip2 and the first output channel Op12 are equivalent.
[0018]
10-04-2019
12
First, when separated signals y11 and y12 (collectively referred to as y1i) are input from input
channels Ip1 and Ip2 first, the separated signal y1i is input to one buffer M1a or M2a
(hereinafter referred to as a-side buffer). , Are accumulated sequentially from the top address
until the memory becomes full. Next, when the signal is filled in the a-side buffer Mia (memory
becomes full), the separated signal y1i that is subsequently input is stored in the other buffer
M1b, M2b (hereinafter referred to as b-side buffer). It accumulates sequentially from the top
address until it becomes full. Here, in parallel with the signal storage in the b-side buffers M1b
and M2b, the signals stored in the a-side buffers M1a and M2a are sequentially output from the
first address stored in the first output channel Op2i. In FIG. 2A, signals are sequentially
accumulated in the b-side buffers M1b and M2b, and in parallel with this, the signals
accumulated in the a-side buffers M1a and M2a are output through the second output channel
Op2i. Represents a situation. Note that, in the figure, the arrows attached with the respective
symbols of CH1Pt and CH2Pt indicate the designated position of the pointer that designates the
signal to be output to the second output channel Op2i. CH1Pt indicates the designated position
of the pointer for one second output channel Op21, and CH2Pt indicates the designated position
of the pointer for the other second output channel Op22. The designated position of the pointers
CH1Pt and CH2Pt corresponding to each of the second output channels Op2i is a constant
velocity from the head address side of the a-side buffer M1a, M2a or b-side buffer M1b, M2b to
the rear end side The designated signal is output through each of the second output channels
Op2i by sequentially moving at the speed.
[0019]
Then, when all the signals accumulated in the a-side buffers M1a and M2a are output, the
designated positions of the pointers CH1Pt and CH2Pt subsequently move to the top addresses
of the b-side buffers M1b and M2b, and the b-side buffers M1b, M2b The signal stored in M2b is
output through the second output channel Op2i sequentially from the one stored in the start
address. In addition, since the signal accumulation in the b-side buffers M1b and M2b is
completed almost simultaneously with the completion of the output of all the signals in the a-side
buffers M1a and M2a, parallel to the signal output from the b-side buffers M1b and M2b Then,
the separation signal y1i input subsequently is stored in the a-side buffers M1a and M2a. In FIG.
2B, the signals are sequentially stored in the a-side buffers M1a and M2a, and in parallel with
this, the signals stored in the b-side buffers M1b and M2b are output through the second output
channel Op2i. Represents a situation. By repeating the above operation, the separated signal y1i
is output in real time through the second output channel Op2i with a predetermined delay time.
[0020]
10-04-2019
13
In addition, the output buffer 22 switches the setting of the correspondence relationship of the
output channels to thereby output any separation signal y11 output through each of the first
output channels Op11 and Op12 (that is, input through each of the input channels Ip1 and Ip2). ,
Y12 can be switched through which one of the two (plurality) second output channels Op2i.
Here, the setting of the correspondence relationship of the output channels means that each of
the pointers CH1Pt and CH2Pt corresponding to each of the second output channels Op21 and
Op22 is used as a buffer M1a or M1b on the input channel Ip1 side (that is, the first output
channel Op11 side). Or the buffers M2a and M2b on the input channel Ip2 side (that is, the first
output channel Op12 side). In this embodiment, the correspondence relationship of the output
channels is set by the frequency analysis unit 24. Hereinafter, a setting (x, y is a channel number)
for associating the first output channel Op1x (input channel Ipx) with the second output channel
Op2y is expressed as “(x => y)”. Do.
[0021]
FIGS. 2A and 2B described above show the case where the correspondence relationship of the
output channels is “(1 => 1) and (2 => 2)”. That is, the pointer CH1Pt points to the buffer M1a
or M1b, and the pointer CH2Pt points to the buffer M2a or M2b. On the other hand, FIG. 2C
shows an example of the case where the correspondence of the output channels is “(1 => 2) and
(2 => 1)”. The example shown in FIG. 2C shows a state where the pointer CH1Pt points to the
buffer M2b and the pointer CH2Pt points to the buffer M1b. Thus, the signal stored in the b-side
buffer M1b (ie, the signal input through the input channel Ip1) is output through the second
output channel Op22, and the signal stored in the b-side buffer M2b (ie, the input channel Ip2).
Signal) is output through the second output channel Op21. The correspondence between the first
output channel Op1i (input channel Ipi) and the second output channel Op2i is switched between
the state shown in FIG. 2C and the states shown in FIGS. 2A and 2B.
[0022]
Next, the procedure of the output channel switching process in the sound source separation
device X will be described with reference to the time chart shown in FIG. The process shown in
FIG. 3 is started from the time point at which the separation operation processing unit 11 starts
the sequential sound source separation processing based on the learning of the separation matrix
W by the learning operation unit 12 and on the basis of the learned separation matrix W. Shall
be In addition, S1, S2, etc. shown below represent the identification code | symbol of a process
sequence (step). First, the correspondence relationship of the output channels is initialized by the
10-04-2019
14
frequency analysis unit 24, and the setting result is recorded in the data storage unit 25 (S1). The
initial setting may be, for example, setting to a predetermined correspondence, or setting to a
correspondence according to information input by the user through a predetermined operation
input unit. For example, the correspondence relationship of the output channels is set to “(1 =>
1) and (2 => 2)”. Next, each of the separation signals y1i outputted through the first output
channels Op1i is acquired (captured) by the frequency analysis unit 24 for a predetermined time
length (latest one) (S2). For example, the separated signal y1i sampled (digitized) at a sampling
period of 8 MHz is acquired for 1024 samples (that is, 1/8 (sec)). The acquired separated signal
y1i is temporarily stored in the main storage memory included in the frequency analysis unit 24.
Hereinafter, the separation signal y1i for a predetermined time length acquired here is referred
to as one frame of separation signal y1i.
[0023]
Next, for each of the first output channels Op1i, frequency analysis calculation (automatic
calculation) is performed on the separation signal y1i for each frame obtained in step S2 by the
frequency analysis unit 24 to obtain the separation signal y1i The feature quantities of the
frequency are calculated (S3, S4). More specifically, first, window function processing is
performed on each of the separated signals y1i acquired in step S2 (S3), and FFT analysis
calculation is performed on the separated signals multiplied by the window function (S4) .
Furthermore, based on the power spectrum of each of the separated signals y1i for one frame
obtained by the FFT analysis calculation, the peak frequency in the power spectrum is derived as
the feature amount of each of the separated signals y1i (S4). Here, as the peak frequency in the
power spectrum, for example, among the frequencies indicating the peak of the power, one
having the largest peak value, one having the peak value falling within the predetermined order
from the largest peak value, etc. can be considered. . Moreover, as frequency analysis calculation,
other than FFT (Fast Fourier Transform: Fast Fourier Transform) analysis calculation, analysis
calculation based on self-regressive MEM (Maximum Entropy Method), AR (Auto Regression:
Auto Regressive) model It is possible to adopt a well-known method of frequency analysis
calculation such as analysis calculation based on. In addition to the peak frequency in the power
spectrum, the distribution range of the power spectrum (for example, the range of the frequency
having a power of a predetermined level or more) or the like can be considered as the feature of
the separated signal y1i.
[0024]
Next, it is judged by the frequency analysis unit 24 whether or not the feature quantity is stored
10-04-2019
15
(stored) in the data storage unit 25 based on the separated signal y1i for the past m frames (S5).
If it is determined in step S5 that the feature amount for the past m frames is not stored yet, the
feature amount (peak frequency etc.) calculated in step S4 is data for each second output channel
Op2i. The data is additionally recorded (temporarily stored) in the storage unit 25 (S11), and the
process returns to step S2 described above. Thus, the processing of steps S2 to S4 described
above is repeated until the feature amount for the past m frames is accumulated in the data
storage unit 25. As a result, for example, when the sampling period of the separation signal y1i (=
sampling period of the mixed speech signal xi (t)) is 8 kHz, and one frame is a signal for 1024
samples, m = 24, the separation signal y1i for 3 seconds The feature amount based on the is
stored in the data storage unit 25. The sampling cycle, the number of samples in one frame, and
the number m of accumulation frames are set to appropriate values according to the application
target.
[0025]
Furthermore, after that, the feature amount is recorded in the data storage unit 25 for each
second output channel Op2i according to the setting contents of the correspondence relationship
of the output channels (according to the switching status of the output of the separation signal) ).
For example, assuming that the memory areas corresponding to the second output channels
Op21 and Op22 are represented by Mem1 and Mem2, respectively, the correspondence
relationship between the output channels is “(1 => 1) and (2 => 2)”. The feature quantity based
on the separated signal y11 output through the first output channel Op11 is recorded in Mem1,
and the feature quantity based on the separated signal y12 output through the first output
channel Op12 is recorded in Mem2. On the other hand, when the correspondence of the output
channels is “(1 => 2) and (2 => 1)”, the feature value based on the separation signal y11 output
through the first output channel Op11 is recorded in Mem2 The feature quantity based on the
separated signal y12 output through the first output channel Op12 is recorded in Mem1. As a
result, as long as the tracking of the sound source (switching of the output channel in the output
buffer 22) is properly performed, the feature amounts of the separated signals corresponding to
the same sound source are always accumulated in each of the memory areas Mem1 and Mem2 .
[0026]
On the other hand, if it is determined in step S5 that the feature quantity for the past m frames is
stored, the frequency analysis unit 24 determines the replacement state of the separation signal
y1i output through each of the first output channels Op1i. A process is performed (S6: an
example of a signal change determination procedure). The determination of the replacement
10-04-2019
16
state is performed for each of the first output channels Op1i based on the past feature amount
stored in the data storage unit 25 by the frequency analysis unit 24 and the latest feature
amount calculated in step S4. This is performed by automatically evaluating the transition of the
feature quantities (state changing with the passage of time). More specifically, comparing each of
the peak frequencies for the past m frames stored in the data storage unit 25 with the current
(latest) peak frequency calculated in step 4 It is conceivable to determine the switching state. For
example, consider the case where the current correspondence of the output channels is “(1 =>
1) and (2 => 2)”. In this case, one of the peak frequencies for the past m frames corresponding
to the second output channel Op21 matches or approximates the current peak frequency
calculated for the separated signal y12 of the first output channel Op12. Or the peak frequency
for the past m frames corresponding to the second output channel Op22 matches the current
peak frequency calculated for the separated signal y11 of the first output channel Op11, or It is
conceivable to determine that the separation signal y1i has been replaced while approximating.
Note that “to approximate” indicates that, for example, even if there is a difference in peak
frequency, the difference is within a predetermined error range.
[0027]
Next, in step S6, when it is determined that the separation signal y1i of the first output channel
Op1i has been replaced, the frequency analysis unit 24 performs the above-described process so
that the separation signal y2i of the second output channel Op2i is not replaced. The setting of
the correspondence of the output channel is changed, and the setting of the correspondence after
the change is stored in the data storage unit 25 (S8). Further, the setting information of the
correspondence of the output channel after the change is notified from the frequency analysis
unit 24 to the output buffer 22, and the output buffer 22 responds to this and the output
destination of the separated signal (second output channel Op2i). Switching is performed (S9). By
the processing of steps S7 to S9 by the frequency analysis unit 24 and the output buffer 22
described above, any separation signal y1i output through the first output channel Op1i is
determined based on the determination result of the signal change determination in step S6. It is
switched which of the second output channels Op2i is to be output (an example of the output
switching procedure). However, the switching of the output channel is performed in accordance
with the timing at which the separated signal y1i determined that the switching has occurred is
output through the second output channel Op2i. By the above process, the sound source
separation device X operates as follows. That is, when the positions of the sound sources 1 and 2
with respect to the microphones 111 and 112 move and the directions (left and right directions)
of the plurality of sound sources 1 and 2 with respect to the microphones 111 and 112 change,
they are output from the first output channels Op1i. The feature quantity of each separation
signal y1i is replaced, and the state of the replacement is determined in step S6. Then, according
to the determination result of the switching state, the transmission path of the separation signal
10-04-2019
17
from each of the first output channels Op1i to each of the second output channels Op2i is
switched by the processing of steps S7 to S9. As a result, each of the second output channels
Op2i always outputs the separated signal y2i corresponding to the same sound source as long as
the replacement determination (S6) of the separated signal y1i is correctly performed. That is, it
becomes possible to track the sound source.
[0028]
Furthermore, if it is determined in step S7 that replacement of the separation signal y1i has not
occurred, or after the process of step S9 is completed, the frequency analysis unit 24 stores
(accumulates) the data storage unit 25 in the past. Of the feature amounts for m frames, the
oldest is updated to the current (latest) feature amount calculated in step S4 (ie, updated to the
content for the latest past m frames) (S10), and then , And the process returns to step S2.
Thereafter, the processes of steps S2 to S10 are repeated. The processing in steps S4 and S10
executed by the frequency analysis unit 24 is performed by calculating the feature amount of the
separated signal y1i for each predetermined time length for each of the first output channel Op1i
(first output terminal) It is an example of the feature value calculation and recording procedure
temporarily stored in the storage unit 25.
[0029]
Next, a sound source separation device X 'which is an application example of the sound source
separation device X will be described. First, the configuration of the sound source separation
device X 'will be described with reference to the block diagram shown in FIG. As shown in FIG. 6,
in addition to the same components as the sound source separation device X, the sound source
separation device X ′ further includes a DOA estimation unit 31, a rotation control unit 40, and
a microphone rotation mechanism 50. The DOA estimation unit 31 obtains the separation matrix
W calculated by the learning calculation (that is, the learning calculation of the separation matrix
W performed in the ICA-BSS sound source separation process) by the learning operation unit 12,
and the separation matrix Based on W, the direction in which two sound sources 1 and 2
(hereinafter referred to as specific sound sources) present in the respective sound collection
ranges of two adjacent microphones 111 and 112 (hereinafter referred to as specific
microphones) predetermined Hereinafter, DOA estimation calculation for estimating each of the
specific sound source directions θa and θb) is executed (an example of the specific sound
source direction estimation means). The DOA estimation unit 31 is also realized by, for example,
an arithmetic processor such as a DSP and a storage unit such as a ROM storing a program
executed by the processor, as in the learning operation unit 12 and the like. Here, the specific
10-04-2019
18
sound source directions θa and θb are relative angles with respect to the front direction of the
microphone which represents the direction (the direction representing the whole) of the entire
microphones whose relative directions are fixed to each other. In the example shown in FIG. 6,
the middle direction R0 of the direction of each of the specific microphones 111 and 112 is the
front direction of the microphone, and the direction of the specific sound sources 1 and 2 is a
relative angle with respect to the front direction R0 of the microphone. In the example shown in
FIG. 6, since the total number of microphones is two, both (all) become the specific microphones,
but when the total number of microphones is three or more, they are designated in advance from
among them Two microphones become the specific microphones. Specifically, when one specific
sound source that wants to track the sound among the movable sound sources and the next
sound source (two sound sources) are used as the specific sound sources, the positions of the
specific sound sources are taken as the main sound collection The two microphones in the range
are designated as the specific microphones. Information on which microphone is to be the
specific microphone is stored in the data storage unit 25 in advance, and the DOA estimation unit
31 reads out and acquires the information. The DOA estimation unit 31 estimates (calculates) the
specific sound source directions θa and θb, for example, by executing the DOA estimation
process described in Non-Patent Document 3 and Non-Patent Document 4.
More specifically, the specific sound source directions θa and θb (DOA) are estimated by
multiplying the separation matrix W obtained from the learning operation unit 12 by a steering
vector. In the case of performing DOA estimation processing described in Non-Patent Document
3 and Non-Patent Document 4, as the sound source separation processing, the sound source
separation processing (sound source separation unit Z2) of the BSS method based on the FDICA
method shown in FIG. .
[0030]
Hereinafter, the DOA estimation process (hereinafter referred to as the DOA estimation process
based on the dead angle characteristic) shown in the non-patent document 3 and the non-patent
document 4 will be described. The sound source separation process by the ICA method is a
process of calculating a matrix (separation matrix) representing a spatial dead angle filter by
learning operation, and removing sound from a certain direction by filter processing using the
separation matrix. The DOA estimation process based on the dead angle characteristic calculates
the spatial dead angle represented by the separation matrix for each frequency bin, and obtains
the direction (angle) of the sound source by calculating the average value of the spatial dead
angle for each frequency bin. presume. For example, in a sound source separation apparatus that
collects the sound of two sound sources with two microphones, the DOA estimation process
based on the dead angle characteristic executes the following calculation. In the following
10-04-2019
19
description, the suffix k is the microphone identification number (k = 1, 2), the suffix l is the
sound source identification number (l = 1, 2), f is the frequency bin, and the subscript f m is the
identification number of the frequency bin (m = 1, 2), Wlk (f) is the separation matrix obtained by
learning calculation in the sound source separation processing of the BSS method based on the
FDICA method, c is the speed of sound, dk (d1 or d2 ) Is the distance from the middle position of
the two microphones to each microphone (a half of the distance between the microphones, ie, d1
= d2), and θ1 and θ2 are DOAs of the respective two sound sources. First, according to the
following equation (7) (corresponding to equation (12) in Non-Patent Document 4), the sound
source angle information Fl (f, θ) in the case of l = 1 and l = 2 is Calculated for each frequency
bin. Further, DOA (angle) θ1 (fm) and θ2 (fm) for each frequency bin according to the following
equation (8) and equation (9) (corresponding to equation (13) and equation (14) in the above
non-patent document 4) Ask for Then, for θ1 (fm) calculated for each frequency bin, an average
value is calculated in the range of all frequency bins, and the average value is set as the direction
θ1 of one sound source. Similarly, for θ2 (fm) calculated for each frequency bin, an average
value is calculated in the range of all frequency bins, and the average value is set as the direction
θ2 of the other sound source. The DOA estimation unit 31 also executes other processes, which
will be described later.
[0031]
The microphone rotation mechanism 50 is a mechanism that adjusts the entire orientation of the
plurality of microphones 111 and 112 by rotating the entire plurality of microphones 111 and
112 whose relative orientations are fixed relative to each other (said voice input means An
example of the orientation adjustment mechanism). The microphone rotation mechanism 50
includes a microphone holding unit 51 and a servomotor 52. The microphone holding unit 51 is
a member for holding (supporting) all the microphones 111 and 112 in a state in which the
relative directions of the microphones 111 and 112 are fixed. The servomotor 52 is a drive
source for rotating the microphone holding portion 51 around a predetermined rotation axis and
holding it in an arbitrary direction (rotation angle), and is, for example, a stepping motor or the
like. The microphone front direction R0 can be directed in a desired direction by the servo motor
52 rotating the microphone holding unit 51 about a predetermined rotation axis and holding it in
a desired direction.
[0032]
The rotation control unit 40 controls the microphone rotation mechanism 50 (here, the
servomotor 52) to make the microphone front direction R0, which is an intermediate direction of
10-04-2019
20
the specific microphones 111 and 112, a desired direction. Is a controller that executes control
to hold it towards. Specifically, the rotation control unit 40 acquires information on the specific
sound source directions θa and θb from the DOA estimation unit 31, and outputs a control
command to the microphone rotation mechanism 50 based on the information. Thus, the
microphone front direction R0 is controlled to be in the middle direction of the directions (the
specific sound source directions θa and θb) in which the specific sound sources 1 and 2
estimated by the DOA estimation unit 31 are present (the above Example of voice input means
direction control means). That is, the rotation control unit 40 controls the rotation shaft of the
servomotor 52 to rotate and stop by (θa + θb) / 2.
[0033]
Hereinafter, how the orientation of the microphone (the microphone front direction R0) is
adjusted by the rotation control unit 40 will be described with reference to FIG. FIG. 7 is a
diagram (plan view) schematically showing how the orientation of the microphone (the
microphone front direction R0) is adjusted by the rotation control unit 40. As shown in FIG. For
example, in the initial state, as shown in FIG. 7A, in a state in which the two specific sound
sources 1 and 2 are sufficiently separated from each other (a state in which the difference |
θa−θb | , And are present in the sound collection range of the specific microphones 111 and
112, respectively. Further, in the state shown in FIG. 7A, it is assumed that the separation matrix
W sufficiently learned by the learning operation unit 12 is obtained. In addition, the state shown
to Fig.7 (a) is a state (2nd pattern mentioned later) in general ((theta) a, (theta) b) = (60 degrees, 60 degrees). Then, as shown in FIG. 7B, one or both of the two specific sound sources 1 and 2
move from the initial state within the sound collection range of the specific microphones 111
and 112 corresponding to the respective ones. It is assumed that the positions of the two specific
sound sources 1 and 2 are close to each other (the difference | θa−θb | between the specific
sound source directions is small). The example shown in FIG. 7 (b) is an example in which only
one of the specific sound sources 2 has moved from the position P1 to the position P2 from the
initial state shown in FIG. 7 (a). Here, in the state shown in FIG. 7 (b), when the directions of the
specific microphones 111 and 112 (the microphone front direction R0) are held unchanged, as
shown by the thick dashed arrows in FIG. 7 (b), The two specific sound sources 1 and 2 become
unevenly distributed in the sound collection range of one of the specific microphones 112 by
only slightly moving the specific sound source 2 present in the direction close to the microphone
front direction R0, and the learning calculation The sound source can not be separated by the
unit 12 and the separation calculation processing unit 11. Therefore, as shown in FIG. 7C, the
rotation control unit 40 directs the microphone front direction R0 to an intermediate direction of
the specific sound source directions θa and θb estimated by the DOA estimation unit 31. State
(θa ′ = −) in which the directions θa ′ and θb ′ of the specific sound sources 1 and 2 (the
specific sound source direction after the direction control) after the direction control is controlled
10-04-2019
21
are symmetrical with respect to the microphone front direction R0 The orientation of the entire
microphone is adjusted (controlled) by a predetermined angle Δψ so that θb ′). As a result, as
indicated by the thick dashed arrow in FIG. 7C, even if the specific sound source 2 existing in the
direction close to the microphone front direction R0 moves a little, the two specific sound
sources 1 and 2 are respectively The state existing in the sound collection range of the
corresponding specific microphone 112 is maintained, and it is possible to avoid falling into a
state in which the sound source separation by the learning operation unit 12 and the separation
operation processing unit 11 can not be performed.
[0034]
In addition, when one or both of the specific sound sources 1 and 2 move largely, the positional
relationship between the specific microphones 111 and 112 and the specific sound sources 1
and 2 is as shown in FIG. When changing to the state shown in b), in the learning calculation by
the learning operation unit 12, a problem may occur that the convergence of the separation
matrix W takes a long time or the sound source separation performance is deteriorated.
Therefore, the DOA estimation unit 31 performs the learning calculation sufficiently in a state in
which the specific sound sources 1 and 2 exist in the direction or in the direction close thereto
based on the specific sound source directions θa ′ and θb ′ after the direction control. Thus,
the separation matrix W previously obtained is set as an initial matrix W0 (initial value of the
separation matrix W) used for the next learning calculation. The details will be described later.
[0035]
In the sound source separation device X ′, a plurality of reference directions representing the
specific sound source direction and a plurality of initial matrix W0 initial values of the separation
matrix W are used as information referred to when setting the initial matrix W0. Initial matrix
candidate information representing a combination with a candidate is stored in advance in the
data storage unit 25. For example, as a reference direction of the specific sound source direction
(θa, θb), (30 °, -30 °), (60 °, -60 °), (90 °, -90 °), (120 °, -120 °) When five patterns
(hereinafter, referred to as first to fifth patterns) of (150 °, −150 °) are defined, the initial
matrix candidate information includes identification information of each of the five patterns, The
initial matrix W0 suitable for each pattern is information associated with it. Here, the initial
matrix W 0 included in the initial matrix candidate information is the learning operation unit 12
in a state (the reference state) in which the specific sound source directions θa and θb are fixed
to the reference directions of the five patterns. And a separation matrix (the reference separation
matrix) obtained by performing the sound source separation processing according to the ICA-BSS
10-04-2019
22
sound source separation method by the separation calculation processing unit 11 and sufficiently
performing the learning calculation. That is, under the condition that the specific sound sources
1 and 2 are arranged in the reference direction, the separation matrix W which has sufficiently
converged is set as the initial matrix W0. As described above, when the initial matrix W0 set in
this manner is used in the learning calculation when the reference state or a state close thereto is
used in the learning calculation, the required time for the learning calculation is relatively short
(sequential calculation A new separation matrix W with high separation performance can be
obtained even if the number of times is small. The sound source separation device X ′ executes
a process (initial matrix setting process described later) of selecting and setting an initial matrix
W0 used by the learning operation unit 12 from the initial matrix candidate information, the
contents of which will be described later. . The initial matrix candidate information may be stored
in an external memory (for example, a flash memory or the like) accessible by the DOA
estimation unit 31 via a predetermined communication interface or memory interface.
[0036]
Next, with reference to the flowchart shown in FIG. 8, the procedure of microphone direction
control and initial matrix setting processing by the sound source separation device X 'will be
described. The process shown in FIG. 8 is executed in parallel with the sound source separation
process in real time by the separation operation processing unit 11 and the output channel
switching process shown in FIG. It is executed each time a proper separation matrix W is
obtained. In addition, S21, S22, etc. shown below represent the identification code | symbol of a
process sequence (step). First, the DOA estimation unit 31 monitors whether or not a new
separation matrix W that has been learned by the learning operation unit 12 has been obtained
(whether or not learning calculation has been completed) (S21). Then, when the DOA estimation
unit 31 detects that the new separation matrix W having been learned is obtained by the learning
operation unit 12, the DOA estimation unit 31 acquires the new separation matrix W from the
learning operation unit 12 (S22). .
[0037]
Next, the DOA estimation unit 31 executes estimation calculation of the specific sound source
directions θa and θb based on the new separation matrix W obtained from the learning
operation unit 12, and the estimation result is stored in the data storage unit 25. (S23).
Furthermore, the estimation result of the specific sound source directions θa and θb is handed
over from the DOA estimation unit 31 to the rotation control unit 40, and the rotation control
unit 40 acquires the specific sound source direction θa acquired from the DOA estimation unit
10-04-2019
23
31. The adjustment angle Δ 結果 (= (θa + θb) / 2) of the direction of the microphone is
calculated based on the estimation results of, and θb (S24). Next, the rotation control unit 40
controls the microphone rotation mechanism 50 to adjust the orientation of the entire
microphone by the adjustment angle Δ (S25). Thus, the microphone front direction R0 is
directed to the middle of the specific sound source directions θa and θb estimated by the DOA
estimation unit 31. By the process of step S25, the positional relationship between the specific
microphones 111 and 112 and the specific sound sources 1 and 2 changes from the state shown
in FIG. 7B, for example, to the state shown in FIG. 7C. As a result, it is possible to avoid, as much
as possible, a situation where two specific sound sources 1 and 2 are unevenly distributed in one
of the sound collection ranges of the specific microphones 111 and 112 and sound source
separation can not be performed properly. In step S24 or step S25, the adjustment angle Δψ of
the direction of the microphone is handed over from the rotation control unit 40 to the DOA
estimation unit 31.
[0038]
On the other hand, the DOA estimation unit 31 determines the direction of the microphone based
on the estimation result of the specific sound source directions θa and θb by the processing in
step S23 and the adjustment angle Δψ of the direction of the microphone acquired from the
rotation control unit 40. Calculating the directions θa ′ and θb ′ (the specific sound source
direction after the direction control) in which the specific sound sources 1 and 2 exist after the
control of the above, and recording the calculation result in the data storage unit 25 (S26, the
specific after the control Example of sound source direction calculation means). Specifically, θa
′ = − θb ′ = (θa−Δψ). In order to calculate the change in the specific sound source
directions θa ′ and θb ′ after the direction control, the data storage unit 25 holds latest
(latest) data of a predetermined number. Next, the DOA estimation unit 31 sets the change (for
example, the difference between the previous calculated value and the current calculated value)
of the specific sound source directions θa ′ and θb ′ after the direction control to a
predetermined set value (for example, It is determined whether or not it is 30 ° or more (S27).
Here, when the DOA estimation unit 31 determines that the change in the specific sound source
directions θa ′ and θb ′ after the direction control is equal to or larger than the set value,
step S26 (in the process of the specific sound source direction calculation unit after the control)
Corresponding to the specific sound source directions θa ′ and θb ′ after the orientation
control, the next learning calculation by the learning operation unit 12 out of a plurality of
candidates for the initial matrix in the initial matrix candidate information The initial matrix W0
to be used is selected, and the selected term matrix W0 is delivered to the learning operation unit
12 (S28, an example of the initial matrix selection means). More specifically, the DOA estimation
unit 31 identifies, from the initial matrix candidate information, the reference direction closest to
the specific sound source directions θa ′ and θb ′ after the orientation control calculated in
10-04-2019
24
step S26 and specifies The candidate of the initial matrix W0 corresponding to the reference
direction is selected and set as the initial matrix W0 to be used for the next learning calculation.
As a result, the initial matrix W0 used for the next learning calculation by the learning operation
unit 12 is updated to the one handed over from the DOA estimation unit 31. As a result, even if
the directions of the specific sound sources 1 and 2 change significantly, the initial matrix W0
appropriate for the change is selected (set) according to the change, so convergence of the
separation matrix W in the next learning calculation It can avoid the problem that it takes a long
time and the sound source separation performance is deteriorated. On the other hand, when the
DOA estimation unit 31 determines that the change of the specific sound source directions θa ′
and θ′b after the direction control is less than the set value in step S27, the DOA estimation
unit 31 skips the process of step S28.
As a result, the separation matrix W, which has been learned most recently at that time, is taken
over as the initial matrix W0 used for the next learning calculation by the learning operation unit
12. Thereafter, the processes of steps S22 to S28 are repeated each time learning calculation is
performed by the learning operation unit 12 and a new separation matrix W which has already
been obtained is obtained.
[0039]
As described above, the sound source separation device X ′ directs the middle direction of the
directions of the two specific microphones 111 and 112 to the middle direction of the directions
(estimated directions) in which the specific sound sources 1 and 2 exist. Control is performed
(S25). As a result, it is possible to avoid, as much as possible, the situation where the uneven
distribution of sound sources occurs and proper sound source separation can not be performed.
In addition, the sound source separation device X ′ selects the initial matrix W 0 to be used for
the next learning calculation according to the directions θa ′ and θb ′ in which the specific
sound sources 1 and 2 exist after performing the direction control of the microphones. (S28).
Thereby, even when the directions of presence of the specific sound sources 1 and 2 largely
change, it is possible to avoid the problem that the convergence of the separation matrix W in the
learning calculation takes a long time or the sound source separation performance is
deteriorated. High sound source separation performance can be maintained.
[0040]
The present invention is applicable to a sound source separation device.
10-04-2019
25
[0041]
FIG. 1 is a block diagram showing a schematic configuration of a sound source separation device
X according to an embodiment of the present invention.
The figure explaining operation | movement of the output buffer with which the sound source
separation apparatus X is provided. 6 is a flowchart illustrating a procedure of output channel
switching processing performed by the sound source separation device X. FIG. 8 is a block
diagram showing a schematic configuration of a sound source separation unit Z1 that performs
sound source separation processing of the BSS method based on the TDICA method. FIG. 8 is a
block diagram showing a schematic configuration of a sound source separation unit Z2 that
performs sound source separation processing of the BSS method based on the FDICA method.
FIG. 7 is a block diagram showing a schematic configuration of a sound source separation device
X ′ which is an application example of the sound source separation device X. The figure which
represented typically a mode that the direction of the microphone was adjusted by the sound
source separation apparatus X '. The flowchart showing the procedure of direction control of a
microphone by sound source separation apparatus X ', and the process of initial stage matrix
setting.
Explanation of sign
[0042]
X, X ': sound source separation apparatus according to an embodiment of the present invention
1, 2, ... sound source 10: sound source separation unit 11 ... separation operation processing unit
12 ... DAC (D / A converter) 24 ... frequency analysis unit 25 ... data storage unit 31 ... DOA
estimation unit 40 ... rotation control unit 50 ... microphone rotation mechanism 51 ...
microphone holding unit 52 ... servo motor 111, 112 ... microphones S1, S2 , ... Processing
procedure (step) Ip1, Ip2 ... Input channel of output buffer Op11, Op12 ... First output channel
(first output terminal) Op21, Op22 ... Second output channel (second output terminal) M1a, M1b,
M2a , M2b ... buffer
10-04-2019
26
Документ
Категория
Без категории
Просмотров
0
Размер файла
50 Кб
Теги
jp2007215163, description
1/--страниц
Пожаловаться на содержимое документа