close

Вход

Забыли?

вход по аккаунту

?

DESCRIPTION JP2009272876

код для вставкиСкачать
Patent Translate
Powered by EPO and Google
Notice
This translation is machine-generated. It cannot be guaranteed that it is intelligible, accurate,
complete, reliable or fit for specific purposes. Critical decisions, such as commercially relevant or
financial decisions, should not be based on machine-translation output.
DESCRIPTION JP2009272876
A high-quality two-output signal storing sound space information of a target sound is provided. A
sound source separation emphasizing system includes a receiver 111 which records sound
signals generated from a large number of sound sources at two different points. Filter processing
such that the signal of the target sound source is not included in the difference signal between
the received signals X and X at different points after processing the signals 112 and 112 and the
received signals X and X from the receiving units 111 and 112 as inputs A mask filter estimation
unit 130 that estimates a time-frequency filter coefficient that emphasizes a target sound source
by receiving an output signal of the filter processing unit 120 and the signals X and X from the
reception units 111 and 112; The signal G from the estimation unit 130 is multiplied by the
signals X and X from the reception unit to reduce non-target sound, and a mask filter 140 for
emphasizing the target sound holding spatial sound information To. [Selected figure] Figure 1
Sound source separation and emphasis system
[0001]
The present invention selectively separates and emphasizes a target sound by using a plurality of
receiving units (for example, microphones) under an environment where a plurality of sounds are
presented from different directions, and holds spatial sound information of the separated sounds.
The present invention relates to a technology for outputting a 2-output (2-channel) acoustic
signal.
[0002]
10-04-2019
1
In general, we humans speak even under noisy environments and their contents are understood.
This is because the characteristics of the acoustic signal input to the left and right ears change
depending on the position of the sound source, and human beings can detect this change. This is
commonly known as a cocktail party effect. A selective binaural algorithm (cocktail-party effect
algorithm) that selectively inputs a certain sound in an environment where multiple sounds are
presented from different orientations has various studies from the viewpoint of realization of
human auditory mechanism It has been done.
[0003]
In Patent Document 1, using frequency domain binaural model (FDBM: Frequency Domain
Binaural Model), acoustic signals generated from a plurality of sound sources are input from both
the left and right receiving parts, and both input signals are input Inter-aural level difference
(ILD) is determined from the cross spectrum of both left and right input signals from the cross
spectrum of both left and right input signals, and the level difference between power spectra of
both left and right input signals. By comparing the IPD or ILD obtained for each frequency band
in the frequency band with that of the database, sound source direction candidates are
determined for each frequency band, and the appearance frequency among the sound source
directions obtained for each frequency band Is a method of estimating a plurality of sound
source directions that exist two-dimensionally on the left, right, upper and lower under an
environment in which a plurality of sounds are generated, by a method of estimating a direction
with a high Proposed.
[0004]
In Patent Document 2, replacing the frequency conversion of Patent Document 1 with wavelet
transform has the effect of matching the human auditory filter and suppressing the database
capacity and the number of calculations to about 1/10 of Patent Document 1. Similar to Patent
Document 1, there has been proposed a method of estimating a plurality of sound source
directions which are present two-dimensionally in the left, right, up and down under an
environment where a plurality of sounds are generated.
[0005]
In Non-Patent Document 1, an adaptive filter is used to cancel a target signal, and a block for
estimating a noise component and a block for calculating a binary mask are two. An adaptive
filter is used to cancel a target signal from an input signal , Make noise only signal.
10-04-2019
2
The ratio OIR (Output Input Ratio) of the signal of only the noise component generated to the
input signal of one reference microphone is determined.
OIR is given by the following equation. Here, XL (ω, l) is an input signal on one side (left), and N
(ω, l) is an output signal of the adaptive filter. Based on the above OIR value, determination is
made according to the following equation to create a binary mask BM. JP-A-2004-325284
"Method of estimating sound source direction, system therefor, and method of separating plural
sound sources, system therefor" JP 2007-240605A "Method of separating sound source using
complex wavelet transform, and Source separation system N. Roman et al, "Binaural segregation
in multisource reverberant environments," J. Acoust. Soc. Am., 120, 6, 2006.
[0006]
In Patent Document 1 and Patent Document 2, their separation ability is low, and there is a
problem that a large distortion remains in the separated sound source. Further, in Non-Patent
Document 1, since the sound source output is one output, sound space information by stereo
output can not be used. Furthermore, since the sound source separation filter has a steepness of
1 and 0 in the time frequency domain, it can be estimated that the error between the time
frequency band in which the target sound source is present and the boundary portion in the time
frequency band is large.
[0007]
In order to solve the above problems, the present invention aims to obtain a high-quality twooutput signal in which sound space information of a target sound is stored, focusing on a twooutput system with a plurality of sound source outputs.
[0008]
In order to achieve the above object, the present invention receives two receiving units that
receive signals input from a plurality of sound sources and outputs two channels of reception
signals, and inputs one of the two channels of reception signals. Filter processing to remove the
target signal component from the received signal by subtracting the output signal from the other
received signal as a noise signal other than the target, and the noise signal extracted by the filter
10-04-2019
3
unit The ratio of the target sound to the received signal is estimated for each time frequency
domain from the ratio of the received signal of the two channels and the received signal, and the
received signal is used with the target sound component ratio obtained by estimation. Mask filter
coefficient estimation unit for estimating mask filter coefficients in the time frequency domain
for removing unnecessary components from the filter, the estimated mask filter coefficients, and
reception of two channels With the items, and a mask filter section for outputting the output
signals of two channels, a sound source separation enhancement system and extracts a target
sound while maintaining the spatial information of the sound source.
The filter unit may perform adaptive filter processing with one of the two channels of received
signals as an input, and select one of two configurations for subtracting the output signal from
the other received signals. The two reception units for outputting the reception signals of the two
channels select two from the reception units provided with m (m ≧ 2: m is an integer), and the
filter unit receives the selected two receptions. It is also possible to perform adaptive filtering
with one of the two channels of the received signal from the section as an input, and to select one
of the configurations for subtracting the output signal from the other received signals of the two
channels. The receiving unit may be a microphone installed at different positions for a plurality
of sound sources, and may be applied to a hearing aid, for example.
[0009]
In the sound source separation emphasizing system of the present application, spatial sound
information can be held to obtain an acoustic signal by the outputs of two channels. Further, the
target sound can be extracted with high accuracy while holding the spatial information of the
sound source effectively by the 2-channel signal selected from the reception signal of m points
(mm2: m is an integer). This is because, depending on the arrangement of the target sound
source and the noise source, the accuracy with which only the target signal component is
eliminated may deteriorate, so that the result of combining the signals of other points can be
used.
[0010]
Next, a multiple-input, two-output sound source separation and emphasis system according to an
embodiment of the present invention will be described with reference to the drawings. Note that
the present invention is not limited by the following embodiments. Hereinafter, each unit may
perform analog processing of an analog signal or digitize a signal to realize each unit by digital
10-04-2019
4
processing. When digital processing is performed, the processing may be realized by a program
executed by the processor.
[0011]
FIG. 1 shows the configuration of a two-output sound source separation and enhancement
system according to an embodiment of the present invention. FIG. 1 shows an example of
inputting from two points. As shown in FIG. 1, the sound source separation and enhancement
system receives the reception units 111 and 112 that record sound signals generated from a
large number of sound sources at two different points, and the reception signals XR and XL from
the reception units 111 and 112. Filter processing unit 120 such that the signal of the target
sound source is not included in the difference signal between the reception signals XL and XR at
different points after processing the signal, and the output signal of the filter processing unit 120
and the reception unit 111, 112 A mask filter estimation unit 130 for estimating a time
frequency filter coefficient that emphasizes a target sound source upon receiving the signals XL
and XR from the signal source, and the signal G from the mask filter estimation unit 130 to the
signals XL and XR from the reception unit It has the mask filter 140 which reduces the nontarget sound by being multiplied and emphasizes the target sound holding the space sound
information.
[0012]
The receiving units 111 and 112 record acoustic signals generated from a plurality of sound
sources between two points by, for example, a configuration in which two microphones are
arranged. The recorded audio signals between the two channels are respectively converted into
electrical signal data and passed to the filter processing unit 120. At this time, an electrical signal
may be converted to a digital signal by analog / digital conversion.
[0013]
The filter processing unit 120 processes one signal (XR in FIG. 1) of the signal data XL and XR
passed from the receiving units 111 and 112 by the adaptive filter 121, and the filter output
signal and the filter input signal The target sound source component included in the difference
signal with the signal (XL in FIG. 1) from a point different from that in (1) is adaptively controlled
so that the noise signal N becomes large. The output signal N of the filter processing unit 120 is
10-04-2019
5
passed to the mask filter estimation unit 130.
[0014]
The mask filter estimation unit 140 obtains the ratio of the signal data XL and XR from the
reception units 111 and 112 and the output signal N of the filter processing unit 120, performs
non-linear determination processing, and estimates mask filter coefficients. . The process of
estimating mask filter coefficients will be described. Assuming that the received signal is XL (ω,
l), XR (ω, l) (frequency band number, l is a time frame number), the signal from the filter
processing unit 120 and the input signal to the filter processing unit 120 Assuming that the
difference signal with the signal from the different point is N (ω, l), R (ω, l) as the discrimination
index.
[0015]
From equation (1), if R (ω, l)) 1, it means that the average value of the input signal to the input
section and the noise component are almost equal. That is, at this time, it is considered that the
input signal to the input unit is composed of noise components and does not include the target
signal. Conversely, if R (ω, l) << 1, the average value of the input signal to the input section
becomes larger than the noise component, and it can be considered that the input signal contains
many target signals. From this, it is possible to estimate the mask filter coefficient G (ω, l) based
on the value of R (ω, l), for example, according to the following equation. Here, XL (ω, l) and XR
(ω, l) are signals from the two reception units 111 and 112.
[0016]
The mask filter coefficient G appropriately uses various functions depending on the magnitudes
of XL (ω, l), XR (ω, l) and N (ω, l), and is not limited by the above equation . For example, one of
the good realization methods may be estimated by the threshold of and equation (2).
[0017]
From the output G (ω, l) of the mask filter estimation unit 130 and the input signals XL (ω, l)
10-04-2019
6
and XR (ω, l), the mask filter unit 140 processes as follows and outputs an output signal ^
Generate SL (ω, l), ^ SR (ω, l). Now, in the system of the present invention, only the amplitude
value of the signal is changed, and since the same filter is used on the left and right sides, ITD
and ILD are preserved in the output signal. Therefore, since the sound source signal storing the
spatial signal of the target signal can be enhanced, selective binaural hearing ability can be used
even in the processed signal. For example, application to listening to music with both ears such
as a hearing aid and headphones It is effective for
[0018]
Another Embodiment 1 FIG. 2 shows another embodiment of the filter processing unit 120 of the
embodiment of FIG. The configuration shown in FIG. 2A is the same as the configuration of the
filter processing unit 120 shown in FIG. 1, the XR signal is processed by the adaptive filter 121,
and the difference signal between the filter output signal and the XL signal is processed. Adaptive
control is performed so that the target sound source component included is reduced (the noise
signal N is increased). In the configuration shown in FIG. 2 (b), contrary to FIG. 2 (a), the XL
signal is processed by the adaptive filter 123, and the target sound source component included
in the difference signal between the filter output signal and the XR signal. Is adaptively
controlled so that the noise signal N becomes smaller (the noise signal N becomes larger). In this
embodiment, two configurations shown in FIG. 2A and FIG. 2B are prepared. For example, when
there is a target sound source on the right side and the sound source on the right side is to be
emphasized, FIG. When the configuration of (a) is selected, and the target sound source is on the
left side and the sound source on the left side is to be emphasized, the configuration of FIG. 2 (b)
is selected. The selection of the two configurations is provided with a switching circuit (not
shown), and the configuration of the filter processing unit 120 is changed according to the
position of the target sound source. With this configuration, it is possible to select an optimal
configuration of the filter processing unit according to the position of the target sound source.
[0019]
OTHER EMBODIMENT 2 The structure shown in FIG. 2 has shown the structure at the time of
providing two receiving parts, such as a microphone. However, the number of receiving units is
not limited to two. It is also possible to respectively provide m (m ≧ 2: integer) reception units
installed at different positions and an adaptive filter connected thereto. In this case, among the
adaptive filters, the switching circuit is connected to be able to select two of the m receivers
according to the position of the target sound source, and to be connected to the two selected
receivers. The output signal of one selected adaptive filter takes a difference signal from the
10-04-2019
7
input signal from a different point from the adaptive filter input signal, as shown in FIG. 2, and
the difference signal A switching circuit is provided for adaptive control so that the target sound
source component contained in the signal becomes smaller (the noise signal N becomes larger).
The selection of the two receiving units may be selected so as to improve the accuracy of
eliminating only the signal component of the target sound source using an adaptive filter. In this
configuration, two pairs of signals are taken out from multiple inputs (m inputs), the target
source signal is first eliminated using an adaptive filter, and extraneous noise components are
estimated. From the ratio of one or more noise components estimated in this way to the input
signal, multiplication of the bin with a large proportion of the target signal at a certain time on
the time frequency plane or at a certain frequency subrange (hereinafter referred to as frequency
bin) By setting the coefficient close to 1 and making the multiplication coefficient close to 0 for
the bin with a small proportion of the target signal, it is possible to extract the target signal.
[0020]
Next, a practical example of simulation is shown. In the simulation using the sound source
separation and enhancement system having the configuration shown in FIG. 1, the head transfer
function (HRTF: Head-Related Transfer Function) of the KEMAR dummy head distributed by MIT
is applied to the sound source signal or noise signal. I used one that was folded in. In Non-Patent
Document 1, a microphone with a better S / N ratio is used as a reference and this is used as a
denominator. For this reason, the accuracy varies depending on the position of the noise
component, and the SN ratio at the reference microphone becomes worse. On the other hand, in
the system of the configuration shown in FIG. 1 of the present invention, such a problem can be
solved because the ratio to the average value of the input signal is used as shown in equation (1).
This is shown in FIG. When the Roman's System of the graph of FIG. 3 is compared with the
system (Proposed System) of the configuration shown in FIG. 1 of the present invention, the
Roman's System is unbalanced on the left and right sides around 0 degrees. Then you can see
that the balance between the left and right is almost even. Well, simulation was performed under
various conditions. The distance between the speaker and the receiver was 1.4 m, and all the
sound sources were at an elevation angle of 0 °. Further, the horizontal angle was 0 ° in the
front direction, + on the right side, − on the left side, and the SN ratio of −5, 0, 5 and 10 dB.
Evaluation of the signal to noise ratio was all performed on one channel. When multiple noise
components were present, they were all added, and the SN ratio was also determined with the left
ear. The target signal can not but be embarrassed when I remember the mysterious impressions
and surprises that are included in the NTT-AT phoneme balance 1000 sentence broadband sound
source database. "Was used. The speaker is MIY (male). One of the target signals has a horizontal
angle of 0 °. As for the position and type of the noise source, three conditions shown in FIG. 4
were set and simulation was performed.
10-04-2019
8
[0021]
We used Segmental SNR and Log-spectral distance (LSD) as the evaluation index. The higher the
value of Segmental SNR, the better the signal-to-noise ratio, and the smaller the value of LSD, the
smaller the distortion of the sound source. These calculation formulas are shown below. Here, s
(·) is a target signal and an audio signal after ^ s (·) processing, L is the total number of frames,
and K is the number of samples in one frame. δ is a parameter set so that the dynamic range of
the logarithmic spectrum is within about 50 dB.
[0022]
Simulation experiments were conducted, and the results of Segmental SNR and LSD under each
condition are shown in FIG. It is clear that the LSD indicating distortion is at the same level, and
the result of Segmental SNR indicating separation performance is improved by about 1 to 2 dB as
compared with Non-Patent Document 1, and the advantage of the present invention can be
understood.
[0023]
In addition, it was confirmed by simulation experiments that the output signal obtained in the
present invention leaves sound space information. FIG. 6 shows the true 2ch level difference of
the target sound and the 2ch level difference of the separated sound. In the section where the
target sound exists, it is confirmed that the level difference of the 2ch signal after separating and
extracting the target sound source is equal to the true level difference of the target sound
obtained by the 2ch input without noise. The advantages of the present invention are shown.
[0024]
According to the present invention, as compared with the conventional method, (1) It is not
necessary to determine the reference microphone, and therefore, a certain accuracy can be
expected regardless of the position of the noise component. (2) Both outputs can be used for
both ears. (3) The ITD and ILD are preserved because the same filter is applied to the left and
right input signals, and the target sound source can be emphasized while retaining the spatial
10-04-2019
9
information of both ears. There are advantages such as Therefore, as shown in FIG. 7, if the
sound source separation and enhancement system of the present invention is applied to a
hearing aid, it is expected that the hearing aid in the deaf can be improved by using the selective
binaural hearing ability of the listener in addition to the sound source emphasis. Ru.
[0025]
It is a block diagram showing composition of a sound source separation emphasis system
concerning an embodiment of the invention. It is a figure which shows the other structural
example of the filter process part in the sound source isolation | separation emphasis system of
FIG. It is the figure which showed the isolation | separation performance by this invention
(Proposed System) and the nonpatent literature 1 (Roman's System) when changing a noise
direction in case target noise male speaker 0 degree and noise are white noise. It is the figure
which showed the conditions of the sound source direction in isolation | separation evaluation
simulation of the target sound in several noise environment of various directions. The
performance evaluation result of this invention (Proposed System) and a nonpatent literature 1
(Roman's System) in isolation | separation evaluation simulation of the target sound in several
noise environments of various directions is shown. The true 2ch level difference of the target
sound representing that the separated output signal according to the present invention preserves
the sound space information of the target sound and the 2ch level difference of the separated
sound are shown. It is a figure which shows the structure at the time of applying this system to a
hearing aid.
10-04-2019
10
Документ
Категория
Без категории
Просмотров
0
Размер файла
22 Кб
Теги
description, jp2009272876
1/--страниц
Пожаловаться на содержимое документа