close

Вход

Забыли?

вход по аккаунту

?

DESCRIPTION JPH11168791

код для вставкиСкачать
Patent Translate
Powered by EPO and Google
Notice
This translation is machine-generated. It cannot be guaranteed that it is intelligible, accurate,
complete, reliable or fit for specific purposes. Critical decisions, such as commercially relevant or
financial decisions, should not be based on machine-translation output.
DESCRIPTION JPH11168791
[0001]
BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to an
audio source detection method and apparatus comprising microphone means for receiving an
audio signal and detection means for detecting audio in the received audio signal.
[0002]
Background of the Invention Telephone conversations are often disrupted by echo. This is in
particular a full duplex telephone with the following four different states: idle, near-end calls, farend calls, and double-talk call states In the case of Echoes usually occur when a call comes in
from the far end and the received far end signal is played back on the speaker and back through
the microphone to the far end. The echo problem occurs in particular in hands-free
communication methods where the speaker plays high volume sound to the surroundings, and
thus the sound from the speaker is easily returned to the microphone.
[0003]
Adaptive signal processing is employed to remove the echo. In hand-free mobile phone
applications, it is possible to use the known echo cancellers and echo suppressors to effectively
remove acoustic feedback, i.e. acoustic echo, which is jammed from the speaker to the
microphone. The echo canceller can be implemented using an adaptive digital filter that
10-04-2019
1
suppresses the echo signal from the outgoing signal, that is, the signal that normally comes from
the far end when the far end signal is present on the receive side. In this way, efforts are made to
prevent the far-end signal from returning to the far-end. The parameters of the adaptive filter are
usually updated whenever a far-end call occurs, in order to take into account the conditions of
any situation as accurately as possible. An echo suppressor is used to attenuate the transmitted
near-end signal.
[0004]
The situation where near-end speech and far-end speech occur simultaneously is called a double
talk situation. During double talk, the echo canceller can not effectively remove the echo signal.
The reason is that the echo signals are summed within the transmitted near-end signal, in which
case the echo canceller can not form an accurate model of the echo signal to be removed. In such
a case, the echo canceller's adaptive filter can not properly adapt to the acoustic response of the
space between the speaker and the microphone, and thus, if the near-end speech signal is
present, the transmitted signal may The echo can not be removed. Therefore, double-talk
detectors are often used to eliminate the double-talk disturbing effect on the echo canceller.
Normally, double-talk situations are detected by detecting whether near-end speech is present at
the same time as far-end speech. During double talk, the parameters of the echo canceller's
adaptive filter are not updated, but the adaptive filter update must be interrupted while the nearend person is talking. The echo suppressor also needs information on the near-end talker's call
activity so that the signal transmitted while the near-end person is talking is not inappropriately
(overly) attenuated.
[0005]
In addition to echo canceling and suppression, the interruptable transmissions used in GMS
mobile phones require information on near-end call activity. The concept of interruptible
transmission is to transmit the call signal only during call activity, that is, the near-end signal is
not transmitted to save power while the near-end talker is idle It is. In order to avoid excessive
fluctuations in background noise levels due to interruptible transmission, it is possible to
transmit some kind of comfort noise that is idle and yet save the bits needed during transmission
. To that end, near-end call activity must be detected accurately, quickly and reliably in order not
to degrade the sound quality of the call transmitted by the GSM interruptible transmission.
[0006]
10-04-2019
2
FIG. 1 shows a conventionally known arrangement 1 for echo canceling and double talk
detection. The near-end signal 3 arrives from the microphone 2 and is detected using the nearend speech activity detector 4, VAD (voice activity detector). Far-end signal 5 arrives from input
connection I (which may be the input connector for hands-free devices, the wire connector for
stationary telephones, and the path from the antenna to the telephone's receiving branch for
mobile telephones), The far-end speech activity detector 6 is detected in the VAD, and is finally
reproduced by the speaker 7. The near-end signal 3 and the far-end signal 5 are both sent to a
double-talk detector 8 for detecting double talk and an adaptive filter 9 for adapting to the
acoustic response of the echo path 13. The adaptive filter 9 also receives the output of the
double talk detector 8 as an input, since it does not adapt to the filter during double talk (because
the parameters are not updated). In order to perform echo cancellation, the model 10 formed by
the adaptive filter is subtracted from the near-end signal 3 in an adder / subtractor 11. An echo
canceller output signal 12 is sent to the output connection O (which may be the output connector
for hands-free devices, the wire connector for stationary phones, and the path from the
transmitting branch to the antenna for mobile phones), from which echo ( Part of) has already
been canceled. The echo canceller shown in FIG. 1 may be integrated into the phone (e.g.
consisting of a speaker and a microphone for hands-free speaker calling) or may be implemented
in a separate hands-free device.
[0007]
Several methods for detecting double talk have been proposed. However, many of them are quite
simple and some are unreliable. Most double talk detectors are based on the power ratio between
the loudspeaker signal and / or the microphone signal and / or the signal after the echo
canceller. The advantages of these detectors are simplicity and speed, and their disadvantage is
their unreliability.
[0008]
Also known are detectors based on correlating the loudspeaker signal and / or the microphone
signal and / or the signal after the echo canceller. These detectors are based on the idea that the
mere echo signal in the speaker and the microphone (the signal after going through the echo
canceller) is strongly correlated, but the correlations as the near-end signal is added to the
microphone signal Is reduced. The disadvantages of these detectors are the slow detection speed,
the (possibly incorrect) uncorrelatedness of the near-end and far-end signals, and the effect of
10-04-2019
3
changes on the loudspeaker signal due to the echo path. One thing is that even if there is no near
end signal, the correlation is degraded.
[0009]
Also known is a double talk detector based on an autocorrelation comparison of the same signal,
in which case the detector can recognize speech in the near end signal and thus detect the
presence of the near end signal. Although the power required to calculate such detectors is small,
the same problems as described above occur because the detectors are based on correlation.
[0010]
Kuo S.M., Pan Z. In the article "Microphone system for canceling acoustic echoes for
large-scale video conferencing", ICSPAT '94 Proceedings, 1994, pp. 7-12, noise and acoustic
echoes are eliminated and the different speech situations mentioned at the outset are recognized
Use two microphones directed in opposite directions. However, the above method does not
provide a special improvement to the recognition of double talk that is implemented solely by the
output power of the echo canceller.
[0011]
Affes S., Grenier Y. In the article "Source subspace tracking array of
microphones for double-talk situations", ICSPAT '96 Bulletin, Volume 2, 1996, pp. 909-912,
proposed echo and background noise cancelers with microphone-vector structure It is done. The
proposed echo canceller filters the signals arriving from a spatially selected direction while
keeping the signals arriving from the desired direction. The above echo canceller can also
operate during double talk situations. However, the above document does not present near-end
speech activity detection nor double-talk detection using a multi-microphone solution (also called
microphone-vector).
[0012]
SUMMARY OF THE INVENTION The invention of a method and apparatus for detecting near-end
speech activity and recognizing a double-talk situation has now been made.
10-04-2019
4
[0013]
SUMMARY OF THE INVENTION The present invention is based on the concept of detecting a
near-end speech signal based on the direction in which it arrives.
In hands-free applications where the speaker signal arrives from a direction distinct from the
direction of the near-end talker's call signal, the near-end call signal can be distinguished from
the speaker signal based on its angle of arrival. In the present invention, detection is performed
using several microphones (microphone vectors) that pick up voice from different directions and
/ or different points.
[0014]
The outputs of the microphone vectors are first band-pass filtered into narrowband signals and
the angle of arrival is estimated on the signal matrix formed by the filtered signals. The
estimation restores the spatial spectrum from which the direction of arrival is tracked based on
the peaks that occur in the spectrum. The arrival direction of the near-end speech signal and the
arrival direction of the speaker signal are updated based on the obtained (found) arrival
directions. These estimates of direction of arrival make the final determination of VAD easier.
When the direction-of-arrival estimator detects a sufficiently strong spectral peak in the direction
of arrival sufficiently close to the estimated direction of arrival of the near-end speech signal, the
near-end talker is considered to be talking. That is, near-end call activity can be detected.
[0015]
The determination of double talk requires information on far-end call activity in addition to nearend call activity. This information can be detected using known voice activity detectors, such as,
for example, power level based voice activity detectors.
[0016]
The device according to the invention comprises means for determining the direction of arrival of
the received signal, means for storing the estimated direction of arrival of the speech of a
particular source, the direction of arrival of the received signal and the estimated direction of
10-04-2019
5
arrival. Means for comparing and, if the direction of arrival of the received signal according to
the comparison agrees with the estimated direction of arrival within a tolerance, the speech is
generated at the particular source And means for displaying the information.
[0017]
DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS The present invention will now be
described in detail with reference to the drawings.
[0018]
FIG. 2 shows a block diagram of a detector according to the invention for detecting near-end
speech activity and recognizing double talk.
In the present invention, several microphones 2a, 2b,. . . , 2M are used and their microphones are
preferably connected as a so-called microphone vector 2.
The vector has at least two microphones, preferably three or four or more. Each microphone has
a single signal 3a, 3b,. . . , 3M, and when M microphones (M is an integer) are used, M signals
that can change in the time domain are obtained, and these are M elements that can change in
the time domain. Form one signal vector.
[0019]
The outputs 3a, 3b,. . . , 3M are first band-pass filtered in band-pass filter 14 to produce
narrowband signals 19a, 19b,. . . , 19M. Since the exact estimation method of the superresolution spectrum only works on narrowband signals, band-pass filtering is performed for
estimation of the direction angle. Band pass filtering may be implemented, for example, using fast
Fourier transform (FFT), windowing and interleaving. The frequency range of the band pass filter
is determined based on the distance between each microphone in the microphone vector.
According to the Nyquist sampling theory, the spatial sampling frequency must be at least twice
the spatial frequency of the signal, so the following is obtained as the band frequency (point
frequency) of the bandpass filter 14: f = C / 2d, where C is The speed of sound in the air (343 m /
s at 20 ° C.), where d is the distance between each microphone.
10-04-2019
6
[0020]
The filtered signals 19a, 19b,. . . , 19M are estimated using some known estimation method such
as, for example, MUSIC (Multiple Signal Classification). It takes place in
[0021]
This estimation method recovers the spatial spectrum from which the direction of arrival of the
signal is determined on the basis of the peaks occurring in the spectrum.
FIG. 3 shows an example of the spatial spectrum of such a microphone vector signal. The
direction of arrival can be determined from the spectrum diagram shown in FIG. 3, for example
by examining the derivative of the spectral curve. Such zero points of the derivative are restored
as arrival directions. At zero, the derivative changes from positive to negative, which indicates, as
is known, each peak position in the curve. Thus, in FIG. 3, two signals arrive at the microphone
vector. One from the direction of 10 °, or one from the direction of 40 °. Furthermore, it may
be required that the spectral peaks considered to be in the direction of arrival have a certain
minimum amplitude (e.g. 5 dB). In the drawings, the coverage (coverage) of the spectrum is
shown as 90 °. In practice, it can be detected in the range of ± 90 °. The calculation of the
derivative and the check whether the minimum condition of the amplitude is fulfilled should
preferably be carried out (by programming) using a digital signal processor. The estimation unit
15 gives (estimates) the arrival direction 16 of the signal as its output.
[0022]
The estimated arrival directions of the near-end speech signal 3 and the loudspeaker signal 5 are
updated in block 17 based on the obtained arrival directions. The possible directions of arrival
are estimated by averaging the directions of arrival obtained from the spectral peaks. If it is
almost known from which direction the signal has arrived, then the effect of the error peak that
occurs in the spatial spectrum can be minimized. It is not annotated unless an error peak occurs
in the estimated arrival direction. FIG. 4 shows the arrangement of the microphone 2 and the
loudspeaker 7 of a conventional handsfree device in a car, the loudspeaker being usually in front
of the microphone vector 2 in a direction of 0 ° ± 40 °. The position of the loudspeakers may
vary significantly with respect to the microphone vector. Microphones 2a, 2b,. . . , 2M are
arranged at a specific distance from each other in a specific direction. The distance and direction
must be determined by the arrival direction estimation algorithm described below. A more
10-04-2019
7
detailed description will now be given of the averaging of the far-end and near-end arrival
directions performed by the signal source position and direction determination block 17.
[0023]
The estimation of the far-end arrival direction is performed based on the averaging of the arrival
angles 16 obtained from the spectrum estimation unit 15. Averaging is performed only when
there is a call at the far end, which is determined using the output of far end VAD 6 whose output
is sent to decision block 17. The averaging is preferably carried out in the time domain, for
example using IIR-filtering. It is basically assumed that there are two signal sources arriving from
different directions, namely near-end signal 3 and far-end signal 5. Furthermore, it is assumed
that the direction of arrival of the signal changes relatively slowly compared to the frequency of
observations performed. When spectrum estimation unit 15 emits arrival direction vector doa (in
degrees) as its output, estimated values fdoa (in degrees) of the far-end arrival direction vector
are averaged so that each new direction estimation value is closest. It is updated to affect the
components of fdoa. In the updating, weighting is performed such that the detected direction
updates the fdoa component the closer to the relevant component. The direction of the
loudspeaker signal and thus the direction of the reverberation signal induced in the spectrum
changes only very slightly, in which case the above weighting reduces the effects of accidental,
error peaks in the spectrum Ru. At the same time, the probability of occurrence of the fdoa
component in question, pdoa, is updated as the new value is closer to the relevant direction
estimate. In addition, the strength of the relevant fdoa component, powdoa, is updated based on
the power of the corresponding spectral peak. In this case, the far-end arrival direction
estimation vector fdoa has the directions of arrival angles of (M-1) signals. The component pdoa
consists of the probability of the corresponding direction of arrival in the range [0, 1] and the
standardized intensity in the range [0, 1] corresponding to powdoa.
[0024]
Here, the arrival direction of the far-end signal 5 is the component of the far-end arrival direction
vector estimation value fdoa closest to the far-end signal arrival direction determined last while
the probability and the intensity corresponding thereto are the highest. It can be assumed that
Since the estimate is updated only when the call is at the far end, it can be assumed that the near
end signal 3 (in this case double talk) will occur in less than 50% of the time. Therefore, what is
basically assumed is that double talk occurs in less than half of the far-end speech activation
time. The far-end signal arrival direction (speaker direction) may be separated from the arrival
direction of the reflected loudspeaker signal based on the power of the spectral peak
10-04-2019
8
corresponding to the arrival direction. The signal directly arriving at the microphone produces
peaks in the spatial spectrum that are usually stronger than the signal attenuated in the
reverberation path.
[0025]
The following is a description of the algorithm for estimating the arrival direction with reference
to FIG. At step 100, an initialization is performed which consists of determining: fdoa, pdoa and
powdoa contain (M-1) components. doa contains L components (1 ≦ L ≦ M−1). The fdoa
component is initialized with different values. fdoa(n)=−90+n*
180/M;(1≦n≦M−1)
[0026]
Step 101: Tracking estimated values (each component of fdoa) as follows corresponding to the
detected arrival direction (doa). Calculate the distance of each direction of arrival from each
estimate. The shortest estimated value doa (i) and the corresponding closest estimated value fdoa
(n) are selected.
[0027]
Step 102: Update the estimated value fdoa (n) according to how close to the estimated value of
the reaching direction doa (i). The closer you are, the detected direction changes the estimate.
That is, fdoa (n) = α0 * fdoa (n) + (1−α0) * doa (i), where α0 is, for example, a linear function
or an exponential function of distance (for linear (linear) dependency 5). By adjusting the update
coefficient α0 and the upper and lower limits of the distance d, α0_max, α0 min and d max, d
min, not only the speed of update, but at what distance the peak located at the point affects the
estimated value Can also affect. For example, when the maximum value of the distance is kept at
40 ° (d min = 0 °, d max = 40 °) and the maximum value of the update coefficient is kept at 1
(α 0 min = 0.99, α 0 max = 1 0), peak values of spectral errors further than 40 ° do not update
the estimates and thus do not induce any errors at all. In this way it is possible to eliminate the
effects of the above-mentioned spurious signals on the estimate.
[0028]
10-04-2019
9
Step 103: Increase the probability of occurrence of the estimated value, pdoa, depending on how
close the arrival direction is to the estimated value. In the following, it is assumed that the
function of distance is a linear function. Other functions, such as an exponential function, are also
possible. pdoa (n) = α1 * pdoa (n) + (1-α1) (1-dist / 180) where α1 is, for example, 0.9, and dist
is an observed value within the range of [0, 180]. It is the distance between the estimated value.
[0029]
Step 104: Also update the estimated power powdoa with the power of the spectral peak detected
as follows. powdoa (n) =. alpha.3 * powdoa (n) + (1-.alpha.3) * Pow / Powmax where .alpha.3 is,
for example, 0.9, Pow is the power of the spectral peak, and Powmax is the maximum power.
[0030]
In step 105 it is determined whether other arrival directions and estimates can be found, and if
yes, steps 101-104 are repeated for the remaining arrival direction and estimate pairs. Step 106:
Reduce the frequency and power of estimates for which no direction of arrival was detected, for
example by setting dist = 180 and Pow = 0.
[0031]
Then, in step 107, for example, by maximizing the following equation, with respect to the
direction of the speaker, the direction has the highest probability of occurrence and power, and
of the estimated value closest to the latest rated value of the direction of the speaker. Choose a
direction. a* pdoa(k)+b* powdoa(k)+c*
distance(k);K=1... M-1, (1) where a, b and c are weighting factors, for
example 1/3, distance (k) is the estimated value fdoa (k) and the direction of the loudspeaker
evaluated so far (previously) Distance in degrees between
[0032]
10-04-2019
10
So far, estimation of the arrival direction of the far-end speech signal has been described. The
estimation of the arrival direction of the near-end speech signal will be described below. The
estimation of the direction of arrival of the near-end signal is carried out according to the
procedure and algorithm described above, so an estimate of the direction of arrival at the near
end ndoa is obtained by replacing fdoa with ndoa in the above algorithm. The estimation is
performed if the far-end speech activity detector 6 indicates that there is no speech arriving from
the far-end. When this spectrum is detected in the estimator 15, there are no expected peaks
(direction of arrival angle) or 1 to (M-1) peaks according to the direction of the near-end signal
and / or the spurious signal and the reverberation There is. As the direction of the near-end
talker, the direction indicated by the most frequently repeated and strongest spatial spectrum as
described above is selected. Furthermore, it is assumed that the near-end caller is sitting in the
direction of about 0 ° ± 30 ° with respect to the microphone vector, in which case the initial
value of the near-end talker's direction estimate is 0 ° It is possible to set, and in the selection of
the direction, the previously (previously) evaluated direction can be strongly weighted.
[0033]
These assumed reaching direction values fdoa, ndoa are taken to the detection block 18 which
performs the final detection. When the arrival direction estimation unit 15 detects a sufficiently
strong spectrum peak in the arrival direction whose peak is sufficiently close to the assumed
arrival direction of the near-end speech signal, it is found that the near-end talker is talking. That
is, near end call activity is detected. This comparison is made at the detector 18 based on the
signals arriving from the blocks 15 and 17. The final determination of near-end speech activity is
made using spectral peak and direction of arrival estimation (averaging). If any spectral peak is
closer to the near-end estimate of the arrival direction (or its reverberant estimate) than the farend estimate, and also closer to the near-end estimate than the predetermined error tolerance:
The presence of a call at the near end is detected. The tolerance value is, for example, 10
degrees.
[0034]
The determination of double talk requires information on far-end call activity in addition to nearend call activity. This information is sent from the far-end speech activity detector 6 to the
double talk detector 18, which in this way detects that the near-end speech activity detector
(described above) detects speech and the far-end When the call activity detector 6
simultaneously detects a call, it detects a double talk situation. As far as the far-end signal is
concerned, any VAD algorithm may be used to detect call activity. The result of double talk is
10-04-2019
11
obtained using a simple AND operation on the values of near end and far end call activity, ie 1 (in
call) and 0 (not in call).
[0035]
The function of the transient state (transition) detector TD will be described below with reference
to FIG. This detector is optional for the call activity detector / double talk detector according to
the invention and is therefore shown using dotted lines in the figure. Since the estimation of the
direction of arrival is performed on narrowband signals, it is difficult to detect abrupt near-end
signal changes (transient states). It is therefore possible to use a parallel detector TD optimized
for the detection of transients. After each transient position is detected, a direction of arrival
detector is used to check the accuracy of the determination. If the detector according to the
invention detects a sufficiently rapid signal change, for example less than 20 ms, it is not
necessary to use a transient detector TD.
[0036]
In principle, a conventional VAD can be used as a transient detector. However, since it is possible
to attenuate certain arrival direction angles by means of the multiple microphone structure, the
transient detector TD can be realized in such a way that the direction of the estimated
loudspeaker signal is attenuated. In this case, the probability that the detected transient state is
connected to the near-end signal is high. Attenuation in the loudspeaker direction can be realized
in many different ways. The easiest way is to use two adaptive microphone structures. In
principle, these two microphones, for example the microphones 2a, 2b,. . . , 2M can be used. Two
microphone signals are sufficient to realize the attenuation. If the adaptation is controlled using
the determination of the direction-of-arrival estimator (i.e. the adaptation is performed only in
the presence of the far-end signal), an attenuation in the desired direction is obtained. The
adaptation is easier if the detection is done in a certain frequency range (such as 1 KHz-2 KHz).
Within the transient detector, direct frequency division can be performed on the signal obtained
from the microphone, for example using an FFT or band-pass filter.
[0037]
The actual transient detector TD compares the instantaneous power P (n) of the signal at instant
n with the noise estimate N (n). Where P (n) is the power of the microphone signal (or the power
10-04-2019
12
of the microphone signal in which the direction of the speaker signal is attenuated), the noise
estimate N (n) is averaged using its previous value and the call is The corresponding power,
which is controlled by the determination of the entire system if not at all. Information on the
moment when there is no call can be taken from block 18 to the transient detector TD (dotted
arrow). The associated values P (n) and N (n) can be calculated using a transient detector based
on the signals arriving from the microphone. The method of calculating the signal output values
P (n) and N (n) is known and can be implemented in the transient detector TD using, for example,
an IIR (infinite impulse response) filter. If the difference is sufficiently large, it is determined that
a transient state is detected. Iterative averaging, N (n + 1) = αN (n) + (1-α) P (n), is used to
update the noise estimate N (n). Is a time constant (generally about 0.9) for controlling the
averaging.
[0038]
The transient detector complements the function of the space detector according to the
invention. Although it is possible to detect near-end speech itself by the transient state detector
TD, reliable detection is obtained by direction determination by the arrival direction estimation
unit 15. The detection of incorrect transients at mere echo sources (not near-end signals) can be
corrected by the direction at the arrival direction estimator 15. It is not necessary to be aware of
echo induced transients during near-end speech if directional attenuation works well enough.
The near-end speech initiated during the echo can again be detected as a clear transient and the
result can be checked using the direction of arrival detector. The output of the transient detector
TD is taken to block 18 (dotted line).
[0039]
Near-end speech activity and double talk can also be determined by a statistical pattern
recognition approach based on the output of the arrival direction estimator 15. According to this
approach, detection of call activity based on direction of arrival (DOA) angle estimation could be
improved using statistical information. Pattern recognition techniques such as neural networks
and hidden Markov models (HMMs) have been successfully applied to many similar tasks. The
strength of the pattern recognition method is that it can be trained. Given a sufficient amount of
training data, models can be estimated for each state of the system (near-end speech, far-end
speech, double talk, silence). These models can then be used to optimally detect the state of the
system. It goes without saying that the detection process is only optimal if the modeling
assumptions are correct.
10-04-2019
13
[0040]
The following describes schematically how to use the HMM for detecting multi-microphone call
activity. Since the input to the system is still derived from the spatial spectrum, the DOA angle of
the signal (s) is still the decisive factor according to the invention. Moreover, the transient
detection component (reference TD) described above can be used as before.
[0041]
The first step in pattern recognition using HMMs is to define a model network. As mentioned
above, the full duplex telephone system has four states (models): near end call, far end call,
double talk, silence. Each model can be modeled with multi-state HMMs, but single-state HMMs
can be used as a starting point. Another possible improvement is to use a minimum forcing
period on each state to prevent oscillations between each state.
[0042]
In theory, transients can occur between any two models, but in practice direct transients between
silence and double-talk models and between near-end and far-end models are ignored The actual
transient state is as shown in FIG.
[0043]
Once the model structure is defined, it must be decided what kind of probability distribution to
use.
The standard approach to speech recognition is to model each state by a Gaussian probability
density function (pdf), which is also a preferred starting point in this example. Alternatively, any
pdf could be used. The training of the model pdf is ideal by estimating the most likely parameters
from the labeled training data (which can tell which state the system is at a given moment) as
shown in Figure 9 To be done. An alternative is to start with a certain basic model and adapt the
system online, called unsupervised training. Referring again to speech recognition, there are
several online caller adaptation techniques that can be applied to this. In summary, using current
data, the condition that produces the greatest likelihood is adapted with the greatest weighting.
10-04-2019
14
The more adaptation data, the more weight is added to the update. A clear problem with
unsupervised training is the risk of adapting the incorrect model in the case of misclassification.
If the initial parameters can be estimated with a small number of surveillance training samples,
the likelihood of better adaptation will be high. Furthermore, the far-end channel (speaker) is
separated from the remaining channels and this information can be used. If far-end activity is
present, only the far-end and double-talk models can be adapted, and so on.
[0044]
The actual detection (recognition) is quite simple. Only select the model with the highest
probability at any time. Of course, additional information such as far-end call activity can be used
to further enhance detection performance. A logical refinement of this alternative approach is to
use HMMs in several ways. For example, the HMM representing each system state consists of
three states. Transients to the model, states representing stationary parts of the model, and
transients deviating from the model. Also, Gaussian pdfs could be used together to increase the
accuracy of pdf modeling.
[0045]
When the detector according to the invention is used in a hands-free application in a car, the
transient detector can be changed in such a way that the direction of the final reverberation of
the signal is taken into account. In such cases, transient detection can be improved by
attenuating some of the estimated arrival directions of the loudspeakers, rather than one
estimated arrival direction of the loudspeakers.
[0046]
The advantage compared to the conventional method of the space call activity detector according
to the present invention is its ability to recognize both double-talk situations and near-end speech
activity, its quickness and reliability. The detector of the invention based on the direction of
arrival of the speech signal is very reliable due to its main features. Although the differences
between the power levels of each speech signal do not have a significant effect on the result, the
detector also recognizes near-end speech signals that are much lower in power than the speaker
signals. In addition, the outcome of the decision is not affected by the operation of a separate
device, such as the operation of an adaptive echo canceller. In a double talk detector, there is
10-04-2019
15
often a threshold level depending on the speech signal and the ambient noise level, and it is
determined based on the threshold level whether there is a double talk situation. The parameters
of this detector are constant in the main part, so there are no problems as described above. The
speed of recognition can be increased by using an optional transient detector.
[0047]
In any case, in this hands-free facility, many operations such as far-end speech activity detection
and ambient noise estimation necessary for the space detector according to the present invention
are performed, thus the calculation operations already performed according to the present
invention are Can be used by the detector.
[0048]
The detector according to the invention can be used in hands-free equipment, for example in-car
kits of mobile telephones or hand-free equipment of car telephones (for example as part of echo
cancelers and transmission logic).
The invention is also suitable for use in applications such as so-called hands-free phones, where
hands-free equipment is included in the phone.
[0049]
FIG. 6 shows by way of example a mobile station according to the invention in which a spatial
near end speech / double talk detector 80 according to the invention is used. The speech signal
to be transmitted out of the microphone vector 2 is sampled at the A / D converter 20 and then
processed for base frequency signals (such as eg speech coding, channel coding, interleaving),
Mixing and modulation to the radio frequency and transmission to the block TX take place. From
block TX, the signal is emitted to the air path via duplex filter DPLX and antenna ANT. For
example, detector 80 can be utilized to control an echo canceller or to control transmit TX with
intermittent transmissions. On the receive side the normal operation of the receiver branch RX is
performed such as demodulation, cancellation of interleaving, channel decoding and speech
decoding, after which the far-end speech activity is detected in the detector 6 and the signal is D
/ A converted The signal is converted into an analog format by the unit 23 and reproduced by
the speaker 7. Installing blocks 2, 7, 20, 23 and 80 described in FIG. 6 in a separate hand-free
device having inputs for the mobile station for the input, output and control signals (30, 50, near-
10-04-2019
16
end VAD, DT) By doing this, it is possible to implement the present invention in a separate handfree device. The invention can also be integrated in a conference calling arrangement with one or
more microphones and speakers on the tabletop for calling out in a conference or, for example,
man-crophones and speakers, for example in a video display. Can be used in connection with a
computer for making a call via the Internet network. Thus, the present invention is suitable for all
types of hand-free type devices.
[0050]
The implementation methods and embodiments of the invention have been described above
using examples. It will be obvious to those skilled in the art that the present invention is not
limited to the details of the embodiments presented above, and that the present invention can be
realized in other embodiments without departing from its features. The disclosed embodiments
are to be considered as illustrative and not restrictive. Accordingly, the possibilities of practicing
and using the present invention are limited only by the claims which follow. Therefore, different
embodiments of the present invention and equivalent embodiments specified by each claim are
included in the scope of the present invention.
[0051]
Brief description of the drawings
[0052]
1 is a block diagram of a conventionally known echo canceller.
[0053]
2 is a block diagram of a detector according to the invention.
[0054]
3 is a spatial spectrum diagram of the microphone vector signal.
[0055]
4 is an installation view of a microphone and a speaker in a car.
10-04-2019
17
[0056]
5 is a diagram showing the update factor used to estimate the direction of arrival (in degrees) as
a function of distance.
[0057]
6 shows a mobile station according to the present invention.
[0058]
7 is a diagram showing the estimation of the arrival direction in the form of a flow chart.
[0059]
8 is a diagram showing the transition between different states in the alternative embodiment.
[0060]
9 is a diagram showing the labeled training data.
[0061]
Explanation of sign
[0062]
2 ... microphone vector 3 ... near end signal 5 ... far end signal 6 ... far end call activity detector 7
... speaker 14 ... band filter 15 ... arrival direction estimation device 17 ... arrival direction
determination block 18 ... double talk detector TD ... Transient detector
10-04-2019
18
Документ
Категория
Без категории
Просмотров
0
Размер файла
34 Кб
Теги
description, jph11168791
1/--страниц
Пожаловаться на содержимое документа