close

Вход

Забыли?

вход по аккаунту

?

DESCRIPTION JP2012248986

код для вставкиСкачать
Patent Translate
Powered by EPO and Google
Notice
This translation is machine-generated. It cannot be guaranteed that it is intelligible, accurate,
complete, reliable or fit for specific purposes. Critical decisions, such as commercially relevant or
financial decisions, should not be based on machine-translation output.
DESCRIPTION JP2012248986
Abstract: The present invention provides a video conference apparatus capable of performing a
realistic video conference with high compatibility with existing video conference system
equipment and a video and sound image direction matching, without significantly increasing the
cost. Do. A video conference apparatus includes a left and right two microphones 11L and 11R as
an apparatus 10 on the utterance side and a camera 15, and a speaker array in which three or
more speakers 34 are horizontally arranged as an apparatus 30 on the receiving side. And a
display 35. The apparatus 10 generates a monaural sound signal and sound image direction
information indicating the sound image direction from the left and right two sound signals of the
input. The device 30 converts the sound image direction indicated by the sound image direction
information received from the speech side into a sound image direction when the speaker array
is regarded as a line connecting the positions of the two microphones on the speech side, and
The sound indicated by the monaural sound signal is output from one or more speakers
corresponding to the sound image direction after the conversion so that the monaural sound
signal is localized in the sound image direction after the conversion. [Selected figure] Figure 1
Video conferencing device
[0001]
The present invention relates to a video conference apparatus used as a terminal for performing
a video conference.
[0002]
2. Description of the Related Art Conventionally, a so-called video conference system is widely
used, which transmits an image captured by a camera and an audio collected by a microphone to
11-04-2019
1
a remote place via an electric or optical circuit.
[0003]
In the quality of this video conference system, it is very important to match the direction of the
sound image with the image.
If the voice uttered from the subject in the image is heard from the direction of the subject, it
gives the sense that the remote party is as if it is in this space (realism), and conversation and
coordination work are promoted. , Work efficiency with remote partners increases.
[0004]
By the way, in order to estimate the direction of the sound image, generally, there are a method
of estimating from the photographed image and a method of estimating from the collected voice.
[0005]
In the method of estimating from a captured image, a speaker is identified by image recognition,
and the position of the speaker is estimated by image information.
Then, the position information is transmitted together with the video and audio signals, and the
sound image is localized to the position information on the decoding side (see, for example,
Patent Document 1).
[0006]
The method of estimating from the collected voice uses a plurality of microphones, and estimates
the position of the speaker from the difference in sound pressure and time difference of the voice
entering each microphone.
The subsequent localization of the sound image is the same as that described above.
11-04-2019
2
[0007]
Patent No. 4327822 gazette
[0008]
With regard to the method of estimating the position of the sound image from the collected
sound described above, generally, if sound is collected using a plurality of microphones, it is
possible to estimate the direction of the sound image from the collected sound.
However, such estimation is generally performed using a microphone array composed of several
or more microphones, and there is a problem that the cost of the microphone array is increased.
[0009]
On the other hand, the above-mentioned method of estimating the sound image position from the
photographed image needs to recognize the image of the speaker, and the current level of the
image recognition technology can not cope with a large number of people. In addition, even if it
is possible, since the amount of computer operation required for video signal processing is much
larger than that of audio signal processing, it is necessary to mount a high-spec arithmetic
processing unit, and the product cost increases. There's a problem.
[0010]
As described above, it is desirable to reduce the number of microphones, the number of speakers,
and the amount of calculation of signal processing as much as possible in order to reduce the
product cost, but in order to estimate the sound image position with the conventional video
conference system equipment It is difficult to reduce the number of microphones and the amount
of computation.
[0011]
By the way, the standardization method adopted as the voice encoding / decoding method in the
11-04-2019
3
video conferencing system widely spread at present is mostly a method of transmitting monaural
voice, so stereo (2 ch) voice signal or more is In the case of transmitting signals of the number of
channels of (1), there is no compatibility among devices of different manufacturers.
As described above, a monaural audio signal is the only signal that can be communicated with
compatibility among devices of different manufacturers, and that an audio signal can be
transmitted using the monaural audio signal transmission path is a compatible device. Is an
essential function of
[0012]
However, in the case of adopting a method of estimating the sound image position from the
collected sound, it is necessary to transmit two or more channels of sound signals collected by a
plurality of microphones through the network, and the compatibility between devices of different
manufacturers It is desirable to have In addition, even in the case of adopting other methods, it is
desirable to make the same compatible.
[0013]
The present invention has been made in view of the above-described actual situation, and its
object is to have a high sense of compatibility with the equipment of the existing video
conference system and a sense of being in the same direction as the image and the sound image.
An object of the present invention is to provide a video conference apparatus capable of
conducting a video conference without significantly increasing the cost.
[0014]
In order to solve the problems as described above, the first technical means of the present
invention is provided with two microphones on the left and right and a camera for
photographing a space of an object collected by the two microphones for speech. A video
conference apparatus including a speaker array in which three or more speakers are horizontally
arranged, and a display for receiving calls and communicating with another video conference
apparatus via a network, the other video conference apparatus The receiver includes an audio
signal processing unit that processes a monaural audio signal and sound image direction
information indicating the sound image direction of the monaural audio signal generated and
transmitted from the two left and right audio signals collected by two microphones in The audio
signal processing unit on the receiver side is a line connecting the position of the two
microphones of the other video conference apparatus with the sound image direction indicated
11-04-2019
4
by the sound image direction information, the speaker array The monaural audio signal from one
or more speakers corresponding to the sound image direction after the conversion so that the
monaural sound signal is localized in the sound image direction after the conversion. It is
characterized in that a process of outputting the voice shown is performed.
[0015]
In a second technical means according to the first technical means, the teleconferencing device is
a sound image indicating a monaural sound signal and a sound image direction of the monaural
sound signal from the two left and right sound signals input by the two microphones. The speech
signal processing unit on the speech side that generates the direction information is further
provided.
[0016]
A third technical means is characterized in that in the first or second technical means, the two
microphones are non-directional microphones spaced apart at both ends of the display.
[0017]
A fourth technical means is characterized in that in the first or second technical means, the two
microphones are directional microphones which are installed apart from each other at both ends
of the display.
[0018]
According to a fifth technical means, in the first or second technical means, the two microphones
are directional microphones disposed adjacent to an upper portion or a lower portion of the
display.
[0019]
According to the present invention, it is possible to provide a cost-effective teleconferencing
device capable of performing a realistic teleconferencing in which the compatibility with the
existing teleconferencing system equipment is high and the image and the sound image direction
match. It can be provided without a large increase.
[0020]
11-04-2019
5
It is a figure which shows the structural example of the video conference system using the video
conference apparatus which concerns on this invention.
It is a block diagram which shows one structural example of the audio | voice signal processing
part in the encoding part of the video conference system of FIG.
It is a figure which shows a mode that audio | voice data are stored in a buffer in the audio |
voice signal processing part of FIG.
It is the figure which expanded a part of waveform of the audio | voice signal input into the postprocessing part of FIG.
When discrete Fourier-transforming the audio | voice signal of a left-right channel and ignoring
the direct-current component of a left-right channel, it is a schematic diagram for demonstrating
the discontinuous point of the waveform which arises in the segment boundary after discrete
Fourier inverse transform.
It is a schematic diagram for demonstrating an example of the discontinuous point removal
process given by the post-processing part of FIG.
It is a figure which shows the result of having performed the discontinuous point removal
process of FIG. 6 with respect to the audio | voice signal of FIG.
It is a block diagram which shows one structural example of the audio | voice signal processing
part in the decoding part of the video conference system of FIG. It is a schematic diagram for
demonstrating 2ch reproduction | regeneration system. It is a schematic diagram which shows an
example of the speaker array arrange | positioned as a speaker group in the video conference
system of FIG. It is a schematic diagram for demonstrating the example of the positional
relationship of a listener, a speaker on either side, and a synthetic sound image. It is a schematic
diagram for demonstrating the example of the positional relationship of the speaker group and
virtual sound source which are used by a wave field synthetic | combination reproduction |
regeneration system. It is a schematic diagram for demonstrating the example of the positional
relationship of the virtual sound source of FIG. 12, a listener, and a synthetic sound image. It is
an outline view showing an example of a teleconference device concerning the present invention.
11-04-2019
6
It is an external view which shows the other example of the video conference apparatus based on
this invention. It is an external view which shows the other example of the video conference
apparatus based on this invention. It is an external view which shows the other example of the
video conference apparatus based on this invention. It is an external view which shows the other
example of the video conference apparatus based on this invention.
[0021]
The video conference apparatus according to the present invention is an apparatus used as a
terminal for performing a video conference, and provides a video conference environment by
communicating with another video conference apparatus via a network. The video conference
apparatus according to the present invention, roughly described, estimates the sound image
direction only from the stereo input sound signal on the speech side (it can be said that the
sound image position is estimated), and information indicating the monaural sound signal and its
sound image direction. Is transmitted, and on the receiving side, the sound indicated by the
monaural sound signal is output so as to be localized in the speaker array so as to correspond to
the direction of the sound image. As a result, transmission of real-life video and audio can be
performed with a small amount of calculation and transmission of only a monaural audio signal.
[0022]
Hereinafter, with reference to the drawings, a configuration example and a processing example of
the video conference apparatus according to the present invention will be described. FIG. 1 is a
diagram showing a configuration example of a video conference system using the video
conference apparatus according to the present invention. The device 10 on the speech side
(coding side) and the device 30 on the receiving side (decoding side) of the video conference
system will be described separately, but the video conference device according to the present
invention includes both devices 10, Has 30 functions.
[0023]
The apparatus 10 on the encoding side includes two microphones 11L and 11R, amplifiers 12L
and 12R, an A / D converter 13, an encoding unit 14, and a camera 15. The microphones 11L
and 11R are a left channel microphone and a right channel microphone, respectively, and the
amplifiers 12L and 12R amplify audio signals collected by the microphones 11L and 11R,
11-04-2019
7
respectively. The A / D converter 13 converts the left and right audio signals output from the
amplifiers 12L and 12R into left and right digital audio signals. The camera 15 shoots a space of
an object to be picked up by the two microphones 11L and 11R and outputs it as a digital video
signal, and image processing is performed as necessary.
[0024]
The encoding unit 14 encodes the left and right digital audio signals output from the A / D
converter 13, and encodes the digital video signal output from the camera 15. The encoding unit
14 has a speech signal processing unit on the speech side, which is one of the main features of
the present invention. The speech signal processing unit on the speech side generates a
monaural sound signal and sound image direction information indicating a sound image direction
of the monaural sound signal from the two left and right sound signals input by the two
microphones 11L and 11R. However, the speech signal processing unit on the speech side may
be provided in addition to the coding unit 14.
[0025]
The device 10 on the encoding side also includes a communication unit (not shown) for
transmitting the encoded digital data to the device 30 on the decoding side. The digital data to be
transmitted includes the sound image direction information in addition to the monaural audio
signal and the video signal.
[0026]
The device 30 on the decoding side includes a communication unit (not shown) that receives data
transmitted from the device 10 on the coding side. Although not particularly described below, the
exchange of data between the device 10 and the device 30 is normally performed via a server
that manages the exchange of data. This exchange of data may be performed, for example, via a
dedicated high security server of the video conference system, or may be performed via a general
chat server.
[0027]
11-04-2019
8
The decoding side device 30 further includes a decoding unit 31, a D / A converter 32, a plurality
of amplifiers 33, and three or more speakers 34. Three or more speakers 34 are arranged
horizontally and form a speaker array.
[0028]
And the apparatus 30 at the side of decoding has an audio signal processing section at the
receiving side which is one of the main features of the present invention. The receiver-side audio
signal processing unit has the receiver-side audio signal processing unit for processing the
monaural audio signal and the sound image direction information received from another video
conference apparatus, as the audio-related processing has been briefly described. Although the
voice signal processing unit on the receiver side will be described later in a detailed example, the
sound image direction indicated by the sound image direction information is the speaker array
formed of three or more speakers 34 and the positions of the two microphones in the device on
the speech side. The monaural audio signal from one or more speakers corresponding to the
sound image direction after conversion is converted so that the monaural sound signal is
localized in the sound image direction after conversion. Perform processing to output the voice
shown.
[0029]
An outline of each component of the apparatus 30 on the decoding side will be described. First,
the decoding unit 31 decodes digital data received from another teleconferencing device, passes
a video signal of the decoded digital data to the display 35, and converts a monaural audio signal
and sound image direction information into a D / A converter Pass to 32. The display 35 displays
a video represented by the video signal input from the decoding unit 31. The D / A converter 32
converts the monaural sound signal into an analog sound signal in the sound image direction
indicated by the sound image direction information using the sound image direction information,
and outputs the analog sound signal to the amplifier 33 corresponding to the speaker 34. Each
amplifier 33 outputs to the speaker 34 corresponding to the input analog audio signal. Thereby,
the corresponding sound is output from the speaker 34 indicated by the sound image direction
information. Here, the audio output may be synchronized with the video display by the existing
technology.
[0030]
11-04-2019
9
Focusing on the voice in the video conference system having such a configuration, an outline
from the collection of the voice to the reproduction will be described. First, the sound emitted
from the speaker is collected by the two left and right microphones 11L and 11R and amplified
by the amplifiers 12L and 12R, respectively. Then, it is sampled by the A / D converter 13 and
becomes a discrete speech signal, which is input to the encoding unit 14. The encoding unit 14
processes the input discrete audio signal to generate and encode a monaural audio signal and
sound image direction information. The encoded data is transmitted by the communication unit
via the network N to the apparatus 30 on the decoding side.
[0031]
The transmitted signal is received by the communication unit of the apparatus 30 on the
decoding side via the network N. The received code data is decoded by the decoding unit 31, and
as a result, an audio signal reflecting the position information indicated by the sound image
direction information is generated. It is converted to an analog signal by the D / A converter 32
and reproduced by the speakers 34 through the amplifiers 33.
[0032]
The speech signal processing unit on the speech side, which is a portion related to speech signal
processing in the encoding unit 14, will be described with reference to FIG. FIG. 2 is a block
diagram showing a configuration example of an audio signal processing unit in the coding unit of
the video conference system of FIG.
[0033]
The audio signal processing unit 20 illustrated in FIG. 2 includes a preprocessing unit 21, a
discrete Fourier transform unit 22, a signal separation and extraction unit 23, gain adjustment
units 24L, 24S and 24R, a combining unit 25, and a discrete Fourier inverse transform unit
(inverse discrete Fourier transform unit 26, post-processing unit 27, and compression coding
processing unit 28.
[0034]
11-04-2019
10
The preprocessing unit 21 reads the input left and right audio signals and performs window
function calculation.
The discrete Fourier transform unit 22 transforms these signals from the time domain
representation to the frequency domain representation. The signal separation and extraction unit
23 separates the converted audio signal into a correlated signal and a non-correlated signal
separated from each of the left and right channels, and also extracts sound image direction
information. The extracted voice direction information is output to the compression coding
processing unit 28.
[0035]
The gain adjustment units 24L, 24S, 24R receive respective separated signal components (left
uncorrelated signal, correlated signal, right uncorrelated signal) and perform scaling processing,
that is, gain coefficients for increasing or decreasing the gain. Perform multiplication processing.
The synthesis unit 25 adds the scaled speech signals and outputs the result to the discrete
Fourier inverse transform unit 26. In addition, since the synthesizing unit 25 performs addition
processing (that is, superposition processing) of three audio signals, it can be said to be an
addition unit or a superposition unit.
[0036]
The discrete Fourier inverse transform unit 26 returns the added voice signal back to the time
domain again and outputs it to the post-processing unit 27. The post-processing unit 27
performs noise removal processing on the output signal from the discrete Fourier inverse
transform unit 26 and outputs the processed signal to the compression coding processing unit
28. Then, the compression encoding processing unit 28 compresses and encodes the audio signal
after the post processing by the post processing unit 27 and the sound image direction
information extracted by the signal separation and extraction unit 23.
[0037]
Hereinafter, specific processing examples of the respective units of the audio signal processing
11-04-2019
11
unit 20 will be described with reference to FIG. 3 together. FIG. 3 is a diagram showing how
audio data is stored in a buffer in the audio signal processing unit of FIG.
[0038]
First, the preprocessing unit 21 will be described. The preprocessing unit 21 reads audio data
(data of audio signal) having a half length of one segment from the A / D converter 13 in FIG. 1.
Here, it is assumed that the audio signal is sampled by the A / D converter 13 at a sampling
frequency of 16 kHz, for example. A segment is an audio data section consisting of a group of
sample points of a certain length, and here refers to the section length to be subjected to discrete
Fourier transform later. The value is, for example, 1024. In this example, 512 points of audio
data which is half the length of one segment are to be read out.
[0039]
The read 512-point audio data is stored in the buffer 3 as exemplified in FIG. This buffer 3 is
designed to be able to hold the voice signal waveform of the previous one segment, and discards
the segments from the past. The last half segment data and the latest half segment data are
connected to create one segment of audio data, and a window function operation is performed on
the data. That is, all sample data will be read twice for window function operation.
[0040]
In the window function operation, one segment worth of audio data is multiplied by the
conventionally proposed next Hann window. Here, m is a natural number, and M is an even
number with one segment length. Assuming that the stereo input signals are xL (m) and xR (m)
respectively, the audio signals x'L (m) and x'R (m) after the window function multiplication are
[0041]
It is calculated that x'L (m) = w (m) xL (m), x'R (m) = w (m) xR (m) (2). When this Hann window is
used, for example, an input signal xL (m0) at a sample point m0 (where M / 2 ≦ m0 <M) is
multiplied by sin <2> ((m0 / M) π). And in the next reading, the same sample point is read as
11-04-2019
12
m0-M / 2.
[0042]
Is multiplied. Here, since sin <2> ((m0 / M) π) + cos <2> ((m0 / M) π) = 1, the signal read
without any correction is shifted by a half segment. By adding them, the original signal is
completely restored.
[0043]
The voice data thus obtained is discrete Fourier transformed by the discrete Fourier transform
unit 22 as in the following equation (3) to obtain voice data in the frequency domain. Here, DFT
represents a discrete Fourier transform, k is a natural number, and 0 ≦ k <M. XL (k) and XR (k)
are complex numbers. XL(k)=DFT(x′L(n)) 、
XR(k)=DFT(x′R(n)) (3)
[0044]
Next, the signal separation and extraction unit 23 will be described. The signal separation and
extraction unit 23 divides the obtained voice data in the frequency domain into small bands. For
the division method, the equivalent rectangular band (ERB) is used, and the ERB bandwidth is
divided from 0 Hz to a half of the sampling frequency. Here, the number of divisions up to the
upper limit fmax [Hz] of a given frequency is given by ERB, that is, the maximum value I of the
index of each band divided by ERB is given by the following equation. I = floor (21.4 log 10 (0.00
437 f max + 1)) (4) where floor (a) is a floor function and represents the maximum value of
integers not exceeding real number a.
[0045]
Then, the central frequency Fc <(i)> (1 ≦ i ≦ I) [Hz] of each ERB band (hereinafter, a small band)
is given by the following equation.
[0046]
11-04-2019
13
Further, the bandwidth b <(i)> [Hz] of the ERB at that time can be obtained by the following
equation.
b <(i)> = 24.7 (0.00437Fc <(i)> + 1) (6) Therefore, by shifting the center frequency to the low
band side and the high band side by the frequency width of ERB / 2 respectively Boundary
frequencies FL <(i)> and FU <(i)> on both sides of the i-th small band can be obtained. Therefore,
the i-th small band includes the KU <(i)>-th line spectrum from the KL <(i)>-th line spectrum.
Here, KL <(i)> and KU <(i)> are expressed by the following equations (7) and (8), respectively. KL
<(i)> = ceil (21.4 log 10 (0.00437 FL <(i)> + 1)) (7) KU <(i)> = floor (21.4 log 10 (0.00437 FU
<(i)> + 1) (8) where ceil (a) is a ceiling function and represents the minimum value of integers not
smaller than the real number a. Further, the line spectrum after discrete Fourier transform is
symmetrical with M / 2 (where M is an even number), except for the direct current component,
for example, XL (0). That is, XL (k) and XL (M−k) have a complex conjugate relationship in the
range of 0 <k <M / 2. Therefore, in the following, the range of KU <(i)> ≦ M / 2 is considered as
the object of analysis, and the range of k> M / 2 is treated the same as a symmetrical line
spectrum in a complex conjugate relationship.
[0047]
Specific examples of these are shown. For example, if the sampling frequency is 16000 Hz, then I
= 33, and division into 33 sub-bands will result. However, the DC component is not a target of
division and is not included in any small band. The reason is that although the normalized
correlation coefficient of the left and right channels is determined in the following method, the
DC component of the left and right channels is all phases since the DC component has only the
real part of the complex number and the normalized correlation coefficient is 1. It is because it
becomes an inappropriate process that it is assigned to the number of relations. There are also
line spectrum components corresponding to frequencies higher than the highest sub-band
interval, but since they have little auditory impact and, moreover, usually have small values, they
are the highest sub-bands. It does not matter if it is included in the section.
[0048]
Next, in each small band determined in this manner, the correlation coefficient is obtained by
obtaining the normalized correlation coefficient of the left channel and the right channel using
the following equation.
11-04-2019
14
[0049]
The normalized correlation coefficient d <(i)> represents how much the audio signals of the left
and right channels are correlated, and takes a real value between 0 and 1.
It will be 1 if the signals are exactly the same and 0 if the signals are completely uncorrelated.
Here, when the powers PL <(i)> and PR <(i)> of the audio signals of the left and right channels are
both 0, extraction of the correlation signal and the non-correlation signal is impossible for the
small band, and processing And move on to the next small band processing. In addition, when
one of PL <(i)> and PR <(i)> is 0, although it can not be calculated by Equation (9), the normalized
correlation coefficient d <(i)> = 0 And continue processing that small band.
[0050]
Next, using the normalized correlation coefficient d <(i)>, transform coefficients for separating
and extracting the correlated signal and the uncorrelated signal from the audio signals of the left
and right channels are determined, and the determined transform coefficients are calculated. The
correlation signal and the non-correlation signal are separated and extracted from the audio
signals of the left and right channels using Both the correlated signal and the uncorrelated signal
may be extracted as an estimated speech signal.
[0051]
A process example of calculation of conversion coefficients and separation and extraction of
signals will be described. Here, a signal of each of the left and right channels is composed of an
uncorrelated signal and a correlated signal, and a model in which the same signal is output from
the left and right is adopted for the correlated signal. The direction of the sound image
synthesized by the correlation signal output from the left and right is determined by the balance
of the sound pressure of each of the left and right of the correlation signal. According to the
model, the input signals xL (n) and xR (n) are represented by the following equation: xL (m) = s
(m) + nL (m), xR (m) = αs (m) + nR (m) (13) Be done. Where s (m) is the left and right correlation
signal, nL (m) is the left channel audio signal minus the correlation signal s (m), which can be
defined as the uncorrelated signal (left channel), nR (m) is the sound signal of the right channel
minus the correlation signal s (m) and can be defined as the uncorrelated signal (of the right
channel). In addition, α is a positive real number representing the degree of the left and right
11-04-2019
15
sound pressure balance of the correlation signal.
[0052]
The audio signals x′L (m) and x′R (m) after the window function multiplication described
above in equation (2) are expressed by equation (14) below by equation (13). Here, s' (m), n'L (m)
and n'R (m) are respectively s (m), nL (m) and nR (m) multiplied by the window function. x 'L (m)
= w (m) {s (m) + n L (m)} = s' (m) + n 'L (m), x' R (m) = w (m) {alpha s (m) + N R (m)} = αs '(m) + n'
R (m) (14)
[0053]
By discrete Fourier transforming equation (14), the following equation (15) is obtained. However,
S (k), NL (k) and NR (k) are respectively the discrete Fourier transforms of s' (m), n'L (m) and n'R
(m). XL (k) = S (k) + NL (k), XR (k) = αS (k) + NR (k) (15)
[0054]
Therefore, the audio signal XL <(i)> (k), XR <(i)> (k) in the i-th small band is XL <(i)> (k) = S <(i)>
(k) + NL <(i)> (k), XR <(i)> (k) = α <(i)> S <(i)> (k) + NR <(i)> (k) where KL <(i) It can be expressed
as:)> ≦ k ≦ KU <(i)> (16) Here, α <(i)> represents α in the i-th small band. Thereafter, the
correlation signal S <(i)> (k) and the uncorrelated signal NL <(i)> (k) and NR <(i)> (k) in the i-th
small band are respectively S <(i) > (K) = S (k), NL <(i)> (k) = NL (k), NR <(i)> (k) = NR (k) where KL
<(i)> ≦ k ≦ It is assumed that KU <(i)> (17).
[0055]
From Equation (16), sound pressures PL <(i)> and PR <(i)> in Equation (12) are PL <(i)> = PS <(i)>
+ PN <(i)>, PR < (I)> = [α <(i)>] <2> PS <(i)> + PN <(i)> (18) Here, PS <(i)>, PN <(i)> are the powers
of the correlated signal and the uncorrelated signal in the i-th small band, respectively, and are
expressed as follows. Here, it is assumed that the sound pressures of the left and right
uncorrelated signals are equal.
11-04-2019
16
[0056]
Further, equation (9) can be expressed by the following equations (10) to (12). However, in this
calculation, it is assumed that the power when S (k), NL (k) and NR (k) are orthogonal to each
other, and is multiplied, is zero.
[0057]
By solving equation (18) and equation (20), the following equation is obtained.
[0058]
These values are used to estimate the correlated and uncorrelated signals in each subband.
The estimated value est (S <(i)> (k)) of the correlation signal S <(i)> (k) in the i-th small band is
calculated using the parameters .mu.1 and .mu.2 as est (S <(i) If (k)) = μ1 XL <(i)> (k) + μ2 × R
<(i)> (k) (23), the estimation error ε is ε = est (S <(i)> (k)) It is expressed as -S <(i)> (k) (24).
Here, est (A) represents an estimated value of A. Then, when the square error ε <2> is
minimized, if ε and XL <(i)> (k) and XR <(i)> (k) are orthogonal to each other, then E [ε · XL <
(I)> (k)] = 0, E [ε · XR <(i)> (k)] = 0 (25) Using Equations (16), (19), and (21) to (24), the
following simultaneous equations can be derived from Equation (25). (1-.mu.1-.mu.2.alpha. <(I)>)
PS <(i)>-. Mu.1 PN <(i)> = 0.alpha. <(I)> (1-.mu.1-.mu.2.alpha. <(I)>) PS <(i) > −μ2 PN <(i)> = 0
(26)
[0059]
By solving equation (26), each parameter can be obtained as follows. Here, power Pest (S) <(i)> of
estimated value est (S <(i)> (k)) obtained in this manner is obtained by squaring both sides of
equation (23). (S) <(i)> = (μ1 + α <(i)> μ2) <2> PS <(i)> + (μ1 <2> + μ2 <2>) PN <(i)> (28)
needs to be satisfied Therefore, the estimated value is scaled from this equation as follows. Note
that est '(A) represents a scaled estimate of A.
[0060]
11-04-2019
17
[0061]
Then, estimated values est (NL <(i)> (k)), est for uncorrelated signals NL <(i)> (k) and NR <(i)> (k)
of left and right channels in the i-th small band (NR <(i)> (k)) is respectively est (NL <(i)> (k)) =
μ3 XL <(i)> (k) + μ4 × R <(i)> (k) (30) est (30) By setting NR <(i)> (k) =. Mu.5 XL <(i)> (k) +.
Mu.6XR <(i)> (k) (31), the parameters .mu. μ6 is
[0062]
Can be asked.
The estimated values est (NL <(i)> (k)) and est (NR <(i)> (k)) obtained in this manner are also
scaled according to the following equations in the same manner as described above.
[0063]
[0064]
The parameters μ1 to μ6 shown in the equations (27), (32) and (33) and the scaling
coefficients shown in the equations (29), (34) and (35) correspond to the conversion coefficients
obtained in step S86. Do.
Then, in step S87, the correlation signal and the uncorrelated signal (the uncorrelated signal of
the right channel, the left channel) are estimated by the calculation (Equations (23), (30), (31))
using these transform coefficients. And uncorrelated signal).
[0065]
As described above, the signal separation and extraction unit 23 outputs the signal separated in
this manner, but as described below, outputs the signal to which the assignment process to the
virtual sound source has been performed. become.
11-04-2019
18
Therefore, the audio signal processing unit 20 has gain adjustment units 24L and 24R for the left
and right channels and a gain adjustment unit 24S for the correlation signal. The signal
separation and extraction unit 23 outputs the uncorrelated signal est '(NL <(i)> (k)) separated
from the left channel to the gain adjustment unit 24L for the left channel, and uncorrelated
separated from the right channel. The signal est '(NR <(i)> (k)) is output to the gain adjustment
unit 24R for the right channel, and the correlation signal est' (S <(i)> (k)) separated from both
channels is correlated It outputs to the gain adjustment part 24S for signals.
[0066]
Furthermore, the signal separation and extraction unit 23 outputs α <(i)> of Expression (21) to
the compression encoding processing unit 28 as sound image direction information. Since this
value in each small band is a value indicating the sound pressure balance between the left and
right of the correlation signal component as shown in equation (13), the sound image position is
identified if this value and the distance between the microphones are known. be able to.
[0067]
The gain adjusting units 24L, 24S, 24R scale the respective signals. Usually, background noise is
mixed in the voices input from the left and right microphones 11L and 11R, but such background
noise has a low correlation between the left and right channels, so the probability of being
separated as an uncorrelated signal Is high. On the other hand, the voice signal of the speaker is
mainly separated as a correlation signal. Therefore, the gain adjustment units 24L and 24R
relatively reduce the uncorrelated signals of the left and right channels relative to the correlation
signal, or relatively increase the correlation signal relative to the left and right uncorrelated
signals by the gain adjustment unit 24S. Then, background noise can be suppressed, and as a
result, the input speech can be clarified.
[0068]
Next, in the combining section 25, those three signals after scaling are added up in all the small
bands, and further three signals after adding up are added up as one signal. The discrete Fourier
inverse transform unit 26 performs discrete Fourier inverse transform to obtain a monaural
11-04-2019
19
audio signal. As described above, in the monaural audio signal, the left and right audio signals are
converted into monaural audio signals, and noise components are further suppressed.
[0069]
The monaural sound signal thus obtained is output to the post-processing unit 27. Here, as
described in Equation (3), since the signal subjected to discrete Fourier transform is the signal
after multiplication by the window function, the signal obtained by inverse transformation is also
in the state of being multiplied by the window function. . The window function is a function as
shown in equation (1), and reading is performed while shifting by half segment length, so as
described above, adding to the output buffer while shifting by half segment length from the head
of the previously processed segment The converted data is obtained by doing.
[0070]
Next, the post-processing unit 27 will be described. The post-processing unit 27 performs noise
removal processing. The noise targeted for noise removal will be described with reference to FIG.
FIG. 4 is an enlarged view of a part of the waveform of the audio signal input to the postprocessing unit 27 of FIG. The audio signal 40 shown in FIG. 4 has a discontinuous point such
that it is near the center 41. Since many such discontinuities are included in the data input to the
post-processing unit 27 through the signal separation and extraction unit 23, they are perceived
as annoying noise during reproduction. Such a discontinuity occurs because this audio signal
processing system ignores and processes DC components, ie, does not consider the line spectrum
of DC components.
[0071]
FIG. 5 is a graph of a waveform schematically showing it. More specifically, FIG. 5 is for
explaining the discontinuities of the waveform generated at the segment boundary after the
inverse discrete Fourier transform when discrete Fourier transform is performed on the audio
signals of the left and right channels and the DC components of the left and right channels are
ignored. It is a schematic diagram. In the graph 50 shown in FIG. 5, the horizontal axis represents
time, and for example, the symbol (M-2) <(l)> indicates that it is the M-2 sample point of the l th
segment There is. The vertical axis of the graph 50 is the value of the output signal for those
sample points. As can be seen from this graph 50, discontinuities occur in the portion from the
11-04-2019
20
end of the l-th segment to the beginning of the (l + 1) -th segment.
[0072]
A noise removal process is performed on this problem. This process may be any method as long
as noise can be removed by eliminating waveform discontinuities, but here, the problem as
described in FIG. 5 with reference to FIGS. 6 and 7 will be described. An example of such
processing for solving will be specifically described. FIG. 6 is a schematic diagram for explaining
an example of the discontinuous point removal process performed by the post-processing unit 27
in FIG. 2 when the audio signals of the left and right channels are discrete Fourier transformed
and the DC components of the left and right channels are ignored. FIGS. 6A and 6B are schematic
diagrams for explaining a method of removing waveform discontinuities that occur at segment
boundaries after inverse discrete Fourier transform. FIG. 7 is a diagram showing the result of
performing the discontinuous point removal process of FIG. 6 on the audio signal of FIG.
[0073]
In the example of the discontinuous point removal process performed in the post-processing unit
27, as the removal example for the graph 50 of FIG. 5 is shown in the graph 60 of FIG. ) Make
the first derivative value of the second segment match. Specifically, the post-processing unit 27
sets the waveform of the (l + 1) th segment so that the post-processing unit 27 has a value at the
beginning of the (l + 1) th segment such that the slope of the last two points of the lth segment is
maintained. Add a DC component (bias). As a result, the output voice signal y ′ ′ j (m) after
processing is represented by y ′ ′ j (m) = y ′ j (m) + B (36) where y ′ j (m) represents the
output voice signal before processing. Become. B is a constant representing a bias, and after the
previous output audio signal and the output audio signal of this process are added in the output
buffer, the waveform is determined to be continuous as shown by the graph 60 in FIG. .
[0074]
In addition, the bias component may be accumulated only by the discontinuous point removal
processing described in FIG. 6, and the amplitude of the waveform may overflow. Therefore, it is
preferable to converge by decreasing the magnitude of the amplitude of the added bias
component (DC component) temporally as in the following equation. Note that “to decrease
temporally” means to decrease in proportion to the elapsed time from the addition point, for
11-04-2019
21
example, the elapsed time from the start point of each processing segment or the start point of
the discontinuity point. y ′ ′ j (m) = y ′ j (m) + B × ((M−mσ) / M) (37) However, σ is a
parameter for adjusting the degree of decrease, and is, for example, 0.5. In order to decrease,
both B and σ are positive. Furthermore, when the absolute value of the value of the bias
obtained for addition exceeds a certain value, σ may be dynamically increased or decreased
according to the value. The timing to increase or decrease may be the next processing segment.
The feedback function works by changing (changing) σ corresponding to the proportional
constant to be reduced according to the absolute value of the bias value (the magnitude of the
amplitude of the DC component). The same effect is obtained. However, these methods do not
guarantee that the amplitude of the speech waveform does not overflow.
[0075]
Therefore, for example, when the bias value reaches a certain value (predetermined value) or
more, a process of not adding the bias term of the second term of Expression (37) may be added
as a function of the safety valve. That is, it is preferable that the post-processing unit 27 execute
the addition of the DC component (execute the removal of the discontinuous point) only when
the amplitude of the DC component obtained for the addition is less than the predetermined
value. By adopting this method, bias components will not be accumulated.
[0076]
Further, in the case where the audio signal is close to white noise, such as a consonant portion of
audio, for example, there are cases where the waveform of the audio signal changes sharply and
the original waveform is already close to discontinuity. When the above-described discontinuous
point removal processing is applied to such an audio signal, the waveform may be distorted in
the opposite case. That is, if the above-mentioned discontinuous point removal processing is
applied to a voice signal in a state where the original waveform is close to discontinuity, this
processing attempts to force the waveform close to such originally discontinuous state.
Conversely, the waveform may be distorted.
[0077]
In order to solve this problem, the post-processing unit 27 preferably performs discontinuous
point removal processing (noise removal processing) according to the following method. That is,
11-04-2019
22
when a signal such as a consonant part of speech is close to white noise, the number of times the
waveform of the input speech signal crosses 0 within a predetermined time (for example, in a
processing segment or in half thereof) is compared with the other parts. Use extreme increase. In
addition, it may be decided arbitrarily where to take 0. Therefore, the number of times the output
speech signal (at least the speech signal after discrete Fourier inverse transform) crosses 0 in the
half segment length is counted, and if it is not less than a predetermined value (predetermined
number of times), In the following segment processing, the bias term of the second term on the
right side in Formula (36) or Formula (37) is not added in the following segment processing. That
is, the discontinuous point removal process is performed only at other places. Note that counting
may be performed on a speech waveform for a fixed time regardless of segment boundaries, or
may be performed on speech waveforms for a plurality of segment processing, and in any case,
the next counting result is It may be decided whether or not to add a bias term in segment
processing.
[0078]
As shown by the audio signal 70 in FIG. 7, the discontinuous point in the audio signal 40 of FIG.
4 (around the center 41) is continuous by eliminating the discontinuity as shown by the audio
signal 70 in FIG. 7. I understand. In this way, the discontinuity can be eliminated and noise can
be eliminated.
[0079]
The monaural audio signal obtained in this manner and the sound image direction information
output from the signal separation and extraction unit 23 are encoded by the compression
encoding processing unit 28. Monophonic audio signals are widely available, G. It may be
encoded according to a voice coding standard such as 711, 722, 723.1, 728, 729 or may be
encoded according to a proprietary protocol such as a Voice over Internet Protocol (VoIP)
application.
[0080]
For the sound image direction information α <(i)>, when the value is 1, it means that the left and
right sound pressures are equal, and the sound image is estimated at an equal distance from the
left and right microphones. Therefore, the value is expected to have equal probability on both
11-04-2019
23
sides with 1 as a boundary. Therefore, to quantize such a value, for example, if the value α ′
<(i)> converted as described below is used, the value falls within the range of −1 to 1 and is
efficiently quantized. It becomes possible. α ′ <(i)> = (α <(i)> − 1) / (α <(i)> + 1) (38)
[0081]
The thus converted α ′ <(i)> is linearly quantized, for example, into 16 steps (4 bits). Since this
value is required for each small band, 4 bits are required for each value. Usually, in a video
conference, it is rare for a plurality of speakers to speak at the same time, and it is also rare for
the speaker to move around while speaking, so for α '<(i)> encoding, the previous frame Or the
difference from the value of the adjacent subband may be coded. In any case, since it can be
expected that the value will be close to 0, further information compression becomes possible by
further subjecting it to Huffman coding.
[0082]
For the above-mentioned voice signal, ITU-R H.264 The sound image position information can be
transmitted as it is according to the existing method such as H.323, but it is necessary to
transmit the sound image position information separately from the audio signal. As a method, if
there is a user-defined bit field, it may be transmitted. If it does not, for example G. In the case of
lossless coding of 711 and 722, embedding may be performed in the audio signal, such as
assigning lower bits that do not significantly affect the sound perception to this. In the case of
other lossy coding, for example, it may be embedded like a QR code in an unimportant part of the
image data. Sound image position information is transmitted as described above.
[0083]
Next, the audio signal processing unit on the receiving side, which is a portion regarding audio
signal processing in the decoding unit 31 of FIG. 1, will be described with reference to FIG. FIG. 8
is a block diagram showing a configuration example of an audio signal processing unit in the
decoding unit of the video conference system of FIG.
[0084]
The audio signal processing unit 80 illustrated in FIG. 8 includes a decoding processing unit 81, a
11-04-2019
24
preprocessing unit 82, a discrete Fourier transform unit 83, a reproduction signal generation
unit 84, a discrete Fourier inverse transform unit 85, and a post processing unit 86.
[0085]
The decoding processing unit 81 extracts a monaural sound signal and sound image direction
information from the received code word.
The monaural sound signal is output to the pre-processing unit 82, and the sound image
direction information is output to the reproduction signal generation unit 84. The pre-processing
unit 82 performs window function calculation as in the pre-processing unit 21 on the encoding
side. On the encoding side, the operation is performed on each of the stereo audio signals, but on
the decoding side, the operation is performed on the monaural audio signal, and the result is
output to the discrete Fourier transform unit 83. The discrete Fourier transform unit 83 performs
discrete Fourier transform in the same manner as on the encoding side, divides the signal into
small bands as described above, and outputs the result to the reproduction signal generation unit
84.
[0086]
The reproduction signal generation unit 84 receives the signal after the Fourier transform and
the sound image direction information, and generates a reproduction signal. At this time, the
reproduction signal generation unit 84 mainly converts the sound image direction indicated by
the sound image direction information into a sound image direction when the speaker array is
regarded as a line connecting the positions of two microphones, and monaural sound In order to
localize the signal in the sound image direction after conversion, the signal is converted into a
signal for outputting the sound indicated by the monaural sound signal from one or more
speakers corresponding to the sound image direction after conversion. As a result, it becomes
possible to output sound localized in the direction of the sound image after the conversion from
one or more speakers.
[0087]
Hereinafter, such voice output will be described in more detail. With regard to the reproduction
11-04-2019
25
method, as shown schematically in FIG. 9, in the stereo (2 ch) reproduction method using two
speakers 91L and 91R, the sound image direction is correctly heard only to the viewer in the
sweet spot 92 area. It is well known. In this method, it is difficult to make the image and the
sound image direction coincide with each of a plurality of participants in the conference.
[0088]
Therefore, as described in FIG. 1, the apparatus 30 on the decoding side arranges the speaker
array 101 linearly in the horizontal direction as shown in FIG. 10, and outputs audio only from
the speakers corresponding to the sound image direction. . As a result, a sweet spot 102 wider
than the sweet spot 92 is obtained, and it is possible for any participant to localize the sound
image near the speaker. More preferably, the reproduced sound may be output by a wave field
synthesis reproduction method such as a Wave Field Synthesis (WFS) method which provides a
wider sweet spot using a speaker array linearly arranged in the horizontal direction.
[0089]
This wavefront synthesis and reproduction method can be said to be one implementation method
of the sound source object-oriented reproduction method. The sound source object-oriented
reproduction method is a method in which all sounds are sounds emitted by any sound source
object, and each sound source object (hereinafter, referred to as “virtual sound source”). )
Contains its own position information and voice signal. Taking music content as an example, each
virtual sound source includes the sound of the respective instrument and the position
information at which the instrument is disposed. The listener who is listening to the sound facing
the speaker array in the acoustic space provided by the wavefront synthesis reproduction
method such as WFS method actually hears the sound emitted from the speaker array from the
virtual sound source behind the speaker array You feel as if it were being emitted.
[0090]
In this wave-field synthesis reproduction system, an input signal representing a virtual sound
source is required. In general, one virtual sound source needs to include an audio signal for one
channel and position information of the virtual sound source. Taking the above-mentioned music
content as an example, it means, for example, an audio signal recorded for each musical
instrument and positional information of the musical instrument, and in a video conference using
11-04-2019
26
the present invention, it is positional information of each speaker.
[0091]
Hereinafter, an example in which wavefront synthesis reproduction is performed by such an
array speaker will be described, and the processing of the reproduction signal generation unit 84
will be mainly described with reference to FIGS. 11 to 13. FIG. 11 is a schematic diagram for
explaining an example of the positional relationship between a listener, left and right speakers,
and a synthesized sound image, and FIG. 12 is an example of the positional relationship between
a speaker group used in the wavefront synthesis reproduction method and a virtual sound
source. FIG. 13 is a schematic diagram for explaining an example of the positional relationship
between the virtual sound source of FIG. 12 and a listener and a synthesized sound image.
[0092]
Now, as in the positional relationship 110 shown in FIG. 11, a line drawn from the listener to the
center point of the left and right speakers 111L and 111R and a line drawn from the listener 113
to the center of any of the speakers 111L / 111R. Let θ0 be a spread angle, and θ be a spread
angle made by a line drawn from the listener 113 to the position of the estimated synthesized
sound image 112. Here, when the same sound signal is outputted from the left and right
speakers 111L and 111R with different sound pressure balances, the direction of the synthetic
sound image 102 produced by the output sound is the next using the above-mentioned
parameter α representing the sound pressure balance. It is generally known that it can be
approximated by the following equation (hereinafter referred to as the sine law in stereophonic
sound).
[0093]
[0094]
Therefore, the direction θ <(i)> of the synthesized sound image of the correlation signal in the ith small band is obtained by the following equation.
11-04-2019
27
Here, θ0 is a value determined in advance, and may be, for example, θ0 = π / 6 [rad].
[0095]
Since what is transmitted here is the value of equation (38), equation (40) can be rewritten as θ
<(i)> = sin <−1> (α ′ <(i)> sin θ0) (41) .
[0096]
Next, as shown in FIG. 12, a plurality of virtual sound sources in the wavefront synthesis
reproduction system are assumed, and are arranged behind the speaker array 121
(corresponding to the speaker array 101 in FIG. 10).
In such a case, the reproduction signal generation unit 84 converts the 2ch audio signal into an
audio signal having the number of virtual sound sources. For example, assuming that the number
of channels after conversion is five, it is regarded as virtual sound sources 122a to 122e in the
wavefront synthesis reproduction method as in the positional relationship 120 shown in FIG.
Deploy. The intervals between the virtual sound sources 122a to 122e and the adjacent virtual
sound sources are equal. Therefore, in the conversion example here, the 2ch audio signal is
converted into 5 audio signals.
[0097]
The reproduction signal generation unit 84 assigns the input monaural sound signal after
discrete Fourier transform to any two adjacent virtual sound sources among the five virtual
sound sources 122a to 122e. Here, as a premise, it is assumed to be inside of both ends (virtual
sound sources 122a and 122e) of the five virtual sound sources. That is, five virtual sound
sources 122a to 122e are arranged so as to be within the spread angle formed by the two
speakers at the time of 2ch stereo reproduction. Then, from the estimated direction of the
synthesized sound image, two adjacent virtual sound sources sandwiching the synthesized sound
image are determined, the assignment of sound pressure balance to the two virtual sound
sources is adjusted, and synthesis is performed by the two virtual sound sources The method of
reproducing so as to produce a sound image is adopted.
[0098]
11-04-2019
28
Therefore, as in the positional relationship 130 shown in FIG. 13, the spread angle formed by the
line drawn from the listener 133 to the middle point of the virtual sound sources 122a and 122e
at both ends and the line drawn to the virtual sound source 122e at the end is θ ' 0, the spread
angle formed by the line drawn from the listener 133 to the synthetic sound image 131 is θ ′.
Furthermore, a line drawn from the listener 133 to the middle point of the two virtual sound
sources 122c and 122d sandwiching the synthesized sound image 131 and a line drawn from
the listener 133 to the middle points of the virtual sound sources 122a and 122e at both ends
(from the listener 133 A spread angle formed by the line drawn to the virtual sound source 122c
is φ0, and a spread angle formed by the line drawn from the listener 133 to the synthesized
sound image 131 is φ. Here, φ0 is a positive real number. A method of assigning to virtual
sound sources using these variables will be described.
[0099]
First, scaling based on the spread angle difference is performed as follows. θ ′ = (θ′0 / θ0)
θ (42) Thus, the difference in the spread angle due to the arrangement of virtual sound sources
is taken into consideration and converted. However, the values of θ ′ 0 and θ 0 may be
adjusted at the time of system installation of the audio data reproduction apparatus, and no
problem occurs even if the values of θ ′ 0 and θ 0 are not equal. Description will be made as =
π / 6 [rad] and θ ′ 0 = π / 4 [rad].
[0100]
Next, assuming that the direction θ <(i)> of the ith synthesized sound image is estimated by
equation (41), for example θ <(i)> = π / 15 [rad], θ is obtained from equation (42) It becomes'
<(i)> = π / 10 [rad]. Then, when there are five virtual sound sources, as shown in FIG. 13, the
synthesized sound image 131 is located between the third virtual sound source 122c and the
fourth virtual sound source 122d counting from the left. When there are five virtual sound
sources, φ 0 00.078 [rad] from θ ′ 0 = π / 4 [rad] between the third virtual sound source 122
c and the fourth virtual sound source 122 d. If φ in the i-th small band is φ <(i)>, then φ <(i)> =
θ ′ <(i)> − φ0 ≒ 0.022π [rad]. In this way, the direction of the synthesized sound image
generated by the correlation signal in each small band is represented by the relative angle from
the direction of the two virtual sound sources that sandwich it. Then, as described above, it is
considered to generate the synthesized sound image by the two virtual sound sources 122c and
122d. For that purpose, it is sufficient to adjust the sound pressure balance of the output sound
11-04-2019
29
signals from the two virtual sound sources 122c and 122d, and as the adjustment method, the
sine law in stereophonic sound used as Equation (39) is used again.
[0101]
Here, of the two virtual sources 122c and 122d sandwiching the synthesized sound image
generated by the correlation signal in the i-th small band, the scaling factor for the third virtual
source 122c is g1 and the scaling factor for the fourth virtual source 122d is g2 Then, from the
third virtual sound source 122c, g1 · est ′ (S <(i)> (k)), and from the fourth virtual sound source
122d, g2 · est ′ (S <(i)> (k)) Will output an audio signal of Then, g1 and g2 may satisfy the
following equation according to the sine law in stereophonic sound.
[0102]
On the other hand, the sum of the powers from the third virtual sound source 122c and the
fourth virtual sound source 122d is equal to the power of the correlation signal at the time of
2ch sound collection on the encoding side as in the following equation α <(i) If g1 and g2 are
normalized using>, g1 <2> + g2 <2> = 1 + [α <(i)>] <2> (44). However, α <(i)> can be obtained
by using the received α ′ <(i)> and performing the reverse operation of equation (38).
[0103]
It is sought by putting these together. By substituting the above-mentioned φ <(i)> and φ0 into
the equation (45), g1 and g2 are calculated. Based on the scaling factor calculated in this manner,
as described above, the third virtual sound source 112 c is an audio signal of g1 · est ′ (S <(i)>
(k)), and the fourth virtual sound source 122 d Assigns an audio signal of g2 · est '(S <(i)> (k)).
Then, as described above, the uncorrelated signal is assigned to the virtual sound sources 122a
and 122e at both ends. That is, est '(NL <(i)> (k)) is assigned to the first virtual sound source
122a, and est' (NR <(i)> (k)) is assigned to the fifth virtual sound source 122e.
[0104]
Unlike this example, if the estimated direction of the synthesized sound image is between the
11-04-2019
30
first and second virtual sound sources, the first virtual sound source is g1 · est ′ (S <(i)> (k ) And
est '(NL <(i)> (k)) will be assigned. Also, if the estimated direction of the synthesized sound image
is between the fourth and fifth virtual sound sources, then the fifth virtual sound source is g2 ·
est ′ (S <(i)> (k)) and est Both '(NR <(i)> (k)) will be assigned.
[0105]
As described above, allocation of correlation signals and uncorrelated signals of the left and right
channels for the i-th small band is performed. Such processing is performed for all the small
bands. As a result, assuming that the number of virtual sound sources is J, output sound signals
Y1 (k),..., YJ (k) in the frequency domain for each virtual sound source (output channel) are
obtained.
[0106]
Then, the discrete Fourier inverse transform unit 85 performs discrete Fourier inverse transform
on each of the obtained output channels as in the following equation to obtain an output speech
signal y′j (m) in the time domain. Here, DFT <-1> represents discrete Fourier inverse transform.
y′j (m) = DFT <−1> (Yj (k)) (1 ≦ j ≦ J) (46) Here, as described in equation (3), the discrete
Fourier transformed signal has a window function Since the signal is a signal after multiplication,
the signal y′j (m) obtained by the inverse conversion is also in the state of being multiplied by
the window function. The window function is a function as shown in equation (1), and reading is
performed while shifting by half segment length, so as described above, adding to the output
buffer while shifting by half segment length from the head of the segment processed one before
The converted data is obtained by doing.
[0107]
The post-conversion unit 86 of FIG. 8 performs noise removal processing on the converted data,
as in the processing on the encoding side. In this way, the output sound for each speaker is
obtained.
[0108]
Here, the speaker array 121 has been described on the premise that it is installed facing in the
11-04-2019
31
front direction so as to emit a sound in the front direction of the display in order to reduce the
amount of calculation, but the present invention is not limited thereto. .
[0109]
Further, although an example in which five virtual sound sources are assumed is shown,
reproduction sounds may be directly assigned to actual speakers instead of virtual sound sources
in the same manner as described above.
In that case, the output sound which one small band takes charge of will be reproduced from
only one speaker or two adjacent speakers.
[0110]
If audio can be picked up, encoded, transmitted, and decoded by the above-mentioned video
conference system, it is possible to transmit to a remote place video + audio with a sense of
reality in which the video and the sound image direction are matched. Become. Furthermore, in
the video conference apparatus according to the present invention, since noise can be reduced
along with the process of extracting the sound image direction information at the time of
encoding as described above, clear sound quality can be transmitted.
[0111]
In the present invention, since the uttering side only needs to be configured to transmit the
monaural audio signal and the slight additional information (sound image direction information)
together with the video signal, such a configuration of the terminal of the existing video
conference system is required. It is easy to add Further, even if such a configuration is not added
to the terminal of the existing video conference system, the video conference itself is determined
by, for example, predetermining the sound image direction information in a predetermined
direction such as the center by the video conference apparatus on the receiving side. It can carry
out. Furthermore, it is possible to cope with the case where a stereo audio signal is received. For
example, when a stereo sound signal is received, the sound image direction is obtained using the
sound signal processing unit 20 on the speech side of FIG. 2, and the sound signal processing
unit 80 on the reception side of FIG. It may be configured to output from the following speaker
11-04-2019
32
array. As described above, the teleconferencing device according to the present invention
exchanges monaural audio signals, and therefore has high compatibility with the devices of the
existing teleconferencing system.
[0112]
Further, in the video conference apparatus according to the present invention, such effects can
be obtained only by transmitting sound image direction information and allocating a monaural
audio signal to the speaker array from the sound image direction information, so the cost is
greatly increased. There is no need to
[0113]
Further, in consideration of a two-way communication video conference system, it is preferable
to add an echo canceling system which cancels the voice from the speaker picked up by the
microphone to the video conference apparatus according to the present invention.
In the present invention, since monaural sound is used, the echo canceling system can also use
the echo canceling system for one input and one output which is widely used in conventional
telephones and video conference systems, and for multiple inputs and multiple outputs. There is
no need for a complex echo canceling system.
[0114]
Next, the arrangement method of a speaker and a microphone is demonstrated, referring FIGS.
14-18. FIGS. 14 to 18 are each an external view showing an example of the video conference
apparatus according to the present invention, and have both the functions of the device 10 on
the encoding side and the device 30 on the decoding side in the video conference system of FIG.
It is an external view which shows the example of the video conference apparatus. In any of FIG.
14 to FIG. 18, the number of speakers constituting the speaker array is not limited to that
illustrated, but may be plural.
[0115]
11-04-2019
33
As in the video conference apparatus 140 shown in FIG. 14, two microphones 142L and 142R
are disposed apart at both ends of the display 141, and a speaker array 143 (an array of eight
speakers in this example) is disposed below the display 141. It may be arranged. When two
microphones 142L and 142R are installed apart as shown in FIG. 14, it is desirable that the
microphones be non-directional in order to widely cover the spatial range in which the utterer is
present. It is possible to estimate the position. However, as a video conference apparatus suitable
for the case where the speaker is often positioned on either the left or right side of the display, it
is preferable to install directional microphones at the positions of the microphones 142L and
142R shown in FIG.
[0116]
Further, as in the video conference apparatus 150 shown in FIG. 15, two microphones 152L and
152R are disposed at the upper part of the display 151 to increase the distance from the speaker
array 153 provided at the lower part of the display 151. You may arrange so that the output
sound which re-enters 152R may be decreased. Alternatively, the present invention may be
applied to a large display in which a plurality of (in this example, four) displays 161a to 161d are
combined as in the video conference apparatus 160 shown in FIG. That is, two microphones
162L and 1162R may be disposed at both ends of the large display, and a speaker array 163 (an
array of 15 speakers in this example) may be disposed below the large display.
[0117]
As for the arrangement of the speaker array, the speaker array 173 may be arranged on the
upper part of the display 171 in which two microphones 172L and 172R are arranged apart at
both ends as in the video conference apparatus 150 shown in FIG. .
[0118]
Further, as in the video conference apparatus 180 shown in FIG. 18, the speaker array 183 is
disposed at the lower part of the display 181 so that the two microphones 182L and 182R
having directivity are directed in the opening direction of both left and right sides. It may be
installed on top.
The microphones 182 </ b> L and 182 </ b> R may be disposed slightly below the speaker array
183 or may be disposed above the display 181. Thus, the two microphones may be directional
11-04-2019
34
microphones disposed adjacent to the top or bottom of the display.
[0119]
Further, in the video conference apparatus according to the present invention, the speaker in the
horizontal direction is used, and the coincidence between the image in the vertical direction and
the sound image is not considered. The reason is that human beings generally have lower
accuracy in sound image perception in the vertical direction than in the horizontal direction, so if
the sound image and the image are made to coincide in the horizontal direction, the distance
between the image and the sound image is As it becomes relatively close, in addition to that, the
audio synchronized with the video can be heard from the video as it is, by the auxiliary effect of
so-called belly talk effect, it is possible to provide a system in which the audio can be heard from
the speaker.
[0120]
The wavefront synthesis and reproduction method applicable to the present invention may be
any method as long as it is provided with a speaker array (a plurality of speakers) as described
above and is outputted from those speakers as a sound image for a virtual sound source. In
addition to the WFS method described above, there are various methods such as a method using
a preceding sound effect (Heath effect) as a phenomenon related to human sound image
perception. Here, with the preceding sound effect, the same sound is reproduced from a plurality
of sound sources, and when there is a small time difference in each sound reaching each listener
from each sound source, the sound image is localized in the sound source direction of the sound
reached earlier. Point to the effect of By using this effect, it is possible to make a virtual sound
source position perceive a sound image. However, it is difficult to make the sound image clearly
perceptible only by the effect. Here, human beings also have the property of perceiving the
sound image in the direction in which the sound pressure is felt highest. Therefore, in the video
conference apparatus, the above-described preceding sound effect and the effect of the maximum
sound pressure direction perception can be combined to allow a small number of speakers to
perceive a sound image in the direction of the virtual sound source. .
[0121]
Further, for example, the constituent elements of the audio signal processing unit 20 illustrated
11-04-2019
35
in FIG. 2 and the audio signal processing unit 80 illustrated in FIG. 8 or the constituent elements
13, 14, 31 and 32 illustrated in FIG. Each component of the video conference apparatus can be
realized by hardware such as a microprocessor (or DSP: Digital Signal Processor), memory, bus,
interface, peripheral device, and software that can be executed on the hardware. . Part or all of
the hardware can be mounted as an integrated circuit / IC (Integrated Circuit) chip set, in which
case the software may be stored in the memory. In addition, all the components of the present
invention may be configured by hardware, and in such a case, it is also possible to mount part or
all of the hardware as an integrated circuit / IC chip set. .
[0122]
In addition, a recording medium recording a program code of software for realizing the functions
in the various configuration examples described above is supplied to a device such as a generalpurpose computer serving as a television conference device, and a program is executed by a
microprocessor or DSP in the device. The object of the present invention is also achieved by code
execution. In this case, the program code itself of the software implements the functions of the
various configuration examples described above, and even this program code itself or a recording
medium (external recording medium or internal storage device) recording the program code The
present invention can be configured by the control side reading and executing the code.
Examples of the external recording medium include various media such as an optical disc such as
a CD-ROM or a DVD-ROM, and a nonvolatile semiconductor memory such as a memory card.
Examples of the internal storage device include various devices such as hard disks and
semiconductor memories. The program code can also be downloaded and executed from the
Internet and can be received and executed from broadcast waves.
[0123]
DESCRIPTION OF SYMBOLS 3 ... Buffer, 10 ... Equipment by the side of an encoding, 11L, 11R ...
Microphone, 12L, 12R ... An amplifier, 13 ... A / D converter, 14 ... An encoding part, 15 ...
Camera, 20 ... Speech signal processing part by the side of an utterance , 21: pre-processing unit,
22: discrete Fourier transform unit, 23: signal separation and extraction unit, 24L, 24S, 24R: gain
adjustment unit, 25: combining unit, 26: discrete Fourier inverse transform unit, 27: postprocessing unit, 28: compression encoding processing unit, 30: device on the decoding side, 31:
decoding unit, 32: D / A converter, 33: amplifier, 34: speaker, 35: display, 80: voice signal
processing unit on the receiving side 81: Decoding processing unit 82: Pre-processing unit 83:
Discrete Fourier transform unit 84: Reproduction signal generation unit 85: Discrete Fourier
inverse transform unit 86: Post-processing unit
11-04-2019
36
Документ
Категория
Без категории
Просмотров
0
Размер файла
58 Кб
Теги
description, jp2012248986
1/--страниц
Пожаловаться на содержимое документа