close

Вход

Забыли?

вход по аккаунту

?

JP2015502716

код для вставкиСкачать
Patent Translate
Powered by EPO and Google
Notice
This translation is machine-generated. It cannot be guaranteed that it is intelligible, accurate,
complete, reliable or fit for specific purposes. Critical decisions, such as commercially relevant or
financial decisions, should not be based on machine-translation output.
DESCRIPTION JP2015502716
A microphone positioning device is provided. A power distribution determiner (10) and a spatial
information estimator (20) are provided. The power distribution determiner (10) is based on
sound source information indicating spatial power density indicating power values for a plurality
of places in the environment, one or more power values and one or more position values of one
or more sound sources located in the environment Decide. A spatial information estimator (20)
estimates acoustic spatial information based on spatial power density. [Selected figure] Figure 1
Apparatus and method for microphone positioning based on spatial power density
[0001]
The present invention relates to audio signal processing, and more particularly to an automatic
microphone positioning apparatus and method.
[0002]
Audio signal processing is becoming increasingly important.
In particular, space sound recording is adopted in many applications. Recording of spatial sound
is aimed at capturing the sound field with the aid of a plurality of microphones so that the viewer
perceives the sound image as if at the recording location on the playback side.
03-05-2019
1
[0003]
The standard approach for spatial sound recording is usually remote non-directional
microphones (eg AB stereo sound), coincident radio microphones (eg intensity stereo sound), or
eg non-patent document 1 Use a more sophisticated microphone, such as the B-format
microphone in Ambisonics, as shown in FIG.
[0004]
Spatial microphones, such as directional microphones, microphone arrays, etc. can record spatial
sound.
The term "spatial microphone" refers to any device (e.g., a combination of directional
microphones, a microphone array, etc.) for the acquisition of spatial sound that is capable of
acquiring the direction of arrival of the sound.
[0005]
For sound reproduction, existing non-parametric approaches derive the desired audio
reproduction signal directly from the recorded microphone signal. The major disadvantage of
these approaches is that the spatial image being recorded is always relative to the spatial
microphone used.
[0006]
In many applications, it is not possible or practical to place the spatial microphone at a desired
position, for example, near one or more sound sources. In this case, it is advantageous to arrange
the spatial microphones further from the active sound source and still be able to capture the
sound scene as desired.
[0007]
Some applications employ two or more real space microphones. It should be noted that the term
03-05-2019
2
"real space microphone" refers to the desired microphone type or combination of microphones
(eg, directional microphones, directional microphone pairs as used in general stereo
microphones, as well as microphone arrays) It means something that exists physically.
[0008]
For each real spatial microphone, the Direction Of Arrival (DOA) can be estimated in the timefrequency domain. Using the information collected by the real spatial microphones in
conjunction with the knowledge of their relative position, to calculate the output signal of the
spatial microphones virtually arranged (at will) at any position in the environment It becomes
possible. Hereinafter, this space microphone is referred to as a "virtual space microphone".
[0009]
In such applications, the position and orientation of one or more virtual microphones need to be
manually input. However, the optimal position and / or orientation of one or more virtual
microphones may be determined automatically.
[0010]
It would be advantageous if apparatus and methods were available to determine where to place
the virtual microphones, where to place the physical microphones, or to determine the optimal
viewing position. Furthermore, it would be advantageous if it could be decided how to position
the microphone in the optimal orientation. The terms "microphone positioning" and "positioning
information" relate to how to determine the proper position of the microphone or viewer and
how to determine the proper orientation of the microphone or viewer.
[0011]
US61/287,596: An Apparatus and a Method for
Converting a First Parametric Spatial Audio
Signal into a Second Parametric Spatial Audio
Signal
03-05-2019
3
[0012]
Michael A. Gerzon.
Ambisonics in multichannel broadcasting and
video. J. Audio Eng. Soc, 33(11): 859−871, 1985
V. Pulkki, “Directional audio coding in spatial
sound reproduction and stereo upmixing,” in
Proceedings of the AES 28th International
Conference, pp. 251−258, Pite;, Sweden, June 30 −
July 2, 2006 V. Pulkki, “Spatial sound
reproduction with directional audio coding,” J.
Audio Eng. Soc., vol. 55, no. 6, pp. 503−516, June
2007 C. Faller: “Microphone Front−Ends for
Spatial Audio Coders”, in Proceedings of the AES
125th International Convention, San Francisco,
Oct. 2008 M. Kallinger, H. Ochsenfeld, G. Del
Galdo, F. Kuech, D. Mahne, R. Schultz−Amling.
and O. Thiergart, “A spatial filtering approach
for directional audio coding,” in Audio
Engineering Society Convention 126, Munich,
Germany, May 2009 R. Schultz−Amling, F. Kuech,
O. Thiergart, and M. Kallinger, “Acoustical
zooming based on a parametric sound field
representation,” in Audio Engineering Society
Convention 128, London UK, May 2010 J. Herre, C.
Falch, D. Mahne, G. Del Galdo, M. Kallinger, and
O. Thiergart, “Interactive teleconferencing
combining spatial audio object coding and DirAC
technology,” in Audio Engineering Society
Convention 128, London UK, May 2010 E. G.
Williams, Fourier Acoustics: Sound Radiation
and Nearfield Acoustical Holography, Academic
Press, 1999 A. Kuntz and R. Rabenstein,
“Limitations in the extrapolation of wave
fields from circular measurements,” in 15th
European Signal Processing Conference (EUSIPCO
2007), 2007 A. Walther and C. Faller, “Linear
simulation of spaced microphone arrays using
03-05-2019
4
b−format recordings,” in Audio Engiineering
Society Convention 128, London UK, May 2010 S.
Rickard and Z. Yilmaz, “On the approximate
W−disjoint orthogonality of speech,” in
Acoustics, Speech and Signal Processing, 2002.
ICASSP 2002. IEEE International Conference on,
April 2002, vol. 1 R. Roy, A. Paulraj, and T.
Kailath, “Direction−of−arrival estimation by
subspace rotation methods − ESPRIT,” in IEEE
International Conference on Acoustics, Speech,
and Signal Processing (ICASSP), Stanford, CA,
USA, April 1986 R. Schmidt, “Multiple emitter
location and signal parameter estimation,” IEEE
Transactions on Antennas and Propagation, vol.
34, no. 3, pp. 276−280, 1986 J. Michael Steele,
“Optimal Triangulation of Random Samples in the
Plane”, The Annals of Probability, Vol. 10, No.3
(Aug., 1982), pp. 548−553 F. J. Fahy, Sound
Intensity, Essex: Elsevier Science Publishers
Ltd., 1989 R. Schultz−Amling, F. K;ch, M.
Kallinger, G. Del Galdo, T. Ahonen and V. Pulkki,
“Planar microphone array processing for the
analysis and reproduction of spatial audio
using directional audio coding,” in Audio
Engineering Society Convention 124, Amsterdam,
The Netherlands, May 2008. M. Kallinger, F.
Kuech, R. Schultz−Amling, G. Del Galdo, T.
Ahonen and V. Pulkki, “Enhanced direction
estimation using microphone arrays for
directional audio coding;” in Hands−Free Speech
Communication and Microphone Arrays, 2008.
HSCMA 2008, May 2008, pp. 45−48 R. K. Furness,
“Ambisonics − An overview,” in AES 8th
International Conference, April 1990, pp.
181−189 Giovanni Del Galdo, Oliver Thiergart,
TobiasWeller, and E. A. P. Habets. Generating
virtual microphone signals using geometrical
03-05-2019
5
information gathered by distributed arrays. In
Third Joint Workshop on Hands−free Speech
Communication and Microphone Arrays (HSCMA
’11), Edinburgh, United Kingdom, May 2011 Ville
Pulkki. Spatial sound reproduction with
directional audio coding. J. Audio Eng. Soc,
55(6):503;516, June 2007
[0013]
The object of the present invention is to provide an improved concept for microphone
positioning.
[0014]
The object of the invention is solved by an apparatus according to claim 1, a method according to
claim 17 and a computer program according to claim 18.
[0015]
An apparatus is provided for determining an optimal microphone position or an optimal viewing
position.
The apparatus comprises a spatial power distribution determiner and a spatial information
estimator.
The spatial power distribution determiner indicates spatial power density indicating power
values for a plurality of places in an environment, and sound source information indicating one
or more power values and one or more position values of one or more sound sources disposed in
the environment. Configure to make decisions based on The spatial information estimator is
configured to estimate acoustic spatial information based on spatial power density.
[0016]
Hereinafter, the term "virtual microphone" usually refers to any type of microphone. In
03-05-2019
6
particular, the term "virtual microphone" relates to both virtual space or non-spatial
microphones, physically existing non-spatial or non-spatial microphones for which positioning
information is to be determined.
[0017]
The spatial information estimator is adapted to determine an optimal position of the virtual
microphone in the environment or an optimal orientation of the virtual microphone based on the
spatial power density determined by the spatial power distribution determiner. The spatial power
density is determined by the spatial power distribution determiner based on the power value of
the sound source and the corresponding position information.
[0018]
An automatic aspect is provided for determining the optimal position and / or orientation of the
sound source, eg, one or more microphones to represent one or more virtual microphones.
[0019]
In some aspects, the spatial power distribution determiner may be adapted, for example, to utilize
selective information provided by a significance metric that represents a measure of confidence
in the estimate of ESS position.
[0020]
For example, in some aspects, the diffuseness of sound Psi can be used as a significance metric.
The term (1-Psi) can simply be multiplied by the source power value when computing the spatial
power distribution so that the diffuse sound does not contribute more than the direct sound in
determining the spatial power distribution.
[0021]
An important effect of the proposed concept is that they can be applied independently of room
03-05-2019
7
conditions and do not require any a priori information on the number or location of speakers and
/ or physical sources .
This makes the system autonomous and can adapt to any kind of scenario using only sound
analysis. According to the prior art, deductive information must be available to determine the
optimal position and / or orientation of one or more microphones. This either constrains the
application or requires estimation to be performed, which limits the accuracy. By adopting the
above embodiment, this becomes unnecessary. The position of the virtual microphone (or virtual
microphones) is calculated by performing a semi-blind scene analysis and then changing it
according to the requirements of the target application.
[0022]
Unlike other methods for estimating the optimal position and / or orientation of the virtual
microphone, the proposed method does not require any information of possible geometrical
scenes. For example, neither a priori information about the number of active sound sources (e.g.,
the number of participants in the meeting) nor any information about the relative position of the
active sound sources (e.g., placement of the participants in the conference room). This
information about the sound is derived only from the properties representing the sound scene of
the active sound source called "effective sound source" (ESS). The ESS models a spatial sound
scene at the point where one or more ESSs are active at a certain moment or certain timefrequency bin. Hereinafter, the term "physical sound source" is used to represent a real sound
source from a sound scene, for example, a speaker, while the term effective sound source (ESS)
(also called "sound source") is single Are used to represent sound events that are active in time or
time-frequency bins. Each ESS is characterized by position and by power. This information makes
it possible to construct a spatial power distribution, eg spatial power density, and to determine
the optimum position or orientation of the virtual microphone.
[0023]
The parameters of the ESS are obtained, for example, by adopting the concept described below
for an apparatus for generating an audio output signal of a virtual microphone at a configurable
virtual position. In the following, position estimation of a sound event in an apparatus for
generating an audio output signal of a virtual microphone is described with particular reference
to FIGS. 15-17. The concepts described there can be employed to determine the location of the
effective sound source. In the following, propagation compensation in a device for generating an
03-05-2019
8
audio output signal of a virtual microphone is described with particular reference to FIGS. 17-20.
The concepts described there can be employed to determine the power of the effective sound
source.
[0024]
According to one aspect, the spatial information estimator can comprise a sound scene center
estimator for estimating the position of the center of the sound scene in the environment. The
spatial information estimator may further comprise a microphone position calculator for
calculating the position of the microphone as acoustic spatial information based on the position
of the center of the sound scene.
[0025]
According to another aspect, the microphone position calculator can be adapted to calculate the
position of the microphone. At this time, the microphone is a virtual microphone.
[0026]
Further, according to another aspect, the sound scene center estimator is configured to calculate
a centroid of spatial power density to estimate a center of the sound scene.
[0027]
In a further aspect, the sound scene center estimator is configured to determine a power delay
profile based on the spatial power density and to determine a root mean square delay based on
the power delay profile for each of a plurality of locations in the environment be able to.
The sound scene center estimator is configured to determine the location of the location of the
plurality of locations having the smallest root mean square delay of the plurality of locations as
the center of the sound scene.
[0028]
03-05-2019
9
In another aspect, the sound scene center estimator may be adapted to perform a circular
integral to estimate the center of the sound scene, where 環境 (x, y) is the case when the
environment is a two dimensional environment Assuming that the spatial power density and C (r,
0) (x, y) indicate a circle, the sound scene center estimator may, for example, calculate the
equation: g (x, y) = Г (x, y) * C By applying (r, o) (x, y), circular integration is performed by
convolving space power density with a circle, and configured to determine circular integral
values for each of multiple locations in the environment .
[0029]
Alternatively, if the environment is a three-dimensional environment, then Γ (x, y, z) is space
power density and C (r, 0) (x, y, z) is a sphere, and the sound scene center The estimator
convolves space power density with a sphere by applying the equation: g (x, y, z) = Г (x, y, z) * C
(r, o) (x, y, z) And are configured to perform circular integration and to determine circular
integration values for each of a plurality of locations in the environment.
[0030]
Further, according to one aspect, the sound scene center estimator is configured to determine the
maximum value of the circular integral value of each of the plurality of locations in the
environment to estimate the center of the sound scene.
[0031]
In a further aspect, the microphone position calculator can be adapted to determine the line of
maximum width of the plurality of lines passing through the center of the sound scene in the
environment.
Each of the plurality of lines passing through the center of the sound scene can have an energy
width, and the line of greatest width is defined as the line among the plurality of lines passing
through the center of the sound scene having the greatest energy width .
[0032]
According to one aspect, the first point of the line segment defining the line segment and the
other second point of the line segment defining the line segment have a power value indicated by
03-05-2019
10
the spatial power density greater than or equal to the predetermined power value. The energy
width of the object line among the plurality of lines may indicate the maximum length of the line
segment of the object line.
The microphone position calculator is configured to determine the position of the microphone
such that a second line passing through the center of the sound scene and the microphone
position is orthogonal to the line of maximum width.
[0033]
According to one aspect, the microphone position calculator can be configured to apply singular
value decomposition to a matrix having a plurality of columns.
The columns of the matrix may indicate the position of the place in the environment with respect
to the center of the sound scene. Furthermore, only the columns of the matrix indicate the
location of locations having a power value indicated by the spatial power density greater than the
predetermined threshold, or only the columns of the matrix indicate the power value indicated by
the spatial power density above the predetermined threshold The position of the place having the
[0034]
According to another aspect, the spatial information estimator can comprise an orientation
determiner for determining the orientation of the microphone based on the spatial power
density. The orientation determiner can be adapted to determine the orientation of the
microphone such that the microphone is directed to the center of the sound scene. r max defines
the maximum distance from the microphone, the orientation determiner is configured to
determine the integral f (φ) for each of the plurality of orientations φ by applying the equation,
the orientation determiner , Configured to determine the orientation of the microphone based on
the determined integral value f (φ).
[0035]
According to another aspect, the spatial power distribution determiner is configured to apply the
spatial power density by applying the equation: if the environment is a two-dimensional
03-05-2019
11
environment, or if the environment is a three-dimensional environment It is configured to
determine spatial power density for multiple locations in the environment for time-frequency
bins (k, n) by applying the equation:
[0036]
Where k is a frequency index, n is a time index, x, y and z are coordinates of multiple locations,
and power i (k, n) is for time-frequency bin (k, n) The power values at the ith source, x ESSi, y
ESSi and z ESSi are the coordinates of the ith source, γ i is a scalar value, and how reliable the
position estimates for each effective source are An index of what can be expressed, where g is a
function dependent on x, y, z, x ESSi, y ES Si, z ES Si, k, n and γ i.
[0037]
Embodiments of the present invention will be described with reference to the accompanying
drawings.
[0038]
FIG. 1 shows a microphone positioning device according to one embodiment.
FIG. 7 shows a microphone positioning device according to another embodiment.
FIG. 5 shows the input and output of the device for microphone positioning according to one
embodiment.
FIG. 5 illustrates multiple application scenarios for a microphone positioning device. FIG. 5
illustrates multiple application scenarios for a microphone positioning device. FIG. 5 illustrates
multiple application scenarios for a microphone positioning device. FIG. 2 shows a spatial power
distribution determiner 21 according to one embodiment. It is a graph which shows the delta
function for comprising the function g. It is a graph which shows the distribution function for
comprising the function g. FIG. 2 illustrates a spatial information estimator according to one
embodiment. FIG. 6 shows a spatial information estimator according to a further embodiment.
FIG. 7 shows the microphone position / orientation calculator 44 in more detail according to
another embodiment. FIG. 7 illustrates optimization based on output energy width according to
one embodiment. FIG. 7 illustrates optimization based on output energy width according to one
03-05-2019
12
embodiment. FIG. 7 illustrates optimization based on output energy width according to one
embodiment. FIG. 6 shows a spatial information estimator further comprising a direction
determiner according to another embodiment. FIG. 7 shows an apparatus for generating an audio
output signal according to one embodiment. FIG. 5 illustrates the inputs and outputs of an
apparatus and method for generating an audio output signal according to one embodiment. FIG.
1 shows the basic structure of an apparatus for generating an audio output signal according to
an embodiment comprising a sound event position estimator and an information calculation
module. FIG. 6 illustrates an exemplary scenario in which a real spatial microphone is illustrated
as a uniform linear array of three microphones each. Figure 2 shows two spatial microphones in
3D for estimating the direction of arrival in 3D space. FIG. 7 illustrates the geometry in which the
isotropic point-like source of the current time-frequency bin (k, n) is located at position p IPLS (k,
n). FIG. 5 illustrates an information operation module according to one embodiment. FIG. 7 is a
diagram illustrating an information processing module according to another embodiment. FIG. 2
shows the locations of two real space microphones, a localized sound event and a virtual space
microphone. FIG. 7 illustrates how to obtain an incoming direction for a virtual microphone
according to one embodiment. FIG. 5 illustrates possible aspects of deriving a DOA of sound from
the viewpoint of a virtual microphone according to one embodiment. FIG. 5 is a diagram
illustrating an information operation block including a diffusive operation unit according to an
embodiment. FIG. 5 is a diagram illustrating a diffusive operation unit according to one
embodiment.
It is a figure which shows the scenario where sound event position estimation is not possible. FIG.
2 shows the locations of two real space microphones, localized sound events and a virtual
microphone. FIG. 2 illustrates a scenario in which two microphone arrays receive direct sound,
sound reflected by a wall and diffuse sound. FIG. 2 illustrates a scenario in which two
microphone arrays receive direct sound, sound reflected by a wall and diffuse sound. FIG. 2
illustrates a scenario in which two microphone arrays receive direct sound, sound reflected by a
wall and diffuse sound.
[0039]
FIG. 1 shows a microphone positioning device according to one embodiment. The apparatus
comprises a spatial power distribution determiner 10 and a spatial information estimator 20. The
spatial power distribution determiner 10 determines spatial power density spd indicating power
values for a plurality of places in an environment, at least one power value and at least one
power value of one or more effective sound sources (EES) disposed in the environment. It is
configured to determine based on sound source information ssi indicating a position value.
Spatial information estimator 20 is configured to estimate acoustic spatial information aspi based
03-05-2019
13
on spatial power density.
[0040]
FIG. 2 shows a microphone positioning device according to another embodiment. The device may
also be referred to as a spatial power distribution, which indicates power values for multiple
locations of the environment based on valid source information indicating one or more core
values and position values of one or more assigned effective sources in an environment. A space
power distribution determiner 21 is provided to determine space power density (SPD). The
apparatus further comprises a spatial information estimator 22 for estimating the position and /
or orientation of the virtual microphone (VM) based on the spatial power density.
[0041]
FIG. 3 shows the input and output of a microphone positioning device according to one
embodiment. The inputs 91, 92, ... 9N to the device consist, for example, of the power, eg the
absolute value of the square of the sound pressure, and the position, eg of a 2D or 3D Cartesian
coordinate. The effective sound source (ESS) expresses a sound scene (sound field).
[0042]
The effective sound source may, for example, be equivalent to the instantaneous point sound
source (IPLS) described below for a device for generating an audio output signal of a virtual
microphone at a configurable virtual position.
[0043]
At the output, the position and location of one or more virtual microphones are returned.
In the following, the term "physical sound source" is used to represent a real sound source from a
sound scene, for example, a speaker, while the term effective sound source (ESS) (also called
"sound source") is single Used to represent sound events that are active in time or time-frequency
bins, and for the IPLS described below with respect to the device for generating the audio output
signal of a virtual microphone at a configurable virtual position Is also used.
03-05-2019
14
[0044]
Furthermore, the term "sound source" is intended to include both physical sound sources and
useful sound sources.
[0045]
The inputs 91, 92,..., 9N of the device according to the embodiment of FIG. 2 are described below
for a device for generating an audio output signal of a virtual microphone at a configurable
virtual position, and a non-patent document As also described in 19, the information on the
positions and corresponding powers of a plurality of N effective sound sources localized in
temporal or time-frequency bins is provided.
[0046]
For example, this information may be the audio output of a virtual microphone at a configurable
virtual position, discussed below, for 1, 2, ..., N different frequency bins when the short time
Fourier transform (STFT) is applied It consists of the output 106 in FIG. 14 of the information
processing module of the device for generating the signal.
[0047]
With respect to the microphone positioning device, different operating modes are active during a
predetermined time interval, each mode suggesting various scenarios for positioning and
orienting one or more virtual microphones.
Microphone positioning devices are employed in multiple application scenarios.
[0048]
In the first application scenario, N omnidirectional virtual microphones are placed inside the
sound scene (FIG. 4A).
Thus, in this application scenario, a large number of virtual microphones cover the entire sound
03-05-2019
15
scene.
[0049]
In a second application scenario, a single virtual microphone is positioned at the acoustic center
of the sound scene.
For example, an omnidirectional virtual microphone, a cardioid virtual microphone, or a virtual
spatial microphone (such as a B-formatted microphone) is positioned such that all participants
are properly captured (FIG. 4B).
[0050]
In the third application scenario, one spatial microphone is placed "outside" of the sound scene.
For example, as shown in FIG. 4C, virtual stereo microphones are arranged so as to obtain a wide
spatial image.
[0051]
In the fourth application scenario, while the virtual microphone is located at a fixed position
(predetermined position), the optimal orientation of the virtual microphone is estimated. For
example, the position and orientation of the virtual microphone may be predefined and only the
orientation may be calculated automatically.
[0052]
Note that all of the above applications can include temporal adaptability. For example, the
position / orientation of the virtual spot microphone follows one speaker as the speaker moves in
the room.
[0053]
03-05-2019
16
In FIGS. 2 and 3, selective information is given, for example, by a significance metric 13 which
represents a measure of confidence in the estimation of the ESS position. For example, such
metrics may be described in the following for the device for generating an audio output signal of
a virtual microphone at a configurable virtual position (when using two or more microphone
arrays as described) Or from the diffusivity parameters calculated in [20].
[0054]
Metrics can be expressed for all of the inputs 91, ..., 9N (eg, using a constant value metric for all
inputs) or defined separately for each input 91, ..., 9N can do. The outputs 15 and 16 of the
device of FIG. 2 may comprise the position and / or orientation of one or more virtual
microphones. Depending on the application, outputs (position and orientation) for multiple
virtual microphones can be generated, each corresponding to a particular virtual microphone.
[0055]
FIG. 5 shows a spatial power distribution determiner 21 according to one embodiment. The space
power distribution determiner includes a space power distribution main processing unit 31 and a
space power distribution post-processing unit 32. The spatial power distribution determiner 21
expresses power localized at a given point in space for each time-frequency bin (k, n), for
example (x, y, z) x (x, y, z) , K, n) are configured to determine (or rather calculate) the corrected
spatial power density (SPD), which will be denoted hereinafter. The SPD is generated by
integrating power values 91,..., 9N at the location of the effective source, which is an input to the
spatial power distribution determiner 21.
[0056]
The computation of SPD for time-frequency bin (k, n) is performed according to the following
equation: Here, (x, y, z) are coordinates of the system, and x ESSi, y ESSi and z ES Si are
coordinates of the effective sound source i. The significance metric 103γ i indicates an index of
how reliable the position estimate of each effective sound source is. By default, the significance
metric may be equal to one. Here, the power i and the coordinates xESS i, yESS i and zESS i
correspond to the input 9i in FIG. Furthermore, for convenience of notation, the (k, n) extension
03-05-2019
17
will not be described hereinafter. However, the equations that follow still depend on the timefrequency bin (k, n) of particular interest.
[0057]
The SPD (for example, in FIG. 5) generated by the space power distribution main processing unit
31 is further processed by the space power distribution main processing unit 32 (postprocessing of SPD and time integration module), for example, adopting an autoregressive filter
Can be integrated in time. Any type of post-processing filter may be applied to the SPD to
increase robustness to sound scene outliers (ie, those resulting from incorrect position
estimation). Such post-processing filter may be, for example, a low pass filter or a morphological
(erosion, enlargement) filter.
[0058]
In calculating the position and / or orientation of one or more virtual microphones, SPDdependent selective parameters can be employed. This parameter may, for example, refer to a
banned and / or preferred area of the room in which the virtual microphone (VM) is located,
select a particular SPD range that meets certain predefined rules, and refer to the SPD. You may
[0059]
As can be seen from equation (1), g is a function of the significance metric γ (or γ i) in space
with a value equal to 1 by default. If not, γ can be used to take into account the different
contributions. For example, when σ <2> is the variance of the position estimation value, for
example, γ is set to γ = 1 / σ <2>.
[0060]
Alternatively, the average spread value Ψ computed at the microphone array can be used, so that
γ = 1−Ψ.
[0061]
This allows γ to be chosen to decrease for less reliable estimates and to increase for more
03-05-2019
18
reliable ones.
[0062]
There are numerous possibilities for constructing the function g.
Two examples that are particularly useful in practice are:
[0063]
In the first function, δ (x), δ (y) and δ (z) denote the delta function (see FIG. 6A showing the
delta function).
In the second function, s = [xyz] <T>, μ = [μ x μ y μ z] <T> is the mean vector, Σ γ is the
covariance matrix of the Gaussian distribution function g (distribution function See Figure 6B).
The covariance matrix is computed using the formula: which, for example, in the case of 1 D,
depends on the choice of γ for the scenario in the case of γ = 1 / σ <2>.
[0064]
As can be seen from equation (3), the function g is described by the distribution function around
the effective sound source position given by the inputs 91. Here, for example, the significance
metric is the inverse of the variance of the Gaussian distribution. If the estimate of the source
position is highly reliable, then the distribution according to it will be rather narrow, while the
less reliable estimate will correspond to a high variance, hence a broad distribution. For example,
reference is made to FIG. 6B showing an example of 1D.
[0065]
FIG. 7 shows a spatial information estimator 22 according to one embodiment. The spatial
information estimator comprises a sound scene center estimator 41 for estimating the position of
03-05-2019
19
the center of the sound scene in the environment. Furthermore, the spatial information estimator
comprises a microphone position calculator 42 for calculating the position of the microphone as
acoustic spatial information based on the position of the center of the sound scene.
[0066]
FIG. 8 shows a spatial information estimator 22 according to a further embodiment. The spatial
information estimator comprises a virtual microphone position calculator 44 arranged to
calculate the position of the virtual microphone and to determine the orientation of the virtual
microphone. Thus, virtual microphone position calculator 44 is also referred to as microphone
position / orientation calculator 44.
[0067]
The spatial information estimator 22 of FIG. 8 uses the previously generated SPD 23 as an input.
It returns as output to the position 15 and the orientation 16 of one or more virtual
microphones, depending on the intended application. The first processing block, the sound scene
center estimator 41, provides an estimate of the center of the sound scene. The output 43 of the
block 41, eg the position of the sound scene center, is then supplied as an input to a second
processing block, the virtual microphone position / orientation calculator 44. The virtual
microphone position / orientation calculator 44 performs an actual estimation of the final
position 15 and orientation 16 of one or more virtual microphones, depending on the application
of interest.
[0068]
The sound scene center estimator 41 provides an estimate of the sound scene center. The output
of the sound scene center estimator 41 is then provided as an input to the microphone position /
orientation calculator 44. The microphone position / orientation calculator 44 makes an actual
estimation of the final position 15 and / or orientation 16 of one or more virtual microphones
according to the mode of operation characterizing the application of interest.
[0069]
03-05-2019
20
Embodiments of the sound scene centered estimator will now be described in more detail. There
are several possible concepts to get to the center of the sound scene.
[0070]
According to a first concept of the first embodiment, the center of the sound scene is obtained by
calculating the center of gravity of SPD Γ (x, y, z). The value of Γ (x, y, z) becomes s which is
interpreted as the mass present at point (x, y, z) in space.
[0071]
According to a second concept of the second embodiment, the position of the space with the
smallest time dispersion of the channel is found. This is achieved by considering the root mean
square (RMS) delay spread. First, for each point p = (x0, y0) in space, the power delay profile
(PDP) A p (τ) is calculated based on SPD Γ (x, y, z) using, for example,
[0072]
Then, from Ap (τ), the RMS delay is calculated using the formula: Here, overlined τ s indicates
the average delay of A p (τ). The position where the average delay τ RMS, p is minimum
indicates the center of the sound scene.
[0073]
According to a third concept of the third embodiment, which is adopted as an alternative to
sound scene centered estimation, "circular integration" is proposed. For example, in the case of
2D, SPD Γ (x, y) is convoluted into a circle C (r, o) according to the following formula. Where r is
the radius of the circle and o defines the center of the circle. The radius r may be constant or
may change according to the power value at the point (x, y). For example, high power at point (x,
y) corresponds to a large radius while low power corresponds to a small radius. Additional
dependencies on power are also possible. One such example would be to convolve a circle with a
bivariate Gaussian function before using it to construct the function g (x, y). According to such an
03-05-2019
21
embodiment, the covariance matrix of the bivariate Gaussian function becomes dependent on the
power at position (x, y), ie high power corresponds to low dispersion while low power to high
dispersion It corresponds.
[0074]
Once g (x, y) is computed, the center of the sound scene can be determined according to the
following equation:
[0075]
In a further embodiment, this concept is extended to 3D by using 3D convolution of Γ (x, y, z) on
a sphere as well.
[0076]
FIG. 9 illustrates in more detail the microphone position / orientation calculator 44 according to
another embodiment.
The sound scene center 43, along with the SPD 23, is provided as an input to the microphone
position / orientation calculator 44.
In the microphone position / orientation calculator 44, information about the center 43 of the
sound scene is replicated at the output depending on the operation required for the target
application, in one virtual microphone located at the acoustic center of the sound scene. For the
scenario, for example, if the application scenario of FIG. 4B is applicable, it can be used directly
as the position of the virtual microphone. Alternatively, the information about the center 43 of
the sound scene can be used as a correction parameter inside the microphone position /
orientation calculator 44.
[0077]
Different concepts may be applied to calculate the position of the microphone, for example,
optimization based on exit energy width or optimization based on fundamental component
analysis.
03-05-2019
22
[0078]
For purposes of explanation, it is assumed that the microphone position is computed according
to the application scenario of FIG. 4C for one spatial microphone scenario outside the sound
scene.
However, the following description is equally applicable to any other application scenario.
[0079]
In the following, the concept for estimating the position of the virtual microphone according to
the already mentioned embodiments is explained in more detail.
[0080]
Optimization based on the output energy width defines a set of M equally spaced lines passing
through the center of the sound scene.
For example, for each of these lines in a 2D scenario, SPD Γ (x, y) is emitted orthogonally along
those lines and summed.
[0081]
10A-10C show optimization based on the output energy width. In FIG. 10A, the output power
function P proj is computed for each of the lines l 1,..., L i,. Then, as shown in FIG. 10B, the
corresponding width in the function is calculated. For example, the width is defined as -3 dB
width. The width is defined as -3 dB width. This is equivalent to the distance between the leftmost
point and the rightmost point of the distance portion corresponding to a power level higher than
a predetermined power level of, for example, -3 dB. Subsequently, the widest line is identified,
and the virtual microphone is arranged in the direction orthogonal thereto. The orientation of the
virtual microphone may be set so that it points to the center of the sound scene, as described in
the next section. Using this approach, two possible virtual microphone (VM) positions will be
obtained, as the virtual microphones (VM) are positioned positive or negative in orthogonal
directions.
03-05-2019
23
[0082]
For example, the distance at which the VM is located can be calculated based on geometrical
considerations as well as the opening angle of the virtual microphone. This is shown in FIG. 10C.
The distance at which the VM is located varies depending on the operating mode specific to the
target application. This implies that the triangle is configured such that the width i of FIG. 10C
represents one side of the triangle and the center of gravity COG is the center point of that side.
The third vertex of the triangle is found by taking the orthogonal line to the COG and defining it
as the bisector of the VM aperture angle α. And the length of the bisector gives the distance
between the position of the VM and the center of the sound scene.
[0083]
According to another embodiment, the concept of optimization based on emitted energy as
described is extended to 3D. In this case, M2 equally spaced planes (in the azimuth and elevation
directions) are defined instead of M lines. And the width corresponds to the diameter of the circle
consisting of the largest part of the outgoing energy. By placing the VM on the normal to the
plane of maximum circle diameter, the final position is obtained. According to one embodiment,
the distance from the center of the sound scene to the VM location may be recalculated, as in the
case of 2D with geometrical considerations and an opening angle specified by the mode of
operation.
[0084]
According to another embodiment, optimization based on fundamental component analysis is
employed. Optimization based on basic component analytical processing directly uses the
information available from SPD. First, SPD Γ (x, y, z) is quantized and a threshold selection filter
is applied to the quantized data set. This will discard all points with energy levels less than the
predetermined threshold. After that, the remaining points h i = [h x, i, h y, i, h z, i] <T> are mean
centered (that is, the mean central point is the sound from the coordinates of the ith effective
sound source In the matrix H, it is recognized as follows: Here, N defines the number of points
after application of the threshold value. Then a singular value decomposition (SVD) is applied to
H so as to be factored into the following product:
03-05-2019
24
[0085]
The first column of U represents the basic component and has the highest variability among the
data set. The second column of U is orthogonal to the first and represents the direction in which
to place the VM. The width is implicitly given by the first singular value in the matrix Σ. Knowing
this width as well as the direction allows one to calculate the position and orientation of the VM
described in the optimization method based on the output energy width described above with
reference to FIGS. 10A-10C.
[0086]
In other embodiments, these methods apply to 2D issues. This is simple as it simply ignores /
removes the z-axis component from the equations and discussion above.
[0087]
For other applications, such as the application scenario of FIG. 4A (multiple virtual microphones
covering the entire sound scene), different concepts such as iterative optimization techniques
may be employed. In the first step, the position with the largest value of SPD is identified. This
specifies the location of the first VM of the total of N virtual microphones. Following this, all
energy surrounding this location (ie to a predetermined distance) is removed from the SPD. The
previous steps are repeated until the positions of all the N virtual microphones are known. If N is
not defined, iteration is performed until the maximum value of SPD is less than a predetermined
threshold.
[0088]
FIG. 11 shows another embodiment. In this embodiment, the spatial information estimator 22
further comprises an orientation determiner 45. The orientation determiner 45 is configured to
determine the (appropriate) orientation 16 of the microphone based on the spatial power density
23.
03-05-2019
25
[0089]
Hereinafter, orientation estimation will be described. Since it is assumed that the virtual
microphone is directed to the center of the sound scene, the optimization approach based on the
output energy width implicitly calculates the virtual microphone orientation 15, as in the basic
component analysis.
[0090]
However, for some other application scenarios, it may be appropriate, for example, to calculate
the orientation explicitly in applications where the optimal orientation of the virtual microphone
is estimated and the virtual microphone is located at a fixed position. In this case, the orientation
should be determined so that the virtual microphone picks up most of the energy of the sound
scene.
[0091]
According to one embodiment, to determine the orientation of the virtual microphones, first, the
possible directions φ are sampled and an integration over the energy in each of those directions
is performed. It is obtained by the following function of φ. Here, r max is defined as the
maximum distance from the VM and controls the VM's pickup pattern. Then, the final orientation
φ of VM is calculated as Here, wφ (φ) is a weighting function based on the input characteristic
of VM. w φ (φ) is a function that defines how much the energy coming from the direction φ is
expanded in consideration of, for example, a predetermined viewpoint direction φ of the VM and
a specific pickup pattern.
[0092]
In the following, an apparatus for generating an audio output signal that simulates the recording
of a virtual microphone at a configurable virtual position in an environment will be described. A
microphone positioning device according to one of the above embodiments may be employed to
determine a virtual position for the device for generating the audio output signal.
03-05-2019
26
[0093]
FIG. 12 shows an apparatus for generating an audio output signal that simulates the recording of
a virtual microphone at a configurable virtual position posVmic in an environment. The
apparatus comprises a sound event position estimator 110 and an information calculation
module 120. The sound event position estimator 110 receives the first direction information di1
from the first real spatial microphone and receives the second direction information di2 from the
second real spatial microphone. The sound event position estimator 110 is configured to
estimate a sound source position ssp indicating the position of a sound source emitting a sound
wave in the environment, and the sound event position estimator 110 is arranged in the
environment for the first real microphone position pos1 mic A second real spatial microphone
provided on the basis of the first directional information di1 provided by the first real spatial
microphone located and in a second real microphone position in the environment The sound
source position ssp is estimated based on the direction information di2 of The information
computing module 120 is based on the first recorded audio input signal is1 recorded by the first
real spatial microphone, and on the basis of the first real microphone position pos1mic, and the
virtual position posVmic of the virtual microphone Configured to generate an audio output signal
based on The information calculation module 120 adjusts the amplitude value, absolute value or
phase value of the first recording audio input signal is1 so that the arrival of the sound wave
emitted by the sound source in the first real spatial microphone and the sound wave in the
virtual microphone Generating a first modified audio signal to obtain an audio output signal by
modifying the first recorded audio input signal is1 by compensating for the first delay or
amplitude attenuation between the arrival of And a propagation compensator configured as
described above.
[0094]
FIG. 13 shows the inputs and outputs of the apparatus and method according to one
embodiment. Information from two or more real space microphones 111, 112, ..., 11N is input to
the device / processed by this method. This information includes directional information from
the real spatial microphone, eg, an audio signal picked up by the real spatial microphone as well
as the direction of arrival (DOA) estimates. Audio signals and direction information such as
direction of arrival estimates are expressed in the time-frequency domain. For example, if 2D
geometry reconstruction is desired and the legacy short time Fourier transform (STFT) domain is
chosen for the representation of the signal, then DOA is represented as k and n, ie azimuth angles
according to frequency index and time index can do.
03-05-2019
27
[0095]
In an embodiment, localization of sound events in space is performed based on the position and
orientation of the real and virtual spatial microphones in a common coordinate system, as well as
describing the position of the virtual microphones. . This information can be represented by the
inputs 121... 12N and 104 in FIG. Input 104 may additionally specify properties of the virtual
spatial microphone, such as its position and pick-up pattern, as described below. If the virtual
spatial microphone comprises a plurality of virtual sensors, their position and the corresponding
different pickup patterns will be considered.
[0096]
The output of the device or corresponding method may, if desired, be one or more sound signals
105 picked up by a spatial microphone defined and arranged as specified by 104. In addition, the
device (or rather the method) can provide an output corresponding to spatial side information
106 that can be estimated by employing a virtual spatial microphone.
[0097]
FIG. 14 shows an apparatus according to an embodiment comprising two main processing units,
a sound event position estimator 201 and an information calculation module 202. The sound
event position estimator 201 performs geometric reconstruction based on the DOA composed of
the inputs 111... 11 N and based on the knowledge of the position and orientation of the real
spatial microphone from which the DOA was calculated. . The output of sound event position
estimator 205 includes position estimates (in 2D or 3D) of the sound source such that sound
events occur for each time and frequency bin. The second processing block 202 is an
information operation module. According to the embodiment of FIG. 14, the second processing
block 202 computes virtual microphone signals and spatial side information. Therefore, this is
also referred to as a virtual microphone signal and side information operation block 202. A
virtual microphone signal and side information operation block 202 processes an audio signal
consisting of 111... 11 N using the sound event position 205 and outputs a virtual microphone
audio signal 105. If desired, block 202 may compute spatial side information 106 corresponding
to a virtual spatial microphone. The following embodiments illustrate the possibilities of how
blocks 201 and 202 operate.
03-05-2019
28
[0098]
In the following, position estimation of a sound event position estimator according to an
embodiment will be described in more detail.
[0099]
Depending on the dimension of the problem (2D or 3D) and the number of spatial microphones,
several solutions to position estimation are possible.
[0100]
If there are two spatial microphones in 2D (the simplest case), simple triangulation is possible.
FIG. 15 shows an exemplary scenario in which the real spatial microphones are represented by a
uniform linear array (ULA) of three microphones each.
The DOA represented by the azimuth angles a1 (k, n) and a2 (k, n) is calculated for timefrequency bins (k, n). This is achieved by adopting a suitable DOA estimator, such as ESPRIT in
reference [12] or MUSIC in reference [13], for pressure signals that are converted to the timefrequency domain.
[0101]
In FIG. 15, two real spatial microphone arrays 410 and 420 are shown as two real spatial
microphones. The two estimated DOAa1 (k, n) and a2 (k, n) are represented by two lines, the first
line 430 shows DOAa1 (k, n) and the second line 440 shows DOAa2 (k, n). n). Triangulation is
made possible by simple geometrical considerations by knowing the position and orientation of
each array.
[0102]
Triangulation is not possible if the two lines 430 and 440 are perfectly parallel. However, in real
applications this is very rare. However, not all triangulation results correspond to physical or
03-05-2019
29
possible locations for sound events in the space considered. For example, if the estimated
position of the sound event is too far or even out of premise space, the DOA will not correspond
to any sound event that can be physically interpreted in the model used. Such results may be
caused by sensor noise or room reverberation that is too strong. Thus, according to one
embodiment, such undesirable results are flagged so that the information operations module 202
properly handles them.
[0103]
FIG. 16 shows a scenario in which the position of a sound event is estimated in 3D space. A
suitable spatial microphone is employed, for example a flat or 3D microphone array. In FIG. 16, a
first spatial microphone 510, eg, a first 3D microphone array, and a second spatial microphone
520, eg, a first 3D microphone array, are shown. DOA in 3D space can be expressed, for example,
by azimuth angle and elevation angle. Unit vectors 530 and 540 can be used to represent DOA.
Two lines 550 and 560 are fired according to the DOA. In 3D, even with very reliable estimates,
two lines 550 and 560 fired according to DOA can not intersect. However, triangulation can still
be performed, for example, by selecting the midpoint of the shortest distance connecting the two
lines.
[0104]
As in the 2D case, triangulation may fail for certain combinations of directions, or may result in
impossible results, for example, to the information computation module 202 of FIG. It is flagged
against.
[0105]
If there are more than two spatial microphones, several solutions are possible.
For example, the triangulation described above is performed for all pairs of real spatial
microphones (1 and 2, 1 and 3 and 2 and 3 for N = 3). Then, the average of the positions
obtained (x and y, and also along z if 3D is considered) may be taken.
[0106]
03-05-2019
30
Alternatively, more complex concepts may be used. For example, the probabilistic approach
described in Non-Patent Document 14 may be applied.
[0107]
According to one embodiment, the sound field can be analyzed in the time-frequency domain and
can be obtained, for example, by short time Fourier transform (STFT). Here, k and n denote
frequency index k and time index n, respectively. The complex pressure P v (k, n) at any position
p v for a given k and n is a single spherical surface emitted by a narrow band isotropic point
source, eg by adopting the formula Modeled as a wave. Here, P IPLS (k, n) is the signal emitted by
IPLS at its location p IPLS (k, n). The complex factor γ (k, p IPLS, p v) represents the propagation
from p IPLS (k, n) to p v, for example, introducing appropriate phase and magnitude
displacements. Here, the assumption is applied that in each time-frequency bin only one IPLS is
active. Nevertheless, multiple narrowband IPLS at different locations may be active at a single
time instance.
[0108]
Each IPLS models direct sound or individual room reflections. Ideally, the position p IPLS (k, n)
may correspond to an actual sound source located in the room or a mirror image sound source
located outside. Therefore, the position p IPLS (k, n) will also indicate the position of the sound
event.
[0109]
It should be noted that the term "actual sound source" refers to an actual sound source physically
present in a recording environment, such as a speaker or an instrument. On the other hand, when
using "sound source", "sound event" or "IPLS", it means a valid sound source that becomes active
at a given time instance or at a given time-frequency bin, the sound source being, for example, ,
Represents a real source or mirror image source.
[0110]
03-05-2019
31
27A-27B show a microphone array that localizes the sound source. Localized sound sources have
different physical interpretations depending on their nature. When the microphone arrays
receive direct sound, they localize the position of the actual sound source (e.g., the speaker).
When the microphone array receives the reflection, the microphone array localizes the position
of the mirror image source. The mirror image source is also a sound source.
[0111]
FIG. 27A shows a scenario where two microphone arrays 151 and 152 receive direct sound from
an actual sound source (physically present sound source) 153.
[0112]
FIG. 27B shows a scenario in which two microphone arrays 161 and 162 receive the reflected
sound.
The sound is that reflected by the wall. Because of this reflection, the microphone arrays 161 and
162 localize where the sound seems to come to a location of the mirror image source 165 that is
different from the location of the speaker 163.
[0113]
Both the actual sound source 153 and the mirror source 165 of FIG. 27A are sound sources.
[0114]
FIG. 27C shows a scenario where two microphone arrays 171 and 172 receive diffuse sound and
can not localize the sound source.
[0115]
The single-wave model is only accurate for moderately reflecting environments given that the
source signal meets the W-disjoint orthogonality (WDO) condition, ie that the time-frequency
overlap is sufficiently small. is there.
03-05-2019
32
This is usually correct for speech signals, as shown, for example, in [11].
[0116]
On the one hand, this model gives good estimates for other environments and is therefore
applicable to those environments.
[0117]
In the following, the estimation of the position p IPLS (k, n) according to one embodiment is
described.
At predetermined time-frequency bins, the position of the active IPLS p IPLS (k, n), and the
estimation of sound events at time-frequency bins, are the direction of arrival (DOA) of the sound
measured at at least two different observation points Based on triangulation.
[0118]
FIG. 17 shows the geometry when the IPLS of the current time-frequency slot (k, n) is located at
the unknown position p IPLS (k, n).
In order to identify the required DOA information, two real spatial microphones of known
geometry, position and orientation, here two microphone arrays arranged at positions 610 and
620 respectively, are employed. Vectors p 1 and p 2 point to positions 610 and 620,
respectively. The orientation of the array is defined by the unit vectors c 1 and c 2. The DOA of
sound is identified at positions 610 and 620 for each (k, n) using, for example, the DOA
estimation algorithm given by DirAC analysis (see Non-Patent Document 2 and Non-Patent
Document 3). Thereby, the first viewpoint unit vector e 1 <POV> (k, n) and the second viewpoint
unit vector e 2 <POV> (k, n) (both not shown in FIG. 17) relating to the viewpoint of the
microphone array ) Is given as the output of the DirAC analysis. For example, when operating in
2D, the first viewpoint unit vector is as follows.
[0119]
03-05-2019
33
Here, φ 1 (k, n) indicates the azimuth angle of DOA estimated in the first microphone array as
shown in FIG. For the global coordinate system at the origin, the corresponding DOA unit vectors
e 1 (k, n) and e 2 (k, n) can be computed by applying the following equation: R is a coordinate
conversion matrix, for example, when operating in 2D, c 1 = [c 1, x 1, c 1, y 1] <T>. For example,
to perform triangulation, direction vectors d 1 (k, n) and d 2 (k, n) can be calculated as follows.
Where d 1 (k, n) = || d 1 (k, n) || and d 2 (k, n) = || d 2 (k, n) || are two microphone arrays with
IPLS And the distance of the unknown. The following equation can be solved for d 1 (k, n).
Finally, the position p IPLS (k, n) of IPLS is obtained by
[0120]
In another embodiment, equation (6) can be solved for d 2 (k, n), and p IPLS (k, n) is computed by
similarly adopting d 2 (k, n).
[0121]
Equation (6) always provides a solution when operating in 2D, unless e 1 (k, n) and e 2 (k, n) are
parallel.
However, when using three or more microphone arrays, or when operating in 3D, no solution is
obtained if the direction vectors d do not intersect. According to one embodiment, in this case the
closest points to all direction vectors d can be calculated and the result can be used as the
position of the IPLS.
[0122]
In one embodiment, all the observation points p 1, p 2,... Should be arranged such that the sound
emitted by the IPLS falls in the same temporal block n. This requirement is easily fulfilled if the
distance Δ between any two of the observation points is smaller. Here, n FFT is the window
length of the STFT, R in 0 ≦ R <1 specifies the overlap between successive time frames, and f s is
the sampling frequency. For example, for a 1024 point STFT at 48 kHz with 50% overlap (R =
0.5), the maximum spacing between arrays meeting the above requirements is Δ = 3.65 m.
03-05-2019
34
[0123]
Hereinafter, the information operation module 202 according to an embodiment, for example, a
virtual microphone signal and a side information operation module will be described in more
detail.
[0124]
FIG. 18 shows a schematic diagram of an information computation module 202 according to one
embodiment.
The information calculation unit includes a propagation compensator 500, a combiner 510, and
a spectrum weighting unit 520. The information computing module 202 may estimate the sound
source position ssp estimated by the sound event position estimator, one or more audio input
signals recorded by one or more real spatial microphones, one or more real spatial microphone
positions posRealMic, and The virtual position posVmic of the virtual microphone is received.
Then, the information calculation module 202 outputs an audio output signal os indicating the
audio signal of the virtual microphone.
[0125]
FIG. 19 shows an information operation module according to another embodiment. The
information operation module of FIG. 19 includes a propagation compensator 500, a combiner
510, and a spectrum weighting unit 520. The propagation compensator 500 includes a
propagation parameter calculation module 501 and a propagation compensation module 504.
The combiner 510 comprises a combining factor calculation module 502 and a combining
module 505. The spectrum weighting unit 520 includes a spectrum weighting operation unit
503, a spectrum weighting application module 506, and a spatial side information operation
module 507.
[0126]
In order to calculate the audio signal of the virtual microphone, geometric information, for
example, the position and orientation of the real spatial microphones 121... 12 N, the position,
orientation and characteristics of the virtual spatial microphone 104 and position estimation of
03-05-2019
35
the sound event 205 The values are input to the information operation module 202, in particular
to the propagation parameter operation module 501 of the propagation compensator 500, to the
combining factor operation module 502 of the combiner 510, and to the spectrum weighting
operation unit 503 of the spectrum weighting unit 520. The propagation parameter operation
module 501, the synthesis factor operation module 502 and the spectrum weighting operation
unit 503 operate parameters used in the correction of the audio signals 111... 11N in the
propagation compensation module 504, the synthesis module 505 and the spectrum weighting
application module 506. Do.
[0127]
In the information computation module 202, the audio signals 111... 11N will first be modified to
compensate for the effects provided by the different propagation lengths between the sound
event location and the real spatial microphone. The signals may then be combined, for example,
to improve the signal-to-noise ratio (SNR). Finally, the resulting signal may be spectrally weighted
to take into account the directional microphone pickup pattern of the virtual microphone as well
as any distance dependent gain function. These three steps are described in more detail below.
[0128]
Here, propagation compensation will be described in more detail. In the upper part of FIG. 20,
the locations of localized sound events 930 for two real spatial microphones (first microphone
array 910 and second microphone array 920), time-frequency bin (k, n) And the position 940 of
the virtual space microphone.
[0129]
The lower part of FIG. 20 shows a time axis. It is assumed that a sound event is emitted at time t0
and then propagates to real and virtual spatial microphones. Since not only the amplitude but
also the time delay of arrival changes with distance, the longer the propagation distance, the
weaker the amplitude and the longer the time delay of arrival.
[0130]
03-05-2019
36
The signals in the two real arrays can only be compared if their relative delay Dt12 is small. If
this is not the case, one of these two signals needs to be realigned in time to compensate for the
relative delay Dt12, and possibly scaled to compensate for the different attenuations.
[0131]
Compensating for the delay between the arrival at the virtual microphone and the arrival at the
real microphone array (at one of the real spatial microphones) changes the delay independent of
the localization of the sound event, and many applications Against the
[0132]
Returning to FIG. 19, the propagation parameter calculation module 501 is configured to
calculate and correct the delay for each real spatial microphone and for each sound event.
If desired, the propagation parameter calculation module 501 can also calculate gain factors to
be taken into account to compensate for different amplitude attenuations.
[0133]
The propagation compensation module 504 is configured to correct the audio signal using this
information. If the signal is shifted by a small amount of time (compared to the filter bank time
window), a simple phase rotation is sufficient. If the delay is large, more complex
implementations are required.
[0134]
In the following, a concrete estimation of the propagation compensation for the virtual
microphone according to one embodiment is described with reference to FIG. 17 which shows in
particular the position 610 of the first real spatial microphone and the position 620 of the
second real spatial microphone. Do.
[0135]
03-05-2019
37
In the embodiments described herein, at least a first recorded audio input signal, eg, a pressure
signal of at least one of a real spatial microphone (eg, a microphone array), eg, a first real spatial
microphone Pressure signals shall be available.
The microphone to be studied is referred to as a reference microphone, its position is referred to
as a reference position p ref, and its pressure signal is referred to as a reference pressure signal P
ref (k, n). However, propagation compensation may be performed not only on one pressure
signal, but also on multiple or all pressure signals of a real spatial microphone.
[0136]
The relationship between the pressure signal P IPLS (k, n) emitted by IPLS and the reference
pressure signal P ref (k, n) of the reference microphone located at p ref can be expressed by
equation (9) .
[0137]
In general, the complex factor γ (k, p a, p b) represents the phase rotation and amplitude decay
caused by the propagation of the spherical wave from its origin to p a to p b.
However, real-world tests have shown that a virtual microphone signal with a much smaller
number of artifacts is a plausible impression, considering only the amplitude decay in γ, as
compared to the case where the phase rotation is also considered.
[0138]
The sound energy that can be measured at a given point in space largely depends on the distance
r from the sound source, in FIG. 6 from the position p IPLS of the sound source. In many
situations, this dependency is modeled with sufficient accuracy using known physical principles,
for example, 1 / r attenuation of sound pressure in the far field of point sources. If the distance of
the reference microphone, eg the distance of the first real microphone from the sound source, is
known, and if the distance of the virtual microphone from the sound source is also known, then
the sound energy at the position of the virtual microphone is the reference It is estimated from
the signals and energy of the microphone, eg the first real spatial microphone. This means that
03-05-2019
38
the output signal of the virtual microphone is obtained by applying an appropriate gain to the
reference pressure signal.
[0139]
It is assumed that the first real spatial microphone is the reference microphone and p ref = p 1.
In FIG. 17, the virtual microphone is located at p v. Since the geometric state in FIG. 17 is grasped
in detail, the distance d 1 (k, n) = || d 1 (k, n) between the reference microphone (the first real
space microphone in FIG. 17) and the IPLS. Can be easily specified together with the distance s
(k, n) = || s (k, n) ||, that is, the distance between the virtual microphone and the IPLS.
[0140]
The sound pressure P v (k, n) at the position of the virtual microphone is calculated by combining
Expression (1) and Expression (9).
[0141]
As mentioned above, in some embodiments, the factor γ considers only the amplitude
attenuation due to propagation.
Assuming that the sound pressure decreases by 1 / r.
[0142]
If there is a model of equation (1), for example, only direct sound is present, equation (12) can
correctly reconstruct the absolute value information. However, in the case of pure diffuse sound
areas, for example, if the assumptions of the model do not hold, the method presented results in
an inherent dereverberation of the signal when the virtual microphone is moved away from the
position of the sensor array. In fact, as mentioned above, in the diffuse sound region it is
expected that most IPLS will be localized near the two sensor arrays. Therefore, when the virtual
microphones are moved away from these positions, the distance s = || s || is increased in FIG.
Thus, when applying the weighting according to equation (11), the magnitude of the reference
pressure is reduced. Correspondingly, as the virtual microphone is brought closer to the actual
03-05-2019
39
sound source, the time-frequency bins corresponding to the direct sound are amplified such that
the entire audio signal is perceived to be less diffuse. By adjusting the rules in equation (12), it is
possible to control the amplification of the direct sound and the suppression of the diffuse sound
at will.
[0143]
By performing propagation compensation on the recorded audio input signal (eg, pressure
signal) of the first real spatial microphone, a first modified audio signal is obtained.
[0144]
In an embodiment, the second modified audio signal may be obtained by performing propagation
compensation on the recorded second audio input signal (second pressure signal) of the second
real spatial microphone.
[0145]
In another embodiment, a further audio signal can be obtained by performing propagation
compensation on the recorded further audio input signal (further pressure signal) of the further
real space microphone.
[0146]
The combining of block 502 and block 505 in FIG. 19 according to one embodiment will now be
described in more detail.
Suppose that two or more audio signals from a plurality of different real spatial microphones are
corrected in order to compensate different propagation paths to obtain two or more corrected
audio signals.
Once the audio signals from different real spatial microphones are corrected to compensate for
different propagation paths, they are combined to improve audio quality.
By doing so, for example, the SNR is increased or the reverberation is reduced.
03-05-2019
40
[0147]
A possible solution for the synthesis consists of taking a weighted average, eg taking into account
the SNR, the distance to the virtual microphone, or the diffusivity estimated by the real spatial
microphone. Conventional solutions may be employed, such as maximum ratio combining (MRC)
or equal gain combining (EQC). Linear combination of some or all of the modified audio signal to
obtain a combined signal. The modified audio signal can be weighted in linear synthesis to obtain
a synthesized signal. Or-depending on the choice, eg SNR, distance or spread, eg only one signal
is used.
[0148]
The role of module 502 is to compute parameters for synthesis, if applicable, which is
implemented in module 505.
[0149]
Here, the spectral weighting according to the embodiment will be described in detail.
For this, reference is made to blocks 503 and 506 of FIG. In this last step, the audio signal
obtained from synthesis or from propagation compensation of the input audio signal depends on
the spatial characteristics of the virtual spatial microphone as specified by the input 104 and / or
(given at 205 Weighting in the time-frequency domain according to the reconstructed geometry.
[0150]
For each time-frequency bin, DOA for a virtual microphone as shown in FIG. 21 can be easily
obtained by geometric reconstruction. Furthermore, the distance between the virtual microphone
and the position of the sound event can be easily calculated.
[0151]
The weights for the time-frequency bins are then calculated taking into account the type of
virtual microphone desired.
03-05-2019
41
[0152]
In the case of a directional microphone, spectral weighting can be computed according to a
predefined pick-up pattern.
For example, according to one embodiment, a cardioid microphone may have a pick-up pattern
defined by the function g (θ), g (θ) = 0.5 + 0.5 cos (θ). Note that θ is the angle between the
direction of viewing the virtual space microphone and the DOA of the sound from the viewpoint
of the virtual microphone.
[0153]
Another possibility is an artistic (non-physical) damping function. In some applications, it may be
desirable to suppress sound events away from the virtual microphone by a factor greater than
that which characterizes free area propagation. To this end, some embodiments introduce an
additional weighting function that depends on the distance between the virtual microphone and
the sound event. In one embodiment, only sound events within a predetermined distance (e.g., in
meters) from the virtual microphone should be picked up.
[0154]
With regard to the directivity of the virtual microphone, any directivity pattern can be applied to
the virtual microphone. By doing so, for example, a certain sound source can be separated from a
complex sound scene.
[0155]
Since the DOA of sound is computed at the virtual microphone position p v, ie, it can realize any
directivity to the virtual microphone. Here, c v is a unit vector that describes the direction of the
virtual microphone. For example, let P v (k, n) denote a synthesized signal or a propagationcompensated modified audio signal, then the equation: calculates the output of a virtual
03-05-2019
42
microphone with cardioid directivity. The potentially generated pointing pattern thus depends on
the accuracy of the position estimate.
[0156]
In an embodiment, one or more real non-spatial microphones, for example directional
microphones such as omnidirectional microphones or cardioids, are placed in the sound scene in
addition to the real spatial microphones, as shown in FIG. The sound quality of the virtual
microphone signal 105 is further improved. These microphones are not used to collect any
geometric information, but only to provide a cleaner audio signal. These microphones are placed
closer to the source than the spatial microphones. In this case, according to one embodiment, the
real non-spatial microphone audio signals and their positions are simply input to the propagation
compensation module 504 shown in FIG. 19 for alternative processing of the real spatial
microphone audio signals. Ru. Then, propagation compensation is performed on the recorded
audio signal of the one or more non-spatial microphones with respect to the position of the one
or more non-spatial microphones. Thereby, an embodiment is realized with an additional nonspatial microphone.
[0157]
In a further embodiment, computation of spatial side information of the virtual microphone is
implemented. In order to calculate the microphone spatial side information 106, the information
operation module 202 of FIG. 19 comprises a spatial side information operation module 507,
which receives as an input the position 205 of the sound source and the position, orientation and
characteristics 104 of the virtual microphone Configured to In one embodiment, depending on
the side information 106 that needs to be calculated, the audio signal 105 of the virtual
microphone can also be considered as an input to the spatial side information calculation module
507.
[0158]
The output of the spatial side information calculation module 507 is side information 106 of the
virtual microphone. This side information may be, for example, DOA or diffusion of sound for
each time-frequency bin (k, n) from the viewpoint of the virtual microphone. As other possible
side information, for example, the active sound intensity vector Ia (k, n) that would have been
03-05-2019
43
measured at the position of the virtual microphone can also be used. We will now explain how
these parameters can be derived.
[0159]
According to one embodiment, DOA estimation for a virtual spatial microphone is implemented.
The information computing module 120 is configured to estimate the arrival direction of the
virtual microphone as spatial side information based on the position vector of the virtual
microphone and based on the position vector of the sound event shown in FIG.
[0160]
FIG. 22 shows a possible way of deriving the DOA of sound from the viewpoint of a virtual
microphone. The position of the sound event provided by block 205 in FIG. 19 can be described
for each time-frequency bin (k, n) by a position vector r (k, n) which is a position vector of the
sound event. Similarly, the position of the virtual microphone given as the input 104 in FIG. 19
can be described by a position vector s (k, n) which is a position vector of the virtual microphone.
The direction in which the virtual microphone is viewed can be described by the vector v (k, n).
The DOA for the virtual microphone is given by a (k, n). This represents the angle between v and
the sound propagation path h (k, n). h (k, n) can be calculated by using the formula:
[0161]
Here, the desired DOAa (k, n) can be calculated for each (k, n), for example, by the definition of
the inner product of h (k, n) and v (k, n), that is,.
[0162]
In another embodiment, the information calculation module 120 estimates the intensity of the
active sound in the virtual microphone as spatial side information based on the position vector of
the virtual microphone and based on the position vector of the sound event shown in FIG. It may
be configured to
[0163]
03-05-2019
44
From DOAa (k, n) defined above, it is possible to derive the intensity Ia (k, n) of the active sound
at the position of the virtual microphone.
Regarding this, when the audio signal 105 of the virtual microphone in FIG. 19 corresponds to
the output of the nondirectional microphone, for example, it is assumed that the virtual
microphone is a nondirectional microphone.
Furthermore, it is assumed that the viewing direction v in FIG. 22 is parallel to the x-axis of the
coordinate system. Since the desired active sound intensity vector Ia (k, n) describes the total
flow of energy through the position of the virtual microphone, Ia (k, n) is computed according to
the equation: Note that [] <T> represents a transposed vector, rho is the density of air, and P v (k,
n) is the sound pressure measured by the virtual space microphone, for example, block 506 in
FIG. Output 105 of the
[0164]
If expressed in a general coordinate system but still an active intensity vector should be
calculated at the position of the virtual microphone, the following equation may be applied.
[0165]
The diffusivity of sound describes how diffuse the sound field is in a given time-frequency slot
(see, for example, Non-Patent Document 2).
The diffusivity is expressed by the value Ψ, where 0 ≦ Ψ ≦ 1. The diffusivity 1 indicates that
the total sound field energy of the sound field is completely diffused. This information is
important, for example, in the reproduction of space sound. Conventionally, diffusivity is
calculated at specific points in the space where the microphone array is located.
[0166]
According to one embodiment, diffusivity may be computed as an additional parameter to side
information generated for a virtual microphone (VM) that can be placed at any position in the
sound scene. This allows the device to calculate the diffusivity as well as the audio signal at the
03-05-2019
45
virtual position of the virtual microphone, as it can generate the DirAC stream, ie the audio
signal, the direction of arrival and the diffusivity, for any point in the sound scene. It can be
viewed as The DirAC stream is further processed, stored, transmitted and played back in any
multi-speaker setup. In this case, the viewer experiences the sound scene as if he or she was at
the position specified by the virtual microphone, and as looking at the direction specified by the
orientation.
[0167]
FIG. 23 shows an information operation block according to an embodiment comprising a
diffusive operation unit 801 for calculating diffusivity in a virtual microphone. The information
computation block 202 is configured to receive, in addition to the inputs of FIG. 14, the inputs
111-11N, which also includes the diffusivity in the real spatial microphone. Let Ψ <(SM1)>-Ψ
<(SMN)> denote these values. These additional inputs are input to the information computation
module 202. The output 103 of the diffusion calculation unit 801 is a diffusion parameter
calculated at the position of the virtual microphone.
[0168]
The diffusive computing unit 801 of one embodiment is described in more detail in FIG.
According to one embodiment, the energy of the direct and diffuse sound in each of the N spatial
microphones is estimated. Then, using the information on the position of the IPLS and the
information on the position of the spatial virtual microphone, N estimates of these energies at the
position of the virtual microphone can be obtained. Finally, estimates can be combined to
improve estimation accuracy, and diffusivity parameters in the virtual microphone can be easily
calculated.
[0169]
E dir <(SM1)> to E dir <(SMN)> and E diff <(SM1)> to E diff <(SMN> are the direct and diffuse
sounds of N spatial microphones calculated by the energy analysis unit 810 Provides an estimate
of the energy of For the ith spatial microphone, if P i is a complex pressure signal and Ψ i is
diffusive, energy can be calculated, for example, according to the following equation:
03-05-2019
46
[0170]
The energy of the diffuse sound should be equal at all locations, so an estimate of the diffuse
sound energy E diff <(VM)> of the virtual microphone is eg according to the formula: eg at the
diffusive combiner 820 , Simply calculated by taking the average from E dir <(SM 1)> to E dir
<(SMN)>.
[0171]
A more effective synthesis of the estimates E dir <(SM 1)> to E dir <(SMN)> is performed by
taking into account the variance of the estimator, for example by taking into account the SNR.
[0172]
The energy of the direct sound depends on the distance to the sound source due to the
propagation.
Therefore, taking this into consideration, it is possible to correct from E dir <(SM 1)> to E dir
<(SMN)>.
This can be performed, for example, by the direct sound propagation adjustment unit 830. For
example, assuming that the energy of the direct sound region decays by 1 over the square of the
distance, an estimate for the direct sound in the virtual microphone for the ith spatial
microphone can be calculated according to:
[0173]
Similar to the diffusive synthesis unit 820, estimates of direct sound energy obtained at different
spatial microphones can be synthesized, for example, by the direct sound synthesis unit 840. The
result is E dir <(VM)>, for example, an estimate for the direct sound energy of the virtual
microphone. The diffusivity Ψ <(VM)> of the virtual microphone can be calculated, for example,
by the diffusive sub-calculator 850 according to:
[0174]
03-05-2019
47
As mentioned above, in some cases, for example, in the case of an incorrect estimation of the
direction of arrival, the position estimation of the sound event performed by the sound event
position estimator fails. FIG. 25 illustrates such a scenario. In these cases, regardless of the
diffusive parameters estimated at the different spatial microphones and received as input 11111N, the diffusiveness 103 of the virtual microphone can not be spatially coherent reproduction,
so 1 (ie completely May be set to diffusion).
[0175]
Furthermore, the reliability of the DOA estimate can be taken into account in the N spatial
microphones. This may be expressed, for example, in terms of the variance or SNR of the DOA
estimator. Because such information can be considered by the diffusive sub-calculator 850, the
VM diffusivity 103 can be artificially increased if the DOA estimate is not reliable. In fact, as a
result, the position estimate 205 is also unreliable.
[0176]
FIG. 26 illustrates a virtual output signal generator 991 according to one embodiment. The
virtual output signal generator 991 comprises a microphone positioner 992 according to one of
the above embodiments comprising a microphone position calculator 993. Furthermore, the
virtual output signal generator comprises an audio output signal generator 994 according to one
of the above embodiments. The output signal generated by the audio output signal generator
994 is a virtual output signal vos. The microphone position calculator 992 of the microphone
positioning device 991 is configured to calculate the position of the microphone as a calculated
microphone position cmp. The audio output signal generator 994 is configured to simulate the
recording of the virtual microphone at the calculated microphone position calculated by the
microphone positioner 992. Thereby, the microphone positioning device 992 calculates the
virtual position of the virtual microphone with respect to the audio output signal generating
device 994.
[0177]
While several aspects have been described in the context of an apparatus, it will be appreciated
that the blocks or devices also correspond to method steps or features of method steps, and that
03-05-2019
48
these aspects also represent a description of the corresponding method. Similarly, the aspects
described in connection with the method steps also represent a description of corresponding
blocks, details or features of the corresponding device.
[0178]
The decomposed signal according to the present invention can be recorded on a digital recording
medium or can be transmitted to a transmission medium such as a wireless transmission medium
such as the Internet or a wired transmission medium.
[0179]
Depending on certain implementation requirements, embodiments of the invention can be
implemented in hardware or in software.
An embodiment thereof is, for example, a flexible disk, a DVD, which cooperates (or can
cooperate with) a programmable computer system such that electronically readable control
signals are stored and the respective methods are performed. It can be implemented using digital
storage media such as CD, ROM, PROM, EPROM, EEPROM (registered trademark) or flash
memory.
[0180]
Some embodiments according to the invention comprise non-transitory data carriers having
electronically readable control signals, which cooperate with a programmable computer system
such that one of the methods described herein is performed. Can work.
[0181]
Generally, embodiments of the present invention may be implemented as a computer program
product with program code, the program code being operable to perform one of the methods
when the computer program product is run on a computer.
The program code is stored, for example, on a machine readable carrier. Another embodiment
consists of a computer program stored on a machine readable carrier and performing one of the
03-05-2019
49
methods described herein.
[0182]
In other words, an embodiment of the inventive method is a computer program comprising
program code for performing one of the methods described herein when the computer program
runs on a computer.
[0183]
Thus, a further embodiment of the inventive method is on a data carrier (ie digital storage
medium or computer readable medium) comprising a computer program for carrying out one of
the methods described herein, recorded thereon. is there.
[0184]
Thus, a further embodiment of the inventive method is a sequence of data streams or signals
representing a computer program for performing one of the methods described herein.
The data stream or sequence of signals may be configured to be transferred via a data
communication connection, such as, for example, the Internet.
[0185]
A further embodiment consists of processing means, for example a computer or programmable
logic device, configured or adapted to perform one of the methods described herein.
A further embodiment consists of a computer installed with a computer program for performing
one of the methods described herein.
[0186]
In some embodiments, programmable logic devices (eg, field programmable gate arrays) may be
03-05-2019
50
used to perform some or all of the functions of the methods described herein. In some
embodiments, a field programmable gate array can cooperate with a microprocessor to perform
one of the methods described herein. Usually, those methods are suitably performed by any
hardware device.
[0187]
The embodiments described above are merely illustrative of the principles of the present
invention. It is understood that variations and modifications of the arrangements and details
described herein will be apparent to those skilled in the art. Accordingly, it is intended that the
invention not be limited by the specific details set forth in the description and illustration of the
embodiments set forth herein, but only by the scope as set forth in the appended claims. .
03-05-2019
51
Документ
Категория
Без категории
Просмотров
0
Размер файла
75 Кб
Теги
jp2015502716
1/--страниц
Пожаловаться на содержимое документа