close

Вход

Забыли?

вход по аккаунту

?

DESCRIPTION JP2010130411

код для вставкиСкачать
Patent Translate
Powered by EPO and Google
Notice
This translation is machine-generated. It cannot be guaranteed that it is intelligible, accurate,
complete, reliable or fit for specific purposes. Critical decisions, such as commercially relevant or
financial decisions, should not be based on machine-translation output.
DESCRIPTION JP2010130411
To improve the accuracy of tracking a speaker. According to one embodiment, a multiple signal
section estimation apparatus of the present invention includes a sensor unit, a voice signal
section estimation unit, a speaker direction estimation unit, a face position detection unit, and an
information integration unit. The speech signal section estimation unit estimates the existence
probability of speech with respect to the entire region of a plane centered on the sensor unit,
using as input the frequency spectrum obtained by frequency analysis of the acoustic signal from
the microphone. The speaker direction estimation unit estimates the existence probability of the
speaker in each region using the frequency spectrum of the acoustic signal. The face position
detection unit receives the video signal from the camera and estimates the existence probability
of the discourse participant in each area based on the direction of the gravity center of the
discourse participant's face. The information integration unit receives the probability of voice
presence, the probability of presence of the speaker and the probability of presence of the
discourse participant as input, and calculates the probability that the discourse participant spoke
in a specific area in each area. [Selected figure] Figure 1
Multiple signal interval estimation apparatus and method and program therefor
[0001]
The present invention relates to a multiple signal interval estimation device for estimating which
participant spoke when from a data such as a conference or a conversation in which discourse
recorded by one or more participants is recorded, a method thereof, and a program.
[0002]
10-04-2019
1
If voice and video data of a discourse performed by a person and a person such as a meeting or
conversation are recorded, and these data can be automatically analyzed and an appropriate
index can be provided, the efficiency to the necessary information can be obtained. Access, which
will lead to the realization of technology for automatic generation of conference proceedings and
summaries.
The most basic information for doing this kind of automatic indexing is: "Who spoke when? Need
to capture In order to detect "when", it is necessary to exclude a section without speech from the
observation data and to detect a section with speech. In addition, in order to detect “who”, it is
necessary to classify which section is by which utterer among the speech sections obtained by
the speech section detection.
[0003]
Such a technique is referred to as speaker determination technique, and is a technique for
classifying a speaker on a speech section obtained by performing speech section detection using
acoustic information recorded by a microphone array (non-patent document) 1) and techniques
for stochastically integrating speech zone detection and speaker classification techniques (NonPatent Document 2) have been proposed. Tranter, S. E. and Reynolds, D. A., "An overview of
automatic speaker diversion systems," IEEE Trans. On Audio, Speech, and Language Processing,
vol. 14, pp. 1557-1565, 2006. Araki, S., Fujimoto, M., Ishizuka, K., Sawada, H., and Makino, S. “A
DOA based speaker diarization system for real meetings,” “Proceedings of the 5th Joint
Workshop on Hands-free Speech Communication and Microphone Arrays, pp. 29-32, 2008.
[0004]
However, the speaker determination technology using only voice has a problem that the accuracy
is degraded when the speaker moves without emitting the voice. In order to solve the problem,
when a discourse participant speaks again after moving in a situation without a speech, it is
necessary to determine whether a speaker is present immediately after the speech. There is also
considered a method of improving the accuracy of the tracking of the speaker by simultaneously
using a video signal in addition to the audio information. For example, a technique for tracking a
person with high accuracy based on the presence or absence of movement of a person and the
presence or absence of speech, a technique for prioritizing which of video and audio information
is used according to the level of detection accuracy, voice likelihood A technique for detecting the
10-04-2019
2
position of the speaker by using both of the image likelihood and analyzing a face image to
estimate a person who is drawing attention at a meeting to detect whether or not the person who
is drawing attention is speaking There are techniques etc that record the conference by doing.
However, such a technology definitely deals with information obtained from elemental
technologies such as speech zone detection and moving object detection, and the performance of
elementary technologies with low accuracy spreads to the processing of the latter stage (which
becomes a bottleneck) , There was a problem that the performance of the whole system is
reduced.
[0005]
The present invention has been made in view of such a point, and by combining information
obtained from an acoustic signal observed by a microphone and information obtained from a
video signal observed by a camera, the speaker can It is an object of the present invention to
provide a multiple signal interval estimation device with improved tracking accuracy, a method
therefor, and a program.
[0006]
A multiple signal section estimation apparatus according to the present invention includes a
sensor unit, a voice signal section estimation unit, a speaker direction estimation unit, a face
position detection unit, and an information integration unit.
The sensor unit includes a plurality of microphones and one or more cameras, and outputs an
audio signal and an image signal. The voice signal section estimation unit estimates the presence
probability of voice for the entire region of a plane centered on the sensor unit, using as input
the frequency spectrum obtained by frequency analyzing the acoustic signal from the
microphone. The speaker direction estimation unit estimates the existence probability of the
speaker in each of the regions using the frequency spectrum. The face position detection unit
receives the video signal from the camera, and estimates the presence probability of the abovementioned discourse participant in each of the above-mentioned areas based on the direction of
the gravity center of the discourse participant's face. The information integration unit receives
the probability of presence of speech, the probability of presence of the speaker and the
probability of presence of the discourse participant as input, and calculates the probability of the
discourse participant speaking in a specific area in each area.
[0007]
10-04-2019
3
The conventional technology using only acoustic signals can not track the position of the
discourse participant moving in the absence of speech. In this invention, since the face position
detection unit estimates the presence probability of the discourse participant by using the video
signal as an input, it is possible to track the position of the discourse participant even if the
discourse participant moves in a situation without speech. Becomes possible. Therefore, it
becomes unnecessary to determine whether there is a speaker when speaking again after
movement, and it becomes possible to detect with high accuracy the presence or absence of a
speaker from the time when each speaker starts talking. In addition, since the information
integration unit integrates the existence probability of speech, the existence probability of the
speaker and the existence probability of the discourse participant to calculate the probability that
the discourse participant spoke, the system by the performance of the element technology with
low accuracy It is also possible to make it difficult for the problem of the overall performance to
deteriorate.
[0008]
[Basic idea of the present invention] The multi-signal section estimation device of the present
invention divides a planar space centered on a sensor unit provided with a plurality of
microphones and one or more cameras into R discrete regions. In each of the areas r = 1, 2,..., R,
qr representing the presence or absence of a discourse participant as a binary value (if qr = 0, no
discourse participant is present in the area r, and if qr = 1, discourse in the area r The participant
introduces ar (in the case of area r, there is no speech in case of ar = 0, and in the case of ar = 1
there is speech in the case of area r) representing the presence of a participant and the presence
or absence of speech in binary. Also, assuming that the frequency spectrum Xr of the audio
signal obtained from the region r, the spatial power distribution of the audio signal obtained from
the region r is Dr, and the observed video signal obtained from the region r is Vr, and those
observations are obtained. By finding the conditional probability p (ar = 1, qr = 1 | Xr, Dr, Vr)
where qr = 1 and ar = 1 and performing threshold processing on this, the speech can be detected
in any direction when viewed from the sensor unit Estimate if there was. Note that the threshold
used when performing threshold processing for estimating when and in what direction viewed
from the sensor unit, and the value of R for dividing a planar space centered on the sensor unit
into R pieces For example, it may be pre-recorded in a storage unit (not shown in FIG. 1) in the
multiple signal section estimation apparatus.
[0009]
10-04-2019
4
The method of estimating speech with this conditional probability will be described below as the
basic idea of the present invention. Assuming that the conditional probability p (ar = 1, qr = 1 |
Xr, Dr, Vr) is frequency spectrum Xr, space power distribution is Dr, and observation video signal
is Vr, and each is assumed to be mutually independent, It can be described in).
[0010]
[0011]
Here, assuming equation (2), the conditional probability p can be expressed by equation (3).
[0012]
Further, assuming equation (4), this can be expressed by equation (5).
By applying Bayes's theorem to equation (5) and assuming that both prior probabilities p (qr = 1)
and p (ar = 1) are constants, approximate equation (6) holds.
The conditional probability p takes a larger value as the probability that the discourse participant
is in the region r is higher and the probability that the utterance is in the region r is higher.
[0013]
[0014]
According to the multi-signal interval estimation device of the present invention, the existence
probability p (ar = 1 | Xr) of speech, the existence probability p (ar = 1, qr = 1 | Dr) of the
speaker, and the existence probability p of the discourse participant A value obtained by
calculating and multiplying (qr = 1 | Vr) is assumed to be an approximate value of the conditional
probability that a dissident is present and has an utterance.
10-04-2019
5
Then, the speaker is identified by performing threshold processing on the approximate value of
the conditional probability. Note that the result of applying threshold processing to the
approximate value is a determination result as to whether or not a dissident participant is
present in the region r and there is an utterance, and the classification is as shown in equation
(20) described later. By performing the process, the speaker can be identified. Thus, according to
the present invention, the existence probability p (qr = 1 | Vr) of the discourse participant
determined from the video signal Vr is expressed as the existence probability p (ar = 1 | Xr) of
the voice determined from the audio signal. Since the probability of uttering is integrated by
integrating the existence probability p (ar = 1, qr = 1 | Dr) of the utterer, the discourse participant
moves even in a situation without uttering It becomes possible to track the position. In addition,
three probability values are integrated (when calculating a plurality of probability values, it is not
necessary to perform judgment processing such as threshold processing on the probability
values calculated each time one probability value is calculated, but three Since all probability
values are calculated and used) (even if the reliability of one probability value is low), the low
reliability does not become a bottleneck.
[0015]
Hereinafter, embodiments of the present invention will be described with reference to the
drawings. The same reference numerals are assigned to the same components in the drawings,
and the description will not be repeated.
[0016]
FIG. 1 shows an example of the functional configuration of the multiple signal section estimation
apparatus 100 of the present invention. Figure 2 shows the operation flow. The multiple signal
section estimation apparatus 100 includes a sensor unit 3, an audio signal section estimation unit
4, a speaker direction estimation unit 5, a face position detection unit 6, and an information
integration unit 7. For example, a predetermined program is read into a computer including, for
example, a ROM, a RAM, a CPU and the like, and each unit excluding the sensor unit 3 is realized
by the CPU executing the program.
[0017]
The sensor unit 3 includes a plurality of microphones 1 and one or more cameras 2 and outputs
10-04-2019
6
an audio signal and an image signal (step S3, FIG. 2). The acoustic signal is, for example, a digital
signal obtained by sampling at 16 kHz the sound collected by three microphones arranged on the
same horizontal plane. The video signal is, for example, a digital signal of 30 frames / second,
which is photographed by one or more cameras arranged so as to be able to photograph an
omnidirectional space around the sensor unit 3.
[0018]
The observation signals of the audio signal and the video signal are cut out as one frame by
multiplying a signal of a time length of 32 ms by a window function while moving in the time
axis direction by 16 ms, for example. For example, a Hanning window w (n) shown in equation
(7) is multiplied and cut out.
[0019]
[0020]
Here, n represents the n-th sample point, and L represents the number of sample points of the
cutout waveform.
L is, for example, 512 points. Discrete Fourier transform is performed on the observation signal
cut out as this frame, and the signal waveform in the time domain is converted to the frequency
spectrum in the frequency domain. Assuming that the observation signal of the i-th frame is xi
(n), its frequency spectrum Xi (k) can be obtained by equation (8).
[0021]
[0022]
Here, j represents an imaginary unit, and k represents a discrete point (frequency bin) obtained
by equally dividing the sampling frequency by K.
10-04-2019
7
As K, for example, 512 which is the size of frame length L or more is used. In FIG. 1, the sensor
unit 3, the A / D converter provided between the voice signal section estimation unit 4 and the
speaker direction estimation unit 5 and the face position detection unit 6, and the discrete
Fourier transform means are omitted. ing.
[0023]
The voice signal section estimation unit 4 divides a plane centered on the sensor unit 3 into a
plurality of regions r with the acoustic signal subjected to frequency analysis as an input, and
uses the frequency spectrum Xi (k) of the acoustic signal in each region The existence probability
p (ar = 1 | Xi (k)) of speech in each region is estimated (step S4). The speaker direction
estimation unit 5 receives the frequency-analyzed acoustic signal as an input and uses the spatial
power distribution of the acoustic signal in each region to determine the existence probability of
the speaker in each region p (ar = 1, qr = 1 | Dr) Is estimated (step S5).
[0024]
The face position detection unit 6 estimates the presence probability p (qr = 1 | Vr) of the
discourse participant in each area based on the direction of the gravity center of the discourse
participant's face, using the video signal subjected to frequency analysis as an input S6). The
information integration unit 7 determines the existence probability of speech p (ar = 1 | Xr), the
existence probability p of the speaker (ar = 1, qr = 1 | Dr), and the existence probability of the
discourse participant p (qr = 1 | Vr The probability p (ar = 1, qr = 1 | Xr, Dr, Vr) that the
discourse participant uttered in the specific area in each area is calculated with the input as an
input (step S7).
[0025]
Since the probability p (ar = 1, qr = 1 | Xr, Dr, Vr) that the discourse participant uttered in this
way is uttered uses the video signal, the discourse participant moved in a situation without
uttering Make it possible to track the position of the discourse participant even if. Next, a specific
method of obtaining each probability value will be described.
10-04-2019
8
[0026]
[Voice Signal Section Estimation Unit] FIG. 3 shows an example of a functional configuration of
the voice signal section estimation unit 4. The speech signal section estimation unit 4 includes
pre / post SN ratio calculation means 40, likelihood ratio Λ calculation means 41, and speech
presence probability calculation means 42. The pre / post SN ratio calculating means 40
estimates the power λi <N> (k) in the frame i of the noise signal by using a Kalman filter etc.
using the frequency spectrum Xi (k), and based on this, The a posteriori signal-to-noise ratio (a
posteriori SN ratio) γi (k) shown in (9) and the preliminary signal-to-noise ratio (a priori SN
ratio) ξi (k) shown in equation (10) are determined.
[0027]
[0028]
The likelihood ratio Λ calculating means 41 outputs a likelihood ratio 表 す indicating the degree
of presence of the target signal using the a posteriori SN ratio γi (k) and the prior SN ratio ξi (k)
thus obtained.
The likelihood ratio Λ is the likelihood p (Xi (k) | H0) that the i-th frame of frequency k does not
include the target signal and the likelihood p (Xi (k) | H1 that includes the target signal in the
noise It can be expressed by the ratio of Each likelihood is defined by Formula (11) and Formula
(12).
[0029]
[0030]
Here, λi <S> (k) is the power at the frequency k of the target signal in the i-th frame.
By calculating the ratio of these likelihoods, the likelihood ratio Λi (k) is calculated (Equation
(13)).
10-04-2019
9
[0031]
[0032]
Here, the definition of the prior SN ratio ξi (k) is shown in equation (14).
Since it is not possible to directly obtain λi <S> (k), the prior SN ratio ξi (k) is obtained by the
above means. The likelihood ratio Λ calculating means 41 outputs, as a likelihood ratio 尤 i, a
value obtained by averaging the likelihood ratio 得 i (k) thus obtained for all the frequencies k,
for example (Equation (15)).
[0033]
[0034]
The speech presence probability calculating means 42 receives the likelihood ratio Λi and
estimates the speech presence probability with respect to the entire area by equation (16).
[0035]
[0036]
[Speaker Direction Estimation Unit] FIG. 4 shows an example of a functional configuration of the
speaker direction estimation unit 5.
The speaker direction estimation unit 5 includes a signal arrival direction calculation unit 50, a
classification unit 51, and a speaker presence probability calculation unit 52.
Signal arrival direction calculation means 50 receives the frequency spectrum Xi (k) and
calculates the signal arrival direction at each frequency bin (k).
10-04-2019
10
First, the arrival time difference τi <m> (k) of the acoustic signal shown in equation (17) and the
arrival time difference vector τi (k) <→> shown in equation (18) are determined. (The notation
of the variable name is correct in the expression. )
[0037]
[0038]
Where f is the frequency (Hz) for frequency bin (k).
Xi <m> (k) is the frequency spectrum of the signal observed by the microphone m (m = 1... M).
Assuming that the O-th microphone is a reference microphone and the distance vector between
the reference microphone and the other microphones is the expression (19) and the arrival time
difference vector τi (k) <→> is used, the acoustic signal is given by the relationship shown in
equation (19) The incoming azimuth angle θi (k) and elevation angle φi (k) can be determined.
Here, M is the total number of microphones. The value of M may be determined in advance and
recorded in the recording unit in the multiple signal section estimation apparatus 100, and the
speaker direction estimation unit 5 may read the total number M of microphones from the
recording unit. In addition, it is determined in advance as to which microphone to use as a
reference microphone, and information for specifying the reference microphone is recorded in
advance in the recording unit, and the speaker direction estimation unit 5 reads the information
for specifying the reference microphone. As well.
[0039]
Here, vs is the speed of sound (about 344 m / sec), and D <-1> is a generalized inverse matrix of
D.
[0040]
The signal arrival direction calculation means 50 outputs the arrival direction of the acoustic
signal thus obtained at the azimuth angle θi (k) and the elevation angle φi (k).
10-04-2019
11
Thereafter, only the azimuth angle is used as the signal arrival direction for the sake of simplicity.
[0041]
The classification unit 51 classifies the azimuth angle θi (k) as shown in the equation (20).
[0042]
Here, θ n represents the center of gravity of the cluster representing the n-th speaker.
The threshold is given externally and given, for example, 15 degrees. The threshold may also be
determined in advance and recorded in the recording unit in the multiple signal section
estimation apparatus 100, and the classification unit may read out from the recording unit. Each
cluster can also be generated based on the spatial power distribution estimated in a frame having
a high probability of the presence of speech as described later.
[0043]
The speaker presence probability calculating means 52 receives each cluster Cn (θi (k)) and
calculates the speaker's presence probability p (ar = 1, qr = 1 | Dr) by equation (21).
[0044]
[0045]
Here, K represents the total number of frequency bins of the frequency spectrum obtained as a
result of discrete Fourier transform.
Cn represents a cluster in which the n-th speaker is present.
For example, if the n-th speaker exists in the regions r1 to r2, the regions r1 to r2 become Cn.
10-04-2019
12
[0046]
[Face Position Detection Unit] FIG. 5 shows a functional configuration example of the face
position detection unit 6. The face position detection unit 6 includes a face position detection /
tracking means 60 and a discourse participant presence probability calculation means 61, and
estimates a presence probability p (qr = 1 | Vr) of the discourse participant using the video signal
as an input.
[0047]
The face position detection / tracking means 60 outputs the direction θ n of the center of
gravity of the face of the talk participant as an input, for example, a video signal covering all
directions with an omnidirectional camera equipped with two fisheye lenses. Discourse The
center of gravity direction of the participant's face can be found, for example, in the reference
material “Mateo Lozano, O. and Otsuka, k,“ Simultaneous and fast 3D tracking of multiple
faces in video sequences by using a particle filter ”J. Signal Processing Systems, DOI 10.1007
This can be determined by using the face detection and tracking method using template
matching and particle filter described in “/ s11265-008-0250-2, in press”.
[0048]
The discourse participant presence probability calculation means 61 receives the center of
gravity direction θn of the discourse participant's face as an input, and uses the Gaussian
distribution function N (θn (τ), σ <2>) as shown in equation (22) Calculate the existence
probability p (qr = 1 | Vr).
[0049]
[0050]
[Information Integration Unit] The information integration unit 7 estimates the existence
probability of the speech estimated by the speech signal section estimation unit 4, the existence
probability of the speaker estimated by the speaker direction estimation unit 5, and the face
position detection unit 6. The probability of presence of the discourse participant is input and the
probability values are integrated to calculate the probability p (ar = 1, qr = 1 | Xr, Dr, Vr) that the
10-04-2019
13
discourse participant spoke in the specific region.
[0051]
The probability p (ar = 1, qr = 1 | Xr, Dr, Vr) uttered by the discourse participant in the specific
region is determined, for example, by multiplying each probability as shown in equation (23).
[0052]
[0053]
In addition, according to the reliability of each probability value, you may give and obtain |
require a weight, as shown to Formula (24).
[0054]
Further, the probability p (ar = 1, qr = 1 | Xr, Dr, Vr) in which the discourse participant speaks
may be obtained by the sum of logarithms as shown in the equation (25).
[0055]
[0056]
FIG. 6 shows an example of the functional configuration of the multiple signal interval estimation
apparatus 160 according to the second embodiment of the present invention.
The multiple signal section estimation device 160 is a modification of the operations of the
speaker direction estimation unit 5 and the voice signal section estimation unit 4 of the first
embodiment.
[0057]
The speaker direction estimation unit 60 of the multiple signal section estimation device 160
divides the plane centered on the sensor unit 3 into a plurality of regions, calculates the space
10-04-2019
14
power distribution of the acoustic signal in each region, and speaks in each region Estimate the
existence probability of the person.
The speaker direction estimation unit 60 first uses the azimuth angle θi (k) output from the
signal arrival direction calculation means 50 to extract only the signal coming from a certain
range, and uses the time frequency mask shown in equation (26). Generate Maski (k, r).
[0058]
[0059]
Here, Θ r represents a certain range of the arrival direction of the signal to be extracted, and r is
an index representing a certain range of the arrival direction of a specific signal (r = 1... R).
In order to extract a signal in a certain range, for example, 0 is used for a and 1 for b.
[0060]
Next, the frequency spectrum Xi (k) and the time frequency mask Maski (k, r) are used to arrive
from the range r for estimating the signal power distribution (power distribution in space of the
signal) coming from each signal arrival direction Assuming that the power of the acoustic signal
is Pi (r), it can be calculated by equation (27).
[0061]
The space power distribution is estimated by calculating equation (27) for the entire region.
[0062]
The speech signal section estimation unit 61 estimates the existence probability of speech by
performing the same calculation as in the first embodiment using the space power distribution.
The information integration unit 7 of the second embodiment determines the probability p (ar =
10-04-2019
15
1, qr = 1 | Xr, Dr, Vr) that the participant in the speech talks using the space power distribution.
[0063]
The idea of generating the time frequency mask Maski (k, r) may be introduced to the speech
signal section estimation unit 4 of the first embodiment to estimate the speech presence
probability for each area r.
In this case, since the speech signal section estimation unit 4 also estimates the probability for
the region r, an effect of improving the estimation accuracy of the speech probability of the
multiple signal section estimation apparatus 100 can be expected.
[0064]
In addition, the target signal presence / absence determination unit 8 may be provided to
determine whether or not a speech participant is speaking in each area r using the probability
that the information integration unit 7 outputs.
The target signal presence / absence determination unit 8 has an utterance threshold T that
determines presence / absence of utterance, and the probability p (ar = 1, qr = 1 | Xr, Dr, Vr) in
which the discourse participant speaks exceeds the utterance threshold T. If so, "1" is output on
the assumption that the discourse participant speaks in the region r, and "0" is output on the
assumption that no speech is exceeded.
Providing such a target signal presence / absence determination unit 8 makes it easier to use as
a multiple signal section estimation apparatus.
The utterance threshold T may be a fixed value or a value that changes with time.
[0065]
10-04-2019
16
[Evaluation Experiment] In order to confirm the effect of the present invention, an evaluation
experiment in which an audio signal and a video signal observed using three microphones and
two cameras are analyzed by the multiple signal section estimation apparatus 100 of the present
invention. I did.
The experimental conditions are described.
FIG. 7 shows the recording environment of the audio signal and the video signal.
We recorded four audio and video signals for round-table 70 conversation in a conference room
with a reverberation time of about 350 ms. For example, two omnidirectional microphones 1a,
1b, and 1c are arranged at the center of a circular table 70, for example, at the apex of an
equilateral triangle with one side of 4 cm, and two cameras 2a and 2b equipped with fisheye
lenses centered on the equilateral triangle. Were placed to cover all directions.
[0066]
The sampling rate of the audio signal is 16 kHz, and the video signal is 30 frames per second.
The frame length of signal analysis is 64 ms and the frame shift is 32 ms. The threshold used for
speaker classification was 15 degrees. The speaker decision error rate (DER) was used as the
evaluation scale. DER adds three error times, that is, false-time (FST: False-alarm Speech Time),
false-rejection time (MST: Missed Speech Time), and speaker error time (SET: Speaker error time).
It calculated | required by Formula (28) divided by total speech time.
[0067]
[0068]
The results are shown in Table 1.
[0069]
In the method of the present invention, the speaker decision error rate DER is improved by 3.5%.
10-04-2019
17
The result output from the target signal presence / absence determination unit 8 at this time is
shown in FIG.
The horizontal axis in FIG. 8 is time (seconds), and the vertical axis is direction (degrees). ●
indicates that there is an utterance.
[0070]
The multiple signal section estimation apparatus and the method thereof according to the
present invention described above are not limited to the above-described embodiment, and
various modifications can be made without departing from the scope of the present invention.
For example, instead of using a time-frequency mask to estimate space power distribution, a
delay-and-sum method (References "Jojiro Oga, Yoshio Yamazaki, Yutaka Kanada," Sound System
and Digital Processing ", The Institute of Electronics, Information and Communication Engineers")
You may use the space spectrum obtained by etc.
[0071]
In addition, the processing described in the above-described apparatus and method is not only
performed in chronological order according to the order of description, but also performed as
parallel or individually depending on the processing capability of the apparatus that executes the
processing or the necessity. It is also good.
[0072]
Further, when the processing means in the above-mentioned device is realized by a computer, the
processing content of the function that each device should have is described by a program.
Then, by executing this program on a computer, the processing means in each device is realized
on the computer.
[0073]
10-04-2019
18
The program describing the processing content can be recorded in a computer readable
recording medium. As the computer readable recording medium, any medium such as a magnetic
recording device, an optical disc, a magneto-optical recording medium, a semiconductor memory,
etc. may be used. Specifically, for example, as a magnetic recording device, a hard disk drive, a
flexible disk, a magnetic tape or the like as an optical disk, a DVD (Digital Versatile Disc), a DVDRAM (Random Access Memory), a CD-ROM (Compact Disc Read Only) Memory), CD-R
(Recordable) / RW (Rewritable), etc., a magneto-optical recording medium, MO (Magneto Optical
Disc), etc., and a semiconductor memory such as a flash memory can be used.
[0074]
Further, this program is distributed, for example, by selling, transferring, lending, etc. a portable
recording medium such as a DVD, a CD-ROM or the like in which the program is recorded.
Furthermore, the program may be stored in a storage device of a server computer, and the
program may be distributed by transferring the program from the server computer to another
computer via a network.
[0075]
Further, each means may be configured by executing a predetermined program on a computer,
or at least a part of the processing content may be realized as hardware. In addition, for values
that may be predetermined and recorded in the recording unit in the multiple signal interval
estimation apparatus, for example, the inside of the multiple signal interval estimation apparatus
may be input via an input / output unit (not shown in FIG. 1). Alternatively, each processing
function that acquires input values from the outside and uses each input value acquires input
values via the input unit and records the input values in a memory or the like in each processing
function. good.
[0076]
The figure which shows the function structural example of the several signal area estimation
apparatus 100 of this invention. FIG. 7 is a diagram showing an operation flow of the multiple
signal section estimation device 100. FIG. 2 is a diagram showing an example of a functional
10-04-2019
19
configuration of a voice signal section estimation unit 4; FIG. 2 is a diagram showing an example
of a functional configuration of a speaker direction estimation unit 5; FIG. 2 is a diagram showing
an example of the functional configuration of a face position detection unit 6; FIG. 2 is a diagram
showing an example of a functional configuration of a multiple signal section estimation device
160. The figure which shows the recording environment of the audio signal and video signal of
evaluation experiment. The figure which shows the result which the target signal presence
discrimination | determination part 8 output by evaluation experiment.
10-04-2019
20
Документ
Категория
Без категории
Просмотров
0
Размер файла
32 Кб
Теги
description, jp2010130411
1/--страниц
Пожаловаться на содержимое документа