close

Вход

Забыли?

вход по аккаунту

?

DESCRIPTION JP2016181770

код для вставкиСкачать
Patent Translate
Powered by EPO and Google
Notice
This translation is machine-generated. It cannot be guaranteed that it is intelligible, accurate,
complete, reliable or fit for specific purposes. Critical decisions, such as commercially relevant or
financial decisions, should not be based on machine-translation output.
DESCRIPTION JP2016181770
Abstract: When directivity is formed toward a location having a history in which directivity is
formed in a directivity direction using an adaptive filter or in the vicinity thereof, convergence
time of filter coefficients is shortened to quickly form directivity Let's do it. A directivity forming
unit (11) forms directivity in a directivity direction designated by a user for voice collected by a
microphone array device (MA). When there is a history point RP in the vicinity of the position
designated by the user, the directivity forming unit 11 stores the pointing direction (θ, φ) and
the filter coefficient (w,) corresponding to the history point registered in the history table 40. ,
And w) are read, and the filter coefficients (w,..., W) are set to the adaptive filters F1 to Fn−1 to
perform directivity formation processing while learning the filter coefficients of the adaptive
filter. [Selected figure] Figure 3
Sound pickup system
[0001]
The present invention relates to a sound pickup system for picking up sound.
[0002]
Conventionally, in a surveillance system installed at a predetermined position (for example, a
ceiling or a wall) of a store (for example, a retail store, a bank), a shopping street, or a public
place (for example, a station or a library), a plurality of camera devices are connected via a
network By doing this, video data (still images and moving images of a predetermined range to
be monitored) is included.
11-04-2019
1
The same applies hereinafter to monitoring by a monitoring device installed at one place.
[0003]
However, in monitoring only video, there is a limit to the amount of information that can be
obtained, and there is an increasing demand for a monitoring system for obtaining audio data in
order to perform monitoring by voice.
[0004]
In response to this request, some camera devices are equipped with a microphone and put audio
data on video data and transmit it to a network.
However, there are many omnidirectional microphones used here, and even if they are
unidirectional microphones, their directional characteristics are wide-angle. Therefore, the noise
that you want to hear is often obscured by noise.
[0005]
In the microphone array, as a method of beam forming for directing directivity in a target
direction, there is a classical method called delay-and-sum method, which utilizes a difference in
arrival time of sound to each microphone. However, this method has an advantage that the
characteristics are stable, but in order to obtain sufficient performance, many microphones are
required, and there is a problem that the scale of the entire system becomes large. On the other
hand, an adaptive array represented by, for example, the Griffith-Jim type array has an advantage
that a large suppression performance can be realized using a small number of microphones, but
the filter characteristics are adapted based on the surrounding voice and noise. Since the update
is performed as described above, there is a problem that the performance is degraded due to a
deviation from the assumed environment, the optimum filter characteristics have to be updated
by learning, and it takes a long time to improve the suppression performance.
[0006]
On the other hand, a filter coefficient at a specific position is obtained in advance, and if the
11-04-2019
2
sound collection direction is a specific position, a beam is formed using the filter coefficient
obtained in advance, and adaptively at other places. There is known a technique of switching
between two modes, adaptive and fixed, of updating filter coefficients (see, for example, Patent
Document 1).
[0007]
Japanese Patent Publication No. 2003-514481
[0008]
However, in the configuration of Patent Document 1, when the noise environment at a specific
position changes, for example, even if beam formation is performed with a filter coefficient
obtained in advance, such as a change in people's traffic or an in-store interior change, the best
result. Was not guaranteed to come out.
In addition, when the adaptation mode works, there is no solution to the fact that it takes time for
the suppression performance to increase.
[0009]
The present invention has been made in view of the above-described conventional situation, and
it is a filter coefficient when forming directivity toward a place having a history in which
directivity is formed in a directivity direction using an adaptive filter or in the vicinity thereof. It
is an object of the present invention to provide a sound collection system that can quickly form a
directivity by reducing the convergence time of.
[0010]
According to the present invention, a sound collection unit having a plurality of microphones and
configured to collect specified voices, an operation unit in which a pointing direction indicating
the emphasis direction of the voices is specified, and the sound collected by the sound collection
unit An emphasizing processing unit for emphasizing the voice of the pointing direction
designated by the operation unit from the data using the adaptive filter; and a storage unit for
storing the filter coefficient of the adaptive filter in association with the pointing direction The
emphasizing processing unit may perform emphasizing processing on the audio data using the
filter coefficient stored in the storage unit, which corresponds to a direction within a
predetermined range including the pointing direction designated by the operation unit. It is a
11-04-2019
3
sound collection system.
[0011]
According to the present invention, when directivity is formed toward a location having a history
in which directivity is formed in the directivity direction using an adaptive filter or in the vicinity
thereof, the convergence time of the filter coefficient is shortened to form directivity. It can be
done quickly.
[0012]
Block diagram showing the configuration of the sound collection system according to the first
embodiment. Block diagram showing the configuration of the voice processing device. Diagram
showing the configuration of the microphone array device and the directivity forming unit.
Diagram showing the configuration of the adaptive filter. A diagram showing the registered
contents of the history table. A diagram showing a screen of a display on which a voice map is
displayed. A flowchart showing a sound collection processing procedure. A diagram showing a
screen of a display on which a voice map including a pointing direction specified by a user is
displayed. A diagram showing the registered contents of the history table in which the pointing
direction of the history point closest to the position (the pointing direction) designated by the
user and the filter coefficient are selected. Flowchart showing the automatic update procedure of
the history table. Block diagram showing system configuration Diagram showing configuration of
voice processing unit Coaxially installed A diagram for explaining a combination of a camera
device and a microphone array device A schematic diagram showing an example of usage of the
sound collection system A flowchart showing a sound collection process procedure A diagram
showing an omnidirectional image displayed on a screen of a display Second embodiment The
schematic diagram which shows an example of the usage type of the sound collection system in
the modification of FIG. Explanatory drawing of the operation | movement outline of a sound
collection system when a person is designated among the images displayed on the display. A
block diagram showing a configuration of a sound collection system in the embodiment A
diagram showing the registered contents of a preset table A flowchart showing a sound collection
process procedure A flowchart showing a sound collection process procedure following FIG.
Diagram of the screen of the screen and the sounding operation of the speaker FIG. 29 is a
diagram for explaining the relationship between pronunciation and adaptive filter learning; FIG.
29 is a flowchart showing a tracking processing procedure by the video processing unit; FIG. 29
is a flowchart showing directivity processing procedure by an audio processing unit; Flow chart
showing a learning process procedure of the filter A diagram for explaining a learning operation
of a target person on an omnidirectional image
[0013]
11-04-2019
4
Hereinafter, embodiments of the sound collection system according to the present invention will
be described with reference to the drawings.
The sound collection system of the present embodiment uses an adaptive array (adaptive
microphone array), and shows an example of automatically forming a directivity characteristic
according to the ambient noise environment as an example of the enhancement processing.
Adaptive microphone arrays use large numbers of microphones to achieve great suppression
performance.
[0014]
First Embodiment FIG. 1 is a block diagram showing a configuration of a sound collection system
5 according to a first embodiment.
The sound collection system 5 is installed at, for example, a store, and has a configuration in
which the microphone array device MA and the voice processing device 10 are connected via the
network 8.
[0015]
The microphone array device MA as an example of the sound collection unit is installed on the
ceiling of a store as an adaptive microphone array, and a plurality (eight in this case) of
microphones M1 to Mn (see FIG. 3) are concentrically downward It is an omnidirectional
microphone that can be placed toward and can pick up audio in the store.
The microphone array apparatus MA picks up voices around the target area using the
microphones M1 to Mn, and voice data (voice data) picked up by the microphones M1 to Mn is
transmitted through the network 8 And transmit to the voice processing device 10.
11-04-2019
5
Each of the microphones M1 to Mn may be a nondirectional microphone, a bidirectional
microphone, a unidirectional microphone, a sharp directional microphone, or a combination
thereof.
[0016]
FIG. 2 is a block diagram showing the configuration of the speech processing apparatus 10. As
shown in FIG. The voice processing device 10 forms directivity in the pointing direction specified
by the user, and outputs voice collected by the microphone array device MA. The voice
processing apparatus 10 includes a communication unit 15, a directivity direction calculation
unit 12, a directivity formation unit 11, a filter coefficient holding unit 14, a history detection /
update unit 13, an input / output control unit 16, an operation unit 17, a display 18, and voice
output. It has a part 19.
[0017]
The communication unit 15 receives a packet of audio data transmitted from the microphone
array device MA via the network 8 and outputs the packet to the directivity forming unit 11.
[0018]
The directivity direction calculation unit 12 calculates the sound source direction from the
microphone array device MA based on the position input to the voice map (see FIG. 6) input from
the operation unit 17.
In the present embodiment, the sound source direction is represented by a horizontal angle θ
and a vertical angle φ centered on the microphone array device MA. The horizontal angle θ is
an angle in a horizontal plane (X-Y plane) starting from the center of the microphone array
device MA, and the vertical angle φ is an inclination from the Z axis passing through the center
of the microphone array device MA. For example, when the sound source is near the microphone
array device MA, the vertical angle φ is detected as a small value because the sound source is
near the Z axis.
[0019]
11-04-2019
6
The directivity forming unit 11 as an example of the emphasizing processing unit uses the voice
data directly transferred from the microphone array device MA, and each voice collected by each
of the microphones M1 to Mn by the directivity control process of the voice data In order to add
data and emphasize (amplify) sound (volume level) in a specific direction from the position of
each of the microphones M1 to Mn of the microphone array device MA, audio data having
directivity in the specific direction is generated. . The specific direction (emphasis direction) is a
direction from the microphone array device MA toward the position designated by the operation
unit 17. A technique related to directivity control processing of audio data for forming directivity
of voice collected by the microphone array device MA is a method using delay sum as disclosed
in, for example, Japanese Patent Laid-Open No. 2014-143678. Is a known technique.
[0020]
Furthermore, the directivity forming unit 11 further aims by subtracting ambient noise formed
by the adaptive filters F1 to Fn-1 (see FIG. 3 and FIG. 4) from the delayed sum signal formed by
the adder AD1. You can emphasize the sound.
[0021]
In the history detection / updating unit 13 as an example of the filter coefficient updating unit,
when directivity is formed by the directivity forming unit 11, the directivity direction and the
adaptive filters F1 to Fn-1 described later (see FIG. 3 and FIG. 4) Are registered (updated or
added) in the history table 40 (see FIG. 5) stored in the filter coefficient holding unit 14.
Further, when the pointing direction is designated by the user via the operation unit 17 and the
pointing direction calculating unit 12, the history detecting / updating unit 13 indicates the
pointing direction (horizontal angle θ, vertical angle φ) and the filter coefficients w1 to wq It is
detected whether the correspondence information with is registered in the history table 40 or
not.
[0022]
The filter coefficient holding unit 14 as an example of the storage unit stores a history table 40
in which the histories of the filter coefficients w1 to wq corresponding to the pointing direction
(horizontal angle θ, vertical angle)) are registered.
11-04-2019
7
[0023]
The input / output control unit 16 controls input / output of various data to the operation unit
17, the display 18, and the audio output unit 19.
[0024]
The directivity forming unit 11, the directivity direction calculating unit 12, the history detecting
/ updating unit 13, the filter coefficient holding unit 14 and the input / output control unit 16
are, for example, a central processing unit (CPU), a micro processing unit (MPU) or a DSP It is
configured by a PC (Personal Computer) using (Digital Signal Processor).
The PC executes control processing for overall operation of each part, input / output processing
of data with other parts, data calculation (calculation) processing, and data storage processing.
[0025]
The operation unit 17 is disposed, for example, corresponding to the screen of the display 18,
and is configured using a touch panel or a touch pad that can be input by a user's finger or a
stylus pen.
The operation unit 17 outputs data of a designated place (coordinates) for emphasizing
(amplifying) the volume level of the audio data to the directivity forming unit 11 according to the
user's operation. The operation unit 17 may be configured using a pointing device such as a
mouse or a keyboard.
[0026]
The display 18 as an example of the display unit displays an audio map (see FIG. 6) or the like in
which the pointing direction (horizontal angle θ, vertical angle φ) from the microphone array
device MA toward the sound source is represented by a sound source mark. The voice output
unit 19 is voice data collected by the microphone array device MA and transferred through the
network 8 or voice that has been enhanced in a specific direction by the directivity forming unit
11 based on the voice data. Output voice based on data.
11-04-2019
8
[0027]
FIG. 3 is a diagram showing the configuration of the microphone array device MA and the
directivity forming unit 11. As shown in FIG. The microphone array device MA includes n
microphones M1 to Mn, n amplifiers P1 to Pn, n AD converters L1 to Ln, an encoding unit 31,
and a communication unit 32.
[0028]
The microphones M1 to Mn, which constitute an omnidirectional microphone, are, for example,
ECMs (high-quality small electret condenser microphones), and are disposed on the front surface
of the housing of the microphone array device MA. The n amplifiers P1 to Pn respectively
amplify the audio signals output from the n microphones M1 to Mn. The n AD converters L1 to
Ln convert analog audio signals output from the n amplifiers P1 to Pn into digital audio data.
[0029]
The encoding unit 31 generates audio data packets for digital audio signals output from the A / D
converters L1 to Ln. The communication unit 32 transmits the packet of audio data generated by
the encoding unit 31 to the audio processing device 10 via the network 8.
[0030]
As a main part of the adaptive microphone array, the directivity forming unit 11 adds a delay to
the voice signal and then subtracts it (delay difference) to form a blind spot (low sensitivity) in
the noise direction, and noise in the surrounding area. Automatically creates directional
characteristics according to the environment. The directivity forming unit 11 is adapted for
adaptation of n delay units (delays) D1 to Dn, n−1 subtracters HA1 to HAn−1, HB1 to HBn−1,
adders AD1, AD2, and n−1. It has filters F1 to Fn-1.
[0031]
11-04-2019
9
The principle of the adaptive microphone array (for example, the Griffith-Jim adaptive
microphone array) is as follows. The adaptive microphone array calculates the time difference
when the target signal is received by each microphone from the arrival direction of the target
signal and the arrangement of the microphones. In the adaptive microphone array, the target
sound is in phase by adding a delay amount corresponding to this time difference to the
reception signal of each microphone. When the differences between adjacent target signals in
phase are taken, the target sounds cancel each other, and a signal (noise signal) containing only
noise is obtained. The adaptive microphone array passes each noise signal through the adaptive
filter and then subtracts it from the delayed output of the first microphone to obtain a target
sound with ambient noise suppressed.
[0032]
The delay devices D1 to Dn are output from the n microphones M1 to Mn, respectively, and the
user specifies the audio signals amplified by the n amplifiers P2 to Pn and corresponds to the
direction calculated by the directivity direction calculation unit 12 Delay the delay time. The n-1
subtractors HA1 to HAn-1 output difference signals (inter-channel difference signals) of the
respective microphones. The adder AD1 adds the audio signals from the microphones delayed by
the delay units D1 to Dn to obtain a delayed-sum audio signal.
[0033]
The n-1 adaptive filters F1 to Fn-1 receive the inter-channel difference signal output from each of
the n-1 subtractors HA1 to HAn-1 and output a noise component. The n-1 subtractors HB1 to
HBn-1 respectively subtract the noise components output from the adaptive filters F1 to Fn-1
from the audio signal delayed and added by the adder AD1. The adder AD2 adds each signal
obtained by subtracting the noise component by n-1 subtracters HB1 to HBn-1 from the audio
signal delayed and summed by the adder AD1, and outputs the result as an audio signal.
[0034]
The audio output unit 19 includes an amplifier 19A and a speaker 19B, amplifies the audio signal
output from the directivity forming unit 11 by the amplifier 19A, and outputs the audio signal
from the speaker 19B.
11-04-2019
10
[0035]
FIG. 4 is a diagram showing the configuration of adaptive filters F1 to Fn-1.
Since the adaptive filters F1 to Fn-1 have the same filter coefficients w1 to wq, only the adaptive
filter F1 will be described here.
[0036]
The adaptive filter F1 is, for example, a filter of the FIR (Finite Impulse Response) type, and
includes q delay devices DF1 to DFq, q variable amplifiers PF1 to PFq, q adders AF1 to AFq, a
subtractor HF1, and adaptive It has an algorithm ARG.
[0037]
In the adaptive filter F1, the q delay devices DF1 to DFq are connected in series to sequentially
delay the input signal x (k).
Here, the input signal x (k) is a differential signal between the channels (microphone elements)
described above, and is audio data having a blind spot (low sensitivity) in the target (noise)
direction. The q variable amplifiers PF1 to PFq amplify the signals of the delay units DF1 to DFq,
respectively, with gains according to the filter coefficients w1 to wq. The signals output from the
q variable amplifiers PF1 to PFq are sequentially added by the q adders AF1 to AFq and output as
an output signal y (k). The output signal y (k) is a noise component between channels. The
subtractor HF1 outputs the difference between the output signal y (k) and the reference signal
(ideal signal) to the adaptive algorithm ARG. Here, the adaptive algorithm ARG updates the filter
coefficients w1 to wq so as to minimize the difference between the output signal y (k) and the
reference signal, using a known method such as the least squares method.
[0038]
FIG. 5 is a diagram showing the registered contents of the history table 40 stored in the filter
coefficient holding unit 14. In the history table 40, when directivity is formed by the directivity
11-04-2019
11
forming unit 11, the directivity direction and the filter coefficients w1 to wq used in the adaptive
filters F1 to Fn-1 are registered in association with each other. For example, as the n-th history
data, a horizontal angle θ n and a vertical angle φ n representing a pointing direction are
registered in association with filter coefficients (w n1,..., W nq).
[0039]
FIG. 6 is a view showing a screen of the display 18 on which the audio map 50 is displayed. The
voice map 50 is drawn with three concentric circles 50h, 50i, 50j and a line segment 50m having
a radius that divides the central angle of these concentric circles 12 with the position of the
microphone array device MA as the central point O. Of the three concentric circles, the innermost
concentric circle 50 h corresponds to the vertical angle φ = 30 °, the middle concentric circle
50 i corresponds to the vertical angle φ = 60 °, and the outermost concentric circle 50 j
corresponds to the vertical angle φ = 90 ° Equivalent to.
[0040]
On the voice map 50, history points RP1 to RP6 registered in the history table 40 are displayed.
The history points RP1 to RP6 are collectively referred to simply as the history points RP if it is
not necessary to distinguish them. The closer the history point RP1 is to the closer to the position
of the microphone array device MA. In addition, a line segment 50m extending in the horizontal
direction on the right side from the central point O has a central angle of 0 °, which
corresponds to a horizontal angle θ = 0 °. A 12-divided line segment 50m represents a
horizontal angle of 0 ° to 360 ° in steps of a central angle of 0 ° to 30 °. For example, as
coordinates (θ, φ) on the voice map 50, a history point RP3 is drawn at coordinates (150 °, 65
°). Here, although the history point RP is displayed on the screen of the display 18 so as to
overlap with the audio map 50, it may not be displayed.
[0041]
An operation of the sound collection system 5 having the above configuration is shown.
[0042]
FIG. 7 is a flowchart showing a sound collection process procedure.
11-04-2019
12
First, the pointing direction calculation unit 12 in the voice processing device 10 receives a
position (pointing direction) on the voice map 50 designated by the user via the operation unit
17 and calculates the designated position (S1). . FIG. 8 is a view showing a screen of the display
18 on which the audio map 50 including the pointing direction designated by the user is
displayed. When the user moves the cursor 17z on the voice map 50 using the mouse as the
operation unit 17 and clicks on the place where he wants to hear the voice, the display 18
displays the position (oriented direction) TP specified on the voice map 50. Show in black circle.
[0043]
The history detecting / updating unit 13 calculates the distance between the designated position
and the history point, and searches the history point RP registered in the history table 40 stored
in the filter coefficient holding unit 14 (S2). It is determined whether or not there is a history
point RP within a predetermined distance (direction within a predetermined range) on the voice
map 50 from the designated position (S3). If there is no history point RP, the history detection /
update unit 13 sets the filter coefficient to a default value (S5).
[0044]
On the other hand, when there is a history point RP, the history detection / update unit 13
registers the pointing direction (θ, φ) and the filter coefficient corresponding to the history
point registered in the history table 40 stored in the filter coefficient holding unit 14 (W 1,..., W
q) are read (S 4). Here, it can be seen that the designated position TP is closest to the history
point RP2. FIG. 9 is a view showing registered contents of the history table 40 in which the
pointing direction of the history point closest to the position (pointing direction) designated by
the user and the filter coefficient are selected. Here, in the history table 40, the pointing direction
(θ 2, φ 2) and the filter coefficients (w 21,..., W 2q) of the history point RP2 surrounded by the
thick frame 40z are selected.
[0045]
The directivity forming unit 11 sets the filter coefficients (w 1,..., W q) to the adaptive filters F1 to
Fn−1, and learns the adaptive filters F1 to Fn−1 with the history point RP as a start point. A
directivity formation process is performed (S6).
11-04-2019
13
[0046]
On the other hand, when there is no history point RP within the predetermined distance on the
voice map 50 from the designated position, the directivity forming unit 11 determines default
filter coefficients (w 1,..., W) determined in advance as described above. q) is set as the adaptive
filters F1 to Fn-1, and directivity formation processing is performed while learning the adaptive
filters F1 to Fn-1 with the initial position as the start point in step S6.
[0047]
When the directivity forming unit 11 performs the directivity forming process, the voice output
unit 19 outputs voice based on the voice data of the directivity direction in which the directivity
is formed (S7).
[0048]
The history detection / update unit 13 determines the presence or absence of the history point
RP performed in step S3 (S8).
If there is a history point RP, the history detection / update unit 13 updates the history table 40
(S9).
In updating the history table 40, only the filter coefficients (w 1,..., W q) are overwritten without
changing the pointing direction (θ, φ) corresponding to the history point.
Alternatively, both the pointing direction (θ, φ) and the filter coefficients (w 1,..., W q) are
overwritten. The history table 40 may be left as it is without being updated. On the other hand,
when there is no history point RP in step S8, the history detection / updating unit 13 adds the
new pointing direction (θ, φ) and the filter coefficients (w 1,..., W q) formed by the directivity
forming unit 11. ) Is added to the history table 40 (S10). Then, after the process of step S9 or
step S10, the operation of the speech processing device 10 returns to the process of step S1.
[0049]
11-04-2019
14
FIG. 10 is a flowchart showing an automatic update procedure of the history table 40. This
automatic updating process is a process which is performed at regular time intervals
(periodically) by the speech processing apparatus 10 and automatically updates the filter
coefficients (w 1,..., W q) registered in the history table 40. . The automatic updating process may
be performed when the voice processing device 10 is not performing the directivity forming
process or the like, that is, when the load of a CPU or the like that realizes the voice processing
device 10 is relatively light. Also, in stores, since the surrounding noise environment greatly
differs between busy hours during operation and other hours and after closing the store, the
optimal filter coefficient should be updated according to the time zone. Is desirable. In addition,
filter coefficients may be registered for each time zone in a history table used in a store or the
like. For example, busy time zones, spare time zones, and after closing may be registered
separately.
[0050]
First, the history detection / update unit 13 initializes a variable i representing the order of
history points registered in the history table 40 to the value 0 (S21). Subsequently, the history
detecting / updating unit 13 adds the value 1 to the variable i (S22), and reads the pointing
direction and the filter coefficient of the record corresponding to the variable i from the history
table 40 (S23).
[0051]
The directivity forming unit 11 sets the filter coefficients read by the history detecting / updating
unit 13 in the adaptive filters F1 to Fn-1 and performs learning (S24). In learning of the adaptive
filters F1 to Fn-1, the directivity forming unit 11 calculates the filter coefficients w1 to wq as
described above, using the voice data collected by the microphone array device MA.
[0052]
The history detection / update unit 13 updates the history table 40 using the filter coefficients
w1 to wq obtained as a result of learning (S25). Thereafter, the history detection / update unit 13
determines whether the variable i has reached the registration number n of the history points RP
registered in the history table 40 (S26). If the number n of registered history points RP has not
been reached, the history detection / update unit 13 returns to step S22 and performs the same
11-04-2019
15
processing. On the other hand, when the number n of registered history points RP has been
reached, the history detection / update unit 13 ends this operation.
[0053]
In this sound collection system 5, the directivity forming unit 11 forms directivity in the
directivity direction designated by the user for the sound collected by the microphone array
device MA. When there is a history point RP in the vicinity of the position designated by the user,
the directivity forming unit 11 has the directivity direction (θ, φ) and the filter coefficient (w 1)
corresponding to the history point registered in the history table 40. ,..., W q) and by setting the
filter coefficients (w 1,..., W q) to the adaptive filters F1 to Fn−1, the directivity formation
processing is performed while learning the filter coefficients of the adaptive filter. Do. Therefore,
the time until the filter coefficient converges to an appropriate value is shortened.
[0054]
As described above, in the sound collection system 5 of the first embodiment, the convergence
time of the filter coefficient is formed when directivity is formed toward a location having a
history in which directivity is formed in the directivity direction using an adaptive filter and in
the vicinity thereof. Can be shortened to quickly form directivity.
[0055]
Further, the sound collection system 5 of the present embodiment can designate a position
(orientation direction) to be heard on the sound map 50, and it is not necessary to use a camera
device.
Therefore, the system configuration is simplified. Further, by displaying the history points RP on
the voice map 50, the position of the learned history points can be grasped.
[0056]
Although the user designates the sound source direction in the present embodiment, the sound
source method may be determined by estimating in which direction the sound source is present
11-04-2019
16
for the sound collected by the microphone array device MA. Also, although the configuration of
the directivity forming unit 11 combines the delay sum of the audio signal and the adaptive filter,
the present invention is not limited to this configuration, and the configuration of the adaptive
filter itself is also limited to the Griffith-Jim type. Absent.
[0057]
Second Embodiment In the first embodiment, the microphone array device is used alone, but in
the second embodiment, the camera device and the microphone array device are used in
combination.
[0058]
FIG. 11 is a block diagram showing a configuration of a sound collection system 5A in the second
embodiment.
In the sound collection system according to the second embodiment, the same components as
those of the first embodiment are denoted by the same reference numerals, and the description
thereof will be omitted.
[0059]
The sound collection system 5A is installed in a store or the like, and has a configuration in
which a camera device CA for monitoring, a microphone array device MA, a recorder device 7,
and a monitoring device 6 are mutually connected via a network 8. The camera device CA and
the microphone array device MA are coaxially installed as described later (see FIG. 13). The
microphone array device MA has the same configuration as that of the first embodiment.
[0060]
The camera apparatus CA as an example of the imaging unit is an omnidirectional camera in
which a fisheye lens CAy is mounted on the front surface of a housing CAz (see FIG. 13), and
includes omnidirectional images (still and moving images) downward. And so on). The camera
apparatus CA is not limited to the omnidirectional camera, but may be a fixed camera having a
11-04-2019
17
fixed angle of view, and in the case of a fixed camera, the camera apparatus CA captures an
image of a location (preset position) registered in advance. Do. The camera device CA transfers
data of the captured video (video data) to the monitoring device 6 via the network 8 and causes
the recorder 7 to record the data.
[0061]
The recorder device 7 includes a control unit (not shown) for controlling each process such as
data recording, and a recording unit (not shown) for storing video data and audio data. The
recorder device 7 associates and records the video data captured by the camera device CA and
the audio data collected by the microphone array device MA.
[0062]
The monitoring device 6 is configured by a PC (Personal Computer), and monitors an image
captured by the camera device CA and a sound collected by the microphone array device MA,
and is shown in the first embodiment. In addition to the communication unit 15, the input /
output control unit 16, the operation unit 17, the display 18 and the audio output unit 19, the
audio processing unit 21 and the video processing unit 22 are included.
[0063]
The video processing unit 22 controls the display 18 to display an image based on the video data
captured by the camera apparatus CA in accordance with an operation instruction from the user
without being interlocked with the audio processing unit 21.
The voice processing unit 21 forms directivity in the pointing direction designated by the user,
and outputs voice collected by the microphone array device MA. FIG. 12 is a diagram showing
the configuration of the audio processing unit 21. As shown in FIG. The voice processing unit 21
includes a directivity direction calculation unit 12A in addition to the directivity formation unit
11, the history detection / update unit 13, and the filter coefficient holding unit 14 shown in the
first embodiment.
[0064]
11-04-2019
18
When the user designates a position on the display 18 which is displayed by the camera device
CA and the user wants to listen to the sound, the directivity direction calculation unit 12A
specifies the directivity direction of the sound collected by the microphone array device MA.
calculate. Here, since the microphone array device MA and the camera device CA are coaxially
arranged, these coordinate systems substantially match. Therefore, when a position desired to be
heard by the user is designated on the screen of the display 18 on which the image captured by
the camera device CA is displayed, the directivity direction calculation unit 12A calculates the
directivity direction (sound source direction) corresponding to the position. . The pointing
direction is represented by the horizontal angle θ and the vertical angle φ centered on the
microphone array device MA, as described above.
[0065]
The directivity forming unit 11 uses the audio data directly transferred from the microphone
array device MA or recorded in the recorder 7, and is collected by each of the microphones M1
to Mn by the directivity control process of the audio data. In order to add each voice data and to
emphasize (amplify) voice (volume level) in a specific direction from the position of each
microphone M1 to Mn of the microphone array device MA, form directivity in the specific
direction and Generates voice data in which noise other than direction is subtracted by an
adaptive filter. The history detection / updating unit 13 and the filter coefficient holding unit 14
are the same as those in the first embodiment, and thus the description thereof is omitted.
[0066]
The display 18 displays a video (image) based on video data captured by the camera device CA
and transferred via the network 8 or recorded in the recorder 7 (see FIG. 16). The voice output
unit 19 is specified by the directivity forming unit 11 based on voice data collected by the
microphone array device MA and transferred via the network 8 or recorded in the recorder
device 7 or voice data thereof. The voice based on the voice data subjected to emphasis
processing is output.
[0067]
FIG. 13 is a diagram for explaining a combination of the camera device CA and the microphone
11-04-2019
19
array device MA installed coaxially. The camera device CA is an omnidirectional camera that has
a disk-shaped housing CAz and has a fisheye lens CAy mounted on the front surface of the
housing CAz. The microphone array device MA is an omnidirectional microphone array having a
ring-shaped housing MAz in which a plurality of microphones M1 to Mn (see FIG. 3) are arranged
concentrically. Sound collection system 5A is in a state where housing CAz of camera apparatus
CA is accommodated inside opening MAr formed in housing MAz of microphone array apparatus
MA, that is, a state where camera apparatus CA and microphone array apparatus MA are united.
In the ceiling RF (see FIG. 14).
[0068]
An operation of the sound collection system 5A having the above configuration is shown.
[0069]
FIG. 14 is a schematic view showing an example of usage of the sound collection system 5A.
For example, the camera apparatus CA and the microphone array apparatus MA are installed on
the ceiling RF of a store in a combined state. In addition, a speaker 87 is placed on the floor
surface FLR located almost directly below the ceiling RF on which the camera device CA and the
microphone array device MA are installed. Further, on the floor surface FLR, four persons 55, 56,
57, 58 stand, and the person 55 and the person 56 have a conversation, and the person 57 and a
person 58 have a conversation. The speaker 87 is located between these two sets, and music is
flowing from the speaker 87. The camera apparatus CA captures an image of the four persons
55, 56, 57, 58 from a position slightly above. In addition, the microphone array device MA picks
up the sound of the entire store.
[0070]
FIG. 15 is a flowchart showing a sound collection process procedure. About the same step
processing as the step processing shown by the flow chart of Drawing 1 of a 1st embodiment, the
explanation is omitted by attaching the same step number. FIG. 16 is a view showing the
omnidirectional image 60 displayed on the screen of the display 18. As described above, since
the camera apparatus CA and the microphone array apparatus MA are coaxially installed, and the
sound collecting position and the imaging position substantially coincide with each other, the
position designated by the omnidirectional image 60 imaged by the camera apparatus CA
11-04-2019
20
Corresponds to the directivity direction of the sound collected by the microphone array device
MA. First, the user operates the mouse which is the operation unit 17 to move the cursor 17z
displayed on the screen of the display 18, and clicks and designates at a place where he / she
wants to hear a voice. Here, the position of the person 55 is designated by the cursor 17z.
[0071]
The pointing direction calculation unit 12A receives the position on the omnidirectional image
60 designated by the user via the operation unit 17 (S1A). The pointing direction calculation unit
12A acquires the pointing direction of sound based on the designated image position (S1B).
[0072]
After acquiring the directional direction of the voice, the process of the voice processing unit 21
is the same as the process after step S2 in the first embodiment. That is, in step S2, the history
detecting / updating unit 13 calculates the distance between the designated position and the
history point, and the history point RP registered in the history table 40 stored in the filter
coefficient holding unit 14 A process of searching for and determining whether or not there is a
history point RP within a predetermined distance on the voice map 50 from the position
designated in step S3 is performed. Then, after the process of step S9 or S10, the audio
processing unit 21 returns to the process of step S1A.
[0073]
As described above, the sound collection system 5A of the second embodiment displays the
omnidirectional image captured by the camera device CA on the screen of the display 18, and
outputs the voice in the directivity direction designated by the user from the speaker 19B.
Therefore, it is possible to monitor both images and sounds. Furthermore, it is possible to directly
specify on the image a place (orientation direction) that the user wants to hear while looking at
the omnidirectional image 60 displayed on the screen of the display 18. Since it is possible to
specify the place where you want to hear the sound while checking the image, the convenience
and operability are improved.
[0074]
11-04-2019
21
Further, by recording the audio data and the video data in the recorder 7, not only real-time
reproduction but also offline reproduction becomes possible afterward. Also, on the image
displayed on the display 18, the user can specify the pointing direction of the sound.
[0075]
Further, since the microphone array device MA and the camera device CA are coaxially arranged,
these coordinate systems substantially coincide with each other. Therefore, the position on the
image captured by the camera device CA can be used as the pointing direction of the microphone
array device MA as it is, and it is easy even if calibration for converting the position on the image
into the pointing direction is not performed. Direction is determined.
[0076]
Further, the filter coefficient when the directivity is formed by the directivity forming unit is
registered as a history in the history table 40 stored in the filter coefficient holding unit 14, so
the filter coefficient is stored in the filter coefficient holding unit 14. In this case, the latest filter
coefficients can be always maintained by updating the filter coefficients. On the other hand, when
no filter coefficient is stored in the filter coefficient holding unit 14, the number of registered
filter coefficients can be increased by adding a new filter coefficient, whereby many directivity
forming processes are performed. There is an increase in the chances that the filter coefficients
registered in (4) are used, which leads to shortening of the convergence time of the filter
coefficients.
[0077]
(Modification of Second Embodiment) In the modification of the second embodiment, as a camera
device, the imaging direction can be changed (changeable) in the pan direction and the tilt
direction, and the angle of view or the like can be set by the zoom operation. The case where a
PTZ camera as an example of an optical axis moving camera whose magnification can be
adjusted and whose optical axis can be moved is used. Also, the position where the PTZ camera is
installed is a position away from the microphone array device. Therefore, calibration (calibration)
is performed in advance for the relationship (positional relationship) between the coordinate
11-04-2019
22
system of the PTZ camera and the coordinate system of the microphone array device, thereby the
microphone array is started from the position designated on the image captured by the PTZ
camera It is necessary to be able to calculate the pointing direction from the device toward the
position. This calibration is a known technique, for example as described in WO 2014/125835.
By performing this calibration, even if the direction of the PTZ camera is changed, the
information on the imaging direction is always obtained from the PTZ camera, so that the
pointing direction of the microphone array apparatus can be calculated.
[0078]
FIG. 17 is a schematic view showing an example of usage of the sound collection system 5A1 in
the modification of the second embodiment. For example, one PTZ camera CB and one
microphone array device MA are installed on the ceiling RF of the store, and a speaker 87 is
placed on the floor FLR of the store.
[0079]
Two persons 53 and 54 who are on the floor FLR are in conversation while standing up. Music
flows from a speaker 87 placed at a position slightly away from the two persons 53 and 54. In
addition, the PTZ camera CB captures an image of a point (place) to be monitored including the
persons 53 and 54. Furthermore, the microphone array device MA picks up the sound of the
entire store. The screen of the display 18 displays an image captured by the PTZ camera CB.
Further, the speaker 19B outputs the conversation of the two persons 53 and 54 or the music in
the store.
[0080]
FIG. 18 is an explanatory diagram of an operation outline of the sound collection system 5A1
when the person 53 is designated in the image displayed on the display 18. As shown in FIG. For
example, the user clicks the vicinity of the head of the person 53 displayed on the screen of the
display 18 with the cursor 17 z to designate a place to be heard.
[0081]
11-04-2019
23
The voice processing unit 21 uses the voice data collected by the microphone array device MA to
generate voice data in which voice (volume level) in a pointing direction toward a position
designated by the user is emphasized (amplified). Further, the audio processing unit 21 causes
the speaker 19B to output audio in synchronization with the image captured by the PTZ camera
CB. As a result, the voice at the position designated by the user is emphasized, and the sound of
the person 53 (for example, see “Hello” shown in FIG. 18) is voice-outputted from the speaker
19B at a large volume. On the other hand, music which is placed at a distance closer to the
microphone array device MA as compared to the persons 53 and 54, but flows from the speaker
87 which is not at the position designated by the user (see "♪-" shown in FIG. 18) Is not
emphasized and output as a voice, but is output at a small volume compared to the pronunciation
of the person 53.
[0082]
FIG. 19 is a flowchart of the sound collection process. The same step number is attached to the
same step processing as the step processing shown in the flowchart of FIG. 15 of the second
embodiment, and the description thereof will be omitted.
[0083]
First, the pointing direction calculation unit 12A in the audio processing unit 21 performs
calibration (calibration) to obtain the relationship between the coordinate system of the PTZ
camera CB and the coordinate system of the microphone array device MA (S1Z). After this, the
processing of steps S1A to S7 is the same as that of the second embodiment.
[0084]
The user operates the mouse which is the operation unit 17 to move the cursor 17z displayed on
the screen of the display 18, and clicks and designates at a place where the user wants to hear a
voice. The pointing direction calculation unit 12A receives the position on the image designated
by the user via the operation unit 17 in step S1A. In step S1B, the pointing direction calculation
unit 12A acquires the pointing direction of sound based on the designated image position.
11-04-2019
24
[0085]
Then, in step S7, when the directivity forming unit 11 performs the directivity forming process,
the audio output unit 19 outputs the voice based on the voice data of the directivity direction in
which the directivity is formed. After that, the history detection / update unit 13 updates the
history table 40 in step S9 regardless of the presence or absence of the history in step S3. Then,
after the process of step S9, the audio processing unit 21 returns to the process of step S1A.
[0086]
As described above, in the modification of the second embodiment, even if the position where the
PTZ camera is installed is a position away from the microphone array device, the relationship
between the coordinate system of the PTZ camera and the coordinate system of the microphone
array device is in advance. By performing calibration in advance, it is possible to calculate the
pointing direction which is the direction from the microphone array device to the position
designated on the image captured by the PTZ camera. Thus, the microphone array device can be
installed at an arbitrary position without being restricted by the installation location of the PTZ
camera. Alternatively, the PTZ camera can be installed at any position without being restricted by
the installation location of the microphone array device.
[0087]
Third Embodiment The second embodiment shows the case where there is one camera device,
but the third embodiment shows the case where a plurality of camera devices are used.
[0088]
FIG. 20 is a block diagram showing a configuration of a sound collection system 5B in the third
embodiment.
In the sound collection system of the third embodiment, the same components as those of the
first and second embodiments are denoted by the same reference numerals, and the description
thereof will be omitted.
11-04-2019
25
[0089]
The sound collection system 5B is installed in a store etc., and a plurality of monitoring camera
devices CC1 to CCn, a microphone array device MA, a recorder device 7, and a monitoring device
6A are mutually connected via the network 8. Have the following configuration.
[0090]
Here, PTZ cameras and fixed cameras are used as the plurality of camera devices CC1 to CCn.
Also, the PTAZ camera has PTZ information set so as to pick up an image of a previously
registered location (preset position). The fixed camera does not have to have a camera preset
value (PTZ information). The plurality of camera devices CC1 to CCn transfer the captured image
data to the monitoring device 6A via the network 8 and cause the recorder device 7 to record
them. Here, although the case where a plurality of camera devices are PTZ cameras or fixed
cameras is shown, an omnidirectional camera may be included. Further, when it is not necessary
to distinguish the plurality of camera devices CC1 to CCn in particular, they are simply referred
to as camera devices CC.
[0091]
The monitoring device 6A has a preset table memory 23 and a history data table memory 24 in
addition to the configuration shown in the second embodiment. The preset table memory 23
stores a preset table 45. The history data table memory 24 stores the history table 40 described
in the first embodiment.
[0092]
FIG. 21 shows the registered contents of the preset table 45. As shown in FIG. The location,
microphone preset value, camera IP address and camera preset value are registered in the preset
table 45. The pointing direction and the filter coefficient are registered as microphone preset
values for each location. The filter coefficient is determined by performing learning control when
forming directivity of voice in the directivity direction. The camera preset values are pan angle,
tilt angle and zoom magnification.
11-04-2019
26
[0093]
For example, when the location is the register R1, pointing direction: (θ 1, φ 1), filter coefficient:
(w 11,..., W 1 q), camera IP address: “165.254.10.11”, pan angle: “Null” , Tilt angle: "Null"
and balm magnification: "Null" are registered. Here, since the pan angle, the tilt angle, and the
barm magnification are all “Null”, that is, they do not have PTZ information, the camera device
that picks up the registration R1 is a fixed camera. When the place is the magazine shelf T1,
pointing direction: (θ 3, φ 3), filter coefficient: (w 31,..., W 3q), camera IP address:
“165.254.10.13”, pan angle: “30 “°”, tilt angle: “45 °” and zoom magnification: “×
3” are registered. The camera device for imaging the magazine shelf T1 is a PTZ camera. In
addition, for this PTZ camera, camera preset values (PTZ information) for imaging the magazine
shelf T2, that is, pan angle: "-15 °", tilt angle: "30 °" and balm magnification: "× 5 "Is also
registered. The PTZ camera captures an image in the imaging direction corresponding to the
preset value. By using the PTZ camera, the imaging area to be monitored can be easily changed.
[0094]
An operation of the sound collection system 5B having the above configuration is shown.
[0095]
22 and 23 are flowcharts showing the sound collection process.
The same step number is attached to the same step processing as the step processing shown in
the flowchart of FIG. 15 of the second embodiment, and the description thereof will be omitted.
[0096]
First, the pointing direction calculation unit 12A in the audio processing unit 21 performs
calibration (calibration) to obtain the relationship between the coordinate system of the PTZ
camera and the coordinate system of the microphone array device MA (S31). This calibration is a
known technique as described above. Further, the directivity forming unit 11 in the voice
processing unit 21 acquires the directivity direction registered in the preset table 45, and forms
the directivity of the voice in this directivity direction, thereby performing the preset process of
11-04-2019
27
calculating the filter coefficient. Perform (S32).
[0097]
The monitoring device 6A receives the camera device CC selected by the user via the operation
unit 17 (S33). The monitoring device 6A determines whether the camera device CC selected by
the user is a fixed camera (S34). When the camera is a fixed camera, the directivity forming unit
11 reads the preset information registered in the preset table 45 (S35). The preset information
includes the pointing direction and the filter coefficient corresponding to the position at which
the fixed camera captures an image.
[0098]
On the other hand, if it is a PTZ camera in step S34, the pointing direction calculation unit 12A
reads imaging direction information from the camera device CC (S36). The imaging direction
information includes a pan angle, a tilt angle, and a zoom factor. The pointing direction
calculation unit 12A determines whether the imaging direction information matches the camera
preset value registered in the preset table 45, that is, whether the camera device CC faces the
preset position (see FIG. S37). When the camera device CC is directed to the preset position, the
directivity forming unit 11 reads preset information (including the directivity direction and the
filter coefficient) registered in the preset table 45 in step S35.
[0099]
On the other hand, when the camera device CC does not face the preset position in step S37, the
pointing direction calculation unit 12A uses calibration and the sound source is in any direction
with respect to the sound collected by the microphone array device MA Estimate (S38).
Thereafter, the processes of steps S39 to S44 are the same as the processes of steps S2 to S7
shown in FIG. 7 of the first embodiment. That is, the history detection / update unit 13 calculates
the distance between the designated position and the history point, and searches the history
point RP registered in the history table 40 stored in the filter coefficient holding unit 14 (see FIG.
S39) It is determined whether there is a history point RP within a predetermined distance on the
voice map 50 from the designated position (S40).
11-04-2019
28
[0100]
When there is a history point RP, the history detection / update unit 13 registers the pointing
direction (θ, φ) and the filter coefficient (w corresponding to the history point registered in the
history table 40 stored in the filter coefficient holding unit 14). 1,..., W q) are read (S 41). The
directivity forming unit 11 sets the filter coefficients (w 1,..., W q) to the adaptive filters F1 to
Fn−1, and learns the adaptive filters F1 to Fn−1 with the history point RP as a start point. A
directivity formation process is performed (S43).
[0101]
On the other hand, when there is no history point RP within the predetermined distance on the
voice map 50 from the designated position, the directivity forming unit 11 adaptive filters the
default filter coefficients (w 1,..., W q) determined in advance. F1 to Fn-1 are set, and in step S43,
directivity formation processing is performed while learning the adaptive filters F1 to Fn-1 with
the initial position as the start point. When the directivity forming unit 11 performs the
directivity forming process, the voice output unit 19 outputs voice based on the voice data of the
directivity direction in which the directivity is formed (S44).
[0102]
The history detection / update unit 13 performs table update processing for updating the history
table 40 and the preset table 45 (S45). Thereafter, the operation of the monitoring device 6A
returns to the process of step S33.
[0103]
FIG. 24 is a flowchart showing the table update processing procedure in step S45. First, the
monitoring device 6A determines whether the camera device CC selected by the user via the
operation unit 17 is a fixed camera (S51). If the camera is a fixed camera, the history detection /
update unit 13 updates the preset information registered in the preset table 45 (S52). Here, only
the filter coefficient of the preset information (microphone preset value) is updated. Thereafter,
the monitoring device 6A returns to the original process.
11-04-2019
29
[0104]
On the other hand, if the selected camera device CC is not a fixed camera in step S51, that is, if it
is a PTZ camera, the pointing direction calculation unit 12A reads imaging direction information
from the camera device CC (S53). The imaging direction information includes a pan angle, a tilt
angle, and a zoom factor. The pointing direction calculation unit 12A determines whether the
imaging direction information matches the camera preset value registered in the preset table 45,
that is, whether the camera device CC faces the preset position (see FIG. S54). When the camera
device CC faces the preset position, the history detection / updating unit 13 updates the preset
information registered in the preset table 45 in step S52. Here, only the filter coefficient of the
preset information (microphone preset value) is updated.
[0105]
On the other hand, if the camera apparatus CC is not directed to the preset position in step S54,
it is determined whether the preset button 76 (see FIG. 25) is pressed and a preset setting is
instructed (S55). When the preset setting is instructed, the history detection / updating unit 13
adds the preset information to the preset table 45 (S56). Thereafter, the monitoring device 6A
returns to the original process. Also, in step S55, when the preset setting is not instructed, the
history detection / updating unit 13 determines whether the pointing direction from the
microphone array device MA toward the imaging direction is in the vicinity of the history point
RP (S57) ).
[0106]
If the pointing direction is in the vicinity of the history point RP, the history detection / update
unit 13 updates the filter coefficients registered in the history table 40 (S58). On the other hand,
if the pointing direction is not in the vicinity of the history point RP, the history detection /
updating unit 13 adds a new history point (pointing direction and filter coefficient) to the history
table 40 (S59). After the process of step S58 or S59, the monitoring device 6A returns to the
original process.
[0107]
11-04-2019
30
FIG. 25 is a diagram showing the screen of the display 18 and the sound generation operation of
the speaker 19B. On the left side of the screen of the display 18, a pull-down menu 70 of various
items is displayed. Here, the pull-down menu of the device tree is expanded, and the camera
device CC2 is in a selected state. A monitor screen 72 on which an image captured by the
selected camera device CC2 is displayed is arranged at the upper part of the approximate center
of the screen of the display 18. An operation panel 73 is disposed below the approximate center
of the screen of the display 18. The operation panel 73 includes a brightness button 75 for
adjusting the brightness of the image, a focus button 78 for adjusting the focus of the image
captured by the camera devices CC1 to CCn, and a selection button 74 for selecting one of the
camera devices CC1 to CCn. , A zoom button 77 for performing a zooming operation, and a
preset button 76 to be pressed when a new preset position is added. The voice processing unit
21 outputs the voice of “Welcome” from the speaker 19B based on the voice data in which the
directivity is formed by the selected microphone array device MA.
[0108]
As described above, in the sound collection system 5B according to the third embodiment, a large
number of microphone preset values (including filter coefficients) corresponding to locations
imaged by a plurality of camera devices including fixed cameras and PTZ cameras are registered.
be able to. In addition, since directivity is formed in the pointing direction specified by the user
using filter coefficients selected from among many registered filter coefficients, the convergence
time of the filter coefficients is shortened to form directivity. It can be done quickly. Also, by
updating both the preset table and the history table using the result of performing emphasis
processing (directionality processing), it is possible to obtain an optimal filter coefficient, and use
both the preset table and the history table. This increases the probability that the desired
pointing direction is included in the predetermined range including the registered microphone
preset value and the history point. Therefore, directivity can be formed quickly.
[0109]
Fourth Embodiment In the first, second, and third embodiments, among the sounds collected by
the microphone array device MA, the position at which the user wants to hear the sound is a
location designated by the user In the fourth embodiment, when the user designates a person
and the designated person (e.g., a pedestrian) moves, in the fourth embodiment, the user hears
the voice emitted by the person while tracking the person. Show. When the person such as a
pedestrian is tracked and the voice is reproduced, the monitoring device performs tracking so
11-04-2019
31
that the pointing direction matches the target person while automatically tracking the target
person (target person) with the image processing function. .
[0110]
FIG. 26 is a diagram for explaining the relationship between the pronunciation and the learning
of the adaptive filter when tracking the target person 81. When tracking the target person 81,
the directivity generation unit stops the learning function when the target person 81 starts
speaking during the directivity formation process (see times T2 and T4), and from the time when
it is determined that the voice is interrupted Learning of the adaptive filter is resumed (see points
T3 and T5). Here, the reason why the learning function is stopped when there is a target sound (a
target person's speech) is as follows. Although the adaptive array (adaptive microphone array)
theoretically makes the gain of noise in the target direction zero, in practice the variation of the
characteristics of the individual microphones, the influence of reflection etc. by incorporating in
the case, etc. Because of various factors, the gain of noise in the target direction does not become
zero, and the target sound is also judged as noise and attenuated.
[0111]
When resuming learning, the directivity forming unit compares the distance from the position
where the learning function resumed to the immediately previous learning stop position and the
distance to the position of the history point near the resumption position, and Load filter
coefficients and start learning. Here, when the learning stop position and the position of the
history point (history information position) are compared and either one is selected, it is
conceivable that the learning stop position closer in time is more suitable for the situation.
Therefore, different weights (learning stop position priority) are attached to each of the learning
stop position and the history information position to perform selection. As an example, the
directivity forming unit compares the distance to the learning stop position with the distance to
the history information position, and if the distance to the learning stop position <2 times the
distance to the history information position, the learning stops Let's adopt the filter coefficient of
the position.
[0112]
The sound pickup system in the fourth embodiment has substantially the same configuration as
11-04-2019
32
that of the second embodiment. About the component same as 2nd Embodiment, the description
is abbreviate | omitted by using the same code | symbol. The camera apparatus CA of the fourth
embodiment is an omnidirectional camera, and is installed coaxially with the microphone array
apparatus MA. Therefore, the coordinate system of the camera apparatus CA and the coordinate
system of the microphone array apparatus MA substantially coincide with each other.
[0113]
FIG. 27 is a flowchart showing the tracking processing procedure by the video processing unit
22. The video processing unit 22 receives the designation of the target person 81 designated by
the user via the operation unit 17, and notifies the voice processing unit 21 that the target
person 81 has been designated (S61). The image processing unit 22 tracks the target person 81
(S62). In this tracking process, the video processing unit 22 executes an image processing
function and tracks the target person 81 using a known tracking technique. For example, the
video processing unit 22 calculates the difference between the captured image and the
background image, and tracks an image portion having a difference area equal to or more than a
predetermined value as the target person 81. Also, the video processing unit 22 may extract the
feature amount of the target person 81 and track the image portion where the feature amount
appears as the target person 81.
[0114]
The image processing unit 22 calculates the position (target position) of the target person 81
being tracked in the camera coordinate system, and notifies the sound processing unit 21 of the
target position information (S63). The video processing unit 22 determines whether an
instruction to end tracking has been received by the user via the operation unit 17 (S64). When
the instruction to end the tracking is not received, the video processing unit 22 returns to step
S62, and continues the tracking of the target person 81 described above. On the other hand,
when an instruction to end tracking is received, the video processing unit 22 notifies the audio
processing unit 21 of the end of tracking (end of sound collection) (S65), and ends this operation.
As described above, when the video processing unit 22 starts tracking by specifying the target
person 81, the video processing unit 22 calculates the position (target position) of the target
person 81 every predetermined time until the tracking end instruction is given. Inform target
position information to.
[0115]
11-04-2019
33
FIG. 28 is a flowchart showing the directivity formation processing procedure by the audio
processing unit 21. When receiving the notification that the target person 81 has been
designated from the video processing unit 22, the pointing direction calculation unit 12A starts
this operation. First, the pointing direction calculation unit 12A acquires target position (position
on an image) information notified from the video processing unit 22 (S71). The pointing
direction calculation unit 12A calculates the pointing direction of the sound based on the
position on the designated image, and passes the pointing direction of the sound to the history
detection / updating unit 13 (see S72, reference k). The history detecting / updating unit 13
starts learning of the adaptive filters F1 to Fn-1 when the pointing direction of speech is input,
and the filter coefficients (w 1,..., W q) obtained while learning are directivity forming units
Output to 11 (refer to the code j). The learning process of the adaptive filters F1 to Fn-1 will be
described later with reference to FIGS. 29 and 30.
[0116]
The directivity forming unit 11 sets the filter coefficients (w 1,..., W q) obtained in this way while
learning the adaptive filters F1 to Fn-1 to the adaptive filters F1 to Fn-1, and performs directivity
processing (S73). When the directivity forming unit 11 performs the directivity forming process,
the voice output unit 19 outputs voice based on the voice data of the directivity direction in
which the directivity is formed (S74).
[0117]
The audio processing unit 21 determines whether the notification of tracking completion has
been received from the video processing unit 22 (S75). If the notification of the end of the
tracking has not been received, the voice processing unit 21 returns to step S71, acquires the
target position information, and repeats the same processing. On the other hand, when the
notification of the end of tracking is received from the video processing unit 22 in step S75, the
directivity forming unit 11 ends the processing for forming the directivity of audio and switches
to nondirectionality (S76). After this, the audio processing unit 21 ends this operation.
[0118]
11-04-2019
34
29 and 30 are flowcharts showing the learning processing procedure of the adaptive filter. When
there is an input of the pointing direction from the pointing direction calculation unit 12A (see
the symbol k), the history detection / updating unit 13 is activated and inputs the pointing
direction (S81). The history detection / updating unit 13 determines whether the video
processing unit 22 starts tracking the target person 81, is in tracking, or ends the tracking (S82).
[0119]
When tracking is started, the history detection / update unit 13 calculates the distance between
the designated position and the history point, and the history point RP registered in the history
table 40 stored in the filter coefficient storage unit 14 Is searched (S83), and it is determined
whether or not there is a history point RP within a predetermined distance from the designated
position (S84). If there is no history point RP, the history detection / update unit 13 sets the filter
coefficient to the default value (S86).
[0120]
On the other hand, when there is a history point RP, the history detection / update unit 13
registers the pointing direction (θ, φ) and the filter coefficient corresponding to the history
point registered in the history table 40 stored in the filter coefficient holding unit 14 (W 1,..., W
q) are read (S 85).
[0121]
The directivity forming unit 11 performs learning of the adaptive filters F1 to Fn-1 (S87), and
outputs a filter coefficient obtained as a result of learning (S88, see symbol j).
In the processes of steps S87 and S88, the target person 81 does not emit a voice (during
utterance stop). The history detection / update unit 13 determines whether to stop learning of
the adaptive filters F1 to Fn-1 (S89). If learning of the adaptive filters F1 to Fn-1 is not stopped,
the history detection / updating unit 13 returns to step S87 and repeats the same processing. On
the other hand, when stopping learning of the adaptive filters F1 to Fn-1, the history detection /
updating unit 13 holds the filter coefficient (S90). The history detection / update unit 13 adds
this filter coefficient to the history table 40 stored in the filter coefficient holding unit 14 (S91).
The history detection / update unit 13 may not add this filter coefficient to the history table 40.
11-04-2019
35
[0122]
If tracking is in progress in step S82, the history detection / updating unit 13 determines
whether learning of the adaptive filters F1 to Fn-1 is in progress (S92). When learning is in
progress, the history detection / update unit 13 proceeds to the process of step S87 described
above. On the other hand, when learning is not being performed, the history detection / updating
unit 13 determines whether the directivity forming unit 11 has started learning (S93).
[0123]
When learning has not started, the history detection / updating unit 13 outputs the held filter
coefficient to the directivity forming unit 11 (see S94, symbol j). In addition, also when the filter
coefficient is added to the history table 40 in step S91, the history detection / updating unit 13
outputs the held filter coefficient to the directivity forming unit 11.
[0124]
On the other hand, when learning is started in step S93, the history detection / updating unit 13
compares the immediately preceding learning stop position with the position of the history point
in the vicinity (history information position) (S95). As an example, when the distance to the
immediately preceding learning stop position <twice the distance to the history information
position is satisfied, the history detection / updating unit 13 uses the filter coefficient of the
immediately preceding learning stop position.
[0125]
The history detection / update unit 13 determines whether to use the filter coefficients of the
history points registered in the history table 40 (S96). When the filter coefficient of the history
point is not used, the history detection / updating unit 13 uses the filter coefficient of the
immediately preceding learning stop position (S97). On the other hand, when using the filter
coefficient of the history point, the history detection / updating unit 13 uses the filter coefficient
of the history point (S98). After the processes of steps S97 and S98, the history detection /
updating unit 13 proceeds to the process of step S87 described above.
11-04-2019
36
[0126]
As described above, when the target person 81 utters during learning of the adaptive filter, the
history detection / updating unit 13 stops learning and holds the filter coefficients at the stop
time points (time points T2 and T4) in step S90. It sends to the directivity formation part 11 by
S94 (refer code | symbol j). On the other hand, when the target person 81 stops speaking while
learning is stopped in step S92, the history detection / updating unit 13 resumes learning in step
S93 (time points T3 and T5). At this time, the learning stop position immediately before and the
history information position in the vicinity are compared in step S95, and one of the filter
coefficients is sent to the directivity forming unit 11 at the start of learning in step S87 (see
symbol j). In addition, in step S82, when the tracking is ended, the history detection / updating
unit 13 ends the learning of the adaptive filter.
[0127]
FIG. 31 is a diagram for explaining the learning operation of the target person 81 on the
omnidirectional image 90. In the omnidirectional image 90 displayed on the screen of the display
18, six history points RP1 to RP6 are set. The history points may not be displayed. At the start of
tracking, when the history near the start point (time T1) is searched, since the history point RP2
is in the vicinity, the history detection / update unit 13 sets the filter coefficients in the adaptive
filters F1 to Fn-1 to The sex formation unit 11 starts learning of the adaptive filters F1 to Fn-1.
[0128]
In addition, if there is an input of the pointing direction during tracking, the history detection /
updating unit 13 determines whether or not learning is in progress, and if learning is in progress
(period T1 to T2), learning of the adaptive filter is continued. And sends the filter coefficient to
the directivity forming unit 11. In addition, when learning is stopped (period T2 to T3), the
history detection / updating unit 13 waits until learning resumes, and sends the held filter
coefficients to the directivity forming unit 11 when learning resumes.
[0129]
Also, when learning is resumed at the position of time point T3, comparing the distance to the
11-04-2019
37
nearby history point RP1 and the distance to the immediately preceding learning stop position
(the position of time point T2), Although it is shorter, considering the above-mentioned
weighting, the distance to the immediately preceding learning stop position (the position at time
T2) <twice the distance to the history point RP1 is satisfied, so the directivity forming unit 11
The filter coefficients at the stop position (the position at time T2) are adopted and set in the
adaptive filters F1 to Fn-1.
[0130]
In addition, when learning is resumed at the position of time point T5, comparing the distance to
the nearby history point RP6 with the distance to the immediately preceding learning stop
position (the position of time point T4) The directivity forming unit 11 is shorter because the
distance to the immediately preceding learning stop position (the position at time T4) is less than
twice the distance to the history point RP6 even if the weighting described above is taken into
consideration. The filter coefficients at the history point RP6 are adopted and set in the adaptive
filters F1 to Fn-1.
[0131]
As described above, when the target person 81 is designated, the audio processing unit 21 starts
directivity formation processing, and acquires the position on the image from the video
processing unit 22 at regular intervals until the tracking termination instruction is given. .
Further, the audio processing unit 21 outputs the pointing direction to the history detecting /
updating unit 13 when the pointing direction is calculated.
When the pointing direction is input, the history detecting / updating unit 13 sets a filter
coefficient to perform learning, and notifies the directivity forming unit 11 of the result. The
directivity forming unit 11 performs the directivity forming process using the notified filter
coefficient.
[0132]
As described above, in the sound collection system according to the fourth embodiment, in the
case where a person such as a pedestrian is tracked and the voice is reproduced, the monitoring
11-04-2019
38
device automatically tracks the person (target person) who is the target by the image processing
function. While pointing to the target person toward the target person. This makes it possible to
listen to the voice emitted by the person while tracking the person specified by the user. In
addition, learning of the adaptive filter is stopped when the person speaks, and learning of the
adaptive filter is performed when the person does not speak, so it is possible to prevent the voice
emitted by the person from being judged as noise and reduced. .
[0133]
Further, when the person stops speaking and starts learning the adaptive filter, the directivity
forming unit 11 can select the filter coefficient learned immediately before or the filter
coefficient of the history point RP, so a more suitable filter The coefficients can be set in advance,
which leads to shortening of the convergence time of the filter coefficients. Further, since the
filter coefficient of the adaptive filter learned immediately before is prioritized, it is possible to
approximate the filter coefficient suitable for directivity formation.
[0134]
Here, weighting is simply performed by distance only, but the recording time of the history may
be left in the history table to add the factor of time to the calculation of weighting. For example,
the weighting may be increased as the recorded date and time is more recent.
[0135]
Also, while tracking of people such as pedestrians is automatic tracking by the image processing
function, the user can click on the mouse according to the movement of the pedestrian or drag
the mouse, the sound collection system will be every fixed time Alternatively, the pointing
direction may be automatically updated every predetermined distance. In addition, in the case of
reproducing the recorded data on the recorder device, if several moving points are designated
once, the trajectories of movement of the person are connected by a curve and then reproduced
again, the number of points to be designated is It can be reduced.
[0136]
11-04-2019
39
Although various embodiments have been described above with reference to the drawings, it
goes without saying that the present invention is not limited to such examples. It will be apparent
to those skilled in the art that various modifications and alterations can be conceived within the
scope of the appended claims, and of course these also fall within the technical scope of the
present invention. It is understood.
[0137]
For example, although information in which the pointing direction and the filter coefficient are
associated with each other is registered in the history table, information such as the number of
history may be further associated and registered.
[0138]
INDUSTRIAL APPLICABILITY The present invention is useful for a sound collection system that
uses an adaptive filter to form directivity in a specified pointing direction when sound is
collected.
[0139]
5, 5A, 5A1, 5B Sound pickup system 6, 6A Monitoring device 7 Recorder device 8 Network 10
Voice processing device 11 Directionality forming unit 12, 12A Directional direction calculation
unit 13 History detection and update unit 14 Filter coefficient holding unit 15 Communication
unit 16 input / output control unit 17 operation unit 17z cursor 18 display 19 audio output unit
19A amplifier 19B, 87 speaker 21 audio processing unit 22 video processing unit 23 preset table
memory 24 history data table memory 31 coding unit 32 communication unit 40 history table
40z Thick frame 45 Preset table 50 Audio map 50h, 50i, 50j Concentric circle 50m Line segment
53, 54, 55, 56, 57, 58, 81 Person 60, 90 Omnidirectional image 70 Pull-down menu 72 Monitor
screen 73 Operation panel 74 Selection button 75 Brightness button 76 Preset button 77 Zoom
button 78 Focus button AD1, AD2, AF1 to AFq Adder ARG Adaptive algorithm CA, CB, CC1 to CCn
Camera device CAy Fisheye lens CAz, MAz Housing D1 to Dn, DF1 to DFq Delayer (delay) F1 to
Fn-1 Adaptive filter FLR Floor surface HA1 to HAn-1, HB1 to HBn-1, HF1 Subtractor L1 to Ln A /
D converter M1 to Mn Microphone MA Microphone array device MAr Opening O Center point P1
to Pn Amplifier PF1 PFq + 1 variable amplifier RF ceiling RP, RP1 to RP6 history point TP
specified position
11-04-2019
40
Документ
Категория
Без категории
Просмотров
0
Размер файла
63 Кб
Теги
description, jp2016181770
1/--страниц
Пожаловаться на содержимое документа