close

Вход

Забыли?

вход по аккаунту

?

DESCRIPTION JP2012058314

код для вставкиСкачать
Patent Translate
Powered by EPO and Google
Notice
This translation is machine-generated. It cannot be guaranteed that it is intelligible, accurate,
complete, reliable or fit for specific purposes. Critical decisions, such as commercially relevant or
financial decisions, should not be based on machine-translation output.
DESCRIPTION JP2012058314
The present invention provides an acoustic processing system for extracting speech of a person
at a position to be extracted for the safety of a person around a machine, and instantaneously
extracting speech useful for avoiding danger. In a sound processing system, a sound input unit
201 consisting of a plurality of microphones for picking up sound, and a risk degree calculation
unit 206 for calculating the degree of risk associated with contact with a surrounding person or
object by operation of a machine A sound extraction unit 203 which receives a signal output
from the sound input unit 201 as an input and outputs a separation signal according to the
degree of risk calculated by the risk degree calculation unit 206, and a separation output from
the sound extraction unit 203 And a sound output unit 219 that outputs a signal. [Selected
figure] Figure 2
Sound processing system and machine using the same
[0001]
The present invention relates to sound processing technology suitable for an operator or driver
operating a relatively large machine such as a construction machine, a vehicle or a work machine
to grasp the situation of a person around the machine, and in particular to the person around the
machine The present invention relates to a sound processing system suitable for the safety of the
present invention and a technology that is effective when applied to a machine using the same.
[0002]
In a relatively large machine such as a construction machine, a vehicle, or a work machine, an
11-04-2019
1
operator or a driver (hereinafter referred to as an operator) always grasps the situation of the
person around the machine for the safety of the person around the machine. It is necessary to
avoid the danger each time.
One of the important information for the operator to know the situation of the person around the
machine is the speech uttered by the person around him.
[0003]
In order to pick up the voices of surrounding people, a microphone is installed outside the
machine, and it is assumed that the operator is made to grasp the situation of the surrounding
people by presenting the picked up sound to the operator. The sound collected by the
microphone is mixed not only with the voices of the surrounding person but also with the engine
sound, mechanical driving sound, excavating sound, etc. accompanying the operation of the
machine, so the voice of the person collected from the collected sound Only need to be extracted
and presented to the operator.
[0004]
If sound source separation technology using a plurality of microphones (microphone arrays) is
used, it is possible to extract only the sound coming from a specific position. However, there are
the following two issues.
[0005]
First, in sound source separation, the problem is that it is necessary to specify the position where
speech is to be extracted, that is, the position where a person is present. For example, in a sound
source separation method based on position estimation assuming sparsity (for example, Patent
Document 1), sound source separation is performed by applying a filter with a designated
extraction position as a target sound source position and the other as a disturbing sound source
position. For this reason, it is necessary to specify the position. There is also a technique called
blind source separation in which the sound of each sound source is extracted without specifying
the position of the sound source, but even in that case, among the plurality of acquired acoustic
signals, which sound should be extracted There remains the problem of determining what
11-04-2019
2
happened.
[0006]
The second problem is that there is a trade-off between the "accuracy" of sound source
separation and the filter adaptation time. The accuracy here means how close the extracted
sound is to the sound of the original target sound source. In general, in an adaptive method for
extracting with high accuracy (for example, the independent component analysis in Non-Patent
Document 1), the filter can not be adapted only by the instantaneous input signal, and the
operator grasps the situation of the surrounding person It is not possible to judge the danger
avoidance (hereinafter, "instantly" means that the time is sufficiently shorter than the time from
the reception of the sound to the time when the operator performs the danger avoidance action).
[0007]
On the other hand, there are sound source separation algorithms that can be extracted using only
instantaneous input signals (for example, the binary masking in Non-Patent Document 2), but
their accuracy is generally low and noise is mixed. It is difficult for the operator to recognize
what the person is talking about. There is also a problem that the operator is always exposed to
the remaining noise without being separated.
[0008]
Further, there is a method of selecting the independent component analysis and binary masking
based on the volume difference depending on the situation in order to achieve both real-time
processing and separation accuracy (for example, Patent Document 2). Patent Document 2 shows
an example in which selection is performed based on the convergence of separation matrix of
independent component analysis.
[0009]
JP 2007-47427 A JP JP 2007-33825 A
[0010]
11-04-2019
3
T.Takatani,T.Nishikawa,H.Saruwatari,and
K.Shikano, “Blind separation of binaural sound
mixtures using SIMO−model−based independent
component analysis,”
ICASSP2004,vol.4,pp.113−116,2004.
O.Yilmaz and S.Rickard, “Blind separation of
speech mixtures via time−frequency masking,”
IEEE Trans. Signal
Process.,vol.52,no.7,pp.1830−1847,July 2004.
M.Togami,T.Sumiyoshi, and A.Amano, “Stepwise
phase difference restoration method for sound
source localization using multiple microphone
pairs,” ICASSP2007,vol. I,pp.117−120,2007.
[0011]
By the way, in the patent document 2 mentioned above, the merit selected by the reference |
standard of a convergence is the stability that isolation | separation precision does not fall to less
than binary masking. In the present invention in which the safety of the surrounding person is
the most important, the momentary necessity is required so that it is necessary to avoid the
danger, but this problem may be solved by the invention of Patent Document 2 which
emphasizes the stability of separation accuracy. I can not solve it. In addition, the problem of
specifying the position to be extracted as described above can not be solved at all.
[0012]
Therefore, the present invention has been made to solve the above-mentioned problems, and its
typical purpose is to extract the voice of the person in the position to be extracted and to avoid
danger for the safety of the person around the machine. It is an object of the present invention to
provide an acoustic processing system for instantaneously extracting useful speech.
[0013]
The above and other objects and novel features of the present invention will be apparent from
the description of the present specification and the accompanying drawings.
11-04-2019
4
[0014]
The outline of typical ones of the inventions disclosed in the present application will be briefly
described as follows.
[0015]
That is, a typical sound processing system includes a sound input unit including a plurality of
microphones for picking up sound, and a risk degree calculation unit that calculates the degree
of risk associated with contact with a surrounding person or object by the operation of a
machine. A sound extraction unit that receives a signal output from the sound input unit as an
input and outputs a separated signal according to the degree of risk calculated by the risk degree
calculation unit, and a sound that outputs the separated signal output from the sound extraction
unit And an output unit.
Furthermore, you may have the following characteristics.
[0016]
The sound extraction unit is composed of a plurality of sound source separation units in which
each position having a relatively high degree of risk is an extraction position.
The extraction method of each sound source separation unit is a method that can be
instantaneously extracted when the risk of the corresponding extraction position is high, and is a
method that can be extracted with high accuracy when the risk of the extraction position is low.
[0017]
The degree of danger is calculated from the detection result of the motion state of the machine
and the position of the person.
11-04-2019
5
The motion state of the machine is estimated by the machine motion state estimation unit based
on sensor information or a machine operation signal installed on the work machine. The human
detection is performed by combining the audio non-voice discrimination result and the moving
object detection result based on the video. The voice non-voice discrimination includes a sound
source position estimation unit that estimates a sound source position from the signal output by
the sound input unit, and a voice non-voice discrimination unit that determines voice non-voice
based on a sound source position output by the sound source position estimation unit. Realized
by Moving object detection is realized by an image input unit including one or more cameras
such as a visible light camera or an infrared camera, and a moving object detection unit that
detects a moving object based on an image output from the image input unit. Further, the sound
source position estimation unit changes the estimation method according to the degree of danger
for each position, and the moving object detection unit changes the detection method.
[0018]
An image output unit that displays an image according to the degree of risk, an externally output
sound generation unit that generates an externally directed output sound to the outside of the
machine based on the degree of danger, and an externally generated output sound generation
unit An external sound output unit that outputs a directional output sound, and a machine
control unit that controls the operation of the machine based on the degree of danger.
[0019]
The effects obtained by typical ones of the inventions disclosed in the present application will be
briefly described as follows.
[0020]
That is, according to a typical sound processing system, sound processing for extracting the voice
of the person at the position to be extracted and for instantaneously extracting the voice useful
for risk avoidance for the safety of the person around the machine System can be provided.
[0021]
It is a figure which shows an example of the hardware constitutions of the sound processing
system in Embodiment 1 of this invention.
It is a figure which shows an example of the block configuration of the sound processing system
11-04-2019
6
in Embodiment 1 of this invention.
It is a figure which shows an example of the block configuration of the sound input part shown in
FIG.
It is a figure which shows an example of a block configuration of the sound source position
estimation part shown in FIG. It is a figure which shows an example of the block configuration of
a moving body detection part shown in FIG. It is a figure which shows an example of the block
configuration of the sound extraction part shown in FIG. FIG. 3 is a diagram showing an example
of a data structure of frequency domain signal Xf (f, τ) in a certain frame τ in FIG. FIG. 3 is a
diagram showing an example of a block configuration when scheme 2 selected by the sound
source separation unit in FIG. 2 is a sparsity-based adaptive minimum dispersion beamformer. It
is a flowchart which shows an example of the processing flow of the sound extraction part shown
in FIG. It is a figure which shows an example of the block configuration of the sound processing
system in Embodiment 3 of this invention. It is a figure which shows an example of the block
configuration of the sound processing system in Embodiment 4 of this invention. It is a flowchart
which shows an example of the SPIRE algorithm in the sound source position estimation part
shown in FIG. It is a figure which shows an example of the external appearance at the time of
applying the sound processing system in Embodiment 1 of this invention to a construction
machine.
[0022]
Hereinafter, an embodiment of the present invention will be described in detail based on the
drawings, taking a sound processing system integrated with, for example, a construction machine
as an example. In all the drawings for describing the embodiments, the same reference numeral
is attached to the same member in principle, and the repetitive description thereof will be
omitted.
[0023]
First Embodiment The first embodiment of the present invention will be described below with
reference to FIGS. 1 to 9, 12 and 13.
[0024]
FIG. 1 is a diagram showing an example of a hardware configuration of the sound processing
11-04-2019
7
system according to the first embodiment of the present invention.
[0025]
The hardware configuration of the sound processing system 100 according to the present
embodiment includes microphone arrays 1011 to 101M, speaker arrays 1021 to 102S, visible
light cameras 1031 to 103A, infrared cameras 1041 to 104B, a microphone 105, headphones
106, and A / D-. D / A converter 107, central processing unit 108, volatile memory 109, storage
medium 110, image display device 111, audio cables 1141 to 114M, 1151 to 115S, 116, 117,
monitor cable 118, digital cables 119, 1201 to 120 120A, 1211 to 121B and the like.
The sound processing system 100 is integrated with a construction machine including a work
machine 112, a machine operation input unit 113, and the like.
[0026]
The microphone arrays 1011 to 101M are microphone groups attached to the outside of the
construction machine and each array is composed of N microphones.
The speaker arrays 1021 to 102S are a speaker group including S speakers 1021 to 102S
mounted outside the construction machine.
[0027]
The visible light cameras 1031 to 103A are a group of visible light cameras mounted outside the
construction machine. The infrared cameras 1041 to 104B are an infrared camera group
mounted on the outside of the construction machine.
[0028]
The microphone 105 is a microphone worn by the operator. The headphones 106 are
11-04-2019
8
headphones worn by the operator.
[0029]
The A / D-D / A converter 107 converts the signals output from the microphone arrays 1011 to
101 M and the signals output from the microphone 105 into digital data, and at the same time,
converts analog sound pressure to the speaker arrays 1021 to 102 S and the headphones 106. It
is an A / D-D / A conversion device that outputs a signal.
[0030]
The central processing unit 108 is a central processing unit that processes the output of the A /
D-D / A converter 107.
The volatile memory 109 is a volatile memory that temporarily stores data of arithmetic
processing in the central processing unit 108 and the like. The storage medium 110 is a storage
medium for storing information such as a program. The image display device 111 is a display
device that displays information, images, and the like of arithmetic processing in the central
processing unit 108.
[0031]
The audio cables 1141 to 114M are cables for connecting the microphone arrays 1011 to 101M
and the A / D-D / A conversion device 107. The audio cables 1151 to 115S are cables for
connecting the speaker arrays 1021 to 102S and the A / D-D / A conversion device 107. The
audio cable 116 is a cable that connects the microphone 105 and the A / D-D / A conversion
device 107. The audio cable 117 is a cable for connecting the headphone 106 and the A / D-D /
A converter 107.
[0032]
The monitor cable 118 is a cable that connects the image display device 111 and the central
processing unit 108.
[0033]
11-04-2019
9
The digital cable 119 is a cable that connects the A / D-D / A converter 107 and the central
processing unit 108.
The digital cables 1201 to 120A are cables that connect the visible light cameras 1031 to 103A
and the central processing unit 108. The digital cables 1211 to 121B are cables for connecting
the infrared cameras 1041 to 104B and the central processing unit 108.
[0034]
The work machine 112 is a construction machine having an arm and the like. The machine
operation input unit 113 is a part for inputting various operations of the construction machine.
[0035]
The operation of the hardware of the sound processing system 100 configured as described
above is as follows.
[0036]
Sound pressure data output from the microphone arrays 1011 to 101M are sent to the A / D-D /
A converter 107 via the audio cables 1141 to 114M.
The sound pressure data from the microphone arrays 1011 to 101M are converted into digital
sound pressure data by the A / D-D / A converter 107, respectively. In this conversion,
conversion timing is converted synchronously between signals. The converted digital sound
pressure data is sent to the central processing unit 108 through the digital cable 119, and the
central processing unit 108 performs acoustic signal processing. The digital sound pressure data
after the acoustic signal processing is sent to the A / D-D / A converter 107 via the digital cable
119. The digital sound pressure data from the central processing unit 108 is converted into
analog sound pressure data by the A / D-D / A converter 107 and is output from the headphone
106 through the audio cable 117.
11-04-2019
10
[0037]
The digital sound pressure data X collected by the microphone arrays 1011 to 101 M and sent to
the central processing unit 108 include the voices of the workers outside the work machine 112
and the engine sounds and arm drive sounds emitted by the work machine 112. Noise and mixed
in are included. In central processing unit 108, digital sound pressure data X, image data VI
obtained from visible light cameras 1031 to 103A, image data II obtained from infrared cameras
1041 to 104B, and operation signals obtained from machine operation input unit 113 The risk
degree H for each position is calculated based on the speed information of the work machine
112 and the speed information of the work machine 112. The degree of danger H is stored in the
volatile memory 109. The central processing unit 108 changes the sound source position
estimation method based on the degree of danger H, changes the moving body detection method,
and sets the position where the degree of danger is relatively high as the sound extraction
position. The sound extraction is performed according to a method that can be instantaneously
extracted for the position, and the sound extraction is performed according to a method that can
be extracted with high accuracy for the position where the degree of danger is low. The
extraction signal Y is sent to the A / D-D / A converter 107 via the digital cable 119, converted
into an analog signal, and output from the headphone 106 via the audio cable 117.
[0038]
The degree of danger H for each position stored in the volatile memory 109 is converted into an
image in the central processing unit 108 and output from the image display unit 111 via the
monitor cable 118.
[0039]
An audio signal collected by the microphone 105 is converted into digital sound pressure data by
the A / D-D / A converter 107 through the audio cable 116 and is transmitted to the central
processing unit 108 through the digital cable 119. It is input.
Further, directivity filters using the speaker arrays 1021 to 102S are stored in advance in the
storage medium 110 for each position to which the directivity is to be directed. For the digital
sound pressure data, a directional filter that directs directivity to a position where the degree of
danger H is relatively high is selected and convoluted to generate multi-channel digital signal
data. The multi-channel digital signal data is input to the A / D-D / A converter 107 via the digital
cable 119, and the A / D-D / A converter 107 converts the multi-channel analog signal into an
11-04-2019
11
audio signal 1151. The signal is output from the speaker arrays 1021 to 102S through 115115S.
[0040]
The central processing unit 108 controls, for the work machine 112, the type of movement, the
moving speed, the type of operation, the operation speed, and the like according to the danger
level H.
[0041]
The digital cable 119 uses a USB cable or the like.
A USB cable, a LAN cable or the like is used as the digital cables 1201 to 120A and the digital
cables 1211 to 121B.
[0042]
FIG. 13 is a view showing an example of the appearance when the sound processing system 100
according to the present embodiment is applied to a construction machine. FIG. 13 is a schematic
view of the construction machine as viewed from above.
[0043]
In the example of FIG. 13, the construction machine includes a cabinet 13001, an engine unit
13002, an arm unit 13003 and the like. Microphone arrays 1011 to 1014 are arranged at four
corners outside the construction machine. An operator operates in the cabinet 13001.
[0044]
For example, when the present invention is not used, almost no external sound can be heard
inside the cabinet 13001. In addition, the construction machine itself has noise sources such as
11-04-2019
12
the engine unit 13002 and the arm unit 13003, and even if you hear the sounds collected by the
microphone arrays 1011 to 1014 as they are, the voices of people around them buried in those
noises I can hardly hear you. The present invention solves these problems.
[0045]
FIG. 2 is a diagram showing an example of a block configuration of the sound processing system
100 according to the present embodiment. The block configuration shown in FIG. 2 is a
functional configuration by software realized by the central processing unit 108 shown in FIG. 1
reading and executing a program stored in the storage medium 110. However, some of the
components include the hardware configuration shown in FIG.
[0046]
The sound processing system 100 according to the present embodiment includes a sound input
unit 201, a sound source position estimation unit 202 connected to the sound input unit 201, a
sound extraction unit 203 connected to the sound input unit 201, and a sound source position
estimation unit The voice non-voice discrimination unit 204 connected to 202, the person
detection unit 205 linked to the sound non-voice discrimination unit 204, and the person
detection unit 205 are linked to the sound source position estimation unit 202 and the sound
extraction unit 203 A mechanical motion state estimation unit 209 connected to the danger
degree calculation unit 206, the machine sensor input unit 207, and the machine sensor input
unit 207 and leading to the danger degree calculation unit 206, a visible light input unit 210, an
infrared input unit 211 And a moving object detection unit 212 connected to the visible light
input unit 210, the infrared input unit 211, and the risk degree calculation unit 206, and
connected to the person detection unit 205; The video output unit 213 connected to the output
unit 205 and the risk degree calculation unit 206, the external sound output unit 216 connected
to the operator voice input unit 215, the operator sound input unit 215 and the risk degree
calculation unit 206 An external sound output unit 217 connected to the external output sound
generation unit 216; a mechanical operation control unit 218 connected to the danger degree
calculation unit 206; a sound output unit 219 connected to the sound extraction unit 203; , And
a mechanical operation input unit 221 connected to the mechanical motion state estimation unit
209.
[0047]
In the non-speech discrimination unit 204 and the mechanical motion state estimation unit 209,
11-04-2019
13
the dimension 208 of the machine is used.
The sound source position estimation unit 202 and the sound extraction unit 203 use the
information of the microphone arrangement 214. In the moving object detection unit 212, a
camera projection matrix 220 is used.
[0048]
The main functions (some components include the hardware configuration) by software of the
sound processing system 100 configured as described above are as follows.
[0049]
The sound input unit 201 is a functional unit including a plurality of microphones that collect
sound.
Details will be described later with reference to FIG. The sound source position estimation unit
202 is a functional unit that estimates the sound source position from the signal output from the
sound input unit 201 or estimates the sound source position from the signal output from the
sound extraction unit 203. Further, the sound source position estimation unit 202 changes the
estimation method based on the degree of danger for each position output by the degree of
danger calculation unit 206. Details will be described later with reference to FIG. The sound
extraction unit 203 is a functional unit that receives the signal output from the sound input unit
201 as an input and outputs a separation signal according to the degree of danger calculated by
the danger degree calculation unit 206. The sound extraction unit 203 includes a plurality of
sound source separation units, each sound source separation unit sets an extraction position
according to the degree of danger, and the sound source separation unit changes the separation
method according to the degree of danger. Details will be described later with reference to FIG.
[0050]
The voice non-voice determination unit 204 is a functional unit that determines voice non-voice
based on the sound source position output from the sound source position estimation unit 202.
The person detection unit 205 is a functional unit that detects the position of the person based
on the result of the voice non-speech determination output from the voice non-voice
11-04-2019
14
determination unit 204. The person detection unit 205 also performs person detection based on
the signal output from the moving object detection unit 212.
[0051]
The degree-of-risk calculation unit 206 is a functional unit that calculates the degree of danger
associated with the contact with the surrounding person or object by the operation of the
machine. The degree-of-risk calculator 206 calculates the degree of danger for each position.
Furthermore, the degree-of-risk calculation unit 206 calculates the degree of danger based on the
motion state output by the machine motion state estimation unit 209, or calculates the degree of
risk based on the person position detection result output by the person detection unit 205. The
machine motion state estimation unit 209 is a functional unit that estimates the motion state of
the machine estimated based on sensor information or a machine operation signal installed in the
machine.
[0052]
The image input unit is a functional unit that includes a visible light input unit 210 and an
infrared input unit 211, and includes one or more visible light cameras or infrared cameras. The
moving body detection unit 212 is a functional unit that performs moving body detection based
on the video output from the video input unit. Also, the moving object detection unit 212
changes the detection method based on the degree of danger for each position output by the
degree of danger calculation unit 206. Details will be described later with reference to FIG. The
video output unit 213 is a functional unit that displays a video based on the degree of danger
output by the degree-of-risk calculation unit 206.
[0053]
The external-oriented output sound generation unit 216 is a functional unit that generates an
external-oriented output sound with respect to the outside of the machine based on the degree of
risk that the degree-of-risk calculation unit 206 outputs. The external sound output unit 217 is a
functional unit that outputs the external output sound generated by the external output sound
generation unit 216.
11-04-2019
15
[0054]
The machine operation control unit 218 is a functional unit that controls the operation of the
machine based on the degree of risk outputted by the degree-of-risk calculation unit 206. The
sound output unit 219 is a functional unit that outputs the separation signal output from the
sound extraction unit 203.
[0055]
In the following, the main functional units by software of the sound processing system 100 will
be described in detail.
[0056]
An example of the block configuration of the sound input unit 201 is shown in FIG.
The sound input unit 201 includes a multi-channel AD converter 301, a multi-channel frame
processing unit 302, a multi-channel short-term frequency analysis unit 303, and the like. The
multi-channel AD converter 301 is included in the A / D-D / A converter 107.
[0057]
In the sound input unit 201, multi-channel analog sound pressure data obtained from the
microphone arrays 1011 to 101 M is converted into digital sound pressure data x_11 (t) to
x_MN (t) by the multi-channel AD converter 301. t is discrete time for each sampling period. The
converted digital sound pressure data x_ 11 (t) to x_MN (t) passes to the multi-channel frame
processing unit 302.
[0058]
The multi-channel frame processing unit 302 transfers x_ij (t) from t = τs to t = τs + F_s−1 to
Xf_ij (t, τ) from t = 0 to t = F−1. Here, τ is called a frame index, and is incremented by one
after the processing from the multi-channel frame processing unit 302 to the sound output unit
11-04-2019
16
219 is completed. s is called frame shift and means the number of samples shifted for each
frame. F_s is called a frame size and means the number of samples to be processed at one time
per frame. i is an index (1,..., M) meaning a microphone array number. j is an index (1,..., N)
meaning a microphone number.
[0059]
After that, Xf_ij (t, τ) is passed to the multichannel short time frequency analysis unit 303. The
multi-channel short-term frequency analysis unit 303 performs window processing such as DC
component cutting and Hamming window, Hanning window, Blackman window, etc. on Xf_ij (t,
τ), and then performs short-time Fourier transform to each frequency domain To the signal Xf_ij
(f, τ) of Let F be the number of frequency bins here. Xf_ij (f, τ) in a certain frame τ has a data
structure as shown in FIG. The frequency domain signal Xf_ij (f, τ) is sent to the sound source
position estimation unit 202 and the sound extraction unit 203.
[0060]
FIG. 4 shows an example of the block configuration of the sound source position estimation unit
202. As shown in FIG. The sound source position estimation unit 202 includes frequency
direction estimation units 4011 to 401M, a direction estimation integration unit 402, and the
like.
[0061]
First of all, the direction estimation unit for each frequency 401 i is an incoming direction θ_i of
sound for each frequency index f with respect to multi-channel frequency domain signals Xf_i1
(f, τ) to Xf_iN (f, τ) corresponding to one microphone array 101i. Estimate (f). If the number of
microphone elements in the microphone array is two, θ is estimated by [Equation 1].
[0062]
[0063]
11-04-2019
17
Here, ρ (f, τ) is a phase difference between the frame τ and the frequency index f of the input
signals of the two microphone elements.
freq (f) is the frequency (Hz) of the frequency index f, and is calculated by [Equation 2].
[0064]
[0065]
Where FS is the sampling rate of the A / D converter.
Let d be the physical spacing (m) of the two microphone elements. c is the speed of sound (m / s).
The speed of sound changes strictly depending on the temperature and the density of the
medium, but is usually fixed at one value such as 340 m / s. In the noise removal processing
here, since the same processing may be performed separately for each time-frequency based on
the above-mentioned "sparseness" assumption, the suffix (f, τ) of the time-frequency is omitted
thereafter. To write.
[0066]
When the number of microphone elements in the microphone array is three or more, it is
possible to calculate the direction with high accuracy by the SPIRE algorithm (see Non-Patent
Document 3). Also in the SPIRE algorithm, the same processing is performed separately for each
time-frequency based on the above-mentioned "sparseness" assumption. FIG. 12 shows a
flowchart of the SPIRE algorithm.
[0067]
First, in the SPIRE algorithm, placement of the microphone element is read (S1201). Next, in the
SPIRE algorithm, the microphone elements constituting each microphone pair are selected so
that each microphone pair is composed of two microphone elements (S1202). At this time, it is
11-04-2019
18
desirable to divide the microphone spacing between the two microphone elements constituting
the microphone pair so as to be different for each microphone pair.
[0068]
Next, the SPIRE algorithm sorts each microphone pair in order from the smallest microphone
interval, and stores them in the microphone pair queue (S1203). Here, let l be an index for
identifying one microphone pair, l = 1 be the microphone pair with the shortest microphone
interval, and l = L be the microphone pair with the longest microphone interval. A comparison
operation is performed to determine whether the number of elements in the microphone pair
queue is zero (S1204). While the number of elements is not 0 (S1204-No), S1205 and S1206
described below are repeated.
[0069]
That is, next, processing for reading one microphone pair l with the shortest interval from the
microphone pair queue and removing it from the microphone pair queue is performed (S1205).
Then, in the subsequent phase difference estimation processing, first, an integer nl satisfying
[Equation 3] is found for the read l (S1206). Since the range enclosed by the inequality
corresponds to 2π, only one solution is always found. Then, [Equation 4] is executed.
[0070]
[0071]
[0072]
Further, before performing the above-described processing for l = 1, [Equation 5] is set as an
initial value.
S1205 and S1206 are repeated P times, and when the number of elements of the microphone
pair queue becomes 0 (S1204-Yes), the direction is calculated from the phase difference
11-04-2019
19
according to [Equation 6] to calculate θ (f, τ) (S1207) .
[0073]
[0074]
[0075]
Here, dl is the distance between the microphone elements of the l-th microphone pair.
[0076]
The estimation accuracy of the sound source direction estimation is known to increase as the
microphone spacing increases. However, if the microphone spacing is longer than a half
wavelength of the signal for estimating the direction, one direction is identified from the phase
difference between the microphones. It is known that two or more directions with the same
phase difference can not exist (spatial aliasing).
The SPIRE method is provided with a mechanism for selecting a direction close to the sound
source direction obtained by the short microphone spacing out of two or more estimated
directions generated by the long microphone spacing.
Therefore, it has an advantage that the sound source direction can be estimated with high
accuracy even at a long microphone interval where spatial aliasing occurs.
[0077]
Direction estimation results θ_i (f, τ) output from the frequency direction estimation units
4011 to 401 M are input to the direction estimation integration unit 402.
It is possible to obtain a position histogram h (p, τ) having a larger value as the position index p
where the sound source is present by [Equation 7].
11-04-2019
20
[0078]
[0079]
Here, using [Equation 8] obtained by thinning out the addition process of [Equation 7] according
to the danger degree map data H (p, τ) calculated in the previous frame, for the position where
the danger degree is high The position histogram can be calculated with high followability.
[0080]
[0081]
The speech non-speech discrimination unit 204 generates a speech non-speech discrimination
map v (p, τ) representing the presence or absence of speech at each position p based on the
position histogram h (p, τ) input from the sound source position estimation unit 202. judge.
For speech non-speech discrimination, h (p, τ) is regarded as the noise-containing speech signal
of the person present at position p, noise estimation based on MCRA is performed, and then
input signal to noise ratio (post-SNR) γ (p The discrimination may be performed using a general
algorithm such as a discrimination method [Eq. 9] based on T, τ), which does not become an
essential functional difference.
[0082]
[0083]
In addition, calculation cost can be reduced by setting v (p, τ) to be always 0 with respect to p
inside the machine based on the size 208 of the machine.
The voice non-voice discrimination map v (p, τ) is sent to the person detection unit 205.
11-04-2019
21
[0084]
A visible light ray input unit 210 composed of visible light cameras 1031 to 103A sends visible
light image data VI to a moving body detection unit 212.
[0085]
An infrared input unit 211 including infrared cameras 1041 to 104B sends infrared image data II
to the moving body detection unit 212.
[0086]
FIG. 5 shows an example of a block configuration of the moving object detection unit 212.
The moving body detection unit 212 includes a background difference / interframe difference
calculation unit 501, a body surface detection unit 502, a visual pyramid intersection calculation
unit 503, and the like.
[0087]
The background difference / interframe difference calculation unit 501 calculates images EI_1 to
EI_A in which object regions are extracted by background difference processing and interframe
difference processing on the respective images based on the visible light ray image data VI_1 to
VI_A.
The body surface detection unit 502 calculates, based on the infrared image data II_1 to II_B,
images BI_1 to BI_B in which pixel regions having high temperatures for the respective images
are extracted as the body surface region.
The view pyramid intersection calculation unit 503 back projects the view cones of the object
regions of the images EI_1 to EI_A and the body surface regions of the images BI_1 to BI_B into a
three-dimensional space based on the camera projection matrix 220.
11-04-2019
22
The moving object presence map e (p, τ) is updated as in [Equation 11] for the region where the
visual volume intersects in the three-dimensional region where the field of view intersects
between the cameras obtained by [Equation 10].
[0088]
[0089]
[0090]
Here, we also use the equation 12 in which the backprojection process of the equation 10 is
thinned out according to the risk map data H (p, τ) calculated in the previous frame. The
followability to a position where the degree of risk in e (p, τ) calculation is high is high.
[0091]
[0092]
The human detection unit 205 calculates a human detection map d (p, τ) according to [Equation
13] based on the voice non-voice discrimination map v (p, τ) and the moving object presence
map e (p, τ).
Here, wv is a weighting factor of 0 or more and 1 or less.
[0093]
[0094]
The machine sensor input unit 207 includes, for example, sensors such as a speedometer of a
machine and a hydraulic sensor of an arm of the machine, and sets each sensor signal as vector C
(t) = (c_1 (t),..., C_Ω (t)) Output.
[0095]
11-04-2019
23
The mechanical motion state estimation unit 209 obtains the three-dimensional position P_k (t)
of each small portion z_k from the dimension 208 of the machine.
Here, k (k = 1,..., K) is a site index.
In addition, the vector V (k) of the motion velocity V_k (t) of the small portion z_k with respect to
the set of the vector C (t) of the sensor signal and the vector P (t) = (P_1 (t),..., P_K (t)). It is
assumed that a table of t) = (V_1 (t),..., V_K (t)) is stored in the storage medium 110 in advance.
This table can be easily obtained by simulation at design time.
From this table, the velocity V_k (t) of the small part z_k is obtained.
[0096]
Further, the operation signal μ (t) is obtained from the machine operation input unit 221.
The operation signal μ is also stored by storing a table of corresponding accelerations A (t) =
(A_1 (t),..., A_k (t)) for the combination of the operation signals μ (t) and P (t). From (t), an
acceleration A_k (t) of the small portion z_k is obtained.
The predicted position P (t + Δt) of the small portion z_k at time t + Δt can be obtained by
[Equation 14]. Finally, a map g (p, t) of the shortest time taken to contact is obtained by [Equation
15].
[0097]
[0098]
11-04-2019
24
[0099]
The degree-of-risk calculation unit 206 is based on the person detection map d (p, τ) input from
the person detection unit 205 and the map g (p, t) of the shortest contact time input from the
mechanical motion state estimation unit 209. Then, the danger degree map H (p, τ) is calculated
by [Equation 16].
Here, ε and は are respectively appropriate constants.
[0100]
[0101]
The image output unit 213 superimposes the person detection map d (p, τ) and the danger
degree map H (p, τ) and presents the result.
[0102]
The sound extraction unit 203 extracts the extracted signal Yf (f) based on the frequency domain
signals Xf_11 (f, τ) to Xf_MN (f, τ) and the risk map H (p, τ) input from the sound input unit
201. , Τ).
[0103]
An example of a block configuration of the sound extraction unit 203 is shown in FIG.
The sound extraction unit 203 includes an extraction direction selection unit 601, sound source
separation units 6021 to 602R, a mixing unit 603, and the like.
[0104]
First, the extraction direction selection unit 601 sorts H (p, τ) of all position indexes p, and
11-04-2019
25
determines the top R positions p_1 to p_R as extraction positions.
The sound source separation units 6021 to 602R correspond to the extraction positions p_1 to
p_R, respectively.
A flowchart of the r-th sound source separation unit 602r (for example, 602R) is shown in FIG.
[0105]
In S901, cases are divided according to H (p_r, τ)> T_h or H (p_r, τ) ≦ T_h.
In the case where H (p_r, τ) is high (H (p_r, τ)> T_h) (S901-Yes), it is determined that high
speed is particularly required, and the method is a method that can be instantaneously extracted
in S902 Select 1 In method 1, for example, when the direction θ (f, τ) obtained for each
frequency index by the direction estimation algorithm such as SPIRE described above overlaps
the extraction position p_r, the frequency component is left, and the frequency does not overlap
It may be binary masking which makes a component 0.
[0106]
On the other hand, if H (p_r, τ) is relatively low (H (p_r, τ) T T_h) (S901-No), it is determined
that high accuracy extraction is required for smooth communication. In step S903, method 2
which is a method that can be extracted instantaneously is selected.
[0107]
FIG. 8 shows an example of a block configuration in the case of the minimum dispersion
beamformer by adaptation based on sparsity as an example of scheme 2.
Method 2 is a detailed configuration of the target sound / noise separation unit 801, the target
sound steering vector update unit 802, the noise covariance matrix update unit 803, the filter
update unit 804, and the filter multiplication unit 805. This will be described based on FIG.
11-04-2019
26
[0108]
The target sound / noise separation unit 801 generates the target sound signal X_des ([Equation
17]) according to the direction θ (f, τ) obtained for each frequency index by the direction
estimation algorithm, as in the binary masking described above. Separate into f, τ) and X_int (f,
τ). X_des (f, τ) is sent from the target sound / noise separation unit 801 to the target sound
steering vector update unit 802. X_int (f, τ) is sent from the target sound / noise separation unit
801 to the noise covariance matrix update unit 803.
[0109]
[0110]
The target sound steering vector updating unit 802 calculates the target sound steering vector a
(f, τ) = [a_0 (f, τ),..., A_M−1 (f, τ)] <T> based on [Equation 18]. Update.
However, γs is a suitable constant parameter of 0 or more and less than 1. Of course, for
stability, updating may be performed only when | X_des_i (f, τ) | is sufficiently large.
[0111]
[0112]
The noise covariance matrix updating unit 803 updates the noise covariance matrix R (f, τ)
based on [Equation 19].
However, X_int (f, τ) = [X_int_0 (f, τ),..., X_int_M−1 (f, τ)] <T>, and γn is an appropriate
constant parameter of 0 or more and less than 1. Of course, for the sake of stability, updating
may be performed only when | X_int (f, τ) | is sufficiently large.
11-04-2019
27
[0113]
[0114]
The filter updating unit 804 calculates the filter w (f, τ) from the target sound steering vector a
(f, τ) and the noise covariance matrix R (f, τ) based on [Equation 20].
However, γw is a suitable constant parameter of 0 or more and less than 1.
[0115]
[0116]
Finally, in the filter multiplication unit 805, Xf (f, τ) = [Xf_0 (f, τ),..., Xf_M−1 (f, τ)] based on
[Equation 21]. By multiplying <T>, the signal Yf (f, τ) from which the sound coming from the
specified direction is removed is obtained.
[0117]
[0118]
In this example, the minimum dispersion beamformer by adaptation based on sparsity is used in
method 2. However, method 2 may use ICA which is another high accuracy extraction method.
Since ICA uses high-order statistics, a speech signal of about several seconds is required for
adaptation, and while it is difficult to extract instantaneously, highly accurate extraction is
possible.
Although only two methods 1 and 2 are selected and executed in this example, the number of
methods may be three or more, and they may be selected and executed according to the degree
of risk.
11-04-2019
28
[0119]
The mixing unit 603 mixes the frequency domain signals output from the sound source
separation units 6021 to 602R, and outputs an extraction signal Yf (f, τ).
[0120]
The frequency domain frame signal Yf (f, τ) calculated by the above procedure is sent to the
sound output unit 219, where it is subjected to inverse FFT to be converted to the time domain
signal y (t, τ).
y (t, τ) is overlapped, added, and converted to the inverse of the window function y (t) every
frame period, and y (t) is output from the headphone 106 via DA conversion Be done.
[0121]
The external output sound generation unit 216 selects a filter having directivity of the speaker
array at a position p_r at which H (p, τ) is large, based on the risk map H (p, τ).
The voice signal input from the operator voice input unit 215 including the microphone 105 on
the operator side is multiplied by the above filter to generate a multi-channel signal, and the
sound output unit 217 for external uses the speaker array 1021 via DA conversion. Output from
~ 102S.
[0122]
The machine operation control unit 218 decelerates or stops the operation of the machine when
the danger degree map H (p, τ) is very large with respect to a certain p.
[0123]
According to the sound processing system in the present embodiment described above, the
following effects can be obtained.
11-04-2019
29
(1) The danger degree calculation unit 206 calculates the danger degree for each position, and
the sound extraction unit 203 automatically selects a position with a high degree of danger as an
extraction position, so voices should be extracted for safety It is possible to extract the voice of a
person present at a position where the degree of risk is high. (2) Since the sound extraction unit
203 selects a method that can be instantaneously extracted as the sound source separation unit
whose extraction position is a position with a high degree of risk, the voice of a person at a
position with a high degree of danger is extracted in real time . Thereby, the operator can
perform danger avoidance instantaneously. (3) In the sound extraction unit 203, the sound
source separation unit whose extraction position is a position with a relatively low degree of risk
selects the high-accuracy separation method, and therefore outputs the extraction voice with less
residual noise. As a result, the operator can recognize the contents of the voice of the
surrounding person, and a smooth conversation can be made between the operator and the
surrounding person through the external sound output unit 217. (4) The sound source position
estimation unit 202 changes the estimation method according to the danger degree for each
position calculated by the danger degree calculation unit 206, and the moving body detection
unit 212 changes the detection method, thereby calculating the position with high danger
degree. Since it is possible to lower the frequency of calculation for the low risk position, the
update of the risk calculation is shortened as the position at which the operator's quick action is
required is higher. (5) Since the degree of danger is visually presented to the image output unit
213 as an image, the danger can be avoided even if the operator can not use hearing for some
reason, such as when talking by telephone or wirelessly. (6) The external sound output unit 217
can call attention to the person around the machine even in an environment that is difficult to
hear due to the noise of the machine because the sound output unit 217 directs directivity to a
position where the degree of danger is high. it can. (7) Since the machine operation control unit
218 urgently controls the machine itself to avoid the danger when the degree of danger is high,
there is a possibility that an accident can be avoided when the operator's avoidance judgment is
not in time.
[0124]
Second Embodiment The second embodiment of the present invention will be described below
with reference to FIG.
[0125]
In the first embodiment, an example was described in which the r-th sound source separation
unit 602r (for example, 602R) of the sound extraction unit 203 switches the method for each
position, but in the present embodiment, the method is switched for each position This is an
example applied to a configuration in which the method is switched only by time, not
11-04-2019
30
[0126]
According to the sound processing system of this embodiment having such a configuration, in
addition to the effects of the first embodiment, for example, when H (p, τ)> T_h for a certain p,
all sound source separation units are used. Even in the configuration in which method 1 is
selected, there is an effect that time with high risk can be extracted in real time, and time with
low risk can be extracted with high accuracy.
[0127]
Third Embodiment The third embodiment of the present invention will be described below with
reference to FIG.
FIG. 10 is a diagram showing an example of a block configuration of the sound processing
system according to the present embodiment.
[0128]
In this embodiment, the visible light beam input unit 210, the infrared light input unit 211, the
moving object detection unit 212, the video output unit 213, the operator voice input unit 215,
and the output sound generation unit 216 for external , The external sound output unit 217, the
machine operation control unit 218, and the camera projection matrix 220 are not included.
[0129]
That is, as shown in FIG. 10, the sound processing system according to the present embodiment
includes a sound input unit 201, a sound source position estimation unit 202, a sound extraction
unit 203, a voice non-voice determination unit 204, and a person detection unit 205. And a
danger degree calculation unit 206, a mechanical sensor input unit 207, a mechanical motion
state estimation unit 209, a sound output unit 219, a mechanical operation input unit 221, etc. It
has the same function.
[0130]
According to the sound processing system of this embodiment having such a configuration, the
11-04-2019
31
following effects (1) to (4) except for (5) to (7) of the effects of the first embodiment You can get
(1) The danger degree calculation unit 206 calculates the danger degree for each position, and
the sound extraction unit 203 automatically selects a position with a high degree of danger as an
extraction position, so voices should be extracted for safety It is possible to extract the voice of a
person present at a position where the degree of risk is high.
(2) Since the sound extraction unit 203 selects a method that can be instantaneously extracted as
the sound source separation unit whose extraction position is a position with a high degree of
risk, the voice of a person at a position with a high degree of danger is extracted in real time .
Thereby, the operator can perform danger avoidance instantaneously.
(3) In the sound extraction unit 203, the sound source separation unit whose extraction position
is a position with a relatively low degree of risk selects the high-accuracy separation method, and
therefore outputs the extraction voice with less residual noise. As a result, the operator can
recognize the contents of the voice of the surrounding person. (4) The sound source position
estimation unit 202 changes the estimation method according to the degree of risk for each
position calculated by the degree of risk calculation unit 206, thereby preferentially performing
calculation for a position with a high degree of danger, and the degree of danger is low. Since the
frequency of calculation with respect to the position can be reduced, the update of the
calculation of the degree of risk is shortened as the position with the high degree of risk where
the operator's prompt action is required is high.
[0131]
Fourth Embodiment The fourth embodiment of the present invention will be described below
with reference to FIG. FIG. 11 is a diagram showing an example of a block configuration of the
sound processing system according to the present embodiment.
[0132]
The present embodiment has a configuration in which the sound source position estimation unit
202, the voice non-voice determination unit 204, and the person detection unit 205 are not
provided in addition to the third embodiment.
11-04-2019
32
[0133]
That is, as shown in FIG. 11, in the sound processing system according to the present
embodiment, the sound input unit 201, the sound extraction unit 203, the risk degree calculation
unit 206, the mechanical sensor input unit 207, and the mechanical motion state estimation unit
209, a sound output unit 219, a machine operation input unit 221, and the like, and each
functional unit has the same function as that of the first embodiment.
[0134]
According to the sound processing system of this embodiment having such a configuration, it is
possible to obtain the following effects (1) to (3) except for (4) of the effects of the third
embodiment. it can.
(1) Even when the person detection unit is not provided, the danger degree calculation unit 206
calculates the danger degree for each position, and the sound extraction unit 203 automatically
selects a position with a high degree of danger as an extraction position. Therefore, it is possible
to extract the voice of a person present at a high risk position, for which the voice should be
extracted for safety.
(2) Since the sound extraction unit 203 selects a method that can be instantaneously extracted as
the sound source separation unit whose extraction position is a position with a high degree of
risk, the voice of a person at a position with a high degree of danger is extracted in real time .
Thereby, the operator can perform danger avoidance instantaneously. (3) In the sound extraction
unit 203, the sound source separation unit whose extraction position is a position with a
relatively low degree of risk selects the high-accuracy separation method, and therefore outputs
the extraction voice with less residual noise. As a result, the operator can recognize the contents
of the voice of the surrounding person.
[0135]
As mentioned above, although the invention made by the present inventor was concretely
explained based on an embodiment, the present invention is not limited to the above-mentioned
embodiment, and can be variously changed in the range which does not deviate from the gist.
Needless to say.
11-04-2019
33
[0136]
For example, in the above embodiment, the configuration example in which the sound processing
system is integrated with the construction machine has been described, but the present invention
is applicable not only to the construction machine but also to general vehicles, work machines,
etc. is there.
[0137]
The sound processing system according to the present invention relates to sound processing
technology suitable for an operator or driver operating a relatively large machine such as a
construction machine, a vehicle or a work machine to grasp the situation of a person around the
machine, in particular The present invention is applicable to an acoustic processing system
suitable for the safety of a person around a machine and a machine using the same.
[0138]
DESCRIPTION OF SYMBOLS 100 ... Sound processing system, 1011 to 101M ... Microphone
array, 1021 to 102S ... Speaker array, 1031 to 103A ... Visible light camera, 1041 to 104B ...
Infrared camera, 105 ... Microphone, 106 ... Headphone, 107 ... A / D-D / A converter, 108:
central processing unit, 109: volatile memory, 110: storage medium, 111: image display device,
112: working machine, 113: machine operation input unit, 1141 to 114M, 1151 to 115S, 116,
117 ... audio cable, 118 ... monitor cable, 119, 1201 to 120 A, 1211 to 121 B ... digital cable, 201
... sound input unit, 202 ... sound source position estimation unit, 203 ... sound extraction unit,
204 ... voice non-voice determination unit, 205: person detection unit, 206: risk degree
calculation unit, 207: mechanical sensor input unit 208: machine dimensions 209: mechanical
motion state estimation unit 210: visible light input unit 211: infrared input unit 212: moving
body detection unit 213: video output unit 214: microphone arrangement 215: operator voice
input unit 216: external output sound generation unit 217: external sound output unit 218:
mechanical operation control unit 219: sound output unit 220: camera projection matrix 221:
machine operation input 301, multi-channel AD converter, 302, multi-channel frame processing
unit, 303, multi-channel short-term frequency analysis unit, 4011 to 401 M, per-frequency
direction estimation unit, 402, direction estimation integration unit, 501, background difference
Inter-frame difference calculation unit 502 body surface detection unit 503 view cone
intersection calculation unit 601 extraction direction selection unit 6021 to 602 R sound source
separation Unit 603 mixing unit 801 target sound / noise separation unit 802 target sound
steering vector updating unit 803 noise covariance matrix updating unit 804 filter updating unit
805 filter multiplication unit 13001 cabinet 1300 13002 ... engine part, 13003 ... arm part.
11-04-2019
34
Документ
Категория
Без категории
Просмотров
0
Размер файла
50 Кб
Теги
description, jp2012058314
1/--страниц
Пожаловаться на содержимое документа