close

Вход

Забыли?

вход по аккаунту

?

DESCRIPTION JPH08286680

код для вставкиСкачать
Patent Translate
Powered by EPO and Google
Notice
This translation is machine-generated. It cannot be guaranteed that it is intelligible, accurate,
complete, reliable or fit for specific purposes. Critical decisions, such as commercially relevant or
financial decisions, should not be based on machine-translation output.
DESCRIPTION JPH08286680
[0001]
BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a
sound extraction apparatus, and more particularly to a sound extraction apparatus for extracting
a sound emitted by an object (in the present invention, a human or an object expected to emit a
sound).
[0002]
2. Description of the Related Art Conventionally, when inspecting the degree of deterioration of a
structure such as a building or a bridge, mainly, the squeaks and the like emitted from a
predetermined portion of the structure are simulated by simulation. The degree of aging was
examined by calculating. However, since the value calculated by the above-mentioned simulation
is a predicted value to the last, in order to conduct a more precise inspection, the actual
squeaking sound emitted from a predetermined part of the building should be distinguished from
surrounding noise and extracted. It was desired.
[0003]
On the other hand, in relation to the above, audio signals of sounds collected by a plurality of
microphones are superimposed on the same time axis, and each of those sounds is appropriately
delayed according to the distance between the microphone and the target sound source. A
10-04-2019
1
technique is known which extracts only the sound emitted from a target sound source by
performing averaging after performing an operation. In addition, this technology is applied to a
hand-held video camera that simultaneously performs shooting and recording, and the
technology for matching the focus of the image of the subject with the focus of recording is
known by extracting the sound emitted from the subject during shooting. (See Japanese Patent
Application Laid-Open No. 5-308553).
[0004]
However, the above-mentioned technology relating to the hand-held video camera is effective
only for a single subject located within a narrow area within the field of view of the video camera,
and the video camera is in hand-held. Because of the small number of microphones attached to
them and the narrow spacing between them, the effect of noise was large, and it was relatively
difficult to collect highly realistic speech.
[0005]
By the way, the image recognition technique which arrange | positions a some television camera
on a ceiling conventionally, and detects the position of the object which exists in a room based on
the image information image | photographed by the some television camera exists.
[0006]
However, when an object moves, the moving object is photographed by moving a plurality of
television cameras in accordance with the moving object and performing focus adjustment.
As described above, since it is necessary to move the television camera and perform focus
adjustment, there is a problem that a delay time occurs before obtaining image data including an
object.
[0007]
The present invention has been made in consideration of the above-mentioned facts, and
combines the image recognition technology and the sound extraction technology for the
detection of the position mentioned above to detect the position of the object whose position is
uncertain and It is a first object to provide a sound extraction device capable of extracting a
10-04-2019
2
sound emitted by an object based on the position.
Another object of the present invention is to provide a sound extraction device capable of
extracting the sound emitted by an object based on the detected position after detecting the
position of the object more efficiently.
[0008]
According to the first aspect of the present invention, there is provided an image pickup means
for photographing an area including an object as a sound source, and an area photographed by
the image pickup means. Image recognition means for recognizing the position of the object from
the image information of the plurality of microphones, a plurality of microphones arranged at a
predetermined position and collecting sounds emitted by the object, and collected sounds
collected by each of the plurality of microphones The time-series data of a plurality of sampled
sounds among the series data is selected, and the time-series data of the selected sampled sound
is displayed at the position of the object recognized by the image recognition means and the
position of the microphone from which the selected sampled sound was sampled. Based on the
sound emitted by the object is shifted to synchronize, and by averaging the time-series data of
the shifted sampled sound Characterized in that it has extracting means for extracting a sound
object emits, the.
[0009]
In the first aspect of the present invention, the photographing means photographs an area
including an object as a sound source, and the image recognition means recognizes the position
of the object from the image information of the area photographed by the photographing means.
[0010]
For example, according to the image information in the room 50 taken by each of the plurality of
television cameras 16 installed on the ceiling 52 as shown in FIG. Recognize the position.
That is, according to the image information, the head P of the target person A is a region having
a characteristic amount specific to the human head, such as a substantially spherical shape in
which many surfaces are covered with hair and there are many black portions. It is extracted as
an area corresponding to
10-04-2019
3
Then, among a large number of rectangular parallelepiped regions obtained by virtually equally
dividing the room 50 along the arrow X direction, the arrow Y direction, and the arrow Z
direction, a region corresponding to the extracted head P is virtually divided. It recognizes which
area it corresponds to.
[0011]
On the other hand, the sound emitted by the object is collected by a plurality of microphones
(hereinafter referred to as microphones) disposed at predetermined positions. For example, as
shown in FIG. 2, a microphone disposed on the ceiling of a room in which two persons are
present collects collected sounds including the voices of two target persons A and B and some
noise. It is assumed that the time-series data of the collected sound collected by each microphone
is the waveform shown in FIG. 1A (note that although the number of microphones is seven for
convenience of explanation, the present invention is limited thereto) It is not a thing, but the
number can also be increased more).
[0012]
As shown in FIG. 1A, in the waveform of the time-series data of the collected sound collected by
each microphone, the portion corresponding to the voice of the target person A and the voice of
the target person B has a time axis Shift along the horizontal axis). That is, depending on the
distance between each target person and each microphone, the time for the voice of the target
person to reach the microphone is different. For example, since the microphone 1 is close to the
target person A and is far from the target person B, in the time series data of the microphone 1, a
portion corresponding to the voice of the target person A appears first along the time axis. The
part corresponding to the voice of will appear.
[0013]
The extraction means selects time-series data of a plurality of collected sounds among time-series
data of collected sounds collected by each of the plurality of microphones. Here, time-series data
of the collected sound collected by all the microphones may be selected, or, as in the invention
according to claim 8 described later, collected by the microphone separated by a predetermined
distance or more from the position of the object The time-series data of the collected sound may
10-04-2019
4
be excluded, and the time-series data of other collected sound may be selected.
[0014]
Then, the extraction means synchronizes the sound emitted by the object based on the position
of the object recognized by the image recognition means and the position of the microphone
from which the selected collection sound is collected, on the time-series data of the selected
collection sound Shift to
[0015]
For example, assuming that the target person A in FIG. 2 is an object and the extraction of the
voice of the target person A is taken as an example, the distance between the head P of the target
person A and each microphone is divided by the speed of sound. The delay time of sound
collection at each microphone with respect to the voice of A is calculated.
Then, as shown in FIG. 1B, for each microphone, time-series data obtained by delaying the timeseries data of the collected sound collected by the microphone by the delay time along the time
axis is obtained. As a result, in each microphone, the portion corresponding to the voice of the
target person A is approximately synchronized (aligned to the same phase) along the time axis.
On the other hand, portions corresponding to the voice of the target person B other than the
voice of the target person A and other noises remain out of phase along the time axis.
[0016]
Furthermore, the extraction means extracts the sound emitted by the object by averaging the
time-series data of the shifted sampled sound. For example, all time series data in the
microphones 1 to 7 shown in FIG. 1B are synchronously added (superimposed), and the
amplitude of the waveform after the addition is divided by the number of microphones ?7?. As
a result, as shown in FIG. 1C, in the portion corresponding to the voice of the target person B
other than the voice of the target person A and other noises, the amplitude of the time-series data
subjected to arithmetic averaging is extremely small and the Because the amplitude value is
within the range of (1), only the part corresponding to the voice of the target person A is
extracted.
10-04-2019
5
[0017]
As described above, according to the first aspect of the present invention, the position of the
object can be recognized, and the sound emitted by the object can be distinguished from
surrounding noise based on the position.
[0018]
Further, in order to achieve the first object, according to the invention of claim 2, according to
the invention of claim 1, the image recognition means also makes a direction in which the object
emits a sound from image information of a region including the object. It is characterized in that
the position where the sound emitted by the object can be extracted well is recognized again as
the position of the object based on the position of the object and the direction in which the object
emits sound.
[0019]
In the second aspect of the invention, the image recognition means also recognizes the direction
in which the object emits a sound from the image information of the area including the object.
For example, the direction in which the target person A shown in FIG. 2 emits a sound (voice) is
recognized as follows.
That is, first, after recognizing the head P in the manner described above, the trunk S located
under the head P is recognized, and in the trunk S, the chest width L2 is smaller than the
shoulder width L1, the target person A Is estimated to be pointing in the arrow V direction or the
opposite direction. Next, based on the general feature that the ratio of hair on the surface of the
head P is higher on the side where the face is not located than on the side where the face is
located, the back side of the paper in FIG. Since the degree of blackness is higher than that, the
head P is presumed to be facing in the arrow V direction, and the direction in which the target
person A emits a voice is recognized as the arrow V direction.
[0020]
Furthermore, the image recognition means is a position where the sound emitted by the object
can be extracted favorably based on the position of the object and the direction in which the
object emits sound, ie all frequency components ranging from low frequency range to high
frequency range with original sound without omission. A position which can be extracted at
10-04-2019
6
substantially the same level (for example, a position separated by a predetermined distance
(several tens cm) from the position of the object in the direction in which the sound is emitted) is
recognized again as the position of the object.
[0021]
The sound extraction as described above is performed based on the position of the object
recognized in this way, that is, the position where the sound emitted by the object can be
extracted well, and in particular, the directivity of the sound emitted by the object is strong, etc.
Can perform sound extraction with higher accuracy.
[0022]
Also, in order to achieve the first object, in the invention according to claim 3, in the invention
according to claim 1, when the object moves, the photographing means follows the movement of
the object and includes the area including the object It is characterized by photographing.
[0023]
According to the third aspect of the invention, when the object moves, the photographing means
follows the movement of the object to photograph an area including the object.
Thereby, the image recognition means recognizes the position of the moving object from the
image information of the photographed area, and the extraction means starts the movement
object from the moving object in the manner described above based on the position of the object
recognized by the image recognition means. Extract the sound of
Therefore, the sound from the moving object can be extracted.
[0024]
Further, in order to achieve the first object, in the invention according to claim 4, in the invention
according to claim 1, when there are a plurality of objects, the photographing means
photographs an area including a plurality of objects, and The image recognition means
recognizes the position of each of the plurality of objects from the image information of the
photographed area, and the extraction means extracts the sound from each of the plurality of
objects.
10-04-2019
7
[0025]
According to the fourth aspect of the invention, when there are a plurality of objects, the
photographing means shoots an area including a plurality of objects, and the image recognition
means separates the position of each of the plurality of objects from the image information of the
photographed area. Recognize
Then, the extraction means extracts the sound from each of the plurality of objects in the manner
described above based on the position of each of the plurality of objects recognized by the image
recognition means.
Thereby, the sound from each of the plurality of objects can be extracted also for the plurality of
objects.
[0026]
In order to achieve the first object, the invention according to claim 5 is the invention according
to claim 1, wherein at least one of the speed of sound and the sound propagation path is affected
in the area including the object and the plurality of microphones. The extraction unit further
includes a detection unit that detects an acoustic environment state that is a factor to exert, and
the extraction unit detects the sampled sound based on the changed acoustic environment state
when the acoustic environment state detected by the detection unit changes. The present
invention is characterized in that the shift of time series data is corrected.
[0027]
In the invention according to the fifth aspect, the detection means detects an acoustic
environment state, such as temperature, wind power, or wind direction, which is a factor
affecting at least one of sound velocity and sound propagation path in a region including the
object and the plurality of microphones. Do.
Then, when the acoustic environment state detected by the detection means changes, the
extraction means corrects the shift of the time-series data of the sampled sound, for example, as
10-04-2019
8
follows, based on the changed state of the acoustic environment.
[0028]
That is, by referring to the sound speed correction table in which the ratio of the sound speed to
the standard sound speed corresponding to the detected temperature calculated in advance
corresponds to the detected temperature, the sound speed corresponding to the detected
temperature and the standard sound speed The delay time of sound collection at each
microphone is corrected based on the ratio, and the delay operation is performed according to
the corrected delay time. Alternatively, the delay time of sound collection at each microphone is
corrected by dividing the distance between the position of the object and the position of each
microphone by the speed of sound corresponding to the detected temperature, and the delay
operation is performed according to the corrected delay time. Do.
[0029]
Also, for example, a propagation path obtained by simulating the propagation path of sound from
the detected wind power and the position of the object under the wind direction to the position
of each microphone in advance assuming various wind power values and wind direction values.
The delay time of the sound collection at each microphone is corrected by estimating based on
the information on the bending (change) of the signal, and dividing the distance along the
estimated propagation path by the speed of sound, Do delay operation.
[0030]
As described above, it is possible to extract sound with high accuracy according to the change of
the acoustic environment state.
[0031]
In order to achieve the first object, the invention according to claim 6 relates to the invention
according to claim 1, wherein the extraction means is based on the information on the directivity
of the high range, when the sampled sound of the high range is extracted. The series data is
weighted and averaged.
[0032]
10-04-2019
9
FIG. 11 shows a region in which the components of each frequency band of sound propagate.
It can be seen that the higher range propagates to a wider range, while the higher range
propagates almost only in the direction of the sound (arrow D).
That is, the directivity of sound differs depending on the frequency band, and the directivity is
lower as the lower range is, and the directivity is generally stronger in the upper range.
Therefore, with the microphone positioned in the direction in which the object emits sound,
frequency components in almost the entire range from the low range to the high range are
collected, while in the other microphones, the high range of the low range is too much. It is not
collected.
[0033]
However, according to the sixth aspect of the present invention, the extraction means is based on
the information on the directivity of the high frequency range, and the sampled sound of the high
frequency range by the microphone positioned in the direction in which the object emits the
sound and the high frequency range by the other microphones. Weighted and averaged to
correct the imbalance between the collected sound and the sound. This can prevent the high
range from becoming relatively weaker than the low range.
[0034]
Further, in order to achieve the first object, according to the invention of claim 7, according to
the invention of claim 1, the image recognition means is a direction in which the object emits a
sound based on image information of a region including the object, The position and direction of
the reflection surface of the sound located around the object are also recognized, and the
extraction means detects the position of the microphone from which the selected sound has been
collected, the position of the object, the direction in which the object emits sound, Shifting timeseries data of the selected sampled sound so as to synchronize either the direct sound from the
object or the reflected sound reflected by the reflecting surface, based on the position and
orientation of the reflecting surface; It is characterized by
[0035]
10-04-2019
10
By the way, when the sound emitted by the object is reflected by a predetermined reflection
surface and collected by the microphone, the reflected sound is usually very weak compared to
the direct sound, so by averaging, it is automatically performed together with other noise
components. Removed.
However, if the reflecting surface is located in a position close to the object and in the direction
in which the object emits sound, and the microphone is located in the traveling direction of the
reflected sound from the reflecting surface, etc. Since the reflected sound is larger than the direct
sound, it is rather effective to extract the sound emitted from the object by collecting the
reflected sound on the reflecting surface.
[0036]
Therefore, according to the seventh aspect of the present invention, the image recognition means
further generates, from the image information of the area including the object, the direction in
which the object emits sound and the position and direction of the reflection surface of the sound
located around the object. recognize. The extraction means is based on the position of the
microphone from which the selected sampled sound has been collected, the position of the
object, the direction in which the object emits the sound, and the position and orientation of the
reflecting surface, time-series data of the selected sampled sound Either the direct sound from
the object or the reflected sound reflected by the reflecting surface is shifted to synchronize.
[0037]
For example, in the case where the reflecting surface is located near the object and in the
direction in which the object emits sound, and the microphone is located in the traveling
direction of the reflected sound from the reflecting surface, the reflecting surface reflects The
delay operation is performed according to the propagation time of the reflected sound. As
described above, it is possible to perform more appropriate sound extraction in accordance with
the arrangement of the reflective surface and the like in the periphery of the object.
[0038]
In order to achieve the first object, the invention according to claim 8 relates to the invention
10-04-2019
11
according to claim 1, wherein the extraction means comprises time series data of collected sound
collected by each of the plurality of microphones. Among them, time-series data of the collected
sound collected by the microphone located at a predetermined distance or more from the
position of the object is excluded from the selection targets.
[0039]
Generally, the sound is attenuated according to its propagation distance, so when the sound
emitted by the object is collected by the microphone through a long propagation distance, the
collected sound collected by the microphone is the emission of the object. Since it contains only a
few components of the sound, the degree of contribution to the formation of the time-series data
is small when obtaining time-series data of the sound of the object.
[0040]
Therefore, according to the eighth aspect of the present invention, the extraction means is a
microphone separated from the position of the object among the plurality of microphones, that
is, the sampled sound collected by the microphones located at a predetermined distance or more
previously determined by experiment. Exclude series data from selection.
Thereby, the load of the process (shift and averaging process by the extraction means) relating to
the extraction of the sound can be reduced without lowering the accuracy of the extraction of the
sound.
[0041]
For the same purpose as above, among the plurality of microphones, the microphone with the
small volume of the collected sound taken by the microphone, that is, the microphone with the
volume of the collected sound smaller than the predetermined volume level previously obtained
by experiment The collected sound may be excluded from the selection targets.
[0042]
Also, in order to achieve the first object, the invention according to claim 9 is the invention
according to the invention according to claim 1, which is an output means for outputting the
sound emitted by the object extracted by the extraction means to a predetermined speech
recognition device Furthermore, it is characterized by having.
10-04-2019
12
[0043]
By the way, in the invention according to claim 1, the signal to noise ratio is obtained by shifting
and averaging the time series data of the sampled sound taken by more microphones placed near
the object as described above. It can be improved to extract the sound of the object.
Moreover, it is also possible to extract a sound having a higher signal-to-noise ratio than the
sound collected by a normal microphone.
Such good quality sound can be used as an input to the speech recognition device.
[0044]
Therefore, in the invention according to the ninth aspect, the output means outputs the sound
from the object extracted by the extraction means to a predetermined speech recognition device.
Thus, it is possible to input voices uttered by one or more persons in the area where the sound
can be extracted by the sound extraction device to the voice recognition device. In particular, the
present invention can be applied to the case where an elderly or disabled person with a physical
disability uses a voice recognition device to control on / off or the like of a home appliance by
using voice.
[0045]
Further, in order to achieve the second object, the invention according to claim 10 is a
photographing means comprising a wide-angle fixed focus lens disposed at a predetermined
position, and photographing a region including an object as a sound source, Image recognition
means for recognizing the position of an object from image information of a region
photographed by the photographing means, a plurality of microphones arranged at a
predetermined position and collecting sounds emitted by the object, and the plurality of
microphones The time-series data of a plurality of sampled sounds is selected from the timeseries data of the sampled sound sampled by the method, and the time-series data of the selected
sampled sound is the position of the object recognized by the image recognition means and the
selected sampling Based on the position of the microphone from which the sound was collected,
10-04-2019
13
the sound emitted by the object is shifted to synchronize, and By averaging the time series data,
and having a extraction means for extracting a sound object emitted.
[0046]
In the invention according to the tenth aspect, the photographing means comprises a wide-angle
fixed focus lens disposed at a predetermined position.
This makes it possible to capture an object without changing the orientation of the imaging
means (for example, a television camera) following the movement of the object, regardless of
whether the object is moving or stationary. . In addition, objects such as objects, people and
animals are generally fixed in height from the floor and the ground, and furthermore, the wideangle fixed focus lens has the characteristic that the depth of focus is large. It is possible to shoot
an object without having it, that is, without performing focus adjustment.
[0047]
As described above, the position of the object can be recognized promptly without changing the
direction of the photographing means following the movement of the object and adjusting the
focus. In addition, since the mechanical operation mechanism for changing the direction of the
photographing means and adjusting the focus as described above is not necessary, the structure
of the photographing means and the sound extraction apparatus can be simplified, and
mechanical Durability can be improved by reducing the operating part.
[0048]
In addition, as the arrangement position of the wide-angle fixed focus lens, for example, in
addition to a flat portion such as a ceiling of a room, a corner formed by two surfaces of a ceiling
and a wall, a corner formed by a total of three surfaces of a ceiling and two walls It can be a
department.
[0049]
In the invention according to claim 10, the position of the object is recognized by the image
recognition means in the same manner as the invention according to claim 1 from the image
information of the area photographed by the photographing means as described above, and the
10-04-2019
14
extraction means Can distinguish the sound emitted by the object from the surrounding noise
and extract it.
[0050]
In order to achieve the second object, in the invention according to claim 11, in the invention
according to claim 10, a plurality of the photographing means are provided, and each
photographing means is formed by the wide-angle fixed focus lens. The image recognition unit
further includes an area sensor disposed at an image point, and the image recognition unit
processes shape information different from each other by processing different types of image
pickup information photographed by the plurality of image pickup units, and the shape
recognition unit And a three-dimensional coordinate computing means for computing threedimensional coordinates of the object recognized by the above.
[0051]
According to the eleventh aspect of the present invention, a plurality of photographing means are
provided, and each photographing means further includes an area sensor disposed at an imaging
point by the wide-angle fixed focus lens.
That is, the image of the object taken through the wide-angle fixed focus lens is imaged on the
area sensor by the imaging means.
In this manner, the regions including the object are photographed from different positions by the
plurality of photographing means.
[0052]
The shape recognition means processes different pieces of shooting information shot by the
plurality of shooting means to recognize the shape of the object.
In order to recognize the shape of an object, it is obtained, for example, by virtually subdividing a
three-dimensional space along each of the X-axis, Y-axis and Z-axis directions as in the invention
according to claim 12 described later. Among a large number of cube-like minute spaces,
10-04-2019
15
recognition may be performed by finding an area formed by the minute space occupied by an
object, or, for example, different photographing information is planarized by a plurality of
photographing means. The image information may be converted into image information, image
information of at least the front, back, left side, right side, and plane of the object may be
determined from the converted image information, and the determined image information may
be synthesized and recognized.
[0053]
Three-dimensional coordinates of an object recognized by the shape recognition means or a
predetermined part of the object are calculated by the three-dimensional coordinate calculation
means. In general, an object is not a point but a collection of points, so all three-dimensional
coordinates of all points included in the object may be calculated, or all points belonging to the
boundary between the object and the three-dimensional space The three-dimensional coordinates
of may be calculated. Further, for example, the height, the length, and the like may be obtained
by setting (storing) a specific position in advance and calculating three-dimensional coordinates
of the specific position.
[0054]
Thus, the three-dimensional coordinates of the object can be quickly obtained by the shape
recognition means and the three-dimensional coordinate calculation means constituting the
image recognition means, and the position of the object can be recognized rapidly.
[0055]
In order to achieve the second object, in the invention according to claim 12, in the invention
according to claim 11, the shape recognition means is based on different pieces of image
information photographed by the plurality of photographing means. A region formed by a minute
space occupied by an object among a large number of cubic minute spaces obtained by virtually
subdividing a three-dimensional space along each direction of X axis, Y axis and Z axis It is
characterized in that the shape of the object is recognized by obtaining.
[0056]
According to the invention as set forth in claim 12, of the large number of cubic microspaces
obtained by virtually subdividing the three-dimensional space along each of the X-axis, Y-axis and
10-04-2019
16
Z-axis directions by the shape recognition means Among them, the shape of the object is
recognized by obtaining an area formed by a minute space occupied by the object.
Here, the micro area can be subdivided to the limit of the resolution of the area sensor.
Therefore, the shape of the object can be recognized in detail.
[0057]
In order to achieve the second object, according to the invention of claim 13, according to the
invention of claim 11, the shape recognition means is based on different pieces of image
information photographed by the plurality of photographing means. Of a large number of cubic
micro-spaces obtained by virtually subdividing a three-dimensional space along each direction of
X axis, Y axis and Z axis, within a viewing angle at which an object is projected from each
photographing means It is characterized in that the shape of the object is recognized by
extracting each of the minute spaces included in and determining a region formed by the minute
spaces included in all of the extracted minute spaces.
[0058]
In the image information photographed by the plurality of photographing means according to
claim 11 described above, a shadow (dead spot) area is generated as shown in FIG.
Therefore, as described in claim 13, the shape recognition means is arranged along the X-axis, Yaxis and Z-axis directions in the three-dimensional space based on the different image
information photographed by the plurality of photographing means. Among a large number of
cubic microspaces obtained by virtually dividing into subdivisions, the microspaces included in
the viewing angle for projecting an object from each photographing means are respectively
extracted and included in all of the extracted microspaces The shape of the object is recognized
by finding the area formed by the minute space.
[0059]
By recognizing the shape of the object in this way, it is possible to eliminate the area of the
shadow (dead spot) and to accurately recognize the shape of the object.
10-04-2019
17
[0060]
In order to achieve the second object, the invention according to claim 14 is the invention
according to claim 10, wherein a plurality of the photographing means are provided, and each
photographing means is formed by the wide-angle fixed focus lens. The image recognition unit
further includes an area sensor disposed at an image point, and the image recognition unit
acquires two-dimensional coordinates formed on the area sensor of each photographing unit, and
an object of the object is acquired based on the plurality of acquired two-dimensional
coordinates. It is characterized by recognizing a position.
[0061]
In the invention according to the fourteenth aspect, since the two-dimensional coordinates of the
point on the area sensor of each of the plurality of photographing means can be calculated back,
the three-dimensional coordinates can be calculated back. The two-dimensional coordinates
formed on the upper side can be acquired, and the position (three-dimensional coordinate) of the
object can be accurately recognized based on the plurality of acquired two-dimensional
coordinates.
In order to achieve the second object, the invention according to claim 15 includes a wide-angle
fixed focus lens disposed at a predetermined position and an area sensor disposed at an imaging
point by the lens, A photographing means for photographing an area including an object as a
sound source, a reflecting means disposed in the vicinity of the photographing means and
reflecting an image of the object so as to form an image on the area sensor The two-dimensional
coordinates on the area sensor of each of the object image imaged on the area sensor and the
object image imaged on the area sensor without being reflected by the reflection means are
acquired and acquired Image recognition means for recognizing the position of the object by
calculating the three-dimensional coordinates of the object based on the plurality of twodimensional coordinates; A plurality of microphones arranged at different positions for collecting
sounds emitted by the object, and time-series data of a plurality of collected sounds among timeseries data of the collected sounds collected by each of the plurality of microphones; The timeseries data of the sampled sound is shifted so that the sounds emitted by the object are
synchronized based on the position of the object recognized by the image recognition means and
the position of the microphone from which the selected sampled sound was sampled. And
extracting means for extracting a sound emitted by the object by averaging time-series data of
the sound.
10-04-2019
18
[0062]
In the invention according to the fifteenth aspect, the photographing means comprises a wideangle fixed focus lens disposed at a predetermined position and an area sensor disposed at an
image forming point by the lens.
In addition, an area including the object is photographed by the photographing means.
[0063]
The reflection means is disposed in the vicinity of the photographing means. For example, as
shown in (G) to (L) of FIG. 24, this reflecting means may be disposed along the wall, may be Lshaped, or may be curved. The reflection means reflects the image of the object so as to form an
image on the area sensor.
[0064]
Then, the image recognition means is provided on the area sensor for each of an object image
reflected by the reflection means and imaged on the area sensor, and an object image formed on
the area sensor without being reflected by the reflection means. Get two dimensional coordinates.
In this way, even if there is only one imaging means, a plurality of object images are formed on
the area sensor provided in the single imaging means, and a plurality of two-dimensional
coordinates are acquired. Therefore, as in the invention according to claim 14 described above,
the image recognition means can accurately recognize the position (three-dimensional
coordinates) of the object based on the plurality of acquired two-dimensional coordinates.
[0065]
As described above, another object image can be formed on the area sensor by the reflection
means, so that even if there is only one photographing means, three-dimensional coordinates of
the object are calculated to accurately recognize the position of the object. can do.
[0066]
In the invention according to claim 15, the position of the object is recognized by the image
10-04-2019
19
recognition means as described above, and the sound emitted from the object by the extraction
means is the ambient noise and the valve in the same manner as the invention according to claim
1 It can be extracted separately.
[0067]
Also, in order to achieve the first and second objects, the invention according to claim 16 is the
invention according to any one of claims 10 to 15, wherein the image recognition means
comprises an area including an object. The image information also recognizes the direction in
which the object emits a sound, and based on the position of the object and the direction in
which the object emits a sound, recognizes again the position where the sound emitted by the
object can be favorably extracted as the position of the object. It is characterized by
[0068]
In the invention according to the sixteenth aspect, as in the invention according to the second
aspect described above, since the extraction of sound is performed based on the position of the
object, that is, the position where the sound emitted by the object can be extracted well. In the
case where the directivity of the sound emitted by the object is strong or the part (surface) of the
object that emits the sound is large, it is possible to extract the sound with higher accuracy.
[0069]
In order to achieve the first and second objects, according to the invention of claim 17, in the
invention according to any one of claims 10 to 15, when there are a plurality of objects, the
photographing means is An area including a plurality of objects is photographed, the image
recognition unit recognizes positions of each of the plurality of objects from image information of
the photographed area, and the extraction unit extracts a sound from each of the plurality of
objects. , It is characterized.
[0070]
According to the seventeenth aspect of the present invention, when there are a plurality of
objects, the photographing means photographs an area including a plurality of objects, and the
image recognition means detects the photographed area, as in the fourth aspect of the invention
described above. The position of each of a plurality of objects is recognized from image
information.
10-04-2019
20
Then, the extraction means extracts the sound from each of the plurality of objects in the same
manner as the invention described in claim 1 based on the position of each of the plurality of
objects recognized by the image recognition means.
Thereby, the sound from each of the plurality of objects can be extracted also for the plurality of
objects.
[0071]
Also, in order to achieve the first and second objects, the invention according to claim 18 is the
invention according to any one of claims 10 to 15 in an area including the object and the
plurality of microphones. The apparatus further includes detection means for detecting an
acoustic environment state that is a factor affecting at least one of the sound velocity and the
sound propagation path, and the extraction means changes when the acoustic environment state
detected by the detection means changes. The shift of the time-series data of the sampled sound
is corrected based on the acoustic environment state.
[0072]
According to the eighteenth aspect of the present invention, when the acoustic environment state
detected by the detection means changes, the extraction means extracts the same as the
invention according to the fifth aspect described above based on the changed state of the
acoustic environment. Correct the shift of time series data of sound.
As a result, it is possible to perform sound extraction with high accuracy in accordance with
changes in the acoustic environment state.
[0073]
Also, in order to achieve the first and second objects, the invention according to claim 19 relates
to the directivity according to any one of claims 10 to 15, wherein the extraction means relates
to high directivity It is characterized in that the time series data of the sampled sound in the high
range is weighted and averaged based on the information.
[0074]
10-04-2019
21
In the invention according to claim 19, as in the invention according to claim 6 described above,
the extraction means is based on the information related to the directivity of the high range, and
the high range is sampled by the microphone positioned in the direction in which the object
emits sound. Weighting and averaging are done to correct the imbalance between the sound and
the high-pitched sound collected by other microphones.
This can prevent the high range from becoming relatively weaker than the low range.
[0075]
In order to achieve the first and second objects, the invention according to claim 20 is the
invention according to any one of claims 10 to 15, wherein the image recognition means
comprises an area including an object. The image information further recognizes the direction in
which the object emits a sound, and the position and the direction of the reflection surface of the
sound located around the object, and the extraction unit detects the position of the microphone
from which the selected collected sound is collected. Based on the position, the direction in which
the object emits sound, and the position and orientation of the reflecting surface, time-series data
of the selected sampled sound is either a direct sound from the object or a reflected sound
reflected on the reflecting surface It is characterized in that one or the other shifts so as to be
synchronized.
[0076]
According to the twentieth aspect of the present invention, as in the seventh aspect of the
invention described above, for example, the reflecting surface is located at a position close to the
object and in the direction in which the object emits sound, the reflecting surface In the case
where the microphone is positioned in the traveling direction of the reflected sound due to the
above, the delay operation according to the propagation time of the reflected sound reflected by
the reflecting surface is executed.
As described above, it is possible to perform more appropriate sound extraction in accordance
with the arrangement of the reflective surface and the like in the periphery of the object.
[0077]
Further, in order to achieve the first and second objects, the invention according to claim 21 is
10-04-2019
22
the invention according to any one of claims 10 to 15, wherein the extraction means comprises
each of the plurality of microphones. The time-series data of the collected sound collected by the
microphone located at a predetermined distance or more from the position of the object among
the time-series data of the collected sound collected by the process is excluded from the selection
targets.
[0078]
According to the twenty-first aspect of the present invention, as in the case of the eighth aspect,
the extraction means is a microphone distant from the position of the object among the plurality
of microphones, that is, a position separated by a predetermined distance previously obtained by
experiment. Exclude the collected sound taken by the microphone from the selection.
Thereby, the load of the process (shift and averaging process by the extraction means) relating to
the extraction of the sound can be reduced without lowering the accuracy of the extraction of the
sound.
[0079]
In order to achieve the first and second objects, the invention according to claim 22 relates to the
sound emitted by the object extracted by the extraction means in the invention according to any
one of claims 10 to 15. And an output means for outputting the signal to a predetermined speech
recognition device.
[0080]
According to the twenty-second aspect of the invention, the output means outputs the sound of
the object extracted by the extraction means to a predetermined speech recognition device, as in
the case of the ninth aspect of the invention.
Thus, it is possible to input voices uttered by one or more persons in the area where the sound
can be extracted by the sound extraction device to the voice recognition device.
[0081]
10-04-2019
23
DESCRIPTION OF THE PREFERRED EMBODIMENTS [First Embodiment] The first embodiment of
the present invention will be described below with reference to the drawings.
In the first embodiment, an example in which only the voice of the target person A in the
predetermined room 50 shown in FIG. 2 is extracted is shown.
[0082]
As shown in FIGS. 2 and 3, the sound extraction device 10 according to the first embodiment is
connected to a plurality of television cameras 16 disposed at predetermined positions on the
ceiling 52 of the room 50 and the television cameras 16. A plurality of (n, 8 О in FIG. 2)
arranged in a matrix at substantially equal intervals on the ceiling 52 and an extraction position
calculation processor 14 for setting an extraction position of sound based on image information
captured by the television camera 16 A microphone array unit 18 including eight microphones
22), an audio extraction board 12 connected to each of the microphones 22 and extracting a
voice of a target person from sounds collected by the microphones 22, And an output terminal
board 20 for outputting a sound.
[0083]
Each microphone 22 is connected to the sound collecting unit 24, the amplifier filter 26
connected to the sound collecting unit 24 for noise cutting and amplification of the audio signal,
and connected to the amplifier filter 26 to convert an analog signal to a digital signal. And an A /
D converter 28.
The extraction position arithmetic processor 14 is configured to include a CPU 14A, a ROM 14B,
a RAM 14C mainly used as a working storage area, and an input / output controller (hereinafter
referred to as I / O) 14D. The CPU 14A, the ROM 14B, the RAM 14C, and the I / O 14D are
connected to one another by a bus 14E.
[0084]
The voice extraction board 12 is connected to the microphones 22 via the digital circuit 30 in a
10-04-2019
24
one-to-one correspondence, and n input buffer memories i (for temporarily storing voice data
transmitted from the microphones 22) i: 1, 2... n), an input buffer memory group 32, a processor
34 connected to each input buffer memory i for controlling the entire voice extraction board 12,
etc., and a processor 34 connected respectively 34, an output buffer memory group 44
comprising n output buffer memories i (i: 1, 2... N) for temporarily storing audio data
corresponding to each of the microphones 22 output from 34, and each output An adder 46 for
adding audio data corresponding to each of the microphones 22 connected to the buffer memory
i and output from each output buffer memory i; A D / A converter 48 for conversion into analog
signals continue to digital signals, is provided.
The processor 34 includes the CPU 38, the ROM 40, the RAM 42, and the I / O 36, similarly to
the extraction position calculation processor 14, and these are connected to one another by the
bus 37.
The input buffer memory i, the output buffer memory i, and the extraction position calculation
processor 14 described above are connected to the I / O 36. Further, the processor 34 transmits
each component, that is, each microphone 22, input buffer memory group 32, in order to
transmit a control signal or the like for synchronizing the operation of each component in the
sound extraction device 10 to each component. A control signal line 43 is connected to each of
the output buffer memory group 44, the adder 46, and the D / A converter 48. The ROM 40
previously stores a control program for voice extraction processing, which will be described later,
position information on the arrangement position of each of the microphones 22, a delay table,
which will be described later, and the like.
[0085]
Further, the output terminal board 20 is provided with an audio output terminal 21, and the
audio output terminal 21 is connected to the D / A converter 48 of the audio extraction board
12.
[0086]
In the ROM 14B built in the extraction position calculation processor 14, position information
indicating the arrangement position of the television camera 16 and a control program of
extraction position calculation processing described later are stored in advance.
[0087]
10-04-2019
25
Next, the operation of the first embodiment will be described.
When the start button (not shown) of the sound extraction device 10 is turned on by the
operator, the control routine of the extraction position calculation process shown in FIG. 4 is
performed by the CPU 14A of the extraction position calculation processor 14 and the control
routine of the voice extraction process shown in FIG. Each is executed by the CPU 38 of the
extraction board 12.
Each of these control routines is repeatedly executed at predetermined time intervals.
[0088]
First, the control routine of the extraction position calculation process shown in FIG. 4 will be
described. In step 102, the shooting information from each television camera 16 is fetched. In
the next step 104, the position of the head P of the target person A (see FIG. 2) is calculated from
the captured imaging information. As a position at this time, as shown in FIG. 2 as an example, a
large number of rectangular parallelepiped shapes obtained by virtually equal dividing the room
50 along the arrow X direction, the arrow Y direction, and the arrow Z direction. The information
which shows in which area | region the object person A is located among area | regions can be
used. In FIG. 2, the case where the room 50 is equally divided into 16 in each direction is shown
as an example. That is, in step 104, from the photographed image, a region having a
characteristic amount specific to the human head, such as a substantially spherical shape in
which many surfaces are covered with hair and there are many black portions, The position
corresponding to the head P is extracted, and the position of the head P on the virtual threedimensional coordinates described above is calculated based on the position of the extracted area
on the photographed image.
[0089]
Further, in step 104, the direction of the head P of the target person A is also estimated. That is,
first, based on the general feature that the trunk S located under the head P shown in FIG. 2 is
recognized and the chest width L2 is smaller than the shoulder width L1 in the trunk S, the chest
width L2 and the shoulder width L1 are From the size, it is estimated that the target person A is
pointing in the arrow V direction or the opposite direction. Next, based on the general feature
10-04-2019
26
that the ratio of hair on the surface of the head P is higher on the side where the face is not
located than on the side where the face is located, the back side of the paper in FIG. From the fact
that the degree of blackness is higher than that, it is estimated that the target person A is
pointing in the arrow V direction.
[0090]
In the next step 106, a position separated by a predetermined distance (for example, about 30
centimeters) in the arrow V direction from the position of the head P obtained in step 104 is set
as the extraction position for the target person A. Then, at the next step 108, position
information of the set extraction position is transmitted to the voice extraction board 12.
[0091]
Next, the control routine of the voice extraction process executed by the CPU 38 of the processor
34 provided in the voice extraction board 12 shown in FIG. 5 will be described. In step 200, it is
determined whether the information on the extraction position transmitted from the extraction
position arithmetic processor 14 in step 108 described above has been received. If the
information on the extraction position is not received, the control routine is ended, and if the
information on the extraction position is received, the process proceeds to step 202. In step 202,
based on the installation position information of each of the microphones 22 extracted from the
ROM 40 and the extraction position information received, the sound of the extraction position is
excluded by excluding the microphones 22 installed at a position separated by a predetermined
distance or more from the extraction position. The microphone 22 suitable for the extraction is
selected.
[0092]
On the other hand, the sound emitted from the target person A is first captured by the sound
collecting unit 24 of the microphone 22, and the noise is cut by the amplifier filter 26 and
amplified at a predetermined amplification factor as shown in FIG. Sound signal. Then, these
audio signals are converted into audio data digitized by the A / D converter 28.
[0093]
10-04-2019
27
Then, in step 203 of the voice extraction process, voice data collected and converted as described
above is taken in from each of the microphones 22 selected in step 202 through the digital line
30, and the voice data is transmitted to each microphone 22. Write to the input buffer memory i
corresponding to. That is, audio data corresponding to the audio signal as shown in FIG. 1A is
written to the input buffer memory i. At this time, the data is written sequentially from a
predetermined reference address of the input buffer memory i. Then, when the voice extraction
processing routine is executed next time, a new reference address shifted from the reference
address by a predetermined address is set, and the new reference address is sequentially written.
Then, when the writing to the input buffer memory i is completed three times, the new reference
address is returned to the leading address of the input buffer memory i at the fourth time, and
the voice data is written sequentially from the leading address. Thus, the input buffer memory i is
used as a so-called ring buffer.
[0094]
In the next step 212, the delay time corresponding to the distance between the position of one of
the selected microphones 22 and the extraction position is fetched from the delay table stored in
advance in the ROM 40. In the delay table, for each extraction position of the extraction position
that can vary within the range of the room 50, the distance between the extraction position and
each microphone 22 is divided by the speed of sound at standard room temperature. This table is
a table in which the propagation time (delay time) is recorded, and is prepared in advance by the
number of extraction position candidates that can vary within the range of the room 50.
[0095]
In the next step 214, an address obtained by shifting the audio data from the one microphone 22
by the memory address corresponding to the delay time from the predetermined reference
address (ie, the write start address to the input buffer memory i) is taken out. As from the input
buffer memory i. Thereby, the sound data written to the input buffer memory i before the sound
emitted by the target person A reaches the one microphone 22 is cut off, and the sound emitted
by the target person A and reaches the one microphone 22 is It will be taken out.
[0096]
10-04-2019
28
Then, in the next step 216, the extracted audio data is written to the output buffer memory i
corresponding to the one microphone 22. That is, audio data corresponding to the audio signal
as shown in FIG. 1B is written to the output buffer memory i. The output buffer memory i is also
used as a so-called ring buffer as in the above-mentioned input buffer memory i.
[0097]
The above steps 212, 214, 216 are then performed for all of the selected microphones. When the
processes of steps 212, 214 and 216 are executed for all of the selected microphones, the result
is affirmed at step 218, and the process proceeds to step 220 and the adder 46 adds the audio
data corresponding to each of the selected microphones. Let
[0098]
In the next step 222, the added audio data is output to the D / A converter 48 with the decimal
point position shifted upward by the number of digits of INT (log 2 M). As a result, it is possible
to obtain substantially the same result as in the case where the added audio data is divided by
the number M of microphones. Here, in addition to the above, the processor 34 may take in the
calculation result of the adder 46 and perform normal division.
[0099]
Thereafter, the audio data output from the adder 46 is converted into an analog audio signal as
shown in FIG. 1C by the D / A converter 48, and the converted audio signal is output to the audio
output terminal 21 of the output terminal board 20. Sent to By connecting an audio reproduction
device or the like to the audio output terminal 21, the voice of the extracted target person A can
be reproduced and heard.
[0100]
As apparent from the above description, by performing the above-described delay operation and
averaging on the sounds collected by a plurality of (seven in the example of FIG. 1) microphones
22, the target person of interest The noise components other than the voice of A are so small in
amplitude that only the voice of the target person A can be extracted.
10-04-2019
29
[0101]
Further, the extraction position calculation process (FIG. 4) and the voice extraction process (FIG.
5) are repeatedly performed at predetermined time intervals.
Thereby, when the target person A moves, the inside of the room 50 is continuously
photographed by the plurality of television cameras 16, and the position of the head P which
changes along with the movement of the target person A based on the image information and
The orientation is determined, and the extraction position is set according to the position and
orientation of the head P at that time. Then, the voice extraction board 12 can extract the voice
even when the target person A moves by performing the voice extraction process according to
the extraction position.
[0102]
In the voice extraction process of the first embodiment, an example is given in which
microphones (for example, seven microphones) close to the set extraction position are selected,
only voice data from the selected microphone is taken, and written to the input buffer memory.
However, the audio data from all (n) microphones is once taken and written to each input buffer
memory, and only the audio data from the selected microphone (for example, seven
microphones) corresponds to the delay time. It is also possible to shift from the input buffer
memory by shifting only the memory address.
[0103]
Further, in the voice extraction process of the present invention, the sound of the target person
(or the target object) is collected by a large number of microphones arranged near the extraction
position, and the collected voice signal is delayed as described above. By performing averaging, it
is possible to perform sound extraction with an improved signal-to-noise ratio.
Moreover, it is also possible to extract a sound having a higher signal-to-noise ratio than the
sound collected by a normal microphone. Such good quality sound can be used as an input to the
speech recognition device. That is, the voice of the person (one or more persons) who is in the
area where the sound can be extracted by the sound extraction device can be input to the voice
10-04-2019
30
recognition device.
[0104]
Second Embodiment Next, a second embodiment of the present invention will be described. In
the second embodiment, an example is shown in which the voice of the target person A and the
voice of the target person B in the predetermined room 50 shown in FIG. 2 are separately
extracted. The same parts as those in the first embodiment are denoted by the same reference
numerals, and the description will be omitted.
[0105]
As shown in FIG. 6, the sound extraction device 10 according to the second embodiment is
provided with a plurality (N) of the sound extraction boards 12 described in the first
embodiment, and further with each microphone 22. An audio data relay board 56 for connecting
each audio extraction board 12 is installed. Further, the extraction position calculation processor
14 is connected to the processor 34 provided in each voice extraction board 12. Further, the
output terminal board 20 is provided with an audio output terminal 21 corresponding to each
audio extraction board 12, and each audio output terminal 21 is connected to the D / A converter
48 of the corresponding audio extraction board 12.
[0106]
Next, the operation of the second embodiment will be described. When a start button (not shown)
of the sound extraction device 10 is turned on by the operator, a control routine of extraction
position calculation processing for a plurality of extraction positions shown in FIG. 7 is shown in
FIG. The same control routine for voice extraction processing as that of the first embodiment is
executed by the CPU 38 of each of the two voice extraction boards 12.
[0107]
The control routine of the extraction position calculation process shown in FIG. 7 will be
described. In the following description, target persons A and B will be referred to as target
10-04-2019
31
persons 1 and 2 for convenience. In step 102, shooting information from each television camera
16 is fetched, and in the next step 103, "2" is substituted for the variable K as the number of
target persons and the variable L is initialized to "1".
[0108]
In the next step 105, calculation of the position and orientation of the head of the target person L
(that is, target person 1) is performed in the same manner as in the first embodiment, and in the
next step 107, the voice of the target person L is Set an extraction position L (ie, extraction
position 1) for extracting. Then, in the next step 109, the information on the extraction position L
is transmitted to the corresponding voice extraction board L.
[0109]
In the next step 110, it is determined whether or not the processing in the above steps 105, 107,
and 109 has been completed for all target persons by determining whether the variable L is
equal to the variable K indicating the number of target persons. judge. In this case, the beginning
is negated, and the routine proceeds to step 112 where the variable L is incremented by one. As
a result, the value of the variable L becomes "2".
[0110]
Thereafter, the process returns to step 105, and the processing of steps 105, 107, and 109 is
performed on the target person L (ie, target person 2). When those processes are completed, step
110 is affirmed because the variables L and K are equal, and the control routine is ended.
[0111]
The voice extraction boards 12 respectively corresponding to the target persons 1 and 2 receive
the information of the extraction position 1 or the extraction position 2 transmitted from the
extraction position arithmetic processor 14 in step 109, respectively, and based on the received
information, The voice extraction process shown in FIG. 5 same as the first embodiment is
executed. Although the description is omitted, the voice of the target persons 1 and 2 can be
10-04-2019
32
extracted independently by the voice extraction processing in the voice extraction boards 12
respectively corresponding to the target persons 1 and 2.
[0112]
In the second embodiment, although an example in which a plurality of voice extraction boards
12 are provided and each voice extraction board 12 extracts a sound from one extraction
position is shown, the immediacy of voice extraction is not required to be very high. In some
cases, the voice extraction process may be sequentially performed on each of a plurality of
extraction positions in the single voice extraction board 12.
[0113]
Third Embodiment Next, a third embodiment of the present invention will be described.
The third embodiment shows an example in which only the voice of the target person A in the
room 50 is extracted in consideration of the influence of the temperature change in the room 50
shown in FIG. The same parts as those in the first embodiment are denoted by the same
reference numerals, and the description will be omitted.
[0114]
As shown in FIG. 8, the sound extraction device 10 according to the third embodiment includes a
plurality of temperature sensors 58, one of which is installed at a plurality of temperature
measurement points in the room 50. ing. Each temperature sensor 58 is connected to an I / O 36
in the processor 34. Further, the temperature distribution information for estimating the
temperature distribution in the room 50 is stored in advance in the ROM 40 in the processor 34
based on the temperatures at a plurality of temperature measurement points of the room 50
measured by the temperature sensor 58. There is.
[0115]
Next, the operation of the third embodiment will be described. When the start button (not shown)
of the sound extraction device 10 is turned on by the operator, the same control routine of
10-04-2019
33
extraction position calculation processing as the first embodiment shown in FIG. 4 is shown in
FIG. 9 by the CPU 14A of the extraction position calculation processor 14. The control routine of
the voice extraction process is executed by the CPU 38 of the voice extraction board 12
respectively. In the following, the description of the extraction position calculation process is
omitted, and the speech extraction process according to the third embodiment will be described
with reference to FIG.
[0116]
At step 203, for each of the selected microphones 22, loading of audio data from the microphone
22 and writing of the loaded audio data to the input buffer memory i are executed, and at next
step 204, the selected microphone 22 is selected. The distance between the microphone 22 and
the extraction position is calculated for one of the microphones 22.
[0117]
In the next step 205, the temperature at a predetermined temperature measurement point of the
room 50 is taken in from each of the plurality of temperature sensors 58, and in the next step
206, the above is stored in the ROM 40 based on the temperatures of the plurality of
temperature measurement points taken. The temperature distribution in the room 50 is
estimated by referring to the temperature distribution information of the above, and the average
temperature on the sound propagation path until the sound emitted from the extraction position
reaches the microphone 22 is calculated.
[0118]
In the next step 207, the speed of sound on the sound propagation path is calculated based on
the average temperature on the sound propagation path, and in the next step 208, the distance
between the microphone 22 and the extraction position calculated in step 204 is calculated. The
propagation time of the sound reaching the microphone 22, that is, the delay time for the
microphone 22 is calculated by dividing by the sound speed calculated in the above.
Then, in the next step 209, the calculated delay time is stored in the delay table secured in the
RAM 42 in association with the identification number of the microphone 22.
The delay table in the third embodiment is used as a temporary storage area for temporarily
storing the calculated delay time for each microphone 22.
10-04-2019
34
[0119]
The above steps 204 to 209 are executed for each of the selected microphones 22. When
execution for all the selected microphones 22 is completed, a delay table in which delay times for
each of the selected microphones 22 are recorded is completed. After that, as in the first
embodiment, the audio data from one microphone 22 is shifted by the memory address
corresponding to the delay time to the microphone 22 obtained from the delay table in step 214
and from the input buffer memory i Take out. In the next step 216, the extracted voice data is
written to the output buffer memory i.
[0120]
When the processing of these steps 214 and 216 is completed for all of the selected
microphones 22, an affirmative result is obtained at step 218 and the process proceeds to step
220. At steps 220 and 222, the audio data in each of the selected microphones 22 is averaged
and output to the D / A converter 48. The audio data is converted into an analog audio signal by
the D / A converter 48, and the converted audio signal is output to the audio output terminal 21
of the output terminal board 20.
[0121]
As described above, according to the third embodiment, it is possible to perform sound extraction
with high accuracy according to the change in temperature in the room 50.
[0122]
In addition, the sound extraction apparatus 10 of this invention can extract a sound in
consideration of the curve of the propagation path of the sound by the influence of a wind (wind
direction, wind power) in the same way as the above.
For example, as shown in FIG. 10, a case will be described in which a squeaky sound emitted
from a specific measurement portion 66A of the iron bridge 66 when the train 64 traveling in
the arrow R direction crosses the iron bridge 66. In this case, since it is an outdoor acoustic
10-04-2019
35
environment, wind influences other than temperature on sound propagation. For example, the
propagation path of sound emitted from the measurement site 66A of the iron bridge 66 and
reaching one microphone 22A is not a straight path indicated by a broken line K1 but a curved
path indicated by a solid line K2, and the sound propagation path length L1 (curved path The
length) is longer than the distance L2 (the length of the linear path) between the measurement
site 66A and the microphone 22A. Therefore, in the sound extraction device 10, the wind power
meter 60 detects the wind power and the wind direction meter 62 detects the wind direction.
And the propagation path of sound changes to what kind of path (curved path) due to the
influence of wind power and wind direction, and how long the propagation path length L1
becomes longer than the distance L2, the extraction position arithmetic processor 14 or voice
extraction board The delay time in the microphone 22A is calculated based on the calculated
propagation path length L1 calculated by the processor 34 of Twelve. The propagation path
length of the sound is similarly obtained for the other microphones 22, and the delay time is
calculated. Then, based on the calculated delay time, the subsequent delay operation and the
addition averaging are performed to extract the sound emitted from the measurement part 66A.
In this manner, sound extraction can be performed in consideration of bending of the sound
propagation path due to the influence of wind (wind direction, wind power).
[0123]
Fourth Embodiment Next, the fourth embodiment of the present invention will be described. In
the fourth embodiment, an example is shown in which the voice of the target person C in the
room 50 shown in FIG. 11 is extracted in consideration of the difference in directivity due to the
frequency in the voice. The same parts as those in the first embodiment are denoted by the same
reference numerals, and the description will be omitted.
[0124]
The configuration of the sound extraction device 10 in the fourth embodiment is the same as the
configuration of the sound extraction device 10 in the first embodiment described above, and
thus the description thereof will be omitted. However, in the ROM 40 in the processor 34 of the
voice extraction board 12, a weighting table in which weighting constants to be described later
are recorded is stored in advance.
[0125]
10-04-2019
36
Next, the operation of the fourth embodiment will be described. First, the difference in directivity
due to the frequency in sound will be described. As shown in FIG. 11, the directivity of sound
differs depending on the frequency, and the lower the frequency, the less the directivity, and the
higher the frequency, the stronger the directivity. Therefore, while a microphone located in the
direction D where the target person C emits a voice, sounds of almost the entire frequency range
from low frequency to high frequency are collected, while sounds of low frequency are collected
by other microphones. The high frequency sound of things will not be collected very much.
[0126]
Therefore, in the fourth embodiment, the volume of the high range of the collected sound
collected by the microphone 22 positioned in the direction D and the volume of the high range of
the collected sound collected by the other microphone 22 An example is shown in which the
above problem is solved by performing a weighting operation on both to correct the imbalance
of.
[0127]
Note that the extraction position calculation process is the same as that of the first embodiment,
so the description thereof is omitted, and the speech extraction process will be described using
FIG.
[0128]
In steps 200, 202, and 203, as in the first embodiment, a microphone is selected based on the
extraction position information received from the extraction position arithmetic processor 14,
and acquisition of audio data from the selected microphone and input of the audio data are
performed. Write to the buffer memory i.
In the next step 213, the delay time is taken from the delay table corresponding to the relative
position of the extraction position with respect to one microphone 22, and weighting is
performed from the weighting table corresponding to the relative position of the extraction
position with respect to the microphone 22 and the direction of sound. Capture a constant.
In the weighting constant corresponding to the microphone 22 located in the direction D in
which the person C emits a voice, a value relatively smaller than the weighting constant
10-04-2019
37
corresponding to the microphone 22 located at a position deviated from the direction D is set
There is.
[0129]
In the next step 214, as in the first embodiment, the audio data from the microphone 22 is
shifted from the input buffer memory i by shifting it by the memory address corresponding to
the delay time, and in the next step 217 The high frequency components are weighted (in
accordance with the above amplification or reduction) according to the above weighting constant
and written to the output buffer memory i.
[0130]
The above steps 213, 214, 217 are executed for each of the selected microphones 22.
Thus, while the high frequency components of the collected sound collected by the microphone
22 positioned in the direction D are reduced in level, the high frequency components of the
collected sound collected by the microphone 22 positioned away from the direction D Are
amplified in level.
[0131]
In the next steps 220 and 222, the audio data in each of the selected microphones 22 is
averaged and output to the D / A converter 48. The audio data is converted into an analog audio
signal by the D / A converter 48, and the converted audio signal is output to the audio output
terminal 21 of the output terminal board 20.
[0132]
According to the fourth embodiment, the high frequency component of the sampled sound
collected by the microphone positioned in the direction D and the high frequency of the sampled
sound collected by the microphone 22 positioned at the position away from the direction D With
the component, the level imbalance can be improved, and the reduction of the high frequency
sound relative to the low frequency sound due to the strong directivity of the high frequency
sound can be prevented.
10-04-2019
38
[0133]
In the first to fourth embodiments, an example is shown in which only the direct sound emitted
from the target person (or the target object) and directly reaching the microphone is extracted.
Generally, the reflected sound that has been reflected from a wall surface or the like as a
reflecting surface and reaches the microphone is removed together with other noise components
by performing averaging, since the magnitude thereof is much smaller than that of the direct
sound.
[0134]
However, when the wall is close to the target person and the target person is positioned in the
direction in which the target person emits sound, the reflected sound from the wall is larger than
the direct sound, so the reflected sound is collected instead. It can be said that the extraction
effect of the sound emitted by the target person is higher.
[0135]
Therefore, based on the image information taken by the television camera 16, when it is
recognized that the wall surface is at a position close to the target person and the target person
is positioned in the direction of emitting sound, it is executed by the CPU 38 of the processor 34
In the audio extraction process, the propagation distance of the reflected sound reflected by the
wall surface is adopted as the distance between the microphone and the extraction position for
calculating the delay time for each microphone, not the distance between the two. The delay time
may be calculated according to the propagation distance of the sound, and the delay operation
may be performed according to the delay time according to the propagation distance of the
reflected sound.
[0136]
As a result, the sound that has directly reached each microphone from the target person is
removed as a noise component, and instead, the reflected sound that has been reflected by the
wall surface and that has reached each microphone is extracted as the sound of the target
person.
10-04-2019
39
As described above, when the reflected sound reaching each microphone is more suitable for
extracting the sound of the target person (target object) than the direct sound, the reflected
sound can be extracted.
[0137]
In addition, the sound extraction apparatus of this invention is applicable as follows other than
said various embodiment.
For example, in the case where the voice of the questioner in the audience is amplified in the
lecture hall, the audience is photographed with a plurality of television cameras, and a clerk can
use the mouse etc. When pointing, the extraction position calculation processor sets the vicinity
of the mouth of the questioner as the extraction position. Then, the sound extraction board
extracts the sound from the extraction position, and outputs the sound extracted from a
predetermined speaker. As a result, it is not necessary to bring the microphone to the position of
the questioner in the audience each time, which helps smooth the lecture.
[0138]
Further, for example, as in the train 64 shown in FIG. 10, in the case where sound emitted from a
moving object having a fixed movement path is continuously extracted (traced) with the passage
of time, a plurality of extractions at substantially equal intervals on the movement path The
positions (for example, portions 66B, 66C, 66D of the iron bridge) may be set in advance, and the
sounds at these extraction positions may be sequentially extracted along with the passage of
time. In this case, the process of grasping the movement of the moving object from the image
taken by the television camera 16 in order to set the extraction position becomes unnecessary,
and the sound can be traced following the fast movement of the moving object.
[0139]
Fifth Embodiment Next, a fifth embodiment according to the present invention will be described.
In the fifth embodiment, when extracting the sound of an object, an image including the object is
photographed by a plurality of television cameras provided with a wide-angle fixed focus lens,
10-04-2019
40
and the position of the object is recognized based on the image data. Indicates
[0140]
As shown in FIG. 13, on the ceiling 52, a plurality of (four as an example) television cameras 16
are installed, and in each television camera 16, a fisheye lens 16A as a wide-angle fixed focus
lens is installed. The viewing angle of each fisheye lens 16A is preset to 90 ░ or more.
Therefore, when the object is moving, the object can be photographed without moving the
television camera 16 regardless of whether the object is stationary.
[0141]
As this fisheye lens, there are various types such as equidistance projection type, stereographic
projection type, equisolid angular projection type, orthographic projection type, etc. In this
embodiment, any fisheye lens can be used, An example using an equidistant projection type
fisheye lens will be described below. Each television camera 16 also includes a CCD (ChargeCoupled Device) area image sensor 16B (see FIG. 18). In addition, objects such as objects, people
and animals are generally fixed in height from the floor and the ground, and furthermore, the
fisheye lens 16A as a wide-angle fixed focus lens has a characteristic that the depth of focus is
large. Even if it does not have a focusing mechanism, an object image can be clearly formed on
the CCD area image sensor 16B. In this manner, each of the plurality of television cameras 16
captures a predetermined area including an object from a different position.
[0142]
Next, the operation of the fifth embodiment will be described. When the operator designates a
target person A as an object and turns on the start button (not shown) of the sound extraction
device 10, the control routine of the sound extraction process shown in FIG. 5 is the same as that
of the first embodiment. The execution is started by the CPU 38, and the control routine of the
extraction position calculation process shown in FIG. 14 is started by the CPU 14A. In the
following, the description of the speech extraction processing is omitted, and the extraction
position calculation processing in the fifth embodiment will be described using FIGS.
[0143]
10-04-2019
41
At step 120 shown in FIG. 14, object classification processing is performed. In this object
classification process, a subroutine shown in FIG. 15 is executed. At step 140 in FIG. 15, the
image data A when the object (target person A) does not exist in the room 50 is read from the
ROM 14B, and at the next step 142, the image data B photographed by each television camera
16 is fetched and Remember to In the next step 144, the difference between the image data B
and the image data A is taken to recognize the target person A present in the room 50 (see FIG.
17).
[0144]
Next, at step 146, a timer for a predetermined time T is set, and at next step 148, the process
waits for the predetermined time T, and when time-out occurs, it proceeds to step 150.
[0145]
In step 150, image data C (that is, image data after a predetermined time T has elapsed from the
image data B) captured by each television camera 16 is fetched.
Then, in the next step 152, the image data B stored in the RAM 14C is read out, the image data B
and the image data C are compared, and in the next step 154, the target person A is moved based
on the comparison result. To judge.
[0146]
If the target person A is not moving (is stationary), a negative determination is made in step 154,
and the process returns to the main routine of FIG. On the other hand, when the target person A
is moving, an affirmative determination is made in step 154, and the process proceeds to step
156, and the traveling direction of the target person A is determined from the difference between
the image data B and the image data C (see FIG. 17) Before and after the target person A is
determined from the traveling direction. Then, in the next step 158, information on the direction
of movement and the front and back of the target person A is stored in the RAM 14C, and the
process returns to the main routine of FIG.
10-04-2019
42
[0147]
In the next step 122, the position and height of the target person A are calculated. As shown in
FIG. 18, the focal distance of the equidistant projection type fisheye lens 16A fixed to the point O
is f, and the distance from the point O to the point Q taken vertically to the floor surface 54 of
the room 50 is H, from the point Q The distance to the point P on the floor 54 of the target
person A is R, and the height of the target person A (the distance between the point P ? and the
point P when the tip of the target person A in the ceiling direction is the point P ?) Let h be. In
addition, the angle formed by the point POQ is ?, the angle formed by the point P'OQ is ? ', the
distance corresponding to the height of the object image on the CCD surface of the CCD area
image sensor 16B is h', and of the object image h ' From the image point corresponding to the
point P to p, the point of the object image h 'corresponding to the point P' to the point p ', from
the image center of the CCD surface (center of the CCD surface) o to the point p The angles ?
and ? ? and the distances r and r ? can be obtained by the following equations (1) to (4),
where r is the distance of r and r ? is the distance from the image center o of the CCD surface to
the point p ?. .
[0148]
? = tan?1 (R / H) (1) ? ? = tan?1 {R / (H?h)} (2) r = f? 3) r ? = f? ? (4) Therefore, the
height h and the distance R can be obtained by the following equations (5) and (6).
[0149]
h = H {1-tan (r / f) / tan (r '/ f)} (5) R = H tan (r / f) (6) Note that the distance H and the focal
length The f is predetermined, and the equations (5) and (6) are stored in the ROM 14B.
Therefore, in step 122, equation (5) is read out from the ROM 14B, the height h is calculated
from the information on the CCD surface of one television camera 16, and equation (6) is read
out. The two-dimensional position of the target person A is calculated from the obtained two
distances R.
[0150]
In the next step 124, a matrix-like minute space (hereinafter referred to as voxel) obtained by
10-04-2019
43
virtually dividing the three-dimensional space along the X, Y and Z directions centering on the
position calculated in the above step 122 is called "voxel". Set Thus, the image data C is
converted into a collection of voxels. FIG. 19 conceptually shows voxels occupied by the target
person A when the target person A is projected from four television cameras A, B, C, and D.
[0151]
That is, when the target person A is projected from each television camera, the voxels located
within the viewing angle of the target person A are occupied by the target person A including the
portions RA 1, RB 2, RC 3 and RD 3 of the shadows (dead spots). Is set as the voxel to be The
voxels can be subdivided into the resolution limit of the CCD area image sensor 16B.
[0152]
In the next step 126, the first narrowing down is carried out to limit the voxels occupied by the
target person A in the image data based on the height h of the target person A as follows.
[0153]
Since the height h of the target person A can be generally set in advance from the average height
of an adult, when the target person A is projected from each television camera as shown in FIGS.
20 (A) to (D) Among voxels positioned within the viewing angle of the target person A, those
having a height in the range of 0 to h are narrowed down as voxels that the target person A
occupies.
Here, an area formed by the voxels narrowed down is referred to as a first-order narrowed down
area.
[0154]
Next, in step 128, a second narrowing-down is performed from the first narrowing-down area in
each image data to an area overlapping all of them. As a result, the shadow areas RA, RB, RC and
RD shown in FIG. 19 are excluded from the voxels occupied by the target person A, and are
narrowed down to the voxels 70 occupied by the target person A as shown in FIG. In the next
10-04-2019
44
step 130, this voxel 70 accurately recognizes the position and shape of the object. Since the
voxels can be subdivided to the limit of resolution of the CCD area image sensor 16B, it is also
possible to recognize the shape of the object in detail.
[0155]
In the next step 132, as shown in FIG. 22, dimensions such as height and thickness of the voxel
70, head color differences stored in the ROM 14B in advance, positions of eyes, nose, mouth and
ears, length of arm Dummy based on information on human characteristics such as sheath
position, toe direction, degree of freedom of joints, and information on traveling direction and
front and back of target person A stored in RAM 14C when target person A is moving. Convert to
model 72.
[0156]
In the next step 134, a subroutine of extraction position setting processing shown in FIG. 16 is
executed.
In step 160 of FIG. 16, a predetermined number (two as an example) of television cameras are
selected for photographing the head of the target person A, and the head of the target person A
on the CCD surface of each selected television camera Capture the 2D coordinates corresponding
to the position of. In the selection of the television camera, for example, the selection may be
made in descending order of the object images when the target person A is photographed, or the
television camera capturing the front of the target person A may be selected. Also, let the two
selected television cameras be camera L and camera R, respectively.
[0157]
In the next step 162, three-dimensional coordinates are calculated. As shown in FIG. 23, threedimensional coordinates C of the camera L are (X, 0, Z), and three-dimensional coordinates C 'of
the camera R are (X', 0, Z). Further, the coordinates PL on the CCD surface of the camera L
corresponding to the position of the head of the target person A are (?1, ?1), the distance from
the image center OL on the CCD surface of the camera L to the coordinates PL, the target person
A Coordinates PR on the CCD surface of the camera R corresponding to the position of the head
of the camera (?1 ?, ?1 ?), the distance from the image center OR of the CCD surface of the
camera R to the coordinates PR r ?, coordinates PL and coordinates PR The point at which the
10-04-2019
45
two lights intersect when the light from the light source is assumed, that is, the threedimensional coordinate P of the head of the target person A is (x, y, z).
[0158]
Also, let (X, 0, z) be the coordinates of the point of intersection S of the foot of a perpendicular
drawn parallel to the Z axis from the three-dimensional coordinate position of the camera L and a
plane perpendicular to the Z axis including the point P. Let (X ?, 0, z) be the coordinates of the
point of intersection S ? of the foot of a perpendicular drawn parallel to the Z axis from the
three-dimensional coordinate position of R and a plane perpendicular to the Z axis including
point P. Further, the angle between the points PCS is .theta.1, the angle between the points PC'S
'is .theta.1', the angle between the points PSS 'is .phi., And the angle between the points PS'S is
.phi.'.
[0159]
The distance r from the image center OL to the image on the CCD surface is obtained by the
equation (3) described above.
[0160]
Further, ?1 and ?1 are respectively ?1 = f?1 cos (?-?) =-f?1 cos ??1 = f?1 sin (?-?) =
f?1 sin ? (7).
Here, sin ? = y / {(x?x) 2 + y 2} 1/2 (8) cos ? = (x?x) / {(x?x) 2 + y 2} 1/2, so ? 1 , ? 1 is ?
1 = ?f? 1 (x?X) / {(x?X) 2 + y 2} 1/2 (9) ? 1 = f? 1 y / {(x?X) 2 + y 2} 1/2 ... (10) can be
obtained. By dividing the equation (10) by the equation (9), y = (. Beta.1 / .alpha.1) (X-x) (11) y =
(. Beta.1 '/. Alpha.1') (X'-x). (12) Eliminating y from the equations (11) and (12), x = (?1 ?1 ?
X???1 ? ?1 X) / (?1 ?1???1 ? ?1) (13) The X coordinate of the three-dimensional
coordinate P can be obtained by
[0161]
Next, x is eliminated from the equation (11) and the equation (13), and y = ?1 ?1 ? (X?X ?) /
(?1 ?1???1??1) иии (14) The Y coordinate of P can be determined.
10-04-2019
46
[0162]
By the way, since ? 1 = tan ?1 [{(x?X) 2 + y 2} 1/2 / (Z?z)], from the equations (7) and (8), ?
1 / (fsin?) = tan ?1 [{(X?x) 2 + y 2] 1/2 / (z?z)] z = Z-[{(x-x) 2 + y 2} 1/2 / tan [(? 1 / f) О {(xX) 2 + y 2} 1/2 / y] (15) Also, from the equation (11), {(x-X) 2 + y 2} 1/2 = (x-X) x {1 + (? 1 / ?
1) (2) 1?2 From the equation (11) and the equation (14), (x?X) = (X??X) / {1- (?1 ? / ?1) О
(?1 / ?1 ?)}, so Formula (15) is z = Z + [(X '-X) x {1 + (beta 1 / alpha 1) 2} 1/2 / {1-(alpha 1' /
alpha 1) x (beta 1 / beta 1 ')}] / tan {(? 1 2 + ? 1 2) 1/2 / f} (16) It is possible to obtain the Z
coordinate of the three-dimensional coordinate P.
[0163]
Since the three-dimensional coordinates of each television camera 16 are predetermined, in step
162, the equations (13), (14) and (16) are read out from the ROM 14B, and on the CCD surface
of the camera L fetched in step 160. The target person A is obtained by substituting the values of
the coordinates PL (.alpha.1, .beta.1) of and the coordinates PR (.alpha.1 ', .beta.1') on the CCD
surface of the camera R into the equations (13), (14) and (16). The three-dimensional coordinates
P (x, y, z) of the head of can be determined.
[0164]
In the next step 164, the direction of the head of the target person A is estimated as in the first
embodiment described above (similar to the process in step 104 of FIG. 4).
In the next step 166, a position separated by a predetermined distance (for example, about 30
centimeters) in the arrow V direction (see FIG. 13) from the position of the head obtained in step
162 is set as the extraction position for the target person A.
Then, at the next step 168, the position information of the set extraction position is transmitted
to the voice extraction board 12 and the process returns.
[0165]
As described above, according to the fifth embodiment, since imaging is performed using the
wide-angle fixed focus lens 16A, there is no need to move the television camera 16 or to perform
10-04-2019
47
focus adjustment.
Therefore, it is possible to shorten the time until the object (target person A) is captured, and it is
possible to quickly recognize the position of the object.
[0166]
In addition, since there is no need for a mechanism for changing the direction of the television
camera and adjusting the focus, the task of capturing an object can be automated, and the drive
part is eliminated, so that the durability and reliability of the television camera can be improved.
It can be enhanced.
[0167]
Further, since one television camera captures an image of one object, for example, threedimensional coordinates can be calculated even if there is an obstacle or other object that
obstructs the field of view such as furniture.
[0168]
In addition, since the television camera is disposed on the ceiling of a room constituting a threedimensional space, the wall surface can be effectively used.
[0169]
In the fifth embodiment, the plurality of television cameras 16 are arranged on the ceiling 52,
but as shown in (A) to (F) of FIG. 24, they are arranged near or embedded in the wall. It may be
arranged in the corner part of 2 faces comprised with a ceiling and a wall, or the corner part of 3
faces comprised with a ceiling and 2 walls.
Furthermore, as shown in (M) to (O) in FIG. 24, the equidistant projection type fisheye lens 16A
may be directed to the center of the room.
[0170]
Further, in the fifth embodiment, the image data A when the object is not present in the room 50
is read out at step 140 in the object classification process shown in FIG. 15, but this step 140 is
10-04-2019
48
not performed. An object may be recognized based on the image data B captured at 16 and the
image data C after the predetermined time T has elapsed from the image data B.
[0171]
In the fifth embodiment, an image including an object is photographed using two television
cameras, but three or more television cameras may be used.
[0172]
In the fifth embodiment, the equidistant projection type fisheye lens is used, but as described
above, even if an equisolid angle projection type fisheye lens, a stereographic projection type
fisheye lens or an orthographic projection fisheye lens is used, the target is the same as above.
Three-dimensional coordinates of the head of the person A can be calculated.
The following equations (1) ? to (6) ? are shown below, which correspond to the equations (1)
to (6) when using an equisolid angle projection type fisheye lens.
[0173]
? = tan-1 (R / H) (1) '?' = tan-1 {R / (H-h)} (2) 'r = 2 f sin (? / 2 ) ... (3) 'r' = 2 f sin (? '/ 2) (4)' h
= H [1-tan {2 sin -1 (r / 2 f)} / tan {2 sin -1 (r '/ 2f)}] (5)' R = H tan {2 sin-1 (r / 2f)} (6) '(sixth
embodiment) Next A sixth embodiment according to the present invention will be described.
In the sixth embodiment, in extracting the sound of an object, three-dimensional coordinates of
the object are calculated based on image data including the object obtained using one television
camera and one mirror. An example of recognizing the position of an object is shown.
Since the sixth embodiment is substantially the same as the fifth embodiment, the same reference
numerals are given to the same parts in FIGS. 13 to 16 and the description will be omitted.
[0174]
As shown in FIG. 25, on one side of each television camera 16, a mirror 74 elongated in the
10-04-2019
49
vertical direction (Z direction) parallel to the direction (X direction) of one end face of the CCD
area image sensor 16 B is fixed to the ceiling 52. It is set up.
[0175]
Next, various amounts of the equidistant projection type fisheye lens 16A, the CCD area image
sensor 16B and the mirror 74 according to the sixth embodiment will be described with
reference to FIGS.
FIG. 26 shows the details of the above-mentioned quantities when the distance between the
equidistant projection type fisheye lens 16A and the CCD area image sensor 16B is considered to
be negligible.
[0176]
As shown in FIG. 25, the center of the upper end of the mirror 74 on the same XY plane as the
CCD surface of the CCD area image sensor 16B is taken as the origin O (0, 0, 0) of threedimensional coordinates.
The image center H of the CCD surface is separated from the origin O by a distance h in the Y
direction, and the three-dimensional coordinate of the image center H is (0, h, 0). The threedimensional coordinates of a predetermined portion (for example, the head) P of the target
person A are (x, y, z), and the light emitted from the point P is refracted by the equidistant
projection fisheye lens 16A to be a point on the CCD surface Image on D The two-dimensional
coordinates of the point D on this CCD surface are taken as (?D, ?D). The light emitted from the
point P and reflected by the mirror 74 is refracted by the equidistant projection type fisheye lens
16A to form an image at a point R on the CCD surface. The two-dimensional coordinates of the
point R on this CCD surface are taken as (?R, ?R). Assuming that the virtual television camera
17 without the mirror 74 is assumed and the three-dimensional coordinates of the image center
H ? of the CCD surface are (0, ?h, 0), the light emitted from the point P is The image is
refracted by the virtual equidistant projection type fisheye lens 17A to form an image at a point
R 'on the CCD surface of the virtual CCD area image sensor 17B, and the above point R and the
imaginary point R' Be symmetrical. The distance from the image center H to the point D on the
CCD surface is rD, and the distance from the image center H to the point R on the CCD surface is
rR.
10-04-2019
50
[0177]
As shown in FIG. 26, an arbitrary point on a perpendicular drawn from point H in the Z direction
is taken as point V, and an arbitrary point on a perpendicular drawn from point H 'in the Z
direction is taken as point V' In this case, the angle between the point PHV and the point PH'V 'is
defined as the angle ?D and the angle ?R'. Also, the point represented by the three-dimensional
coordinates (x, y, 0) is the point S, the distance between the point S and the point H is the
distance BR, the distance between the point S and the point H 'is the distance BR', the point P The
distance between the point H and the point H is a distance AD, and the distance between the
point P and the point H 'is a distance AR'.
[0178]
Next, the operation of the sixth embodiment will be described. In step 160 in the extraction
position setting process shown in FIG. 16, one television camera 16 for photographing the target
person A is selected (for example, a television camera with the smallest distance rD is selected).
The two-dimensional coordinate values of each of the point D (?D, ?D) and the point R (?R,
?R) on the CCD surface corresponding to the position are taken.
[0179]
In the next step 162, three-dimensional coordinates are calculated. The quantities described
above will now be further described with reference to FIGS. 25 and 26.
[0180]
The angles ?D and ?R ? are respectively ?D = tan?1 (BD / Q) = tan?1 [{(y?h) 2 + x2] 1?2 /
z] ?R ? = tan?1 (BR ? / Q) The distance rD and rR are expressed by the following equation
from the above equation (3) because it can be obtained by = tan-1 [{(y + h) 2 + x2] 1/2 / z].
[0181]
10-04-2019
51
rD = f и tan-1 [{(y-h) 2 + x 2} 1/2 / z] r R = f и tan-1 [{(y + h) 2 + x 2} 1/2 / z] ? D = r D cos (?
??D) = ? rD cos?D (17) ?D = rD sin (???D) = rD sin?D (18) ?R = rR cos?R ? (??R ?
= ?R) (19) ?R = R R sin R R '(? R R' = R R) (20) cos D D = (y-h) / {(y-h) 2 + x 2} 1/2 (21) sin D D
= x / {(y ?h) 2 + x 2} 1/2 (22) cos ? R ? = (y + h) / {(y + h) 2 + x 2} 1/2 (23) sin ?R ? = x / {(y
+ h) 2 + x 2} 1/2 (24), therefore, from the equations (17) and (21) and the equations (18) and
(22), ?D = ?f?D (y?h) / {(y?h) 2 + x 2} 1/2 (25) ? D = f ? D x / {( y?h) 2 + x 2} 1/2 (26).
Eliminating f?D from these two equations, y = h- (?D / ?D) x (27) ?R = f?R '(y + h) / {(y + h) 2
+ x2} 1/2 (28) ?R = f?R 'x / {(y + h) 2 + x 2} 1/2 (29) y =-h + (?R / ?R) x (30) From the
equations (27) and (30), x = 2h?D ?R / (?D ?R + ?R ?D) (31) The X coordinate of the threedimensional coordinate P can be obtained.
[0182]
Next, substituting the equation (31) into the equation (27), the Y coordinate of the threedimensional coordinate P is determined by y = h (?R ?D -?D ?R) / (?D ?R + ?R ?D) (32)
be able to.
[0183]
Further, ?D = rD sin?D = f?D sin?D = f и tan?1 [{(y?h) 2 + x 2} 1/2 / z] и sin ?D By
modifying this equation, z = {(y?h) 2 + x 2 } 1/2 / tan (?D / fsin?D) = {(y-h) 2 + x2} 1/2 / tan
[(?D / f) О {(y-h) 2 + x2} 1/2 / x] where From Formula (31) and Formula (32), {(y-h) 2 + x 2} 1/2
= 2 h ? R (? D 2 + ? D 2) 1/2 / (? D ? R + ? R ? D) z = 2 h ? R (? D 2 + ? D 2) 1/2 /
[(?D ?R + ?R ?D) О tan {(?D 2 + ?D 2) 1/2 / f}] (33) The Z coordinate of the threedimensional coordinate P can be obtained.
[0184]
The distance h from the mirror 74 to the image center H of the CCD surface is predetermined.
Therefore, at step 162, the equations (31), (32) and (33) are read out from the ROM 14B, and at
step 160, point D (.alpha.D, .beta.D) and point R (.alpha.R, .beta.R) on the CCD surface are read.
The two-dimensional coordinate values are substituted to calculate the three-dimensional
coordinates P (x, y, z) of the head of the target person A.
10-04-2019
52
[0185]
As described above, according to the sixth embodiment, since the three-dimensional coordinates
of the head of the target person A can be calculated by one television camera, the number of
television cameras installed on the ceiling 52 can be reduced. Can.
[0186]
In the sixth embodiment, an example in which the three-dimensional coordinates of the object
are calculated by one TV camera and one mirror installed on the ceiling 52 has been described,
but (G) to (L) in FIG. As shown in the above, the mirror may be mounted on the wall, or one
television camera and a plurality of mirrors may be used.
Curved mirrors may also be used.
When multiple mirrors are used, more object images are formed on the CCD surface, so even if a
blind spot is caused by another object (such as furniture or a pillar), three-dimensional
coordinates are calculated as described above. can do.
[0187]
Seventh Embodiment Next, a seventh embodiment according to the present invention will be
described. In the seventh embodiment, an example in which the shape of an object is recognized
without setting a voxel is shown in extracting the sound of the object. The seventh embodiment is
substantially the same as the fifth embodiment, and therefore the same portions as in FIGS. 13
and 16 are denoted with the same reference numerals, and the description will be omitted.
Further, in the seventh embodiment, in order to simplify the description, it is assumed that the
target person A is captured by the television cameras A, B, C, and D as shown in FIG.
[0188]
The extraction position calculation processor 14 in the seventh embodiment converts the
distorted image data including the object image into planarized image data, and at least the front,
back, and left sides of the object image based on the converted image data. It has a function of
obtaining image data of a surface, right side and plane, and synthesizing the obtained image data
to recognize an object.
10-04-2019
53
[0189]
Next, the operation of the seventh embodiment will be described.
At step 121 in the extraction position calculation process shown in FIG. 27, the image data taken
by the television cameras A, B, C, and D is taken. As shown in FIGS. 29A to 29D, the image of the
image data captured in step 121 is distorted. In the next step 123, the image data of these
distorted images are converted into planarized image data, and image data as shown in FIGS. 30
(A) to (D) is obtained.
[0190]
In the next step 125, image data of the front, back, left side, right side and plane of the target
person A is obtained from the planarized image data. 31A to 31C respectively show the image
data of the front surface, the right side surface, and the plane of the target person A obtained in
step 125. In the next step 127, the image data of the front, back, left side, right side and plane
obtained in step 125 are synthesized. Thereby, the shape of the object can be recognized. In the
next step 134, the extraction position setting process shown in FIG. 16 is executed in the same
manner as the fifth embodiment based on the image data of the synthesized target person A.
[0191]
As described above, according to the seventh embodiment, the shape of an object can be
recognized without setting voxels.
[0192]
Although the television camera 16 of visible light is used as the television camera 16 of the first
to seventh embodiments, it may be photographed in a wavelength range other than visible light,
for example, like an infrared camera.
In this way, since the object can be photographed even when the lamp is not lit, it can be used as
10-04-2019
54
a crime prevention device or a monitoring device.
[0193]
In the fifth to seventh embodiments, the television camera 16 having the fisheye lens 16A as the
wide-angle fixed focus lens and the CCD area image sensor 16B is applied to the sound extraction
device 10 of the first embodiment. In the sound extraction apparatus 10 according to the second
to fourth embodiments, a fisheye lens 16A as a wide-angle fixed focus lens and a CCD area image
sensor have been described. The same effect can be obtained by applying the television camera
16 equipped with 16B.
[0194]
As apparent from the above description, the present invention includes the following technical
aspects.
[0195]
The sound extraction device according to any one of claims 1 to 14, wherein the photographing
means is disposed on a ceiling of a room constituting a three-dimensional space.
[0196]
The sound extraction device according to any one of claims 1 to 14, wherein the imaging unit
captures an image in a wavelength range other than visible light.
[0197]
The shape recognition means converts distorted image information including an object image
into planarized image information, and based on the converted image information, an image of at
least the front, back, left side, right side and plane of the object image The sound extraction
device according to any one of claims 10 to 14, wherein information is determined, and the
determined image information is combined to recognize an object.
[0198]
Furthermore, a memory means is pre-stored with at least one of characteristic information as
information on height, thickness, head, arms, hands, feet, face, eyes, nose, mouth, ears, toes and
joints characteristic of human beings The shape recognition means reads the feature information
stored in the storage means, and recognizes that the object is a person based on the feature
information and the image information photographed by the photographing means. The sound
10-04-2019
55
extraction apparatus according to any one of claims 10 to 14.
[0199]
According to the first aspect of the present invention, the position of the object can be
recognized, and the sound emitted by the object can be extracted and distinguished from the
surrounding noise based on the position.
[0200]
Further, according to the inventions of claims 2 and 16, particularly when the directivity of the
sound emitted by the object is strong, or when the part (surface) emitting the sound of the object
is large, etc., extraction of the sound with higher accuracy is performed. The effect of being able
to do is obtained.
[0201]
Further, according to the third aspect of the present invention, it is possible to obtain an effect
that the sound from the moving object can be extracted.
[0202]
Further, according to the inventions of claims 4 and 17, it is possible to obtain an effect that the
sound from each of the plurality of objects can be extracted also with respect to a plurality of
objects.
[0203]
Also, according to the inventions of claims 5 and 18, the effect that it is possible to perform the
extraction of the sound with high accuracy according to the change of the state of the acoustic
environment is obtained.
[0204]
Further, according to the inventions of claims 6 and 19, it is possible to obtain the effect that it is
possible to prevent the high tone range from becoming relatively weaker than the low frequency
component.
[0205]
Further, according to the seventh and twentieth aspects of the present invention, it is possible to
10-04-2019
56
obtain an effect that sound extraction can be performed more appropriately according to the
arrangement state of the reflection surface and the like around the object.
[0206]
Further, according to the inventions of claims 8 and 21, the load of processing (shift and
extraction processing by the extraction means) relating to extraction of sound can be reduced
without lowering the accuracy of extraction of sound. The effect is obtained.
[0207]
Further, according to the inventions of claims 9 and 22, it is possible to input voices uttered by a
person (one or more persons) within a region where sound extraction can be performed by the
sound extraction device to the speech recognition device. The effect is obtained.
[0208]
According to the tenth aspect of the present invention, it is possible to rapidly recognize the
position of the object without changing the direction of the photographing means following the
movement of the object and adjusting the focus.
In addition, since there is no need for a mechanical operation mechanism for changing the
direction of the photographing means and the focus adjustment, the structure of the
photographing means and the sound extraction device can be simplified, and the mechanical
operation parts can be reduced. In addition, the effect that the durability can be improved can be
obtained.
Furthermore, the effect is also obtained that the sound emitted by the object can be distinguished
from surrounding noise and extracted based on the position of the recognized object.
[0209]
Further, according to the invention of claim 11, by the shape recognition means and the threedimensional coordinate calculation means constituting the image recognition means, the threedimensional coordinates of the object can be quickly obtained and the position of the object can
be recognized rapidly. The effect is obtained.
10-04-2019
57
[0210]
Further, according to the invention of claim 12, since the minute area can be subdivided to the
limit of resolution of the area sensor, by obtaining the minute area occupied by the object from
the image information, the shape of the object can be detailed The effect of being able to be
recognized is obtained.
[0211]
Further, according to the invention of claim 13, it is possible to exclude the shadow area in each
different image information photographed by a plurality of photographing means, and it is
possible to correctly recognize the shape of the object. can get.
[0212]
Further, according to the invention of claim 14, it is possible to acquire two-dimensional
coordinates formed on the area sensor, and calculate three-dimensional coordinates of the object
based on the plurality of acquired two-dimensional coordinates. The effect is obtained.
[0213]
Further, according to the invention of claim 15, since the object image can be formed on the area
sensor by the reflection means, it is possible to rapidly recognize the position of the object even if
there is only one photographing means. In addition, the effect of being able to extract the sound
emitted by the object based on its position from the surrounding noise can be obtained.
[0214]
Brief description of the drawings
[0215]
1 is a schematic view showing the principle of sound collection according to the present
invention.
[0216]
2 is a schematic view showing a sound collection environment according to the first to fourth
embodiments.
10-04-2019
58
[0217]
3 is a schematic configuration diagram of a sound extraction device according to the first and
fourth embodiments.
[0218]
4 is a flow chart showing a control routine executed by the sound collection position arithmetic
processor according to the first, third and fourth embodiments.
[0219]
5 is a flow chart showing a control routine executed by the processor of the speech extraction
board according to the first and second embodiments.
[0220]
6 is a schematic configuration diagram of a sound extraction device according to a second
embodiment.
[0221]
7 is a flow chart showing a control routine executed by the sound collection position calculation
processor according to the second embodiment.
[0222]
<Drawing 8> It is the outline block diagram of the sound extraction device which relates to 3rd
execution form.
[0223]
9 is a flow chart showing a control routine executed by the processor of the speech extraction
board according to the third embodiment.
[0224]
10 is a configuration example of the case where the sound extraction device of the present
invention is applied to sound extraction outdoors.
[0225]
10-04-2019
59
11 is a schematic view showing the difference in directivity by the sound range of the sound
according to the fourth embodiment.
[0226]
12 is a flow chart showing a control routine executed by the processor of the speech extraction
board according to the fourth embodiment.
[0227]
13 is a schematic view showing a sound collection environment according to the fifth to seventh
embodiments. FIG.
[0228]
FIG. 14 is a flow chart showing a control routine executed by the sound collection position
calculation processor according to the fifth and sixth embodiments.
[0229]
15 is a flow diagram showing a subroutine of the object classification process.
[0230]
<Figure 16> It is the flow figure which shows the subroutine of extraction position setting
processing.
[0231]
17 is an explanatory diagram for explaining the concept of separating the object.
[0232]
18 is an explanatory view for explaining various quantities such as the height of the object.
[0233]
19 is an explanatory diagram for explaining the relationship between the shadow portion of the
object and the voxel.
10-04-2019
60
[0234]
FIG. 20A shows voxels based on image data of the television camera A, FIG. 20B shows voxels
based on image data of the television camera B, and FIG. 20C shows voxels based on image data
of the television camera C. (D) is a figure which shows the voxel by the image data of the
television camera D. FIG.
[0235]
21 is an explanatory view for explaining the concept of voxels narrowed down by the second
narrowing down.
[0236]
22 is an explanatory view for explaining the concept of converting the voxels narrowed down by
the second narrowing down to a dummy model.
[0237]
FIG. 232 is a conceptual diagram for explaining various amounts when three-dimensional
coordinates are calculated by using two television cameras.
[0238]
FIG. 24 is a view showing various arrangements of a television camera or a mirror.
[0239]
25 is a block diagram of a three-dimensional position recognition device according to the sixth
embodiment.
[0240]
26 is an explanatory view for explaining the position of the CCD area image sensor and the like
of the sixth embodiment.
[0241]
FIG. 27 is a flow chart showing a control routine executed by the sound collection position
calculation processor according to the seventh embodiment.
10-04-2019
61
[0242]
FIG. 28 is a plan view showing the arrangement of objects and a television camera according to
the seventh embodiment.
[0243]
29A shows an image of image data of the television camera A, FIG. 29B shows an image of image
data of the television camera B, and FIG. 29C shows an image of image data of the television
camera C. (D) is a figure which shows the image of the image data of the television camera D. FIG.
[0244]
FIG. 30A is a view showing an image when the distorted image data of the television camera A is
converted into a flattened image data, and FIG. 30B is an image data obtained by flattening the
distorted image data of the television camera B. (C) is a diagram showing an image when the
distorted image data of the television camera C is converted into a flattened image data, and (D)
is a diagram showing the distorted television camera It is a figure which shows the image when
converting the image data of D into the image data planarized.
[0245]
FIG. 31A is a diagram showing an image of image data directly in front, FIG. 31B is a diagram
showing an image of image data immediately beside, and FIG. 31C is a diagram showing an
image of image data immediately above .
[0246]
Explanation of sign
[0247]
DESCRIPTION OF SYMBOLS 10 sound extraction apparatus 12 audio extraction board 14
extraction position arithmetic processor 16 television camera (photographing means) 16A
equidistant projection type fisheye lens (wide-angle fixed focus lens) 16B CCD area image sensor
(area sensor) 21 audio output terminal 22 microphone 32 input buffer Memory 34 Processor 44
Output buffer memory 46 Adder 58 Temperature sensor 60 Wind gauge 62 Wind direction
meter 74 Mirror (reflection means)
10-04-2019
62
i. Then, when the voice extraction
processing routine is executed next time, a new reference address shifted from the reference
address by a predetermined address is set, and the new reference address is sequentially written.
Then, when the writing to the input buffer memory i is completed three times, the new reference
address is returned to the leading address of the input buffer memory i at the fourth time, and
the voice data is written sequentially from the leading address. Thus, the input buffer memory i is
used as a so-called ring buffer.
[0094]
In the next step 212, the delay time corresponding to the distance between the position of one of
the selected microphones 22 and the extraction position is fetched from the delay table stored in
advance in the ROM 40. In the delay table, for each extraction position of the extraction position
that can vary within the range of the room 50, the distance between the extraction position and
each microphone 22 is divided by the speed of sound at standard room temperature. This table is
a table in which the propagation time (delay time) is recorded, and is prepared in advance by the
number of extraction position candidates that can vary within the range of the room 50.
[0095]
In the next step 214, an address obtained by shifting the audio data from the one microphone 22
by the memory address corresponding to the delay time from the predetermined reference
address (ie, the write start address to the input buffer memory i) is taken out. As from the input
buffer memory i. Thereby, the sound data written to the input buffer memory i before the sound
emitted by the target person A reaches the one microphone 22 is cut off, and the sound emitted
by the target person A and reaches the one microphone 22 is It will be taken out.
[0096]
10-04-2019
28
Then, in the next step 216, the extracted audio data is written to the output buffer memory i
corresponding to the one microphone 22. That is, audio data corresponding to the audio signal
as shown in FIG. 1B is written to the output buffer memory i. The output buffer memory i is also
used as a so-called ring buffer as in the above-mentioned input buffer memory i.
[0097]
The above steps 212, 214, 216 are then performed for all of the selected microphones. When the
processes of steps 212, 214 and 216 are executed for all of the selected microphones, the result
is affirmed at step 218, and the process proceeds to step 220 and the adder 46 adds the audio
data corresponding to each of the selected microphones. Let
[0098]
In the next step 222, the added audio data is output to the D / A converter 48 with the decimal
point position shifted upward by the number of digits of INT (log 2 M). As a result, it is possible
to obtain substantially the same result as in the case where the added audio data is divided by
the number M of microphones. Here, in addition to the above, the processor 34 may take in the
calculation result of the adder 46 and perform normal division.
[0099]
Thereafter, the audio data output from the adder 46 is converted into an analog audio signal as
shown in FIG. 1C by the D / A converter 48, and the converted audio signal is output to the audio
output terminal 21 of the output terminal board 20. Sent to By connecting an audio reproduction
device or the like to the audio output terminal 21, the voice of the extracted target person A can
be reproduced and heard.
[0100]
As apparent from the above description, by performing the above-described delay operation and
averaging on the sounds collected by a plurality of (seven in the example of FIG. 1) microphones
22, the target person of interest The noise components other than the voice of A are so small in
amplitude that only the voice of the target person A can be extracted.
10-04-2019
29
[0101]
Further, the extraction position calculation process (FIG. 4) and the voice extraction process (FIG.
5) are repeatedly performed at predetermined time intervals.
Thereby, when the target person A moves, the inside of the room 50 is continuously
photographed by the plurality of television cameras 16, and the position of the head P which
changes along with the movement of the target person A based on the image information and
The orientation is determined, and the extraction position is set according to the position and
orientation of the head P at that time. Then, the voice extraction board 12 can extract the voice
even when the target person A moves by performing the voice extraction process according to
the extraction position.
[0102]
In the voice extraction process of the first embodiment, an example is given in which
microphones (for example, seven microphones) close to the set extraction position are selected,
only voice data from the selected microphone is taken, and written to the input buffer memory.
However, the audio data from all (n) microphones is once taken and written to each input buffer
memory, and only the audio data from the selected microphone (for example, seven
microphones) corresponds to the delay time. It is also possible to shift from the input buffer
memory by shifting only the memory address.
[0103]
Further, in the voice extraction process of the present invention, the sound of the target person
(or the target object) is collected by a large number of microphones arranged near the extraction
position, and the collected voice signal is delayed as described above. By performing averaging, it
is possible to perform sound extraction with an improved signal-to-noise ratio.
Moreover, it is also possible to extract a sound having a higher signal-to-noise ratio than the
sound collected by a normal microphone. Such good quality sound can be used as an input to the
speech recognition device. That is, the voice of the person (one or more persons) who is in the
area where the sound can be extracted by the sound extraction device can be input to the voice
10-04-2019
30
recognition device.
[0104]
Second Embodiment Next, a second embodiment of the present invention will be described. In
the second embodiment, an example is shown in which the voice of the target person A and the
voice of the target person B in the predetermined room 50 shown in FIG. 2 are separately
extracted. The same parts as those in the first embodiment are denoted by the same reference
numerals, and the description will be omitted.
[0105]
As shown in FIG. 6, the sound extraction device 10 according to the second embodiment is
provided with a plurality (N) of the sound extraction boards 12 described in the first
embodiment, and further with each microphone 22. An audio data relay board 56 for connecting
each audio extraction board 12 is installed. Further, the extraction position calculation processor
14 is connected to the processor 34 provided in each voice extraction board 12. Further, the
output terminal board 20 is provided with an audio output terminal 21 corresponding to each
audio extraction board 12, and each audio output terminal 21 is connected to the D / A converter
48 of the corresponding audio extraction board 12.
[0106]
Next, the operation of the second embodiment will be described. When a start button (not shown)
of the sound extraction device 10 is turned on by the operator, a control routine of extraction
position calculation processing for a plurality of extraction positions shown in FIG. 7 is shown in
FIG. The same control routine for voice extraction processing as that of the first embodiment is
executed by the CPU 38 of each of the two voice extraction boards 12.
[0107]
The control routine of the extraction position calculation process shown in FIG. 7 will be
described. In the following description, target persons A and B will be referred to as target
10-04-2019
31
persons 1 and 2 for convenience. In step 102, shooting information from each television camera
16 is fetched, and in the next step 103, "2" is substituted for the variable K as the number of
target persons and the variable L is initialized to "1".
[0108]
In the next step 105, calculation of the position and orientation of the head of the target person L
(that is, target person 1) is performed in the same manner as in the first embodiment, and in the
next step 107, the voice of the target person L is Set an extraction position L (ie, extraction
position 1) for extracting. Then, in the next step 109, the information on the extraction position L
is transmitted to the corresponding voice extraction board L.
[0109]
In the next step 110, it is determined whether or not the processing in the above steps 105, 107,
and 109 has been completed for all target persons by determining whether the variable L is
equal to the variable K indicating the number of target persons. judge. In this case, the beginning
is negated, and the routine proceeds to step 112 where the variable L is incremented by one. As
a result, the value of the variable L becomes "2".
[0110]
Thereafter, the process returns to step 105, and the processing of steps 105, 107, and 109 is
performed on the target person L (ie, target person 2). When those processes are completed, step
110 is affirmed because the variables L and K are equal, and the control routine is ended.
[0111]
The voice extraction boards 12 respectively corresponding to the target persons 1 and 2 receive
the information of the extraction position 1 or the extraction position 2 transmitted from the
extracti
Документ
Категория
Без категории
Просмотров
0
Размер файла
90 Кб
Теги
description, jph08286680
1/--страниц
Пожаловаться на содержимое документа