close

Вход

Забыли?

вход по аккаунту

?

DESCRIPTION JP2010154259

код для вставкиСкачать
Patent Translate
Powered by EPO and Google
Notice
This translation is machine-generated. It cannot be guaranteed that it is intelligible, accurate,
complete, reliable or fit for specific purposes. Critical decisions, such as commercially relevant or
financial decisions, should not be based on machine-translation output.
DESCRIPTION JP2010154259
[PROBLEMS] A noise is attenuated even for a sound source that generates voice intermittently to
obtain a good voice, and characters are appropriately displayed according to a person who made
the voice. An object position detection unit (24b) calculates distance and direction to a subject,
voice position detection unit (12) calculates distance and direction to a sound source, distance
and direction to a subject, distance and direction to a sound source, Based on the relationship
between the subject and the sound source as the same object, the tracking control unit 40b for
tracking the associated subject image, and the microphone array based on the tracking result and
the distance or direction of the subject or the sound source 11, directional characteristic
adjusting units 13a and 13b for adjusting the directional characteristics of the 11; voice
recognition units 15a and 15b for converting voice into a character string based on voice data
generated by the microphone array 11 whose directional characteristics are adjusted; And an
output control unit 40d for generating output data for displaying the character string on the
screen according to the subject image. [Selected figure] Figure 1
Image and sound processing device
[0001]
The present invention relates to an image and sound processing apparatus.
[0002]
In a general video camera, light collected by a lens is converted into an electrical signal by an
imaging device, and image data processed by the camera and audio data converted into an
10-04-2019
1
electrical signal by a microphone are separately compressed. Data compression is performed and
recorded on a recording medium.
Then, at the time of reproduction, the image data and the audio data recorded in the recording
medium are expanded and output to an output device such as a television.
[0003]
Patent Document 1 proposes an image processing apparatus that performs voice recognition
processing on input voice in a digital camera, converts the recognized voice into characters, and
superimposes the characters on a still image for display. There is.
[0004]
Further, according to Patent Document 2, the movement of the mouth of the displayed person is
detected so that the user can reliably and easily visually recognize the voice content emitted by
the person displayed in the screen, and this detection is performed. There has been proposed a
device for converting the voice emitted by a person into the vicinity of the selected mouth and
displaying it on the screen.
[0005]
Furthermore, Patent Document 3 proposes a display device that displays information
superimposed on an image and a character by displaying the voice-recognized voice in a telop
system.
JP-A-11-55614 JP-A-9-233442 JP-A-11-41538
[0006]
However, in the techniques described in Patent Document 1 to Patent Document 3, when a
plurality of persons displayed in the screen make a voice alternately, the directivity characteristic
of the microphone is not adjusted to the person who made the voice; The noise may be
attenuated and a good voice may not be acquired, and the user may miss the timing of recording.
10-04-2019
2
[0007]
Also, as in the technique described in Patent Document 3, a plurality of persons are displayed at
positions close to each other in the screen only by digitizing the voice emitted by the person near
the detected mouth and displaying it on the screen. In the case, the user could not recognize
which person emitted the sound.
[0008]
The present invention has been made in view of the above problems, and noise is attenuated
even for a sound source that generates voice intermittently to obtain good voice, and character
display is appropriately performed according to the person who made the voice. An object of the
present invention is to provide an image / sound processing apparatus to perform.
[0009]
In order to achieve the above object, according to a first feature of the image and sound
processing apparatus according to the present invention, in an image and sound processing
apparatus for displaying characters according to an object that emits sound, an optical system
collects light from an object. A microphone array in which a plurality of microphones for
converting sound emitted from a sound source into an electric signal to generate sound data are
arranged at predetermined intervals; An object position detection unit that calculates the distance
from the image and sound processing apparatus to the subject and the direction of the object
with respect to the image and sound processing apparatus based on the image data generated by
the imaging unit; Based on audio data, calculate the distance from the audio / video processing
device to the sound source and the direction of the audio source with respect to the audio / video
processing device The subject and the sound source based on the sound position detection unit,
the distance and direction of the subject calculated by the object position detection unit, and the
distance and direction of the sound source calculated by the sound position detection unit; The
microphone array on the basis of an associating unit that associates the objects as the same
object, the distance and direction of the subject calculated by the object position detecting unit,
or the distance and direction of the sound source calculated by the audio position detecting unit
A speech recognition unit for converting speech into a character string on the basis of speech
data generated by a microphone array whose directivity characteristics are adjusted by the
directivity characteristics adjustment unit; Control for causing the output unit to generate output
data for displaying the character string converted by the unit on the screen according to the
subject Provided with a door.
[0010]
In order to achieve the above object, a second feature of the image / sound processing apparatus
according to the present invention is a tracking control unit for tracking the subject
10-04-2019
3
corresponding to the object associated by the association unit on the image data. The directional
characteristic adjusting unit further includes: a tracking result of the tracking control unit; a
distance and a direction of the subject calculated by the object position detecting unit; or a
distance of the sound source calculated by the voice position detecting unit; And adjusting the
directivity of the microphone array based on the direction.
[0011]
In order to achieve the above object, a third feature of the image / sound processing apparatus
according to the present invention is an object detection unit for detecting feature information of
an object from image data generated by the imaging unit, human feature information, A human
classification information storage unit for storing human classification information in association
with a human classification classified based on human characteristic information, and a feature
of an object detected by the object detection unit based on the human classification information
An object recognition unit for extracting a human classification corresponding to information,
and a character string converted by the speech recognition unit are translated from a language
according to the human classification extracted by the object recognition unit into a native
language set in advance And a translation unit, and the output control unit displays on the screen
the character string converted by the translation unit according to the subject image on the
screen detected by the object detection unit. Certain of the output data to be generated in the
output section.
[0012]
In order to achieve the above object, according to a fourth feature of the image and sound
processing apparatus according to the present invention, the output control unit converts the
image by the translation unit in the vicinity of the subject image on the screen detected by the
object detection unit. The output unit is configured to generate output data for displaying the
character string on the screen.
[0013]
In order to achieve the above object, according to a fifth feature of the image and sound
processing apparatus according to the present invention, the output control unit is configured to
use the translating unit based on the size of the subject of the image data generated by the
imaging unit. The output unit is configured to generate output data for displaying the converted
character string on the screen.
[0014]
In order to achieve the above object, according to a sixth feature of the image and sound
10-04-2019
4
processing apparatus according to the present invention, the output control unit converts by the
translation unit based on the direction of the subject in the image data generated by the imaging
unit. The output unit is configured to generate output data for displaying the character string on
the screen.
[0015]
In order to achieve the above object, according to a seventh feature of the image / sound
processing apparatus according to the present invention, the output control unit performs
conversion by the translation unit based on the type of subject in the image data generated by
the imaging unit. Determining at least one of a color and a font of the specified character string,
and causing the output unit to generate output data for displaying the converted character string
on the screen with the determined color or font. is there.
[0016]
In order to achieve the above object, according to an eighth feature of the image and sound
processing apparatus according to the present invention, when the output control unit
determines that the subject is a human by the object recognition unit, the output control unit The
output unit is configured to generate output data for displaying on the screen the character
string converted by the translating unit at a position near the mouth of the human being.
[0017]
In order to achieve the above object, according to a ninth feature of the image / voice processing
apparatus according to the present invention, when the output control unit determines that the
subject is a human by the object recognition unit, the human being According to an angle of the
head, the output unit is configured to generate output data for displaying the character string
converted by the translation unit on a screen by tilting the character string.
[0018]
According to the image / sound processing apparatus of the present invention, noise can be
attenuated even for a sound source that generates sound intermittently to obtain good sound,
and characters can be appropriately displayed according to the person who has generated the
sound. .
[0019]
Hereinafter, embodiments of the present invention will be described with reference to the
drawings.
10-04-2019
5
[0020]
In one embodiment of the present invention, an example of an image / sound processing
apparatus that attenuates noise even for a sound source that generates sound intermittently to
obtain a good sound and appropriately displays characters according to a person who has
generated a sound I will list and explain.
[0021]
<Configuration of Image and Audio Processing Device> FIG. 1 is a configuration diagram showing
a configuration of an image and audio processing device according to an embodiment of the
present invention.
[0022]
An image / sound processing apparatus 1 according to an embodiment of the present invention
includes a microphone array 11, an audio position detection unit 12, a first directivity
characteristic adjustment unit 13a, a second directivity characteristic adjustment unit 13b, and a
first The voice detection unit 14a, the second voice detection unit 14b, the first voice recognition
unit 15a, the second voice recognition unit 15b, the dictionary storage unit 16, the first
translation unit 17a, and the second A translation unit 17b, a voice compression unit 18, a
recording voice generation unit 19, a camera 21 and a camera processing unit 22 having an
imaging unit, a motion sensor 23, an operation unit 41, a direction sensor 42, a detection unit 24
and , Motion vector detection unit 25, moving image compression unit 26, character combining
unit 27, human classification information storage unit 31, image reference feature information
storage unit 32, directivity characteristic priority storage unit 34, CPU 40, recording A unit 43
and an audio output unit 44; And a radical 113 45.
[0023]
The microphone array 11 includes a first microphone 11a, a second microphone 11b, and a third
microphone 11c, which are disposed at predetermined intervals of, for example, about 10 mm.
Convert to voice data.
[0024]
The audio position detection unit 12 calculates the distance from the image / sound processing
device 1 to the sound source and the direction of the sound source with respect to the image /
sound processing device 1 based on the sound data generated by the microphone array 11.
10-04-2019
6
[0025]
The first directional characteristic adjustment unit 13a calculates the tracking result of the
tracking control unit 40b of the CPU 40, which will be described later, and the distance and
direction of the subject calculated by the object position detection unit 24b of the detection unit
24, or the audio position detection unit 12 Based on the distance and direction of the sound
source calculated by the above-mentioned method so as to eliminate the time difference between
the voices reaching the first microphone 11a, the second microphone 11b, and the third
microphone 11c. The directivity characteristics are adjusted by superimposing voice data.
[0026]
The second directivity characteristic adjustment unit 13 b has the same configuration as the first
directivity characteristic adjustment unit 13 a.
[0027]
The first voice detection unit 14a extracts voice feature information from the voice data whose
directivity characteristic has been adjusted by the first directivity characteristic adjustment unit
13a.
Specifically, the first voice detection unit 14a extracts the volume, tone color information, and the
like from the voice whose directivity characteristic has been adjusted, and supplies these to the
CPU 40 as voice feature information.
[0028]
The second speech detection unit 14 b has the same configuration as the first speech detection
unit 14 a.
[0029]
The first speech recognition unit 15a converts speech into a character string based on speech
data generated by the microphone array 11 whose directivity characteristic has been adjusted by
the first directivity characteristic adjustment unit 13a.
10-04-2019
7
Specifically, the first voice recognition unit 15a is based on the type of subject specified by the
object recognition unit 24c described later and the dictionary data for each type of object stored
in the dictionary storage unit 16 described later. The voice is converted to a character string
based on the voice data generated by the microphone array 11.
[0030]
The second speech recognition unit 15b has the same configuration as the first speech
recognition unit 15a.
[0031]
The dictionary storage unit 16 stores dictionary data for each type of subject such as, for
example, a dog, a cat, a car, a human, and the like.
[0032]
The first translation unit 17a is set by the input operation of the operation unit 41, which will be
described later, from the language corresponding to the human classification extracted by the
object recognition unit 24c from the character string converted by the first speech recognition
unit 15a. Translate to your native language.
[0033]
The second translation unit 17 b has the same configuration as the first translation unit 17 a.
[0034]
The audio compression unit 18 compresses the recording audio data generated by the recording
audio generating unit 19 described later according to a predetermined compression method, and
causes the recording unit 43 described later to record the compressed recording audio data.
[0035]
The recording sound generation unit 19 synthesizes the sound data supplied from the
microphone array 11 and the sound data supplied from the first directivity characteristic
adjustment unit 13a and the second directivity characteristic adjustment unit 13b, and a
recording unit described later The number of audio channels required to be recorded in 43 (for
10-04-2019
8
example, 2 channels in stereo recording) is converted.
Specifically, when the recording sound generation unit 19 generates a sound based on the sound
volume and the movement of the mouth of the subject image whose face is recognized by the
object recognition unit 24 c described later, the sound source is a sound source. The voice data
supplied from the first directivity characteristic adjustment unit 13a and the second directivity
characteristic adjustment unit 13b are recorded, and the voice supplied from the microphone
array 11 when the human being the sound source does not emit voice. Each voice data is
synthesized so as to record data, output data is generated, and is supplied to the voice
compression unit 18 and the voice output unit 44.
Thereby, even when there is ambient noise, it is possible to clearly record or output the voice
emitted by the human being as the sound source.
[0036]
The camera 21 includes a zoom lens 21a and an imaging element 21b.
The zoom lens 21a adjusts the angle of view based on the zoom magnification set by the
operation signal supplied from the operation unit 41 described later, and the zoom lens 21a
condenses the light from the subject by an optical system (not shown). The light collected by the
imaging element 21b is converted into an electrical signal.
[0037]
The camera processing unit 22 converts the electric signal supplied from the camera 21 into
image data such as RGB signal luminance signal Y and color difference signals Cr and Cb.
[0038]
The motion sensor 23 includes, for example, a gyro sensor or the like, detects the motion of the
image / sound processing device 1, and supplies the motion to the CPU 40 and the detection unit
24.
[0039]
10-04-2019
9
The detection unit 24 includes an object detection unit 24 a, an object position detection unit 24
b, and an object recognition unit 24 c.
[0040]
The object detection unit 24 a detects feature information of a subject image from the image data
generated by the camera processing unit 22.
For example, the object detection unit 24a detects the shape and color of a subject image as
feature information from image data.
Further, when the object detection unit 24a determines that the type of the subject is “human”
by the object recognition unit 24c described later, the object detection unit 24a further
determines the skin color and the pupil as the feature information of the subject. Detect color,
contour, hair color, and costume.
[0041]
The object position detection unit 24 b calculates the distance from the image and sound
processing device 1 to the object of the image data and the direction of the object with respect to
the image and sound processing device 1 based on the image data generated by the camera
processing unit 22.
[0042]
The object recognition unit 24c recognizes a subject image.
Specifically, the object recognition unit 24 c specifies the type of subject based on the shape and
color extracted by the object detection unit 24 a and the image reference feature information
stored in the image reference feature information storage unit 32.
Then, when the specified type of subject is “human”, the object recognition unit 24 c is
detected by the object position detection unit 24 b based on human classification information
10-04-2019
10
stored in the human classification information storage unit 31 described later. A human
classification corresponding to feature information of a subject image is extracted.
Furthermore, the object recognition unit 24c performs face recognition when the specified type
of subject is “human”.
[0043]
The motion vector detection unit 25 detects the motion of the image data generated by the
camera processing unit 22, and supplies the motion to the CPU 40 and the detection unit 24.
[0044]
The moving image compression unit 26 compresses the image data generated by the camera
processing unit 22 according to a predetermined compression method, and supplies the
compressed image data to a recording unit 43 described later.
[0045]
The character combining unit 27 superimposes a character string on the image data generated
by the camera processing unit 22 according to an instruction of an output control unit 40 d of
the CPU 40 described later, and causes the display unit 45 to display the character string.
[0046]
The human classification information storage unit 31 stores human characteristic information
and human classification classified based on the characteristic information as human
classification information in association with each other.
[0047]
FIG. 2 is a diagram showing an example of human classification information stored in the human
classification information storage unit 31 included in the image / voice processing apparatus 1
according to an embodiment of the present invention.
[0048]
As shown in FIG. 2, the column name "human classification" (symbol 51), the column name "skin
color" (symbol 52), the column name "pupil color" (symbol 53), and the column name "outline"
(Symbol 54), column name "color of hair" (symbol 55), and column name "feature of costume"
10-04-2019
11
(symbol 56) are associated and stored as human classification information.
[0049]
The image reference feature information storage unit 32 stores the type of the subject and the
image reference feature information in association with each other.
[0050]
FIG. 3 is a diagram showing an example of the image reference feature information stored in the
image reference feature information storage unit 32 included in the image / voice processing
apparatus 1 according to an embodiment of the present invention.
[0051]
As shown in FIG. 3, the column name “type” (reference numeral 61) and the column name
“image reference feature information” (reference numeral 62) are stored in association with
each other.
The image reference characteristic information 62 includes a column name “type” (reference
numeral 62 a), a column name “color” (reference numeral 62 b), and a column name
“reference dimension” (reference numeral 62 c).
[0052]
The directivity characteristic priority storage unit 34 stores the priority of the type of the subject
and the sound source supplied from the operation unit 41 described later.
The CPU 40 described later performs processing in accordance with a predetermined priority
stored in advance in the directivity characteristic priority storage unit 34 until the priority in the
operation unit 41 is designated.
[0053]
10-04-2019
12
The CPU 40 centrally controls the image and sound processing apparatus 1.
Further, the CPU 40 functionally includes an association unit 40a, a tracking control unit 40b, a
directivity adjustment control unit 40c, and an output control unit 40d.
[0054]
The associating unit 40 a associates the subject and the sound source as the same object based
on the distance and the direction of the subject calculated by the object position detection unit
24 b and the distance and the direction of the sound source calculated by the sound position
detection unit 12. .
[0055]
The tracking control unit 40b divides the image displayed on the display unit 45 into a plurality
of blocks based on the image data, and detects the movement of each block to obtain an object
associated with the image data by the association unit 40a. Track the movement of the
corresponding subject.
[0056]
The directivity adjustment control unit 40 c is based on the tracking result of the tracking control
unit 40 b and the distance and direction of the subject calculated by the object position detection
unit 24 b or the distance and direction of the sound source calculated by the sound position
detection unit 12. The directional characteristics are adjusted by the first directional
characteristic adjusting unit 13a or the second directional characteristic adjusting unit 13b.
[0057]
The output control unit 40d outputs the output data for displaying the character string converted
by the first voice recognition unit 15a or the second voice recognition unit 15b on the screen
according to the subject image, the recording unit 43 or the character combining unit Make it 27
[0058]
The operation unit 41 sets various operation signals such as an operation signal for requesting
start and end of photographing based on the user's operation, and a native language translated
by the first translation unit 17a or the second translation unit 17b. And generates the operation
10-04-2019
13
signal to supply the generated operation signal to the CPU 40.
[0059]
The direction sensor 42 detects the direction in which the image / voice processing apparatus 1
is facing, and supplies the detected direction data to the CPU 40 and the detection unit 24.
[0060]
The recording unit 43 synchronizes the recording audio data supplied from the audio
compression unit 18, the moving image data supplied from the moving image compression unit
26, and the character string supplied from the CPU 40 according to the instruction of the output
control unit 40d of the CPU 40 Let me record it.
[0061]
The audio output unit 44 includes an audio output device such as a speaker, and outputs audio
based on the recorded audio data supplied from the recorded audio generation unit 19.
[0062]
The display unit 45 includes an image output device such as an organic EL (electroluminescence)
display or a liquid crystal display, and displays an image based on the image data supplied from
the character combining unit 27.
[0063]
Next, the operation of the image sound processing apparatus 1 according to an embodiment of
the present invention will be described.
[0064]
FIG. 4 is a flow chart showing the processing flow of the image and sound processing apparatus
1 according to an embodiment of the present invention.
[0065]
First, when an electric signal is supplied from the camera 21 (step S101), the camera processing
unit 22 of the image / voice processing apparatus 1 converts the supplied electric signal into an
RGB signal luminance signal Y, color difference signals Cr, Cb, etc. Convert to generate image
data.
10-04-2019
14
[0066]
Next, the object position detection unit 24b corrects the shake based on the movement of the
image and sound processing apparatus 1 detected by the movement sensor 23 and the direction
of the image and sound processing apparatus 1 detected by the direction sensor 42 (step S102).
).
For example, the object position detection unit 24b selects a range of image data to be cut out
from the image data supplied from the camera processing unit 22 so as to cancel the movement
of the image and sound processing apparatus 1 detected by the movement sensor 23. The
supplied image data is supplied to the object detection unit 24a.
[0067]
Then, the object detection unit 24a detects the feature information of the subject image from the
image data whose shake has been corrected (step S103).
For example, the object detection unit 24a detects the shape and color of the subject image from
the image data as feature information of the subject image.
[0068]
Next, the object recognition unit 24c recognizes a subject image (step S104).
Specifically, the object recognition unit 24 c specifies the type of subject based on the shape and
color extracted by the object detection unit 24 a and the image reference feature information
stored in the image reference feature information storage unit 32.
Then, when the type of the specified subject is “human”, the object recognition unit 24 c
determines the feature information of the subject image detected in step S 103 based on the
10-04-2019
15
human classification information stored in the human classification information storage unit 31.
Extract the corresponding human classification.
[0069]
FIG. 5 is a view for explaining the processing by the object detection unit 24a and the object
recognition unit 24c included in the image / voice processing apparatus 1 according to an
embodiment of the present invention.
[0070]
As shown in FIG. 5, the subject A and the subject B appear on the screen captured by the camera
21. Therefore, the object recognition unit 24c extracts “human” as the type of the subject A
and the subject B, and the object The detecting unit 24a further detects the color of the skin, the
color of the pupil, the contour, the color of the hair, and the costume as the feature information
of the subject A and the subject B.
[0071]
Then, based on the human classification information stored in the human classification
information storage unit 31, the object recognition unit 24c detects the color of the detected
skin, the color of the pupil, the contour, the color of the hair, and the human classification
corresponding to the costume. Extract.
[0072]
Next, the object position detection unit 24b calculates the distance from the image and sound
processing device 1 to the object and the direction of the object with respect to the image and
sound processing device 1 based on the image data whose shake is corrected (step S105).
For example, the object position detection unit 24b determines the distance from the image /
sound processing device 1 to the subject of the image data based on the angle of view
determined by the zoom magnification set for the zoom lens 21a of the camera 21 and the focus
information to the subject. The direction of the subject relative to the image / voice processing
apparatus 1 is calculated.
[0073]
10-04-2019
16
FIG. 6 is a diagram for explaining the process of calculating the direction of the subject by the
object position detection unit 24b included in the image sound processing apparatus 1 according
to an embodiment of the present invention.
[0074]
As shown in FIG. 6, the subject A and the subject B shown in FIG. 6 are shown on the screen
captured by the camera 21.
Assuming that the angle of view of the camera 21 is ± 物体, the object position detection unit
24b moves the subject A detected by the object detection unit 24a in the + θ3 direction on the
xy plane when the image / sound processing apparatus 1 is viewed from above. It is determined
that the subject A is present on a straight line 201 in a certain direction, that is, in the + θ3
direction.
[0075]
Then, the object position detection unit 24b calculates the distance from the image / sound
processing device 1 to the subject based on the image data whose shake has been corrected.
[0076]
FIG. 7 is a diagram for explaining the process of calculating the distance of the subject by the
object position detection unit 24b included in the image sound processing apparatus 1 according
to an embodiment of the present invention.
[0077]
When the subject A or B is within the range of focus of the camera 21, the object position
detection unit 24b calculates the distance from the focus information of the focus.
[0078]
As shown in FIG. 7, when the subject A is within the range of focus, the object position detection
unit 24b calculates the distance d1 between the camera 21 and the subject A from the focus
information of focus.
10-04-2019
17
[0079]
When the subject A or B is out of the range of focus of the camera 21, the object position
detection unit 24 b generates the subject image of the image data based on the image reference
feature information stored in the image reference feature information storage unit 32. The
reference dimension of the subject corresponding to the feature information is extracted, and the
distance from the camera 21 to the subject of the image data is calculated based on the extracted
reference dimension of the subject and the angle of view of the camera 21.
[0080]
For example, when the subject B shown in FIG. 7 is out of the range of focus, the object position
detection unit 24b uses the image reference feature information stored in the image reference
feature information storage unit 32 as the subject type specified in step S104. The corresponding
reference dimension L2 is extracted.
[0081]
Then, assuming that the height of the screen shown in FIG. 5 is Hc, the vertical length H2 of the
face of the subject B is θc, and the angle of view is θc, the object position detection unit 24b
uses Equation 1 below. The angle θ2 is calculated.
[0082]
θ2 = θc × H2 / Hc (Equation 1) Next, the object position detection unit 24b uses the following
Equation 2 to calculate the distance d2 from the extracted reference dimension L2 and the
calculated angle θ2. calculate.
[0083]
d2 = L2 / tan θ2 (Equation 2) Thus, the object position detection unit 24b determines the
distance from the image and sound processing apparatus 1 to the object and the object to the
image and sound processing apparatus 1 based on the image data whose shake is corrected.
Direction can be calculated.
[0084]
Next, when the audio data is supplied from the first microphone 11a, the second microphone
11b, and the third microphone 11c (step S106), the audio position detector 12 detects the image
audio detected by the motion sensor 23 The shake is corrected based on the movement of the
processing device 1 and the image / sound processing device 1 detected by the direction sensor
42 (step S107).
10-04-2019
18
[0085]
As shown in FIG. 4, next, the audio position detection unit 12 calculates the distance from the
image audio processing device 1 to the sound source and the direction of the sound source with
respect to the image audio processing device 1 based on the corrected audio data ( Step S108).
[0086]
FIG. 8 is a diagram for explaining calculation processing of the direction and distance of the
sound source by the audio position detection unit 12 included in the image / voice processing
apparatus 1 according to an embodiment of the present invention.
[0087]
As shown in FIG. 8, since the first microphone 11a, the second microphone 11b, and the third
microphone 11c are disposed with a predetermined distance apart, the voices uttered by the
sound source A are different from each other. Delay time to input is different.
[0088]
Specifically, as shown in FIG. 8, assuming that the time from sound generation from the sound
source A to arrival at the first microphone 11a is t0, the sound generation from the sound source
A to the second microphone occurs. The time to reach 11b is (t0 + t1), and the time from when
the sound source A emits a voice to the time it reaches the third microphone 11c is (t0 + t2).
[0089]
Therefore, the voice position detection unit 12 compares the phases of the voices input to the
first microphone 11a, the second microphone 11b, and the third microphone 11c, thereby
delaying the delay time t1 of the voice input to the microphones. , T2 are calculated, and based
on the calculated delay times t1, t2, the distance from the image and sound processing device 1
to the sound source and the direction of the sound source to the image and sound processing
device 1 are calculated.
[0090]
FIG. 9 shows an example of phase comparison of audio waveforms input to the first microphone
11a, the second microphone 11b, and the third microphone 11c included in the image / sound
processing apparatus 1 according to an embodiment of the present invention. FIG.
10-04-2019
19
[0091]
As shown in FIG. 9, since the sound emitted from the sound source A and reaching the first
microphone 11a has a peak at time T10, the audio position detection unit 12 detects T10 at this
peak time. Set as a standard.
Then, the audio position detection unit 12 sets a time from T10 to a time T11 when a similar
peak waveform reaches in the audio waveform that has reached the second microphone 11b as a
delay time t1.
Further, the audio position detection unit 12 sets a time from T10 to a time T12 when a similar
peak waveform reaches in the audio waveform that has reached the third microphone 11c as a
delay time t2.
[0092]
Then, based on the calculated delay times t1 and t2, the sound position detection unit 12
calculates the distance from the image sound processing device 1 to the sound source and the
direction of the sound source with respect to the image sound processing device 1.
Specifically, assuming that the sound velocity is v, the distance from the sound source A to the
first microphone 11a is v · t0, and the distance from the sound source A to the second
microphone 11b is v · (t0 + t1. And the distance from the sound source A to the third microphone
11c is v · (t0 + t2).
Then, the voice position detection unit 12 is a point separated from the first microphone 11a, the
second microphone 11b, and the third microphone 11c by v · t0, v · (t0 + t1), and v · (t0 + t2),
respectively. That is, with the first microphone 11a, the second microphone 11b, and the third
microphone 11c as a center, the radius from the center is taken as a circle with v · t0, v · (t0 + t1)
and v · (t0 + t2) When drawn, points overlapping with each other are defined as a point of the
sound source A.
10-04-2019
20
[0093]
Thereby, the audio position detection unit 12 can calculate the distance from the image audio
processing device 1 to the sound source and the direction of the sound source with respect to the
image audio processing device 1 based on the corrected audio data.
[0094]
Note that, for example, when the sound source A and the sound source B simultaneously emit
sound, the sound position detection unit 12 determines the distance from the sound
identification device 1 to the sound source using the technology described in Japanese Patent
Laid-Open No. 2006-227328, for example. The direction of the sound source with respect to the
voice recognition device 1 is calculated.
Specifically, the audio position detection unit 12 determines whether a band division signal
obtained by band division is a signal in which a plurality of sound sources overlap or a signal
consisting of only one sound source, and the sound source The sound source direction is
calculated using only non-overlapping frequency components.
[0095]
Next, the associating unit 40a of the CPU 40 calculates the distance from the image / voice
processing device 1 to the subject calculated in step S104, the direction of the subject relative to
the image / voice processing device 1, and the image / voice processing device 1 calculated in
step S108. Based on the distance to the sound source and the direction of the sound source with
respect to the image / voice processing apparatus 1, it is determined whether the sound source
can be associated with the subject (step S109).
[0096]
For example, the associating unit 40a calculates the predetermined peripheral range of the
position specified by the distance from the image and sound processing device 1 to the subject
calculated in step S105 and the direction of the object with respect to the image and sound
processing device 1 and If there is an overlapping portion in a predetermined peripheral range of
the position specified by the distance from the image / voice processing device 1 to the sound
source and the direction of the sound source with respect to the image / voice processing device
1, the subject and the sound source are associated as the same object. Determine that it is
possible.
10-04-2019
21
[0097]
If it is determined in step S109 that the sound source can be associated with the subject, the
associating unit 40a determines the distance from the image / voice processing device 1 to the
subject calculated in step S105 and the direction of the subject relative to the image / voice
processing device 1. The distance from the image / sound processing device 1 to the sound
source calculated in step S108 and the direction of the sound source with respect to the image /
sound processing device 1 are associated (step S110).
[0098]
Next, the tracking control unit 40b of the CPU 40 divides the image displayed on the display unit
45 into a plurality of blocks based on the image data, and tracks the movement of the subject by
detecting the movement of each block (step S111). ).
[0099]
Specifically, the tracking control unit 40b divides the screen displayed based on the image data
into a plurality of blocks, and the subject moves based on the motion vector for each block
detected by the motion vector detection unit 25. Detect if there is.
The detection of the motion vector may be either a luminance signal or a color signal.
[0100]
Further, even when there is no moving object in the screen, the tracking control unit 40b always
recognizes the entire image on the screen and estimates the subject from the contour, the color,
and the like.
The image recognition is performed on the subject based on the feature information, and the
subject is compared with the subject that has been detected.
If the difference between the subject and the feature information of the subject so far is smaller
than a predetermined value, it is determined that the subject is the same.
10-04-2019
22
Thereby, the tracking control unit 40b can track the subject within the screen.
[0101]
Then, according to an instruction of the directivity adjustment control unit 40c of the CPU 40,
the first directivity characteristic adjustment unit 13a or the second directivity characteristic
adjustment unit 13b performs the first microphone 11a, the second microphone 11b, and the
third microphone 11c. The directivity characteristic is adjusted by superimposing the voice data
generated by the first microphone 11a, the second microphone 11b, and the third microphone
11c so as to eliminate the time difference between the voices that have arrived at (step S112).
The directivity characteristic adjustment process will be described later.
[0102]
Next, when audio data is supplied from the first microphone 11a, the second microphone 11b,
and the third microphone 11c (step S113), the audio position detector 12 detects the image
audio detected by the motion sensor 23 The shake is corrected based on the movement of the
processing device 1 (step S114).
[0103]
Next, the first voice detection unit 14a or the second voice detection unit 14b is a feature of the
voice whose shake is corrected, which is supplied from the first directivity characteristic
adjustment unit 13a or the second directivity characteristic adjustment unit 13b. Information is
detected (step S115).
For example, the first voice detection unit 14a extracts volume and tone information as voice
feature information from the voice data whose shake is corrected.
[0104]
10-04-2019
23
Then, the first voice recognition unit 15a or the second voice recognition unit 15b is generated
by the microphone array 11 whose directivity characteristic is adjusted by the first directivity
characteristic adjustment unit 13a or the second directivity characteristic adjustment unit 13b,
respectively. The voice is converted into a character string based on the voice data thus
processed (step S116).
Specifically, the first voice recognition unit 15a or the second voice recognition unit 15b is based
on the type of subject specified in step S104 and the dictionary data for each type of subject
stored in the dictionary storage unit 16. The voice is converted into a character string based on
the voice data generated by the microphone array 11.
For example, when the type of subject specified in step S104 is "dog", voice is generated based on
voice data generated by the microphone array 11 using the dictionary data for dogs stored in the
dictionary storage unit 16 Convert to string
As described above, since speech is converted into a character string based on dictionary data
stored for each type of subject, conversion into a character string can be performed with higher
accuracy.
[0105]
Next, the first translation unit 17a or the second translation unit 17b extracts the character
string converted by the first speech recognition unit 15a or the second speech recognition unit
15b by the object recognition unit 24c. The language corresponding to the classification is
translated into the native language set in advance based on the operation of the operation unit
41 (step S117).
Specifically, when the human classification extracted by the object recognition unit 24c is "yellow
race", the first translation unit 17a or the second translation unit 17b is used as a language
candidate in Japanese, China, or the like. The display unit 45 displays a list of languages and
languages used in the Asian region such as Korean.
Then, when a selection signal for selecting any one language from the language candidates
10-04-2019
24
displayed from the operation unit 41 is supplied by the selection operation of the user, the first
translation unit 17 a or the second translation unit 17 b Translates the character string
converted by the first speech recognition unit 15a or the second speech recognition unit 15b
from the selected language into a native language set in advance based on the operation of the
operation unit 41.
[0106]
At this time, when the user does not perform the language selection operation, the first
translation unit 17a or the second translation unit 17b guesses the most suitable language from
the input speech, and sends the guess to the native language. Do the conversion.
[0107]
Next, the first translation unit 17a or the second translation unit 17b divides the character string
translated in step S117 into clauses (step S118).
[0108]
Then, the output control unit 40d determines whether the subject image tracked by the tracking
control unit 40b in step S111 is within the range of the screen (step S119).
[0109]
When it is determined in step S119 that the subject image is out of the range of the screen (in
the case of NO), the output control unit 40d is based on the direction of the sound source with
respect to the image processing unit 1 detected by the sound position detection unit 12. Then, a
character string is displayed at the end of the screen for each clause divided in step S118 (step
S120).
[0110]
FIG. 10 shows an example of the screen when the output control unit 40d included in the image /
voice processing apparatus 1 according to the embodiment of the present invention displays a
character string on the screen end.
[0111]
As shown in FIG. 10, when it is determined that the subject image is out of the range of the
screen, the output control unit 40d detects the sound source for the image / sound processing
apparatus 1 detected by the sound position detection unit 12 among the four sides of the screen.
10-04-2019
25
A character string 402 is displayed on the screen end of the screen 401 for each of the clauses
divided in step S118 along the side closest to the direction.
[0112]
On the other hand, when it is determined in step S119 that the subject image is within the range
of the screen (in the case of YES), the output control unit 40d calculates the inclination of the
head of the subject image tracked in step S111 (step S121). ).
[0113]
Next, the output control unit 40d causes the character combining unit 27 or the recording unit
43 to generate output data for superposing and displaying a character string for each clause
divided in step S118 according to the subject. The combining unit 27 displays a screen on the
display unit 45 based on the output data, or the recording unit 43 records the output data (step
S122).
[0114]
FIG. 11 shows an example of a screen displayed on the display unit 45 based on the output data
by the character synthesis unit 27 included in the image / voice processing apparatus 1
according to an embodiment of the present invention.
(A) and (b) show an example of a screen displaying a character string when the subject image in
the screen is relatively large, and (c) and (d) show when the subject image in the screen is
relatively small Shows an example of a screen displaying a character string.
[0115]
As shown in FIG. 11A, for example, when the number of character strings is relatively large, the
output control unit 40d determines a predetermined character string to be displayed based on
the width L3 of the subject image A2 in the screen. The character string 403 is displayed on a
new line so as not to exceed the number of characters.
Similarly, in the case shown in FIG. 11C, the output control unit 40d does not exceed the
10-04-2019
26
predetermined number of characters of the displayed character string based on the horizontal
widths L3 and L4 of the subject images A2 and A3 in the screen. Causes the character string 403
to be displayed on a new line.
[0116]
Further, as shown in FIG. 11B, for example, when the number of character strings is relatively
small, the output control unit 40d sets the character string 404 to the subject image based on the
width L3 of the subject image A2 in the screen. It is displayed with the largest font size which
becomes width L4 or less.
Also in the case shown in FIG. 11D, the output control unit 40d causes the character string 404
to be displayed with the largest font size that is equal to or less than the lateral width L4 of the
subject image based on the lateral width L4 of the subject image A3 in the screen. .
[0117]
Further, the output control unit 40d displays a character string for each clause divided in step
S118 according to the direction of the subject image.
[0118]
FIG. 12 shows an example of a screen when the output control unit 40d included in the image /
voice processing apparatus 1 according to an embodiment of the present invention displays a
character string on the screen.
(A) shows an example of a screen displaying a character string when the subject image in the
screen faces the front, and (b) shows the character string when the subject image in the screen
faces the back An example of the displayed screen is shown, (c) shows an example of the screen
on which the character string is displayed when the subject image in the screen is facing in the
horizontal direction, and (d) shows the subject in the screen An example of a screen on which a
character string is displayed in the case where the image faces obliquely downward toward the
screen is shown.
10-04-2019
27
[0119]
As shown in FIG. 12A, for example, when it is determined by the object recognition unit 24c that
the subject image A4 in the screen is facing the front, the output control unit 40d determines the
downward direction of the subject image A4 in the screen. The character string 405 is displayed
on the screen.
[0120]
As shown in FIG. 12B, for example, when the object recognition unit 24c determines that the
subject image A5 in the screen is facing the back, the output control unit 40d overlaps the
subject image A5 in the screen. As a result, the character string 405 is displayed.
[0121]
As shown in FIG. 12C, for example, when it is determined by the object recognition unit 24c that
the subject image A6 in the screen is facing in the lateral direction, the output control unit 40d
selects the subject image A6 in the screen. The character string 405 is displayed at a position
near the mouth of the user.
[0122]
As shown in FIG. 12D, for example, when it is determined by the object recognition unit 24c that
the subject image A7 in the screen faces obliquely downward, the output control unit 40d selects
the subject image in the screen. The character string 405 is inclined and displayed in accordance
with the inclination of the head of the subject image which is a position near the mouth of A7
and which is calculated in step S121.
[0123]
FIGS. 13 (a) and 13 (b) are diagrams for explaining the oblique display of a character string by
the output control unit 40d included in the image sound processing apparatus 1 according to an
embodiment of the present invention.
[0124]
As shown in FIG. 13A, the object recognition unit 24c performs face detection based on the
subject image A7 tracked by the tracking control unit 40b, thereby determining the face
detection frame 501 and the mouth position detection frame 502.
[0125]
10-04-2019
28
Then, the output control unit 40d calculates the angle of the face detection frame 501 in which
the face is detected, as the rotation angle r of the tilt of the head of the subject image A7.
[0126]
As shown in FIG. 13B, the output control unit 40d rotates the character string 503 by the
rotation angle r in the direction in which the face detection frame 501 is inclined to obtain the
character string 503A.
Then, the output control unit 40d superimposes the character string 503A rotated by the
rotation angle r on a position near the mouth of the subject image A7.
[0127]
Further, the output control unit 40d controls the color and / or the color of the character string
converted by the first translation unit 17a and the second translation unit 17b based on the type
of the subject identified by the object recognition unit 24c in step S104. The font may be
determined, and output data may be generated for displaying the determined color and / or
character string converted by the font on the screen.
[0128]
Next, the output control unit 40d determines whether the display of the character string divided
into clauses is completed (step S123), and when it is determined that the display of the character
string is ended, the CPU 40 uses the operation unit 41. It is determined whether or not an
operation signal for requesting the end of shooting has been supplied (step S124), and when it is
determined that an operation signal for requesting the end of shooting has been supplied (in the
case of YES), the processing ends.
[0129]
<Directional Characteristic Adjustment Process> Next, directional characteristic adjustment
processing in the image / sound processing apparatus 1 according to an embodiment of the
present invention will be described.
[0130]
FIG. 14 is a flow chart showing a processing flow of directivity characteristic adjustment
10-04-2019
29
processing in the image / voice processing apparatus 1 according to an embodiment of the
present invention.
[0131]
As shown in FIG. 14, the directivity adjustment control unit 40 c of the CPU 40 determines
whether or not at least one of the first directivity characteristic adjustment unit 13 a and the
second directivity characteristic adjustment unit 13 b is usable ( Step S201).
Specifically, the CPU 40 determines whether or not there is the first directional characteristic
adjustment unit 13a or the second directional characteristic adjustment unit 13b for which the
directional characteristic adjustment is not performed.
[0132]
When it is determined in step S201 that none of them can be used, that is, it is determined that
both the first directivity characteristic adjustment unit 13a and the second directivity
characteristic adjustment unit 13b are performing directivity characteristic adjustment (in the
case of NO), directivity adjustment The control unit 40c extracts the directivity characteristic
priority stored in the directivity characteristic priority storage unit 34 (step S202).
Specifically, from the directivity characteristic priority storage unit 34, the directivity adjustment
control unit 40c adjusts the type of the object whose motion is being tracked in step S111, the
first directivity characteristic adjustment unit 13a, and the second directivity characteristic
adjustment. The directivity characteristic priority corresponding to the type of the subject whose
directivity characteristic is adjusted by the unit 13 b is extracted.
[0133]
Next, in the directivity adjustment control unit 40c, the directivity characteristic priority of the
subject tracking the movement in step S113 is adjusted by the first directivity characteristic
adjustment unit 13a or the second directivity characteristic adjustment unit 13b. It is determined
whether it is higher than the directivity characteristic priority of the existing subject (step S203).
10-04-2019
30
[0134]
In step S203, the directivity characteristic priority of the object whose movement is being
tracked in step S113 is the directivity characteristic priority of the object whose directivity
characteristics are adjusted by the first directivity characteristics adjustment unit 13a or the
second directivity characteristics adjustment unit 13b. If it is determined that the degree is
higher than 0 degrees (in the case of YES), the first directivity characteristic adjustment unit 13a
or the second directivity characteristic adjustment unit 13b performs directivity adjustment
based on an instruction from the directivity adjustment control unit 40c (step S204). ).
Specifically, based on the tracking result of the tracking control unit 40b, the first directional
characteristic adjusting unit 13a or the second directional characteristic adjusting unit 13b
determines whether the first microphone 11a, the second microphone 11b, or the third
microphone 11b is used. The directional characteristics are adjusted by superimposing the voice
data generated by the first microphone 11a, the second microphone 11b, and the third
microphone 11c so as to eliminate the time difference between the voices reaching the
microphone 11c.
[0135]
As described above, according to the image-sound processing apparatus 1 which is an
embodiment of the present invention, the subject and the sound source are associated as the
same object based on the distance and the direction of the subject and the distance and the
direction of the sound source. The tracking control unit 40b tracks the associated object, and the
first directional characteristic adjusting unit 13a and the second directional characteristic
adjusting unit 13b control the tracking result of the tracking control unit 40b and the distance
and direction of the subject or the sound source. Since the directivity characteristic of the
microphone array 11 is adjusted based on the distance and direction, even when the sound
source goes out of the angle of view of the camera 21 or when the sound source generates audio
intermittently, The noise is attenuated by adjusting the directivity characteristic of the
microphone array 11 without the sound position detection unit 12 and the object position
detection unit 24b recalculating the position of the object, and a good sound is obtained. It can
be.
[0136]
Further, according to the image / sound processing apparatus 1 as one embodiment of the
present invention, the character string converted by the first speech recognition unit 15a or the
second speech recognition unit 15b is displayed on the screen according to the subject image.
10-04-2019
31
Therefore, the character display can be appropriately performed according to the person who
has made the sound.
[0137]
In the image / sound processing apparatus 1 according to an embodiment of the present
invention, two directivity characteristic adjustment units (first directivity characteristic
adjustment unit 13a and second directivity characteristic adjustment unit 13b) and two voice
detection units ( Although the first voice detection unit 14a and the second voice detection unit
14b) are provided, the present invention is not limited to this, and a plurality of directional
characteristic adjustment units and a plurality of voice detection units may be provided.
[0138]
FIG. 1 is a block diagram showing the configuration of an image / sound processing apparatus
according to an embodiment of the present invention.
It is the figure which showed an example of the human classification information memorize |
stored in the human classification information storage part 31 with which the image audio
processing apparatus which is one Embodiment of this invention is equipped.
It is the figure which showed an example of the image reference | standard characteristic
information memorize | stored in the image reference | standard characteristic information
storage part 32 with which the image sound processing apparatus which is one Embodiment of
this invention is equipped.
It is the flowchart which showed the processing flow of the image sound processing device which
is one embodiment of the present invention.
It is a figure explaining the process by the object detection part with which the image sound
processing apparatus which is one Embodiment of this invention is equipped, and an object
recognition part.
It is a figure explaining the calculation process of the direction of a to-be-photographed object by
10-04-2019
32
the object position detection part with which the image sound processing device which is one
embodiment of the present invention is provided.
It is a figure explaining the calculation process of the distance of the to-be-photographed object
by the object position detection part with which the image sound processing device which is one
embodiment of the present invention is provided.
It is a figure explaining calculation processing of direction and distance of a sound source by an
audio position detection part with which an image voice processing device which is one
embodiment of the present invention is provided.
It is the figure which showed an example of the phase comparison of the audio | voice waveform
input into the 1st microphone which the image sound processing apparatus which is one
Embodiment of this invention is equipped, a 2nd microphone, and a 3rd microphone.
An example of a screen in case an output control part with which an image voice processing
device which is one embodiment of the present invention is provided displays a character string
on a screen end is shown.
The character synthesis part with which the image voice processing apparatus which is one
embodiment of this invention is equipped shows an example of the screen displayed on the
display part based on output data, (a), (b) is a subject in the screen An example of the screen
which displayed the character string when an image is comparatively large is shown, (c), (d)
shows an example of the screen which displayed the character string, when the to-bephotographed image in a screen is comparatively small. .
An example of a screen in case the output control part with which the image sound processing
apparatus which is one embodiment of this invention is equipped displays a character string on
the screen is shown, and (a) shows the to-be-photographed image in a screen for the front. (B)
shows an example of the screen on which the character string is displayed when the subject
image in the screen is facing the back, and (c) shows the screen on which the character string is
displayed. An example of the screen on which the character string is displayed in the case where
the subject image in the screen faces the horizontal direction is shown, and (d) shows the case
where the subject image in the screen faces the diagonally downward direction An example of a
screen displaying a character string is shown.
10-04-2019
33
It is a figure where the output control part with which the image sound processing device which
is one embodiment of the present invention is provided explains slanting display of a character
string.
It is the flowchart which showed the processing flow of the directivity characteristic adjustment
processing in the image sound processing device which is one embodiment of the present
invention.
Explanation of sign
[0139]
DESCRIPTION OF SYMBOLS 1 ... Image sound processing apparatus 11 ... Microphone array 11a
... 1st microphone 11b ... 2nd microphone 11c ... 3rd microphone 12 ... Audio position detection
part 13a ... 1st directional characteristic adjustment part 13b ... 2nd directional characteristic
Adjustment unit 14a: first voice detection unit 14b: second voice detection unit 15a: first voice
recognition unit 15b: second voice recognition unit 16: dictionary storage unit 17a: first
translation unit 17b: second Translation unit 18 Voice compression unit 19 Recording voice
generation unit 21 Camera 22 Camera processing unit 23 Motion sensor 24 Detection unit 24a
Object detection unit 24b Object position detection unit 24c Object recognition unit 25 Motion
vector Detection unit 26: Video compression unit 27: Character composition unit 31: Human
classification information storage unit 32: Image reference feature information storage unit 33:
Voice reference feature information storage unit 4 Directional Characteristic Priority Storage Unit
40 CPU 40a Association Unit 40b Tracking Control Unit 40c Directional Adjustment Control Unit
40d Output Control Unit 41 Operation Unit 42 Directional Sensor 43 Recording Unit 44 Audio
Output Unit 45 Display
10-04-2019
34
Документ
Категория
Без категории
Просмотров
0
Размер файла
48 Кб
Теги
jp2010154259, description
1/--страниц
Пожаловаться на содержимое документа