Patent Translate Powered by EPO and Google Notice This translation is machine-generated. It cannot be guaranteed that it is intelligible, accurate, complete, reliable or fit for specific purposes. Critical decisions, such as commercially relevant or financial decisions, should not be based on machine-translation output. DESCRIPTION JP2012039275 The present invention provides a technique for estimating reflected sound information from a collected sound signal. SOLUTION: Template information which is a set of templates (TP) representing transfer characteristics for each frequency between the p-th (11p ≦ P) position and each position of M microphones is prepared in advance. . The pth reflected sound obtained by multiplying the pth TP by the pth complex amplitude using the observation signal and the template information is subtracted from the observation signal so that the power of the residual signal obtained is minimized. The power of the residual signal obtained by subtracting the p-th reflected sound obtained by determining the complex amplitude and multiplying the p-th TP by the determined p-th complex amplitude is determined for each p, and the minimum of them Of the transfer signal function multiplied by the complex amplitude in the vicinity of the direction D determined by the position corresponding to the determined TP, and The direction of arrival of the reflected sound is estimated by correcting the direction D so as to minimize the power. [Selected figure] Figure 4 Reflected sound information estimation device, reflected sound information estimation method, program [0001] The present invention relates to a technique for estimating information (arrival amplitude, arrival direction) about a reflected sound from a collected sound signal obtained by collecting a sound signal with a microphone. [0002] 11-04-2019 1 A system for exchanging voice information such as a telephone or a voice conference is generally called a voice communication system. In voice communication systems, it is very important to obtain information on the reflected sound (arrival amplitude, direction of arrival, etc.). In a reverberant environment such as a conference room, some of the collected signals collected through the microphone are not only direct sound coming directly from a sound source such as a speaker but also reflected on the floor, wall or ceiling Reflected sound is mixed. Therefore, if a certain speaker's remarks are recorded under such a reverberant environment, it will be difficult to hear because the direct sound is delayed and the reflected sound is mixed. If the reflected sound can be removed by estimating the arrival information of each reflected sound from the collected sound signal, the voice can be recovered to be easy to hear. Here, Non-Patent Document 1 can be cited as a conventional study for estimating reflected sound information. [0003] A functional configuration for realizing the technology disclosed in Non-Patent Document 1 is shown in FIG. The procedure in this technology is as follows. [0004] １． The sound source signal emitted from the impulse sound source 100 is collected using the 4ch microphones 110-1, 110-2, 110-3, and 110-4. The AD conversion unit 120 converts the collected analog signal into a digital signal x <→> (t) = [x1 (t), x2 (t), x3 (t), x4 (t)] <T> . Here, [·] <T> represents transposition. t represents an index of discrete time. It is assumed that four microphones are disposed at the apex of a regular tetrahedron. [0005] ２． The impulse response calculation unit 130 receives digital signals x <→> (t) = [x1 (t), x2 (t), x3 (t), x4 (t)] <T>, and generates an impulse response for each microphone. Calculate h <→> (t) = [h1 (t), h2 (t), h3 (t), h4 (t)] <T>. The impulse response may be calculated using the TSP method, the M-sequence method, or the like, and any method may be used to calculate the impulse 11-04-2019 2 response. [0006] ３． The virtual sound source calculation unit 140 receives the impulse response h <→> (t) = [h1 (t), h2 (t), h3 (t), h4 (t)] <T> of 4 channels and inputs virtual virtual sound source information v <→> = [v <→> 1,..., v <→> D] <T> is output. D represents the number of virtual sound sources. The virtual sound source is a sound source assumed to exist virtually to represent the arrival amplitude, the arrival direction, and the arrival time of each reflected sound. The virtual sound source will be described with reference to FIG. In FIG. 2, a path for receiving a sound source signal reflected by the right wall with a microphone is described. The sound source signal (reflection sound) reflected from the right side wall is equivalent to the signal coming directly from the position described as "virtual sound source" (however, the effects of attenuation by the reflection on the wall and distance attenuation are receive). [0007] The details of this prior art will be described. When the impulse response is measured at four closely spaced sound receiving points (microphone positions), a slight difference occurs in the time of arrival of the reflected sound. As shown in FIG. 3, the arrival times t1n and t2n at the respective sound receiving points regarding the nth reflected wave are correlated as shown in FIG. , T3n, t4n (1 ≦ n ≦ D). Assuming that the length of the side of the regular tetrahedron microphone array is d and the speed of sound is c, each virtual sound source information vn <→> = [Xn, Yn, Zn, Sn] <T> can be obtained. Here, Xn, Yn, and Zn represent the position of the n-th virtual sound source (see Equations (1) to (3)), which has information corresponding to the arrival direction and arrival time of each reflected sound. Further, Sn represents the strength of the n-th virtual sound source, and is obtained by averaging the amplitudes of the n-th reflected sound correlated by the impulse of 4 ch. [0008] Yoshio Yamazaki et al., "Spatial information of a hole determined by impulse response of four adjacent points", Proceedings of the Acoustical Society of Japan, 1981, 5 years, pp. 759-760. [0009] 11-04-2019 3 According to the prior art, in order to estimate the “arrival amplitude”, the “arrival direction”, and the “arrival time” of the reflected sound called virtual sound source information, it is necessary to prepare an impulse response in advance. However, in order to prepare an impulse response, since it is necessary to observe using a special signal, the condition that the impulse response in every position is prepared beforehand is not realistic. [0010] Therefore, an object of the present invention is to provide a technique for estimating reflected sound information ("arrival direction" and "arrival amplitude" of reflected sound) from a collected signal without using a special signal. [0011] A template representing a transfer characteristic for each frequency between the p-th position and each position where M microphones are arranged, where P is a predetermined integer of 2 or more and p is an integer of 1 or more and P or less. A set of template information is prepared in advance. M signals collected by M microphones using M microphones are converted into the frequency domain (observed signals) and template information, using (1) p-th template p The pth complex amplitude is determined so that the power of the residual signal obtained by subtracting the pth reflected sound represented by multiplying the th complex amplitude from the observation signal is minimized, and the pth complex determined The power of the residual signal obtained by subtracting the p-th reflected sound expressed by multiplying the p-th template by the amplitude from the observed signal is determined for each p, and the template giving the minimum power among them is determined. (2) in the vicinity of the direction D determined by the position corresponding to the determined template, the function (transfer characteristic function) simulating the transfer characteristic for each frequency between any position in space and each microphone Estimating the arrival direction of the reflected sound by the power of the residual signal E obtained by subtracting the multiplied by the amplitude from the observed signal to correct the direction D so as to minimize. 11-04-2019 4 [0012] A template representing a transfer characteristic for each frequency between the p-th position and each position where M microphones are arranged, where P is a predetermined integer of 2 or more and p is an integer of 1 or more and P or less. A set of template information is prepared in advance. M signals collected by M microphones using M microphones are converted into the frequency domain (observed signals) and template information, using (1) p-th template p The pth complex amplitude is determined so that the power of the residual signal obtained by subtracting the pth reflected sound represented by multiplying the th complex amplitude from the observation signal is minimized, and the pth complex determined The power of the residual signal obtained by subtracting the p-th reflected sound expressed by multiplying the p-th template by the amplitude from the observed signal is determined for each p, and the template giving the minimum power among them is determined. (2) in the vicinity of the direction D determined by the position corresponding to the determined template, the function (transfer characteristic function) simulating the transfer characteristic for each frequency between any position in space and each microphone The direction of arrival of the reflected sound is estimated by correcting the direction D so that the power of the residual signal E obtained by subtracting the product of the amplitudes from the observed signal is minimized, and the transmission corresponding to the direction of arrival The complex amplitude multiplied by the characteristic function is estimated as the arrival amplitude of the reflected sound. [0013] A template representing a transfer characteristic for each frequency between the p-th position and each position where M microphones are disposed, where P is a predetermined integer of 2 or more and p is an integer satisfying 1 ≦ p ≦ P. The template information which is a set of is prepared in advance. M is one or more predetermined Q or more by using a signal (observed signal) obtained by converting the M sound pickup signals obtained by collecting the sound signal with M microphones into the frequency domain and the template information For each q, taking an integer, q is an integer of 1 or more and Q or less, (1) the pth reflected sound represented by multiplying the pth template by the pth complex amplitude and the qth smallest residual signal (However, the first minimum residual signal is taken as the observation signal.) The p-th complex amplitude is determined by determining the p-th complex amplitude so as to minimize the power of the residual signal obtained by subtraction from the observation signal. The power of q + 1 th residual signal obtained by subtracting the p th reflected sound represented by multiplying p th template by the q th minimum residual signal is determined for each p, and the minimum power among them Determine the template given ) A function (transfer characteristic function) simulating the transfer characteristic for each frequency between any 11-04-2019 5 position in space and each microphone in the vicinity of the direction D determined by the position corresponding to the determined template multiplied by the complex amplitude The direction of arrival of the reflected sound is estimated by correcting the direction D such that the power of the residual signal E obtained by subtracting one from the q-th minimum residual signal is minimized. [0014] A template representing a transfer characteristic for each frequency between the p-th position and each position where M microphones are disposed, where P is a predetermined integer of 2 or more and p is an integer satisfying 1 ≦ p ≦ P. The template information which is a set of is prepared in advance. M is one or more predetermined Q or more by using a signal (observed signal) obtained by converting the M sound pickup signals obtained by collecting the sound signal with M microphones into the frequency domain and the template information For each q, taking an integer, q is an integer of 1 or more and Q or less, (1) the pth reflected sound represented by multiplying the pth template by the pth complex amplitude and the qth smallest residual signal (However, the first minimum residual signal is taken as the observation signal.) The p-th complex amplitude is determined by determining the p-th complex amplitude so as to minimize the power of the residual signal obtained by subtraction from the observation signal. The power of q + 1 th residual signal obtained by subtracting the p th reflected sound represented by multiplying p th template by the q th minimum residual signal is determined for each p, and the minimum power among them Determine the template given ) A function (transfer characteristic function) simulating the transfer characteristic for each frequency between any position in space and each microphone in the vicinity of the direction D determined by the position corresponding to the determined template multiplied by the complex amplitude By estimating the direction of arrival of the reflected sound by correcting the direction D so that the power of the residual signal E obtained by subtracting those from the q-th minimum residual signal is minimized, The complex amplitude multiplied by the transfer characteristic function is estimated as the arrival amplitude of the reflected sound. [0015] The power of the residual signal may be the power obtained by adding over all frequencies. At this time, the direction of arrival is estimated by correcting the direction D so that the power of the residual signal E obtained by adding over all the frequencies is minimized. 11-04-2019 6 [0016] Also, the frequency is ω, the set of frequency ω is Ω, i is an imaginary unit, c is the speed of sound, p-th position [xp, yp, zp] and m-th (1 ≦ m ≦ M) microphones are arranged The transfer characteristic with [um, vm, wm] is Spm (ω), where template Sp (ω) = {Sp1 (ω),..., SpM (ω)} (ω∈Ω) A template generation process of generating template information {S1 (ω),..., SP (ω)} (ω} Ω) may be included. [0017] For example, each transfer characteristic Rm (for each frequency) between an arbitrary position [x, y, z] in space and each position [um, vm, wm] at which M microphones are arranged is ω) (1 ≦ m ≦ M), and the transfer characteristic Rm (ω) is expressed by the following equation, where the frequency is ω, i is an imaginary unit, and c is the speed of sound. [0018] According to the present invention, template information, which is a set of templates representing transfer characteristics for each frequency between space (or plane) positions and positions where M microphones are arranged, is created in advance, It is possible to estimate the reflected sound information from the collected sound signal without using a special signal for the sound source signal in order to obtain the impulse response by decomposing the observed signal into one or more reflected sounds based on is there. If the reflected sound information is obtained, it can be applied to applications such as estimation of sound source direction which can not be realized by the conventional voice information processing technology, and voice emphasis (pickup of distant sound and pickup by distance). [0019] The figure which shows the function structure of the reflected sound information estimation technique in a prior art. The figure for demonstrating a virtual sound source. The figure for demonstrating matching of the reflected sound in a prior art. FIG. 2 is a block diagram showing the functional arrangement of a reflected sound information estimation apparatus according to the first embodiment; FIG. 5 11-04-2019 7 is a view showing a processing procedure of a reflected sound information estimation method according to the first embodiment. The figure which shows the structural example of a twodimensional microphone array. The figure for demonstrating the transfer characteristic between the p-th point [xp, yp, zp], and the m-th sound receiving point [um, vm, wm]. The figure for demonstrating that the sound pressure distribution in a certain plane observed using the microphone array shown in FIG. 6 is obtained by the superposition with the direct sound, the reflected sound 1, and the reflected sound 2, for example. The figure for demonstrating the principle of this invention. The figure which shows the process sequence of the reflected sound information estimation method which concerns on 2nd Embodiment. (A) A diagram for explaining that, ideally, only information on an estimated arrival direction should be extracted. (B) The figure for demonstrating that the information regarding directions other than presumed arrival direction will also be mixed in fact. The figure for demonstrating reducing the influence of directions other than presumed arrival direction by summarizing the power of a residual signal over all the frequencies. The figure which shows sound pressure distribution in the case of using a two-dimensional matrix microphone array of a practical level, and its decomposition | disassembly. [0020] First Embodiment The present invention relates to a method in which a sound signal (sound source signal) emitted from a sound source such as a speech signal is collected from a signal (sound collection signal) collected by a microphone array configured of a plurality of microphones. The "direction of arrival" or "the amplitude of arrival" and the "direction of arrival" are estimated. The functional configuration and processing flow of the first embodiment are shown in FIG. 4 and FIG. [0021] The sound source signal emitted from the sound source 200 is picked up using the microphones 210-1, ..., 210-M of Mch (step S1). M is preferably a value greater than 4. The AD conversion unit 220 converts the collected analog signal into a digital signal xx <→> (t) = [xx1 (t),..., XxM (t)] <T> (step S2). Here, [·] <T> represents transposition. t represents an index of discrete time. [0022] 11-04-2019 8 It is desirable to arrange M microphones at equal intervals in two or three dimensions. This is to uniquely determine the correspondence between the arrival direction of the reflected sound and the template (which will be described later but which simulates the transfer characteristic of the reflected sound). Although the present invention can be practiced in principle even if the microphones are arranged one-dimensionally or not equally spaced, the relationship between the transfer characteristic of the reflected sound and the arrival direction of the reflected sound is one to one. It is desirable to arrange two-dimensionally or three-dimensionally at equal intervals in order to prevent this. An example of the case where microphones are arranged at equal intervals on a two-dimensional plane is shown in FIG. The microphone spacing d is preferably set to satisfy the spatial sampling theorem. When the space sampling theorem is satisfied, the microphone spacing d is a numerical value satisfying the equation (4). c is the speed of sound, and f is the frequency to be analyzed. For example, when analyzing a frequency of 4 kHz, it is preferable to set the microphone spacing to about 4 cm. [0023] The frame division unit 230 receives the digital signal xx <→> (t) = [xx1 (t),..., XxM (t)] <T> output from the AD conversion unit 200, and consists of a plurality of samples for each channel A signal x <→> (k) = [x1 (k),..., XM (k)] <T> divided into a set (frame) of digital signals is output (step S3). k is an index representing a frame number. Frame division is processing for buffering and outputting W points for each digital signal xxi (t) (1 i i M M) of each channel. W depends on the sampling frequency, but in the case of 16 kHz sampling, around 512 points are appropriate. [0024] The frequency domain conversion unit 240 receives the digital signal x <→> (k) of each frame as an input, and the signal X <→> (ω, k) = [X 1 (ω, k),. , k)] <T> and output (step S4). This signal X <→> (ω, k) will be called an observation signal. Here, ω indicates the index of the discrete frequency (since there is a relation of ω = 2πf between the frequency f and the angular frequency ω, the frequency index ω may be identified with this angular frequency ω. Hereinafter, “the index of frequency” is simply referred to as “frequency” with respect to ω), and k indicates the index of the frame. Although there is a discrete Fourier transform as one of the methods of converting to the frequency domain, other methods may be used if converting to the frequency domain. The observation signal X <→> (ω, k) in the frequency domain is output for each frequency ω and frame k. 11-04-2019 9 [0025] The template generation unit 250 sets template information S <→> (ω) = [S 1 <→> (which is a set of P templates Sp <→> (ω) (however, for convenience of calculation, vector notation)). ω),..., SP <→> (ω)] (<∀> ω∈Ω; Ω is a set of frequency indexes ω) is generated for each frequency ω (step Sp). This process is usually performed prior to each process of steps S1-S4. P represents the total number of templates, and is set in advance to an integer value of 2 or more. The more the total template number P, the more accurate the estimation of the reflected sound information will be. However, since the amount of calculation increases, it is preferable to set, for example, P = about 1000. This process is a process which is performed in advance before observing a signal with a microphone. Also, as long as the position of the microphone (for example, the distance d between the microphones) is not changed or the total number of templates P is not changed, it is usually not necessary to recreate the template each time. The "template" referred to here is a simulation of the transfer characteristic (sound propagation characteristic) corresponding to the direction of arrival of the reflected sound. The p-th (1 ≦ p ≦ P) template Sp <→> (ω) = [Sp 1 (ω),..., Sp M (ω)] <T> (ω∈Ω) is a predetermined p-th point [xp, yp, zp] and M receiving points (here, the receiving point is a position where the microphone is arranged, and the m-th (1 ≦ m ≦ M) receiving point is [um, vm, wm ] Represents the transfer characteristic of each frequency between (see FIG. 7). An example of a calculation formula of each element Spm (ω) of the p-th template Sp <→> (ω) is shown in Formula (5). The symbol i represents an imaginary unit. [0026] Direction information θp <→> (ω) is associated with the p-th template Sp <→> (ω). Direction information θp <→> (ω) is the p-th from the origin of the three-dimensional orthogonal coordinate system that is the reference of the position coordinates of the p-th point [xp, yp, zp] and the sound receiving point [um, vm, wm] Direction of the point [xp, yp, zp] of, for example, two polar angles (polar angle .theta.p, pol and azimuthal angle in the spherical coordinate system (having the common origin with the origin of the three-dimensional orthogonal coordinate system) It is expressed as θp, az). That is, θp <→> (ω) = [θp, pol (ω), θp, azi (ω)]. If the p-th point [xp, yp, zp] is associated with the p-th template Sp <→> (ω), the directional information θ p <→> (ω) is from the position [xp, yp, zp] Since calculation is possible, it is not a requirement that the p-th template Sp <→> (ω) is associated with the direction information θp <→> (ω). Since the three-dimensional orthogonal coordinate system and the spherical coordinate system can be converted mutually (coordinate conversion), the right side of equation (5) is not position [x, y, z] but direction information θ p <→> (ω It can also be expressed, for example, as equation (5a) using =) [θ p, pol (ω), θ p, az (ω)]. Here, d is a microphone interval, and the microphone array is a two-dimensional microphone array of Ξ rows and (rows (Φ × Ξ = M), and the 11-04-2019 10 position of the mth microphone is φ rows and ξ rows (1 ≦ φ ≦ Φ, 1 It is assumed that ≦≦≦≦. [0027] Further, when the template corresponds to the direction as in the first embodiment, it is preferable that the positions of P points [xp, yp, zp] (1 ≦ p ≦ P) be positions different from each other. For example, assuming that each point [xp, yp, zp] is equidistantly spaced from the origin, different P points on the spherical surface centering on the origin may be used. The reason why each point [xp, yp, zp] is sufficiently separated from the origin is that the signal emitted from the sound source or virtual sound source is transmitted spherically but is sufficiently separated from the sound source or virtual sound source In the local region in), it is possible to simulate a direct sound or a reflected sound as a plane wave. However, this does not mean that the template information includes the template corresponding to the position in the same direction. It is assumed that the microphone array is arranged in the vicinity (local area) of the origin of the coordinate system. [0028] The template storage unit 260 stores the template information S <→> (ω) output from the template generation unit 250 and plays a role in providing the reflected sound information estimation unit 270 with the template information S <→> (ω) at the time of analysis. . [0029] The reflected sound information estimation unit 270 receives Q observed signals X <→> (ω, k) in the frequency domain and template information S <→> (ω), and outputs Q reflected sound information components rsq <→> (ω, Reflected sound information rs <→> (ω, k) = [rs1 <→> (ω, k),..., rsQ <→> (a set of k) (however, for convenience of calculation, vector notation) ω, k)] <T> is output for each frequency ω for each frame k (step S5). Here, Q represents the total number of reflection sounds to be estimated, and is set to an integer value of 1 or more in advance. The q-th (1 ≦ q ≦ Q) reflected sound information component rsq <→> (ω, k) is rsq <→> (ω, k) = [rsAq (ω, k), rsBq (ω, k)] The rsAq (ω, k) is the arrival amplitude of the qth reflected sound, and rsBq (ω, k) is the arrival direction of the qth reflected sound. 11-04-2019 11 [0030] The principle of estimating the reflected sound information will be described. An example of the sound pressure distribution on a certain plane observed using a two-dimensional microphone array as shown in FIG. 6 is shown as a gray scale at the left end of FIG. Regarding the view of sound pressure distribution shown as a gray scale, the black part shows small sound pressure and the white part shows large sound pressure. Not only the sound pressure distribution of the direct sound but also the sound pressure distribution of the reflected sound are mixed in the observed sound pressure distribution. In the case where the direct sound and the reflected sound sufficiently come from the far side, the respective sound pressure distributions on the twodimensional plane become stripes like the three gray scale figures on the right side of FIG. The “darkness” of the striped pattern corresponds to the arrival amplitude of the direct sound or the reflected sound, and the “rotation / period” corresponds to the arrival direction of the direct sound or the reflected sound. The example of FIG. 8 shows that the sound pressure distribution of the observation signal is formed by superimposing the sound pressure distributions of direct sound, reflected sound 1 and reflected sound 2 having different arrival amplitudes and directions. In the frequency domain, the direct sound and each reflected sound are represented by a complex sine wave whose frequency changes according to the direction of arrival, and the observation signal is a combination of the direct sound and a plurality of complex sine waves corresponding to each reflected sound It is represented as By the way, the problem to be solved by the present invention is to estimate the arrival amplitude and / or the arrival direction of the reflected sound using only the observation signal. In order to solve this problem, it is possible to estimate from the sound pressure distribution depicted at the left end of FIG. Correspond to [0031] An outline of a method for estimating the reflected sound information rs <→> (ω, k) will be described with reference to FIG. The most strongly reflected sound 0 of the power contained in the observation signal observed on a certain two-dimensional plane (corresponding to q = 1 and having the strongest power, so this reflected sound 0 is usually understood as "direct sound" Is estimated, and the reflected sound 0 is subtracted from the observation signal to obtain a residual signal E2. Next, the reflected sound 1 (corresponding to q = 2) having the strongest power contained in the residual signal E2 is estimated, and the reflected sound 1 is subtracted from the residual signal E2 to obtain a new residual signal. Get E3. Next, the strongest reflected sound 2 (corresponding to q = Q = 3) of the power contained in the residual signal E3 is estimated. Here, the case of Q = 3 has been described, but in general, the q-th residual signal Eq 11-04-2019 12 (where the first residual signal is an observation signal) has the strongest q-th power. By sequentially performing an operation of subtracting the reflected sound q-1 (where the reflected sound 0 is a direct sound) up to q = Q, Q pieces of reflected sound information components (rs1 <→> (ω, k), ..., Obtain rsQ <→> (ω, k)). The first reflected sound information component rs1 <→> (ω, k) corresponds to the reflected sound 0 (direct sound), and the second reflected sound information component rs2 <→> (ω, k) corresponds to the reflected sound 1 The third reflected sound information component rs3 <→> (ω, k) corresponds to the reflected sound 2,..., The Qth reflected sound information component rsQ <→> (ω, k) is the reflected sound Q Corresponds to 1. Although Q depends on the application using the calculation power and the reflected sound information, it is preferable to set it to about 30. [0032] Note that although the sound pressure distributions in FIG. 8 and FIG. 9 are shown as highresolution gray-scale diagrams, however, in order to show the sound pressure distribution as such high-resolution gray-scale diagrams, a very large number of microphones are required. It is not practical. On the other hand, even when using, for example, 100 microphones as a 10 × 10 two-dimensional matrix microphone array as a practical level two-dimensional matrix microphone array, the sound pressure shown as a coarse (low resolution) gray scale (see FIG. 13) Only distribution can be obtained. Therefore, from the viewpoint of practical use, it is required to accurately estimate the arrival amplitude and the arrival direction of the reflected sound under a situation where only a low resolution sound pressure distribution can be obtained. In the present invention, a plane wave coming from an arbitrary position is specifically represented (formalization) in order to improve the spatial resolution, and a reflected sound with a large power can be estimated and a reflected sound with a small power can be estimated. In order to prevent it from disappearing, the already estimated reflected sound is removed from the observation signal to estimate the next reflected sound (decomposition). The formulation is as described for the template information, and the decomposition is as described in the outline of the estimation method of the reflected sound information rs <→> (ω, k). [0033] The method of estimating the reflected sound information rs <→> (ω, k) described above will be described in detail. Before the explanation, define the symbols. The qth residual signal can be expressed as Eq <→> (ω, k) = [Eq1 (ω, k),..., EqM (ω, k)] <T>, qth reflected sound (in the case of q = 1 Let A q (ω, k) R q <→> (ω, θ q <→> (ω, k)), which represents a direct sound. Rq <→> (ω, θq <→> (ω, k)) = [R1 (ω, θq <→> (ω, k)), ..., RM (ω, θq <→> ω, k))] <T> is a function 11-04-2019 13 simulating the transfer characteristic for each frequency between any position [x, y, z] in the space and each microphone (hereinafter referred to as transfer function) Any function that simulates the transfer characteristic for each microphone may be used. The reason why such a transfer characteristic function is used as the component of the reflected sound is that the template corresponding to the direction considered to be closest to the estimated direction of arrival of the reflected sound is determined, and the direction D corresponding to that template is determined. This is to improve the estimation accuracy of the direction of arrival of the reflected sound by correcting the direction D in the vicinity (this is the case where the reflected sound A q (ω, k) R q <→> (ω, θ q <→> (ω , k)) will be described later as optimization. In general, each transfer characteristic Rm (ω, θ q <→> (ω, k)) constituting the transfer characteristic function and the calculation formula of each element Spm (ω) of the template are the same. In this case, for each frequency between the position [x, y, z] in the direction represented by the direction information θ q <→> (ω, k) and the m-th sound receiving point [um, vm, wm] The transfer characteristic Rm (ω, θ q <→> (ω, k)) of is expressed by the equation (6). The position [x, y, z] in the direction represented by the direction information θ q <→> (ω, k) may be, for example, a position on the spherical surface sufficiently away from the coordinate system origin. . The reason for setting the position [x, y, z] to a position sufficiently away from the origin is as described above, and in detail, the position [x, y, z] corresponds to the sound source in the local region where the microphone array is arranged. It is preferable that the direct sound or the reflected sound from the virtual sound source be any position in the space at a distance at which it can be simulated as a plane wave. Since the three-dimensional orthogonal coordinate system and the spherical coordinate system can be converted mutually (coordinate conversion), the right side of equation (6) is not position [x, y, z] but direction information θ q <→> (ω , k) = [θq, pol (ω, k), θq, azi (ω, k)] can be represented, for example, as shown in equation (6a). Here, d is a microphone interval, and the microphone array is a two-dimensional microphone array of Ξ rows and (rows (Φ × Ξ = M), and the position of the mth microphone is φ rows and ξ rows (1 ≦ φ ≦ Φ, 1 It is assumed that ≦≦≦≦. [0034] The reflected sound Aq (ω, k) is a reflection of the template Rq <→> (ω, θq <→> (ω, k)) and the reflected sound such as the phase or reflection on the wall of the sound source 200 itself and attenuation due to distance. And the arrival amplitude. The above-mentioned method of subtracting the reflected sound from the residual signal in the ascending order of q is expressed by the equation (7). However, 1 ≦ q ≦ Q, and E1 <→> (ω, k) = X <→> (ω, k). 11-04-2019 14 [0035] Next, a method of optimizing the reflected sound Aq (ω, k) Rq <→> (ω, θq <→> (ω, k)) will be described. The q-th optimized reflected sound A q (ω, k) R q <→> (ω, θ q <→> (ω, k)) is the q + 1-th residual signal Eq + represented by equation (7) The power is determined according to a criterion that minimizes the power of 1 <→> (ω, k) (Eq + 1 <→> (ω, k)) <H> Eq + 1 <→> (ω, k). Specifically, note that the transfer characteristic function Rq <→> (ω, θq <→> (ω, k)) is determined by the direction information θq <→> (ω, k). Parameters Aq (ω, k) representing the sound Aq (ω, k) Rq <→> (ω, θ q <→> (ω, k)), optimum values Aq of opt θ q → → (ω, k), opt (ω, k), θ q, opt <→> (ω, k) are obtained by equation (8). The symbol H represents conjugate transposition. [0036] At this time, the q-th reflected sound information component rsq <→> (ω, k) = [rsAq (ω, k), rsBq (ω, k)] is given by the equations (9) and (10). [0037] Although various calculation methods of Formula (8) can be considered variously, the example is shown here. The optimization method described below is applied to each q in ascending order of q. [0038] 1 1 Initial value setting of direction information First, the initial value θ ini of direction information θ q ω (ω, k), q → → (ω, k) is determined using template information S → → (ω) Do. For this purpose, the template corresponding to the direction considered to be the closest to the arrival direction to be estimated is determined, and the direction information corresponding to the determined template is set to the initial value of the direction information θq <→ (ω, k) It may be set as θini, q <→> (ω, k). [0039] 11-04-2019 15 Therefore, in order to determine the template as described above from the template information, the reflected sound is represented as Aq (ω, k, g (ω, q)) Sg (ω, q) <→> (ω) for convenience. To Here, g (ω, q) represents the index of the template that can most accurately express the qth reflected sound in the template information. The coefficient A q (ω, k, g (ω, q)) constituting the reflected sound is the template Sg (ω, q) <→> (ω due to the phase or reflection on the wall of the sound source 200 itself, attenuation due to distance, etc. Represents the difference between) and the reflected sound. In this case, the q + 1st residual signal Eq + 1 <→> (ω, k) is expressed as shown in equation (11). However, E1 <→> (ω, k) = X <→> (ω, k). [0040] The reflected sound A q (ω, k, g (ω, q)) Sg (ω, q) <→> (ω) is the q + 1-th residual signal Eq + 1 <→> (ω, based on equation (11) It is estimated according to the criterion which makes the power (Eq + 1 <→> (ω, k)) <H> Eq + 1 <→> (ω, k) of k) as a minimum. There are various estimation methods, but one of them is described. Since the reflected sound is composed of two elements, Aq (ω, k, g (ω, q)) and Sg (ω, q) <→> (ω), optimize for the two elements Is required. <Process 1> and <Process 2> described later are performed for each q in ascending order of q. [0041] <Process 1> The symbol Λ is a set of the whole set {1,..., P,. That is, Λ = {1, ..., p, ..., P}-{g (ω, 1), ..., g (ω, q-1)}. However, when performing <Process 1> for the first time, Λ = {1,..., P,. The p-th template Sp <→> (ω) is the power (Eq + 1 <→> (ω, k)) <H> Eq + 1 <→> of the residual signal Eq + 1 <→> (ω, k) The coefficient A q (ω, k, p) on the assumption that the template is an optimal template for minimizing (ω, k) is obtained by the equation (12) based on the least squares method. Note that, at this stage, q on the left side of equation (9) has no meaning. [0042] <Process 2> Assuming that the number (density) of elements of the set Λ is | Λ |, then the | Λ | coefficients Aq (ω, k, p) (p∈Λ) obtained based on the equation (12) are used G (ω, q) representing the index of the template Sg (ω, q) <→> (ω) is the power (Eq + 1 <→>) of the residual signal Eq + 1 <→> (ω, k) (ω, k)) <H> Eq + 1 <→> (ω, k) is obtained by the equation (13) as an index that minimizes. 11-04-2019 16 [0043] Therefore, the initial value θ ini of the directional information θ q <→> (ω, k), q <→> (ω, k) is a template Sg (ω) having g (ω, q) obtained by the equation (13) as an index , q) Direction information θg (ω, q) <→> (ω) = [θg (ω, q), pol (ω), θg (ω, q), azi (ω) corresponding to Is given as]. That is, θini, q <→> (ω, k) = [θg (ω, q), pol (ω), θg (ω, q), azi (ω)]. It should be noted that the initial value θini, q <→> (ω, k) does not depend on the frame index k. [0044] 最適 2 Optimization of reflected sound Next, with the initial value θini of direction information θ q <→> (ω, k) θ ini, q <→> (ω, k) as the starting point, the q + 1 th represented by equation (7) In order to minimize the power (Eq + 1 <→> (ω, k)) <H> Eq + 1 <→> (ω, k) of the residual signal Eq + 1 <→> (ω, k), The reflected sound A q (ω, k) R q <→> (ω, θ q <→> (ω, k)) is optimized. Since the reflected sound is composed of two elements, coefficients Aq (ω, k) and Rq <→> (ω, θ q <→> (ω, k)), optimization for the two elements can be performed. It will be necessary. There are various optimization methods, but one of them (gradient method) will be described. In the illustrated method, the correction of the direction information θ q <→> (ω, k) and the correction of the coefficient A q (ω, k) are alternately performed by a predetermined number of times (δ times), and the reflected sound A q (ω, k) R q <→> (ω, θ q <→> (ω, k)) is optimized. The value of δ is, for example, about 50, but may be 1. [0045] 補正 2.1 Correction of direction information Correction of direction information θ q <→> (ω, k) = [θ q, pol (ω, k), θ q, azi (ω, k)] is performed by updating according to equation (14) To be done. When the processing of 2.12.1 is performed for the first time, the direction information θ q <→> (ω, k) on the right side of the equation (14) is the initial value θ ini obtained by the processing of 1 1, q <→> (ω, k) If the process of 2.12.1 is not the first time, the direction information θ q <→> (ω, k) on the right side of the equation (14) is the direction information obtained by the process of 処理 2.1 immediately before. Also, when the processing of §2.1 is performed for the first time, the coefficient A q (ω) used in the calculation of power (E q + 1 <→> 11-04-2019 17 (ω, k)) <H> E q + 1 <→> (ω, k) , k) are Aq (ω, k, p) obtained by the equation (12), and if the processing of 2.1 2.1 is not the first time, the power (Eq + 1 <→> (ω, k)) <H The coefficient A q (ω, k) used in the calculation of Eq + 1 <→> (ω, k) is assumed to be the coefficient A q (ω, k) obtained by the immediately preceding process of 直 前 2.2 (described later) . The step widths α1 and α2 are small positive constants, which are determined in consideration of the convergence speed and the like, and are set to, for example, values of about 0.1. [0046] 2.2 2.2 Correction of Coefficient Correction of the coefficient Aq (ω, k) is performed by obtaining a new coefficient Aq (ω, k) according to the equation (15) based on the method of least squares. Rq <→> (ω, θq <→> (ω, k)) used in the equation (15) is the direction information θq <→> (ω, k) obtained by the processing of 2.12.1 and the equation (6) Obtained from [0047] The coefficient A q (ω, k) and the directional information θ q <→> (ω, k) obtained at the end of the δ iterations are A q, opt (ω, k) and θ q, opt <→> ( ω, k), and the q-th reflection sound information component rsq <→> (ω, k). That is, the q-th reflected sound information component rsq <→> (ω, k) = [rsAq (ω, k), rsBq (ω, k)] is given by the equations (16) and (17). [0048] By the above process, Q reflected sound information components rsq <→> (ω, k) = [rsAq (ω, k), rsBq (ω, k)] (q = 1,..., Q) are obtained. When δ = 1 is set, it is possible to obtain only the arrival direction as the reflected sound information by not performing the correction of the coefficient. [0049] Second Embodiment In the first embodiment, the reflected sound information rs <→> (ω, k) is obtained using the template information S <→> (ω), but P templates Sp <→> (ω It is not necessarily essential to obtain in advance the template information S <→> (ω) which is a set of. An embodiment in which the template information S <→> (ω) is not obtained in advance will be 11-04-2019 18 described as a second embodiment. [0050] In the second embodiment, each process of steps S1 to S4 in the first embodiment is performed, but the process of step Sp in the first embodiment is unnecessary, and the process is further changed to the process of step S5 in the first embodiment. Then, the process of step S5a is performed (see FIG. 10). Therefore, duplicate explanations of the same matters as the first embodiment will be omitted, and matters different from the first embodiment will be described. [0051] The process of step S5a in the second embodiment will be described. In the process of step S5a in the second embodiment, "§1 Initial value setting of direction information" is different from that in the first embodiment. An initial value θini of direction information θq <→> (ω, k), q <→> (ω, k) is determined, for example, by an arrival direction estimation method such as a beam former method. The beam former method is a method of spatially scanning a directional beam and searching the direction of power increase from the obtained power spectrum. Here, it is assumed that P arrival directions can be estimated by the beamformer method. [0052] In practice, the power spectrum obtained by the beamformer method may not be steep with respect to the direction of arrival, and in such a case, for example, the range of the direction corresponding to the power spectrum showing spectral intensity higher than a predetermined spectral intensity. The direction of arrival may be determined at predetermined intervals. As a specific example, assuming that a power spectrum showing a spectrum intensity equal to or greater than a predetermined spectrum intensity is obtained in the range of polar angle 5 ° and azimuth angle 10 ° to 20 °, the arrival direction is determined every 2 ° predetermined As (polar angle 5 °, azimuth angle 10 °), (polar angle 5 °, azimuth angle 12 °), (polar angle 5 °, azimuth angle 14 °), (polar angle 5 °, azimuth angle 16 °), (Polar angle 5 °, azimuth angle 18 °) and (polar angle 5 °, azimuth angle 20 °) may be set as the arrival directions. [0053] 11-04-2019 19 Further, even if the power spectrum shows a sharp peak in a certain direction, the direction of arrival may be determined within a predetermined range of the direction instead of simply determining the direction as one of the directions of arrival. As a specific example, assuming that a power spectrum showing a sharp peak at a polar angle of 30 ° and an azimuth angle of 50 ° is obtained, the arrival direction in a predetermined range (polar angle ± 4 °, azimuth angle ± 4 °, interval 2 °) (Polar angle 26 °, azimuth angle 46 °), (polar angle 28 °, azimuth angle 46 °), (polar angle 30 °, azimuth angle 46 °), (polar angle 32 °, azimuth angle 46 ° ), (Polar angle 34 °, azimuth angle 46 °), (polar angle 26 °, azimuth angle 48 °), (polar angle 28 °, azimuth angle 48 °), (polar angle 30 °, azimuth angle 48 °), (Polar angle 32 °, azimuth angle 48 °), (polar angle 34 °, azimuth angle 48 °), (polar angle 26 °, azimuth angle 50 °), (polar angle 28 °, azimuth angle 50 °), (pole 30 ° angle, 50 ° azimuth angle, (polar angle 32 °, azimuth angle 50 °), (polar angle 34 °, azimuth angle 50 °), (polar angle 26 °, azimuth angle 52 °), (polar angle 28 °, azimuth angle 52 °), (polar angle 30 °, azimuth 52 °), (polar angle 32 °, azimuth angle 52 °), (polar angle 34 °, azimuth angle 52 °), (polar angle 26 °, azimuth angle 54 °), (polar angle 28 °, azimuth angle 54 ° ), (Polar angle 30 °, azimuth angle 54 °), (polar angle 32 °, azimuth angle 54 °), (polar angle 34 °, azimuth angle 54 °) may be made the arrival directions. In the first embodiment, P is a fixed value, but in the second embodiment, it should be noted that P is a value depending on the estimation result by the arrival direction estimation method such as the beamformer method. [0054] Templates are generated for P arrival directions obtained by the beamformer method. The calculation formula of each element of the template is, for example, formula (6). The “1 initial value setting of direction information” described in the first embodiment may be performed using the P templates (template information S <→> (ω)). The processing after the initial value setting is as described in the first embodiment. [0055] <Modifications> In the first embodiment described above, the reflected sound information rs <→> (ω, k) is estimated using the observation signal X <→> (ω, k) for each frequency, but the reflected sound for each frequency When information is estimated, information on directions other than the direction of the virtual sound source to be uniquely estimated (estimated arrival direction) may be included, which may result in an error in the reflected sound information. For 11-04-2019 20 example, as shown in FIG. 11 (a), it is desirable to be able to extract only information about the estimated arrival direction, but in fact, as shown in FIG. 11 (b), information about directions other than the estimated arrival direction is mixed. There is [0056] Therefore, in the modification, the estimation error of the reflected sound information is reduced by calculating the power collectively over all the frequencies. That is, as shown in FIG. 12, by integrating the power of the residual signal over all the frequencies, the influence of directions other than the estimated arrival direction can be reduced as much as possible. Generally, there is a variation in power at each frequency in directions other than the estimated direction of arrival, so by integrating the power of the residual signal over all frequencies, the power in the other direction compared to the power in the estimated direction of arrival The relative impact of power can be reduced. Note that, in FIG. 12, the power on the vertical axis indicates a relative value, so that the scales of the graphs are not the same. [0057] The process in this modification is as follows. A set of indexes ω of frequencies included in a frequency band to be analyzed is Ω. For example, when dealing with audio signals, the set of indices corresponding to the 1.0 to 3.0 kHz band may be Ω. Then, the index g (ω, q) of the template Sg (ω, q) <→> (ω) is determined by the equation (18) instead of the equation (13). Also, the correction of the direction information θ q <→> (ω, k) = [θ q, pol (ω, k), θ q, azi (ω, k)] is performed by updating by equation (19) instead of equation (14) Done by [0058] <Application Example> Reflected sound information is very important voice information for human beings to live. For example, a visually impaired person grasps the environment by observing a sound source signal generated by tapping by a wall, a ceiling or the like with an ear. In addition, even in daily conversations, there is a difference in the ease of conversation between conversation in a room where adequate reflection occurs and conversation in an environment where reflections are relatively small. Hereinafter, a service example using reflected sound information estimated by the present invention will be described. The first is an example in which the present invention is incorporated into a conferencing system. Since the amplitude of the 11-04-2019 21 reflected sound changes according to the direction of the directional sound source, when the reflected sound information is known, it can be estimated to which direction the sound source is directed. If the estimation system for the source direction is incorporated into the conference system, it can be applied to presenting to whom the user spoke. The second is a system that allows you to view video and audio at any position. Sounds in the distance are difficult to pick up because the power of the sound source coming directly is small. If the reflected sound information is known, not only the direct sound but also the reflected sound can be emphasized and collected, so it is possible to emphasize distant sounds. In the field of voice processing, although it is possible to pick up and pick up sound sources according to directions, it is considered very difficult to pick up voices according to distance. If the reflected sound information is known, physical feature quantities corresponding to the distance can be obtained, so it is possible to pick up sound by distance. If it is possible to pick up distant sounds, pick up by direction or distance, it is possible to generate a sound field corresponding to the position selected by the viewer in a pseudo manner. [0059] In a voice communication system, estimating reflected sound information leads to obtaining information of a sound field that could not be obtained by direct sound alone. If the reflected sound information is known, it may be linked to distant sound collection and sound collection according to distance, which can not be achieved by the conventional voice enhancement technology, or sound field information (for example, not able to be estimated by conventional sound collection technology) Direction of the sound source can be estimated. The estimation of the sound field information leads to the development of a speech processing device which could not be realized by the prior art. The prior art concerning the estimation of the reflected sound information needed to observe a special signal to obtain an impulse response, but the present invention has the advantage that the reflected sound information can be obtained with a general observation signal such as a speech signal. have. [0060] <Hardware Configuration Example of Reflected Sound Information Estimation Device> The reflected sound information estimation device according to the above-described embodiment includes an input unit to which a keyboard and the like can be connected, an output unit to which a liquid crystal display and the like can be connected, and a CPU (Central Processing Unit) [A cache memory or the like may be provided. Memory, RAM (Random Access Memory) and ROM (Read Only Memory), external storage devices as hard disks, and exchange of data between 11-04-2019 22 these input units, output units, CPU, RAM, ROM, and external storage devices It has a bus etc. to connect as possible. In addition, if necessary, the reflected sound information estimation apparatus may be provided with a device (drive) or the like capable of reading and writing a storage medium such as a CD-ROM. Examples of physical entities provided with such hardware resources include general purpose computers. [0061] The external storage device of the reflected sound information estimation apparatus stores a program for estimating the reflected sound information and data required in the processing of this program [Not limited to the external storage device, for example, the program is read only It may be stored in a ROM which is a storage device. ]. In addition, data and the like obtained by the processing of these programs are appropriately stored in a RAM, an external storage device, and the like. Hereinafter, a storage device that stores data, an address of the storage area, and the like will be simply referred to as a “storage unit”. [0062] The storage unit of the reflected sound information estimation apparatus includes a program for performing AD conversion on an analog signal, a program for performing frame division processing, and a program for converting a digital signal for each frame into an observation signal in the frequency domain. , A program for generating template information, and a program for estimating reflected sound information using the observation signal in the frequency domain and the template information are stored. [0063] In the reflected sound information estimation apparatus, each program stored in the storage unit and data necessary for processing of each program are read into the RAM as necessary, and interpreted and processed by the CPU. As a result, when the CPU realizes predetermined functions (AD conversion unit, frame division unit, frequency domain conversion unit, template generation unit, reflection sound information estimation unit), estimation of reflection sound information is realized. 11-04-2019 23 [0064] <Supplement> The present invention is not limited to the above-described embodiment, and various modifications can be made without departing from the spirit of the present invention. Further, the processing described in the above embodiment may be performed not only in chronological order according to the order of description but also may be performed in parallel or individually depending on the processing capability of the device that executes the processing or the necessity. . [0065] When the processing function of the hardware entity (the reflected sound information estimation apparatus) described in the above embodiment is implemented by a computer, the processing content of the function that the hardware entity should have is described by a program. Then, by executing this program on a computer, the processing function of the hardware entity is realized on the computer. [0066] The program describing the processing content can be recorded in a computer readable recording medium. As the computer readable recording medium, any medium such as a magnetic recording device, an optical disc, a magneto-optical recording medium, a semiconductor memory, etc. may be used. Specifically, for example, as a magnetic recording device, a hard disk device, a flexible disk, a magnetic tape or the like as an optical disk, a DVD (Digital Versatile Disc), a DVDRAM (Random Access Memory), a CD-ROM (Compact Disc Read Only) Memory), CD-R (Recordable) / RW (Rewritable), etc. as magneto-optical recording medium, MO (Magneto-Optical disc) etc., as semiconductor memory EEP-ROM (Electronically Erasable and Programmable Only Read Memory) etc. Can be used. [0067] Further, this program is distributed, for example, by selling, transferring, lending, etc. a portable recording medium such as a DVD, a CD-ROM or the like in which the program is recorded. Furthermore, this program may be stored in a storage device of a server computer, and the 11-04-2019 24 program may be distributed by transferring the program from the server computer to another computer via a network. [0068] For example, a computer that executes such a program first temporarily stores a program recorded on a portable recording medium or a program transferred from a server computer in its own storage device. Then, at the time of execution of the process, the computer reads the program stored in its own recording medium and executes the process according to the read program. Further, as another execution form of this program, the computer may read the program directly from the portable recording medium and execute processing according to the program, and further, the program is transferred from the server computer to this computer Each time, processing according to the received program may be executed sequentially. In addition, a configuration in which the above-described processing is executed by a so-called ASP (Application Service Provider) type service that realizes processing functions only by executing instructions and acquiring results from the server computer without transferring the program to the computer It may be Note that the program in the present embodiment includes information provided for processing by a computer that conforms to the program (such as data that is not a direct command to the computer but has a property that defines the processing of the computer). [0069] Further, in this embodiment, the hardware entity is configured by executing a predetermined program on a computer, but at least a part of the processing content may be realized as hardware. 11-04-2019 25

1/--страниц