Patent Translate Powered by EPO and Google Notice This translation is machine-generated. It cannot be guaranteed that it is intelligible, accurate, complete, reliable or fit for specific purposes. Critical decisions, such as commercially relevant or financial decisions, should not be based on machine-translation output. DESCRIPTION JP2010175431 An object of the present invention is to improve estimation accuracy of a sound source direction. A sound source direction estimation apparatus according to the present invention includes a microphone array consisting of three microphones arranged at the apex of an equilateral triangle, and frequency conversion for converting signals received by the microphones of the microphone array into signals in the frequency domain. And an arrival time difference calculation unit that calculates arrival time differences for each combination of microphone pairs of different microphones, and a sound source direction estimation unit that obtains sound source candidates from the arrival time differences and classifies sound source direction candidates. The sound source direction estimation unit includes a sparsity determination unit that determines whether sparsity can be assumed or not for each frequency bin of the arrival time difference, and determines sound source candidates from arrival time differences of frequency bins that can assume sparsity; Classify the candidates. [Selected figure] Figure 1 Sound source direction estimation device and method thereof, and program [0001] The present invention relates to a sound source direction estimation device for detecting the direction of a speaker used in videophone calls, audio conferences, etc. [0002] For example, Non-Patent Document 1 discloses a sound source direction estimation method used in a conventional audio conference or the like. 04-05-2019 1 The method estimates the directions Sn of N (N ≧ 2) different sound sources using a microphone array consisting of three microphones 1, 2, 3 arranged at the vertices of an equilateral triangle as shown in FIG. It is a thing. The operation of the sound source direction estimation apparatus 300 will be described with reference to FIG. [0003] The sound source direction estimation device 300 includes three microphones 1, 2, 3 disposed at the vertex of an equilateral triangle, frequency conversion units 11, 12, and 13, arrival time difference calculation units 21, 22, and 23, a sound source direction estimation unit And 150. A signal xi (n) in time sample n received by the microphones 1, 2, 3 is input to the frequency conversion units 11, 12, 13 and is a signal in the frequency domain obtained for each frame which is a set of a plurality of time samples It is converted to Xi (ω, m). Here, m and ω respectively indicate the number of the signal frame subjected to the frequency conversion and the frequency of the converted signal. The microphone reception signal whose frequency has been converted is input to the arrival time difference calculation units 21, 22, 23. The arrival time difference calculation units 21, 22, and 23 calculate Equation (1) for each of the combinations of three different microphone pairs, and the arrival time differences τij (ω, m) (i, j に お け るOutput 3, i ≠ j). i and j indicate the microphone numbers. [0004] [0005] The arrival time difference τ ij is input to the sound source direction estimation unit 150, and the estimated sound source direction θ n ^ is output. In addition, the notation in the figure is correct. An example of the functional configuration of the sound source direction estimation unit 150 is shown in FIG. The sound source direction estimation unit 150 includes a vectorization unit 151, a sound source direction calculation unit 152, and a histogram calculation unit 153. The vectorization unit 151 receives the arrival time differences τ 12 (ω, m), τ 23 (ω, m), and τ 31 (ω, m) output by the arrival time difference calculation units 21, 22 and 23 as shown in Expression (2) The arrival time difference vector t 04-05-2019 2 (ω, m) is output. The vectorization unit 151 simply arranges the input arrival time difference τ ij (ω, m) into a vector. [0006] [0007] The sound source direction calculation unit 32 multiplies the input arrival time difference vector t (ω, m) from the left by the coordinate transformation matrix D given by the equation (4) as in the equation (3), and The sound source direction candidate θ ′ (ω, m) is obtained from the one element and the second element by the calculation of equation (5). [0008] [0009] The histogram calculation unit 153 obtains a histogram from the input sound source direction candidate θ ′ (ω, m), and outputs the direction giving the peak of the histogram as a sound source direction estimated value θa ^ (a = 1,..., A ′). . A 'is the maximum number of simultaneous sound sources given in advance. [0010] Here, the histogram is calculated by classifying all the sound source direction candidates θ ′ (ω, m) obtained in the frequency bins of a plurality of continuous frames into predetermined angle widths. As the number of frames used when obtaining the histogram, the number of frames corresponding to the length of time in which the sound source does not move is selected. For example, when the frame length is 16 ms and it is considered that the sound source does not 04-05-2019 3 move for about 0.5 seconds, for example, the histogram is calculated using the sound source direction candidate θ ′ (ω, m) obtained in each of the 30 frames Be Assuming that the sampling frequency of the signal is 16 kHz and the frequency conversion method is, for example, short-time Fourier transform using data of 256 points, the number of sound source direction candidates θ ′ (ω, m) is a frequency of 3840 (128 × 30) Equal to the number of bins. [0011] Masao Matsuo, Yusuke Hioka and Nozomu Hamada, “Estimating DOA of multiple speech signals by improved histogram mapping method,” Proceedings of IWAENC 2005, pp. 129-132. [0012] In the conventional method, when the sound source signal is a non-stationary signal such as speech, in which components concentrate at a specific frequency, any frequency bin at any time is only the component of one of a plurality of sound sources. The processing is performed under the assumption that there is sparsity in the time frequency domain. [0013] [What is sparsity] Here, sparsity seems to be concentrated in a part of the region where the energy of the target signal is present (in many cases, in the time frequency region) and 0 in many other regions. If there is a property of interest, it is called signal sparsity. [0014] However, in general, as the number of sound sources increases, the assumption of signal sparsity breaks down, so that the conventional technology can not estimate the sound source direction with sufficient accuracy. For example, when the speakers located in different directions speak at the same time, the estimation accuracy of the sound source direction is degraded. Further, in the actual environment, sounds other than voice are often generated, and many of the sounds are signals in which the component of the sound spreads in a steady and wide frequency, such as the sound of an air conditioner or a fan of a personal computer. 04-05-2019 4 Since these sounds can not be assumed to be sparse, superimposing them on the sound of the sound source further causes the estimation accuracy of the sound source direction to deteriorate. [0015] The present invention has been made in view of this point, and even if a speaker located in different directions speaks at the same time, a sound source direction estimation device and its method capable of accurately estimating those directions, and its program Intended to provide. [0016] The sound source direction estimation apparatus according to the present invention is different from a microphone array consisting of three microphones arranged at the apex of an equilateral triangle and a frequency conversion unit for converting signals received by the microphones of the microphone array into signals in the frequency domain. The present invention comprises an arrival time difference calculation unit that calculates arrival time differences for each combination of microphone pairs of microphones, and a sound source direction estimation unit that obtains sound source candidates from the arrival time differences and classifies sound source direction candidates. The sound source direction estimation unit includes a sparsity determination unit that determines whether sparsity can be assumed for each frequency bin of the arrival time difference, and determines sound source candidates from arrival time differences of frequency bins for which sparsity can be assumed. Classify direction candidates. [0017] According to the present invention, the sparsity determination unit removes the arrival time difference of frequency bins for which the sparsity of the sound source can not be hypothesized, and finds the sound source candidate from the arrival time difference of frequency bins for which the remaining sparsity can be assumed. Therefore, the sound source direction estimation apparatus according to the present invention excludes the arrival time difference between frequency bins in which both voices are mixed even if speakers at different positions speak at the same time, and based on the arrival time difference consisting of a single sound source. Estimate 04-05-2019 5 the direction. Therefore, sound source direction estimation can be performed with high accuracy. [0018] The figure which shows the function structural example of the sound source direction estimation apparatus 100 of this invention. FIG. 6 is a diagram showing an operation flow of the sound source direction estimation device 100. FIG. 2 is a diagram showing an example of a functional configuration of a sound source direction estimation unit 30. FIG. 7 is a diagram showing an example of a functional configuration of a sparsity determination unit 34. FIG. 7 is a diagram showing an operation flow of the sparsity determination unit 34. The figure which shows the example of an arrival time difference vector and an arrival time difference orthonormal vector. The figure which shows an example of vector orthogonal degree P ((theta)) in case there are multiple sound sources. The figure which shows an example of vector orthogonal degree P ((theta)) in case the number of sound sources is one. FIG. 7 is a view showing an example of the functional configuration of a sparsity determination unit 34 '; The figure which shows the operation | movement flow of sparsity determination part 34 '. The figure which shows an example of the result of having estimated the sound source direction by the conventional sound source direction estimation apparatus 300. FIG. The figure which shows an example of the result of having estimated the sound source direction by the sound source direction estimation apparatus 100 of this invention. The figure which shows the plane of a microphone array. The figure which shows the function structural example of the conventional sound source direction estimation apparatus 300. FIG. The figure which shows the function structural example of the conventional sound source direction estimation part 150. FIG. [0019] Hereinafter, embodiments of the present invention will be described with reference to the drawings. The same reference numerals are given to the same parts in the plurality of drawings, and the description will be omitted. [0020] FIG. 1 shows an example of the functional configuration of a sound source direction estimation apparatus 100 according to the present invention. The sound source direction estimation 04-05-2019 6 apparatus 100 includes a microphone array including three microphones, frequency conversion units 11, 12, and 13, arrival time difference calculation units 21, 22, and 23, and a sound source direction estimation unit 30. The sound source direction estimation apparatus 100 differs from the sound source direction estimation apparatus 300 described in the prior art only in that the sound source direction estimation unit 30 includes the sparsity determination unit 34 and the processing procedure using the determination result. [0021] The same parts as the operation of the conventional sound source direction estimation apparatus 300 will be briefly described with reference to the operation flow of FIG. The frequency converters 11, 12, 13 convert the signals received by the microphones 1, 2, 3 into signals in the frequency domain (step S11). The arrival time difference calculation units 21, 22, 23 calculate the arrival time differences τ ij (ω, m) (τ 12 (ω, m), τ 23 (ω, m) for each combination of microphone pairs of different microphones 1, 2, 3 , Τ 31 (ω, m)) are calculated (step S 21). The sound source direction estimation unit 30 obtains a sound source candidate θ ′ (ω, m) from the arrival time difference τ ij (ω, m), and classifies the sound source candidate θ ′ (ω, m) (step S30). [0022] The sound source direction estimation apparatus 100 according to the present invention is characterized in that the sound source direction estimation unit 30 includes the sparsity determination unit 34 that determines whether sparsity can be assumed or not for each frequency bin of the arrival time difference τ ij (ω, m). new. The sound source direction estimation unit 30 obtains sound source candidates from the arrival time difference τ ij (ω, m) of frequency bins that can be assumed to be sparsity output by the sparsity determination unit 34, and classifies sound source candidates (step S30). The determination of the sparsity is performed for each frame m and for each frequency bin ω. Therefore, even if speakers at different positions speak simultaneously, the arrival time difference τ ij (ω, m) of the frequency bin in which both voices are mixed is excluded, so that the estimation of each sound source direction can be performed with high accuracy. [0023] 04-05-2019 7 FIG. 3 shows an example of the functional configuration of the sound source direction estimation unit 30. As shown in FIG. The sound source direction estimation unit 30 includes a vectorization unit 151, a sparsity determination unit 34, a sound source direction calculation unit 152 ′, and a histogram calculation unit 153. As apparent from comparison with the functional configuration example (FIG. 15) of the sound source direction estimation device 300 according to the prior art, the sound source direction estimation unit 30 determines between the sparseness determination unit 34 between the vectorization unit 151 and the sound source direction calculation unit 152. The conventional sound source direction estimation unit 150 differs from the conventional sound source direction estimation unit 150 in that the sound source direction calculation unit 152 ′ calculates the sound source direction with reference to the determination result. [0024] An example of a functional configuration of the sparsity determination unit 34 of this embodiment is shown in FIG. 4, and its operation flow is shown in FIG. The sparsity determination unit 34 includes an orthogonal matrix calculation unit 35, a vector orthogonality calculation unit 36, and an orthogonality determination unit 38. Orthogonal matrix calculation unit 35 receives arrival time difference vector t (ω, m) output from vectorization unit 151 as input, and two arrival time difference orthonormal vectors t⊥1 orthogonal to arrival time difference vector t (ω, m) (Ω, m) and t⊥ 2 (ω, m) are output (step S35). This orthonormal vector can be determined, for example, by Gram-Schmidt orthonormalization. (Reference "G. Strang, "Linear Algebra and Its Applications," Industrial Books, pp. 141-143 ") [0025] The arrival time difference orthonormal vectors t⊥1 (ω, m) and t⊥2 (ω, m) are input to the vector orthogonality calculation unit 36, and the orthogonality of the arrival time difference vector to the theoretical value te (θ) is determined (Step S36). The theoretical value te (θ) of the arrival time difference vector is a value that can be calculated by equation (6). [0026] [0027] 04-05-2019 8 Here, d is the length of one side of the triangle formed by the microphones 1, 2, 3 arranged at the apex of the triangle (see FIG. 13). c is the speed of sound. Thus, te (θ) is a theoretical value that can be calculated regardless of the measured value. As the theoretical value te (θ) of the arrival time difference vector, one recorded in the recording unit 37 may be read out sequentially as shown in FIG. 4 or a value recorded in advance in the vector orthogonality calculation unit 36 is used. You may do so. [0028] Here, the meaning of obtaining two arrival time difference orthonormal vectors t⊥1 (ω, m) and t⊥2 (ω, m) orthogonal to the arrival time difference vector t (ω, m) will be described. FIG. 6 shows arrival time difference orthonormal vectors t⊥1 (ω, m) and t⊥2 (ω, m) with respect to any arrival time difference vector t (ω, m). In order to know the direction of this arrival time difference vector t (ω, m), the vector whose direction is known and the arrival time difference orthonormal vector t⊥1 (ω, m), t⊥2 (ω, m) You can see if they are orthogonal or not. If orthogonal, the direction of arrival time difference vector t (ω, m) is the same as the direction of the known vector. [0029] The vector orthogonality calculation unit 36 calculates the orthogonality P (the arrival time difference orthonormal vectors t 差 1 (ω, m), t⊥2 (ω, m)) and the arrival time difference vector te (θ) of the theoretical value. Equation (7) is used to calculate θ) (step S36). [0030] [0031] Equation (7) is all directions for the arrival time difference orthonormal vector t⊥1 (ω, m), t⊥2 (ω, m) corresponding to each arrival time difference vector t (ω, m) It is calculated for the arrival time difference vector te (θ) of theoretical values of 0 to 359 degrees. Since the direction of the arrival time difference vector te (θ) of the theoretical value calculated 04-05-2019 9 by Equation (7) is known, the theoretical value and the arrival time difference orthonormal vector t⊥1 (ω, m), t⊥2 (ω, m) When the symbol and the symbol are orthogonal to each other, the first term and the second term of the denominator of equation (7) become 0 respectively. Thus, the orthogonality P (θ) has a large value. On the contrary, in the case of an angle different from the theoretical value, since the first term and the second term of the denominator in equation (7) have a certain value, the value of the orthogonality P (θ) becomes a small value. [0032] In this way, arrival time difference orthonormal vectors t⊥1 (ω, m) and t⊥2 (ω, m) orthogonal to the arrival time difference vector t (ω, m) are determined, and these and the arrival time difference vector te of the theoretical value By evaluating whether (θ) is orthogonal or not, it is possible that the arrival time difference vector t (ω, m) is a vector produced by one sound source or a vector produced by mixing signals of other sound sources. It can be determined. [0033] Specific examples of the degree of orthogonality P (θ) calculated by the equation (7) are shown in FIG. 7 and FIG. The horizontal axis is the arrival direction of the signal [degrees], and the vertical axis is the maximum vector orthogonality maxP (θ). Here, the 0 degree direction is the direction of the microphone 1 viewed from the center of the microphone array when the microphone array is placed on the desk (FIG. 13). FIG. 7 shows the maximum vector orthogonality maxP (θ) at each angle when the angle between the sound source 1 located at an angle of 10 degrees and another sound source 2 is changed from 0 degrees to 360 degrees. . The maximum vector orthogonality maxP (θ) shows a large value of about 32 only when the angles of the sound source 1 and the sound source 2 coincide with each other, and shows a small value of about 12 or less in the other directions. [0034] FIG. 8 shows the maximum vector orthogonality P (θ) when the angle of the sound source is changed from 0 degree to 360 degrees when there is only one sound source. The maximum 04-05-2019 10 vector orthogonality maxP (θ) in all directions of the signal arrival direction indicates the same magnitude (about 32) as the angle of 10 degrees in FIG. [0035] The orthogonality determination unit 38 determines the orthogonality of the arrival time difference vector t (ω, m) by comparing the orthogonality P (θ) with the threshold value Th (step S 38). The arrival time difference vector t (ω, m) having high orthogonality is a vector from a sound source at one fixed position, that is, the arrival time difference vector t (ω, m) where sparsity can be assumed. On the contrary, sparsity can not be assumed for a small arrival time difference vector t (ω, m) with a small degree of orthogonality P (θ). [0036] As shown in the equation (8), whether or not the sparsity can be assumed is determined with the threshold value Th as, for example, 15 (step S380). [0037] [0038] If the orthogonality P (θ) is larger than Th = 15, the sparsity determination result NJ (ω, m) is 1 (step S382), and if smaller, NJ (ω, m) is 0 (step S381), and all arrivals are achieved. The arrival time difference vector t (ω, m) is updated (step S 384) until the determination on the time difference vector t (ω, m) ends (Y in step S 383). Therefore, the sparsity of all frame m and arrival time difference vectors t (ω, m) of frequency bin ω is determined. [0039] The sound source direction calculation unit 152 ′ refers to the sparsity determination result NJ (ω, m), and the sound source shown in the equation (5) only for the arrival time difference vector 04-05-2019 11 t (ω, m) of NJ (ω, m) = 1. Direction candidate θ ′ (ω, m) is calculated and output to the histogram calculation unit 153. The calculation of the sound source direction candidate θ ′ (ω, m) and the operation of obtaining a histogram by the histogram calculation unit 153 and using the angle giving the peak value as the sound source direction are the same as in the prior art. [0040] As described above, since the sound source direction estimation apparatus 100 estimates the sound source direction using the arrival time difference vector t (ω, m) of the frequency bin for which sparsity can be assumed, it is possible for speakers of different positions to speak at the same time Even if there is a case, each sound source direction can be accurately estimated. Although the determination of the sparsity has been described by the method of obtaining the normalized orthogonal vector to the arrival time difference vector, the present invention is not limited to this method. Another embodiment of the sparsity determination method will be described next. [0041] The sparsity determination method of the second embodiment is a method of evaluating the sparsity by evaluating the difference between the directions of the arrival time difference vector t (ω, m) and the arrival time difference vector te (θ) of the theoretical value. FIG. 9 shows an example of the functional configuration of the sparsity determination unit 34 'of the second embodiment. The sparsity determination unit 34 'includes an inter-vector distance calculation unit 90 and a vector match determination unit 91. [0042] The inter-vector distance calculation unit 90 receives the arrival time difference vector t (ω, m) as an input and normalizes it with the magnitude of the arrival time difference vector itself to obtain the theoretical value te (θ) of the arrival time difference vector itself. The distance P ′ (θ), which is the absolute value of the value obtained by subtracting the normalized theoretical value normalized by the magnitude, is calculated by equation (9). 04-05-2019 12 [0043] [0044] Here, te (θ) is the magnitude of the theoretical value of the arrival time difference vector calculated by equation (6). As the theoretical value te (θ) of the arrival time difference vector, one recorded in the recording unit 37 ′ may be read out sequentially as shown in FIG. 9, or the value recorded in the intervector distance calculation unit 90 is used. You may do so. [0045] The distance P ′ (θ) is a value that becomes 0 when the direction of the arrival time difference vector t (ω, m) matches the direction of the theoretical value te (θ) of the arrival time difference vector. Therefore, it can be determined whether the arrival time difference vector t (ω, m) is a vector from one sound source or a vector influenced by another sound source according to the magnitude of the value. That is, it can be determined by the magnitude of the distance P ′ (θ) whether or not the arrival time difference vector t (ω, m) where sparsity can be assumed. [0046] In the case of the second embodiment, the magnitude of the distance P ′ (θ) is determined by the vector match determination unit 91 (step S91). Contrary to Example 1, the smaller the distance P ′ (θ) is, the arrival time difference vector t (ω, m) where sparsity can be assumed. The other processes are the same as in the first embodiment. Thus, it is also possible to determine the presence or absence of the sparsity of the arrival time difference vector t (ω, m). [0047] 04-05-2019 13 Simulation Results In order to confirm the effects of the present invention, the sound source direction estimation performances of the conventional sound source direction estimation apparatus 300 and the sound source direction estimation apparatus 100 of the present invention were compared. The simulation assumes that the sound source is a male located in the direction of an angle of 10 degrees, and a female located in the direction of an angle of 20 degrees, and white noise without sparsity is superimposed at a SN ratio of 10 dB It went on condition. [0048] The resulting histograms are shown in FIG. 11 and FIG. The horizontal axis is the arrival direction of the signal in degrees, and the vertical axis is the frequency. FIG. 11 is a histogram obtained by the conventional sound source direction estimation apparatus 300. The top of the histogram is offset in the direction of 5 degrees and 15 degrees. FIG. 12 is a histogram obtained by the sound source direction estimation apparatus 100 of the present invention. Two different peaks occur correctly in the directions of 10 degrees and 20 degrees, and the peaks appear prominently in comparison with FIG. Thus, it has been confirmed that the sound source direction estimation accuracy of the sound source direction estimation apparatus 100 of the present invention is high. [0049] The sound source direction estimation apparatus and the method of the present invention described above are not limited to the above-described embodiment, and can be appropriately modified without departing from the spirit of the present invention. For example, the processes described in the apparatus and method described above are not only performed in chronological order according to the order of description, but also as parallel or individually as needed depending on the processing capability of the apparatus performing the process or the need. It is also good. [0050] Further, when the processing means in the above-mentioned device is realized by a computer, the processing content of the function that each device should have is described by a program. Then, by executing this program on a computer, the processing means in each device is realized on the 04-05-2019 14 computer. [0051] The program describing the processing content can be recorded in a computer readable recording medium. As the computer readable recording medium, any medium such as a magnetic recording device, an optical disc, a magneto-optical recording medium, a semiconductor memory, etc. may be used. Specifically, for example, as a magnetic recording device, a hard disk drive, a flexible disk, a magnetic tape or the like as an optical disk, a DVD (Digital Versatile Disc), a DVDRAM (Random Access Memory), a CD-ROM (Compact Disc Read Only) Memory), CD-R (Recordable) / RW (Rewritable), etc., a magneto-optical recording medium, MO (Magneto Optical Disc), etc., and a semiconductor memory such as a flash memory can be used. [0052] Further, this program is distributed, for example, by selling, transferring, lending, etc. a portable recording medium such as a DVD, a CD-ROM or the like in which the program is recorded. Furthermore, the program may be stored in a storage device of a server computer, and the program may be distributed by transferring the program from the server computer to another computer via a network. [0053] Further, each means may be configured by executing a predetermined program on a computer, or at least a part of the processing content may be realized as hardware. 04-05-2019 15
1/--страниц