Patent Translate Powered by EPO and Google Notice This translation is machine-generated. It cannot be guaranteed that it is intelligible, accurate, complete, reliable or fit for specific purposes. Critical decisions, such as commercially relevant or financial decisions, should not be based on machine-translation output. DESCRIPTION JPH11304906 [0001] BACKGROUND OF THE INVENTION In this method, signals received by a plurality of microphones (hereinafter sometimes referred to simply as "microphones") in an automatic speaker follow-up camera, an automatic speaker follow-up directional sound collector, etc. The present invention relates to a method of estimating the position of a sound source and a recording medium therefor. [0002] 2. Description of the Related Art In this section, first, the principle of sound source position estimation will be described, and then, the delay-sum method and the cross-correlation method, which are conventional sound source position estimation methods, will be described. . [0003] §１． Preliminary Description First, necessary conditions for estimating the sound source position will be described. Estimating the position of the sound source from signals received by a plurality of microphones is basically equivalent to identifying three or more triangles with the sound source as a common vertex in the space by the congruence condition of triangles . Here, each triangle takes one side on a straight line passing through the two microphones, and the sound source is at 10-05-2019 1 the vertex opposite to that side. The congruence condition of triangles is three, such as a trilateral phase, a two-sided pinch phase, and a two-sided angular phase. All these conditions necessarily require the length of one side, but in the sound source position estimation, since the positional relationship of the microphones is known, this side is taken as a straight line through which two or more microphones pass. Now, for simplicity, let us consider the relationship between these congruence conditions and position estimation when the sound source is on a plane. [0004] This condition means that a triangle is uniquely determined if the length of three sides is known. In this case, the apex of the triangle is the sound source and two microphones, and the distance between them corresponds to the length of three sides. [0005] This condition means that a triangle is uniquely determined if two corners and the length of a side between them are known. When this condition is used, three microphones are arranged in a straight line, and the narrow side is taken on this straight line, and the two end points of the narrow side are considered as the middle point of the central microphone and the microphones on the left and right thereof. Also, the remaining vertices are sound sources. Then, the two angles in the condition are angles at which the sound source looks from the right and center microphones and the left and center microphones. [0006] This condition means that a triangle is uniquely determined if two sides and an angle between the two sides are known. Of the two sides, one side is taken on a straight line connecting the two microphones, and the end points are the midpoint of the two microphones and one of the microphones. The other side is an end point between the sound source and the microphone. [0007] 10-05-2019 2 Next, the symbols necessary for the description of the conventional method are defined. FIG. 4 is a diagram for explaining how sound waves are received by a plurality of microphones, where 11 is a sound source, 12-1 to 12-M are microphones, 13 is an A / D converter, and s (k) Is the source signal at time k, M is the total number of microphones, d (m), (m = 1, 2,..., M) is the distance between the source and the mth microphone, x ( m, k) represents the sound reception signal of the m-th microphone at time k. In the present specification, the expression of time (time) is expressed as discrete time and represents time by the integer k. Usually, in addition to the sound that directly reaches the microphone from the sound source, there is a reflected sound that reaches the microphone after being reflected on the wall, the floor, etc., but in the explanation of FIG. . Further, it is assumed that the positions of the microphones 12-1 to 12-M are known. [0008] Now, the amount obtained by dividing the sound velocity by the sampling frequency is called normalized sound velocity, and expressed by c, the received signal x (m, k) of the mth microphone at time k is d (m) / c time ago Since it is equal to the sound source signal, the following equation (1) is established. ｘ（ｍ，ｋ）＝ｓ（ｋ−ｄ（ｍ）／ｃ） ・・・（１） [0009] This equation (1) indicates that the distance between the sound source and the microphone is converted into the time difference between the sound source signal and the sound reception signal at the microphone. That is, if the time difference is known, the distance to the sound source is known, and the position of the sound source is known from the distance. In addition, if a microphone (in this specification, the microphone 12-1) to be a reference is determined and the sound reception signal of the microphone is a delayed sound source signal, then x (m, k) = s (k−d (m) / C) = s (k-d (m) / c + d (1) / c-d (1) / c) = s (k-d (1) / c-(d (m)-d (1)) / C) = x (1, k− (d (m) −d (1)) / c) (2) In the equation (1), the time difference between the sound source signal and the sound reception signal of each microphone is unknown, but in the equation (2), the time difference between the microphone 1 and the sound source and the time difference between the microphone 1 and the other microphones are unknown It is. [0010] 10-05-2019 3 FIG. 5 shows how a plane wave is incident on two microphones, and explains what geometrically the time difference between the signals of the two microphones means. In FIG. 5, the broken line 21 represents the equiphase surface of the sound wave, and depicts how the incident sound wave first reaches the microphone 12-1 and arrives at the microphone 12-2. From FIG. 5, the arrival time difference of the sound wave is obtained by dividing the product of the distance between the microphones and the extra angle of the incident angle θ by the normalized sound velocity, as in the following equation (3). Sound wave arrival time difference = microphone interval cos (θ) / c (3) When equation (3) is modified, θ = arc cos (c sound wave arrival time difference / microphone distance) (4). Therefore, it is understood that the incident angle θ can be calculated if the sound wave arrival time difference and the microphone interval are known. [0011] §２． There are various conventional estimation methods of sound source position estimation methods, but in the present specification, two methods belonging to the simplest class among them, the delay sum method and the cross correlation method will be described. [0012] (Conventional Method 1) Delay-Sum Method The delay-sum method is a method using three sides or the like according to the above classification. In advance, it is assumed that there are a plurality of positions where a sound source is likely to be present, and a sound source position that conforms to the reference contained therein as much as possible. The reference calculation uses not the distance between the microphone and the sound source itself, but the time difference between the sound arrival of the reference microphone and the other microphones. According to equation (2), the sound reception signals at all the microphones are the sound reception signals of the reference microphone shifted in time. Considering the signals added after delaying or advancing these signals, the power of the added signal is maximized when the signals of microphone m (m = 1, 2,..., M) Advance by (d (m) -d (1)) / c (or delay by (d (1) -d (m)) / c), and all microphone signals are the same as x (1, k) It is a case where it becomes a phase. In fact, since it is not possible to advance the signal, x (1, k) is delayed by Dsup (> d (m)-d (1)) to be in phase with x (1, k-Dsup / c) Delay the signals of all the microphones so that Thus, in the delay-sum method, the power of a signal obtained by delaying and adding a received signal is taken as a standard, the value thereof is taken as a match, and the prepared delay is applied to the received signal. The position corresponding to the set of delays given the maximum sum signal power is taken as the sound source position. Specific position estimation procedures will be described below. 10-05-2019 4 [0013] FIG. 6 is a diagram for explaining the signal flow of the delay-sum method, where 31 is a delay, 32 is an adder, and D (i, m), (i = 1, 2,..., I) is the next As defined by equation (5), the delay amount of the m-th microphone such that the signals of all the microphones become in phase when there is a sound source at the ith sound-source assumed position, I presupposed the sound source position Where y (i, k) represents the output signal of the adder corresponding to the delay D (i, m) and is referred to as the delay sum. D (m, i) = Dsup + d (i, m) / c (5) where d (i, m) is the distance between the ith sound source postulated position and the microphone m. [0014] The signal x (m, k) received by the microphone 12 is delayed by D (i, m) by the delay unit 31 and then added by the adder 32 to become an output signal y (i, k). The output signal y (i, k) is calculated by the following equation (6). y (i, k) = Σx (m, k−D (i, m)) (6) where た だ し relates to the microphone number m. The estimated sound source position is a position corresponding to i. However, means that the square of a and E [a] take the average of a. These calculation procedures are shown in FIG. [0015] (Conventional Method 2) Cross Correlation Method According to the above classification, the cross correlation method is a method which utilizes a double angle narrow side phase or the like. In the cross correlation method, the time difference between the reference microphone and the signals of the other microphones is regarded as the time difference giving the maximum value of the cross correlation function, and the angle of incidence is obtained from the time difference and the microphone spacing. The cross correlation function r (τ, m) of the signals of the reference microphone 12-1 and the microphone 12-m is defined by the following equation (8). r (τ, 1, m) = E [x (1, k) x (m, k + τ)] (8) Since there is no time difference between the microphone spacing / normalized sound velocity, the time difference τ is A cross-correlation function is determined in the range from − (microphone spacing / normalized sound velocity) to + (microphone spacing / normalized sound velocity). If there is no noise, the cross-correlation function takes the maximum value when τ (m) = (d (m) −d (1)) / c, so the time difference between the microphones giving the maximum value of the cross-correlation function It can be 10-05-2019 5 considered as arrival time difference. When using cross-correlation, choose a special microphone placement so that source location can be easily calculated. For example, a method generally called trigonometry will be described with reference to FIG. In trigonometry, as shown in FIG. 8, three microphones are arranged in a straight line. Assuming that the microphone 12-1 is a reference microphone, the delay times of the signals of the reference microphone and the microphones 12-2 and 12-3 are determined from the cross-correlation function, and then from equation (4), these two delay times The incident angle θ2 of the sound wave from the sound source related to the microphones 12-1 and 12-2 and the incident angle θ3 related to the microphones 12-1 and 12-3 are calculated. Then, the triangle formed by connecting the sound source, the middle point of the microphones 12-1 and 12-2, and the middle point of the microphones 12-1 and 12-3 is determined by the two angles and the narrow side, so that the triangle is uniquely determined. , The sound source position is determined. The above procedure is shown in FIG. [0016] §３． Comparison between conventional methods The amount of computation and the noise resistance performance are compared for the delay-sum method and the cross-correlation method, which are conventional methods. First, the amount of operation excluding the averaging operation is compared. Since the main calculation of the delay-and-sum method is the part that calculates the delay sum y (i, k) (i = 1, 2,..., I) of equation (6), the amount of operation is the number of microphones It is estimated to be the product of the number I of sound source positions assumed to be M. On the other hand, in the cross correlation method, the main part of the calculation is equation (8), and the amount of operation is the product of (the number of microphones-1) and the average microphone spacing divided by the normalized sound velocity. It is estimated to be about twice. When comparing the amount of operation of these two methods, both are almost the same in proportion to the number of microphones, and the difference is that in the delay-sum method, the number of sound source positions I assumed and in the crosscorrelation method The point is proportional to the microphone spacing divided by the normalized sound velocity. [0017] The number of sound source positions to be assumed varies depending on the application, but in applications where the camera is pointed to the speaker's position or the speaker's voice is selectively collected, high position resolution is required. It will be tens of thousands. For example, in the case of a speaker-following camera, the speaker talks within a 120 degree 10-05-2019 6 horizontal angle range, a 30 degree elevation angle 1 degree resolution, and a 5 m distance 50 cm resolution with the camera at the center. When the position of the person is searched, it is approximately I = 120 × 30 × (5 / 0.5) = 36000. [0018] On the other hand, the number obtained by dividing the microphone interval of the cross correlation method by the normalized sound velocity is at most about 100. For example, when the microphone spacing is 50 cm, the sampling frequency is 16 kHz, and the sound speed is 340 m / s, 2 * 0.5 / (340/16000) = 47. From the above, it can be seen that the operation amount of the delay-sum method is 100 to 1000 times the operation amount of the cross-correlation method. [0019] Next, the noise resistance performance is compared. Although not considered in FIG. 4, when the sound source position is actually estimated, reverberation and background noise of the room exist, and the position estimation performance is degraded. If the noise other than the direct sound from these sound sources is collectively expressed as noise, and expressed as n (m, k), the sound reception signal at time k of the microphone m is x (m, k) = s (k−d ( m) / c) + n (m, k) (9) [0020] Noise immunity depends on the microphone arrangement and the frequency band of the sound source, so it is hard to say in general, but here the y of equation (6) and equation (8) when the delay time of the microphone is set correctly The ratio of the power of the source signal s at r to the absolute value of the other components will be discussed. In the case of the delay-sum method, assuming that D (i, m) = Dsup-d (m) / c in the equation (6), the sound source signal s is in phase as in the following equation (10). [0021] At this time, it is multiplied by the power of the delay sum. Is the power of the sound source 10-05-2019 7 signal s, Σ in the fourth term of the second row relates to the microphone number m, m '(m ≠ m'), and the other Σ relate to m. [0022] Let us examine the right side of equation (11) in detail. First, the first term is the power of the sound source signal multiplied, and if there is no noise, only this term appears. The second term means the correlation between the source signal and the noise. The noise contains a component that has a correlation with the sound source, such as reverberation, and the average value E [s (k−Dsup) n (m, k−D (i, m))] of this correlation is denoted as snsn. The third term is a term that means the sum of noise power of each microphone. The third term has no effect on the estimation of the position since the noise power of each microphone is constant with respect to the source position i = 1, 2,. Finally, the fourth term represents the average of the crosscorrelation of the noise, which is denoted η nn. [0023] Based on the above discussion, the meaningful part of the right side of the equation (11) can be rewritten as follows. The ratio of the power of the source signal s to the absolute value of the other is as follows. Where abs (x) represents the absolute value of x. From this equation (13), it can be seen that the power ratio of the sound source signal s increases in proportion to the number M of microphones. [0024] On the other hand, in the cross correlation method, in r (τ, 1, m), when there is no noise, s is in phase when r = (d (m) −d (1)) / c. Is the largest. In the presence of noise, r when τ = (d (m) −d (1)) / c is Similar to the delay-sum method, the ratio of the power of the sound source signal s in r to the absolute value of the other is as follows. This value does not depend on the number M of microphones, and is the same value as the delay-sum method in the case of M = 2. Therefore, when the delay-sum method and the cross-correlation method are compared, it can be seen that the delay-sum method has a larger ratio of the sound source signal s by the number of microphones M / 2 and is excellent in noise resistance. 10-05-2019 8 [0025] As described above, the delay-and-sum method of the conventional method has a problem that the amount of calculation is large, and the cross-correlation method has noise resistance compared to the delay-and-sum method. There is a problem that it is inferior. The present invention solves the above problem of the delay sum method by pre-estimating the sound source position by the cross correlation method and narrowing the sound source position search range, and the same sound source position as the delay sum method while maintaining the noise resistance performance. It is possible to obtain estimated performance. [0026] According to the present invention, in the method of processing signals received by a plurality of microphones and estimating the position of a sound source, the cross correlation function of the received signals is set to all the microphones. A first step of calculating for the second crosscorrelation function, obtaining a time difference giving the maximum value of the crosscorrelation function between the reference microphone and the other microphones, and setting the second difference as the preliminary estimation time difference; A third step of searching for a time difference maximizing the power of the delay sum for the microphone in the vicinity of the preliminary estimated time difference and setting it as an estimated time difference; and a fourth step of calculating a sound source position based on the estimated time difference It is characterized by having. [0027] DETAILED DESCRIPTION OF THE INVENTION The present invention will be described in detail below. As a starting point of the explanation, the calculation formula of the delay-sum method is shown again. This formula can be expanded as However, Σ in the second term of the second term relates to the microphone number m, m '(m ≠ m'), and the other Σ relates to m. In the equation (19), the first term on the right side is the power of the signal of the m-th microphone, which is common with respect to the sound source position i and does not affect the result of the equation (18) for finding the maximum value . [0028] 10-05-2019 9 The second term on the right side is the sum of M (M-1) / 2 cross correlation functions, and each cross correlation function E [x (m, k-D (i, m)) x (m ', k−D (i, m ′)) is a range that D (i, m) −D (i, m ′) can take when the sound source position i is changed, ie, at most − (microphone m and If it is calculated in the range from the distance between m '/ normalized sound velocity) to + (the distance between microphone m and m' / normalized sound velocity), it always exists in it. Thus, it can be seen that it can be evaluated by the sum of the cross correlation function. [0029] By the way, the cause of the large amount of operation of the delay-and-sum method is that the number of sound source positions to be assumed increases in order to obtain sufficient performance. Therefore, in the present invention, in order to solve the problem, the time difference (that is, the sound source position) is roughly estimated by first estimating the time difference between the microphones by the cross correlation function, and then, in the vicinity of the time difference Similar to the delay-sum method, a time difference that maximizes the delay sum power is searched. Finally, the position is calculated back from the determined time difference. [0030] Specifically, Step 1: Calculate a cross correlation function. Step 2: The arrival time difference of the sound wave between the microphone 12-1 and the microphone 12-m is estimated by the cross correlation method. The time difference between the microphone 12-1 and the microphone 12-m is represented by τ (m). Step 3: In the vicinity of the delay time obtained in Step 2, τ (m) −δτ (m) ≦ t (m) ≦ τ (m) + δτ (m) (where δτ (m)> 0) 20) Search t (m) which maximizes the following equation (21). Typical values of δτ are between 1 and 5. Σ E [x (m, k-t (m)) x (m ', k-t (m'))] ... (21) Step 4: The sound wave arrival time difference t (m) obtained in step 3 Based on the above, the sound source position is determined. The transformation of the time difference and the position is carried out by solving the trigonometry described in the cross correlation method and the simultaneous equations concerning the position. [0031] 10-05-2019 10 The above procedure is shown in FIG. In addition, the execution means of this invention is comprised as a sound source position estimation part 710 (refer FIG.2, 3) mentioned later as an example. Specifically, the sound source position estimation unit 710 is a computer device including a CPU (central processing unit) and its peripheral circuits. The procedure shown in FIG. 1 is stored in a semiconductor memory (ROM, RAM, etc.) or other recording medium (magnetic disk, etc.) as a control program in the sound source position estimation unit 710. And CPU of the sound source position estimation part 710 performs the sound source position estimation method by this invention based on the said control program. [0032] Equation (21) omits the coefficient 2 in the second term on the right side of equation (19), and sets D (i, m) and D (i, m ') to t (m) and t (m'), respectively. It is the replaced equation. In the equation (19), the assumed sound source position i is a variable, but in the equation (21), the delay time t (m) is a variable. As described in the comparison of the conventional methods, the number of assumed sound source positions is several thousand to several tens of thousands. On the other hand, in equation (21), the number of delay times is at most several thousand with M = 4 or so. [0033] The calculation amount and noise resistance performance of the present invention will be described. When the calculation amount of the present invention is estimated, first, calculation of the cross correlation function in step 1 requires calculation of a product of M (M-1) times and the average microphone spacing divided by the normalized sound velocity It is. Also, for sound source position estimation, the steps 2 to 4 are executed, but this requires an approximate operation. Let's compare this operation amount with that of the delay-and-sum method. For example, M = 4, δτ (m) = 2, and the other conditions are “§3. In the same case as the example of the comparison between the conventional methods, in the delay-and-sum method, 36000 M = 14000, in the present invention, and about one hundredth of that in the delay-andsum method. As for the noise resistance performance, since the present invention evaluates the power of the delay sum in the same manner as the delay and sum method, it has the same noise resistance performance as the delay and sum method and is superior to the cross correlation method. [0034] 10-05-2019 11 The following will describe supplementary matters for the implementation of the present invention. Although the time difference has been described as an integer time in which the reciprocal of the sampling frequency is one unit in a cross-correlation function or the like, sufficient resolution may not be obtained in integer time. In such a case, the cross correlation function is interpolated and steps 2 and 3 are repeated. Since interpolation also requires crosscorrelation values around cross-correlation values to be interpolated, it is necessary to extend the range for calculating the cross-correlation by the amount used in interpolation. [0035] It greatly contributes to the amount of computation of the present invention. If more accurate τ (m) is obtained in step 2, δτ (m) can be selected smaller, so the amount of computation can be reduced. In order to obtain more accurate τ (m) in step 2, the microphones other than the reference microphones are divided into two or more groups, and with each group and the reference microphones, delay-and-sum method or steps 2 and 3 of the present invention Can be used. [0036] A sound wave with a frequency of about 300 to 500 Hz or less has a smaller change in amplitude at the same arrival time difference than a sound wave of a higher frequency because of the longer wavelength, and therefore, there is less information useful for direction estimation. Nevertheless, the sound source (voice) has a large power in this band and overlaps with the loworder resonance frequency of the room with a long decay time, so the direction other than the sound source such as the wall or ceiling is the same The reflected sound waves coming from the source increase, leading to an error in the estimation of the sound source direction. For the above reasons, the frequency band of the signal used for the calculation of the cross correlation function is about 300 to 500 Hz or more. [0037] The embodiment of the present invention has been described in detail with reference to the drawings, but the specific configuration is not limited to this embodiment, and there may be changes in design etc. within the scope of the present invention. Is also included in the present invention. 10-05-2019 12 [0038] DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS The first embodiment is a speaker automatic tracking directional sound collector. FIG. 2 shows a functional configuration of a speaker automatic tracking directivity system to which the method of the present invention is applied. A speech signal is supplied to the sound source position estimation unit 710 and the delay unit 31 as the input signal x (k, m), and the output of the delay unit 31 is changed in amplitude by the load 720 and then added by the adder 32. The output y (k) of the automatic follow-up directional sound collector. The sound source position estimation unit 710 estimates the position of the sound source and sends the obtained estimated sound source position to the delay / load calculation unit 730, and the delay / load calculation unit 730 maximizes the signal to noise ratio of the output y (k). To determine the delay time and the load. [0039] The speaker automatic follow-up directional sound collector is a device for selectively collecting only the voice of the speaker in a high speed voice communication system such as a video conference system. Conventional desk top microphones have the problem that unpleasant sounds such as the sound of a collision between a desk and a paper or pen are likely to be mixed in, and a collection of microphones placed around walls, ceilings, displays, etc. Sound is required. At this time, as the microphones are separated from the speaker, the signal-to-noise ratio per microphone decreases. In order to compensate for this, it is necessary to receive a plurality of microphones and to delay, weight and add the signals appropriately, and to estimate the sound source position in order to obtain the appropriate delay time and weight. . [0040] At this time, if the estimated sound source position is incorrect, the power of the high frequency band of the output voice is lowered, and a problem occurs that the sound becomes muffled, so more accurate sound source position estimation is required. According to the present invention, with respect to the delay-and-sum method, which is the conventional method, the amount of 10-05-2019 13 calculation is much smaller, and the resolution can be further enhanced with the same processing device, so that more accurate estimation of the sound source position can be performed. . Further, the cross correlation method, which is also a conventional method, has excellent noise resistance performance, so that more accurate sound source position estimation can be performed. As a result, by using the method of the present invention, higher quality voice can be received using the conventional method. [0041] The second embodiment is a speaker auto follow video camera. FIG. 3 shows a functional configuration of a speaker automatic tracking video camera to which the method of the present invention is applied. An audio signal is supplied to the sound source position estimation unit 710 as the input signal x (k, m), the sound source position estimation unit 710 estimates the position of the sound source, and sends the obtained estimated sound source position to the camera control unit 810. The unit 810 issues a control signal to the video camera 820, and the video camera 820 changes the horizontal angle, elevation angle and zoom according to the control signal. [0042] The automatic speaker tracking video camera is a device for automatically putting the speaker into the visual field of the video camera in a video conference system or the like. In a conference with a plurality of people, the conventional fixed-type video camera has a problem that it may not know who the speaker is. In addition, when a person operates the video camera, there is a problem that it takes time and effort. For this reason, there is a need for a video camera that automatically follows the speaker. In order to properly fit a person in the field of view of a video camera, it is required to know the position of the sound source with high accuracy. If the estimated sound source position is incorrect, the image may be out of the screen, or the image of the speaker may be too small for the purpose of achieving the purpose. [0043] According to the present invention, with respect to the delay-and-sum method, which is the conventional method, the amount of calculation is much smaller, and the resolution can be further enhanced with the same processing device, so that more accurate estimation of the sound 10-05-2019 14 source position can be performed. . Further, the cross correlation method, which is also a conventional method, has excellent noise resistance performance, so that more accurate sound source position estimation can be performed. As a result, by using the method of the present invention, it is possible to properly fit the speaker into the field of view of the video camera, as compared with the conventional method. [0044] The third embodiment is an abnormal sound automatic tracking and monitoring camera. This embodiment will not be described in detail because the speaker is replaced with an abnormal sound source in the second embodiment. [0045] According to the present invention, compared with the delay-and-sum method which is the conventional method, the effect that the amount of operation can be reduced while maintaining the sound source estimation accuracy is obtained. In other words, given the same arithmetic device, the present invention has the effect of being able to estimate the position accurately by raising the resolution, and to expand the sound source search range. 10-05-2019 15

1/--страниц