Patent Translate Powered by EPO and Google Notice This translation is machine-generated. It cannot be guaranteed that it is intelligible, accurate, complete, reliable or fit for specific purposes. Critical decisions, such as commercially relevant or financial decisions, should not be based on machine-translation output. DESCRIPTION JP2013175869 Abstract: To obtain an estimated value of a ratio between sound signals with high accuracy. A power estimation value of a frequency domain signal is obtained using a frequency domain signal obtained by converting a sound receiving signal received by a plurality of microphones into a frequency domain. A power estimation value of a direct sound suppression signal obtained by performing processing for suppressing a signal component coming from a direct sound source direction to a frequency domain signal, or a signal component coming from a direct sound source direction to a sound receiving signal A power estimation value of the direct sound suppression signal obtained by converting the signal obtained by performing the suppression processing into the frequency domain is obtained. The power estimation value of the direct sound suppression signal is corrected using the directional shape correction coefficient obtained from the function representing the directivity characteristic that suppresses the signal component coming from the direct sound source direction, and the power estimation value of the indirect sound is obtained. The power estimate of the domain signal and the power estimate of the indirect sound are used to obtain an inter-ratio estimate that represents the ratio of the power estimate of the direct sound to the power estimate of the indirect sound. [Selected figure] Figure 9 Acoustic signal enhancement device, perspective determination device, method thereof and program [0001] The present invention can be applied to, for example, a voice call, a hands-free method of operating a device by voice input, etc., and is used when emphasizing and collecting only the sound of a sound source located within a specific distance range from the microphone. BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to an acoustic signal enhancement device, a perspective determination device, methods thereof, and a 11-04-2019 1 program. [0002] In the prior art shown in Patent Document 1, the sound reception signal of the microphone array is converted to the frequency domain to obtain the in-between ratio, and the respective powers of the direct sound and the indirect sound are calculated using the spatial correlation matrix (See, for example, paragraphs [0025] to [0039] of the first embodiment). [0003] JP, 2011-55211, A [0004] In the method disclosed in Patent Document 1, it is not possible to distinguish between direct sound and indirect sound coming from the same direction, so all sounds coming from the direction of the direct sound are judged to be direct sounds. As a result, the direct sound power may be overestimated (or the indirect sound power may be underestimated), and the finally obtained ratio of inter-amplitudes becomes larger than the true value. The present invention has been made in view of these points, and it is an object of the present invention to provide a technique for accurately determining an in-between ratio estimated value of an acoustic signal. [0005] According to the present invention, an estimated value of the sound ratio between acoustic signals is obtained as follows. A power estimation value of the frequency domain signal is obtained using a frequency domain signal obtained by converting a sound receiving signal received by a plurality of microphones 11-04-2019 2 included in the microphone array into a frequency domain. In addition, a power estimation value of a direct sound suppression signal obtained by performing processing for suppressing a signal component coming from a direct sound source direction to the frequency domain signal, or a direct arrival from a sound source direction to the sound reception signal A power estimation value of the direct sound suppression signal obtained by converting the signal obtained by performing processing for suppressing the signal component to the frequency domain is obtained. Using the directional shape correction coefficient obtained from the function representing the directivity characteristic for suppressing the signal component coming directly from the sound source direction, the power estimated value of the direct sound suppression signal is corrected to obtain the power estimated value of the indirect sound, By using the power estimation value of the frequency domain signal and the power estimation value of the indirect sound, a ratio estimation value representing the ratio of the power estimation value of the direct sound to the power estimation value of the indirect sound is obtained. [0006] In the present invention, the indirect sound coming from the direct sound source direction is distinguished from the direct sound to estimate the power of the indirect sound. Therefore, it is possible to obtain the estimated value of the sound ratio of the acoustic signal more accurately than the conventional method. [0007] The figure which shows an example of the scene which utilizes an acoustic signal emphasis apparatus. The figure which illustrates the propagation path of the sound indoors. The figure which illustrates the relationship between the ratio between estimated values and the distance from a microphone to a sound source. The conceptual diagram which illustrates the shape of directivity. The conceptual diagram for demonstrating a directivity correction coefficient. The figure which illustrates the functional composition of an acoustic signal emphasis device. The figure which illustrates the function composition of a process target signal generation part. The figure which illustrates a frequency domain conversion part and the ratio calculation part. The figure which illustrates the functional composition of an in-between ratio calculation part. The figure which illustrates the operation flow of an acoustic signal emphasis device. The figure which illustrates the functional composition of the distance determination device. The figure for demonstrating a coordinate system. 11-04-2019 3 [0008] Hereinafter, embodiments of the present invention will be described with reference to the drawings. The same reference numerals are given to the same components in the drawings, and the description will not be repeated. Also, in the following description, the symbols “¯”, “^”, etc. used in the text should originally be written directly above the previous character, but due to the limitations of the text notation, immediately after the character Described in. In the formula, these symbols are described at their original positions. [0009] [Principle] Before describing the embodiments, the principle corresponding to each embodiment will be described. In the first embodiment, a single microphone array is used to emphasize and pick up the sound emitted from the direct sound source within a specific distance range from the microphone array. In the second embodiment, the distance between the sound source position and the microphone array is determined. [0010] FIG. 1 illustrates a scene in which the acoustic signal enhancement device of the first embodiment is used. For example, assume that a small microphone array 11 is being surrounded by four speakers 12 to 14, for example. In the conference room, a television 16, a telephone 17, and a speaker 18 for in-house broadcasting are arranged. In such a scene, the utterers 12 to 14 are positioned within a predetermined distance range (in a circle indicated by a broken line) centering on the small microphone array 11 without collecting the sound of the indoor broadcast, the sound of a telephone call, etc. I want to pick up only the utterance (direct sound source). [0011] In the first embodiment, in order to distinguish the distance from the microphone array to the sound source, attention is focused on the ratio of the power estimation value of the direct sound included in the sound reception signal and the power estimation value of the indirect sound (reverberation sound or the like). Hereinafter, a value representing the ratio of the power 11-04-2019 4 estimate of the direct sound to the power estimate of the indirect sound will be referred to as the “inter-ratio estimate”. For example, a value obtained by dividing the power estimation value of the direct sound by the power estimation value of the indirect sound may be used as the estimated ratio, or a value obtained by dividing the power estimation value of the indirect sound by the power estimated value of the direct sound may be It may be a ratio estimated value, or any one of these function values may be an immediate ratio estimated value. The power estimation value means a value that increases as the power increases. Examples of power estimates are power, power spectrum, power spectral density, monotonically increasing function values of amplitude, their estimates, etc. FIG. 2 shows the propagation path of sound from the sound source 21 to the microphone 22 when the microphone 22 is placed indoors and sound is recorded. The direct sound is a sound wave indicated by a thick solid line which directly reaches from the sound source 21 to the microphone. The indirect sound is a sound wave indicated by a broken line which reaches the microphone 22 after the sound emitted from the sound source 21 is reflected by a wall, a floor, a ceiling or the like. [0012] FIG. 3 exemplifies the relationship between the estimated value of the direct ratio and the distance from the microphone to the sound source. The horizontal axis in FIG. 3 is the distance from the microphone to the sound source, and the vertical axis is the estimated value of the direct ratio. In FIG. 3, the power estimation value of the direct sound divided by the power estimation value of the indirect sound is used as the direct ratio estimated value. In general, indirect sounds exhibit a constant magnitude that does not depend on the distance from the microphone. On the other hand, the magnitude of the direct sound monotonously decreases as the distance from the sound source to the microphone increases. Therefore, the estimated value of the direct ratio obtained by dividing the power estimated value of the direct sound by the power estimated value of the indirect sound decreases monotonically with the increase of the distance as in the magnitude of the direct sound. [0013] The acoustic signal enhancement device according to the first embodiment obtains an estimated value of the direct ratio from the sound reception signal, and estimates the distance from the microphone array of the sound source (direct sound source) of the direct sound included in the sound reception signal. Thus, the acoustic signal enhancement device can estimate a predetermined distance range around the microphone array 11. The acoustic signal emphasizing device emphasizes the sound emitted from the sound source existing in the desired distance 11-04-2019 5 range by adjusting the amplitude of the processing target signal according to the estimated value of the direct ratio, and the other sounds (noises) Relatively suppress. The distance determination device according to the second embodiment obtains an estimated ratio from the received sound signal, and determines the distance between the direct sound source position and the microphone array. [0014] The principle which can obtain | require an in-between ratio estimated value precisely is demonstrated. <Indirect sound isotropic arrival model> In the proposed method, a signal model that takes into consideration the isotropy of indirect sound is introduced. Here, although an example using power spectral density or its estimated value as a power estimated value is described, this does not limit the present invention. The following frequency domain signal Xm (ω, t) can be obtained by converting the sound reception signal at the mth microphone of the microphone array consisting of M (M ≧ 2) microphones into the frequency domain by short-time Fourier transform or the like . Xm (ω, t) = (HD <(m)> (ω) + HR <(m)> (ω)) S (ω, t) (1) where ω is a frequency and HD <(m) > (Ω) is the transfer function of the direct sound from the direct source to the mth microphone, HR <(m)> (ω) is the transfer function of the indirect sound from the direct source to the mth microphone, S (Ω, t) is a signal obtained by converting the sound at the direct sound source into the frequency domain. t is an index of a frame that is a predetermined time interval, and a frame corresponding to the index t is expressed as “frame t”. [0015] Here, it is assumed that direct sound is coherent, while indirect sound is diffuse because its main component is reverberation. That is, when focusing on each direction of arrival, direct sound only comes from the direction of the sound source, whereas indirect sound has the property of coming with uniform power from all directions (hereinafter referred to as “isotropic”) . In the proposed method, the indirect sound power is estimated focusing on the difference in these spatial arrival characteristics to obtain an estimated value of the direct ratio. [0016] As a precondition, the arrival direction of direct sound (hereinafter referred to as “direct sound source direction”) is known, and direct sound and indirect sound coming from any direction can 11-04-2019 6 be regarded as plane waves, and direct sound and indirect sound are Do not correlate with each other. At this time, transfer functions HD <(m)> (ω) and HR <(m)> (ω) of direct sound and indirect sound from direct sound source to m-th microphone can be expressed as follows. However, HDref (ω) is the direct sound component of the transfer function from the direct sound source to the reference point (referred to as the “reference point”) of the microphone array, and HRref, θ (ω) is the indirect coming from the direction θ with respect to the reference point It is a sound component. The reference point may be inside the microphone array or outside the microphone array. The inside of the microphone array is, for example, a straight line passing through a plurality of microphones constituting the microphone array, an inside of a plane surrounded by line segments passing through the plurality of microphones, or a plane passing through the plurality of microphones It means the inside of a solid. The outside of the microphone array means a position other than the inside of the microphone array. For example, the distance between each of the plurality of microphones constituting the microphone array and the reference point is shorter than the distance between each of the microphones and the direct sound source. An example of the reference point is the center point of the microphone array, the position of any one microphone. τθ <(m)> is expressed as the following equation. τ θ <(m)> = − (1 / c) u <T> pm Here, the position pm of the m-th microphone is pm = [pm, x, pm, y, pm, z] <T> As shown in FIGS. 12A and 12B, a unit vector u representing the direct sound source direction is u = [sin θ, cos θ, 0] <T>, and c is the propagation velocity of the sound wave. Propagation delay of sound coming from direction θ from the reference point to the mth microphone, θD is the direct sound source direction viewed from the reference point, j is an imaginary unit, and e is a natural logarithm . Further, integration for θ is performed in the range of 0 ≦ θ <2π (the same applies to the following integrations). [0017] That is, each of the transfer functions HD <(m)> (ω) and HR <(m)> (ω) of the direct sound and the indirect sound is a transfer function component from the direct sound source to the reference point and the mth from the reference point It can be decomposed into a phase difference component due to the propagation delay to the microphone. Therefore, a microphone array input vector X (ω, t) = [X 1 (ω, t),..., Having the frequency domain signal X m (ω, t) (mε {1,..., M}) as an element. , XM (ω, t)] <T> is expressed by the following equation. Here, α <T> represents transposition of α, and SD (ω, t) = HDref (ω) S (ω, t), SR, θ (ω, t) = HRref, θ (ω) S (ω, S) t). Further, Aθ (ω) is an M-dimensional vector having a transfer function of a path from the reference point of sound of frequency ω arriving from the direction θ to the microphone array to the mth microphone as an element. An example of Aθ (ω) is an array manifold vector in the direction θ viewed from the reference point of the microphone array. An example in which the array manifold vector is Aθ (ω) is shown below. 11-04-2019 7 [0018] Each element of the array manifold vector depends on the propagation delay τθ <(m)>. When direct sound and indirect sound can be regarded as plane waves, the propagation delay τθ <(m)> depends on the relative position and direction θ of each microphone with respect to the reference point of the microphone array. For details of the array, manifold, and vector, for example, reference 1: "Asano Ta," Array signal processing of sound-localization, tracking and separation of sound source (The Acoustical Society of Japan, Acoustic Technology Series) ", Corona Co., Ltd. , Inc., Feb. 25, 2011, ISBN 978-4-339-01116-6, Chapter 1 (P1-26). [0019] When an arbitrary beamformer (BF: beamformer) is applied to this microphone array input, the power spectral density (PSD: power spectral density) PBM (ω) of the output is as follows. However, PD (ω) = E [| SD (ω, t) | <2>] t, PR, θ (ω) = E [| SR, θ (ω, t) | <2>] t. W (ω) is a vector W (ω) = [W1 (ω),..., WM (ω) having filter coefficients W1 (ω),. ] <T> (see Chapter 4.1 (P70, 71) of Reference 1). U (ω) is an element Upq (ω) = E [Xp (ω, t) Xq <*> (ω, t)] t with pq component (p, q∈ {1,..., M}) M × M matrix (input signal spatial correlation matrix of microphone array). E [α (t)] t represents the expectation value operation of t of α (t), α <H> represents the complex conjugate transposition of α, and α <*> represents the complex conjugate of α. D (ω, θ) is a function (a function having ω, θ as a domain) representing the directivity characteristic formed by the beam former. That is, D (ω, θ) represents the shape of directivity formed by the beam former. For example, D (ω, θ) is expressed as follows. [0020] In the sound field which can be assumed that the indirect sound arrives isotropically at the microphone array in Equation (4), PR, θ (ω) is a value PR <−> independent of θ It can be replaced by (ω). In this case, equation (4) can be modified as follows. [0021] Next, a beamformer is assumed which suppresses signal components coming from the direct 11-04-2019 8 sound source direction θD. In other words, a beamformer is assumed in which a directional characteristic shape (see, for example, FIG. 4) with null (point with low directivity sensitivity) directly directed to the sound source direction θD can be obtained. Furthermore, in other words, it is assumed that a beamformer is realized that achieves a directional characteristic with a spatial notch in the direct sound source direction θD. Such a beamformer can be easily set from the information of the direct sound source direction θD. For example, a filter represented by the "blocking matrix" described in Chapter 4.6 (P90-97) of reference 1 can be used as such a beamformer. A beamformer that suppresses signal components arriving from the direct source direction θD ideally makes the signal components arriving from the direct source direction θD zero. That is, D (ω, θD) = 0 ideally. Assuming that the output power spectral density of such an ideal beamformer is PND (ω), the following holds from Equation (6). [0022] Here, if it can be said that D (ω, θ) represents a directional characteristic that does not suppress indirect sound components for all θ, PND (ω) can be regarded as the power spectral density PR (ω) of indirect sound. However, it is difficult to obtain only the sound coming from the direction θD and to obtain the directivity characteristic that does not suppress the signal coming from the direction θ ≠ θD at all. Even if it is possible to obtain directivity characteristics that do not suppress the signal coming from the direction θ ≠ θD at all, the beamformer assumed here suppresses the sound (including indirect sound components) coming directly from the sound source direction θD. The indirect sound component coming from at least this direction θD is suppressed. Therefore, it can not be said that D (ω, θ) represents a directional characteristic that does not suppress indirect sound components for all θ. [0023] Therefore, in the proposed method, the directional shape correction coefficient R (ω) is obtained using D (ω, θ), the PND (ω) is corrected using the directional shape correction coefficient R (ω), and the power of indirect sound is obtained. The spectral density PR (ω) is estimated. PR (ω) = R (ω) PND (ω) (9) [0024] For example, assuming that the maximum value of | D (ω, θ) | <2> for each frequency ω is 11-04-2019 9 maxθ ′ | D (ω, θ ′) | <2>, the directivity shape correction coefficient R (ω) is as follows: May be set (Specific example 1 of the directional shape correction coefficient). However, θ ′ means θ that maximizes | D (ω, θ) | <2>. The numerator and denominator of Formula (10) are illustrated typically in FIG. [0025] Alternatively, assuming that the average value of | D (ω, θ ′ ′) | <2> (θ′′∈Θ) in a specific angle region Θ is a numerator, the directional shape correction coefficient R (ω) is It may be set (specific example 2 of directional shape correction coefficient). However, || Θ || is a rational number greater than 0 that represents the size of the angle region Θ. For example, || Θ || satisfies the following. A specific example of the angle area Θ is any angle area excluding the direct sound source direction θD, an angular area including the reverse direction of the direct sound source direction θD, and a direction θ maximizing | D (ω, θ) | <2> The angle region to be included, the angle region that maximizes the average value of | D (ω, θ) | <2> among candidates of a predetermined angle region, and the like. [0026] In addition, the directional shape correction coefficient R (ω) may be obtained by multiplying the correction constant by the equation (10) or the equation (11) (Specific example 3 of the directional shape correction coefficient). In this case, a frequency-dependent correction constant that takes acoustic characteristics into consideration may be multiplied, or a frequencyindependent correction constant may be multiplied. Alternatively, the numerator of equation (8) may be a frequency dependent or independent constant (example 4 of directional shape correction factor). [0027] In the proposed method, the output power spectral density PND (ω) of the beamformer is corrected using the directional shape correction coefficient R (ω) to obtain an estimated value PR (ω) of the power spectral density of indirect sound. As a result, the indirect sound component coming from the direct sound source direction θD and suppressed by the beamformer can be corrected, and the estimated value PR (ω) of the power spectral density of the indirect sound can be determined with high accuracy. 11-04-2019 10 [0028] If an estimated value PR (ω) of the power spectral density of the indirect sound is obtained, an estimated value PX of the power spectral density obtained from the frequency domain signals X1 (ω, t), ..., XM (ω, t) The direct ratio estimate DRR can be obtained using ω). For example, the following estimated ratio DRR can be obtained (Specific example 1 of estimated ratio DRR). [0029] Alternatively, it may be the estimated direct value ratio DRR in decibel notation as follows (specific example 2 of the estimated direct value ratio DRR). [0030] Alternatively, the in-range ratio estimated value DRR (ω) may be obtained for each frequency ω as follows (example 3 of the in-between ratio estimated value DRR). [0031] Alternatively, a value obtained by multiplying any of the expressions (12) to (15) by a constant may be taken as the estimated ratio value (specific example 4 of the estimated value ratio DRR), or the expression (12) Any reciprocal number of (15) may be used as the in-phase ratio estimated value (specific example 5 of the in-between ratio estimated value DRR), or a constant is multiplied by the reciprocal of any of the equations (12)-(15) The obtained value may be taken as the direct ratio estimate (specific example 6 of the direct ratio estimate DRR). In addition, the monotonously increasing function values of the equations (12) to (15) may be used as the in-between ratio estimated value (example 7 of the in-between ratio estimated value DRR). [0032] Alternatively, PND (ω), PX (ω), PR is targeted only for the sound reception signal corresponding 11-04-2019 11 to block L consisting of K frames t = (L−1) +1,..., (L−1) + K. (Ω) is determined, and the direct ratio estimate DRR or DRR (ω) (specific examples 1 to 7 of the direct ratio estimate DRR) is obtained for each block L, and the direct ratio estimate DRRL at block L is obtained. Alternatively, it may be DRRL (ω). Here, K is an integer constant of 1 or more, and L is an integer index of 1 or more corresponding to the block. (Specific Example 8 of the estimated direct ratio DRR). In this case, it may be possible to obtain the direct ratio estimate value DRRL or DRRL (ω) for each block in which K = 1. In the following, it is assumed that the block of K = 1 is synonymous with the frame (example 9 of the direct ratio estimated value DRR). Other various ratio estimates can be assumed. In the following, such an estimated in-between ratio is generically referred to as "in-between ratio estimated value DRR". [0033] In addition, when there are a plurality of direct sound sources different in position from each other and a plurality of direct sound source directions θD are present, it can be considered in the same manner as above. [0034] FIG. 6 illustrates the functional configuration of the acoustic signal enhancement device of the acoustic signal enhancement device 400 according to the first embodiment, and FIG. 10 illustrates the operation flow of the acoustic signal enhancement device 400. The acoustic signal emphasizing device 400 according to the first embodiment includes a microphone array 41 including a plurality of microphones m1, ..., mM, a plurality of frequency domain conversion units 421 to 42M, a processing target signal generation unit 43, and an interarea ratio calculation unit And 44, an object signal adjustment unit 45, and an inverse frequency domain conversion unit 46. The target signal adjustment unit 45 includes a filter coefficient calculation unit 451 and a multiplication unit 452. For each functional component except the microphone array 41, a predetermined program is read into a computer including, for example, a ROM (read-only memory), a RAM (random-access memory), a CPU (central processing unit), etc. Is realized by executing the program. Also, the plurality of microphones m1, ..., mM are arranged at mutually different positions. 11-04-2019 12 [0035] Reception signals x1 (n),..., XM (n) received by the plurality of microphones m1,..., MM are input to the plurality of frequency domain conversion units 421,. n represents real time. The frequency domain conversion units 421, ..., 42M convert the sound reception signals x1 (n), ..., xM (n) into digital signals, and for each frame, the frequency domain signals X1 (ω, t), ..., XM (ω) , T) and output (step S42). For example, the frequency domain conversion units 421, ..., 42M sample the sound reception signals x1 (n), ..., xM (n) at a sampling frequency of 16 kHz and convert them into digital signals, for example, 256 samples as one frame Discrete Fourier transform is performed in each frame to generate and output frequency domain signals X 1 (ω, t),..., X M (ω, t) (step S 42). Here, the reception signal xm (n) (m∈ {1,..., M}) represents an acoustic signal received by the microphone mm, and the frequency domain signal Xm (ω, t) is the reception signal xm (n) Corresponds to An A / D converter for converting the received signals x1 (n),..., XM (n) into digital signals is omitted from the drawing. [0036] The processing target signal generation unit 43 receives the frequency domain signals X1 (ω, t),..., XM (ω, t), generates a processing target signal Y (ω, t), and outputs it (step S43). Details of the processing target signal generation unit 43 and step S43 will be described later. [0037] The ratio calculation unit 44 receives the frequency domain signals X1 (ω, t),..., XM (ω, t) as inputs, and inputs the frequency domain signals X1 (ω, t), ..., XM (ω, t) directly. An estimated ratio DRR is generated and output (step S44). Details of the direct-to-inside ratio calculation unit 44 and step S44 will be described later. [0038] The target signal adjustment unit 45 receives the processing target signal Y (ω, t) and the estimated range ratio DRR as input, and adjusts the amplitude of the process target signal Y (ω, t) according to the estimated range ratio DRR. A post-processing signal Z (ω, t) is generated and output. In other words, the target signal adjustment unit 45 multiplies the processing target 11-04-2019 13 signal Y (ω, t) by a gain (filter coefficient) having a magnitude corresponding to the in-between ratio estimated value DRR, thereby processing the processed signal Z (ω, t ) Is generated and output (step S45). [0039] The magnitude of the gain determined in accordance with the in-plane ratio estimate value DRR depends on what distance range from the microphone array 41 the sound emitted from the direct sound source is to be enhanced. For example, when emphasizing the sound emitted from the direct sound source close to the microphone array 41, the ratio of the power estimate of the direct sound to the power estimate of the indirect sound represented by the direct ratio estimate DRR is greater than a predetermined threshold value. If the ratio is smaller than the predetermined threshold value, the gain to be multiplied by the processing target signal is larger than the gain by which the processing target signal is multiplied. For example, when emphasizing the sound emitted from the direct sound source far from the microphone array 41, the ratio of the power estimation value of the direct sound to the power estimation value of the indirect sound represented by the direct ratio estimation value DRR is greater than a predetermined threshold. If the ratio is smaller than the predetermined threshold value, the gain G (ω, t) by which the processing target signal is multiplied is smaller than the gain by which the processing target signal is multiplied. [0040] As described above, the ratio of the power estimate of the direct sound to the power estimate of the indirect sound represented by the direct ratio estimate DRR and the magnitude of the gain to be determined according to the direct ratio estimate DRR are as described above. There is no limitation to the method of determination by comparison with a predetermined threshold. For example, when emphasizing the sound emitted from the direct sound source close to the microphone array 41, the ratio of the power estimation value of the direct sound to the power estimation value of the indirect sound represented by the direct ratio estimation value DRR is the first value. The gain by which the processing target signal is multiplied in some cases is larger than the gain by which the processing target signal is multiplied when the ratio is a second value smaller than the first value. For example, when emphasizing the sound emitted from the direct sound source far from the microphone array 41, the ratio of the power estimate of the direct sound to the power estimate of the indirect sound represented by the direct ratio estimate DRR is the first value. The gain G (ω, t) by which the processing target signal is multiplied in some cases is smaller than the gain by which the processing target signal is multiplied when the ratio is a 11-04-2019 14 second value smaller than the first value. Details of the target signal adjustment unit 45 and step S45 will be described later. [0041] The inverse frequency domain conversion unit 46 converts the input processed signal Z (ω, t) into the time domain signal z (n ′) and outputs it (step S46). n 'represents discrete time. For example, the inverse frequency domain transformation unit 46 transforms the processed signal Z (ω, t) into a time domain signal z (n ′) by inverse Fourier transformation and outputs it. [0042] The operations from step S41 to step S46 are repeated, for example, until the processing on all the sound reception signals x1 (n),..., XM (n) received by the microphones m1,. By the abovedescribed operation, it is possible to, for example, emphasize a sound within a specific distance range and suppress and collect sounds relatively outside the range by the microphone array. Hereinafter, more specific examples of each part and step will be shown. [0043] [Processing target signal generation unit 43 / step S43] An example of the processing target signal Y (ω, t) is a composite signal of the frequency domain signals X1 (ω, t),..., XM (ω, t). Another example of the processing target signal Y (ω, t) is the frequency domain signal Xm ′ (ω, t) corresponding to any one microphone m ′ (m′∈ {1,..., M}), It is a weighting value of Xm '(ω, t). [0044] FIG. 7 shows an example of the functional configuration of the processing target signal generation unit 43. The processing target signal generation unit 43 illustrated in FIG. 7 includes a plurality of weight multiplication units 4311 to 431M and an addition unit 432. The frequency domain signals X1 (.omega., T),..., XM (.omega., T) are input to weight multipliers 4311 to 431M, respectively. The frequency domain signals X1 (ω, t),..., XM (ω, t) are weighting coefficients w1 11-04-2019 15 (ω),..., WM (ω) as frequency domain signals X1 (ω, t),. The weighting frequency domain signals w1 (.omega.) X1 (.omega., t),..., wM (.omega.) XM (.omega., t) are generated and output by multiplying t) by t respectively. [0045] For example, when M nondirectional M microphones m 1 to m are used, w 1 (ω) =... W M (ω) = 1 / M, and M frequency domain signals X 1 (ω, t), The processing target signal Y (ω, t) can be stabilized by processing the average value of XM (ω, t) as the processing target signal Y (ω, t). Also, by setting w1 = 1, wm = 0 (m∈ {2,..., M}), it is possible to use only the sound reception signal of the specific microphone m1. For example, when M microphones m1, ..., mM having directivity are used, the microphones specified as wm '= 1, wm "= 0 (m', m" ∈ {2, ..., M}) Arbitrary directivity can also be obtained by using only the mm 'pickup signal. In addition, for example, the weighting factor w1 (ω), the weighting factor of the weighting beamforming as described in reference 2 “Ohga, Yamazaki, Kanada,“ Acoustic system and digital signal processing ”published by the Institute of Electronics, Information and Communication Engineers” , W M (ω), and any directivity may be realized. Furthermore, when another microphone is present near the desired sound source, a signal obtained by frequency domain conversion of the observation signal of the microphone may be used as the output of the processing target signal generation unit. [0046] The weighted frequency domain signals w 1 (ω) X 1 (ω, t),..., W M (ω) X M (ω, t) are input to the addition unit 432. The addition unit 432 adds the weighted frequency domain signals w1 (ω) X1 (ω, t),..., WM (ω) XM (ω, t) and outputs a processing target signal Y (ω, t). At that time, the propagation delay of the microphones m1, ..., mM with respect to the reference point described above may be corrected. [0047] [Indirect Ratio Calculation Unit 44 / Step S44] The following shows an example in which the power spectral density or its estimated value is used as the power estimated value. As illustrated in FIG. 9, the in-between ratio calculation unit 44 includes a received power estimation unit 441, a weighting coefficient storage unit 442, a directivity formation unit 443, a direct sound 11-04-2019 16 suppression power estimation unit 444, and a directivity shape. The analysis unit 445, the indirect sound power estimation unit 446, and the direct ratio estimation unit 447 are provided. [0048] As illustrated in FIGS. 8 and 9, the frequency domain signals X1 (ω, t),..., XM (ω, t) output from the frequency domain transform units 421,. Received power estimation unit 441 and directivity formation unit 443 are input. The reception power estimation unit 441 generates and outputs a power estimation value of a frequency domain signal corresponding to the reception signal using the frequency domain signals X1 (ω, t),..., XM (ω, t). The power estimation value may be a power estimation value of the frequency domain signal Xm (ω, t) corresponding to any one microphone mm (mε {1,..., M}), or the frequency domain signal The power estimated value of each of X 1 (ω, t),..., X M (ω, t) may be weighted and averaged. In the first embodiment, the power spectral density PX (ω) is obtained as a power estimation value of the frequency domain signal corresponding to the sound reception signal. Here, an example in which the power spectral density PX (ω) can be obtained for each block L consisting of K frames (L−1) +1,..., (L−1) + K is shown. The spectral density PX (ω) is expressed as PX, L (ω). Equation (16) is an example in which the power spectral density of one microphone mm is PX, L (ω), and equation (17) is the frequency domain signal X1 (ω, t), ..., XM (ω, t) This is an example in which the weighted average value of each power spectral density is set to PX, L (ω). [0049] The weighting factor storage unit 442 stores filter coefficients W1 (ω),..., WM (ω) of a beamformer that suppresses the signal components that have arrived from the direct sound source direction θD described above. The directivity forming unit 443 uses the filter coefficients W1 (ω),..., WM (ω) read from the weighting coefficient storage unit 442, and inputs the frequency domain signals X1 (ω, t),. The signal component coming from the direct sound source direction θD is suppressed with respect to ω, t), and a direct sound suppression signal ND (ω, t) obtained thereby is generated and output. For example, the directivity forming unit 443 generates the direct sound suppression signal ND (ω, t) as follows. [0050] The direct sound suppression power estimation unit 444 receives the direct sound suppression 11-04-2019 17 signal ND (ω, t), generates a power estimation value of the direct sound suppression signal ND (ω, t), and outputs it. In the first embodiment, the power spectral density PND (ω) is obtained as a power estimation value of the direct sound suppression signal X (ω, t). Here, an example in which the power spectral density PND (ω) is obtained for each block L is shown, and the power spectral density PND (ω) obtained in the block L is represented as PND, L (ω). [0051] The directional shape analysis unit 445 is formed by a beam former that uses the filter coefficients W1 (ω),..., WM (ω) read out from the weighting coefficient storage unit 442 and suppresses the signal components coming from the direct sound source direction θD described above. A function D (ω, θ) representing the directivity characteristic to be generated, that is, a directional shape is generated and output. For example, the directional shape analysis unit 445 has previously obtained information such as Aθ (ω) corresponding to the reference point of the microphone array 41 and the microphones m1, ..., mM, and these and the filter coefficient W1 (ω), , WM (ω) are used to generate D (ω, θ) according to, for example, the equation (5). Furthermore, the directional shape analysis unit 445 generates and outputs a directional shape correction coefficient R (ω) using D (ω, θ). Examples of the directivity shape correction coefficient R (ω) are specific examples 1 to 4 of the directivity shape correction coefficient described above. [0052] Indirect sound power estimation unit 446 receives power spectrum density PND, L (ω), which is a power estimation value of direct sound suppression signal ND (ω, t), and directional shape correction coefficient R (ω). Ru. The indirect sound power estimation unit 446 corrects the power spectral density PND, L (ω), which is a power estimation value of the direct sound suppression signal ND (ω, t), using the directional shape correction coefficient R (ω), and indirectly Generate and output a power estimate of the sound. In the first embodiment, the estimated value PR (ω) of the power spectral density of indirect sound is determined for each block L as follows. The estimated value PR (ω) of the power spectral density in the block L is expressed as PR, L (ω). PR, L (ω) = R (ω) PND, L (ω) (20) [0053] 11-04-2019 18 The power spectrum density PX, L (ω), which is the power estimation value of the frequency domain signal, and the estimated value PR, L (ω) of the power spectrum density, which is the power estimation value of the indirect sound are input to the direct ratio estimation unit 447 Be done. The in-between ratio estimation unit 447 generates and outputs an in-between ratio estimated value DRR of the frequency domain signals X1 (ω, t),..., XM (ω, t) using these. Examples of the in-between ratio estimated value DRR are specific examples 1 to 9 of the inbetween ratio estimated value DRR. In the first embodiment, PX (ω) of the specific examples 1 to 9 of the in-between ratio estimated value DRR is replaced with PX, L (ω) and PR (ω) is replaced with PR, L (ω). It is assumed that the direct ratio estimate DRRL or DRRL (ω) is obtained. [0054] [Target Signal Adjustment Unit 45 / Step S45] As illustrated in FIG. 6, the signal adjustment unit 45 includes, for example, a filter coefficient calculation unit 451 and a multiplication unit 452. The filter coefficient calculation unit 451 receives the in-between ratio estimated value DRRL or DRRL (ω) as an input, and the gain (filter coefficient) G (ω, t) having a magnitude corresponding to the in-between ratio estimated value DRRL or DRRL (ω). Define and output. [0055] In the case where the estimated value DRRL (ω) for the in-between ratio for each frequency ω is input, the filter coefficient calculation unit 45 determines, for example, each frame belonging to the frequency ω and the block L according to each estimated ratio DRR (ω). The gain G (ω, t) corresponding to t = (L−1) +1,..., (L−1) + K is determined. When the in-between ratio estimated value DRRL which does not depend on the frequency ω is input, the filter coefficient calculation unit 45 determines, for example, each frame t = (L−1) +1 belonging to the block L according to each in-between ratio estimated value DRRL. The gains G (ω, t) at all frequencies ω corresponding to (L−1) + K are determined. [0056] As described above, for example, when emphasizing the sound emitted from the direct sound source close to the microphone array 41, the direct ratio of the direct sound to the indirect sound power estimated value represented by the direct ratio estimate DRRL or DRRL (ω) The gain G (ω, t) (t = (L−1) +1,..., (L−1) + K) when the ratio of the power estimated values is larger than a 11-04-2019 19 predetermined threshold, the ratio is the predetermined It is larger than the gain G (ω, t) in the case of being smaller than the threshold. For example, when emphasizing the sound emitted from the direct sound source far from the microphone array 41, the ratio of the power estimate of the direct sound to the power estimate of the indirect sound represented by the direct ratio estimate DRRL or DRRL (ω) When the ratio is smaller than the predetermined threshold, the gain G (ω, t) (t = (L−1) +1,..., (L−1) + K) when the value is larger than the predetermined threshold The gain G (ω, t) of [0057] For example, when the gain G (ω, t) is determined as shown in the equation (21) or (22), the sound emitted from the direct sound source closer than the specific distance range can be emphasized. ただし、ｔ＝（Ｌ−１）＋１，...，（Ｌ−１）＋Ｋである。 An arbitrary value between the minimum value and the maximum value of the in-between ratio estimated value DRRL or DRRL (ω) is set as the threshold value Th1. When the threshold value Th1 approaches the minimum value (0), the sound quality is improved. Conversely, when the threshold value Th1 approaches the maximum value, the noise suppression effect is enhanced, but the distortion of the sound reception signal becomes large and the sound quality is degraded. As described above, the threshold Th1 has a trade-off relationship between the sound quality and the noise suppression. Therefore, the threshold value Th1 is empirically determined according to the purpose of use in consideration of the trade-off relationship. [0058] Alternatively, for example, when the gain G (ω, t) is determined as shown in the equation (23) or (24), it is possible to emphasize the sound emitted from the direct sound source farther than the specific distance range. ただし、ｔ＝（Ｌ−１）＋１，...，（Ｌ−１）＋Ｋである。 An arbitrary value between the minimum value and the maximum value of the in-between ratio estimated value DRRL or DRRL (ω) is set as the threshold value Th2. [0059] In the formulas (21) to (24), an example in which the gain G (ω, t) takes 0 or 1 is given, but this does not limit the present invention. That is, according to the result of the threshold determination, the gain G (ω, t) may be set to one of other two values (for example, 0.1 and 0.9). 11-04-2019 20 Also, the gain G (ω, t) may be a real number of 1 or more. That is, a gain G (ω, t) for amplifying the processing target signal Y (ω, t) may be determined. Further, a gain G (ω, t) (for example, a value of 0.1 or less) that largely suppresses the processing target signal Y (ω, t) may be determined. In addition, as described above, the magnitude of the gain determined in accordance with the in-between ratio estimated value DRR depends on the estimated power of the direct sound relative to the estimated power of the indirect sound represented by the in-between ratio estimated value DRR. The present invention is not limited to the method of determination by comparing the ratio with a predetermined threshold. In that case, for example, when emphasizing the sound emitted from the direct sound source close to the microphone array 41, the power estimation of the direct sound with respect to the power estimation value of the indirect sound represented by the direct ratio estimate DRRL or DRRL (ω) The gain G (ω, t) (t = (L−1) +1,..., (L−1) + K) when the ratio of values is the first value, the ratio is smaller than the first value It is larger than the gain G (ω, t) in the case of the second value. For example, when emphasizing the sound emitted from the direct sound source far from the microphone array 41, the ratio of the power estimate of the direct sound to the power estimate of the indirect sound represented by the direct ratio estimate DRRL or DRRL (ω) Gain G (ω, t) (t = (L−1) +1,..., (L−1) + K) when the first value is a second value where the ratio is smaller than the first value It is smaller than the gain G (ω, t) in the case of That is, instead of determining the gain G (ω, t) by the threshold determination, the estimated value of the in-between ratio or its function value may be used as the gain G (ω, t). For example, the gain G (ω, t) may be determined as in the following formulas (25) to (28). G (ω, t) = DRRL (t = (L−1) +1,..., (L−1) + K) (25) G (ω, t) = DRRL (ω) (t = (L) -1) +1,..., (L-1) + K) (26) G (ω, t) = F (DRRL) (t = (L-1) +1,. 1) + K) (27) G (ω, t) = F (DRRL (ω)) (t = (L1) + 1, ..., (L-1) + K) (28) However, F is a function such as a monotonically increasing function or a monotonously decreasing function. [0060] The gain G (ω, t) output from the filter coefficient calculation unit 451 and the processing target signal Y (ω, t) output from the processing target signal generation unit 43 are input to the multiplication unit 452. The multiplying unit 452 multiplies the processing target signal Y (ω, t) by the gain G (ω, t) to generate the processed signal Z (ω, t) = G (ω, t) Y (ω, t) Output. [0061] In the second embodiment, the direct / perspective determination of the sound source is performed using the estimated value of the direct-current ratio obtained in the same manner as in the first embodiment to generate a perspective determination result. That is, an estimation of 11-04-2019 21 the directness ratio obtained on the basis of the sound receiving signal received in the determination section including one or more frames, in which the estimated value of the directness ratio is obtained based on the sound receiving signal received in the frame A judgment value corresponding to the value and a reference value corresponding to a plurality of estimated ratio values obtained on the basis of a sound reception signal received on a reference section consisting of a larger number of frames than the judgment section are used. According to the comparison determination, the distance determination of the direct sound source in the determination section is performed. [0062] An example of the determination section is a frame or a block. Examples of reference intervals are frames, blocks, and blocks. In the case of real-time processing, the reference interval is an interval before the determination interval or an interval before the determination interval. In the case of batch processing, the reference interval may be an interval before the determination interval, may be an interval in the future, or may be an interval including the determination interval. An example of the determination value is a function value such as a straight ratio estimated value or a monotonously increasing function value of the straight ratio estimated value. Examples of the reference value are an average value, an expected value, and a weighted addition value of a plurality of direct-to-right ratio estimated values obtained based on the sound reception signal received in the reference section. In the direct sound source distance judgment, for example, a comparison judgment using a judgment value and a reference value outputs a value (for example, 1) representing the first distance judgment result when the reference value is larger than the judgment value, and not so In this case, a value (for example, 2) representing the second distance determination result is output. Examples of comparison determination using the determination value and the reference value include the determination as to whether the determination value is the reference value, the determination as to whether the value representing the ratio of the determination value to the reference value, and the threshold value. One of the first and second distance determination results means "direct sound source is far" and the other means "direct sound source is close". Which one of the first and second distance determination results means "the direct sound source is far" differs depending on the definition of the in-between ratio estimation value. For example, the determination value represents the ratio of the power estimate of the direct sound to the power estimate of the indirect sound in the determination section, and the reference value is the ratio of the power estimate of the direct sound to the power estimate of the indirect sound in the reference section When the average value is represented, if the determination value is smaller than the reference value, a perspective determination result indicating that the direct sound source is far is generated, otherwise, a perspective determination result indicating that the direct sound source is close is generated . Specific examples will be described below. 11-04-2019 22 [0063] FIG. 11 illustrates a functional configuration example of the distance determination device 120 according to the second embodiment. The distance determination device 120 includes a microphone array 41, a plurality of frequency domain conversion units 411 to 41m, a distance ratio calculation unit 44, and a distance determination unit 121. The distance determination unit 121 includes an accumulation unit 1211 and a determination unit 1212. The microphone array 41, the plurality of frequency domain conversion units 411 to 41m, and the inter-area ratio calculation unit 44 are the same as those of the acoustic signal enhancement device 400. Also in the perspective determination apparatus 120, each functional component except the microphone array 41 is realized by a predetermined program being read into a computer including, for example, a ROM, a RAM, a CPU, etc. and the CPU executing the program. . [0064] When a plurality of direct sound sources having different distances from the reference point of the microphone array 41 are sounded at different times, the perspective determination device 120 determines whether the sound source of the sound received at a certain time is far or near. judge. The plurality of direct sound sources may be present in the same direct sound source direction or in different direct sound source directions. The distance determination unit 121 configuring the distance determination device 120 includes a frequency averaging unit 1210, an accumulation unit 1211, and a determination unit 1212. [0065] Although the case where DRRL is used as an in-plane ratio equivalent value below is illustrated, this does not limit the present invention. That is, instead of the direct ratio equivalent value DRRL, the direct ratio equivalent value DRRL (ω) corresponding to any frequency ω may be used, or the direct ratio equivalent value DRRL (ω) of the equation (29) The weighted addition value DRRL <-> of may be used. However, γ (ω) is a weighting factor, and an example of γ (ω) is 1 / Γ. Is the total number of frequencies. For example, when the frequency domain transform units 421 to 42 M perform short-time Fourier transformation, Γ is the total number of frequency bins. 11-04-2019 23 [0066] In addition, although the case is exemplified below where the direct ratio equivalent value DRRL is the ratio of the power estimate of the direct sound to the power estimate of the indirect sound, as described above, the other values may be the direct ratio equivalent value. It is. [0067] The DRRL output from the ratio calculation unit 44 is input to the storage unit 1211 and the determination unit 1212. The storage unit 1211 stores, for example, the in-progress ratio equivalent value DRRL for the past σ pieces (σ is an integer of 2 or more) blocks (example of the reference section), and a reference corresponding to the DRRL for the σ blocks Output value DRR '. As the reference value DRR ', for example, an average value of DRRLs of the stored σ blocks, an average value of the minimum value and the maximum value of the DRRLs in the stored σ blocks, or the like is used. [0068] The determination unit 1212 sets the in-room ratio equivalent value DRRL in the block L (example of the determination section) as a determination value, and compares the reference value DRR 'with the determination value DRRL. If DRR ′> DRRL, the determination unit 1212 outputs a distance determination result Y (L) (for example, Y (L) = 1) indicating that the distance is long. Otherwise, the determination unit 1212 outputs a distance determination result Y (L) (for example, Y (L) = 0) indicating that the distance is short. The perspective determination result Y (L) indicates whether the sound reception signal in the block L is a sound from a relatively close direct sound source or a sound from a relatively far direct sound source. By using this perspective determination result Y (L), it is possible to separate the sequentially received sound reception signal according to the distance between the microphone and its direct sound source. That is, the sounds of a plurality of direct sound sources can be selected according to the distance from the microphone. [0069] 11-04-2019 24 The present invention is not limited to the above-described embodiment. For example, part of the processing performed in the frequency domain described above may be performed in the time domain. For example, in the above embodiment, the directivity forming unit 443 converts the filter coefficients W1 (ω),..., WM (ω) in the frequency domain to the frequency domain signals X1 (ω, t),. t) to generate a direct sound suppression signal ND (ω, t) in which the signal component coming from the direct sound source direction θD is suppressed. However, in the time domain, the digital signal of the sound reception signal x1 (n),..., XM (n) is processed to suppress the signal component coming from the direct source direction, and the signal obtained thereby is processed in the frequency domain. A direct sound suppression signal ND (ω, t) may be generated by conversion. That is, the filter coefficients of the time domain corresponding to the filter coefficients W1 (ω),..., WM (ω) are convoluted with the digital signal of the reception signal x1 (n),. In the frequency domain to generate the direct sound suppression signal ND (ω, t). Further, in the above-described embodiment, the processing target signal generation unit 43 outputs the processing target signal Y (ω, t), and the target signal adjustment unit 45 processes a gain having a magnitude according to the in-between ratio estimated value DRR. The target signal Y (ω, t) is multiplied. However, the processing target signal generation unit 43 outputs a signal y (n ′) obtained by converting Y (ω, t) into the time domain, and the target signal adjustment unit 45 has a magnitude corresponding to the in-between ratio estimated value DRR. The gain may be multiplied by the time domain signal y (n '). In this case, the inverse frequency domain conversion unit 46 is unnecessary. [0070] The functional configurations included in the acoustic signal enhancement device 400 and the distance determination device 120 may be realized by an external device. For example, the acoustic signal enhancement device 400 or the perspective determination device 120 may be connected to an external microphone array without the microphone array, and the same function may be realized. Similarly, the acoustic signal enhancement device 400 and the perspective determination device 120 do not include the frequency domain conversion unit and the inverse frequency domain conversion unit, and the same function is realized using an external frequency domain conversion unit and the inverse frequency domain conversion unit. It may be done. [0071] In addition, the various processes described above are not only executed chronologically according to the description, but may also be executed in parallel or individually depending on the processing capability of the apparatus that executes the process or the necessity. It goes 11-04-2019 25 without saying that other modifications can be made as appropriate without departing from the spirit of the present invention. [0072] Further, when the above configuration is realized by a computer, the processing content of the function that each device should have is described by a program. The above processing function is realized on the computer by executing this program on the computer. [0073] The program describing the processing content can be recorded in a computer readable recording medium. An example of a computer readable recording medium is a non-transitory recording medium. Examples of such recording media are magnetic recording devices, optical disks, magneto-optical recording media, semiconductor memories and the like. [0074] This program is distributed, for example, by selling, transferring, lending, etc. a portable recording medium such as a DVD or CD-ROM in which the program is recorded. Furthermore, this program may be stored in a storage device of a server computer, and the program may be distributed by transferring the program from the server computer to another computer via a network. [0075] For example, a computer that executes such a program first temporarily stores a program recorded on a portable recording medium or a program transferred from a server computer in its own storage device. Then, at the time of execution of the process, this computer reads the program stored in its own recording device, and executes the process according to the read program. Further, as another execution form of this program, the computer may read the program directly from the portable recording medium and execute processing according to the program, and further, the program is transferred from the server computer to this computer 11-04-2019 26 Each time, processing according to the received program may be executed sequentially. [0076] In the embodiment, each device is configured by executing a predetermined program on a computer, but at least a part of the processing content may be realized by hardware. [0077] 400 sound signal enhancement device 120 distance determination device 11-04-2019 27

1/--страниц