APPLIED STOCHASTIC MODELS AND DATA ANALYSIS, VOL. 10, 91-102 (1994) INFORMATION THEORY APPROACH IN EFFICIENCY MEASUREMENT JATI K . SENGUPTA Department of Economics, University of California, Santa Barbara, C A 93106-9210. U S A . SUMMARY The current non-parametric method of measuring productive efficiency of input-output systems is generalized here in the stochastic case in terms of an information theory approach based on the concept of entropy. Use of maximum entropy as a method of finding the most probable distribution of the input-output data set and as a predictive criterion is illustrated for production systems with multiple inputs and outputs. KEY WORDS Principle of maximum entropy Efficiency measurement by data envelopment analysis Use of conditional entropy INTRODUCTION The non-parametric approach of data envelopment analysis (DEA) has been frequently used in recent literature 1,2,3 to specify and measure productive efficiency. Two most attractive features of this approach are its flexibility and its emphasis on data-based method of estimating efficiency. In the stochastic case, however, the statistical distribution of the input-output data plays an important role in the specification and estimation of the production frontier but the DEA model fails to incorporate it in any significant way. It is clearly plausible that the estimate for the efficiency surface by the DEA method would differ according as the data set follows one distribution or another. Given a sample set of observed data points, one may therefore ask: ‘What is the most plausible form of the true distribution that generates the given data?’ Information theory and the associated probability distribution, called the maximum entropy (ME) distribution, seek to provide an answer to this question. The ME principle states that if the statistician’s decision problem is to fit a distribution as an estimator of some true distribution, he should formulate his prior knowledge on the latter distribution in the form of constraints on the distribution to be fitted; then he should choose the most uninformative distribution subject to these constraints, with entropy used as a measure of uninformativeness. Clearly the DEA model, which basically computes a sequence of n linear programming (LP) problems to test if each of the n data points is on the production frontier, is quite flexible in terms of allowing prior knowledge as additional constraints. There is a second way of incorporating the data distribution aspect into a DEA model. This is through a predictive criterion adjoined to the original model. The information theory approach can be used here to generalize the predictive power criterion, when the underlying distribution need not be normal. Finally, when the DEA model is considered in its dynamic version with a production CCC 8755-0024/94/02OO91- 12 01994 by John Wiley & Sons, Ltd. Received 2 April 1992 Revised 22 December 1993 92 J. K. SENGUPTA function involving lagged inputs, the question arises: what should be the order of the lags? For time series data, the technology is frequently represented in the production function by lagged capital inputs and the question of how many lags there should be in the frontier production assumes special importance. The information theory approach can be profitably used here to determine the optimal number of time lags in a dynamic production frontier. Our object here is to develop and apply two basic concepts of information theory: the entropy and the mutual information statistic in the DEA framework, thereby generalizing the scope of applicability of the latter. These applications are intended to develop a joint method of modelling and estimation of productive efficiency in a stochastic view of the DEA model, where the set of sample observations is divided into two subsets: one containing efficient units and the other non-efficient ones. 2. ENTROPY IN DEA MODELS Lack of complete knowledge about the random state of nature has been a pervasive characteristic of most decision models. For the DEA model this is due to the imprecise knowledge of the probability distributions of inputs and outputs. In the standard DEA model we observe the rn inputs X j = ( X i j ) and the single output y j for each decision-making unit (DMU) j E I,, In= (1,2, ...,n) and formulate the following LP model for unit k as follows: min gk = xi^ = 4 C Xkipi i=l where P E C(P>= IPI XP 2 y ;P 2 01 is efficient if it lies on the convex hull of the convex set, i.e. if it holds Then the unit k E In that x i ~ * ( / c =) y k and Sk = y : - y k = o where Sk is the slack variable, which is equal to zero in non-degenerate cases and y : = x L P ( k ) is the potential or maximum output. Then, by varying k in the index set I, in the objective function, the whole efficiency surface can be traced out from the optimal solutions of the family of LP models of the form (1). In case where the data set (X,y) is subject to a stochastic generating mechanism, the objective function of (1) may be written more generally as a convex loss function L ( q , p ) , where q = [email protected] - y is the potential loss of output and p = ( p j ) is the probability of occurrence of the j t h state of nature, assuming a discrete distribution framework. In the production frontier literature, the following specifications of the loss function have been frequently used: INFORMATION THEORY A N D EFFICIENCY MEASUREMENT 93 Both Farrell‘ and the DEA approach3 used the formulation (2b) to minimize the loss function L ( 4 , p )= xkp - yk, assuming that the reference unit k has the highest probability of realizing a zero loss level. Timmer4 and Johansen’ used the formulation (2a) to minimize the average loss function g = Z’P - j j , where Z, j j denote the respective averages of inputs and output. The formulation (2c) is based on the least squares norm, which is appropriate if the loss vector 4 follows a normal distribution. Note that one common assumption underlying all the three specifications above is that the probability p = ( p i ) of the random state of nature is known. In reality, however, this is rarely the case. Trimmer equated the sample means of inputs and output as estimators of the population means and minimized the mean loss function g = X’p - J . For each objective function g k = xtP for the kth reference unit in (I), the DEA model assumes the probability p k to be one; hence this yields the result that if unit k is not efficient for the model having pk = 1, it cannot be efficient for any other model having an objective function: min gr = x : p where r # k. The case of incomplete knowledge about the random states of nature can be handled in two different ways. One is to introduce partial information as a compromise between complete ignorance and complete knowledge about the probabilities pj and thereby assume that the DM can rank the random states of nature in terms of their probabilities, e.g. pl 2 pz 2 ... 2 pn. This line has been developed by Fishburn6 and Kmietowicz and Pearman,’ who have derived optimal sets of decision rules incorporating the incomplete information in a partial order as above. A second method of incorporating partial information is to allow it as prior information before deriving the estimates of the unknown probabilities p j of the random states of nature. Note that Timmer4 does precisely this in his method of moments approach when he replaces the population means of inputs and output by their sample counterparts. Recently Kopp and Mullahy have introduced higher-order sample moment restrictions (e.g. sample moments of orders two to four) as a basis for identifying and estimating the technical efficiency parameters. In particular, they have empirically tested the reasonableness of the assumption of a half normal distribution for the error terms e, = x,!p - yj in the cost data for the U.S. electric; power industry. Their empirical results failed to support the hypothesis of a half-normal specification. This suggests the need to approximate the most appropriate distribution generating the input-output data set. Information theory with its entropy approach provides a most convenient method for approximating the true distribution underlying the data set. Now consider Timmer’s transformation (2a) of the DEA model, where the probabilities p j of the random states of nature generating the output data are not completely known. We assume, however, that sample observations on output are available and we assume that the sample information is given by the sample moments as in the method of moments. For simplicity, we assume that the first moment, i.e. the sample mean, is given by 2P i Y j = i j=1 where i; is the sample mean output. Given this prior information, ji, one may now ask: ‘Which way of assigning the prior probabilities p j in the output distribution makes the fewest assumptions?’ Since the estimated distribution should be widely applicable, statistical decision theory suggests that we choose the most uninformative distribution. As a measure of uninformativeness the concept of entropy has been frequently applied, where entropy associated with a distribution for example is defined as follows: n H = - C p j In p j j=1 (discrete case) 94 J. K. SENGUPTA [ p ( y )In p ( y ) ] d y (continuous case) Here entropy H can be interpreted in two ways: either as a measure of average uninformativeness (i.e. prior uncertainty) _ . in the random states of nature, or as the average information obtained when a given realization of the state of nature is observed. In either case, we maximize entropy W under the prior information summarized by the first sample moment condition ji. This yields the maximum entropy (ME) principle for determining the probabilities Pj On solving this nonlinear programming problem (NLP) one obtains the exponential density Pj = (I/@) ~ X P( - Y j / i ) ; Y j 2 0 (4) This may be called an optimal density, since it maximizes entropy subject to the given prior knowledge. Several interesting extensions of this approach may easily be conceived. First, prior knowledge in the form of sample moments of orders higher than one may be assumed to be known and in this case the DM can specify a sequential method of revising the optimal density estimates. Secondly, the sample moment constraints are used only as summary statistics for specifying inadequate prior knowledge. One may have to apply a criterion of goodness of fit when several competing statistics seem plausible. Thus if several values ji(i) of ji seem plausible, we have to apply the chi-squared test to determine which of the optimal exponential densities fits the observed data best. Finally, once we have the prior densities optimally determined by the ME principle, we may proceed to estimate the parameters of the production frontier either by a parametric procedure (i.e. maximum likelihood method) or by a nonparametric method. Taking the non-parametric case and the first sample moment as the only prior knowledge, we obtain the following transformation of the DEA model: Min Max d ( P , P )= ( P ' X - Y ' ) P + W P ) BEC(P) PEC(P) where This model has a number of interesting features. For example, it can easily be proved that if y ) has all positive elements, then there must exist a vector point the observed data set (X, ( P o , p o )that solves the maximum entropy model (5). Furthermore, if the probability vector p is known or preassigned, the resulting model always yields an optimal solution of the efficiency parameters that minimizes the sum of the absolute values of errors. Note that Timmer's transformation which minimizes the objective function g = C?=I X i P i is a special case of this entropy model ( 5 ) when it is assumed that the probabilities [ P j ] are INFORMATION THEORY AND EFFICIENCY MEASUREMENT 95 known or preassigned. Secondly, if we assume the prior information to be such that ji lies in the interval 0 Q a Q i Q b, then the entropy H(p) is maximized with respect to both pj and ji and this density estimator pj possesses a number of desirable features. First of all, as Akaike’ has shown, the standard maximum likelihood principle of estimation may be viewed as a special case of maximizing the entropy, where entropy is defined by the log likelihood function. Secondly, the density estimator (pj]belongs to the class of non-parametric estimates increasingly applied in modem times. lo Note however that the prior information need not be specified by the first moment condition alone. For instance, a second moment condition may be preassigned by including in the constraint set C ( p ) the following condition: 5 pj(p’xj- qj - /i)* = s^* j=l Thus, other distributions like the normal, gamma, etc. can be derived. For any specific empirical case, however, one has to choose between the alternative densities for the error q j = D’Xj- y, and this choice problem can be resolved by means of a chi-squared test of goodness of fit. 3. MEASURING DYNAMIC EFFICIENCY The concept of productive efficiency used by Farrell and the DEA models does not make a distinction between the current and the capital inputs, and in this sense this is static since the production function does not introduce any dynamic considerations either through time lags between the inputs and output or the presence of capital inputs which may generate output beyond the current period. One general way to incorporate the dynamic elements into a static frontier is to introduce time lags between the capital inputs and the output as: m Yt=Po+ C i=l r pixi+ C Yizt-i-ct; is0 where Xi is the current input and z f - i is a proxy for various dynamic inputs. One could, of course, replace the proxy variable by a vector of dynamic inputs such as capital of different types (i.e. its rates of utilization) or different vintages, provided such data are available. The proxy variable can also be interpreted in terms of the theory of adjustment cost, where it is postulated that the firms tend to adjust to a long run dynamic production frontier. Two types of empirical problems arise in estimating the dynamic production frontier above. The first problem is one of optimally determining the order r of the maximum lag associated with the dynamic inputs t - i . The second is the case when the input coefficients y i follow an adjustment process of a distributed lag model, where the marginal impact of lagged capital inputs declines over time so that the distant inputs have negligible effects on the current output. In both cases one could apply the maximum entropy principle to determine the optimal lag ro say and the optimal lag distribution. The optimal order of the lag in the first case may be determined by maximizing the correlation determinant associated with the problem, where the latter is related to the entropy concept. In the second case, one may rewrite the model as m yt = 00 + i2 =l pixi + i=O p i ( y ) z t - i -4 where pi is the probability density of the lag distribution. If we assume that the mean lag is ji and it is given by prior information with a range (0, a),then the maximum entropy principle 96 J . K. SENGUPTA yields the optimal density as This has the property that the more distant the lag, the less important is its influence on output. The optimal value ro of r, the maximum time lag can also be determined by the ME principle, since any value higher than ro would not increase the value of entropy. Sengupta” has discussed elsewhere several applications of the ME principle in recent economic models. 4. PREDICTIVE USE OF INFORMATION THEORY We now consider the use of information theory in terms of its predictive power and how it can be incorporated in the multiple output case of a DEA model. In case of multiple outputs, the specification and estimation of the production frontier raise additional problems not considered by the current econometric applications. If several outputs are combined into a singIe weighted output as in the theory of canonical correlation, the choice of weights becomes an important issue. We consider here a prediction criterion and show how this correlation-based criterion can be used on the basis of the entropy principle. Consider the case where each unit j E In has s outputs ( y j h ) and rn inputs ( X j i ) . Then we would test for each point k E I, if it is efficient by solving the following L P model: 7 rn Min a,P gk = Xkipi i=l Clearly if the elements OLh were given or known, then the composite output y; = C I Y j h a h can be used as a single output and our earlier discussions can be readily applied. However, the weights a = ( a h ) for defining the composite output are not generally known, although prior information may sometimes be subjectively available. The standard DEA model (6a) determines the optimal weights (a,p ) without giving any consideration to the correlation between the composite input xi’ = x j p and the composite output yf =ria. This is in sharp contrast with the regression approach to the response function, where these weights are so determined as to maximize the correlation R‘ between x,C,yj’;jE I n : R C =( a ! ’ V ~ ~ ) 1 ( a f V y ~ a ) ( ~ ’ ~ ~ P ~ 1 ~ ” * (6b) where Vpq denotes the variance-covariance matrix of the two vectors p and q. It is clear that the DEA model would improve considerably in terms of predictive power if this correlation measure R Ccould be incorporated. Note that the equality relation c h YkhOLh = 1 in (6a) is used as a normalization condition. If this relation is dropped then the objective function should be reformulated as Min gk = XkiPi i=l - 2 h- I YjhClh f P O 2 - 010 = i =O XkiPi - h=O f: Yjhah INFORMATION THEORY AND EFFICIENCY MEASUREMENT 97 where 00,OLO are intercept terms that are also incorporated in the constraints for each fixed j . Here the observed data set D = ( X j , y j ; j E Z,,) comprises input ( X j ) and output ( y j ) vectors of dimensions m + 1 and s + 1 , respectively, for each unit j = 1,2, ...,n. Let OL * ( k ) , P * ( k ) be the optimal solutions of the extended model. Then the unit or firm k is efficient by the DEA efficiency criterion if P*'(k)xk - a*'(k)yk= 0 and the solution vector is non-degenerate. The latter condition is needed for uniqueness. By varying k in the index with I,,= [ 1,2, ...,n) and solving n LP models of the form (6a), all the efficient units can be determined. Let DI be the subset of the entire data set D which contains efficient points only. For all j E D1,define two random variables y' = O L ' ~and x' = 0'2 as the composite output and the composite input, respectively. Now we define a measure of mutual information in the random variable y' relative to the other variable x' by + Z ( y C ,x') = H ( y C ) H ( x ' ) - H ( y Cx, ' ) (7) where H ( . ) is the entropy defined before. Thus, if the probability densities of yc,xc and (y', x ' ) are denoted by f ( y c ) , f ( x c ) and f ( y ' , x ) and assumed to be continuous, then the mutual information statistic Z(yc,x ' ) can also be expressed in terms of the conditional entropy of y' given x': Z ( y C , x C=) H ( y C )- H ( y CI x ' ) (8) where 1 m W Y C )= E [- In f(u" H ( y C1 x ' ) = E [ - In f(y' =- -m f0') In f(r') dYC I x')] Now suppose ( y ' , x ' ) have a joint bivariate normal density with a correlation coefficient p ; then the information Z(yc 1 x ' ) about y c contained in the random variable X' can easily be computed from (8) as - (&>In (1 - p ' ) (9) It is clear from (8) that if y' and x' are statistically independent, then Z ( y c , x c )= 0. Thus Parzen" has shown that the maximum likelihood estimator of f ( y ' , x ' ) , i.e. z ( y C , x c= ) f ( y ' , x ' ) = - ($1 In ( I - 6') where 6 is the sample correlation coefficient, can be used to test the statistical independence of y c and x'. We may employ this statistic in our framework in two different ways. One is to choose the vectors a and 0so as to maximize the mutual information Z(y', x c )between the composite output and the composite input. This is equivalent to maximizing their squared correlation (I?')' defined in (6b). In the case of a single output, this amounts to choosing the parameter vector 0 so as to maximize the predictive power of x' in explaining the variations in output. Note that this result would hold asymptotically if the variables ( y , x ' ) are not normal but tend to joint normality only asymptotically. Furthermore, prior information may be allowed as additional constraints before one maximizes the information statistic Z(y, x'). Sengupta l 3 has discussed several types of transformation which may convert the bivariate nonnormal density to approximate normality. A second way to incorporate the predictive power is to reformulate the multiple output 98 J. K. SENGUPTA model (6a) so that the objective function reflects the maximum of mutual information f ( y Cx‘). , This yields the following model: MU z = c Y ’ J - p ’ X + I ( y C , x c ) a,B s.t. xp - Ya 2 0 where (X,J)are the mean input and output vectors and the last two equalities specify the normalization conditions. This generalized model has a number of flexible feature. First of all, if the composite variables are normally distributed, then the objective function can be reformulated as MU z = (~’7 - p’X + 01‘Vmfl where V, is the covariance matrix of the input and output vectors. This model is no longer linear and hence its optimal solutions (a*, p *) are more diversified than the LP solutions used in the DEA model. Secondly, since the errors &j are additive in the formulation one can apply a simple transformation to obtain a least-squares-like formulation as a ’ y j = p’xj + Po + ~ j ;Po = -/L, Uj = /L - ~j where p is the unknown mean of the error term € j and the new disturbance term U j satisfies most of the least squares conditions, e.g. zero mean and constant variance. Note that if approximate normality holds, then this method can be improved further by utilizing any prior information about p, e.g. it belongs to the interval a Q p Q b where a, b are fixed constants, e.g. truncated normal. Clearly the information theory approach can handle such a prior information by adjoining additional constraints to the generalized model (10). Finally, one can evaluate alternative ways of aggregating outputs into a composite variable. For instance one may preassign equal weights ffh = 1 for all h so that y c is merely the sum of all types of output and then compare it with an optimal composite output y*‘ where the optimal weights a* = (a:) are used. Clearly, the predictive power of the optimal weight model would be much higher, since this objective is already built into the model. 5 . EMPIRICAL APPLICATIONS Our empirical applications consider input-output data, all in logarithmic units, for selected public elementary school districts in California for the years 1976-77 and 1977-78 over which separate regression estimates of educational production functions were made in earlier studies by the present author. Statistics of enrollment, average teacher salary, and standardized test scores are all obtained from the published official statistics. Out of a larger set of 35 school districts we selected 25 in three contiguous counties of Santa Barbara, Ventura and San Luis Obispo on the basis of separate homogeneity tests based on the Goldfeld-Quandt statistic. Four output variable were the standardized test scores in reading (yl ), writing ( y 2 ) , spelling (y3) and mathematics (y4). Two measures of aggregate output can then be defined. One is the composite output y c = a’y based on the optimal weights and the other is the average output INFORMATION THEORY AND EFFICIENCY MEASUREMENT 99 j i = (yl +y2 + y3 +y4)/4 with equal weight on each output. As input variables we had a choice of eight variables of which the following four were utilized in our LP estimates: XI, the average instructional expenditure; x2, the proportion of minority enrollment; x3, the average class size; and xq, a proxy variable representing the parental socioeconomic background. Again, one can define a composite input x c = P’x, and an average input, though the latter may not be very meaningful owing to the diverse nature of the four inputs. In our previous studies, l4 we applied the DEA model for the multiple output and multiple input case, but since the results are not much different from those of the single composite output case we report here only the latter results. Also, the input-output data set may be divided into two groups according as the LP solution of the DEA model are degenerate or nondegenerate. The first sample group with nl = 16 contains only the non-degenerate cases, while the remaining samples (n2 = 9) comprise the degenerate cases where some of the parameter estimates are zero. On the basis of these statistical data three types of illustrative applications are made. One is the application of the DEA model (1) by which the data set of 25 school districts is decomposed into a subset D I of efficient units containing 35% of the districts and the remaining subset D2 containing 65% of the districts. The overall frequency distribution of the efficiency ratio ej defined as ej = 1 - (yj/yf) where yf is the efficient output for j appears as shown in Table I. Normality testing by the Shapiro-Wilk statistic strongly rejected the null hypothesis that the underlying distribution is normal. The exponential density derived under the ME principle with the first moment restriction is then tested by the Kolmogorov-Smirnov statistic and it was not rejected at the 5% level of significance. The second application compares the multiple output situation in three cases in terms of the mean square error (MSE) of prediction. The first case uses averages of the four observed outputs (9); the second uses the composite output (US)from the mean LP model when the terms Z ( y c , x c ) ,P‘V,,P and a’VYyare dropped; and finally the third case uses composite output (yR) which maximizes the sample correlation between the inputs and outputs. The results are given in Table 11. It is clear that by incorporating the measure of mutual information defined in (lo), the predictive power of the DEA model improves considerably. Since the composite output yR may be interpreted as a canonical variate, one may conclude that the use of the canonical variates that maximize the correlation between the inputs and the outputs helps improve the Table I ~ Domain of e, Frequency (Yo) ~ ~~~ 0-0-04 36 0.05-0.07 20 0.08-0.11 28 0.12-0-15 4 Table 11. MSE over LP solutions Non-degenerate output measure cases nl = 16 I 0.141 Yb 0-008 YCR 0.002 Degenerate cases Total nz = 9 n = 25 0.172 0.045 0 027 0.164 0-032 0-011 >0*15 12 100 J. K. SENGUPTA predictive power of the DEA model, as it does in a multiple regression model. This is observed more strongly when the data points are suitably clustered around cone domains of the mean, e.g. for data point close to the mean or the median LP model the predictive power is observed to be the highest. It is of some interest here to report briefly some previous calculations of canonical correlation coefficients of different orders, analysed elsewhere, l4 that confirm that the twostage procedure of estimation of the production frontier considerably improves the goodness of fit of the estimated model. Denoting the composite output by ~ C L )for the kth canonical correlation k = 1,2,3,4 and the squared correlation coefficient by r f k ) the results are given in Table 111. It is clear that the DEA model outperforms the ordinary canonical approach in terms of the predictive r 2 criterion. Thus, if the efficient units are Erst identified and screened and then the ordinary canonical correlation applied, it tends to improve the predictive power. Secondly, the first two squared canonical correlations rfl),rfz, are found to be statistically significant at the 5% level in term of Fisher's z-statistic. By the same test, the third and fourth-order canonical correlations are not significant. This raises a strong case for ignoring the third- and fourth-order canonical variates in our composite output transformation applied to the DEA models. Finally we consider the third application, which considers time series data for the 10 years 1976-86 to evaluate the marginal impact of one of the most important inputs: ~3~ = zf denoting the average class size. Over the years the changes in class size have reflected the rising trend of minority enrollment and the declining trend of budget allocations for school districts. The regression estimates based on the subset D Iof efficient unit for each year yield the results given in Table IV for the production frontier 3 4 yt= PO + C i=l Pixi + iC =l yit-i over the period 1976-86. Table 111 k Method A. B. Ordinary canonical approach: rfk) Two-stage DEA approach: r t k , 1 2 3 4 0-83 0.50 0-20 O.OOO4 0.70 0.49 0.09 0.01 Table IV ~ BO -1 1 *4* (-9.1) P1 -0.11* (-2.5) t-Values are in parentheses. 02 03 4-9** (10.2) 5*3** (10.2) a -1.6** (-10.3) Yl YZ -0.01 (0.85) 0.02 (0.21) Y3 0.004 (0-1) R2 0.96 INFORMATION THEORY AND EFFICIENCY MEASUREMENT 101 When we include all the data containing both the efficient and inefficient units, the class size coefficients b3 and yl,yz,y3 tend to be reduced considerably with 63 turning out to be statistically insignificant at the 5% level of the t-test. Also, the maximum lag turned out to be ro = 2.0, suggesting that this input may not have played a critical role in a dynamic sense. The optimal lag distribution determined by the ME principle shows the following geometric form: ~ i ( y ) = ( 0 ' 1 4 ) ( 0 * 8 6 ) ~ i, = O , 1,2, ... This suggests that the regression coefficient of 0.86 associated with the static model, which ignores the time effects, probably overestimates the marginal impact of the class size on student performance. 6. CONCLUDING REMARKS Our methods of modelling the maximum entropy approach in the current theory of estimation of production frontiers have emphasized here three most important aspects of data-based modelling of economic systems. First, this approach seeks to utilize the observed data characteristics in the form of mean and other higher moments to estimate a best approximation of the true distribution underlying the data. Then it incorporates the distribution of the data set in specifying a two-stage process of estimation. The first stage determines the distribution most consistent with the observed data and the decision maker's prior knowledge about them, while the second stage uses the distribution to estimate the parameters of the production frontier. For example, if the observed data is generated by a gamma distribution, as specified by the maximum entropy approach, then the maximum likelihood method can be applied at the second stage to estimate the parameters of the gamma model. Secondly, by relating this approach to the method of moments and the case of canonical correlation for multiple outputs we have attempted to show its generality and usefulness in situations where significant departures from normality are suspected. Thus it provides a more general framework which is alternative to traditional least squares. Finally, the maximum entropy principle can obviously be applied in other areas of operation research such as demand and cost studies, and reliability and replacement models, although we have applied it in the context of a production frontier. Modelling complex systems that have subsystems provides a very natural framework for applying the maximum entropy approach, as various applications Is in image processing and game theory models show. ACKNOWLEDGEMENT The author would like to thank an anonymous referee for helpful suggestions. REFERENCES 1 . L. M. Seiford and R. M. Thrall, 'Recent developments in DEA: the mathematical programming approach to frontier analysis', Journal of Economerrics, 46(1), 7-38 (1990). 2. M. J. Farrell, 'The measurement of productive efficiency', Journal of fhe Royal SfutbficalSociety, Series A, 120, 253-290 (1957). 3. A. Charnes. W. W. Cooper and E. Rhodes, 'Measuring the efficiency of decision-making units', European Journal of Operational Research, 2, 429-444 (1978). 102 J. K. SENGUPTA a probabilistic frontier production function to measure technical efficiency’, Journa/ of Political Economy, 5 , 776-794 (1971). 5. L. Johansen, Production Functions, North-Holland, Amsterdam, 1972. 6. P. C. Fishburn, Decision and Value Theory, Wiley, New York, 1964. 7. Z. W. Kmietowicz and A. 0.Pearman, Decision Theory and Incomplete Knowledge. Gower, Aldershot. U.K., 4. C. P. Timmer, ‘Using 1981. 8. R. J. Kopp and J. Mullahy, ‘Moment-based estimation and testing of stochastic frontier models’, Journal of Econometrics, 46, 165-184 (1990). 9. H. Akaike, ‘Onentropy maximization principle’, in Applications ofstatistics, North-Holland, Amsterdam, 1977. 10. R. L. Eubank, Spline Smoothing and Nonparametric Regression, Marcel Dekker, New York, 1988. 11. J. K. Sengupta, ‘Maximum entropy in applied econometric research’, International Journal of Systems Science, 22, 1941-1951 (1991). 12. E. Parzen, ‘Time series model identification by estimating information’, in Studies in Econometrics, Time Series and Multivariate Statistics, Academic Press, New York, 1983. 13. J. K. Sengupta, ‘Transformations in stochastic DEA models’, Journal of Econometrics, 46, 109-124 (1990). 14. J. K. Sengupta, ‘Data envelopment with maximum correlation’, International Journal of Systems Science, 20, 2085-2093 (1989). 15. J. Skilling and R. K. Bryan, ‘Maximum entropy image construction’, Journal of the Royal Astronomical Society, 211, 111-124 (1984).