APPLIED STOCHASTIC MODELS IN BUSINESS AND INDUSTRY Appl. Stochastic Models Bus. Ind. 2010; 26:448?472 Published online 15 September 2009 in Wiley Online Library (wileyonlinelibrary.com). DOI: 10.1002/asmb.803 Divergences without probability vectors and their applications Athanasios Sachlas and Takis Papaioannou?, ?, ? Department of Statistics and Insurance Science, University of Piraeus, 185 34 Piraeus, Greece SUMMARY In general, divergences and measures of information are defined for probability vectors. However, in some cases, divergences are ?informally? used to measure the discrepancy between vectors, which are not necessarily probability vectors. In this paper we examine whether divergences with nonprobability vectors in their arguments share the properties of probabilistic or information theoretic divergences. The results indicate that divergences with nonprobability vectors share, under some conditions, some of the properties of probabilistic or information theoretic divergences and therefore can be considered and used as information measures. We then use these divergences in the problem of actuarial graduation of mortality rates. Copyright q 2009 John Wiley & Sons, Ltd. Received 8 February 2008; Revised 24 July 2009; Accepted 25 July 2009 KEY WORDS: Kullback?Leibler divergence; Cressie?Read divergence; divergence with nonprobability vectors; graduation of mortality rates 1. INTRODUCTION There are many practical problems where nonprobability vectors are involved. One such problem is the actuarial graduation of mortality rates. Although divergences and/or measures of information are defined for probability vectors, in practice they are used with nonprobability vectors as well. The main purpose of this paper is to explore the properties of divergences without probability vectors and provide an application in the actuarial field. A bivariate function D( f, g) of two functions or vectors f and g is a measure of divergence if D( f, g)0 with equality if and only if f = g (see [1]). This is the minimal requirement for a measure D( f, g) to be a ?kind? of a distance between f and g. In [2, p. 2] it is mentioned that a coefficient with the property of increasing as the two distributions involved move ?further from ? Correspondence to: Takis Papaioannou, Department of Statistics and Insurance Science, University of Piraeus, 185 34 Piraeus, Greece. ? E-mail: [email protected] ? Part of the work of the second author was done while visiting the Department of Mathematics and Statistics of the University of Cyprus. Copyright q 2009 John Wiley & Sons, Ltd. DIVERGENCES AND THEIR APPLICATIONS 449 each other? will be called divergence measure between two probability distributions. For other requirements see [3, 4]. There are many measures of divergence, many of which originate from or are connected with information theory. For a good review see Chapter 1 of [2], Chapter 7 of [3], [5?7] and references cited therein. In this paper we concentrate on two of the most important divergences in statistics, the Kullback?Leibler and the Cressie?Read power divergences. Papaioannou in [6, 7] presents in detail the properties of information measures. These are nonnegativity, additivity?subadditivity, conditional inequality, maximal information, invariance under sufficient transformations, convexity, loss of information, sufficiency in experiments, appearance in Cramer?Rao inequalities, invariance under parametric transformations, nuisance parameter inequality, order preserving property and limiting property. Measures of divergence, measures of information and their properties are still a topic under research. New measures of divergence are proposed and their properties are investigated in [8], while Papaioannou and Ferentinos [9] examine the Fisher information number in light of the properties of the classical statistical information theory. However, there is no universal agreement among statisticians on which properties constitute or define a measure of statistical information as the approach is mostly operational rather than axiomatic [9]. The aim of this paper is to examine the properties of measures of divergence when they involve nonprobability vectors and to present an application in the graduation problem of actuarial science. The motivation stems from the work of Brockett [10]. An important conclusion of this investigation is that an additional constraint, (vi) of Section 4.1 below, should be included in the divergence minimization process. Inclusion of this constraint dramatically improves goodness-of-fit results. Here we have to mention the work of Csiszar [11] who considers linear inverse problems (problems with linear constraints), with n-dimensional real vectors or vectors with positive components or probability vectors. His aim is to determine logically consistent rules for selecting such a vector. His selection?projection rules minimize ?distances? between such vectors or functions thereof subject to linear constraints. Several postulates?axioms characterize the projection rules. As corollaries axiomatic characterizations of the methods of least squares, minimum discrimination information and maximum entropy are obtained. In this context he presents an extension of the Kullback?Leibler divergence positive vectors not necessarily probability vectors. More to precisely, he added the quantity i qi ? i pi to the standard Kullback?Leibler measure of directed divergence, that is, he defined pi KL I (p, q) = pi ln ? pi +qi qi i where p = ( p1 , . . . , pn )T , q = (q1 , . . . , qn )T are just vectors with n real positive components. The above I KL -divergence is nonnegative and equals to zero if and only if p = q; thus, it satisfies the minimal requirement for a function to be a measure of divergence. In the case that the components of p and q sum up to one, the above-mentioned divergence reduces to the standard Kullback?Leibler directed divergence. We will refer to this measure as the extended Kullback? KL (p, q). For other more axiomatic than information theoretic Leibler divergence and denote it by Iext KL properties of Iext (p, q), see [11]. The paper is organized as follows: In Sections 2 and 3 we study the properties of the Kullback? Leibler and Cressie?Read power divergences for nonprobability vectors in the light of statistical information theory. In Section 4 we describe an application of divergence measures involving Copyright q 2009 John Wiley & Sons, Ltd. Appl. Stochastic Models Bus. Ind. 2010; 26:448?472 DOI: 10.1002/asmb 450 A. SACHLAS AND T. PAPAIOANNOU nonprobability arguments in the actuarial field supported by a numerical investigation. Section 5 contains concluding remarks. 2. KULLBACK?LEIBLER DIRECTED DIVERGENCE INVOLVING NONPROBABILITY VECTORS The most popular measure of divergence between two n �probability vectors p? and q? is the Kullback?Leibler measure of information given by I KL (p? , q? ) = i pi? ln pi? qi? (see [12, pp. 6?7]). It is a measure of directed divergence on the sense that it does not share all the properties of a distance or metric (it is not symmetric and does not satisfy the triangle inequality property) and therefore cannot be considered as a pure distance. The Kullback?Leibler directed divergence is defined for probability vectors and shares most of the properties that all information measures share. In the sequel we investigate whether the Kullback?Leibler directed divergence between two nonprobability vectors can be considered as a measure of information, by examining its properties in the light of general properties of measures of information and divergence. Definition 1 The Kullback?Leibler directed divergence between two n �nonprobability vectors p and q is defined by D KL (p, q) = n pi qi n pi ln i=1 where p = ( p1 , . . . , pn )T >0, q = (q1 , . . . , qn )T >0 with i=1 (1) pi = 1 and n i=1 qi = 1. Lemma 1 For the Kullback?Leibler directed divergence with nonprobability vectors, it holds that n D KL (p, q) = pi [I KL (p? , q? )+ln k] i=1 n n where k = i=1 pi / i=1 qi , and I KL (p? , q? ) is the Kullback?Leibler measure involving the prob? ? ability vectors of which are the normalized elements of p and q, that is, n p and q? , the elements n ? pi = pi / i=1 pi and qi = qi / i=1 qi , i = 1, . . . , n. Proof The result follows by simple algebra. Proposition 1 (The nonnegativity property) D KL (p, q)0 Copyright q 2009 John Wiley & Sons, Ltd. (2) Appl. Stochastic Models Bus. Ind. 2010; 26:448?472 DOI: 10.1002/asmb 451 DIVERGENCES AND THEIR APPLICATIONS if one of the following conditions holds: (i) n i=1 pi n i=1 qi , (ii) n i=1 pi < n qi and ln k>? I KL (p? , q? ) where k = i=1 n pi / i=1 Equality in (2) holds if p = q or ln k = ?I KL (p? , q? ). Moreover, if D KL (p, q) = 0 if and only if p = q. n i=1 pi = n qi i=1 n i=1 qi then Proof The proof is obvious on account of the nonnegativity property of I KL (p? , q? ). n n Note that D KL (p, q) = 0 does not necessarily imply p = q unless pi = i=1 qi . Thus i=1 n n KL qi . the minimal requirement for using D (p, q) as a measure of divergence is i=1 pi = i=1 Numerically thiscan be seen with the vectors p = (0.1584514, 0.2201928, 0.7247736)T and q = (0.4, 0.8, 0.4)T , i pi = i qi , for which D KL (p, q) = 4.78877�?7 is almost zero, while their Euclidean distance is p?q = 0.707107. Definition 1 has obvious extensions to the bivariate and multivariate cases. We present the related definitions for the bivariate case. Definition 2 (Bivariate divergence) Let pi (x, y), i = 1, 2, be two bivariate measures (nonprobability functions) associated with two discrete variables X , Y in R 2 for which it holds x y pi (x, y) = 1. We define the Kullback? Leibler directed divergence between two bivariate nonprobability functions p1 , p2 as D KL X,Y ( p1 , p2 ) = x y p1 (x, y) ln p1 (x, y) p2 (x, y) Definition 3 (Conditional divergence) For the discrete variables pi (x, y), i = 1, 2, as X , Y and the bivariate nonprobability functions given above, let f i (x) = y pi (x, y), h i (y|x) = pi (x, y)/ f i (x), gi (y) = x pi (x, y) and ri (x|y) = pi (x, y)/gi (y), i = 1, 2. We set DYKL |X =x (h 1 , h 2 ) = y h 1 (y|x) ln h 1 (y|x) h 2 (y|x) and define KL DYKL |X (h 1 , h 2 ) = E X [DY |X =x (h 1 , h 2 )] = x f 1 (x) y h 1 (y|x) ln h 1 (y|x) h 2 (y|x) KL Conditional divergences D KL X |Y =y (r1 ,r2 ) and D X |Y (r1 ,r2 ) are defined analogously. Proposition 2 (Strong additivity) Let p1 , p2 be two bivariate nonprobability functions associated with two discrete variables X , Y in R 2 as in Definition 2. Then KL KL KL KL D KL X,Y ( p1 , p2 ) = D X ( f 1 , f 2 )+ DY |X (h 1 , h 2 ) = DY (g1 , g2 )+ D X |Y (r1 ,r2 ) where the functions f i , h i , gi , ri , i = 1, 2, are as in Definition 3. Copyright q 2009 John Wiley & Sons, Ltd. Appl. Stochastic Models Bus. Ind. 2010; 26:448?472 DOI: 10.1002/asmb 452 A. SACHLAS AND T. PAPAIOANNOU Proof For the variables X, Y we have that KL D KL X ( f 1 , f 2 )+ DY |X (h 1 , h 2 ) p1 (x, y) p1 (x, y) f 2 (x) f 1 (x) + f 1 (x) ln f 2 (x) x f 1 (x) f 1 (x) p2 (x, y) x y f 2 (x) f 1 (x) p1 (x, y) + +ln p1 (x, y) ln = f 1 (x) ln f 2 (x) x y p2 (x, y) f 1 (x) x = = x f 1 (x) ln f 1 (x) ln f 1 (x) p1 (x, y) f 2 (x) + + f 1 (x) ln p1 (x, y) ln f 2 (x) x y p2 (x, y) x f 1 (x) = D KL X,Y ( p1 , p2 ) KL KL In a similar way, we prove that D KL X,Y ( p1 , p2 ) = DY (g1 , g2 )+ D X |Y (r1 ,r2 ). Corollary 1 (i) (ii) (iii) (iv) KL KL D KL X,Y ( p1 , p2 )D X ( f 1 , f 2 ) with equality if and only if DY |X (h 1 , h 2 ) = 0; KL KL D KL X,Y ( p1 , p2 )DY (g1 , g2 ) with equality if and only if D X |Y (r1 ,r2 ) = 0; KL KL D KL X,Y ( p1 , p2 )DY |X (h 1 , h 2 ) with equality if and only if D X ( f 1 , f 2 ) = 0; KL KL KL D X,Y ( p1 , p2 )D X |Y (r1 ,r2 ) with equality if and only if DY (g1 , g2 ) = 0. In all above cases equality holds if and only if the normalized values of X , Y are independent. The normalized values of X , Y form tworandom variables X ? , Y ? with discrete joint mass ? probability function pi (x, y) = pi (x, y)/ x y pi (x, y) and marginal and conditional probability mass functions as follows: X ? ? f i? , Y ? |X ? ? h i? , Y ? ? gi? , X ? |Y ? ?ri? . For the random variables X ? , Y ? we have ? ? I XKL ? ,Y ? ( p1 , p2 ) = x y p1? (x, y) ln p1? (x, y) p2? (x, y) Proposition 3 (Weak additivity) If h i (y|x) = gi (y) and thus pi (x, y) = f i (x)gi (y), i = 1, 2, we have that the random variables X ? , Y ? , produced by normalizing X , Y as indicated above, are independent, and KL KL D KL X,Y ( p1 , p2 ) = D X ( f 1 , f 2 )+ DY (g1 , g2 )? ln where = y g1 (y) = x f 1 (x) and = y g1 (y)/ y g2 (y) = x f 1 (x)/ x f 2 (x). Proof On account of the definitions given above, it is easy to see that pi? (x, y) = f i? (x)gi? (y) which means that the random variables X ? , Y ? are independent. We know that ? ? KL ? ? KL ? ? I XKL ? ,Y ? ( p1 , p2 ) = I X ? ( f 1 , f 2 )+ IY ? (g1 , g2 ) Copyright q 2009 John Wiley & Sons, Ltd. Appl. Stochastic Models Bus. Ind. 2010; 26:448?472 DOI: 10.1002/asmb 453 DIVERGENCES AND THEIR APPLICATIONS Then for the variables X, Y and in view of Lemma 1 we have that D KL X,Y ( p1 , p2 ) = = x y x y ? ? p1 (x, y) [I XKL ? ,Y ? ( p1 , p2 )+ln s] ? ? KL ? ? p1 (x, y) [I XKL ? ( f 1 , f 2 )+ IY ? (g1 , g2 )+ln s] (3) where s = x y p1 (x, y)/ x y p2 (x, y). As x f i (x) = y gi (y) = x y pi (x, y), i = 1, 2, we have that = x y p1 (x, y) and s = . Therefore by Equation (3) we have that KL ? ? KL ? ? D KL X,Y ( p1 , p2 ) = I X ? ( f 1 , f 2 )+IY ? (g1 , g2 )+ ln = KL D KL X ( f 1 , f 2 )? ln + DY (g1 , g2 )? ln + ln = KL D KL X ( f 1 , f 2 )+ DY (g1 , g2 )? ln The weak additivity holds if x f 1 (x) = x f 2 (x) or y g1 (y) = y g2 (y). Proposition 4 (Maximal information and sufficiency) Let Y = T (X ) be a measurable transformation of X and pi = pi (x), gi = gi (y), i = 1, 2. Then KL D KL X ( p1 , p2 )DY (g1 , g2 ) with equality if and only if Y ? is sufficient with respect to the pair of distributions p1? and p2? , Y ? and X ? being the normalized versions of Y and X , respectively. Proof Let gi (y) be the measure associated with Y . Then gi (y) = x:T (x)=y pi (x). Setting a = x p1 (x), b = y g1 (y) and in view of Lemma 1, the following inequalities are equivalent: KL D KL X ( p1 , p2 ) DY (g1 , g2 ) ? ? KL ? ? ? a[I XKL ? ( p1 , p2 )+ln c] b[IY ? (g1 , g2 )+ln d] where pi? (x) = pi (x)/ x pi (x), gi? (y) = gi (y)/ y gi (y), c = x p1 (x)/ x p2 (x), d = y g1 (y)/ y g2 (y), and X ? , Y ? are the random variables derived by probabilitizing the values of X, Y . As x pi (x) = y gi (y), i = 1, 2, and thus a = b and c = d, the last inequality is equivalent to ? ? KL ? ? I XKL ? ( p1 , p2 )IY ? (g1 , g2 ) which is true as X ? and Y ? are random variables and Y ? is a measurable transformation of X ? . Equality holds if and only if the statistic Y ? = T (X ? ) is sufficient (cf. [2, pp. 11?12], [6], [12, p. 21]). Copyright q 2009 John Wiley & Sons, Ltd. Appl. Stochastic Models Bus. Ind. 2010; 26:448?472 DOI: 10.1002/asmb 454 A. SACHLAS AND T. PAPAIOANNOU Proposition 5 n n D KL (p, q)I KL (p? , q? ) when one of the following conditions holds: (i) i=1 pi = i=1 qi 1, n n n n n n (ii) i=1 pi > i=1 qi and i=1 pi 1, (iii) i=1 pi < i=1 qi and i=1 pi <1. Proof It follows easily from Lemma 1 by simple algebraic arguments. One basic property of measures of information and divergence is the limiting property. This property asserts that a series {X n } of random variables converges to a random variable X in distribution when n ? ? if and only if I X n ? I X , where I denotes an information measure. Under some conditions the limiting property holds for several measures of information including the Kullback?Leibler divergence. See [12, 13]. In the next proposition we investigate whether the limiting property holds when the Kullback? Leibler divergence has nonprobability vectors in its arguments. Proposition 6 (The limiting property) Let {pn } be a bounded from above sequence of nonprobability vectors. Then pn ? p if and only if D KL (pn , p) ? 0. Proof Let pn ? p. Using Lemma 1 we have i pn (i) KL KL =0 lim D (pn , p) = lim pn (i) lim I (pn , p )+ln n?? n?? n?? i i p(i) because lim I KL (pn , p ) = 0 and lim (ln n?? On the other hand, let i n?? D KL (pn , p) ? 0. lim i p(i)) = 0 Then n?? i pn (i)/ p(i) pn (i) =0 p(i) where (x) = x ln x, x>0 is a continuous function with (1) = 0. Suppose that pn ? p does not hold. So there is a subsequence n 1 <n 2 < � � � <n s < � � � of integers and a vector q such that lim pn s = q s?? and p = q (4) Because is continuous we have that pn s (i) q(i) = p(i) p(i) lim s?? i p(i) p(i) i However i p(i)( pn s (i)/ p(i)) is a subsequence of i p(i)( pn (i)/ p(i)) that converges to (1) = 0. Thus q(i) = (1) = 0 p(i) p(i) i Copyright q 2009 John Wiley & Sons, Ltd. Appl. Stochastic Models Bus. Ind. 2010; 26:448?472 DOI: 10.1002/asmb DIVERGENCES AND THEIR APPLICATIONS 455 which is possible only if p(i) = q(i), which contradicts Equation (4). Thus we have that pn ? p, so the limiting property holds for the Kullback?Leibler directed divergence. As expected the Kullback?Leibler directed divergence D KL (p, q) with nonprobability vectors p and q, does not in general share the properties of the Kullback?Leibler directed divergence with probability vectors p? and q? . Under certain conditions, some of them are satisfied. More precisely D KL (p, q) is nonnegative, additive, invariant under sufficient transformations and greater than I KL (p? , q? ); it satisfies the property of maximal information and the limiting one. So, in general terms, it can be regarded as a measure of divergence and therefore can be used whenever we do not have probability vectors, provided that i pi = i qi . 3. POWER DIRECTED DIVERGENCE WITHOUT PROBABILITY VECTORS Cressie and Read introduced in [14] the power divergence between two probability vectors p? and q? for goodness-of-fit purposes. It is defined by ? n p 1 i p? ?1 I CR (p? , q? ) = (+1) i=1 i qi? where is a real-valued The values at = 0, ?1 are defined by continuity. For ? 0, we n parameter. have I CR (p? , q? ) = i=1 pi? ln( pi? /qi? ), which is the Kullback?Leibler directed divergence, while for ? ?1, we have I CR (p? , q? ) = I KL (q? , p? ). The power divergence has the properties of other measures of divergence such as nonnegativity, continuity, nonadditivity and strong nonadditivity. We note that I CR (p? , q? ) is a directed divergence [14]. We also have the family of power divergence statistics, 2n I CR (p?? , p?0 ), primarily used for goodness-of-fit purposes [14]. Here p?? = X/n is the sample proportion, X is multinomial M(n, p) and p?0 is the probability model of interest. Members of this family are (i) the chi-squared 2 statistic for = 1, (ii) the G 2 statistic for ? 0, (iii) the modified likelihood ratio statistic for ? ?1, (iv) the Freeman?Tukey statistic F 2 for = ? 12 and (v) the Neyman-modified 2 statistic for = ?2. As an alternative to the 2 and G 2 statistics, Cressie and Read proposed in [14] the power divergence statistic with = 23 that lies between them. Let us now see what happens in the case we do not have probability vectors. We first define the power divergence of order for nonprobability vectors. Definition 4 We define as D CR 1 (p, q) = pi (+1) i pi qi ?1 , ? R (5) the Cressie?Read power divergence of order between two nonprobability vectors p = ( p1 , . . . , pn )T >0 and q = (q1 , . . . , qn )T >0, where i pi = 1 and i qi = 1. In this section we shall examine the information theoretic and divergence properties for this measure. We shall also assume that = 0 and = ?1. Copyright q 2009 John Wiley & Sons, Ltd. Appl. Stochastic Models Bus. Ind. 2010; 26:448?472 DOI: 10.1002/asmb 456 A. SACHLAS AND T. PAPAIOANNOU Lemma 2 For the Cressie?Read directed divergence involving nonprobability vectors p, q, it holds that 1?k 1 CR CR ? ? D (p, q) = pi k I (p , q )? k (+1) i where I CR (p? , q? ) is the Cressie?Read directeddivergence between the two probability vectors p? and q? defined in Lemma 1 and k = i pi / i qi . Proof Switching from (p, q) to (p? , q? ) we have D CR (p, q) = pi i = pi i = ? 1 p (+1) i i k ? p (+1) i i pi k i I CR pi? qi? pi? qi? k ?1 1 ? k 1?k 1 (p , q )? k (+1) ? ? Proposition 7 (The nonnegativity property) Let m= 1?k 1 (+1) k Then D CR (p, q)0 if one of the following conditions holds: (i) i pi = i qi ; (ii) i pi > i qi and ? / (?1, 0); (iii) p < q and ? (?1, 0); i i i i (iv) m<I CR (p? , q? ). As for the equality we have the following: (a) If i pi = i qi equality holds if and only if p = q; (b) if i pi > i qi or i pi < i qi equality holds if m = I CR (p? , q? ). In summary if i pi = i qi then D CR (p, q)0 with equality if and only if p = q. Copyright q 2009 John Wiley & Sons, Ltd. Appl. Stochastic Models Bus. Ind. 2010; 26:448?472 DOI: 10.1002/asmb 457 DIVERGENCES AND THEIR APPLICATIONS Proof By Lemma 2 we know that D CR (p, q) = ( i pi )k [I CR (p? , q? )?m]. We have the following: If i pi = i qi , then k = 1 and independently of the value of , it holds that m = 0 and consequently D CR (p, q) = ( i pi )I CR (p? , q? )0. / (?1, 0). In this case we have that D CR (p, q)0 if If i pi > i qi , then k >1 and m<0 if ? CR ? ? CR ? ? mI (p , q), while when mI (p , q ) we have D CR (p, q)0. If i pi < i qi , then k <1 and m<0 if ? (?1, 0). So, in this case we have the same conclusion as above. CR Summarizing, conditions (i)?(iv) of the proposition imply D (p, q)0.It is easy to see that when i pi = i qi equality holds if p = q. Also i pi > i qi or i pi < i qi equality holds if m = I CR (p? , q? ). Note that D CR (p, q) = 0 does not necessarily imply p = q unless i pi = i qi . Again here we have that the minimal requirement for using D CR (p, q) as a measure of divergence is i pi = i qi regardless of the value of . Proposition 8 D CR (p, q)I CR (p? , q? ) when one of the following conditions holds: (i) i pi = i qi , (ii) CR (p? , q? ) p > q and ? / (?1, 0), (iii) p < q and ? (?1, 0). Equality holds if m = I i i i i i i i i independently of the value of , where m is as in Proposition 7. Proof By Lemma 2 we know that D CR (p, q) = ( i pi )k [I CR (p? , q? )?m]. We have the following three situations: If i pi = i qi , then k = 1 and independently of the value of , it holds that m = 0 and consequently D CR (p, q)>I CR (p? , q? ). If i pi > i qi (or equivalently k >1) and m<0 (when ?(?1, 0)), we have that it always holds D CR (p, q)>I CR (p? , q? ). In case that i pi > i qi and m>0 (when ? (?1, 0)) we have that if m>I CR (p? , q? ) then D CR (p, q)<0, which is impossible, while if m<I CR (p? , q? ) then CR ? ? D CR (p, q)<I (p , q ). If i pi < i qi (or equivalently k <1) and m<0 (when ? (?1, 0)), we have that D CR (p, q)> I CR (p? , q? ) always holds. Summarizing, we have that (i)?(iii) imply D CR (p, q)>I CR (p? , q? ). Definition 5 (Bivariate divergence) In the framework of Definition 2 we define the Cressie?Read directed divergence between two bivariate nonprobability functions p1 , p2 as 1 p1 (x, y) CR D X,Y ( p1 , p2 ) = p1 (x, y) ?1 (+1) x y p2 (x, y) Definition 6 (Conditional divergence) In the framework of Definition 3, we set 1 DYCR h 1 (y|x) |X =x (h 1 , h 2 ) = (+1) y Copyright q 2009 John Wiley & Sons, Ltd. h 1 (y|x) h 2 (y|x) ?1 Appl. Stochastic Models Bus. Ind. 2010; 26:448?472 DOI: 10.1002/asmb 458 A. SACHLAS AND T. PAPAIOANNOU and CR DYCR |X (h 1 , h 2 ) = E X [DY |X =x (h 1 , h 2 )] 1 = f 1 (x) h 1 (y|x) (+1) x y h 1 (y|x) h 2 (y|x) ?1 for the variable X . Conditional divergence D CR X |Y (r1 ,r2 ) is defined in an analogous way. Strong additivity is not satisfied for the power divergence with probability vectors as one can easily see with the following numerical example involving two trinomial distributions. If (X ? , Y ? , Z ? ) is trinomial with M(n, pi1 , pi2 , pi3 ), pi1 + pi2 + pi3 = 1, i = 1, 2, using standard results, some algebra and obvious notations, we have x y z n p p p 1 11 12 13 y ? ? I XCR px p pz ?1 ? ,Y ? ( p1 , p2 ) = (+1) x,y,z x, y, z 11 12 13 p21 p22 p23 n p11 x q11 (n?x) 1 CR ? ? x n?x I X ? ( f1 , f2 ) = ?1 , qi1 = 1? pi1 p q (+1) x x 11 11 p21 q21 and ? ? IYCR ? |X ? (h 1 , h 2 ) = n p12 y p13 z q21 (n?x) 1 x y z p11 p12 p13 ?1 (+1) x,y,z x, y, z p22 p23 q11 n where x,y,z is the trinomial coefficient and x + y + z = n. For n = 5, p11 = 0.2, p12 = 0.2, p13 = 0.6 and p21 = 0.3, p22 = 0.4 and p23 = 0.3 and = 1.2, we obtain ? ? CR ? ? CR ? ? I XCR ? ( f 1 , f 2 )+ IY ? |X ? (h 1 , h 2 ) = 0.133+2.037 = 2.17<I X ? ,Y ? ( p1 , p2 ) = 3.451 For n = 5, the same p ij ?s and = ?0.4 we obtain ? ? CR ? ? CR ? ? I XCR ? ( f 1 , f 2 )+ IY ? |X ? (h 1 , h 2 ) = 0.132+0.804 = 0.936>I X ? ,Y ? ( p1 , p2 ) = 0.877 A further numerical investigation revealed that when >0 the subadditivity property holds, while when <0 the superadditivity property holds. Equality holds only when = 0, which is the case of the Kullback?Leibler divergence. No convenient expression was obtained in the case of nonprobability vectors. For weak additivity, we have the following proposition. Proposition 9 (Weak additivity) If h i (y|x) = gi (y) and thus pi (x, y) = f i (x)gi (y), i = 1, 2, we have that the random variables X ? and Y ? , which are the ?standardized? values of X, Y , are independent, then (a) CR CR D CR X,Y ( p1 , p2 ) = D X ( f 1 , f 2 )+ DY (g1 , g2 ) ? ? CR ? ? + p1贩 (+1)I XCR ? ( f 1 , f 2 )IY ? (g1 , g2 )+ p1贩 (1? ) where pi贩 = Copyright q x y 1 (+1) pi (x, y), i = 1, 2, and = p1贩 / p2贩 2009 John Wiley & Sons, Ltd. Appl. Stochastic Models Bus. Ind. 2010; 26:448?472 DOI: 10.1002/asmb 459 DIVERGENCES AND THEIR APPLICATIONS CR CR (b) D CR X,Y ( p1 , p2 ) = D X ( f 1 , f 2 )+ DY (g1 , g2 ) if = 1 and if one of the marginal pairs ? ? ? ? ( f 1 , f 2 ), (g1 , g2 ) are identical. Proof (a) We have already seen in Proposition 3 that the random variables X ? , Y ? are independent. We know that (cf. [3]) ? ? CR ? ? CR ? ? CR ? ? CR ? ? I XCR ? ,Y ? ( p1 , p2 ) = I X ? ( f 1 , f 2 )+ IY ? (g1 , g2 )+(+1)I X ? ( f 1 , f 2 )IY ? (g1 , g2 ) Then using Lemma 2 we have D CR X,Y ( p1 , p2 ) = p1贩 1? ? ? I XCR ? ,Y ? ( p1 , p2 )? (+1) ? ? CR ? ? = p1贩 I XCR ? ( f 1 , f 2 )+ IY ? (g1 , g2 ) 1? ? ? CR ? ? +(+1)I XCR ? ( f 1 , f 2 )IY ? (g1 , g2 )? (+1) CR = D CR X ( f 1 , f 2 )+ DY (g1 , g2 ) + p1贩 1? ? ? CR ? ? (+1)I XCR ? ( f 1 , f 2 )IY ? (g1 , g2 )+ (+1) CR CR ? ? CR ? ? = D CR X ( f 1 , f 2 )+ DY (g1 , g2 )+ p1贩 (+1)I X ? ( f 1 , f 2 )IY ? (g1 , g2 ) + p1贩 (1? ) 1 . (+1) (b) If p1贩 = p2贩 , then = 1. Thus regardless of the value of , the last term of the above ? ? CR ? ? equation equals to 0. Moreover if f 1? = f 2? or g1? = g2? then I XCR ? ( f 1 , f 2 )IY ? (g1 , g2 ) = 0. CR CR CR Thus D X,Y ( p1 , p2 ) = D X ( f 1 , f 2 )+ DY (g1 , g2 ). Proposition 10 (Maximal information and sufficiency) Let Y = T (X ) be a measurable transformation of X , then CR D CR X ( p1 , p2 )DY (g1 , g2 ) when c>1, where c = ( x p1 (x)/ x p2 (x)) , with equality if and only if Y is ?sufficient? as explained in Proposition 4, where pi = pi (x), gi = gi (y), i = 1, 2. Copyright q 2009 John Wiley & Sons, Ltd. Appl. Stochastic Models Bus. Ind. 2010; 26:448?472 DOI: 10.1002/asmb 460 A. SACHLAS AND T. PAPAIOANNOU Proof Let gi (y) be the measure associated with Y . Then gi (y) = x:T (x)=y pi (x). The following inequalities are equivalent: CR D CR X ( p1 , p2 ) DY (g1 , g2 ) CR ? ? ? ? ? p1 (x) c[I X ? ( p1 , p2 )?k] g1 (y) d[IYCR ? (g1 , g2 )?l] x where c = x p1 (x)/ y x p2 (x) , d = y g1 (y)/ y g2 (y) , k= 1?c 1 c (+1) l= 1?d 1 d (+1) and As x pi (x) = y gi (y), i = 1, 2, and thus c = d and k =l, the last inequality is equivalent to ? ? CR ? ? I XCR ? ( p1 , p2 )IY ? (g1 , g2 ) which holds whenever c>1. Equality holds if and only if the statistic Y ? = T (X ? ) is sufficient ([2, pp. 11?12]). Zografos et al. proved in [13] that the limiting property holds for Csiszar?s measure of divergence (-divergence) defined as f 1 (x) C I ( f 1 , f 2 ) = f 2 (x) dx f 2 (x) where is a real-valued convex function satisfying certain conditions. The Cressie and Read divergence can be obtained from Csiszar?s measure by taking (x) = [(+1)]?1 (x +1 ? x) in the discrete version of the measure [2]. So the limiting property holds for the Cressie and Read divergence as well. In the next proposition we investigate whether the limiting property holds in case we do not have probability vectors. Proposition 11 (The limiting property) Let {pn } be a sequence of nonprobability vectors. Then pn ? p if and only if D CR (pn , p) ? 0. Proof Using Lemma 2 we have that lim D CR (pn , p) = lim n?? n?? pn (i) i lim k lim n?? n?? 1 1?k =0 I CR (pn , p )? k (+1) because lim I CR (pn , p ) = 0 and lim k = 1. n?? Copyright q n?? 2009 John Wiley & Sons, Ltd. Appl. Stochastic Models Bus. Ind. 2010; 26:448?472 DOI: 10.1002/asmb DIVERGENCES AND THEIR APPLICATIONS 461 On the other hand, let D CR (pn , p) ? 0. Then, ignoring constant 1/((+1)), pn (i) pn (i) =0 lim pn (i) ?1 = 0 or lim pn (i) n?? i n?? i p(i) p(i) where function (x) = (x +1 ? x), x>0, = 0, ?1 is a positive, continuous function for which it holds (1) = 0. Repeating the argument in the second part of the proof of Proposition 6, we obtain pn ? p. Thus the limiting property holds. We have already seen that the power directed divergence D CR (p, q), under some conditions, is nonnegative, additive, greater than I CR (p? , q? ) and invariant under sufficient transformations. It also shares the property of maximal information and the property. So, we can regard basic limiting D CR (p, q) as a measure of divergence, provided that i pi = i qi . 4. AN ACTUARIAL APPLICATION OF DIVERGENCES INVOLVING NONPROBABILITY VECTORS In this section we shall describe how measures of divergence can be used in order to smooth raw mortality rates. We first start with some basic notions of actuarial graduation, while in the sequel we provide a numerical illustration. 4.1. Graduation of mortality rates In order to describe the actual but unknown mortality pattern of a population, the actuary calculates from raw data crude mortality rates, death probabilities or forces of mortality, which usually form an irregular series. Because of this, it is common to revise the initial estimates with the aim of producing smoother estimates, with a procedure called graduation. There are several methods of graduation classified into parametric curve fitting and nonparametric smoothing methods. For more details on the topic, the interested reader is referred to [15?24] and references therein. A method of graduation using information theoretic ideas was first introduced by Brockett and Zhang in [25]. More specifically, they tried in [26] to construct a smooth series of n death probabilities {vxi } at age xi , i = 1, 2, . . . , n, which is as close as possible to the observed series {u xi } and in addition they assumed that the true but unknown underlying mortality pattern is (i) smooth, (ii) increasing with age x, that is, monotone, and (iii) more steeply increasing in higher ages, that is, convex. They also assumed that (iv) the total number of deaths in the graduated data equals the total number of deaths in the observed data, and (v) the total graduated ages at death equals the total of observed ages at death. By total age of death, we mean the sum of the product of the number of deaths at every age by the corresponding age. The last two constraints imply that the average age of death is required to be the same for the observed and the graduated mortality data. In the sequel, when we use x = 1, 2, . . . , n, we shall mean the corresponding ages 3 2 x1 , x2 , . . . , xn . Mathematically, the five constraints are written as follows: (i) x ( vx ) M, where M is a predetermined positive constant and 3 vx = ?vx +3vx+1 ?3vx+2 +vx+3 ; (ii) vx 0, where vx = vx+1 ?vx ; (iii) 2 vx 0, where 2 vx = vx ?2vx+1 +v ; (iv) l v = x+2 x x x lx u x , x where l x is the number of people at risk at the age x; and (v) x xl x vx = x xl x u x . Copyright q 2009 John Wiley & Sons, Ltd. Appl. Stochastic Models Bus. Ind. 2010; 26:448?472 DOI: 10.1002/asmb 462 A. SACHLAS AND T. PAPAIOANNOU To obtain the graduated values, Zhang and Brockett minimize in [26] the Kullback?Leibler divergence vx D KL (v, u) = vx ln ux x between the crude death probabilities u = (u 1 , . . . , u n )T and the new death probabilities v = (v1 , . . . , vn )T , subject to the constraints (i)?(v) by considering a dual problem of minimization. rates (death probabilities) u and v are not probability as we have n vectors nHowever, the mortality n KL x=1 u x >1 and x=1 vx >1. Brockett in [10, p. 104] states that ?D (v, u) = x=1 vx ln(vx /u x ) is still a measure of fit even in the nonprobability situation because the mortality rates are nonnegative and because of the assumed constraints?. In view of the discussion and results in Section 2, the appropriate constraint to use here is (vi) n x=1 vx = n ux x=1 and not conditions (iv) and (v) of [26]. It is easy to see via a counter-example that conditions (iv) and (v) do not imply (vi). It may be necessary, however, to use them on actuarial grounds. A new and unifying way to obtain the graduated values vx is to minimize the Cressie?Read power divergence vx 1 CR D (v, u) = vx ?1 (+1) x ux between the death probabilities u and v for given subject to constraints (i)?(v) and/or (vi), that is, v0 and gi (v) = 12 vT Di v+biT v+ci 0, i = 1, 2, . . . ,r +1, where for each i, Di , bi , ci are a positive-semidefinite matrix and constants, respectively. It is easy to see that the constraints (i)?(v) may be written in the form of gi (v). For more details see [26]. We note that in this case we have r = 2(n +1) constraints, where n is the number of ungraduated values. The minimization is done for various values of the parameter and in this way we can interpret the resulting series of the graduated values, as the series that satisfy the constraints, and is least distinguishable in the sense of the Cressie?Read directed divergence from the series of the crude values {u x }. It is obvious that if we choose = 0, we perform graduation through the Kullback?Leibler directed divergence that Zhang and Brockett [26] described. In view of the work of Csiszar [11] mentioned in Section 1, one could also consider the extended Cressie and Read power divergence for the positive vectors u and v 1 v x CR Iext vx (v, u) = ?1 ?vx +u x , ? R (+1) x ux The values at = 0, ?1 are defined by continuity. For ? 0, the above-mentioned measure reduces KL (v, u). For ? ?1 it becomes I KL (u, v). to the extended Kullback?Leibler directed divergence Iext ext For these values of , it is known that the full nonnegativity property is satisfied. Setting gx = vx /u x we obtain +1 1 CR Iext (v, u) = u x gx ?2gx +1 (+1) x Copyright q 2009 John Wiley & Sons, Ltd. Appl. Stochastic Models Bus. Ind. 2010; 26:448?472 DOI: 10.1002/asmb DIVERGENCES AND THEIR APPLICATIONS 463 CR (v, u) = 1 u (g ?1)2 0 with equality if and only if v = u. For other values of For = 1, Iext x x x 2 CR , an investigation of the minimum of the function h(y) = y +1 ?2y +1, y>0, revealed that Iext can be either strictly positive or strictly negative and thus it does not satisfy the full nonnegativity CR occur only for >0 and, for example, for = 2 we have that property. Real minima of Iext ? CR Iext (v, u) = 0 if and only if v = u or v = ( 12 5?1)u. Again if v and u are vectors with positive components of sum 1, then the above-mentioned divergence reduces to the standard Cressie and Read power divergence. Additionally, if v = u , the extended measure is identical to x x x x the Cressie and Read directed divergence of order between nonprobability vectors defined in Definition 4. Csiszar?s extended definition essentially incorporates constraint (vi) into the measure. It is now obvious that in the information theoretic graduation problem, we must either incorporate CR (v, u) with = 0 or 1 or ?1. constraint (vi) or we use Iext Finally, for reasons of completeness, we state the Whittaker?Henderson (WH) method of graduating mortality rates, a method widely used in actuarial practice, which is a precedent of the splines smoothing method (see [15, 18, 20, 24]): minimize F +h S where F = x wx (u x ?vx )2 is a measure of fit, wx =l x /(u x (1?u x )) are weights with l x being z 2 the number of people at risk at age x and S = n?3 x=1 ( vx ) is a measure of smoothness of the graduated values, is the difference operator, and z is generally taken as 2, 3, 4. Here we shall take 3 2 2 z = 3. Other choices of S involve exponential-type smoothing with S = n?3 x=1 ( vx ?(c ?1) vx ) , where c is a constant appearing, for example, in Makeham?s law ([20, p. 63]). The underlying assumption is that the dx follow a binomial distribution B(l x , vx ) for each x. The parameter h is usually taken equal to the average of weights wx . S and F are the two basic elements of actuarial graduation. The smaller the value of S, the better for graduation but S and F are in competition. In the following section we provide a numerical illustration, in order to see how information theoretic graduation methods perform, compare them with the WH method and try to find the best CR (v, u) is examined. value of . In addition, the role of Iext 4.2. Numerical investigation For the illustration, we will use three different data sets of death probabilities. The first one comes from [20, p. 162] and will be denoted by L85. The second comes from the Actuarial Society of Hong Kong [27], refers to males insured for more than two years, and will be denoted by HK01M. The third also comes from the same Society, refers to female insured for more than 2 years, and will be denoted by HK01F. The above-mentioned data sets are of different sizes. Especially, the L85 data set consists of 20 death probabilities belonging to ages 75?94 (computed from a total of 79 880 observations). From HK01M we have used 16 death probabilities for ages 70?85 (computed from a total of 13 678 observations), while from HK01F we have taken 20 death probabilities for ages 70?89 (computed from a total of 18 341 observations). We have performed several graduations for each data set, using different values of the parameter and the constraints of smoothness, monotonicity, convexity, the two actuarial constraints and constraint (vi) of Section 4.1. Among them are the values of = 1, 0, ?1, ? 12 and ?2, which give the 2 statistic, Kullback?Leibler divergence, modified likelihood ratio statistic, Freeman?Tukey statistic F 2 and Neyman-modified 2 , respectively. We have also used the value of = 23 that Cressie and Read proposed in [14]. The value of M in the first constraint is different in each data Copyright q 2009 John Wiley & Sons, Ltd. Appl. Stochastic Models Bus. Ind. 2010; 26:448?472 DOI: 10.1002/asmb 464 A. SACHLAS AND T. PAPAIOANNOU set, and was computed for each data set through graduations by the WH method with h the average of weights wx . This was done in order to compare our results with those obtained through the WH method. The values of h and M are as follows: h = 80 786.8 and M = 0.00004 for the L85 data set, h = 29 354.6 and M = 0.00003 for the HK01M data set and h = 54 547 and M = 0.000015 for the HK01F data set. As expected, different choices of the parameter lead to different graduated values. In order to compare the several graduations for each data set, we computed, after graduation, the values of S and F. Furthermore, the performance of our proposed methods (models) was assessed with the log-likelihood, deviance and 2 goodness-of-fit statistics evaluated for the graduated values. As we have assumed that dx ? Bi(l x , vx ), we have that the log-likelihood excluding constants is log L(v) = n [dx log vx +(l x ?dx ) log(1?vx )] i=1 Knowing the log-likelihood function we can calculate the deviance, D(v) = 2 log L(u)?2 log L(v) Finally, we can measure the discrepancy between the observed and the expected deaths with the corresponding 2 statistic, 2 = n (d ?l v )2 x x x l v (1?v x) i=1 x x We want graduations with maximum log L(v) and minimum deviance and 2 . The ungraduated and graduated values for the HK01M data set with the CR method, the five constraints and for = 0, 23 , 1, 2 along with those obtained through the WH method are given in Table I. In Table II we give the corresponding smoothness and goodness-of-fit results along with the minimum value of the Cressie and Read divergence. From Table II we see that almost all the graduations with the Cressie?Read power divergence give the same value for the smoothness measure S, which is the value of M in the smoothness constraint (i), while the value of the goodness-of-fit measures F, deviance and 2 decreases as increases. We observe that in all the cases, we take negative values of I CR (v, u) as x vx < x u x . Note that S cannot exceed its corresponding value of the WH method due to constraint (i). The best choice for is 2. Similar results were also obtained for the other two data sets with the difference that for the L85 data set the best choice of is 23 or 1. Comparing the results with those obtained by the WH method, we have equivalent results as far as smoothness is concerned while we do not have good fidelity F. In terms of 2 and deviance, the winner is WH. A further numerical investigation threw further light on the role or choice of . In Figure 1, we have plotted S versus for the three data sets. The dotted line in each plot denotes the value of M in the smoothness constraint (i). We can see that the three plots present the same pattern. When ??<<?1, S takes a value near the value of M. Then, when ?1<<?0.5, S takes a very small value almost equal to zero and then for the remaining values of , it also takes a value near the value of M. So, for values of between ?1 and ?0.5, the method oversmooths the data. In Figure 2, we present the analogous plots concerning the measure of fit F. We can also see the same pattern for the three data sets. For values of smaller than ?1, the measure of fit increases, till its maximum value. This means that graduation is not acceptable as the graduated values depart Copyright q 2009 John Wiley & Sons, Ltd. Appl. Stochastic Models Bus. Ind. 2010; 26:448?472 DOI: 10.1002/asmb 465 DIVERGENCES AND THEIR APPLICATIONS Table I. Graduations with: (a) the CR divergence and five constraints and (b) the WH method (w/o constraints) for the HK01M data set. x ux vx ( = 0) vx ( = 23 ) vx ( = 1) vx ( = 2) vx (WH) 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 0.01923 0.02563 0.02992 0.03585 0.03899 0.03523 0.05543 0.04939 0.05906 0.07503 0.04848 0.11692 0.06816 0.23598 0.11659 0.29152 0.01567 0.02239 0.02911 0.03584 0.04256 0.04928 0.05601 0.06273 0.06945 0.07617 0.08290 0.08962 0.09634 0.10307 0.10979 0.12154 0.01583 0.02248 0.02913 0.03578 0.04243 0.04908 0.05573 0.06239 0.06904 0.07569 0.08234 0.08899 0.09564 0.10365 0.11477 0.13346 0.01596 0.02255 0.02915 0.03574 0.04234 0.04894 0.05553 0.06213 0.06872 0.07532 0.08192 0.08851 0.09511 0.10454 0.11869 0.14123 0.01618 0.02268 0.02917 0.03567 0.04217 0.04867 0.05516 0.06166 0.06816 0.07466 0.08115 0.08765 0.09483 0.10674 0.12536 0.15274 0.01928 0.02546 0.03045 0.03451 0.03785 0.04160 0.04671 0.05164 0.05621 0.06110 0.06830 0.08085 0.10057 0.12914 0.16700 0.21458 Table II. Smoothness and goodness-of-fit measures for graduations with: (a) the CR divergence and five constraints and (b) the WH method (w/o constraints) for the HK01M data set. S F Deviance log-likelihood 2 I CR (v, u) for =0 = 23 =1 =2 WH 0.000025 41.403 36.37 ?2090.91 41.29 ?0.103 0.000025 38.690 34.33 ?2089.89 38.54 ?0.023 0.000025 36.919 32.95 ?2089.20 36.75 ?0.006 0.000025 34.213 30.81 ?2088.13 34.01 ?0.019 0.000025 19.964 18.29 ?2081.87 19.73 ?0.067 = 0. too far from the crude values. When takes a value almost equal to ?1, F decreases and it is stabilized for the remaining values of . Comparing graduations of the three data sets with the power divergence statistic for = 0 versus the graduation with the Kullback?Leibler divergence ( ? 0), we see that in terms of fidelity F we obtain almost the same results for values of >?1. However, as far as smoothness S is concerned, graduation of the L85 data set via the Kullback?Leibler divergence gives a very small value for the S, which means that the method oversmooths the data. The same result is also obtained by minimizing the power divergence statistic with ? (?1, 0). For the HK01M data set, the minimization of the Kullback?Leibler divergence gives the same results as the minimization of the power divergence statistic with <?1 and >0. Finally, for the HK01F data set, graduation through the Kullback?Leibler divergence oversmooths the data, something that also happens using the power divergence statistic with ?1<<0.5. Our final conclusion is that the choice of = 23 suggested in [14] on grounds of statistical power is also a good choice for graduation. Copyright q 2009 John Wiley & Sons, Ltd. Appl. Stochastic Models Bus. Ind. 2010; 26:448?472 DOI: 10.1002/asmb 466 A. SACHLAS AND T. PAPAIOANNOU S S 0.4 0.25 0.20 0.3 0.15 0.2 0.10 0.1 ?3 ?2 ?1 0.05 0 1 2 3 4 (a) ?4 ?3 ?2 ?1 0 1 2 3 4 (b) S 0.15 0.10 0.05 ?4 ?3 ?2 ?1 0 1 2 3 4 (c) Figure 1. Smoothness S versus (five constraints) (S �?4 ): (a) L85; (b) HK01M; and (c) HK01F. Next we graduated the same data sets using apart from the constraints (i)?(v) of Section 4.1 constraint (vi) as well, which, as we saw, is the minimal requirement for a measure with nonprobability vectors to be a measure of divergence. The results for = 0, 23 , 1, 2 and the WH method for the HK01M data set are given in Table III. From Table IV we see that almost all graduations through the Cressie-Read power divergence give the same value for the smoothness measure S, which is the value of M in the smoothness constraint (i), while the value of F increases as increases. We also see a dramatic improvement in deviance and 2 in comparison with Table II. Here we observe that the values of the I CR (v, u) measure are positive and this is because of the use of constraint (vi). As far as the best choice of is concerned, we have the same results as before. The results are now almost equivalent to those of the WH method as far as both smoothness and goodness of fit are concerned. As far as smoothness is concerned, each choice of gives almost the same value for measure S. More specifically, we have S ? 4�?5 for the L85 data set, S ? 3�?5 for the HK01M data set and S ? 15�?6 for the HK01F data set. Comparing the results between graduations with five and six constraints, the use of additional constraint (vi) improves the results as now we do not have the oversmoothing effect. In Figure 3, we present the plots concerning the measure of fit F. For the L85 data set, we see that the value of F ? 60 for <0, while for positive values of , F increases. For the HK01M data set, we have that F ? 17 for all values of = 2, while for = 2, F ? 19.5. Finally, for the HK01F data set, we see that there are no major differences as F ? (9.9, 12). Comparing again the Copyright q 2009 John Wiley & Sons, Ltd. Appl. Stochastic Models Bus. Ind. 2010; 26:448?472 DOI: 10.1002/asmb 467 DIVERGENCES AND THEIR APPLICATIONS F F 800 70 700 60 600 50 500 400 40 300 30 200 ?4 ?3 ?2 ?1 0 1 2 3 4 (a) ?4 ?3 ?2 ?1 0 1 2 3 4 (b) F 80 70 60 50 40 30 20 ?4 ?3 ?2 ?1 0 1 2 3 4 (c) Figure 2. Goodness-of-fit F versus (five constraints): (a) L85; (b) HK01M; and (c) HK01F. results between the graduations with five and six constraints, the additional constraint improves the results as the graduated values present a much better fidelity. The graduation of the data sets using constraints (i), (ii), (iii) and (vi) gave results similar to those of Table IV with a slight increase in the value of the goodness-of-fit measures. Finally the CR (p, q) are given results of graduation through the extended Cressie and Read power divergence Iext in Table V. The minimization was done subject to constraints (i)?(v). The values of S are equal or almost equal for all chosen values of in Tables II and IV. Comparing the goodness-of-fit with that given by the minimization of the Cressie and Read power divergence subject to the five constraints (Table II), we see a dramatically better fidelity. However, the fidelity provided by the Cressie and Read power divergence subject to the six constraints that we propose (Table IV) is better for every chosen value of the parameter . The extended Cressie and Read power divergence performs comparatively in a similar manner as the Cressie and Read power divergence with six constraints. We then conducted a predictive analysis of our method. Assuming that the underlying pattern of mortality follows Makeham?s model vx = a +bc x , where a, b, c are proper parameters and x is the age, we used a time-based training-test split. We split the data sets into two equal intervals, we graduated the first interval with the CR methods, fitted Makeham?s model and using this model we then compared the predicted values, obtained by this model, with the ungraduated values of the second interval. The resulting MSEs for the second interval appear in Table VI. Results for the Cressie and Read method with six constraints are comparable with the other cases. Compared with the MSE obtained through the WH method, which is equal to 0.00882, we have better performance Copyright q 2009 John Wiley & Sons, Ltd. Appl. Stochastic Models Bus. Ind. 2010; 26:448?472 DOI: 10.1002/asmb 468 A. SACHLAS AND T. PAPAIOANNOU Table III. Graduations with: (a) the CR divergence and six constraints and (b) the WH method (w/o constraints) for the HK01M data set. x ux vx ( = 0) vx ( = 23 ) vx ( = 1) vx ( = 2) vx (WH) 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 0.01923 0.02563 0.02992 0.03585 0.03899 0.03523 0.05543 0.04939 0.05906 0.07503 0.04848 0.11692 0.06816 0.23598 0.11659 0.29152 0.02025 0.02484 0.02944 0.03404 0.03863 0.04324 0.04797 0.05270 0.05761 0.06377 0.07338 0.08960 0.11474 0.15117 0.19989 0.26231 0.02023 0.02483 0.02944 0.03405 0.03866 0.04329 0.04799 0.05269 0.05760 0.06378 0.07341 0.08949 0.11452 0.15093 0.19993 0.26272 0.02022 0.02483 0.02944 0.03406 0.03867 0.04331 0.04800 0.05269 0.05759 0.06378 0.07342 0.08947 0.11447 0.15087 0.19995 0.26281 0.02267 0.02513 0.02785 0.03126 0.03581 0.04150 0.04800 0.05451 0.06101 0.06815 0.07777 0.09297 0.11625 0.15019 0.19591 0.25458 0.01928 0.02546 0.03045 0.03451 0.03785 0.04160 0.04671 0.05164 0.05621 0.06110 0.06830 0.08085 0.10057 0.12914 0.16700 0.21458 Table IV. Smoothness and goodness-of-fit measures for graduations with: (a) the CR divergence and six constraints and (b) the WH method (w/o constraints) for HK01M data set. S F Deviance Log-likelihood 2 I CR (v, u) For =0 = 23 0.000029 17.00 17.11 ?2081.28 16.77 0.070 0.000029 17.03 17.13 ?2081.29 16.80 0.073 =1 0.00003 17.04 17.13 ?2081.30 16.80 0.075 =2 0.00003 19.55 19.83 ?2082.64 19.33 0.086 WH 0.000025 19.964 18.29 ?2081.87 19.73 ?0.067 = 0. of the Cressie and Read methods. The better predictive performance is obtained by minimizing the extended Cressie and Read power divergence with five constraints and in particular for = 0, that is, the extended Kullback?Leibler directed divergence. This is one of the three values of that the extended measure satisfies the minimal requirement for a measure with nonprobability vectors to be a measure of divergence. Similar are the results of the predictive analysis for the L85 and the HK01F data sets. However, we have to note that the predictive analysis results depend on the selected parametric model, in our case Makeham?s model. A better approach would be the minimization of the Cressie?Read divergence D CR ( f (x), u) over functions f , where v has been replaced by an unknown function f (x) = ( f (x1 ), f (x2 ), . . . , f (xn ))T of ages x, subject to integral constraints analogous to the previous constraints (i)?(vi). This is a calculus of variation problem and it is believed that its solution would be a spline function. We intend to explore this in our future research. Copyright q 2009 John Wiley & Sons, Ltd. Appl. Stochastic Models Bus. Ind. 2010; 26:448?472 DOI: 10.1002/asmb 469 DIVERGENCES AND THEIR APPLICATIONS F F 160 20 140 19 120 18 100 17 80 16 60 ?4 ?3 ?2 ?1 0 1 2 3 ?4 4 (a) ?3 ?2 ?1 0 1 2 3 4 (b) F 12.5 12.0 11.5 11.0 10.5 10.0 9.5 ?4 ?3 ?2 ?1 0 1 2 3 4 (c) Figure 3. Goodness-of-fit F versus (six constraints): (a) L85; (b) HK01M; and (c) HK01F. Table V. Smoothness and goodness-of-fit measures for graduations with: (a) the ext. CR divergence and five constraints and (b) the WH method (w/o constraints) for HK01M data set. =0 S F Deviance Log-likelihood 2 CR (v, u) Iext For = 23 0.00003 17.50 17.46 ?2081.46 17.50 0.065 0.00003 19.80 20.85 ?2083.16 19.80 0.066 =1 0.00003 19.12 18.68 ?2082.07 19.12 0.067 =2 0.00003 25.94 24.27 ?2084.86 25.94 0.047 WH 0.000025 19.96 18.29 ?2081.87 19.73 0.069 = 0. 5. CONCLUSIONS AND COMMENTS After a theoretical evaluation of the Kullback?Leibler D KL (p, q) divergence involving nonprobability vectors, we have concluded that this measure shares some of the properties of the Kullback?Leibler directed divergence with probability vectors. Under some conditions, D KL (p, q) is nonnegative, additive and invariant under sufficient transformations. The property of maximal information and the limiting property are satisfied as well. Thus, we may regard D KL (p, q) as a measure of information. A minimal requirement for D KL (p, q) to be a measure of divergence is Copyright q 2009 John Wiley & Sons, Ltd. Appl. Stochastic Models Bus. Ind. 2010; 26:448?472 DOI: 10.1002/asmb 470 A. SACHLAS AND T. PAPAIOANNOU Table VI. MSEs for graduations with: (a) the CR divergence (five constraints); (b) the CR divergence (six constraints); and (c) the ext. CR divergence (five constraints) for HK01M data set. MSE (CR?five constraints) MSE (CR?six constraints) MSE (ext. CR?five constraints) =0 = 2/3 =1 =2 0.00854 0.00868 0.00553 0.00854 0.00868 0.00831 0.00854 0.00868 0.00854 0.00854 0.00868 0.00854 MSE(WH) = 0.00882. pi = i qi . Similar results were obtained for the Cressie?Read power divergence D CR (p, q) with nonprobability vectors. As an application of the previous results, we explored the use of the general Cressie?Read power divergences in order to obtain graduated values. A numerical illustration, minimizing the power divergence for various values of , with constraints (i)?(v) and/or (vi), gave equivalent results, in terms of smoothness, to thoseof othermethods of graduation such as the widely used WH method. The use of constraint (vi) x u x = x vx is considered to be necessary as this is the minimal requirement for a measure with nonprobability vectors to be a measure of divergence. The numerical results supported the use of this additional condition and showed considerable improvement in goodness of fit. However, we cannot say straightforwardly which the value of the parameter is best for graduation. For graduations with constraints (i)? (v) we have the following: Values of <?1 give unacceptable results as far as goodness of fit is concerned and as such they should be avoided. As far as smoothness S is concerned, values of ? (?1, 0.5) oversmooth the data. For 0 the various divergences give similar results in terms of smoothness and fit; thus, the value = 23 suggested in [14] from statistical consideration is a good choice. This topic is under further investigation by the authors. For graduations with the additional constraint (vi), the results improve considerably. The value of seems not to affect the smoothness S, while the parameter has slight effect on the goodness of fit. In the light of the work of Csiszar [11], who extended the Kullback?Leibler directed divergence to nonnegative functions and vectors by adding the difference i qi ? i pi to its expression, we applied the same adjustment to the Cressie?Read power divergence. The extended measure satisfies the minimal requirement only for = 0, 1 and ?1. The minimization of the extended Cressie?Read power divergence for various values of , subject to constraints (i)?(v), gave equivalent results, in terms of smoothness, to those of the other methods used in the numerical investigation. As far as goodness of fit is concerned, there is a clear improvement comparing the results to those minimizing the Cressie?Read power divergence with the five constraints. However, they are almost equivalent with those of the minimization of the Cressie?Read power divergence with the six constraints. Therefore, we can conclude that the additional sixth constraint that we propose is necessary not only from the theoretical but also the practical point of view. The similarity of results between the WH method and the power divergence minimization under the said constraints allows us to claim that the two graduation methods are nearly equivalent. This is supported not only by the numerical investigation but also from the fact that in the WH method we minimize a form of the Lagrangian function F +h S, while in power divergence we minimize F subject to, among others, a constraint on S that in turn leads to a similar Lagrangian. i Copyright q 2009 John Wiley & Sons, Ltd. Appl. Stochastic Models Bus. Ind. 2010; 26:448?472 DOI: 10.1002/asmb DIVERGENCES AND THEIR APPLICATIONS 471 A predictive analysis using Makeham?s mortality model and the MSE criterion showed that the CR methods with six constraints and ?extended? CR and the WH method are comparable. It appears not to be realistic to seek a method that is best in terms of all comparison criteria, some of which are competing. ACKNOWLEDGEMENTS The authors would like to thank the Editor, the Associate Editor, and the Referee for their valuable comments and suggestions and in particular for bringing to our attention the paper of Csiszar [11] that strengthened our investigation. REFERENCES 1. Basu A, Harris IR, Hjort NL, Jones MC. Robust and efficient estimation by minimising a density power divergence. Biometrika 1998; 85(3):549?559. 2. Pardo L. Statistical Inference Based on Divergence Measures. Chapman & Hall/CRC: London, 2006. 3. Read TRC, Cressie NAC. Goodness-of-Fit Statistics for Discrete Multivariate Data. Springer: New York, 1988. 4. Mathai AM, Rathie PN. Basic Concepts in Information Theory and Statistics. Wiley: New York, 1975. 5. Liese F, Vajda I. Convex Statistical Distances. B. G. Teubner: Leipzig, 1987. 6. Papaioannou T. Measures of information. In Encyclopedia of Statistical Sciences, vol. 5, Kotz S, Johnson NL (eds). Wiley: New York, 1985; 391?497. 7. Papaioannou T. On distances and measures of information: a case of diversity. In Probability and Statistical Models with Applications, Charalambides CA, Koutras MV, Balakrishnan N (eds). Chapman & Hall/CRC: London, 2001; 503?515. 8. Mattheou K. On new developments in statistical inference for measures of divergence, Ph.D. Thesis, University of Cyprus, Nicosia, Cyprus, 2007. 9. Papaioannou T, Ferentinos K. On two forms of Fisher?s information number. Communications in Statistics?Theory and Methods 2005; 34:1461?1470. 10. Brockett PL. Information theoretic approach to actuarial science: a unification and extension of relevant theory and applications. Transactions of the Society of Actuaries 1991; 43:73?114. 11. Csiszar I. Why least squares and maximum entropy? An axiomatic approach to inference for linear inverse problems. Annals of Statistics 1991; 19(4):2032?2066. 12. Kullback S. Information Theory and Statistics. Wiley: New York, 1959. 13. Zografos K, Ferentinos K, Papaioannou T. Limiting properties of some measures of information. Annals of the Institute of Statistical Mathematics B 1989; 41(3):451?460. 14. Cressie NAC, Read TRC. Multinomial goodness-of-fit tests. Journal of Royal Statistical Society B 1984; 46(3): 440?464. 15. Benjamin P, Pollard JH. The Analysis of Mortality and Other Actuarial Statistics. Heinemann: London, 1980. 16. Debon A, Montes F, Sala R. A comparison of parametric models for mortality graduation. Application to mortality data for the Valencia region (Spain). SORT 2005; 29(2):269?288. 17. Debon A, Montes F, Sala R. A comparison of nonparametric methods in the graduation of mortality: application to data from the Valencia region (Spain). International Statistical Review 2006; 74(2):215?233. 18. Haberman S. Actuarial methods. In Encyclopedia of Biostatistics, vol. 1, Armitage P, Colton Th (eds). Wiley: New York, 1998; 37?49. 19. Haberman S, Renshaw AE. A simple graphical method for the comparison of two mortality experiences. Applied Stochastic Models in Business and Industry 1999; 15:333?352. 20. London D. Graduation: The Revision of Estimates. ACTEX Publications: Winsted, Connecticut, 1985. 21. Miller MD. Elements of Graduation. Actuarial Society of America: New York, 1949. 22. Nielsen JP. Smoothing and prediction with a view to actuarial science, biostatistics and finance. Scandinavian Actuarial Journal 2003; 1:51?74. 23. Neves CdR, Migon HS. Bayesian graduation of mortality rates: an application to mathematical reserve evaluation. Insurance: Mathematics and Economics 2007; 40:424?434. Copyright q 2009 John Wiley & Sons, Ltd. Appl. Stochastic Models Bus. Ind. 2010; 26:448?472 DOI: 10.1002/asmb 472 A. SACHLAS AND T. PAPAIOANNOU 24. Wang JL. Smoothing hazard rates. In Encyclopedia of Biostatistics, vol. 5 (2nd edn), Armitage P, Colton Th (eds). Wiley: New York, 2005; 4986?4997. 25. Brockett PL, Zhang J. Information theoretical mortality graduation. Scandinavian Actuarial Journal 1986; 131?140. 26. Zhang J, Brockett PL. Quadratically constrained information theoretic analysis. SIAM Journal of Applied Mathematics 1987; 47(4):871?885. 27. The Actuarial Society of Hong Kong. Report on Hong Kong Assured Lives Mortality 2001, 2001. Available from: http://www.actuaries.org.hk/upload/File/ESR01.pdf. Copyright q 2009 John Wiley & Sons, Ltd. Appl. Stochastic Models Bus. Ind. 2010; 26:448?472 DOI: 10.1002/asmb

1/--страниц