Забыли?

?

# asmb.803

код для вставкиСкачать
```APPLIED STOCHASTIC MODELS IN BUSINESS AND INDUSTRY
Appl. Stochastic Models Bus. Ind. 2010; 26:448?472
Published online 15 September 2009 in Wiley Online Library (wileyonlinelibrary.com). DOI: 10.1002/asmb.803
Divergences without probability vectors and their applications
Athanasios Sachlas and Takis Papaioannou?, ?, ?
Department of Statistics and Insurance Science, University of Piraeus, 185 34 Piraeus, Greece
SUMMARY
In general, divergences and measures of information are defined for probability vectors. However, in
some cases, divergences are ?informally? used to measure the discrepancy between vectors, which are
not necessarily probability vectors. In this paper we examine whether divergences with nonprobability
vectors in their arguments share the properties of probabilistic or information theoretic divergences. The
results indicate that divergences with nonprobability vectors share, under some conditions, some of the
properties of probabilistic or information theoretic divergences and therefore can be considered and used
as information measures. We then use these divergences in the problem of actuarial graduation of mortality
rates. Copyright q 2009 John Wiley & Sons, Ltd.
Received 8 February 2008; Revised 24 July 2009; Accepted 25 July 2009
KEY WORDS:
Kullback?Leibler divergence; Cressie?Read divergence; divergence with nonprobability
1. INTRODUCTION
There are many practical problems where nonprobability vectors are involved. One such problem
is the actuarial graduation of mortality rates. Although divergences and/or measures of information
are defined for probability vectors, in practice they are used with nonprobability vectors as well.
The main purpose of this paper is to explore the properties of divergences without probability
vectors and provide an application in the actuarial field.
A bivariate function D( f, g) of two functions or vectors f and g is a measure of divergence
if D( f, g)0 with equality if and only if f = g (see [1]). This is the minimal requirement for a
measure D( f, g) to be a ?kind? of a distance between f and g. In [2, p. 2] it is mentioned that
a coefficient with the property of increasing as the two distributions involved move ?further from
? Correspondence
to: Takis Papaioannou, Department of Statistics and Insurance Science, University of Piraeus,
185 34 Piraeus, Greece.
?
E-mail: [email protected]
?
Part of the work of the second author was done while visiting the Department of Mathematics and Statistics of the
University of Cyprus.
2009 John Wiley & Sons, Ltd.
DIVERGENCES AND THEIR APPLICATIONS
449
each other? will be called divergence measure between two probability distributions. For other
requirements see [3, 4].
There are many measures of divergence, many of which originate from or are connected with
information theory. For a good review see Chapter 1 of [2], Chapter 7 of [3], [5?7] and references
cited therein. In this paper we concentrate on two of the most important divergences in statistics,
the Kullback?Leibler and the Cressie?Read power divergences.
Papaioannou in [6, 7] presents in detail the properties of information measures. These are
under sufficient transformations, convexity, loss of information, sufficiency in experiments, appearance in Cramer?Rao inequalities, invariance under parametric transformations, nuisance parameter
inequality, order preserving property and limiting property.
Measures of divergence, measures of information and their properties are still a topic under
research. New measures of divergence are proposed and their properties are investigated in [8],
while Papaioannou and Ferentinos [9] examine the Fisher information number in light of the
properties of the classical statistical information theory. However, there is no universal agreement
among statisticians on which properties constitute or define a measure of statistical information as
the approach is mostly operational rather than axiomatic [9]. The aim of this paper is to examine
the properties of measures of divergence when they involve nonprobability vectors and to present
an application in the graduation problem of actuarial science. The motivation stems from the work
of Brockett [10]. An important conclusion of this investigation is that an additional constraint, (vi)
of Section 4.1 below, should be included in the divergence minimization process. Inclusion of this
constraint dramatically improves goodness-of-fit results.
Here we have to mention the work of Csiszar [11] who considers linear inverse problems
(problems with linear constraints), with n-dimensional real vectors or vectors with positive components or probability vectors. His aim is to determine logically consistent rules for selecting such
a vector. His selection?projection rules minimize ?distances? between such vectors or functions
thereof subject to linear constraints. Several postulates?axioms characterize the projection rules.
As corollaries axiomatic characterizations of the methods of least squares, minimum discrimination information and maximum entropy are obtained. In this context he presents an extension
of the Kullback?Leibler divergence
positive vectors not necessarily probability vectors. More
to precisely, he added the quantity i qi ? i pi to the standard Kullback?Leibler measure of directed
divergence, that is, he defined
pi
KL
I (p, q) =
pi ln ? pi +qi
qi
i
where p = ( p1 , . . . , pn )T , q = (q1 , . . . , qn )T are just vectors with n real positive components.
The above I KL -divergence is nonnegative and equals to zero if and only if p = q; thus, it
satisfies the minimal requirement for a function to be a measure of divergence. In the case that the
components of p and q sum up to one, the above-mentioned divergence reduces to the standard
Kullback?Leibler directed divergence. We will refer to this measure as the extended Kullback?
KL (p, q). For other more axiomatic than information theoretic
Leibler divergence and denote it by Iext
KL
properties of Iext (p, q), see [11].
The paper is organized as follows: In Sections 2 and 3 we study the properties of the Kullback?
Leibler and Cressie?Read power divergences for nonprobability vectors in the light of statistical
information theory. In Section 4 we describe an application of divergence measures involving
2009 John Wiley & Sons, Ltd.
Appl. Stochastic Models Bus. Ind. 2010; 26:448?472
DOI: 10.1002/asmb
450
A. SACHLAS AND T. PAPAIOANNOU
nonprobability arguments in the actuarial field supported by a numerical investigation. Section 5
contains concluding remarks.
2. KULLBACK?LEIBLER DIRECTED DIVERGENCE INVOLVING
NONPROBABILITY VECTORS
The most popular measure of divergence between two n �probability vectors p? and q? is the
Kullback?Leibler measure of information given by
I KL (p? , q? ) =
i
pi? ln
pi?
qi?
(see [12, pp. 6?7]). It is a measure of directed divergence on the sense that it does not share all the
properties of a distance or metric (it is not symmetric and does not satisfy the triangle inequality
property) and therefore cannot be considered as a pure distance. The Kullback?Leibler directed
divergence is defined for probability vectors and shares most of the properties that all information
measures share.
In the sequel we investigate whether the Kullback?Leibler directed divergence between two
nonprobability vectors can be considered as a measure of information, by examining its properties
in the light of general properties of measures of information and divergence.
Definition 1
The Kullback?Leibler directed divergence between two n �nonprobability vectors p and q is
defined by
D KL (p, q) =
n
pi
qi
n
pi ln
i=1
where p = ( p1 , . . . , pn )T >0, q = (q1 , . . . , qn )T >0 with
i=1
(1)
pi = 1 and
n
i=1 qi
= 1.
Lemma 1
For the Kullback?Leibler directed divergence with nonprobability vectors, it holds that
n
D KL (p, q) =
pi [I KL (p? , q? )+ln k]
i=1
n
n
where k = i=1 pi / i=1 qi , and I KL (p? , q? ) is the Kullback?Leibler measure involving the prob?
?
ability vectors
of which are the normalized elements of p and q, that is,
n p and q? , the elements
n
?
pi = pi / i=1 pi and qi = qi / i=1
qi , i = 1, . . . , n.
Proof
The result follows by simple algebra.
Proposition 1 (The nonnegativity property)
D KL (p, q)0
2009 John Wiley & Sons, Ltd.
(2)
Appl. Stochastic Models Bus. Ind. 2010; 26:448?472
DOI: 10.1002/asmb
451
DIVERGENCES AND THEIR APPLICATIONS
if one of the following conditions holds:
(i)
n
i=1
pi n
i=1
qi , (ii)
n
i=1
pi <
n
qi and ln k>? I KL (p? , q? ) where k =
i=1
n
pi /
i=1
Equality in (2) holds if p = q or ln k = ?I KL (p? , q? ). Moreover, if
D KL (p, q) = 0 if and only if p = q.
n
i=1
pi =
n
qi
i=1
n
i=1 qi
then
Proof
The proof is obvious on account of the nonnegativity property of I KL (p? , q? ).
n
n
Note that D KL (p, q) = 0 does not necessarily imply p = q unless
pi = i=1
qi . Thus
i=1 n
n
KL
qi .
the minimal requirement for using D (p, q) as a measure of divergence is i=1 pi = i=1
Numerically thiscan be seen with the vectors p = (0.1584514, 0.2201928, 0.7247736)T and q =
(0.4, 0.8, 0.4)T , i pi = i qi , for which D KL (p, q) = 4.78877�?7 is almost zero, while their
Euclidean distance is p?q = 0.707107.
Definition 1 has obvious extensions to the bivariate and multivariate cases. We present the
related definitions for the bivariate case.
Definition 2 (Bivariate divergence)
Let pi (x, y), i = 1, 2, be two bivariate measures (nonprobability
functions) associated with two
discrete variables X , Y in R 2 for which it holds x y pi (x, y) = 1. We define the Kullback?
Leibler directed divergence between two bivariate nonprobability functions p1 , p2 as
D KL
X,Y ( p1 , p2 ) =
x
y
p1 (x, y) ln
p1 (x, y)
p2 (x, y)
Definition 3 (Conditional divergence)
For the discrete variables
pi (x, y), i = 1, 2, as
X , Y and the bivariate nonprobability functions
given above, let f i (x) = y pi (x, y), h i (y|x) = pi (x, y)/ f i (x), gi (y) = x pi (x, y) and ri (x|y) =
pi (x, y)/gi (y), i = 1, 2. We set
DYKL
|X =x (h 1 , h 2 ) =
y
h 1 (y|x) ln
h 1 (y|x)
h 2 (y|x)
and define
KL
DYKL
|X (h 1 , h 2 ) = E X [DY |X =x (h 1 , h 2 )] =
x
f 1 (x)
y
h 1 (y|x) ln
h 1 (y|x)
h 2 (y|x)
KL
Conditional divergences D KL
X |Y =y (r1 ,r2 ) and D X |Y (r1 ,r2 ) are defined analogously.
Let p1 , p2 be two bivariate nonprobability functions associated with two discrete variables X , Y
in R 2 as in Definition 2. Then
KL
KL
KL
KL
D KL
X,Y ( p1 , p2 ) = D X ( f 1 , f 2 )+ DY |X (h 1 , h 2 ) = DY (g1 , g2 )+ D X |Y (r1 ,r2 )
where the functions f i , h i , gi , ri , i = 1, 2, are as in Definition 3.
2009 John Wiley & Sons, Ltd.
Appl. Stochastic Models Bus. Ind. 2010; 26:448?472
DOI: 10.1002/asmb
452
A. SACHLAS AND T. PAPAIOANNOU
Proof
For the variables X, Y we have that
KL
D KL
X ( f 1 , f 2 )+ DY |X (h 1 , h 2 )
p1 (x, y) p1 (x, y) f 2 (x)
f 1 (x) + f 1 (x)
ln
f 2 (x) x
f 1 (x)
f 1 (x) p2 (x, y)
x
y
f 2 (x)
f 1 (x) p1 (x, y)
+
+ln
p1 (x, y) ln
= f 1 (x) ln
f 2 (x) x y
p2 (x, y)
f 1 (x)
x
=
=
x
f 1 (x) ln
f 1 (x) ln
f 1 (x) p1 (x, y) f 2 (x)
+
+ f 1 (x) ln
p1 (x, y) ln
f 2 (x) x y
p2 (x, y) x
f 1 (x)
= D KL
X,Y ( p1 , p2 )
KL
KL
In a similar way, we prove that D KL
X,Y ( p1 , p2 ) = DY (g1 , g2 )+ D X |Y (r1 ,r2 ).
Corollary 1
(i)
(ii)
(iii)
(iv)
KL
KL
D KL
X,Y ( p1 , p2 )D X ( f 1 , f 2 ) with equality if and only if DY |X (h 1 , h 2 ) = 0;
KL
KL
D KL
X,Y ( p1 , p2 )DY (g1 , g2 ) with equality if and only if D X |Y (r1 ,r2 ) = 0;
KL
KL
D KL
X,Y ( p1 , p2 )DY |X (h 1 , h 2 ) with equality if and only if D X ( f 1 , f 2 ) = 0;
KL
KL
KL
D X,Y ( p1 , p2 )D X |Y (r1 ,r2 ) with equality if and only if DY (g1 , g2 ) = 0.
In all above cases equality holds if and only if the normalized values of X , Y are independent.
The normalized values of X , Y form
tworandom variables X ? , Y ? with discrete joint mass
?
probability function pi (x, y) = pi (x, y)/ x y pi (x, y) and marginal and conditional probability
mass functions as follows: X ? ? f i? , Y ? |X ? ? h i? , Y ? ? gi? , X ? |Y ? ?ri? . For the random variables
X ? , Y ? we have
? ?
I XKL
? ,Y ? ( p1 , p2 ) =
x
y
p1? (x, y) ln
p1? (x, y)
p2? (x, y)
If h i (y|x) = gi (y) and thus pi (x, y) = f i (x)gi (y), i = 1, 2, we have that the random variables X ? ,
Y ? , produced by normalizing X , Y as indicated above, are independent, and
KL
KL
D KL
X,Y ( p1 , p2 ) = D X ( f 1 , f 2 )+ DY (g1 , g2 )? ln where = y g1 (y) = x f 1 (x) and = y g1 (y)/ y g2 (y) = x f 1 (x)/ x f 2 (x).
Proof
On account of the definitions given above, it is easy to see that
pi? (x, y) = f i? (x)gi? (y)
which means that the random variables X ? , Y ? are independent. We know that
? ?
KL ?
?
KL ? ?
I XKL
? ,Y ? ( p1 , p2 ) = I X ? ( f 1 , f 2 )+ IY ? (g1 , g2 )
2009 John Wiley & Sons, Ltd.
Appl. Stochastic Models Bus. Ind. 2010; 26:448?472
DOI: 10.1002/asmb
453
DIVERGENCES AND THEIR APPLICATIONS
Then for the variables X, Y and in view of Lemma 1 we have that
D KL
X,Y ( p1 , p2 ) =
=
x
y
x
y
? ?
p1 (x, y) [I XKL
? ,Y ? ( p1 , p2 )+ln s]
?
?
KL ? ?
p1 (x, y) [I XKL
? ( f 1 , f 2 )+ IY ? (g1 , g2 )+ln s]
(3)
where s = x y p1 (x, y)/ x y p2 (x, y). As x f i (x) = y gi (y) = x y pi (x, y), i = 1, 2,
we have that = x y p1 (x, y) and s = . Therefore by Equation (3) we have that
KL ?
?
KL ? ?
D KL
X,Y ( p1 , p2 ) = I X ? ( f 1 , f 2 )+IY ? (g1 , g2 )+ ln =
KL
D KL
X ( f 1 , f 2 )? ln + DY (g1 , g2 )? ln + ln =
KL
D KL
X ( f 1 , f 2 )+ DY (g1 , g2 )? ln The weak additivity holds if
x
f 1 (x) =
x
f 2 (x) or
y g1 (y) =
y g2 (y).
Proposition 4 (Maximal information and sufficiency)
Let Y = T (X ) be a measurable transformation of X and pi = pi (x), gi = gi (y), i = 1, 2. Then
KL
D KL
X ( p1 , p2 )DY (g1 , g2 )
with equality if and only if Y ? is sufficient with respect to the pair of distributions p1? and p2? , Y ?
and X ? being the normalized versions of Y and X , respectively.
Proof
Let gi (y) be the measure associated with Y . Then gi (y) = x:T (x)=y pi (x). Setting a = x p1 (x),
b = y g1 (y) and in view of Lemma 1, the following inequalities are equivalent:
KL
D KL
X ( p1 , p2 ) DY (g1 , g2 )
? ?
KL ? ?
? a[I XKL
? ( p1 , p2 )+ln c] b[IY ? (g1 , g2 )+ln d]
where pi? (x) = pi (x)/ x pi (x), gi? (y) = gi (y)/ y gi (y), c = x p1 (x)/ x p2 (x), d = y
g1 (y)/ y g2 (y), and X ? , Y ? are the random variables derived by probabilitizing the values
of X, Y . As
x pi (x) =
y gi (y), i = 1, 2, and thus a = b and c = d, the last inequality is
equivalent to
? ?
KL ? ?
I XKL
? ( p1 , p2 )IY ? (g1 , g2 )
which is true as X ? and Y ? are random variables and Y ? is a measurable transformation of
X ? . Equality holds if and only if the statistic Y ? = T (X ? ) is sufficient (cf. [2, pp. 11?12], [6],
[12, p. 21]).
2009 John Wiley & Sons, Ltd.
Appl. Stochastic Models Bus. Ind. 2010; 26:448?472
DOI: 10.1002/asmb
454
A. SACHLAS AND T. PAPAIOANNOU
Proposition 5
n
n
D KL
(p, q)I KL
(p? , q? ) when
one of the following
conditions
holds:
(i) i=1
pi = i=1
qi 1,
n
n
n
n
n
n
(ii) i=1
pi > i=1
qi and i=1
pi 1, (iii) i=1
pi < i=1
qi and i=1
pi <1.
Proof
It follows easily from Lemma 1 by simple algebraic arguments.
One basic property of measures of information and divergence is the limiting property. This
property asserts that a series {X n } of random variables converges to a random variable X in
distribution when n ? ? if and only if I X n ? I X , where I denotes an information measure. Under
some conditions the limiting property holds for several measures of information including the
Kullback?Leibler divergence. See [12, 13].
In the next proposition we investigate whether the limiting property holds when the Kullback?
Leibler divergence has nonprobability vectors in its arguments.
Proposition 6 (The limiting property)
Let {pn } be a bounded from above sequence of nonprobability vectors. Then pn ? p if and only
if D KL (pn , p) ? 0.
Proof
Let pn ? p. Using Lemma 1 we have
i pn (i)
KL
KL =0
lim D (pn , p) = lim
pn (i) lim I (pn , p )+ln n??
n??
n??
i
i p(i)
because lim I KL (pn , p ) = 0 and lim (ln
n??
On the other hand, let
i
n??
D KL (pn , p) ? 0.
lim
i
p(i)) = 0
Then
n?? i
pn (i)/
p(i)
pn (i)
=0
p(i)
where (x) = x ln x, x>0 is a continuous function with (1) = 0.
Suppose that pn ? p does not hold. So there is a subsequence n 1 <n 2 < � � � <n s < � � � of integers
and a vector q such that
lim pn s = q
s??
and p = q
(4)
Because is continuous we have that
pn s (i)
q(i)
= p(i)
p(i)
lim
s?? i
p(i)
p(i)
i
However
i p(i)( pn s (i)/ p(i)) is a subsequence of
i p(i)( pn (i)/ p(i)) that converges
to (1) = 0. Thus
q(i)
= (1) = 0
p(i)
p(i)
i
2009 John Wiley & Sons, Ltd.
Appl. Stochastic Models Bus. Ind. 2010; 26:448?472
DOI: 10.1002/asmb
DIVERGENCES AND THEIR APPLICATIONS
455
which is possible only if p(i) = q(i), which contradicts Equation (4). Thus we have that pn ? p,
so the limiting property holds for the Kullback?Leibler directed divergence.
As expected the Kullback?Leibler directed divergence D KL (p, q) with nonprobability vectors
p and q, does not in general share the properties of the Kullback?Leibler directed divergence
with probability vectors p? and q? . Under certain conditions, some of them are satisfied. More
precisely D KL (p, q) is nonnegative, additive, invariant under sufficient transformations and greater
than I KL (p? , q? ); it satisfies the property of maximal information and the limiting one. So, in
general terms, it can be regarded as a measure of divergence
and therefore can be used whenever
we do not have probability vectors, provided that i pi = i qi .
3. POWER DIRECTED DIVERGENCE WITHOUT PROBABILITY VECTORS
Cressie and Read introduced in [14] the power divergence between two probability vectors p? and
q? for goodness-of-fit purposes. It is defined by
? n
p
1
i
p?
?1
I CR (p? , q? ) =
(+1) i=1 i
qi?
where is a real-valued
The values at = 0, ?1 are defined by continuity. For ? 0, we
n parameter.
have I CR (p? , q? ) = i=1
pi? ln( pi? /qi? ), which is the Kullback?Leibler directed divergence, while
for ? ?1, we have I CR (p? , q? ) = I KL (q? , p? ). The power divergence has the properties of other
We note that I CR (p? , q? ) is a directed divergence [14].
We also have the family of power divergence statistics, 2n I CR (p?? , p?0 ), primarily used for
goodness-of-fit purposes [14]. Here p?? = X/n is the sample proportion, X is multinomial M(n, p)
and p?0 is the probability model of interest. Members of this family are (i) the chi-squared 2
statistic for = 1, (ii) the G 2 statistic for ? 0, (iii) the modified likelihood ratio statistic for
? ?1, (iv) the Freeman?Tukey statistic F 2 for = ? 12 and (v) the Neyman-modified 2 statistic
for = ?2. As an alternative to the 2 and G 2 statistics, Cressie and Read proposed in [14] the
power divergence statistic with = 23 that lies between them.
Let us now see what happens in the case we do not have probability vectors. We first define the
power divergence of order for nonprobability vectors.
Definition 4
We define as
D
CR
1
(p, q) =
pi
(+1) i
pi
qi
?1 ,
? R
(5)
the Cressie?Read power divergence of order between two
nonprobability vectors
p = ( p1 , . . . , pn )T >0 and q = (q1 , . . . , qn )T >0, where i pi = 1 and i qi = 1.
In this section we shall examine the information theoretic and divergence properties for this
measure. We shall also assume that = 0 and = ?1.
2009 John Wiley & Sons, Ltd.
Appl. Stochastic Models Bus. Ind. 2010; 26:448?472
DOI: 10.1002/asmb
456
A. SACHLAS AND T. PAPAIOANNOU
Lemma 2
For the Cressie?Read directed divergence involving nonprobability vectors p, q, it holds that
1?k 1
CR
CR ? ?
D (p, q) =
pi k I (p , q )? k (+1)
i
where I CR (p? , q? ) is the Cressie?Read directeddivergence between the two probability vectors
p? and q? defined in Lemma 1 and k = i pi / i qi .
Proof
Switching from (p, q) to (p? , q? ) we have
D
CR
(p, q) =
pi
i
=
pi
i
=
?
1
p
(+1) i i
k ?
p
(+1) i i
pi k
i
I
CR
pi?
qi?
pi?
qi?
k ?1
1
? k
1?k 1
(p , q )? k (+1)
?
?
Proposition 7 (The nonnegativity property)
Let
m=
1?k 1
(+1)
k
Then
D CR (p, q)0
if one of the following conditions holds:
(i) i pi = i qi ;
(ii) i pi > i qi and ?
/ (?1, 0);
(iii)
p
<
q
and
?
(?1, 0);
i
i
i
i
(iv) m<I CR (p? , q? ).
As for the equality we have the following:
(a) If i pi = i qi equality holds if and only if p = q;
(b) if i pi > i qi or i pi < i qi equality holds if m = I CR (p? , q? ).
In summary if i pi = i qi then D CR (p, q)0 with equality if and only if p = q.
2009 John Wiley & Sons, Ltd.
Appl. Stochastic Models Bus. Ind. 2010; 26:448?472
DOI: 10.1002/asmb
457
DIVERGENCES AND THEIR APPLICATIONS
Proof
By Lemma 2 we know that D CR (p, q) = ( i pi )k [I CR (p? , q? )?m]. We have the following:
If i pi = i qi , then k = 1 and independently of the value of , it holds that m = 0 and
consequently D CR (p, q) = ( i pi )I CR (p? , q? )0.
/ (?1, 0). In this case we have that D CR (p, q)0 if
If i pi > i qi , then k >1 and m<0 if ?
CR
?
?
CR
?
?
mI (p , q), while when mI (p , q ) we have D CR (p, q)0.
If i pi < i qi , then k <1 and m<0 if ? (?1, 0). So, in this case we have the same conclusion
as above.
CR
Summarizing,
conditions (i)?(iv) of the proposition
imply
D (p,
q)0.It is easy to see that
when i pi = i qi equality holds if p = q. Also i pi > i qi or i pi < i qi equality holds if
m = I CR (p? , q? ).
Note that D CR (p, q) = 0 does not necessarily imply p = q unless i pi = i qi . Again here we
have that the minimal requirement for using D CR (p, q) as a measure of divergence is i pi = i qi
regardless of the value of .
Proposition 8
D CR (p, q)I CR (p? , q? ) when one of the following conditions holds: (i)
i pi =
i qi , (ii)
CR (p? , q? )
p
>
q
and
?
/
(?1,
0),
(iii)
p
<
q
and
?
(?1,
0).
Equality
holds
if
m
=
I
i i
i i
i i
i i
independently of the value of , where m is as in Proposition 7.
Proof
By Lemma 2 we know that D CR (p, q) = ( i pi )k [I CR (p? , q? )?m]. We have the following three
situations:
If i pi = i qi , then k = 1 and independently of the value of , it holds that m = 0 and
consequently
D CR (p, q)>I CR (p? , q? ).
If i pi > i qi (or equivalently k >1) and m<0 (when ?(?1, 0)), we have that it always
holds D CR (p, q)>I CR (p? , q? ). In case that i pi > i qi and m>0 (when ? (?1, 0)) we have
that if m>I CR (p? , q? ) then D CR (p, q)<0, which is impossible, while if m<I CR (p? , q? ) then
CR ? ?
D CR (p,
q)<I (p , q ).
If i pi < i qi (or equivalently k <1) and m<0 (when ? (?1, 0)), we have that D CR (p, q)>
I CR (p? , q? ) always holds.
Summarizing, we have that (i)?(iii) imply D CR (p, q)>I CR (p? , q? ).
Definition 5 (Bivariate divergence)
In the framework of Definition 2 we define the Cressie?Read directed divergence between two
bivariate nonprobability functions p1 , p2 as
1
p1 (x, y) CR
D X,Y ( p1 , p2 ) =
p1 (x, y)
?1
(+1) x y
p2 (x, y)
Definition 6 (Conditional divergence)
In the framework of Definition 3, we set
1
DYCR
h 1 (y|x)
|X =x (h 1 , h 2 ) =
(+1) y
2009 John Wiley & Sons, Ltd.
h 1 (y|x)
h 2 (y|x)
?1
Appl. Stochastic Models Bus. Ind. 2010; 26:448?472
DOI: 10.1002/asmb
458
A. SACHLAS AND T. PAPAIOANNOU
and
CR
DYCR
|X (h 1 , h 2 ) = E X [DY |X =x (h 1 , h 2 )]
1
=
f 1 (x) h 1 (y|x)
(+1) x
y
h 1 (y|x)
h 2 (y|x)
?1
for the variable X . Conditional divergence D CR
X |Y (r1 ,r2 ) is defined in an analogous way.
Strong additivity is not satisfied for the power divergence with probability vectors as one
can easily see with the following numerical example involving two trinomial distributions. If
(X ? , Y ? , Z ? ) is trinomial with M(n, pi1 , pi2 , pi3 ), pi1 + pi2 + pi3 = 1, i = 1, 2, using standard
results, some algebra and obvious notations, we have
x y z
n
p
p
p
1
11
12
13
y
? ?
I XCR
px p pz
?1
? ,Y ? ( p1 , p2 ) =
(+1) x,y,z x, y, z 11 12 13
p21
p22
p23
n
p11 x q11 (n?x)
1
CR ?
?
x n?x
I X ? ( f1 , f2 ) =
?1 , qi1 = 1? pi1
p q
(+1) x x 11 11
p21
q21
and
? ?
IYCR
? |X ? (h 1 , h 2 ) =
n
p12 y p13 z q21 (n?x)
1
x y z
p11 p12 p13
?1
(+1) x,y,z x, y, z
p22
p23
q11
n where x,y,z
is the trinomial coefficient and x + y + z = n. For n = 5, p11 = 0.2, p12 = 0.2, p13 = 0.6
and p21 = 0.3, p22 = 0.4 and p23 = 0.3 and = 1.2, we obtain
?
?
CR
? ?
CR
? ?
I XCR
? ( f 1 , f 2 )+ IY ? |X ? (h 1 , h 2 ) = 0.133+2.037 = 2.17<I X ? ,Y ? ( p1 , p2 ) = 3.451
For n = 5, the same p ij ?s and = ?0.4 we obtain
?
?
CR
? ?
CR
? ?
I XCR
? ( f 1 , f 2 )+ IY ? |X ? (h 1 , h 2 ) = 0.132+0.804 = 0.936>I X ? ,Y ? ( p1 , p2 ) = 0.877
A further numerical investigation revealed that when >0 the subadditivity property holds, while
when <0 the superadditivity property holds. Equality holds only when = 0, which is the case
of the Kullback?Leibler divergence.
No convenient expression was obtained in the case of nonprobability vectors. For weak additivity,
we have the following proposition.
If h i (y|x) = gi (y) and thus pi (x, y) = f i (x)gi (y), i = 1, 2, we have that the random variables X ?
and Y ? , which are the ?standardized? values of X, Y , are independent, then
(a)
CR
CR
D CR
X,Y ( p1 , p2 ) = D X ( f 1 , f 2 )+ DY (g1 , g2 )
?
? CR ? ?
+ p1贩 (+1)I XCR
? ( f 1 , f 2 )IY ? (g1 , g2 )+ p1贩 (1? )
where pi贩 =
x
y
1
(+1)
pi (x, y), i = 1, 2, and = p1贩 / p2贩
2009 John Wiley & Sons, Ltd.
Appl. Stochastic Models Bus. Ind. 2010; 26:448?472
DOI: 10.1002/asmb
459
DIVERGENCES AND THEIR APPLICATIONS
CR
CR
(b) D CR
X,Y ( p1 , p2 ) = D X ( f 1 , f 2 )+ DY (g1 , g2 ) if = 1 and if one of the marginal pairs
?
?
?
?
( f 1 , f 2 ), (g1 , g2 ) are identical.
Proof
(a) We have already seen in Proposition 3 that the random variables X ? , Y ? are independent.
We know that (cf. [3])
? ?
CR ?
?
CR ? ?
CR ?
? CR ? ?
I XCR
? ,Y ? ( p1 , p2 ) = I X ? ( f 1 , f 2 )+ IY ? (g1 , g2 )+(+1)I X ? ( f 1 , f 2 )IY ? (g1 , g2 )
Then using Lemma 2 we have
D CR
X,Y ( p1 , p2 ) = p1贩 1?
? ?
I XCR
? ,Y ? ( p1 , p2 )?
(+1)
?
?
CR ? ?
= p1贩 I XCR
? ( f 1 , f 2 )+ IY ? (g1 , g2 )
1?
?
? CR ? ?
+(+1)I XCR
? ( f 1 , f 2 )IY ? (g1 , g2 )?
(+1)
CR
= D CR
X ( f 1 , f 2 )+ DY (g1 , g2 )
+ p1贩 1?
?
? CR ? ?
(+1)I XCR
? ( f 1 , f 2 )IY ? (g1 , g2 )+
(+1)
CR
CR ?
? CR ? ?
= D CR
X ( f 1 , f 2 )+ DY (g1 , g2 )+ p1贩 (+1)I X ? ( f 1 , f 2 )IY ? (g1 , g2 )
+ p1贩 (1? )
1
.
(+1)
(b) If p1贩 = p2贩 , then = 1. Thus regardless of the value of , the last term of the above
?
? CR ? ?
equation equals to 0. Moreover if f 1? = f 2? or g1? = g2? then I XCR
? ( f 1 , f 2 )IY ? (g1 , g2 ) = 0.
CR
CR
CR
Thus D X,Y ( p1 , p2 ) = D X ( f 1 , f 2 )+ DY (g1 , g2 ).
Proposition 10 (Maximal information and sufficiency)
Let Y = T (X ) be a measurable transformation of X , then
CR
D CR
X ( p1 , p2 )DY (g1 , g2 )
when c>1, where c = ( x p1 (x)/ x p2 (x)) , with equality if and only if Y is ?sufficient? as
explained in Proposition 4, where pi = pi (x), gi = gi (y), i = 1, 2.
2009 John Wiley & Sons, Ltd.
Appl. Stochastic Models Bus. Ind. 2010; 26:448?472
DOI: 10.1002/asmb
460
A. SACHLAS AND T. PAPAIOANNOU
Proof
Let gi (y) be the measure associated with Y . Then gi (y) = x:T (x)=y pi (x). The following inequalities are equivalent:
CR
D CR
X ( p1 , p2 ) DY (g1 , g2 )
CR ? ?
? ?
?
p1 (x) c[I X ? ( p1 , p2 )?k] g1 (y) d[IYCR
? (g1 , g2 )?l]
x
where c =
x
p1 (x)/
y
x
p2 (x) , d =
y g1 (y)/
y g2 (y) ,
k=
1?c
1
c (+1)
l=
1?d
1
d (+1)
and
As
x
pi (x) =
y gi (y),
i = 1, 2, and thus c = d and k =l, the last inequality is equivalent to
? ?
CR ? ?
I XCR
? ( p1 , p2 )IY ? (g1 , g2 )
which holds whenever c>1. Equality holds if and only if the statistic Y ? = T (X ? ) is sufficient ([2,
pp. 11?12]).
Zografos et al. proved in [13] that the limiting property holds for Csiszar?s measure of divergence
(-divergence) defined as
f 1 (x)
C
I ( f 1 , f 2 ) = f 2 (x)
dx
f 2 (x)
where is a real-valued convex function satisfying certain conditions.
The Cressie and Read divergence can be obtained from Csiszar?s measure by taking (x) =
[(+1)]?1 (x +1 ? x) in the discrete version of the measure [2]. So the limiting property holds
for the Cressie and Read divergence as well. In the next proposition we investigate whether the
limiting property holds in case we do not have probability vectors.
Proposition 11 (The limiting property)
Let {pn } be a sequence of nonprobability vectors. Then pn ? p if and only if D CR (pn , p) ? 0.
Proof
Using Lemma 2 we have that
lim D CR (pn , p) = lim
n??
n??
pn (i)
i
lim k lim
n??
n??
1
1?k
=0
I CR (pn , p )? k (+1)
because lim I CR (pn , p ) = 0 and lim k = 1.
n??
n??
2009 John Wiley & Sons, Ltd.
Appl. Stochastic Models Bus. Ind. 2010; 26:448?472
DOI: 10.1002/asmb
DIVERGENCES AND THEIR APPLICATIONS
461
On the other hand, let D CR (pn , p) ? 0. Then, ignoring constant 1/((+1)),
pn (i) pn (i)
=0
lim
pn (i)
?1 = 0 or lim
pn (i)
n?? i
n?? i
p(i)
p(i)
where function (x) = (x +1 ? x), x>0, = 0, ?1 is a positive, continuous function for which it
holds (1) = 0. Repeating the argument in the second part of the proof of Proposition 6, we obtain
pn ? p. Thus the limiting property holds.
We have already seen that the power directed divergence D CR (p, q), under some conditions, is
nonnegative, additive, greater than I CR (p? , q? ) and invariant under sufficient transformations. It
also shares the property of maximal information and the
property. So, we can regard
basic limiting
D CR (p, q) as a measure of divergence, provided that i pi = i qi .
4. AN ACTUARIAL APPLICATION OF DIVERGENCES
INVOLVING NONPROBABILITY VECTORS
In this section we shall describe how measures of divergence can be used in order to smooth raw
we provide a numerical illustration.
In order to describe the actual but unknown mortality pattern of a population, the actuary calculates
from raw data crude mortality rates, death probabilities or forces of mortality, which usually form
an irregular series. Because of this, it is common to revise the initial estimates with the aim of
producing smoother estimates, with a procedure called graduation. There are several methods of
graduation classified into parametric curve fitting and nonparametric smoothing methods. For more
details on the topic, the interested reader is referred to [15?24] and references therein.
A method of graduation using information theoretic ideas was first introduced by Brockett
and Zhang in [25]. More specifically, they tried in [26] to construct a smooth series of n death
probabilities {vxi } at age xi , i = 1, 2, . . . , n, which is as close as possible to the observed series
{u xi } and in addition they assumed that the true but unknown underlying mortality pattern is (i)
smooth, (ii) increasing with age x, that is, monotone, and (iii) more steeply increasing in higher
ages, that is, convex. They also assumed that (iv) the total number of deaths in the graduated
data equals the total number of deaths in the observed data, and (v) the total graduated ages at
death equals the total of observed ages at death. By total age of death, we mean the sum of the
product of the number of deaths at every age by the corresponding age. The last two constraints
imply that the average age of death is required to be the same for the observed and the graduated
mortality data. In the sequel, when we use x = 1, 2, . . . , n, we shall mean the corresponding
ages
3
2
x1 , x2 , . . . , xn . Mathematically, the five constraints are written as follows: (i)
x ( vx ) M,
where M is a predetermined positive constant and 3 vx = ?vx +3vx+1 ?3vx+2 +vx+3 ; (ii)
vx 0,
where vx = vx+1 ?vx ; (iii) 2 vx 0, where 2 vx = vx ?2vx+1
+v
;
(iv)
l
v
=
x+2
x
x
x lx u x ,
x
where l x is the number of people at risk at the age x; and (v) x xl x vx = x xl x u x .
2009 John Wiley & Sons, Ltd.
Appl. Stochastic Models Bus. Ind. 2010; 26:448?472
DOI: 10.1002/asmb
462
A. SACHLAS AND T. PAPAIOANNOU
To obtain the graduated values, Zhang and Brockett minimize in [26] the Kullback?Leibler
divergence
vx
D KL (v, u) = vx ln
ux
x
between the crude death probabilities u = (u 1 , . . . , u n )T and the new death probabilities
v = (v1 , . . . , vn )T , subject to the constraints (i)?(v) by considering a dual problem of minimization.
rates (death probabilities) u and v are not probability
as we have
n
vectors
nHowever, the mortality
n
KL
x=1 u x >1 and
x=1 vx >1. Brockett in [10, p. 104] states that ?D (v, u) =
x=1 vx ln(vx /u x ) is
still a measure of fit even in the nonprobability situation because the mortality rates are nonnegative
and because of the assumed constraints?. In view of the discussion and results in Section 2, the
appropriate constraint to use here is
(vi)
n
x=1
vx =
n
ux
x=1
and not conditions (iv) and (v) of [26]. It is easy to see via a counter-example that conditions (iv)
and (v) do not imply (vi). It may be necessary, however, to use them on actuarial grounds.
A new and unifying way to obtain the graduated values vx is to minimize the Cressie?Read
power divergence
vx 1
CR
D (v, u) =
vx
?1
(+1) x
ux
between the death probabilities u and v for given subject to constraints (i)?(v) and/or (vi), that
is, v0 and gi (v) = 12 vT Di v+biT v+ci 0, i = 1, 2, . . . ,r +1, where for each i, Di , bi , ci are a
positive-semidefinite matrix and constants, respectively. It is easy to see that the constraints (i)?(v)
may be written in the form of gi (v). For more details see [26]. We note that in this case we have
r = 2(n +1) constraints, where n is the number of ungraduated values. The minimization is done
for various values of the parameter and in this way we can interpret the resulting series of the
graduated values, as the series that satisfy the constraints, and is least distinguishable in the sense
of the Cressie?Read directed divergence from the series of the crude values {u x }. It is obvious
that if we choose = 0, we perform graduation through the Kullback?Leibler directed divergence
that Zhang and Brockett [26] described.
In view of the work of Csiszar [11] mentioned in Section 1, one could also consider the extended
Cressie and Read power divergence for the positive vectors u and v
1
v
x
CR
Iext
vx
(v, u) =
?1 ?vx +u x , ? R
(+1) x
ux
The values at = 0, ?1 are defined by continuity. For ? 0, the above-mentioned measure reduces
KL (v, u). For ? ?1 it becomes I KL (u, v).
to the extended Kullback?Leibler directed divergence Iext
ext
For these values of , it is known that the full nonnegativity property is satisfied. Setting gx = vx /u x
we obtain
+1
1
CR
Iext
(v, u) =
u x gx ?2gx +1
(+1) x
2009 John Wiley & Sons, Ltd.
Appl. Stochastic Models Bus. Ind. 2010; 26:448?472
DOI: 10.1002/asmb
DIVERGENCES AND THEIR APPLICATIONS
463
CR (v, u) = 1 u (g ?1)2 0 with equality if and only if v = u. For other values of
For = 1, Iext
x x x
2
CR
, an investigation of the minimum of the function h(y) = y +1 ?2y +1, y>0, revealed that Iext
can be either strictly positive or strictly negative and thus it does not satisfy the full nonnegativity
CR occur only for >0 and, for example, for = 2 we have that
property. Real minima of Iext
?
CR
Iext (v, u) = 0 if and only if v = u or v = ( 12 5?1)u. Again if v and u are vectors with positive
components of sum 1, then the above-mentioned
divergence
reduces to the standard Cressie and
v
=
u
,
the
extended measure is identical to
x
x
x
x
the Cressie and Read directed divergence of order between nonprobability vectors defined in
Definition 4. Csiszar?s extended definition essentially incorporates constraint (vi) into the measure.
It is now obvious that in the information theoretic graduation problem, we must either incorporate
CR (v, u) with = 0 or 1 or ?1.
constraint (vi) or we use Iext
Finally, for reasons of completeness, we state the Whittaker?Henderson (WH) method of graduating mortality rates, a method widely used in actuarial practice, which is a precedent of the
splines smoothing method (see [15, 18, 20, 24]):
minimize
F +h S
where F = x wx (u x ?vx )2 is a measure of fit, wx =l x /(u x (1?u x )) are weights with l x being
z
2
the number of people at risk at age x and S = n?3
x=1 ( vx ) is a measure of smoothness of the
graduated values, is the difference operator, and z is generally taken as 2, 3, 4. Here we shall take
3
2
2
z = 3. Other choices of S involve exponential-type smoothing with S = n?3
x=1 ( vx ?(c ?1) vx ) ,
where c is a constant appearing, for example, in Makeham?s law ([20, p. 63]). The underlying
assumption is that the dx follow a binomial distribution B(l x , vx ) for each x. The parameter h is
usually taken equal to the average of weights wx . S and F are the two basic elements of actuarial
graduation. The smaller the value of S, the better for graduation but S and F are in competition.
In the following section we provide a numerical illustration, in order to see how information
theoretic graduation methods perform, compare them with the WH method and try to find the best
CR (v, u) is examined.
value of . In addition, the role of Iext
4.2. Numerical investigation
For the illustration, we will use three different data sets of death probabilities. The first one comes
from [20, p. 162] and will be denoted by L85. The second comes from the Actuarial Society of
Hong Kong [27], refers to males insured for more than two years, and will be denoted by HK01M.
The third also comes from the same Society, refers to female insured for more than 2 years, and
will be denoted by HK01F. The above-mentioned data sets are of different sizes. Especially, the
L85 data set consists of 20 death probabilities belonging to ages 75?94 (computed from a total of
79 880 observations). From HK01M we have used 16 death probabilities for ages 70?85 (computed
from a total of 13 678 observations), while from HK01F we have taken 20 death probabilities for
ages 70?89 (computed from a total of 18 341 observations).
We have performed several graduations for each data set, using different values of the parameter
and the constraints of smoothness, monotonicity, convexity, the two actuarial constraints and
constraint (vi) of Section 4.1. Among them are the values of = 1, 0, ?1, ? 12 and ?2, which give
the 2 statistic, Kullback?Leibler divergence, modified likelihood ratio statistic, Freeman?Tukey
statistic F 2 and Neyman-modified 2 , respectively. We have also used the value of = 23 that
Cressie and Read proposed in [14]. The value of M in the first constraint is different in each data
2009 John Wiley & Sons, Ltd.
Appl. Stochastic Models Bus. Ind. 2010; 26:448?472
DOI: 10.1002/asmb
464
A. SACHLAS AND T. PAPAIOANNOU
set, and was computed for each data set through graduations by the WH method with h the average
of weights wx . This was done in order to compare our results with those obtained through the WH
method. The values of h and M are as follows: h = 80 786.8 and M = 0.00004 for the L85 data
set, h = 29 354.6 and M = 0.00003 for the HK01M data set and h = 54 547 and M = 0.000015 for
the HK01F data set.
As expected, different choices of the parameter lead to different graduated values. In order
to compare the several graduations for each data set, we computed, after graduation, the values
of S and F. Furthermore, the performance of our proposed methods (models) was assessed with
the log-likelihood, deviance and 2 goodness-of-fit statistics evaluated for the graduated values.
As we have assumed that dx ? Bi(l x , vx ), we have that the log-likelihood excluding constants is
log L(v) =
n
[dx log vx +(l x ?dx ) log(1?vx )]
i=1
Knowing the log-likelihood function we can calculate the deviance,
D(v) = 2 log L(u)?2 log L(v)
Finally, we can measure the discrepancy between the observed and the expected deaths with the
corresponding 2 statistic,
2 =
n (d ?l v )2
x
x x
l
v
(1?v
x)
i=1 x x
We want graduations with maximum log L(v) and minimum deviance and 2 .
The ungraduated and graduated values for the HK01M data set with the CR method, the five
constraints and for = 0, 23 , 1, 2 along with those obtained through the WH method are given in
Table I. In Table II we give the corresponding smoothness and goodness-of-fit results along with
the minimum value of the Cressie and Read divergence. From Table II we see that almost all
the graduations with the Cressie?Read power divergence give the same value for the smoothness
measure S, which is the value of M in the smoothness constraint (i), while the value of the
goodness-of-fit measures F, deviance and 2 decreases
as increases. We observe that in all
the cases, we take negative values of I CR (v, u) as x vx < x u x . Note that S cannot exceed its
corresponding value of the WH method due to constraint (i). The best choice for is 2. Similar
results were also obtained for the other two data sets with the difference that for the L85 data set
the best choice of is 23 or 1. Comparing the results with those obtained by the WH method, we
have equivalent results as far as smoothness is concerned while we do not have good fidelity F.
In terms of 2 and deviance, the winner is WH.
A further numerical investigation threw further light on the role or choice of . In Figure 1,
we have plotted S versus for the three data sets. The dotted line in each plot denotes the value
of M in the smoothness constraint (i). We can see that the three plots present the same pattern.
When ??<<?1, S takes a value near the value of M. Then, when ?1<<?0.5, S takes a
very small value almost equal to zero and then for the remaining values of , it also takes a value
near the value of M. So, for values of between ?1 and ?0.5, the method oversmooths the data.
In Figure 2, we present the analogous plots concerning the measure of fit F. We can also see the
same pattern for the three data sets. For values of smaller than ?1, the measure of fit increases,
till its maximum value. This means that graduation is not acceptable as the graduated values depart
2009 John Wiley & Sons, Ltd.
Appl. Stochastic Models Bus. Ind. 2010; 26:448?472
DOI: 10.1002/asmb
465
DIVERGENCES AND THEIR APPLICATIONS
Table I. Graduations with: (a) the CR divergence and five constraints and (b) the WH method (w/o
constraints) for the HK01M data set.
x
ux
vx ( = 0)
vx ( = 23 )
vx ( = 1)
vx ( = 2)
vx (WH)
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
0.01923
0.02563
0.02992
0.03585
0.03899
0.03523
0.05543
0.04939
0.05906
0.07503
0.04848
0.11692
0.06816
0.23598
0.11659
0.29152
0.01567
0.02239
0.02911
0.03584
0.04256
0.04928
0.05601
0.06273
0.06945
0.07617
0.08290
0.08962
0.09634
0.10307
0.10979
0.12154
0.01583
0.02248
0.02913
0.03578
0.04243
0.04908
0.05573
0.06239
0.06904
0.07569
0.08234
0.08899
0.09564
0.10365
0.11477
0.13346
0.01596
0.02255
0.02915
0.03574
0.04234
0.04894
0.05553
0.06213
0.06872
0.07532
0.08192
0.08851
0.09511
0.10454
0.11869
0.14123
0.01618
0.02268
0.02917
0.03567
0.04217
0.04867
0.05516
0.06166
0.06816
0.07466
0.08115
0.08765
0.09483
0.10674
0.12536
0.15274
0.01928
0.02546
0.03045
0.03451
0.03785
0.04160
0.04671
0.05164
0.05621
0.06110
0.06830
0.08085
0.10057
0.12914
0.16700
0.21458
Table II. Smoothness and goodness-of-fit measures for graduations with: (a) the CR divergence and five
constraints and (b) the WH method (w/o constraints) for the HK01M data set.
S
F
Deviance
log-likelihood
2
I CR (v, u)
for
=0
= 23
=1
=2
WH
0.000025
41.403
36.37
?2090.91
41.29
?0.103
0.000025
38.690
34.33
?2089.89
38.54
?0.023
0.000025
36.919
32.95
?2089.20
36.75
?0.006
0.000025
34.213
30.81
?2088.13
34.01
?0.019
0.000025
19.964
18.29
?2081.87
19.73
?0.067
= 0.
too far from the crude values. When takes a value almost equal to ?1, F decreases and it is
stabilized for the remaining values of .
Comparing graduations of the three data sets with the power divergence statistic for = 0 versus
the graduation with the Kullback?Leibler divergence ( ? 0), we see that in terms of fidelity
F we obtain almost the same results for values of >?1. However, as far as smoothness S is
concerned, graduation of the L85 data set via the Kullback?Leibler divergence gives a very small
value for the S, which means that the method oversmooths the data. The same result is also
obtained by minimizing the power divergence statistic with ? (?1, 0). For the HK01M data set,
the minimization of the Kullback?Leibler divergence gives the same results as the minimization of
the power divergence statistic with <?1 and >0. Finally, for the HK01F data set, graduation
through the Kullback?Leibler divergence oversmooths the data, something that also happens using
the power divergence statistic with ?1<<0.5. Our final conclusion is that the choice of = 23
suggested in [14] on grounds of statistical power is also a good choice for graduation.
2009 John Wiley & Sons, Ltd.
Appl. Stochastic Models Bus. Ind. 2010; 26:448?472
DOI: 10.1002/asmb
466
A. SACHLAS AND T. PAPAIOANNOU
S
S
0.4
0.25
0.20
0.3
0.15
0.2
0.10
0.1
?3
?2
?1
0.05
0
1
2
3
4
(a)
?4
?3
?2
?1
0
1
2
3
4
(b)
S
0.15
0.10
0.05
?4
?3
?2
?1
0
1
2
3
4
(c)
Figure 1. Smoothness S versus (five constraints) (S �?4 ): (a) L85; (b) HK01M; and (c) HK01F.
Next we graduated the same data sets using apart from the constraints (i)?(v) of Section
4.1 constraint (vi) as well, which, as we saw, is the minimal requirement for a measure with
nonprobability vectors to be a measure of divergence. The results for = 0, 23 , 1, 2 and the WH
method for the HK01M data set are given in Table III. From Table IV we see that almost all
graduations through the Cressie-Read power divergence give the same value for the smoothness
measure S, which is the value of M in the smoothness constraint (i), while the value of F increases
as increases. We also see a dramatic improvement in deviance and 2 in comparison with Table
II. Here we observe that the values of the I CR (v, u) measure are positive and this is because of
the use of constraint (vi). As far as the best choice of is concerned, we have the same results as
before. The results are now almost equivalent to those of the WH method as far as both smoothness
and goodness of fit are concerned.
As far as smoothness is concerned, each choice of gives almost the same value for measure
S. More specifically, we have S ? 4�?5 for the L85 data set, S ? 3�?5 for the HK01M data
set and S ? 15�?6 for the HK01F data set. Comparing the results between graduations with
five and six constraints, the use of additional constraint (vi) improves the results as now we do
not have the oversmoothing effect.
In Figure 3, we present the plots concerning the measure of fit F. For the L85 data set, we see
that the value of F ? 60 for <0, while for positive values of , F increases. For the HK01M
data set, we have that F ? 17 for all values of = 2, while for = 2, F ? 19.5. Finally, for the
HK01F data set, we see that there are no major differences as F ? (9.9, 12). Comparing again the
2009 John Wiley & Sons, Ltd.
Appl. Stochastic Models Bus. Ind. 2010; 26:448?472
DOI: 10.1002/asmb
467
DIVERGENCES AND THEIR APPLICATIONS
F
F
800
70
700
60
600
50
500
400
40
300
30
200
?4
?3
?2
?1
0
1
2
3
4
(a)
?4
?3
?2
?1
0
1
2
3
4
(b)
F
80
70
60
50
40
30
20
?4
?3
?2
?1
0
1
2
3
4
(c)
Figure 2. Goodness-of-fit F versus (five constraints): (a) L85; (b) HK01M; and (c) HK01F.
results between the graduations with five and six constraints, the additional constraint improves
the results as the graduated values present a much better fidelity.
The graduation of the data sets using constraints (i), (ii), (iii) and (vi) gave results similar to
those of Table IV with a slight increase in the value of the goodness-of-fit measures. Finally the
CR (p, q) are given
results of graduation through the extended Cressie and Read power divergence Iext
in Table V. The minimization was done subject to constraints (i)?(v). The values of S are equal
or almost equal for all chosen values of in Tables II and IV. Comparing the goodness-of-fit
with that given by the minimization of the Cressie and Read power divergence subject to the five
constraints (Table II), we see a dramatically better fidelity. However, the fidelity provided by the
Cressie and Read power divergence subject to the six constraints that we propose (Table IV) is
better for every chosen value of the parameter . The extended Cressie and Read power divergence
performs comparatively in a similar manner as the Cressie and Read power divergence with six
constraints.
We then conducted a predictive analysis of our method. Assuming that the underlying pattern
of mortality follows Makeham?s model vx = a +bc x , where a, b, c are proper parameters and x is
the age, we used a time-based training-test split. We split the data sets into two equal intervals, we
graduated the first interval with the CR methods, fitted Makeham?s model and using this model
we then compared the predicted values, obtained by this model, with the ungraduated values of
the second interval. The resulting MSEs for the second interval appear in Table VI. Results for the
Cressie and Read method with six constraints are comparable with the other cases. Compared with
the MSE obtained through the WH method, which is equal to 0.00882, we have better performance
2009 John Wiley & Sons, Ltd.
Appl. Stochastic Models Bus. Ind. 2010; 26:448?472
DOI: 10.1002/asmb
468
A. SACHLAS AND T. PAPAIOANNOU
Table III. Graduations with: (a) the CR divergence and six constraints and (b) the WH method (w/o
constraints) for the HK01M data set.
x
ux
vx ( = 0)
vx ( = 23 )
vx ( = 1)
vx ( = 2)
vx (WH)
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
0.01923
0.02563
0.02992
0.03585
0.03899
0.03523
0.05543
0.04939
0.05906
0.07503
0.04848
0.11692
0.06816
0.23598
0.11659
0.29152
0.02025
0.02484
0.02944
0.03404
0.03863
0.04324
0.04797
0.05270
0.05761
0.06377
0.07338
0.08960
0.11474
0.15117
0.19989
0.26231
0.02023
0.02483
0.02944
0.03405
0.03866
0.04329
0.04799
0.05269
0.05760
0.06378
0.07341
0.08949
0.11452
0.15093
0.19993
0.26272
0.02022
0.02483
0.02944
0.03406
0.03867
0.04331
0.04800
0.05269
0.05759
0.06378
0.07342
0.08947
0.11447
0.15087
0.19995
0.26281
0.02267
0.02513
0.02785
0.03126
0.03581
0.04150
0.04800
0.05451
0.06101
0.06815
0.07777
0.09297
0.11625
0.15019
0.19591
0.25458
0.01928
0.02546
0.03045
0.03451
0.03785
0.04160
0.04671
0.05164
0.05621
0.06110
0.06830
0.08085
0.10057
0.12914
0.16700
0.21458
Table IV. Smoothness and goodness-of-fit measures for graduations with: (a) the CR divergence and six
constraints and (b) the WH method (w/o constraints) for HK01M data set.
S
F
Deviance
Log-likelihood
2
I CR (v, u)
For
=0
= 23
0.000029
17.00
17.11
?2081.28
16.77
0.070
0.000029
17.03
17.13
?2081.29
16.80
0.073
=1
0.00003
17.04
17.13
?2081.30
16.80
0.075
=2
0.00003
19.55
19.83
?2082.64
19.33
0.086
WH
0.000025
19.964
18.29
?2081.87
19.73
?0.067
= 0.
of the Cressie and Read methods. The better predictive performance is obtained by minimizing
the extended Cressie and Read power divergence with five constraints and in particular for = 0,
that is, the extended Kullback?Leibler directed divergence. This is one of the three values of that the extended measure satisfies the minimal requirement for a measure with nonprobability
vectors to be a measure of divergence. Similar are the results of the predictive analysis for the L85
and the HK01F data sets. However, we have to note that the predictive analysis results depend
on the selected parametric model, in our case Makeham?s model. A better approach would be the
minimization of the Cressie?Read divergence D CR ( f (x), u) over functions f , where v has been
replaced by an unknown function f (x) = ( f (x1 ), f (x2 ), . . . , f (xn ))T of ages x, subject to integral
constraints analogous to the previous constraints (i)?(vi). This is a calculus of variation problem
and it is believed that its solution would be a spline function. We intend to explore this in our future
research.
2009 John Wiley & Sons, Ltd.
Appl. Stochastic Models Bus. Ind. 2010; 26:448?472
DOI: 10.1002/asmb
469
DIVERGENCES AND THEIR APPLICATIONS
F
F
160
20
140
19
120
18
100
17
80
16
60
?4
?3
?2
?1
0
1
2
3
?4
4
(a)
?3
?2
?1
0
1
2
3
4
(b)
F
12.5
12.0
11.5
11.0
10.5
10.0
9.5
?4
?3
?2
?1
0
1
2
3
4
(c)
Figure 3. Goodness-of-fit F versus (six constraints): (a) L85; (b) HK01M; and (c) HK01F.
Table V. Smoothness and goodness-of-fit measures for graduations with: (a) the ext. CR divergence and
five constraints and (b) the WH method (w/o constraints) for HK01M data set.
=0
S
F
Deviance
Log-likelihood
2
CR (v, u)
Iext
For
= 23
0.00003
17.50
17.46
?2081.46
17.50
0.065
0.00003
19.80
20.85
?2083.16
19.80
0.066
=1
0.00003
19.12
18.68
?2082.07
19.12
0.067
=2
0.00003
25.94
24.27
?2084.86
25.94
0.047
WH
0.000025
19.96
18.29
?2081.87
19.73
0.069
= 0.
After a theoretical evaluation of the Kullback?Leibler D KL (p, q) divergence involving nonprobability vectors, we have concluded that this measure shares some of the properties of the
Kullback?Leibler directed divergence with probability vectors. Under some conditions, D KL (p, q)
is nonnegative, additive and invariant under sufficient transformations. The property of maximal
information and the limiting property are satisfied as well. Thus, we may regard D KL (p, q) as a
measure of information. A minimal requirement for D KL (p, q) to be a measure of divergence is
2009 John Wiley & Sons, Ltd.
Appl. Stochastic Models Bus. Ind. 2010; 26:448?472
DOI: 10.1002/asmb
470
A. SACHLAS AND T. PAPAIOANNOU
Table VI. MSEs for graduations with: (a) the CR divergence (five constraints); (b) the CR divergence (six
constraints); and (c) the ext. CR divergence (five constraints) for HK01M data set.
MSE (CR?five constraints)
MSE (CR?six constraints)
MSE (ext. CR?five constraints)
=0
= 2/3
=1
=2
0.00854
0.00868
0.00553
0.00854
0.00868
0.00831
0.00854
0.00868
0.00854
0.00854
0.00868
0.00854
MSE(WH) = 0.00882.
pi = i qi . Similar results were obtained for the Cressie?Read power divergence D CR (p, q)
with nonprobability vectors.
As an application of the previous results, we explored the use of the general Cressie?Read
power divergences in order to obtain graduated values. A numerical illustration, minimizing the
power divergence for various values of , with constraints (i)?(v) and/or (vi), gave equivalent
results, in terms of smoothness, to thoseof othermethods of graduation such as the widely used
WH method. The use of constraint (vi) x u x = x vx is considered to be necessary as this is the
minimal requirement for a measure with nonprobability vectors to be a measure of divergence.
The numerical results supported the use of this additional condition and showed considerable
improvement in goodness of fit.
However, we cannot say straightforwardly which the value of the parameter is best for
graduation. For graduations with constraints (i)? (v) we have the following: Values of <?1 give
unacceptable results as far as goodness of fit is concerned and as such they should be avoided. As
far as smoothness S is concerned, values of ? (?1, 0.5) oversmooth the data. For 0 the various
divergences give similar results in terms of smoothness and fit; thus, the value = 23 suggested in
[14] from statistical consideration is a good choice. This topic is under further investigation by
the authors. For graduations with the additional constraint (vi), the results improve considerably.
The value of seems not to affect the smoothness S, while the parameter has slight effect on the
goodness of fit.
In the light of the work of Csiszar [11], who extended the Kullback?Leibler
directed divergence
to nonnegative functions and vectors by adding the difference i qi ? i pi to its expression, we
applied the same adjustment to the Cressie?Read power divergence. The extended measure satisfies
the minimal requirement only for = 0, 1 and ?1. The minimization of the extended Cressie?Read
power divergence for various values of , subject to constraints (i)?(v), gave equivalent results,
in terms of smoothness, to those of the other methods used in the numerical investigation. As
far as goodness of fit is concerned, there is a clear improvement comparing the results to those
minimizing the Cressie?Read power divergence with the five constraints. However, they are almost
equivalent with those of the minimization of the Cressie?Read power divergence with the six
constraints. Therefore, we can conclude that the additional sixth constraint that we propose is
necessary not only from the theoretical but also the practical point of view.
The similarity of results between the WH method and the power divergence minimization under
the said constraints allows us to claim that the two graduation methods are nearly equivalent. This
is supported not only by the numerical investigation but also from the fact that in the WH method
we minimize a form of the Lagrangian function F +h S, while in power divergence we minimize
F subject to, among others, a constraint on S that in turn leads to a similar Lagrangian.
i
2009 John Wiley & Sons, Ltd.
Appl. Stochastic Models Bus. Ind. 2010; 26:448?472
DOI: 10.1002/asmb
DIVERGENCES AND THEIR APPLICATIONS
471
A predictive analysis using Makeham?s mortality model and the MSE criterion showed that
the CR methods with six constraints and ?extended? CR and the WH method are comparable. It
appears not to be realistic to seek a method that is best in terms of all comparison criteria, some
of which are competing.
ACKNOWLEDGEMENTS
The authors would like to thank the Editor, the Associate Editor, and the Referee for their valuable
comments and suggestions and in particular for bringing to our attention the paper of Csiszar [11] that
strengthened our investigation.
REFERENCES
1. Basu A, Harris IR, Hjort NL, Jones MC. Robust and efficient estimation by minimising a density power
divergence. Biometrika 1998; 85(3):549?559.
2. Pardo L. Statistical Inference Based on Divergence Measures. Chapman & Hall/CRC: London, 2006.
3. Read TRC, Cressie NAC. Goodness-of-Fit Statistics for Discrete Multivariate Data. Springer: New York, 1988.
4. Mathai AM, Rathie PN. Basic Concepts in Information Theory and Statistics. Wiley: New York, 1975.
5. Liese F, Vajda I. Convex Statistical Distances. B. G. Teubner: Leipzig, 1987.
6. Papaioannou T. Measures of information. In Encyclopedia of Statistical Sciences, vol. 5, Kotz S, Johnson NL
(eds). Wiley: New York, 1985; 391?497.
7. Papaioannou T. On distances and measures of information: a case of diversity. In Probability and Statistical
Models with Applications, Charalambides CA, Koutras MV, Balakrishnan N (eds). Chapman & Hall/CRC:
London, 2001; 503?515.
8. Mattheou K. On new developments in statistical inference for measures of divergence, Ph.D. Thesis, University
of Cyprus, Nicosia, Cyprus, 2007.
9. Papaioannou T, Ferentinos K. On two forms of Fisher?s information number. Communications in Statistics?Theory
and Methods 2005; 34:1461?1470.
10. Brockett PL. Information theoretic approach to actuarial science: a unification and extension of relevant theory
and applications. Transactions of the Society of Actuaries 1991; 43:73?114.
11. Csiszar I. Why least squares and maximum entropy? An axiomatic approach to inference for linear inverse
problems. Annals of Statistics 1991; 19(4):2032?2066.
12. Kullback S. Information Theory and Statistics. Wiley: New York, 1959.
13. Zografos K, Ferentinos K, Papaioannou T. Limiting properties of some measures of information. Annals of the
Institute of Statistical Mathematics B 1989; 41(3):451?460.
14. Cressie NAC, Read TRC. Multinomial goodness-of-fit tests. Journal of Royal Statistical Society B 1984; 46(3):
440?464.
15. Benjamin P, Pollard JH. The Analysis of Mortality and Other Actuarial Statistics. Heinemann: London, 1980.
16. Debon A, Montes F, Sala R. A comparison of parametric models for mortality graduation. Application to mortality
data for the Valencia region (Spain). SORT 2005; 29(2):269?288.
17. Debon A, Montes F, Sala R. A comparison of nonparametric methods in the graduation of mortality: application
to data from the Valencia region (Spain). International Statistical Review 2006; 74(2):215?233.
18. Haberman S. Actuarial methods. In Encyclopedia of Biostatistics, vol. 1, Armitage P, Colton Th (eds). Wiley:
New York, 1998; 37?49.
19. Haberman S, Renshaw AE. A simple graphical method for the comparison of two mortality experiences. Applied
Stochastic Models in Business and Industry 1999; 15:333?352.
20. London D. Graduation: The Revision of Estimates. ACTEX Publications: Winsted, Connecticut, 1985.
21. Miller MD. Elements of Graduation. Actuarial Society of America: New York, 1949.
22. Nielsen JP. Smoothing and prediction with a view to actuarial science, biostatistics and finance. Scandinavian
Actuarial Journal 2003; 1:51?74.
23. Neves CdR, Migon HS. Bayesian graduation of mortality rates: an application to mathematical reserve evaluation.
Insurance: Mathematics and Economics 2007; 40:424?434.
2009 John Wiley & Sons, Ltd.
Appl. Stochastic Models Bus. Ind. 2010; 26:448?472
DOI: 10.1002/asmb
472
A. SACHLAS AND T. PAPAIOANNOU
24. Wang JL. Smoothing hazard rates. In Encyclopedia of Biostatistics, vol. 5 (2nd edn), Armitage P, Colton Th
(eds). Wiley: New York, 2005; 4986?4997.
25. Brockett PL, Zhang J. Information theoretical mortality graduation. Scandinavian Actuarial Journal 1986;
131?140.
26. Zhang J, Brockett PL. Quadratically constrained information theoretic analysis. SIAM Journal of Applied
Mathematics 1987; 47(4):871?885.
27. The Actuarial Society of Hong Kong. Report on Hong Kong Assured Lives Mortality 2001, 2001. Available
2009 John Wiley & Sons, Ltd.
Appl. Stochastic Models Bus. Ind. 2010; 26:448?472
DOI: 10.1002/asmb
```
###### Документ
Категория
Без категории
Просмотров
3
Размер файла
203 Кб
Теги
803, asmb
1/--страниц
Пожаловаться на содержимое документа