Preliminaries
Consider the function f (x) = cos(x), its derivative f
(x) = −sin(x), and its antiderivative
F(x) = sin(x) + C. These formulas were studied in calculus. The former
is used to determine the slope m = f
(x0) of the curve y = f (x) at a point (x0, f (x0)),
and the latter is used to compute the area under the curve for a ≤ x ≤ b.
The slope at the point (π/2, 0) is m = f
(π/2) = −1 and can be used to find the
tangent line at this point (see Figure 1.1(a)):
ytan = m
x − π
2
+ 0 = f
π
2
x − π
2
= −x + π
2
.
y
0.5
0.0
0.5
1.0 1.5 2.0
x
1.0
−0.5
y = cos(x)
Figure 1.1 (a) The tangent line to
the curve y = cos(x) at the point
(π/2, 0).
1
2 CHAP. 1 PRELIMINARIES
0.5
0.0
0.5
1.0 1.5 2.0
x
1.0
−0.5
y
y = cos(x)
Figure 1.1 (b) The area under the
curve y = cos(x) over the interval
[0, π/2].
The area under the curve for 0 ≤ x ≤ π/2 is computed using an integral (see Figure
1.1(b)):
area =
π/2
0
cos(x) dx = F
π
2
− F(0) = sin
π
2
− 0 = 1.
These are some of the results that we will need to use from calculus.
1.1 Review of Calculus
It is assumed that the reader is familiar with the notation and subject matter covered in
the undergraduate calculus sequence. This should have included the topics of limits,
continuity, differentiation, integration, sequences, and series. Throughout the book we
refer to the following results.
Limits and Continuity
Definition 1.1. Assume that f (x) is defined on an open interval containing x = x0,
except possibly at x = x0 itself. Then f is said to have the limit L at x = x0, and we
write
(1) lim
x→x0
f (x) = L,
if given any > 0 there exists a δ > 0 such that | f (x) − L| < whenever 0 <
|x − x0| < δ. When the h-increment notation x = x0 + h is used, equation (1)
becomes
(2) lim
h→0
f (x0 + h) = L.
SEC. 1.1 REVIEW OF CALCULUS 3
Definition 1.2. Assume that f (x) is defined on an open interval containing x = x0.
Then f is said to be continuous at x = x0 if
(3) lim
x→x0
f (x) = f (x0).
The function f is said to be continuous on a set S if it is continuous at each point
x ∈ S. The notation Cn(S) stands for the set of all functions f such that f and its
first n derivatives are continuous on S. When S is an interval, say [a, b], then the
notation Cn[a, b] is used. As an example, consider the function f (x) = x4/3 on the
interval [−1, 1]. Clearly, f (x) and f
(x) = (4/3)x1/3 are continuous on [−1, 1],
while f
(x) = (4/9)x−2/3 is not continuous at x = 0.
Definition 1.3. Suppose that {xn}∞
n=1 is an infinite sequence. Then the sequence is
said to have the limit L, and we write
(4) lim
n→∞
xn = L,
if given any > 0, there exists a positive integer N = N( ) such that n > N implies
that |xn − L| < .
When a sequence has a limit, we say that it is a convergent sequence. Another
commonly used notation is “xn → L as n→∞.” Equation (4) is equivalent to
(5) lim
n→∞
(xn − L) = 0.
Thus we can view the sequence { n}∞
n=1
= {xn − L}∞
n=1 as an error sequence. The
following theorem relates the concepts of continuity and convergent sequence.
Theorem 1.1. Assume that f (x) is defined on the set S and x0 ∈ S. The following
statements are equivalent:
(a) The function f is continuous at x0.
(b) If lim
n→∞
xn = x0, then lim
n→∞
f (xn) = f (x0) (6) .
Theorem 1.2 (Intermediate Value Theorem). Assume that f ∈ C[a, b] and L is
any number between f (a) and f (b). Then there exists a number c, with c ∈ (a, b),
such that f (c) = L.
Example 1.1. The function f (x) = cos(x −1) is continuous over [0, 1], and the constant
L = 0.8 ∈ (cos(0), cos(1)). The solution to f (x) = 0.8 over [0, 1] is c1 = 0.356499.
Similarly, f (x) is continuous over [1, 2.5], and L = 0.8 ∈ (cos(2.5), cos(1)). The solution
to f (x) = 0.8 over [1, 2.5] is c2 = 1.643502. These two cases are shown in Figure 1.2.
4 CHAP. 1 PRELIMINARIES
y
0.0 x
0.5 1.0 1.5 2.0 2.5
0.2
0.4
0.6
0.8
1.0
c1 c2
y = L
y = f (x)
Figure 1.2 The intermediate value
theorem applied to the function
f (x) = cos(x − 1) over [0, 1] and
over the interval [1, 2.5].
0.0
(a, f (a))
y
(x1, f (x1))
y = f (x)
(x2, f (x2))
x
(b, f (b))
0.5 1.0 1.5 2.0 2.5 3.0
10
20
30
40
50
60
Figure 1.3 The extreme value
theorem applied to the function
f (x) = 35 + 59.5x − 66.5x2 + 15x3
over the interval [0, 3].
Theorem 1.3 (Extreme Value Theorem for a Continuous Function). Assume that
f ∈ C[a, b]. Then there exists a lower bound M1, an upper bound M2, and two
numbers x1, x2 ∈ [a, b] such that
(7) M1 = f (x1) ≤ f (x) ≤ f (x2) = M2 whenever x ∈ [a, b].
We sometimes express this by writing
(8) M1 = f (x1) = min
a≤x≤b
{ f (x)} and M2 = f (x2) = max
a≤x≤b
{ f (x)}.
Differentiable Functions
Definition 1.4. Assume that f (x) is defined on an open interval containing x0. Then
f is said to be differentiable at x0 if
(9) lim
x→x0
f (x) − f (x0)
x − x0
SEC. 1.1 REVIEW OF CALCULUS 5
exists. When this limit exists, it is denoted by f
(x0) and is called the derivative of f
at x0. An equivalent way to express this limit is to use the h-increment notation:
(10) lim
h→0
f (x0 + h) − f (x0)
h
= f
(x0).
A function that has a derivative at each point in a set S is said to be differentiable
on S. Note that the number m = f
(x0) is the slope of the tangent line to the graph of
the function y = f (x) at the point (x0, f (x0)).
Theorem 1.4. If f (x) is differentiable at x = x0, then f (x) is continuous at x = x0.
It follows from Theorem 1.3 that if a function f is differentiable on a closed interval
[a, b], then its extreme values occur at the endpoints of the interval or at the critical
points (solutions of f
(x) = 0) in the open interval (a, b).
Example 1.2. The function f (x) = 15x3−66.5x2+59.5x+35 is differentiable on [0, 3].
The solutions to f
(x) = 45x2 − 123x + 59.5 = 0 are x1 = 0.54955 and x2 = 2.40601.
The maximum and minimum values of f on [0, 3] are:
min{ f (0), f (3), f (x1), f (x2)} = min{35, 20, 50.10438, 2.11850} = 2.11850
and
max{ f (0), f (3), f (x1), f (x2)} = max{35, 20, 50.10438, 2.11850} = 50.10438
(see Figure 1.3).
Theorem 1.5 (Rolle’s Theorem). Assume that f ∈ C[a, b] and that f
(x) exists for
all x ∈ (a, b). If f (a) = f (b) = 0, then there exists a number c, with c ∈ (a, b), such
that f
(c) = 0.
Theorem 1.6 (Mean Value Theorem). Assume that f ∈ C[a, b] and that f
(x)
exists for all x ∈ (a, b). Then there exists a number c, with c ∈ (a, b), such that
(11) f
(c) = f (b) − f (a)
b − a
.
Geometrically, the mean value theorem says that there is at least one number c ∈
(a, b) such that the slope of the tangent line to the graph of y = f (x) at the point
(c, f (c)) equals the slope of the secant line through the points (a, f (a)) and (b, f (b)).
Example 1.3. The function f (x) = sin(x) is continuous on the closed interval [0.1, 2.1]
and differentiable on the open interval (0.1, 2.1). Thus, by the mean value theorem, there
is a number c such that
f
(c) = f (2.1) − f (0.1)
2.1 − 0.1
= 0.863209 − 0.099833
2.1 − 0.1
= 0.381688.
The solution to f
(c) = cos(c) = 0.381688 in the interval (0.1, 2.1) is c = 1.179174.
The graphs of f (x), the secant line y = 0.381688x + 0.099833, and the tangent line
y = 0.381688x + 0.474215 are shown in Figure 1.4.
6 CHAP. 1 PRELIMINARIES
a 0.5 1.0 c 1.5 2.0 b
f (a)
f (b)
1.0 (c, f (c))
(a, f (a))
(b, f (b))
m = f ′(c)
0.5
y
x
Figure 1.4 The mean value theorem applied to f (x) =
sin(x) over the interval [0.1, 2.1].
Theorem 1.7 (Generalized Rolle’s Theorem). Assume that f ∈ C[a, b] and that
f
(x), f
(x), . . . , f (n)(x) exist over (a, b) and x0, x1, . . . , xn ∈ [a, b]. If f (x j ) = 0
for j = 0, 1, . . . , n, then there exists a number c, with c ∈ (a, b), such that f (n)(c) = 0.
Integrals
Theorem 1.8 (First Fundamental Theorem). If f is continuous over [a, b] and F
is any antiderivative of f on [a, b], then
(12)
b
a
f (x) dx = F(b) − F(a) where F
(x) = f (x).
Theorem 1.9 (Second Fundamental Theorem). If f is continuous over [a, b] and
x ∈ (a, b), then
(13)
d
dx
x
a
f (t) dt = f (x).
Example 1.4. The function f (x) = cos(x) satisfies the hypotheses of Theorem 1.9 over
the interval [0, π/2]; thus by the chain rule
d
dx
x2
0
cos(t) dt = cos(x2)(x2)
= 2x cos(x2).
Theorem 1.10 (Mean Value Theorem for Integrals). Assume that f ∈ C[a, b].
Then there exists a number c, with c ∈ (a, b), such that
1
b − a
b
a
f (x) dx = f (c).
The value f (c) is the average value of f over the interval [a, b].
SEC. 1.1 REVIEW OF CALCULUS 7
0.0
0.2
0.4
0.6
0.8
0.0 0.5 1.0 1.5 2.0 2.5
y
y = f (x)
x
Figure 1.5 The mean value
theorem for integrals applied to
f (x) = sin(x) + 13
sin(3x) over the
interval [0, 2.5].
Example 1.5. The function f (x) = sin(x) + 13
sin(3x) satisfies the hypotheses of Theorem
1.10 over the interval [0, 2.5]. An antiderivative of f (x) is F(x) = −cos(x) −
19
cos(3x). The average value of the function f (x) over the interval [0, 2.5] is
1
2.5 − 0
2.5
0
f (x) dx = F(2.5) − F(0)
2.5
= 0.762629 − (−1.111111)
2.5
= 1.873740
2.5
= 0.749496.
There are three solutions to the equation f (c) = 0.749496 over the interval [0, 2.5]:
c1 = 0.440566, c2 = 1.268010, and c3 = 1.873583. The area of the rectangle with
base b − a = 2.5 and height f (c j ) = 0.749496 is f (c j )(b − a) = 1.873740. The area
of the rectangle has the same numerical value as the integral of f (x) taken over the interval
[0, 2.5]. A comparison of the area under the curve y = f (x) and that of the rectangle
can be seen in Figure 1.5.
Theorem 1.11 (Weighted Integral Mean Value Theorem). Assume that f, g ∈
C[a, b] and g(x) ≥ 0 for x ∈ [a, b]. Then there exists a number c, with c ∈ (a, b),
such that
(14)
b
a
f (x)g(x) dx = f (c)
b
a
g(x) dx.
Example 1.6. The functions f (x) = sin(x) and g(x) = x2 satisfy the hypotheses of
Theorem 1.11 over the interval [0, π/2]. Thus there exists a number c such that
sin(c) =
π/2
0 x2 sin(x) dx
π/2
0 x2 dx
= 1.14159
1.29193
= 0.883631
or c = sin−1(0.883631) = 1.08356.
8 CHAP. 1 PRELIMINARIES
Series
Definition 1.5. Let {an}∞
n=1 be a sequence. Then
∞
n=1 an is an infinite series. The
nth partial sum is Sn = nk
=1 ak . The infinite series converges if and only if the
sequence {Sn}∞
n=1 converges to a limit S, that is,
(15) lim
n→∞
Sn = lim
n→∞
n
k=1
ak = S.
If a series does not converge, we say that it diverges.
Example 1.7. Consider the infinite sequence {an}∞
n=1
=
1
n(n + 1)
∞
n=1
. Then the nth
partial sum is
Sn =
n
k=1
1
k(k + 1)
=
n
k=1
1
k
− 1
k + 1
= 1 − 1
n + 1
.
Therefore, the sum of the infinite series is
S = lim
n→∞
Sn = lim
n→∞
1 − 1
n + 1
= 1.
Theorem 1.12 (Taylor’s Theorem). Assume that f ∈ Cn+1[a, b] and let x0 ∈
[a, b]. Then, for every x ∈ (a, b), there exists a number c = c(x) (the value of c
depends on the value of x) that lies between x0 and x such that
(16) f (x) = Pn(x) + Rn(x),
where
(17) Pn(x) =
n
k=0
f (k)(x0)
k! (x − x0)k
and
(18) Rn(x) = f (n+1)(c)
(n + 1)! (x − x0)n+1.
Example 1.8. The function f (x) = sin(x) satisfies the hypotheses of Theorem 1.12. The
Taylor polynomial Pn(x) of degree n = 9 expanded about x0 = 0 is obtained by evaluating
SEC. 1.1 REVIEW OF CALCULUS 9
−1.0
−0.5
0.0
0.5
1.0
1 2 3 4 5 6
y
y = P(x)
x
y = f (x)
Figure 1.6 The graph of f (x) = sin(x) and the Taylor
polynomial P9(x) = x − x3/3! + x5/5! − x7/7! + x9/9!.
the following derivatives at x = 0 and substituting the numerical values into formula (17).
f (x) = sin(x), f (0) = 0,
f
(x) = cos(x), f
(0) = 1,
f
(x) = −sin(x), f
(0) = 0,
f (3)(x) = −cos(x), f (3)(0) = −1,
...
...
f (9)(x) = cos(x), f (9)(0) = 1,
P9(x) = x − x3
3!
+ x5
5!
− x7
7!
+ x9
9! .
A graph of both f and P9 over the interval [0, 2π] is shown in Figure 1.6.
Corollary 1.1. If Pn(x) is the Taylor polynomial of degree n given in Theorem 1.12,
then
(19) P(k)
n (x0) = f (k)(x0) for k = 0, 1, . . . , n.
Evaluation of a Polynomial
Let the polynomial P(x) of degree n have the form
(20) P(x) = anxn + an−1xn−1 +· · ·+a2x2 + a1x + a0.
10 CHAP. 1 PRELIMINARIES
Horner’s method or synthetic division is a technique for evaluating polynomials. It
can be thought of as nested multiplication. For example, a fifth-degree polynomial can
be written in the nested multiplication form
P5(x) = ((((a5x + a4)x + a3)x + a2)x + a1)x + a0.
Theorem 1.13 (Horner’s Method for Polynomial Evaluation). Assume that P(x)
is the polynomial given in equation (20) and x = c is a number for which P(c) is to be
evaluated.
Set bn = an and compute
(21) bk = ak + cbk+1 for k = n − 1, n − 2, . . ., 1, 0;
then b0 = P(c). Moreover, if
(22) Q0(x) = bnxn−1 + bn−1xn−2 +· · ·+b3x2 + b2x + b1,
then
(23) P(x) = (x − c)Q0(x) + R0,
where Q0(x) is the quotient polynomial of degree n − 1 and R0 = b0 = P(c) is the
remainder.
Proof. Substituting the right side of equation (22) for Q0(x) and b0 for R0 in equation
(23) yields
P(x) = (x − c)(bnxn−1 + bn−1xn−2 +· · ·+b3x2 + b2x + b1) + b0
= bnxn + (bn−1 − cbn)xn−1 +· · ·+(b2 − cb3)x2
+ (b1 − cb2)x + (b0 − cb1).
(24)
The numbers bk are determined by comparing the coefficients of xk in equations (20)
and (24), as shown in Table 1.1.
The value P(c) = b0 is easily obtained by substituting x = c into equation (22)
and using the fact that R0 = b0:
(25) P(c) = (c − c)Q0(c) + R0 = b0. •
The recursive formula for bk given in (21) is easy to implement with a computer.
A simple algorithm is
b(n) = a(n);
for k = n − 1: −1: 0
b(k) = a(k) + c ∗ b(k + 1);
end
SEC. 1.1 REVIEW OF CALCULUS 11
Table 1.1 Coefficients bk for Horner’s Method
xk Comparing (20) and (24) Solving for bk
xn an =bn bn =an
xn−1 an−1 =bn−1−cbn bn−1 =an−1+cbn
...
...
...
xk ak =bk −cbk+1 bk =ak +cbk+1
...
...
...
x0 a0 =b0 − cb1 b0 =a0 + cb1
Table 1.2 Horner’s Table for the Synthetic Division Process
Input an an−1 an−2 · · · ak · · · a2 a1 a0
c xbn xbn−1 · · · xbk+1 · · · xb3 xb2 xb1
bn bn−1 bn−2 · · · bk · · · b2 b1 b0 = P(c)
Output
When Horner’s method is performed by hand, it is easier to write the coefficients of
P(x) on a line and perform the calculation bk = ak + cbk+1 below ak in a column.
The format for this procedure is illustrated in Table 1.2.
Example 1.9. Use synthetic division (Horner’s method) to find P(3) for the polynomial
P(x) = x5 − 6x4 + 8x3 + 8x2 + 4x − 40.
a5 a4 a3 a2 a1 a0
Input 1 −6 8 8 4 −40
c = 3 3 −9 −3 15 57
1 −3 −1 5 19 17 = P(3) = b0
b5 b4 b3 b2 b1 Output
Therefore, P(3) = 17.
Numerical Methods Using Matlab, 4th Edition, 2004
John H. Mathews and Kurtis K. Fink
ISBN: 0-13-065248-2
Prentice-Hall Inc.
Upper Saddle River, New Jersey, USA
http://vig.prenhall.com/
readna.blogspot.com
Monday, December 23, 2013
Saturday, December 7, 2013
Correlation and regression
Abstract
The
present review introduces methods of analyzing the relationship between
two quantitative variables. The calculation and interpretation of the
sample product moment correlation coefficient and the linear regression
equation are discussed and illustrated. Common misuses of the techniques
are considered. Tests and confidence intervals for the population
parameters are described, and failures of the underlying assumptions are
highlighted.
Keywords: coefficient of determination, correlation coefficient, least squares regression line
Introduction
The
most commonly used techniques for investigating the relationship
between two quantitative variables are correlation and linear
regression. Correlation quantifies the strength of the linear
relationship between a pair of variables, whereas regression expresses
the relationship in the form of an equation. For example, in patients
attending an accident and emergency unit (A&E), we could use
correlation and regression to determine whether there is a relationship
between age and urea level, and whether the level of urea can be
predicted for a given age.
Scatter diagram
When
investigating a relationship between two variables, the first step is
to show the data values graphically on a scatter diagram. Consider the
data given in Table Table1.1.
These are the ages (years) and the logarithmically transformed
admission serum urea (natural logarithm [ln] urea) for 20 patients
attending an A&E. The reason for transforming the urea levels was to
obtain a more Normal distribution [1]. The scatter diagram for ln urea and age (Fig. (Fig.1)1) suggests there is a positive linear relationship between these variables.
Correlation
On
a scatter diagram, the closer the points lie to a straight line, the
stronger the linear relationship between two variables. To quantify the
strength of the relationship, we can calculate the correlation
coefficient. In algebraic notation, if we have two variables x and y,
and the data take the form of n pairs (i.e. [x1, y1], [x2, y2], [x3, y3] ... [xn, yn]), then the correlation coefficient is given by the following equation:
where
is the mean of the x values, and
is the mean of the y values.
This
is the product moment correlation coefficient (or Pearson correlation
coefficient). The value of r always lies between -1 and +1. A value of
the correlation coefficient close to +1 indicates a strong positive
linear relationship (i.e. one variable increases with the other; Fig. Fig.2).2).
A value close to -1 indicates a strong negative linear relationship
(i.e. one variable decreases as the other increases; Fig. Fig.3).3). A value close to 0 indicates no linear relationship (Fig. (Fig.4);4); however, there could be a nonlinear relationship between the variables (Fig. (Fig.55).
For
the A&E data, the correlation coefficient is 0.62, indicating a
moderate positive linear relationship between the two variables.
Hypothesis test of correlation
We
can use the correlation coefficient to test whether there is a linear
relationship between the variables in the population as a whole. The
null hypothesis is that the population correlation coefficient equals 0.
The value of r can be compared with those given in Table Table2,2, or alternatively exact P values
can be obtained from most statistical packages. For the A&E data, r
= 0.62 with a sample size of 20 is greater than the value highlighted
bold in Table Table22 for P = 0.01, indicating a P value
of less than 0.01. Therefore, there is sufficient evidence to suggest
that the true population correlation coefficient is not 0 and that there
is a linear relationship between ln urea and age.
Confidence interval for the population correlation coefficient
Although
the hypothesis test indicates whether there is a linear relationship,
it gives no indication of the strength of that relationship. This
additional information can be obtained from a confidence interval for
the population correlation coefficient.
To calculate a
confidence interval, r must be transformed to give a Normal distribution
making use of Fisher's z transformation [2]:
The standard error [3] of zr is approximately:
and hence a 95% confidence interval for the true population value for the transformed correlation coefficient zr is given by zr - (1.96 × standard error) to zr + (1.96 × standard error). Because zr is Normally distributed, 1.96 deviations from the statistic will give a 95% confidence interval.
For the A&E data the transformed correlation coefficient zr between ln urea and age is:
The standard error of zr is:
The 95% confidence interval for zr is therefore 0.725 - (1.96 × 0.242) to 0.725 + (1.96 × 0.242), giving 0.251 to 1.199.
We
must use the inverse of Fisher's transformation on the lower and upper
limits of this confidence interval to obtain the 95% confidence interval
for the correlation coefficient. The lower limit is:
giving 0.25 and the upper limit is:
giving 0.83. Therefore, we are 95% confident that the population correlation coefficient is between 0.25 and 0.83.
The
width of the confidence interval clearly depends on the sample size,
and therefore it is possible to calculate the sample size required for a
given level of accuracy. For an example, see Bland [4].
Misuse of correlation
There are a number of common situations in which the correlation coefficient can be misinterpreted.
One
of the most common errors in interpreting the correlation coefficient
is failure to consider that there may be a third variable related to
both of the variables being investigated, which is responsible for the
apparent correlation. Correlation does not imply causation. To
strengthen the case for causality, consideration must be given to other
possible underlying variables and to whether the relationship holds in
other populations.
A nonlinear relationship may exist
between two variables that would be inadequately described, or possibly
even undetected, by the correlation coefficient.
A data
set may sometimes comprise distinct subgroups, for example males and
females. This could result in clusters of points leading to an inflated
correlation coefficient (Fig. (Fig.6).6). A single outlier may produce the same sort of effect.
Subgroups in the data resulting in a misleading correlation. All data: r = 0.57; males: r = -0.41; females: r = -0.26.
It
is important that the values of one variable are not determined in
advance or restricted to a certain range. This may lead to an invalid
estimate of the true correlation coefficient because the subjects are
not a random sample.
Another situation
in which a correlation coefficient is sometimes misinterpreted is when
comparing two methods of measurement. A high correlation can be
incorrectly taken to mean that there is agreement between the two
methods. An analysis that investigates the differences between pairs of
observations, such as that formulated by Bland and Altman [5], is more appropriate.
Regression
In
the A&E example we are interested in the effect of age (the
predictor or x variable) on ln urea (the response or y variable). We
want to estimate the underlying linear relationship so that we can
predict ln urea (and hence urea) for a given age. Regression can be used
to find the equation of this line. This line is usually referred to as
the regression line.
Note that in a scatter diagram the response variable is always plotted on the vertical (y) axis.
Equation of a straight line
The
equation of a straight line is given by y = a + bx, where the
coefficients a and b are the intercept of the line on the y axis and the
gradient, respectively. The equation of the regression line for the
A&E data (Fig. (Fig.7)7)
is as follows: ln urea = 0.72 + (0.017 × age) (calculated using the
method of least squares, which is described below). The gradient of this
line is 0.017, which indicates that for an increase of 1 year in age
the expected increase in ln urea is 0.017 units (and hence the expected
increase in urea is 1.02 mmol/l). The predicted ln urea of a patient
aged 60 years, for example, is 0.72 + (0.017 × 60) = 1.74 units. This
transforms to a urea level of e1.74 = 5.70 mmol/l. The y
intercept is 0.72, meaning that if the line were projected back to age =
0, then the ln urea value would be 0.72. However, this is not a
meaningful value because age = 0 is a long way outside the range of the
data and therefore there is no reason to believe that the straight line
would still be appropriate.
Method of least squares
The
regression line is obtained using the method of least squares. Any line
y = a + bx that we draw through the points gives a predicted or fitted
value of y for each value of x in the data set. For a particular value
of x the vertical difference between the observed and fitted value of y
is known as the deviation, or residual (Fig. (Fig.8).8).
The method of least squares finds the values of a and b that minimise
the sum of the squares of all the deviations. This gives the following
formulae for calculating a and b:
Usually, these values would be calculated using a statistical package or the statistical functions on a calculator.
Hypothesis tests and confidence intervals
We
can test the null hypotheses that the population intercept and gradient
are each equal to 0 using test statistics given by the estimate of the
coefficient divided by its standard error.
The
test statistics are compared with the t distribution on n - 2 (sample
size - number of regression coefficients) degrees of freedom [4].
The 95% confidence interval for each of the population coefficients are calculated as follows: coefficient ± (tn-2 × the standard error), where tn-2 is the 5% point for a t distribution with n - 2 degrees of freedom.
For the A&E data, the output (Table (Table3)3) was obtained from a statistical package. The P value
for the coefficient of ln urea (0.004) gives strong evidence against
the null hypothesis, indicating that the population coefficient is not 0
and that there is a linear relationship between ln urea and age. The
coefficient of ln urea is the gradient of the regression line and its
hypothesis test is equivalent to the test of the population correlation
coefficient discussed above. The P value for the constant of
0.054 provides insufficient evidence to indicate that the population
coefficient is different from 0. Although the intercept is not
significant, it is still appropriate to keep it in the equation. There
are some situations in which a straight line passing through the origin
is known to be appropriate for the data, and in this case a special
regression analysis can be carried out that omits the constant [6].
Analysis of variance
As
stated above, the method of least squares minimizes the sum of squares
of the deviations of the points about the regression line. Consider the
small data set illustrated in Fig. Fig.9.9.
This figure shows that, for a particular value of x, the distance of y
from the mean of y (the total deviation) is the sum of the distance of
the fitted y value from the mean (the deviation explained by the
regression) and the distance from y to the line (the deviation not
explained by the regression).
The regression line for these data is given by y = 6 + 2x. The observed, fitted values and deviations are given in Table Table4.4.
The sum of squared deviations can be compared with the total variation
in y, which is measured by the sum of squares of the deviations of y
from the mean of y. Table Table44
illustrates the relationship between the sums of squares. Total sum of
squares = sum of squares explained by the regression line + sum of
squares not explained by the regression line. The explained sum of
squares is referred to as the 'regression sum of squares' and the
unexplained sum of squares is referred to as the 'residual sum of
squares'.
This partitioning of the total sum of squares can be presented in an analysis of variance table (Table (Table5).5).
The total degrees of freedom = n - 1, the regression degrees of freedom
= 1, and the residual degrees of freedom = n - 2 (total - regression
degrees of freedom). The mean squares are the sums of squares divided by
their degrees of freedom.
If
there were no linear relationship between the variables then the
regression mean squares would be approximately the same as the residual
mean squares. We can test the null hypothesis that there is no linear
relationship using an F test. The test statistic is calculated as the
regression mean square divided by the residual mean square, and a P value may be obtained by comparison of the test statistic with the F distribution with 1 and n - 2 degrees of freedom [2]. Usually, this analysis is carried out using a statistical package that will produce an exact P value.
In fact, the F test from the analysis of variance is equivalent to the t
test of the gradient for regression with only one predictor. This is
not the case with more than one predictor, but this will be the subject
of a future review. As discussed above, the test for gradient is also
equivalent to that for the correlation, giving three tests with
identical P values. Therefore, when there is only one predictor variable it does not matter which of these tests is used.
The analysis of variance for the A&E data (Table (Table6)6) gives a P value of 0.006 (the same P value as obtained previously), again indicating a linear relationship between ln urea and age.
Coefficent of determination
Another useful quantity that can be obtained from the analysis of variance is the coefficient of determination (R2).
It is the proportion of the total variation in y accounted for by the regression model. Values of R2 close to 1 imply that most of the variability in y is explained by the regression model. R2 is the same as r2 in regression when there is only one predictor variable.
For the A&E data, R2 = 1.462/3.804 = 0.38 (i.e. the same as 0.622),
and therefore age accounts for 38% of the total variation in ln urea.
This means that 62% of the variation in ln urea is not accounted for by
age differences. This may be due to inherent variability in ln urea or
to other unknown factors that affect the level of ln urea.
Prediction
The
fitted value of y for a given value of x is an estimate of the
population mean of y for that particular value of x. As such it can be
used to provide a confidence interval for the population mean [3]. The fitted values change as x changes, and therefore the confidence intervals will also change.
The 95% confidence interval for the fitted value of y for a particular value of x, say xp, is again calculated as fitted y ± (tn-2 × the standard error). The standard error is given by:
Fig. Fig.1010
shows the range of confidence intervals for the A&E data. For
example, the 95% confidence interval for the population mean ln urea for
a patient aged 60 years is 1.56 to 1.92 units. This transforms to urea
values of 4.76 to 6.82 mmol/l.
Regression line, its 95% confidence interval and the 95% prediction interval for individual patients.
The fitted value for y also provides a predicted value for an individual, and a prediction interval or reference range [3] can be obtained (Fig. (Fig.10).10). The prediction interval is calculated in the same way as the confidence interval but the standard error is given by:
For
example, the 95% prediction interval for the ln urea for a patient aged
60 years is 0.97 to 2.52 units. This transforms to urea values of 2.64
to 12.43 mmol/l.
Both confidence intervals and prediction intervals become wider for values of the predictor variable further from the mean.
Assumptions and limitations
The
use of correlation and regression depends on some underlying
assumptions. The observations are assumed to be independent. For
correlation both variables should be random variables, but for
regression only the response variable y must be random. In carrying out
hypothesis tests or calculating confidence intervals for the regression
parameters, the response variable should have a Normal distribution and
the variability of y should be the same for each value of the predictor
variable. The same assumptions are needed in testing the null hypothesis
that the correlation is 0, but in order to interpret confidence
intervals for the correlation coefficient both variables must be
Normally distributed. Both correlation and regression assume that the
relationship between the two variables is linear.
A
scatter diagram of the data provides an initial check of the assumptions
for regression. The assumptions can be assessed in more detail by
looking at plots of the residuals [4,7].
Commonly, the residuals are plotted against the fitted values. If the
relationship is linear and the variability constant, then the residuals
should be evenly scattered around 0 along the range of fitted values
(Fig. (Fig.1111).
(a) Scatter diagram of y against x suggests that the relationship is nonlinear. (b) Plot of residuals against fitted values in panel a; the curvature of the relationship is shown more clearly. (c) Scatter diagram of y against x suggests that the variability ...
In
addition, a Normal plot of residuals can be produced. This is a plot of
the residuals against the values they would be expected to take if they
came from a standard Normal distribution (Normal scores). If the
residuals are Normally distributed, then this plot will show a straight
line. (A standard Normal distribution is a Normal distribution with mean
= 0 and standard deviation = 1.) Normal plots are usually available in
statistical packages.
Figs Figs1212 and and1313
show the residual plots for the A&E data. The plot of fitted values
against residuals suggests that the assumptions of linearity and
constant variance are satisfied. The Normal plot suggests that the
distribution of the residuals is Normal.
When
using a regression equation for prediction, errors in prediction may
not be just random but also be due to inadequacies in the model. In
particular, extrapolating beyond the range of the data is very risky.
A
phenomenon to be aware of that may arise with repeated measurements on
individuals is regression to the mean. For example, if repeat measures
of blood pressure are taken, then patients with higher than average
values on their first reading will tend to have lower readings on their
second measurement. Therefore, the difference between their second and
first measurements will tend to be negative. The converse is true for
patients with lower than average readings on their first measurement,
resulting in an apparent rise in blood pressure. This could lead to
misleading interpretations, for example that there may be an apparent
negative correlation between change in blood pressure and initial blood
pressure.
Conclusion
Both
correlation and simple linear regression can be used to examine the
presence of a linear relationship between two variables providing
certain assumptions about the data are satisfied. The results of the
analysis, however, need to be interpreted with care, particularly when
looking for a causal relationship or when using the regression equation
for prediction. Multiple and logistic regression will be the subject of
future reviews.
Random Variables
A random variable, usually written X, is a variable whose possible values are numerical outcomes of a random phenomenon. There are two types of random variables, discrete and continuous.Discrete Random Variables
A discrete random variable is one which may take on only a countable number of distinct values such as 0,1,2,3,4,........ Discrete random variables are usually (but not necessarily) counts. If a random variable can take only a finite number of distinct values, then it must be discrete. Examples of discrete random variables include the number of children in a family, the Friday night attendance at a cinema, the number of patients in a doctor's surgery, the number of defective light bulbs in a box of ten. The probability distribution of a discrete random variable is a list of probabilities associated with each of its possible values. It is also sometimes called the probability function or the probability mass function.(Definitions taken from Valerie J. Easton and John H. McColl's Statistics Glossary v1.1)
Suppose a random variable X may take k different values, with the probability that X = xi defined to be P(X = xi) = pi. The probabilities pi must satisfy the following:
- 1: 0 < pi < 1 for each i
- 2: p1 + p2 + ... + pk = 1.
Example
The probabilities associated with each outcome are described by the following table:
Outcome 1 2 3 4 Probability 0.1 0.3 0.4 0.2The probability that X is equal to 2 or 3 is the sum of the two probabilities: P(X = 2 or X = 3) = P(X = 2) + P(X = 3) = 0.3 + 0.4 = 0.7. Similarly, the probability that X is greater than 1 is equal to 1 - P(X = 1) = 1 - 0.1 = 0.9, by the complement rule. This distribution may also be described by the probability histogram shown to the right:
All random variables (discrete and continuous) have a cumulative distribution function. It is a function giving the probability that the random variable X is less than or equal to x, for every value x. For a discrete random variable, the cumulative distribution function is found by summing up the probabilities. (Definition taken from Valerie J. Easton and John H. McColl's Statistics Glossary v1.1)
Example
The probability that X is less than or equal to 1 is 0.1,
the probability that X is less than or equal to 2 is 0.1+0.3 = 0.4,
the probability that X is less than or equal to 3 is 0.1+0.3+0.4 = 0.8, and
the probability that X is less than or equal to 4 is 0.1+0.3+0.4+0.2 = 1. The probability histogram for the cumulative distribution of this random variable is shown to the right:
Continuous Random Variables
A continuous random variable is one which takes an infinite number of possible values. Continuous random variables are usually measurements. Examples include height, weight, the amount of sugar in an orange, the time required to run a mile. (Definition taken from Valerie J. Easton and John H. McColl's Statistics Glossary v1.1)A continuous random variable is not defined at specific values. Instead, it is defined over an interval of values, and is represented by the area under a curve (in advanced mathematics, this is known as an integral). The probability of observing any single value is equal to 0, since the number of values which may be assumed by the random variable is infinite.
Suppose a random variable X may take all values over an interval of real numbers. Then the probability that X is in the set of outcomes A, P(A), is defined to be the area above A and under a curve. The curve, which represents a function p(x), must satisfy the following:
- 1: The curve has no negative values (p(x) > 0 for all x)
- 2: The total area under the curve is equal to 1.
The Uniform Distribution
A random number generator acting over an interval of numbers (a,b) has a continuous distribution. Since any interval of numbers of equal width has an equal probability of being observed, the curve describing the distribution is a rectangle, with constant height across the interval and 0 height elsewhere. Since the area under the curve must be equal to 1, the length of the interval determines the height of the curve. The following graphs plot the density curves for random number generators over the intervals (4,5) (top left), (2,6) (top right), (5,5.5) (lower left), and (3,5) (lower right). The distributions corresponding to these curves are known as uniform distributions.P(X < 3 and X > 5) = P(X < 3) + P(X > 5) = (3-2)*0.25 + (6-5)*0.25 = 0.25 + 0.25 = 0.5. The uniform distribution is often used to simulate data. Suppose you would like to simulate data for 10 rolls of a regular 6-sided die. Using the MINITAB "RAND" command with the "UNIF" subcommand generates 10 numbers in the interval (0,6):
MTB > RAND 10 c2; SUBC> unif 0 6.Assign the discrete random variable X to the values 1, 2, 3, 4, 5, or 6 as follows:
if 0<X<1, X=1
if 1<X<2, X=2
if 2<X<3, X=3
if 3<X<4, X=4
if 4<X<5, X=5
if X>5, X=6.
Use the generated MINITAB data to assign X to a value for each roll of the die:
Uniform Data X Value 4.53786 5 5.77474 6 3.69518 4 1.03929 2 4.23835 5 0.37096 1 0.75272 1 5.56563 6 0.89045 1 3.18086 4
Another type of continuous density curve is the normal distribution. The area under the curve is not easy to calculate for a normal random variable X with mean
Random variables and probability distributions
There are two types of random variable - discrete and continuous. A random variable has either an associated probability distribution (discrete random variable) or probability density function (continuous random variable). Examples
Stating the expected value gives a general impression of the behaviour of some random variable without giving full details of its probability distribution (if it is discrete) or its probability density function (if it is continuous). Two random variables with the same expected value can have very different distributions. There are other useful descriptive measures which affect the shape of the distribution, for example variance. The expected value of a random variable X is symbolised by E(X) or µ.
Stating the variance gives an impression of how closely concentrated round the expected value the distribution is; it is a measure of the 'spread' of a distribution about its average value. Variance is symbolised by V(X) or Var(X) or
For a continuous random variable, the cumulative distribution function is the integral of its probability density function.
Compare continuous random variable. Compare discrete random variable.
The probability-probability (P-P) plot is constructed using the theoretical cumulative distribution function, F(x), of the specified model. The values in the sample of data, in order from smallest to largest, are denoted x(1), x(2), ..., x(n). For i = 1, 2, ....., n, F(x(i)) is plotted against (i-0.5)/n. Compare quantile-quantile (Q-Q) plot. The quantile-quantile (Q-Q) plot is constructed using the theoretical cumulative distribution function, F(x), of the specified model. The values in the sample of data, in order from smallest to largest, are denoted x(1), x(2), ..., x(n). For i = 1, 2, ....., n, x(i) is plotted against F-1((i-0.5)/n). Compare probability-probability (P-P) plot.
Many distributions arising in practice can be approximated by a Normal distribution. Other random variables may be transformed to normality. The simplest case of the normal distribution, known as the Standard Normal Distribution, has expected value zero and variance one. This is written as N(0,1). Examples
The Poisson distribution can sometimes be used to approximate the Binomial distribution with parameters n and p. When the number of observations n is large, and the success probability p is small, the Bi(n,p) distribution approaches the Poisson distribution with the parameter given by m = np. This is useful since the computations involved in calculating binomial probabilities are greatly reduced. Examples Typically, a binomial random variable is the number of successes in a series of trials, for example, the number of 'heads' occurring when a coin is tossed 50 times.
Examples
The Geometric distribution is related to the Binomial distribution in that both are based on independent trials in which the probability of success is constant and equal to p. However, a Geometric random variable is the number of trials until the first failure, whereas a Binomial random variable is the number of successes in n trials. Examples
A continuous random variable X is said to follow a Uniform distribution with parameters a and b, written The Uniform distribution has expected value E(X)=(a+b)/2 and variance {(b-a)2}/12. Example This is very useful when it comes to inference. For example, it allows us (if the sample size is fairly large) to use hypothesis tests which assume normality even if our data appear non-normal. This is because the tests use the sample mean
|
Subscribe to:
Posts (Atom)