Statistics Statistics Text references: Lecture Notes Series: Engineering Mathematics Volume 1, 2nd Edition, Pearson 2006. 5.1 Sampling Theory A sample is good when it is able to represent a population well. Thus when we have a good sample, instead of having to look at the population to get some ideas of its characteristics, we can turn to the sample for inference regarding its population. Since we are going to be dealing with samples, we shall dwell into some theories. 5.1.1 Random variable It was quoted before that a random experiment is an experiment whereby its’ outcomes cannot be determined with certainty. A random variable is basically a measurement from a random experiment. Its formal definition follows: Let S be a sample space. A real-valued function X defined on S is called a random variable, i.e., X : S R. Random variables are usually denoted by capital letters, i.e., X, Y, Z, … and the lower case x, y, z, … as their values. The set containing all possible values for X is called the range for X and denoted as R X . 5.1.2 Random Sample Suppose a random experiment is carried out n times. Let X1, X2, X3, …, Xn denote the experiments’ result. They are also known as n mutually independent random variables, each of which has the same but possibly unknown probability function f(x). The random variables X1, X2, X3, …, Xn are then said to constitute a random sample from a distribution that has the probability function f(x). 5.1.3 Statistic Instead of looking at each element in a sample, we derive certain statistics from a sample that will give us an idea of the characteristics of that sample. For example, mean is a statistic that tells us of the location of the sample while the variance tells us the spread of a sample. A more formal definition is given below: The random variable Y = H(X1, X2, X3, …, Xn) is called a statistic where H is a real function, and X1, X2, X3, …, Xn denote a random sample of size n from a given distribution. ___________________________________________________________________________________ 1 Statistics Two important statistics: (a) The mean of the random sample (or sample mean), n X X 2 X n X X 1 i , n i 1 n (b) The variance of the random sample (or sample variance), n (X i X )2 2 S -----------(*) n 1 i 1 n n n X i2 X i i 1 or S 2 i 1 n(n 1) 2 . Remarks: Can you derive the latter from (*)? It is often assumed that a sample comes from a normally distributed population, N ( , 2 ) . In estimating these population parameters (characteristics), we would need sample statistics, usually the sample mean, X , and the sample variance, s 2 . In order to do this, we would need a rough idea of the distribution of these statistics. Theorem 5.1.1 Let X 1 , X 2 , X 3 , …, X n be a random sample of size n from the normal distribution, N ( , 2 ) .Then the sample mean, X , is also normally distributed with E X n and var ( X n ) 2 1 n where X n X i . n n i 1 From above, note that as n increases, the variance of the sample decreases, which means that the sample mean approaches the population mean. This fact is highlighted by the important theorem below: Theorem 5.1.2 (Central Limit Theorem) Let X 1 , X 2 , X 3 , …, X n be a random sample from a distribution with E X i 1 n and var X i 2 , 0 i n . If n is sufficiently large, then X n X i n i1 2 approximate to a normal distribution with mean, X n and variance, 2X n . n X Furthermore, the distribution of Z n n is a standard normal distribution n (i.e. Zn ~ N(0, 1)). ___________________________________________________________________________________ 2 Statistics Normal approximation for X is good if n 30 for any population. But for n 30 , the normal approximation for X is only good for a population that is more or less the same as a normal population. Example 5.1.1 A factory produces the bulb with lifetime approximation to normal distribution with mean 600 hours and standard deviation 18 hours. Find the probability for average lifetime less than 585 hours (if sample size n = 9). Solution: Let X be the average lifetime of bulb, then 18 X 600, X 6. 3 585 600 P( X 585) P Z 6 P( Z 2.5) = 1 PZ 2.5 = 0.00621 When we have a sample of size n 30 that comes from a normal population and with 2 unknown, it is always better to approximate using a t-distribution, hence the definition below: Definition 5.1.1 ( t distribution) If X and S 2 are the mean and variance for a random sample of size n chosen from a X normal population with mean, and variance 2 , then T has a t distribution X S with n 1 degrees of freedom, where X . n Remarks: t Distribution is approximated to standard normal distribution when sample size n . Also note that since we do not know the population variance, 2 , we approximate using the sample variance, s 2 . Example 5.1.2 Suppose a manufacturer is interested in the average production of a machine in a day, more specifically, he is interested in the probability of the machine producing on average more than 100 items per day. It is known that the machine has a normal distribution with mean and variance 2 . The manufacturer measured the production of 11 machines yielding the following data: 115 82 98 126 109 143 136 92 103 127 150 Solution: Discussed during lectures. ___________________________________________________________________________________ 3 Statistics 5.2 Statistical Inference Statistical inference may be divided into two major areas: 1. Estimation and 2. Tests of hypotheses. When we look at sample statistics in order to estimate the corresponding characteristics of a population, we are looking for an estimator. Let’s look into the different methods of estimation. 5.2.1 Classical Methods of Estimation 5.2.1.1 Point Estimator: A value of sample statistic that produces a single numerical value as the estimate of unknown parameter. Remark: Statistic X (computed from a sample) is a point estimate of the population parameter and the sample variance s 2 is the point estimator of 2 . In estimating a parameter, we have to be sure that the statistic that we plan to use is an unbiased estimator of the parameter. The definition that follows tells us how to determine this. Definition 5.2.1.1 A statistic ̂ is said to be an unbiased estimator of the parameter if E ̂ . Otherwise, it is said to be biased. Example 5.2.1.1 n (X X )2 S2 i is an unbiased estimator of parameter 2 . n 1 i 1 Definition 5.2.1.2 By considering all possible unbiased estimator of some parameter , the one with smallest variance is called the most efficient estimator of . If we say that a distance is measured as 5.28 mm, we are giving a point estimate. If on the other hand we say that the distance is 5.28 0.02 mm, i.e. the distance lies between 5.25 and 5.31 mm, we are giving an interval estimate. 5.2.1.2 Interval estimator: A random interval in which the true value of the parameter falls with some level of probability. ___________________________________________________________________________________ 4 Statistics We always attach a probabilistic statement to the interval estimation. The confidence level gives this probabilistic statement and is usually denoted as (1 - )100% where is known as the significance level (usually we choose as 0.10, 0.05 and 0.01). Confidence intervals: An interval that is constructed based on the confidence level. For example, a 95% confidence interval for X would mean that we are 95% confident that the value of lies in that interval. Normal population: The population from which the sample is selected has an approximate normal distribution. Let’s look at precisely how we conduct interval estimation for mean, proportion and variance. 5.2.1.2.1 Estimating the Mean (Confidence Interval for ) (a) Normal population and known If x is the mean of a random sample of size n from a population with known variance 2 , a 1 100% confidence interval for is given by x z x z n n 2 2 where z is the Z-value leaving an area of to right. 2 2 Example 5.2.1.2 Measurements of the weights of a random sample of 200 containers made by a certain machine showed a mean of 0.21 kilograms and it is known that = 0.002 kilograms. Find the 95% confidence interval for the mean weight of all the containers. Solution: n = 200, 1 100 95 0.05 , 0.002 , z 0.025 1.96 . The 95% confidence interval is 0.002 x z x 1.96 0.21 1.96 0.21 0.0002772 n n 200 2 Or 0.2097 0.2103 . Remark: What does this interval signify? ___________________________________________________________________________________ 5 Statistics (b) Large-sample and unknown If x is the mean of a random sample of size n from a population with unknown variance 2 , a 1 100% confidence interval for is given by s s x z x z n n 2 2 where z is the Z-value leaving an area of to right and s 2 is the sample 2 2 variance. Example 5.2.1.3 The average calcium contain in 36 samples taken from different locations is found to be 1.6 grams per millilitre. Find the 95% confidence intervals for the mean calcium contain in the river. Assume that the sample standard deviation is 0.2. Solution: The point estimate for is x 1.6 . The z-value, leaving an area of 0.025 to the right and therefore an area of 0.975 to the left, is z 0.025 1.96 . Hence the 95% confidence interval is 0 .2 0.2 1.6 1.96 1.6 1.96 which reduces to 1.5347 1.6653 . 36 36 (c) Small sample, unknown If x is the mean of a random sample of size n from a normal population with unknown variance 2 , a 1 100% confidence interval for is given by s s x t x t n n 2 2 where t is the t-value with (n –1) degrees of freedom, leaving an area of 2 2 to the right and s 2 is the sample variance. Remarks: For small samples selected from non-normal populations, we cannot expect our degree of confidence to be accurate. However, for samples of size n 30 , regardless of the shape of most populations, sampling theory guarantees good results. ___________________________________________________________________________________ 6 Statistics Example 5.2.1.4 A random sample of 12 clerks of a certain company typed an average of 75.6 words per minute with a standard deviation of 8.2 words per minute. Find a 95% confidence interval for the average number of words typed by all clerks of this company. (Assuming a normal distribution for the number of words typed per minute) Solution: n = 12, x 75.6 , s = 8.2, t0.025 2.201 with 11 degrees of freedom. Hence, the confidence interval for is 8 .2 8.2 75.6 2.201 75.6 2.201 12 12 70.38993 80.81007 . Example 5.2.1.5 A machine is producing containers that are cylindrical in shape. Nine containers are randomly chosen and the diameters are 10.01, 9.97, 10.03, 10.04, 9.99, 9.98, 9.99, 10.01, and 10.03 centimetres. Find a 99% confidence interval for the mean diameter of containers from this machine, assuming an approximate normal distribution. Solution: n = 9. 10.01 9.97 10.03 10.04 9.99 9.98 9.99 10.01 10.03 x 10.0056 . 9 n x 2 x s 0.0246 . n 1 i 1 t 0.005 3.355 with 8 degrees of freedom. Hence, the confidence interval is i 0.0246 0.0246 10.0056 – (3.355) < < 10.0056 + (3.355) 3 3 9.9781 < < 10.0331. 5.2.1.2.2 proportion) Estimating a Proportion (Confidence Interval for a population In the large-sample (n 30), if p̂ is the proportion of successes in a random sample of size n, and q̂ 1 p̂ , an approximate 1 100% confidence interval for the binomial parameter p is given by p̂q̂ p̂q̂ p̂ z p p̂ z n n 2 2 where z is the Z-value leaving an area of 2 to right. 2 ___________________________________________________________________________________ 7 Statistics Example 5.2.1.6 In a random sample of n = 600 families owning television sets in a city, it is found that x = 240 subscribed to ASTRO. Find a 95% confidence interval for the actual proportion of families in this city who subscribe to ASTRO. Solution: 240 0.4 . Using the statistical table, we found that 600 1.96 . Therefore, the 95% confidence interval for p is The point estimate of p is p̂ z 0.025 0.4 1.96 0.40.6 p 0.4 1.96 0.40.6 600 600 which can be simplified to 0.3608< p < 0.4392. Remark: Take note that the confidence intervals built so far are the two-sided confidence intervals (for mean and proportion). What if you are interested in only the lower (or upper) bound of the mean ? If x is the mean of a random sample of size n from a population with known variance 2 , then a 1 100% one-sided confidence interval for is given by x z n More specifically, the one-sided confidence interval given above is bounded from below. Thus, x z is known as the lower bound for . n will form the upper bound for . How will the onen sided confidence interval corresponding to this upper bound look like? Remark: Similarly, x z ___________________________________________________________________________________ 8 Statistics Estimating the variance (Confidence Interval for 2) 5.2.1.2.3 If s 2 is the variance of a random sample of size n from a normal population, a 1 100% confidence interval for 2 is given by n 1s 2 2 2 ( n 1 )s 2 2 2 1 2 where 2 and 2 are 2-values with n 1 degrees of freedom, leaving areas of 1 2 2 and 1 , respectively, to the right. 2 2 2 2 P 2 2 2 1 1 2 2 1 2 1 2 2 2 Figure 1. The probability of P 2 2 2 1 2 2 Example 5.2.1.7 The following are the weights, in grams, of 10 packages of sugars packed by a worker: 454, 451, 458, 450, 451, 459, 458, 459, 452 and 450. Find a 95% confidence interval for the variance of all such packages of sugars packed by this worker, assuming a normal population. Solution: First we find 2 n n n xi2 xi 2 i 1 10 2063112 4542 15.0667 . s 2 i 1 nn 1 109 To obtain a 95% confidence interval, we have 0.05 . Then, using the statistical table with 9 degrees of freedom, we found that 02.025 19.023 , 20.975 2.700 . Therefore, the 95% confidence interval for 2 is 915.0667 2 915.0667 or simply 7.1282 < 2 < 50.2223. 19.023 2.700 ___________________________________________________________________________________ 9 Statistics 5.2.2 Hypothesis Testing In order to judge if a new procedure we implemented in a study is better than the existing one, we would turn to hypothesis testing. Hypothesis testing is viewed as a formal rule that tells us when we should accept the new procedure as improvement. Statistical hypothesis: An assertion or conjecture concerning one or more random variable. It is frequently denoted by symbols such as H 0 or H 1 . Null hypothesis, H0: A claim (or statement) about a population parameter that is assumed to be true until it is declared false. (Sometimes known as the no change hypothesis). Alternative hypothesis, H1: The hypothesis that we will accept if we decide to reject the null hypothesis. Example 5.2.2.1 Suppose a manufacturer observes that the existing procedures gives about 4% of defective product. The engineer would like to implement a new procedure to reduce the number of defective product. It was agreed that n 100 products would be produced using the new procedure. Let X equal the number of these 100 products that are defective. Thus we have the following tests: H 0 : p 0.04 H 1 : p 0.04 We would like to reject H 0 and accept H 1 , so that the number of defective product is reduced. Since a sample of 100 is taken, it is reasonable to accept H 1 , if X 4 . If X 4 , then we accept H 0 . Test statistic: A value computed from sample data. Suppose from the example above, we find that only 3 products are defective under the new procedure. Thus the test statistic is 3. Rejection region (Critical region): Indicating the values of the test statistic that will imply rejection of the null hypothesis. From the example above, X 4 is the rejection region. Acceptance region: The range that H0 will not be rejected if the value of test statistic falls into it. Critical point: The border of the rejection region. For the example above, X=4 is the critical point. Decision Rule: In hypothesis testing, a statistic of the sample data is computed, and its value will determine the acceptance or rejection of the null hypothesis. This function of the sample data is known as the test statistic. Specifically, the null hypothesis is accepted if the value of the test statistic falls within an interval of real numbers determined ___________________________________________________________________________________ 10 Statistics through the application of probability theory, and is rejected otherwise. The set of values for the test statistic that result in the acceptance of the null hypothesis is called the acceptance region; the set of values that support the rejection of the null hypothesis is called the critical region (rejection region). When dealing with hypothesis testing, there is always a chance for an error to occur, namely the error that we accept the new procedure as improvement when in fact, it was not, or the error of rejecting the new procedure as improvement, when in fact, it was. Definition 5.2.2.1 (type I error) Rejection of the null hypothesis when it is true is called a type I error, P(type I error) = . The probability of committing a type I error, also called the level of significance. Definition 5.2.2.2 (type II error) Acceptance of the null hypothesis when it is false is called a type II error, P(type II error) = There are four possible situations for testing a statistical hypothesis: (a) Accept H 0 when H 0 is true: correct decision. (b) (c) Accept H 0 when H 0 is false: gives type II error. Reject H 0 when H 0 is true: gives type I error. (d) Reject H 0 when H 0 is false: correct decision. Example 5.2.2.2 Referring to Example 5.2.2.1, we would accept the new procedure as being an improvement when, in fact, it was not. This decision is a mistake which we call a type I error. Ptype I error P X 4 when p 0.04 3 b x; 100; 0.04 x 0 The probability of a type II error is impossible to calculate unless we have a specific alternative hypothesis. Example 5.2.2.3 Referring to Example 1, if we have the following hypothesis tests H 0 : p 0.04 H 1 : p 0.02, then we are able to calculate the probability of accepting H 0 when it is false (type II error). ___________________________________________________________________________________ 11 Statistics Ptype II error P X 4 when p 0.02 3 1 b x; 100; 0.02 x0 Definition 5.2.2.3 (power of a test) The power of a statistical test, 1 is the probability of rejecting the null hypothesis, H0 when in fact H1 is true. Definition 5.2.2.4 (Operating characteristic (OC) curve) The probability of accepting the null hypothesis under a range of possible values for parameter is L . The function L is called the operating characteristic (or OC) curve. The function L carries complete information about the probabilities of both types of error. If equals a value where the null hypothesis is true, then Ptype I error 1 L . When has a value where the alternative hypothesis is true, then P type II error L . Example 5.2.2.4 Referring to Example 1, the function L p can be calculated as follows: L p Pacceptance of H 0 P X 4 3 1 b x; 100, p x 0 In tests of hypotheses, we have One- and Two-Tailed Tests which are stated as follows: Hypothesis tests Two-sided One-sided(Left side) One-sided(right side) Symbol in H0 = = or = or Symbol in H1 < > Rejection region In both tails In the left tail In the right tail Procedures for hypothesis testing by using test statistics Step 1: State the null hypothesis H0 that = 0. Step 2: Choose an appropriate alternative hypothesis H1 from one of the alternatives < 0, > 0, or 0. Step 3: Choose a significance level of size . Step 4: Select the appropriate critical point and establish the critical region. Step 5: Compute the value of the test statistic from the sample data. Step 6: Make Decision: Follow the decision rule. ___________________________________________________________________________________ 12 Statistics Example 5.2.2.5 A manufacturer of a certain brand of bulb claims that the average lifetime is more than 1.5 years. State the null and alternative hypotheses to be used in testing this claim and determine where the critical region is located. Solution: The manufacturer’s claim should be rejected only if is less than or equal to 1.5 years and should be accepted if is more than 1.5 years. Since the null hypothesis always specifies a single value of the parameter, we test H 0 : 1.5 , H 1 : 1.5. Although we have stated the null hypothesis with an equal sign, it is understood to include any value not specified by the alternative hypothesis. Consequently, the acceptance of H0 does not imply that is exactly equal to 1.5 years but rather that we do not have sufficient evidence favouring H1. Since we have a one-tailed test, the lesser than symbol indicates that the critical region lies entirely in the left tail of the distribution of our test statistic X . Testing a population mean (a) Test of hypothesis About a Population Mean normal population and known large sample and unknown One-Tailed Test H0: = 0 H1: > 0 (or H1: < 0) x 0 Test statistic: z (for normal population and known) or n x 0 z (for large sample and unknown) s n Critical region: z z (or z z ), where z is the Z-value such that PZ z and s 2 is the sample variance. Two-Tailed Test H0: = 0 H1: 0 Test statistic: z x 0 (for normal population and known) or n x 0 z (for large sample and unknown) s n Critical region: z z 2 , where z is the Z-value such that P Z z 2 2 and s 2 is the sample variance. 2 ___________________________________________________________________________________ 13 Statistics Using the p-value approach for a Single Mean test (for known) The probability value, more commonly called the p-value is the smallest significance level at which the null hypothesis is rejected. Calculate the p-values: (i) x One tailed-test(left tailed): p P Z n (ii) One tailed-test(right tailed): p P Z (iii) x Two tailed-test: p P Z n . x . n P Z x n . Procedures for hypothesis testing by using p-value test Step 1: State the null hypothesis H0 that = 0. Step 2: Choose an appropriate alternative hypothesis H1 from one of the alternatives < 0, > 0, or 0. Step 3: Choose a significance level of size . Step 4: Determine the test distribution to use (one left-tailed, one right-tailed or twotailed). Step 5: Compute the p-value. Step 6: State the Decision Rule. Reject H0 if the p-value is less than . Otherwise, accept H0. Example 5.2.2.6 A random sample of 100 electronic chips showed an average lifetime of 2.8 years. Assuming a population standard deviation of 0.5 year, does this seem to indicate that the mean lifetime is greater than 2.7 years? Use a 0.05 level of significance. Solution: Using test statistic: 1. H 0 : 2.7 years 2. H 1 : 2.7 years. 3. 0.05 . 4. Critical region: z 1.645, where z x 0 n . 5. Computations: x 2.8 years, 0.5 years, and z 2.8 2.7 2. 0.5 100 6. Decision: Reject H 0 and conclude that the mean lifetime is greater than 2.7 years. ___________________________________________________________________________________ 14 Statistics Using p-value test: In this example, the p-value corresponding to z = 2.0 is given by the area of the shaded region in Figure below: p 0 2.0 Z Figure 2. p-value for Example 5.2.2.6. Using statistical table, we have p PZ 2.0 0.0228 . As a result, the evidence in favour of H 1 is even stronger than that suggested by a 0.05 level of significance. Example 5.2.2.7 A manufacturer developed a new type of battery that he claims has a mean lifetime of 10 months with a standard deviation 0.5 month. Test the hypothesis that 10 months against the alternative that 10 months if a random sample of 50 lines is tested and found to have a mean lifetime of 9.8 months. Use a 0.01 level of significance. Solution: Using test statistic: 1. H 0 : 10 months . 2. H 1 : 10 months . 3. 0.01 . x 0 . n 9.8 10 5. Computations: x 9.8 months, n 50, and hence z 2.83 . 0.5 50 6. Decision: Reject H 0 and conclude that the average lifetime is not equal to 10 months but is, in fact, less than 10 months. 4. Critical region: z 2.575 and z 2.575, where z ___________________________________________________________________________________ 15 Statistics Using p-value test: Since the test in this example is two-tailed, the desired p-value is twice the area of the shaded region in Figure below to the left of z 2.83 . Therefore, using statistical table, we have p P Z 2.83 2 P Z 2.83 0.0046 . Which allows us to reject the null hypothesis that =10 kilograms at a level of significance smaller than 0.01. p 2 p 2 2.83 0 2.83 Figure 3. p-value for Example 5.2.2.7 (b) Test of hypothesis About a Population Mean - for a small sample (n < 30) and unknown One-Tailed Test H0: = 0 H1: > 0 (or H1: < 0) x 0 Test statistic: t s n Critical region: t > t (or t t ), where t is the t value such that P(t > t) = . with n 1 degrees of freedom; Two-tailed Test H0: = 0 H1: 0 Test statistic: t x 0 s n Critical region: t t 2 where t 2 is the t value such that Pt t 2 with (n – 1) degrees of 2 freedom ___________________________________________________________________________________ 16 Statistics Example 5.2.2.8 The height of students in University ABC is normally distributed, and a lecturer claims that the mean height of these students is 1.68 metres. To test this claim, another lecturer takes a random sample of 16 students and finds that the mean is 1.71 metres and the standard deviation is 0.05 metres. Can the claim made by the lecturer be accepted at the 0.05 level of significance? Solution: 1. H 0 : 1.68 metres. 2. H 1 : 1.68 metres. 3. 0.05 , 0.025 . 2 4. Critical region: t 2.131 or t 2.131 where t 5. Computations: t x 0 s 6. Decision: Reject H0. n 1.71 1.68 0.05 16 x 0 . s n ( xi x ) 2 . n 1 i 1 n 2.4 where s Example 5.2.2.9 Test the hypothesis that the average diameter of certain type of battery produced by a factory is 10 millimetres if the diameters of a random sample of 10 batteries are 10.1, 9.8, 10.1, 10.5, 10.1, 9.7, 9.9, 10.4, 10.3 and 9.8 millimetres. Use a 0.01 level of significance and assume that the distribution of diameters is normal. Solution: 1. H 0 : 10 . 2. H1 : 10 . 3. 0.01 . 4. Critical region: t 3.25 and t 3.25 where t 5. Computations: t 10.07 10 0.7878 where 0.281 10 6. Decision: Do not reject H 0 . x 0 . s n ( xi x ) 2 . n 1 i 1 n s ___________________________________________________________________________________ 17 Statistics Testing a population proportion for a large sample One-tailed test H0: p = p 0 H1: p > p 0 (or H1: p < p0) p̂ p 0 Test statistic: z where q0 1 p0 and p̂ is a sample proportion. p0 q0 n Critical region: z z (or z z ) Two-tailed Test H0: p = p 0 H1 : p p0 Test statistic: z p̂ p 0 p0 q0 n Critical region: z z 2 where q0 1 p0 and p̂ is a sample proportion. The sample size n must sufficiently large so that the approximation is valid. Example 5.2.2.10: (large sample) A common medicine for relieving serious pain is believed to be only 80% effective. A new medicine is used to a random sample of 100 adults who were suffering from serious pain and it shows that 85 received relief. Is this sufficient evidence to conclude that the new medicine is superior to the one commonly prescribed? Use a 0.05 level of significance. Solution: 1. H 0 : p 0.8 . 2. H 1 : p 0.8 . 3. 0.05 . 4. Critical region: z >1.645. 85 0.85 , q0 1 0.8 0.2 , and 5. Computations: p 100 0.85 0.8 z 1.25 , 0.80.2 100 6. Decision: Do not reject H0 and conclude that the new medicine is not superior. ___________________________________________________________________________________ 18 Statistics Testing a population variance One-Tailed Test H 0 : 2 20 H 1 : 2 20 (or 2 20 ) Test statistic: 2 n 1s 2 20 where s 2 is the sample variance. Critical Region: 2 2 (or 2 12 ) , where 2 and 12 are values of 2 that locate an area of to the right and to the left, respectively, of chi-square distribution based on n 1 degrees of freedom. Two-Tailed Test H 0 : 2 20 H 1 : 2 02 Test statistic: 2 n 1s 2 2 2 0 2 Critical Region: 1 2 where s 2 is the sample variance. or 2 2 , 2 to the right and to 2 2 2 2 the left, respectively, of chi-square distribution based on n 1 degrees of freedom. where 2 and 12 are values of 2 that locate an area of Example 5.2.2.12 In paper manufacturing, the process that produces papers is considered out of control if the standard deviation of the weights of a piece of paper exceeds 1.25 grams. A random sample of 20 pieces of papers taken during a routine periodic check produced a sample standard deviation of 1.90 grams. At this 0.05 level, is the paper production process out of control? Solution: 1. H 0 : 2 1.25 2 . 2. H 1 : 2 1.25 2 . 3. 0.05 . 4. Critical region: 2 2 02.05 30.14 n 1 , where 2 5. Computations: 2 n 1s 2 02 20 11.902 43.8976 . 1.25 2 6. Decision: Since 2 2 , reject H0. Conclude that the paper production process is out of control. ___________________________________________________________________________________ 19 Statistics 5.3 Chi-squared Test of Goodness of Fit A goodness-of-fit test between observed and expected frequencies is based on the k o ei 2 , where 2 is a value of a random variable whose quantity 2 i ei i 1 sampling distribution is approximated very closely by the chi-squared distribution with k 1 degrees of freedom. The symbols oi and ei represent the observed and expected frequencies, respectively, for the i th cell. Decision rule: Reject H 0 if 2 2 . The decision criterion described here should not be used unless each of the expected frequencies is at least equal to 5. Example 5.3.1 Tossing a fair dice for 180 times and each outcome is noted. Test at the 0.01 level of significance whether the data obtained from experiment having the discrete uniform distribution. Solution: A table as follows is formed: Face, X Observed Expected P X x frequencies, oi frequencies, ei 1 26 30 16 2 32 30 16 3 25 30 16 4 24 30 16 5 35 30 16 6 38 30 16 Total 180 1 180 The expected frequencies can be found by multiplying the probabilities 1 6 with 180. 1. 2. 3. 4. H 0 : the random variable X has discrete uniform distribution. H 1 : the random variable X does not have discrete uniform distribution. 0.01 . Critical region: 2 20.01 5 where 20.01 5 15.09 . 6 5. Computation: 2 i 1 oi ei 2 ei 5.67 . 6. Decision: Since 2 5.67 15.09 , there is no reason to reject null hypothesis; this means discrete uniform distribution provides a good fit for the distribution of random variable X. ___________________________________________________________________________________ 20 Statistics Example 5.3.2 Suppose the lifetime (in hours), X for 40 bulbs is recorded and classified into a few classes as follows: Class boundaries (hours) Observed frequencies, oi 1.5 2.0 2 2 .0 2 .5 4 2.5 3.0 11 3.0 3.5 15 3.5 4.0 7 4 .0 4 .5 1 Test at level of significance 0.01, whether the lifetime of bulbs may be approximated by a normal distribution with 3.2 and 0.5 . Solution: A table as follows is formed: Class boundaries Observed (hours) frequencies, oi 1.5 2.0 2 17 2 .0 2 .5 4 2.5 3.0 11 3.0 3.5 15 3.5 4.0 7 8 4 .0 4 .5 1 Total 40 Probabilities 0.0082 0.0726 0.2638 0.3811 0.2195 0.0548 1 Expected frequencies, ei 0.328 13.784 2.904 10.552 15.244 8.78 10.972 2.192 40 The probabilities and expected frequencies are calculated as follows: For the first class boundaries: 2.0 3.2 P X 2.0 P Z PZ 2.4 0.0082 0.5 For the second class boundaries: 2.5 3.2 2.0 3.2 P2.0 X 2.5 P Z P 2.4 Z 1.4 0.0726 0.5 0.5 For the third class boundaries: 3.0 3.2 2.5 3.2 P2.5 X 3.0 P Z P 1.4 Z 0.4 0.2638 0.5 0.5 For the fourth class boundaries: 3.5 3.2 3.0 3.2 P3.0 X 3.5 P Z P 0.4 Z 0.6 0.3811 0.5 0.5 For the fifth class boundaries: 4.0 3.2 3.5 3.2 P3.5 X 4.0 P Z P0.6 Z 1.6 0.2195 0.5 0.5 For the sixth class boundaries: 4.0 3.2 P X 4.0 P Z PZ 1.6 0.0548 . 0.5 ___________________________________________________________________________________ 21 Statistics The expected frequencies can be found by multiplying the probabilities for each boundary class with 40. Note that we combined some of the data so that none of the expected frequencies is less than 5. 1. H 0 : the random variable X has normal distribution with 3.2 and 0.5 . 2. H 1 : the random variable X does not have normal distribution with 3.2 and 0.5 . 3. 0.01 . 4. Critical region: 2 02.01 2 where 20.01 2 9.210 . 3 5. Computation: 2 i 1 oi ei 2 ei 1.559 . 6. Decision: Since 2 1.559 9.210 , there is no reason to reject null hypothesis; this means the normal distribution with 3.2 and 0.5 provides a good fit for the distribution of random variable X. -end- ___________________________________________________________________________________ 22