ANOVA
Question 1a Test equality of urban/rural availablility of water
ANOVA
The GLM Procedure
Data dictionaries
SAS Data set REGRESSIONDATA
Alphabetic List of Variables and Attributes  
#  Variable  Type  Len  Description  Units 
18  AllC  Num  8  All Cause mortality  Deaths per 100,000 population 
19  Commun  Num  8  Communicable Disease mortality  Deaths per 100,000 population 
10  Region  Char  4  WHO Global Region  
11  Urb_Rur  Char  5  Urban or rural residence  
4  all_pcs  Num  8  Per capita health care spending from all sources  Purchasing power equivalent dollars per person 
1  country  Char  11  Country name  
14  drink00  Num  8  Percent of population with access to clean drinking water in year 2000  Percent 
16  drink05  Num  8  Percent of population with access to clean drinking water in year 2005  Percent 
12  drink95  Num  8  Percent of population with access to clean drinking water in year 1995  Percent 
3  gov_pcs  Num  8  Per capita health care spending from government sources  Purchasing power equivalent dollars per person 
5  income_pc  Num  8  Per capita income  Purchasing power equivalent dollars per person 
20  noncom  Num  8  Noncommunicable Disease mortality  Deaths per 100,000 population 
6  phys_num  Num  8  Number of Physicians  BEST32. 
8  phys_pt  Num  8  Physicians per thousand population  Number of Physicians per thousand population 
7  popn  Num  8  Population of Country  BEST32. 
15  sanit00  Num  8  Percent of population with access to sanitation in year 2000  Percent 
17  sanit05  Num  8  Percent of population with access to sanitation in year 2005  Percent 
13  sanit95  Num  8  Percent of population with access to clean drinking water in year 1995  Percent 
9  subregion  Num  8  Subregions within WHO regions  
2  year  Num  8  Calendar year 
SAS Data set Watertrim
Alphabetic List of Variables and Attributes  
#  Variable  Type  Len  Description  units 
1  Country  Char  11  Country name  
4  Region  Char  4  WHO Global Region  
2  Urb_Rur  Char  5  Urban/Rual  
7  drink00  Num  8  Percent of population with access to clean drinking water in year 2000  Percent 
9  drink05  Num  8  Percent of population with access to clean drinking water in year 2005  Percent 
5  drink95  Num  8  Percent of population with access to clean drinking water in year 1995  Percent 
8  sanit00  Num  8  Percent of population with access to sanitation in year 2000  Percent 
10  sanit05  Num  8  Percent of population with access to sanitation in year 2005  Percent 
6  sanit95  Num  8  Percent of population with access to sanitation in year 1995  Percent 
3  subregion  Num  8  WHO subregion within global region 
Names of WHO Regions.
WHO Region Name  Abbreviation 
African Region,  AFRO 
Region of the Americas,  PAHO 
SouthEast Asia Region,  SEAR 
European Region,  EURO 
Eastern Mediterranean Region,  EMRO 
Western Pacific Region  WPRO 
PUBH5770 Introduction to Biostatistics Final Exam Project, June 2017
The final is due on 26 June before midnightby email. Email your finished final exam and include the SAS output, and the written answers to the questions in two separate files. In the written answers, please refer to the page of the SAS output that you derived your answer from (applies to most questions, but may not be applicable to some) and indicate the location of the answer on the SAS output (highlights, arrows, boxes).
The output has titles with the question number and section.
If you have any questions send them by email and if I can answer them without giving away too much, I will send the question and the answer to everyone.
Answer the following questions based on analysis output you have been provided.
 Availability of drinking water and sanitation.
The file “Water” contains data on the percentage of the population that has access to drinking water and sanitation in rural and urban areas and the country overall (Total) in 193 countries at three times 1995, 2000 2005. Countries are categorized into regions and subregions.
For this exercise, the percentage of the population with access to drinking water and sanitation are treated as continuous variables (even though it is actually a proportion).
 Analysis of Variance(ANOVA) is used to test if the availability of drinking water is different in urban and rural regions worldwide in 2005 (drink05).
 What statistic was used in SAS to test for differences in drinking water availability?
 What is the value of this statistic?
 What is the pValue?
 What proportion of the variance in drinking water availability is explained by the model? What numbers shown on the output can be used to calculate this statistics?
 A ttest is used to test if the availability of drinking water is different in urban and rural regions worldwide in 2005 (drink05).
 What type of test was used in SAS to test for differences in water and sanitation availability?
 What is the value of this statistic?
 What is the pValue?
 What in in the output suggests that these data may not be suitable for this test? Why?
 ANOVA is used to test if the difference in the availability of sanitation between urban and rural areas in 2005 are different across regions of the world.
 Is availability of sanitation different across regions of the world?
 What type of test was used in SAS to test for differences in water and sanitation availability?
 What is the value of this statistic?
 Is availability of sanitation different across regions of the world?
 What is the pValue?
 Is the difference in availability of sanitation between urban and rural areas different across regions of the world?
 What is the value of the statistic used in SAS to test if urban rural differences are different across regions?
 What is the pValue?
 What proportion of the variance in sanitationavailability is explained by the ANOVA model?
 Association of communicable and noncommunicable disease with economic and health resources.
The file “Resources” contains data for mortality rates from communicable disease, noncommunicable disease, and all causes (death per 100,000 per year for 2008) for 193 countries. Additional variables include data on the number of physicians per thousand population, percentage of the population with access to drinking water and sanitation (from file “Water” above), per capita health care spending from government and all sources (purchasing power equivalent dollars), and per capita incomes (purchasing power equivalent dollars).
 Using PROC CORR, look at the correlations between the independent variables.
 Which variables are likely to be redundant (i.e., they are likely to be similarly associated with the dependent variables)?
 How does this influence the starting model for regression analysis?
 Identifypossible sets of sets of variables for use in regression.
 Using PROC REGdetermine the association between mortality rates from communicable diseases (variable commun) and independent variables for drink water and sanitation (for year 2000, drink00 and sanit00), income and health care spending, and physicians per thousand population.
 What is the final model? Write the equation for it.
 Interpret the model in words. What does it say about the relationship between the independent and dependent variables?
 Evaluate how well the model fits the data.
 How much of the variation in communicable disease mortality is explained by the final model?
 How much is explained by the variables removed from the full model?
 Run the same model for noncommunicable disease mortality (variable noncom). Start with the same full model and determine the final reduced model.
 What is the final model? Write the equation for it.
 Interpret the model in words. What does it say about the relationship between the independent and dependent variables?
 Evaluate how well the data fits the assumptions for least squares regression.
 The graphs show some clustering of the data. Where are these data clustered? What does this clustered data represent in the real world? Do you think this clustering affects the model?
 Run the model from Part C but restrict the year to 2005.
 How do the results differ?
 What might account for the differences in the results of the same model run for two different years?
 The variable Region cannot be included in the PROC REG models because it is categorical.
 How could youdetermine if Region might have an influence on the rates of communicable and noncommunicable disease?
Solution
Introduction to Biostatistics Final Exam Project
Answer the following questions based on analysis output you have been provided.
 Availability of drinking water and sanitation.
The file “Water” contains data on the percentage of the population that has access to drinking water and sanitation in rural and urban areas and the country overall (Total) in 193 countries at three times 1995, 2000 2005. Countries are categorized into regions and subregions.
For this exercise, the percentage of the population with access to drinking water and sanitation are treated as continuous variables (even though it is actually a proportion).
 Analysis of Variance(ANOVA) is used to test if the availability of drinking water is different in urban and rural regions worldwide in 2005 (drink05).
 What statistic was used in SAS to test for differences in drinking water availability?
The Fstatistic
 What is the value of this statistic?
Page 2. 77.81
 What is the pValue?
Page 2. <0.001
 What proportion of the variance in drinking water availability is explained by the model? What numbers shown on the output can be used to calculate this statistics?
Page 2. 17.77%
 A ttest is used to test if the availability of drinking water is different in urban and rural regions worldwide in 2005 (drink05).
 What type of test was used in SAS to test for differences in water and sanitation availability?
Page 5. TTEST Procedure with equal and unequal variances
 What is the value of this statistic?
Page 5. Equal variances: 8.93. Unequal variances: 8.86
 What is the pValue?
Page 5. < 0.001 for both cases
 What in in the output suggests that these data may not be suitable for this test? Why?
Pages 56. Histograms and QQ plots suggest that the variable drink05 doesn’t have a normal distribution for none of the groups. This means that the conducted test is not valid
 ANOVA is used to test if the difference in the availability of sanitation between urban and rural areas in 2005 are different across regions of the world.
 Is availability of sanitation different across regions of the world?
 What type of test was used in SAS to test for differences in water and sanitation availability?
 Is availability of sanitation different across regions of the world?
Page 8. OneWay ANOVA test
 What is the value of this statistic?
Page 8. 42.84
 What is the pValue?
Page 8. <0.0001
 Is the difference in availability of sanitation between urban and rural areas different across regions of the world?
 What is the value of the statistic used in SAS to test if urban rural differences are different across regions?
Page 8. 41.03
 What is the pValue?
Page 8. <0.0001
 What proportion of the variance in sanitation availability is explained by the ANOVA model?
Page 8. 52.38%
 Association of communicable and noncommunicable disease with economic and health resources.
The file “Resources” contains data for mortality rates from communicable disease, noncommunicable disease, and all causes (death per 100,000 per year for 2008) for 193 countries. Additional variables include data on the number of physicians per thousand population, percentage of the population with access to drinking water and sanitation (from file “Water” above), per capita health care spending from government and all sources (purchasing power equivalent dollars), and per capita incomes (purchasing power equivalent dollars).
 Using PROC CORR, look at the correlations between the independent variables.
 Which variables are likely to be redundant (i.e., they are likely to be similarly associated with the dependent variables)?
Pages 910. Highly correlated independent variables are gov_psc, all_pcs and income_pc; sanit00, sanit95 and sanit05; and drink95, drink00 and drink05.
 How does this influence the starting model for regression analysis?
This problem is called multicollinearity and might cause a high variance of the estimators
 Identifypossible sets of sets of variables for use in regression.
Possibly: gov_pcs, phys_pt, drink00 and sanit00
 Using PROC REGdetermine the association between mortality rates from communicable diseases (variable commun) and independent variables for drink water and sanitation (for year 2000, drink00 and sanit00), income and health care spending, and physicians per thousand population.
 What is the final model? Write the equation for it.
Page 13.
 Interpret the model in words. What does it say about the relationship between the independent and dependent variables?
When the Percentage of population with access to sanitation in year 2000 increases in 1 percent, the Communicable disease mortality rate decreases on average in 11.52 deaths per 100,000 population. Also, the average level of Communicable disease mortality rate when the Percentage of population with access to sanitation in year 2000 is 0, is 1207.27
 Evaluate how well the model fits the data.
Page 13. The goodness of fit of a linear model usually is measured by the R squared statistic. In this case, this statistic shows that this model explains 75.34% of the variability of the Communicable Disease Mortality (dependent variable). It is important to note, though, that the residuals appear to have different probability distributions for different levels of the dependent variable, which could mean that it’s
 How much of the variation in communicable disease mortality is explained by the final model?
Page 13. 75.34%
 How much is explained by the variables removed from the full model?
Page 14. 0.40%
 Run the same model for noncommunicable disease mortality (variable noncom). Start with the same full model and determine the final reduced model.
 What is the final model? Write the equation for it.
Page 20.
 Interpret the model in words. What does it say about the relationship between the independent and dependent variables?
When the Percentage of population with access to sanitation in year 2000 increases in 1 percent keeping constant the Per capita Income, the Communicable disease mortality rate decreases on average in 2.2257 deaths per 100,000 population. When the Per capita Income increases in 1 unit, the Communicable disease mortality rate decreases on average in 0.0097 deaths per 100,000 population. Also, the average level of Communicable disease mortality rate without the effects of the independent variables, is 888.2166
 Evaluate how well the data fits the assumptions for least squares regression.
Page 21. Residuals in this model don’t seem to deviate away too much from the normality assumption and they don’t seem to be correlated to the fitted values. Independent variables Percent of population with access to sanitation in year 2000 (sanity) and Per capita income (income_pc) have a moderately high correlation of 0.61, which doesn’t strictly violates the non multicollinearity assumption, but may cause a fairly high variance of the coefficient estimators. There are a few observations with a high Cook’s distance or a high leverage that might be distorting the results.
 The graphs show some clustering of the data. Where are these data clustered? What does this clustered data represent in the real world? Do you think this clustering affects the model?
Page 24. These points are clustered in the lowest range of Income per capita. This data represents low income countries. This could affect the model, because it is better for the linear models for the independent variables to have a high variance
 Run the model from Part C but restrict the year to 2005.
 How do the results differ?
Page 27. Now income_pc is not included and phys_pt is and Rsquared is higher.
 What might account for the differences in the results of the same model run for two different years?
It might be a change in variables behavior in time. It also could be simply the sample size
 The variable Region cannot be included in the PROC REG models because it is categorical.
 How could youdetermine if Region might have an influence on the rates of communicable and noncommunicable disease?
There are a couple of options. A simple way would be to use an ANOVA. Another way is to compute several dummy variables to represent all the regions and include them into the model.


 The Fstatistic
 77.81

 <0.001
 17.77%. The Model and Error Sum of squares

 TTEST Procedure with equal and unequal variances
 Equal variances: 8.93. Unequal variances: 8.86
 < 0.001 for both cases
 Histograms and QQ plots suggest that the variable drink05 doesn’t have a normal distribution for none of the groups. This means that the conducted test is not valid


 OneWay ANOVA test
 42.84

 <0.001

 41.03
 < 0.001
 52.38%


 Highly correlated independent variables are gov_psc, all_pcs and income_pc; sanit00, sanit95 and sanit05; and drink95, drink00 and drink05.
 This problem is called multicollinearity and might cause a high variance of the estimators

 Possibly: gov_pcs, phys_pt, drink00 and sanit00

 When the Percentage of population with access to sanitation in year 2000 increases in 1 percent, the Communicable disease mortality rate decreases on average in 11.52 deaths per 100,000 population. Also, the average level of Communicable disease mortality rate when the Percentage of population with access to sanitation in year 2000 is 0, is 1207.27
 The goodness of fit of a linear model usually is measured by the R squared statistic. In this case, this statistic shows that this model explains 75.34% of the variability of the Communicable Disease Mortality (dependent variable). It is important to note, though, that the residuals appear to have different probability distributions for different levels of the dependent variable, which could mean that it’s necessary to transform the original variables to have a real linear relationship between independent and dependent variables.
 75.34%
 0.40%

 When the Percentage of population with access to sanitation in year 2000 increases in 1 percent keeping constant the Per capita Income, the Communicable disease mortality rate decreases on average in 2.2257 deaths per 100,000 population. When the Per capita Income increases in 1 unit, the Communicable disease mortality rate decreases on average in 0.0097 deaths per 100,000 population. Also, the average level of Communicable disease mortality rate without the effects of the independent variables, is 888.2166
 Residuals in this model don’t seem to deviate away too much from the normality assumption and they don’t seem to be correlated to the fitted values. Independent variables Percent of population with access to sanitation in year 2000 (sanity) and Per capita income (income_pc) have a moderately high correlation of 0.61, which doesn’t strictly violates the non multicolinearity assumption, but may cause a fairly high variance of the coefficient estimators. There are a few observations with a high Cook’s distance or a high leverage that might be distorting the results.
 These points are clustered in the lowest range of Income per capita. This data represents low income countries. This could affect the model, because it is better for the linear models for the independent variables to have a high variance

 Now income_pc is not included and phys_pt is and Rsquared is higher.
 It might be a change in variables behavior in time. It also could be simply the sample size
There are a couple of options. A simple way would be to use an ANOVA. Another way is to compute several dummy variables to represent all the regions and include them into the model.