ANOVA

Question 1a- Test equality of urban/rural availablility of water

ANOVA

The GLM Procedure                                            Data dictionaries

SAS Data set REGRESSIONDATA

SAS Data set Watertrim

Names of WHO Regions.

 WHO Region Name Abbreviation African Region, AFRO Region of the Americas, PAHO South-East Asia Region, SEAR European Region, EURO Eastern Mediterranean Region, EMRO Western Pacific Region WPRO

PUBH-5770 Introduction to Biostatistics Final Exam Project, June 2017

The final is due on 26 June before midnightby email.  Email your finished final exam and include the SAS output, and the written answers to the questions in two separate files.  In the written answers, please refer to the page of the SAS output that you derived your answer from (applies to most questions, but may not be applicable to some) and indicate the location of the answer on the SAS output (highlights, arrows, boxes).

The output has titles with the question number and section.

If you have any questions send them by email and if I can answer them without giving away too much, I will send the question and the answer to everyone.

Answer the following questions based on analysis output you have been provided.

1. Availability of drinking water and sanitation.

The file “Water” contains data on the percentage of the population that has access to drinking water and sanitation in rural and urban areas and the country overall (Total) in 193 countries at three times 1995, 2000 2005. Countries are categorized into regions and sub-regions.

For this exercise, the percentage of the population with access to drinking water and sanitation are treated as continuous variables (even though it is actually a proportion).

1. Analysis of Variance(ANOVA) is used to test if the availability of drinking water is different in urban and rural regions worldwide in 2005 (drink05).
1. What statistic was used in SAS to test for differences in drinking water availability?
2. What is the value of this statistic?
• What is the p-Value?
1. What proportion of the variance in drinking water availability is explained by the model? What numbers shown on the output can be used to calculate this statistics?
1. A t-test is used to test if the availability of drinking water is different in urban and rural regions worldwide in 2005 (drink05).
1. What type of test was used in SAS to test for differences in water and sanitation availability?
2. What is the value of this statistic?
• What is the p-Value?
1. What in in the output suggests that these data may not be suitable for this test? Why?
1. ANOVA is used to test if the difference in the availability of sanitation between urban and rural areas in 2005 are different across regions of the world.
1. Is availability of sanitation different across regions of the world?
1. What type of test was used in SAS to test for differences in water and sanitation availability?
2. What is the value of this statistic?
• What is the p-Value?
1. Is the difference in availability of sanitation between urban and rural areas different across regions of the world?
1. What is the value of the statistic used in SAS to test if urban rural differences are different across regions?
2. What is the p-Value?
• What proportion of the variance in sanitationavailability is explained by the ANOVA model?
1. Association of communicable and non-communicable disease with economic and health resources.

The file “Resources” contains data for mortality rates from communicable disease, non-communicable disease, and all causes (death per 100,000 per year for 2008) for 193 countries. Additional variables include data on the number of physicians per thousand population, percentage of the population with access to drinking water and sanitation (from file “Water” above), per capita health care spending from government and all sources (purchasing power equivalent dollars), and per capita incomes (purchasing power equivalent dollars).

1. Using PROC CORR, look at the correlations between the independent variables.
1. Which variables are likely to be redundant (i.e., they are likely to be similarly associated with the dependent variables)?
2. How does this influence the starting model for regression analysis?
• Identifypossible sets of sets of variables for use in regression.
1. Using PROC REGdetermine the association between mortality rates from communicable diseases (variable commun) and independent variables for drink water and sanitation (for year 2000, drink00 and sanit00), income and health care spending, and physicians per thousand population.
1. What is the final model? Write the equation for it.
2. Interpret the model in words. What does it say about the relationship between the independent and dependent variables?
• Evaluate how well the model fits the data.
1. How much of the variation in communicable disease mortality is explained by the final model?
2. How much is explained by the variables removed from the full model?
1. Run the same model for non-communicable disease mortality (variable noncom). Start with the same full model and determine the final reduced model.
1. What is the final model? Write the equation for it.
2. Interpret the model in words. What does it say about the relationship between the independent and dependent variables?
• Evaluate how well the data fits the assumptions for least squares regression.
1. The graphs show some clustering of the data. Where are these data clustered? What does this clustered data represent in the real world? Do you think this clustering affects the model?
1. Run the model from Part C but restrict the year to 2005.
1. How do the results differ?
2. What might account for the differences in the results of the same model run for two different years?
2. The variable Region cannot be included in the PROC REG models because it is categorical.
1. How could youdetermine if Region might have an influence on the rates of communicable and non-communicable disease?

Solution                                             Introduction to Biostatistics Final Exam Project

Answer the following questions based on analysis output you have been provided.

1. Availability of drinking water and sanitation.

The file “Water” contains data on the percentage of the population that has access to drinking water and sanitation in rural and urban areas and the country overall (Total) in 193 countries at three times 1995, 2000 2005. Countries are categorized into regions and sub-regions.

For this exercise, the percentage of the population with access to drinking water and sanitation are treated as continuous variables (even though it is actually a proportion).

1. Analysis of Variance(ANOVA) is used to test if the availability of drinking water is different in urban and rural regions worldwide in 2005 (drink05).
1. What statistic was used in SAS to test for differences in drinking water availability?

The F-statistic

1. What is the value of this statistic?

Page 2. 77.81

• What is the p-Value?

Page 2. <0.001

1. What proportion of the variance in drinking water availability is explained by the model? What numbers shown on the output can be used to calculate this statistics?

Page 2. 17.77%

1. A t-test is used to test if the availability of drinking water is different in urban and rural regions worldwide in 2005 (drink05).
1. What type of test was used in SAS to test for differences in water and sanitation availability?

Page 5. TTEST Procedure with equal and unequal variances

1. What is the value of this statistic?

Page 5. Equal variances: -8.93. Unequal variances: -8.86

• What is the p-Value?

Page 5. < 0.001 for both cases

1. What in in the output suggests that these data may not be suitable for this test? Why?

Pages 5-6. Histograms and QQ plots suggest that the variable drink05 doesn’t have a normal distribution for none of the groups. This means that the conducted test is not valid

1. ANOVA is used to test if the difference in the availability of sanitation between urban and rural areas in 2005 are different across regions of the world.
1. Is availability of sanitation different across regions of the world?
1. What type of test was used in SAS to test for differences in water and sanitation availability?

Page 8. One-Way ANOVA test

1. What is the value of this statistic?

Page 8. 42.84

• What is the p-Value?

Page 8. <0.0001

1. Is the difference in availability of sanitation between urban and rural areas different across regions of the world?
1. What is the value of the statistic used in SAS to test if urban rural differences are different across regions?

Page 8. 41.03

1. What is the p-Value?

Page 8. <0.0001

• What proportion of the variance in sanitation availability is explained by the ANOVA model?

Page 8. 52.38%

1. Association of communicable and non-communicable disease with economic and health resources.

The file “Resources” contains data for mortality rates from communicable disease, non-communicable disease, and all causes (death per 100,000 per year for 2008) for 193 countries. Additional variables include data on the number of physicians per thousand population, percentage of the population with access to drinking water and sanitation (from file “Water” above), per capita health care spending from government and all sources (purchasing power equivalent dollars), and per capita incomes (purchasing power equivalent dollars).

1. Using PROC CORR, look at the correlations between the independent variables.
1. Which variables are likely to be redundant (i.e., they are likely to be similarly associated with the dependent variables)?

Pages 9-10. Highly correlated independent variables are gov_psc, all_pcs and income_pc; sanit00, sanit95 and sanit05; and drink95, drink00 and drink05.

1. How does this influence the starting model for regression analysis?

This problem is called multicollinearity and might cause a high variance of the estimators

• Identifypossible sets of sets of variables for use in regression.

Possibly: gov_pcs, phys_pt, drink00 and sanit00

1. Using PROC REGdetermine the association between mortality rates from communicable diseases (variable commun) and independent variables for drink water and sanitation (for year 2000, drink00 and sanit00), income and health care spending, and physicians per thousand population.
1. What is the final model? Write the equation for it.

Page 13.

1. Interpret the model in words. What does it say about the relationship between the independent and dependent variables?

When the Percentage of population with access to sanitation in year 2000 increases in 1 percent, the Communicable disease mortality rate decreases on average in 11.52 deaths per 100,000 population. Also, the average level of Communicable disease mortality rate when the Percentage of population with access to sanitation in year 2000 is 0, is 1207.27

• Evaluate how well the model fits the data.

Page 13. The goodness of fit of a linear model usually is measured by the R squared statistic. In this case, this statistic shows that this model explains 75.34% of the variability of the Communicable Disease Mortality (dependent variable). It is important to note, though, that the residuals appear to have different probability distributions for different levels of the dependent variable, which could mean that it’s

1. How much of the variation in communicable disease mortality is explained by the final model?

Page 13. 75.34%

1. How much is explained by the variables removed from the full model?

Page 14. 0.40%

1. Run the same model for non-communicable disease mortality (variable noncom). Start with the same full model and determine the final reduced model.
1. What is the final model? Write the equation for it.

Page 20.

1. Interpret the model in words. What does it say about the relationship between the independent and dependent variables?

When the Percentage of population with access to sanitation in year 2000 increases in 1 percent keeping constant the Per capita Income, the Communicable disease mortality rate decreases on average in 2.2257 deaths per 100,000 population. When the Per capita Income increases in 1 unit, the Communicable disease mortality rate decreases on average in 0.0097 deaths per 100,000 population. Also, the average level of Communicable disease mortality rate without the effects of the independent variables, is 888.2166

• Evaluate how well the data fits the assumptions for least squares regression.

Page 21. Residuals in this model don’t seem to deviate away too much from the normality assumption and they don’t seem to be correlated to the fitted values. Independent variables Percent of population with access to sanitation in year 2000 (sanity) and Per capita income (income_pc) have a moderately high correlation of 0.61, which doesn’t strictly violates the non multicollinearity assumption, but may cause a fairly high variance of the coefficient estimators. There are a few observations with a high Cook’s distance or a high leverage that might be distorting the results.

1. The graphs show some clustering of the data. Where are these data clustered? What does this clustered data represent in the real world? Do you think this clustering affects the model?

Page 24. These points are clustered in the lowest range of Income per capita. This data represents low income countries. This could affect the model, because it is better for the linear models for the independent variables to have a high variance

1. Run the model from Part C but restrict the year to 2005.
1. How do the results differ?

Page 27. Now income_pc is not included and phys_pt is and R-squared is higher.

1. What might account for the differences in the results of the same model run for two different years?

It might be a change in variables behavior in time. It also could be simply the sample size

1. The variable Region cannot be included in the PROC REG models because it is categorical.
1. How could youdetermine if Region might have an influence on the rates of communicable and non-communicable disease?

There are a couple of options. A simple way would be to use an ANOVA. Another way is to compute several dummy variables to represent all the regions and include them into the model.

1. The F-statistic
2. 77.81
• <0.001
1. 17.77%. The Model and Error Sum of squares
1. TTEST Procedure with equal and unequal variances
2. Equal variances: -8.93. Unequal variances: -8.86
• < 0.001 for both cases
1. Histograms and QQ plots suggest that the variable drink05 doesn’t have a normal distribution for none of the groups. This means that the conducted test is not valid
1. One-Way ANOVA test
2. 42.84
• <0.001
1. 41.03
2. < 0.001
• 52.38%
1. Highly correlated independent variables are gov_psc, all_pcs and income_pc; sanit00, sanit95 and sanit05; and drink95, drink00 and drink05.
2. This problem is called multicollinearity and might cause a high variance of the estimators
• Possibly: gov_pcs, phys_pt, drink00 and sanit00
1. When the Percentage of population with access to sanitation in year 2000 increases in 1 percent, the Communicable disease mortality rate decreases on average in 11.52 deaths per 100,000 population. Also, the average level of Communicable disease mortality rate when the Percentage of population with access to sanitation in year 2000 is 0, is 1207.27
• The goodness of fit of a linear model usually is measured by the R squared statistic. In this case, this statistic shows that this model explains 75.34% of the variability of the Communicable Disease Mortality (dependent variable). It is important to note, though, that the residuals appear to have different probability distributions for different levels of the dependent variable, which could mean that it’s necessary to transform the original variables to have a real linear relationship between independent and dependent variables.
1. 75.34%
2. 0.40%
1. When the Percentage of population with access to sanitation in year 2000 increases in 1 percent keeping constant the Per capita Income, the Communicable disease mortality rate decreases on average in 2.2257 deaths per 100,000 population. When the Per capita Income increases in 1 unit, the Communicable disease mortality rate decreases on average in 0.0097 deaths per 100,000 population. Also, the average level of Communicable disease mortality rate without the effects of the independent variables, is 888.2166
• Residuals in this model don’t seem to deviate away too much from the normality assumption and they don’t seem to be correlated to the fitted values. Independent variables Percent of population with access to sanitation in year 2000 (sanity) and Per capita income (income_pc) have a moderately high correlation of 0.61, which doesn’t strictly violates the non multicolinearity assumption, but may cause a fairly high variance of the coefficient estimators. There are a few observations with a high Cook’s distance or a high leverage that might be distorting the results.
1. These points are clustered in the lowest range of Income per capita. This data represents low income countries. This could affect the model, because it is better for the linear models for the independent variables to have a high variance
1. Now income_pc is not included and phys_pt is and R-squared is higher.
2. It might be a change in variables behavior in time. It also could be simply the sample size

There are a couple of options. A simple way would be to use an ANOVA. Another way is to compute several dummy variables to represent all the regions and include them into the model.