Data Source: the data set used for the following exercises is from Health and Retirement Study (HRS), a longitudinal survey of a representative sample of Americans over the age of 50. The HRS is sponsored by the National Institute on Aging (grant number NIA U01AG009740) and is conducted by the University of Michigan. For more details about the data see https://hrsdata.isr.umich.edu/data-products/public-survey-data.
library(readr)
library(e1071)
#Importing the data file from Github into R
hrs <- read.csv("https://raw.githubusercontent.com/qcjoan/rintro/refs/heads/master/data/HRS_r1bmi.csv")
str(hrs) # Look up the observation and variable information
## 'data.frame': 12652 obs. of 11 variables:
## $ hhidpn : int 1010 2010 3010 3020 10001010 10003020 10003030 10004010 10004040 10013010 ...
## $ raedyrs : int 16 8 12 16 12 16 16 16 12 12 ...
## $ raeduc : chr "5.college and above" "1.lt high-school" "3.high-school graduate" "5.college and above" ...
## $ rameduc : num 10 8 8 12 9 12 16 9 9 10 ...
## $ rafeduc : num 10 6 12 8 12 12 14 9 9 10 ...
## $ ragender: chr "1.male" "2.female" "1.male" "2.female" ...
## $ raracem : chr "1.white/caucasian" "1.white/caucasian" "1.white/caucasian" "1.white/caucasian" ...
## $ r1agey_b: int 54 57 56 54 53 58 36 52 46 54 ...
## $ r1agey_e: int 54 57 56 54 53 58 36 52 46 54 ...
## $ r1agey_m: int 54 57 56 54 53 58 36 52 46 54 ...
## $ r1bmi : num 30.7 18.5 25.8 32.4 23.7 30.4 36.6 27 21.6 28 ...
#You can also import the data file from a device into R:
#hrs <- read.csv("the data file directory path, using forward slash")
#head(hrs) # Look at the first 6 cases to check the data
Explore the descriptive statistics mean, median, 25th and 75th quartiles, min, max of the data set. You also can explore descriptive statistics of one variable.
summary(hrs)
## hhidpn raedyrs raeduc rameduc
## Min. : 1010 Min. : 0.00 Length:12652 Min. : 0.000
## 1st Qu.: 30570510 1st Qu.:11.00 Class :character 1st Qu.: 7.000
## Median : 46608015 Median :12.00 Mode :character Median : 9.000
## Mean : 49218161 Mean :12.02 Mean : 9.218
## 3rd Qu.: 71420518 3rd Qu.:14.00 3rd Qu.:12.000
## Max. :208867020 Max. :17.00 Max. :17.000
## NA's :1 NA's :1165
## rafeduc ragender raracem r1agey_b
## Min. : 0.000 Length:12652 Length:12652 Min. :23.00
## 1st Qu.: 6.000 Class :character Class :character 1st Qu.:52.00
## Median : 8.000 Mode :character Mode :character Median :55.00
## Mean : 8.947 Mean :55.25
## 3rd Qu.:12.000 3rd Qu.:59.00
## Max. :17.000 Max. :85.00
## NA's :1576 NA's :1
## r1agey_e r1agey_m r1bmi
## Min. :23.00 Min. :23.00 Min. : 12.8
## 1st Qu.:52.00 1st Qu.:52.00 1st Qu.: 23.6
## Median :55.00 Median :55.00 Median : 26.5
## Mean :55.26 Mean :55.26 Mean : 27.1
## 3rd Qu.:59.00 3rd Qu.:59.00 3rd Qu.: 29.6
## Max. :85.00 Max. :85.00 Max. :102.7
## NA's :1 NA's :1
#summary(hrs$raedyrs) # Explore descriptive statistics of one variable.
In this case, data preparation includes: (1) Creating a new data frame; (2) Dealing with missing values; (3) Rename the variable names.
The new data frame only contains IV (“raedyrs”- years of schooling) and DV (“r1bmi” - body mass index)
hrs.new = data.frame(hrs$r1bmi,hrs$raedyrs)
# hrsnew <- data.frame(hrs$r1bmi,hrs$raedyrs) # Alternative way
Missing values: NA - not available
summary(hrs.new) # View summary of statistics of the data set, notice a missing data in years of schooling.
## hrs.r1bmi hrs.raedyrs
## Min. : 12.8 Min. : 0.00
## 1st Qu.: 23.6 1st Qu.:11.00
## Median : 26.5 Median :12.00
## Mean : 27.1 Mean :12.02
## 3rd Qu.: 29.6 3rd Qu.:14.00
## Max. :102.7 Max. :17.00
## NA's :1
hrs.new = na.omit(hrs.new) # Delete the observation with missing data by using the function na.omit()
summary(hrs.new) # Check the summary statistics again
## hrs.r1bmi hrs.raedyrs
## Min. : 12.8 Min. : 0.00
## 1st Qu.: 23.6 1st Qu.:11.00
## Median : 26.5 Median :12.00
## Mean : 27.1 Mean :12.02
## 3rd Qu.: 29.6 3rd Qu.:14.00
## Max. :102.7 Max. :17.00
colnames(hrs.new) = c("bmi", "edyrs")
summary(hrs.new) # Now look at the summary statistics again
## bmi edyrs
## Min. : 12.8 Min. : 0.00
## 1st Qu.: 23.6 1st Qu.:11.00
## Median : 26.5 Median :12.00
## Mean : 27.1 Mean :12.02
## 3rd Qu.: 29.6 3rd Qu.:14.00
## Max. :102.7 Max. :17.00
This procedure includes: (1) attaching the data to R’s memory; (2) testing normality; (3) Correlation test; (4) simple linear regression test
Attach the data to R’s memory to perform data analysis efficiently.
attach(hrs.new)
# Skewness and Kurtosis values of DV
skewness(hrs.new$bmi) #[-1,1]
## [1] 1.374967
kurtosis(hrs.new$bmi) #[-2,2]
## [1] 6.272418
# Exploring DV with Graphics
#hist(bmi, main="Histogram of BMI", xlab="BMI") # Histogram to check the data distribution
plot(density(bmi), main="Density Plot: BMI", xlab="BMI") # Density plot to check normality
qqnorm(bmi, main="Population Dist. of BMI") # Q-Q plot to check normality
qqline(bmi) # Add the straight theoretical line
boxplot(bmi, ylab="BMI") # Box plot to detect outliers
cor.test(edyrs, bmi,method = "spearman", exact=FALSE)
##
## Spearman's rank correlation rho
##
## data: edyrs and bmi
## S = 3.7468e+11, p-value < 2.2e-16
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
## rho
## -0.1102811
m <- lm(bmi ~ edyrs) #Assign a model object from the lm function to a variable.
# Extract statistical information from the model object
summary(m)
##
## Call:
## lm(formula = bmi ~ edyrs)
##
## Residuals:
## Min 1Q Median 3Q Max
## -14.791 -3.401 -0.701 2.517 76.253
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 29.06242 0.17460 166.45 <2e-16 ***
## edyrs -0.16348 0.01402 -11.66 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.153 on 12649 degrees of freedom
## Multiple R-squared: 0.01064, Adjusted R-squared: 0.01056
## F-statistic: 136 on 1 and 12649 DF, p-value: < 2.2e-16
# Sctter plots with a regression line
plot(edyrs,bmi,xlab="Years of schooling",ylab="BMI") # Scatter plots to visualize the relationship between DV and IV.
abline(m) # Plot a regression line
Export or save the new data frame “hrs.new” in CSV format.
# write.csv(hrs.new, "directory path/hrs_new.csv")