About the data set

Data Source: the data set used for the following exercises is from Health and Retirement Study (HRS), a longitudinal survey of a representative sample of Americans over the age of 50. The HRS is sponsored by the National Institute on Aging (grant number NIA U01AG009740) and is conducted by the University of Michigan. For more details about the data see https://hrsdata.isr.umich.edu/data-products/public-survey-data.

Run required library packages

library(readr)
library(e1071)

Exercise 1: Importing the raw data from a .csv file into R

#Importing the data file from Github into R
hrs <- read.csv("https://raw.githubusercontent.com/qcjoan/rintro/refs/heads/master/data/HRS_r1bmi.csv")
str(hrs) # Look up the observation and variable information
## 'data.frame':    12652 obs. of  11 variables:
##  $ hhidpn  : int  1010 2010 3010 3020 10001010 10003020 10003030 10004010 10004040 10013010 ...
##  $ raedyrs : int  16 8 12 16 12 16 16 16 12 12 ...
##  $ raeduc  : chr  "5.college and above" "1.lt high-school" "3.high-school graduate" "5.college and above" ...
##  $ rameduc : num  10 8 8 12 9 12 16 9 9 10 ...
##  $ rafeduc : num  10 6 12 8 12 12 14 9 9 10 ...
##  $ ragender: chr  "1.male" "2.female" "1.male" "2.female" ...
##  $ raracem : chr  "1.white/caucasian" "1.white/caucasian" "1.white/caucasian" "1.white/caucasian" ...
##  $ r1agey_b: int  54 57 56 54 53 58 36 52 46 54 ...
##  $ r1agey_e: int  54 57 56 54 53 58 36 52 46 54 ...
##  $ r1agey_m: int  54 57 56 54 53 58 36 52 46 54 ...
##  $ r1bmi   : num  30.7 18.5 25.8 32.4 23.7 30.4 36.6 27 21.6 28 ...
#You can also import the data file from a device into R:
#hrs <- read.csv("the data file directory path, using forward slash")
#head(hrs) # Look at the first 6 cases to check the data

Exercise 2: Exploring raw data

Explore the descriptive statistics mean, median, 25th and 75th quartiles, min, max of the data set. You also can explore descriptive statistics of one variable.

summary(hrs) 
##      hhidpn             raedyrs         raeduc             rameduc      
##  Min.   :     1010   Min.   : 0.00   Length:12652       Min.   : 0.000  
##  1st Qu.: 30570510   1st Qu.:11.00   Class :character   1st Qu.: 7.000  
##  Median : 46608015   Median :12.00   Mode  :character   Median : 9.000  
##  Mean   : 49218161   Mean   :12.02                      Mean   : 9.218  
##  3rd Qu.: 71420518   3rd Qu.:14.00                      3rd Qu.:12.000  
##  Max.   :208867020   Max.   :17.00                      Max.   :17.000  
##                      NA's   :1                          NA's   :1165    
##     rafeduc         ragender           raracem             r1agey_b    
##  Min.   : 0.000   Length:12652       Length:12652       Min.   :23.00  
##  1st Qu.: 6.000   Class :character   Class :character   1st Qu.:52.00  
##  Median : 8.000   Mode  :character   Mode  :character   Median :55.00  
##  Mean   : 8.947                                         Mean   :55.25  
##  3rd Qu.:12.000                                         3rd Qu.:59.00  
##  Max.   :17.000                                         Max.   :85.00  
##  NA's   :1576                                           NA's   :1      
##     r1agey_e        r1agey_m         r1bmi      
##  Min.   :23.00   Min.   :23.00   Min.   : 12.8  
##  1st Qu.:52.00   1st Qu.:52.00   1st Qu.: 23.6  
##  Median :55.00   Median :55.00   Median : 26.5  
##  Mean   :55.26   Mean   :55.26   Mean   : 27.1  
##  3rd Qu.:59.00   3rd Qu.:59.00   3rd Qu.: 29.6  
##  Max.   :85.00   Max.   :85.00   Max.   :102.7  
##  NA's   :1       NA's   :1
#summary(hrs$raedyrs) # Explore descriptive statistics of one variable.

Exercise 3: Preparing data for analysis

In this case, data preparation includes: (1) Creating a new data frame; (2) Dealing with missing values; (3) Rename the variable names.

Exercise 3.1: Creating a new data frame

The new data frame only contains IV (“raedyrs”- years of schooling) and DV (“r1bmi” - body mass index)

hrs.new = data.frame(hrs$r1bmi,hrs$raedyrs)
# hrsnew <- data.frame(hrs$r1bmi,hrs$raedyrs) # Alternative way

Exercise 3.2: Dealing with missing values

Missing values: NA - not available

summary(hrs.new) # View summary of statistics of the data set, notice a missing data in years of schooling.
##    hrs.r1bmi      hrs.raedyrs   
##  Min.   : 12.8   Min.   : 0.00  
##  1st Qu.: 23.6   1st Qu.:11.00  
##  Median : 26.5   Median :12.00  
##  Mean   : 27.1   Mean   :12.02  
##  3rd Qu.: 29.6   3rd Qu.:14.00  
##  Max.   :102.7   Max.   :17.00  
##                  NA's   :1
hrs.new = na.omit(hrs.new) # Delete the observation with missing data by using the function na.omit()
summary(hrs.new) # Check the summary statistics again
##    hrs.r1bmi      hrs.raedyrs   
##  Min.   : 12.8   Min.   : 0.00  
##  1st Qu.: 23.6   1st Qu.:11.00  
##  Median : 26.5   Median :12.00  
##  Mean   : 27.1   Mean   :12.02  
##  3rd Qu.: 29.6   3rd Qu.:14.00  
##  Max.   :102.7   Max.   :17.00

Exercise 3.3: Renaming the variable names for simplicity:

colnames(hrs.new) = c("bmi", "edyrs")
summary(hrs.new) # Now look at the summary statistics again
##       bmi            edyrs      
##  Min.   : 12.8   Min.   : 0.00  
##  1st Qu.: 23.6   1st Qu.:11.00  
##  Median : 26.5   Median :12.00  
##  Mean   : 27.1   Mean   :12.02  
##  3rd Qu.: 29.6   3rd Qu.:14.00  
##  Max.   :102.7   Max.   :17.00

Exercise 4: Analyzing data

This procedure includes: (1) attaching the data to R’s memory; (2) testing normality; (3) Correlation test; (4) simple linear regression test

Exercise 4.1: Attaching the data to R’s memory

Attach the data to R’s memory to perform data analysis efficiently.

attach(hrs.new) 

Exercise 4.2: Testing normality including skewness and kurtosis

# Skewness and Kurtosis values of DV
skewness(hrs.new$bmi) #[-1,1]
## [1] 1.374967
kurtosis(hrs.new$bmi) #[-2,2]
## [1] 6.272418
# Exploring DV with Graphics 
#hist(bmi, main="Histogram of BMI", xlab="BMI") # Histogram to check the data distribution
plot(density(bmi), main="Density Plot: BMI", xlab="BMI") # Density plot to check normality

qqnorm(bmi, main="Population Dist. of BMI") # Q-Q plot to check normality 
qqline(bmi) # Add the straight theoretical line

boxplot(bmi, ylab="BMI") # Box plot to detect outliers

Exercise 4.3: Correlation test

cor.test(edyrs, bmi,method = "spearman", exact=FALSE)
## 
##  Spearman's rank correlation rho
## 
## data:  edyrs and bmi
## S = 3.7468e+11, p-value < 2.2e-16
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
##        rho 
## -0.1102811

Exercise 4.4: Simple linear regression test

m <- lm(bmi ~ edyrs) #Assign a model object from the lm function to a variable.
# Extract statistical information from the model object
summary(m)
## 
## Call:
## lm(formula = bmi ~ edyrs)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -14.791  -3.401  -0.701   2.517  76.253 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 29.06242    0.17460  166.45   <2e-16 ***
## edyrs       -0.16348    0.01402  -11.66   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.153 on 12649 degrees of freedom
## Multiple R-squared:  0.01064,    Adjusted R-squared:  0.01056 
## F-statistic:   136 on 1 and 12649 DF,  p-value: < 2.2e-16
# Sctter plots with a regression line
plot(edyrs,bmi,xlab="Years of schooling",ylab="BMI") # Scatter plots to visualize the relationship between DV and IV.
abline(m) # Plot a regression line

Exercise 5: Exporting the new data frame

Export or save the new data frame “hrs.new” in CSV format.

# write.csv(hrs.new, "directory path/hrs_new.csv")