Does Student Debt Correlate with Retirement Funds?

R
R: Tidy Tuesday
Teaching my mom about R through a Tidy Tuesday data exploration
Author

Gus Lipkin

Published

May 12, 2022

Note

Data comes from the May 18, 2021 Tidy Tuesday dataset “Wealth and income over time”

The dataset can be found here

Pre-Intro

I graduated with my BS in Business Analytics a few days ago. My mom came to visit for graduation and stayed for a few days. I wanted to show her some of what I learned at school through a hands on activity so we found a Tidy Tuesday dataset and did some exploring. There’s no particular rhyme or reason to this, but I wanted to share my passion for data with her and thought it would be cool to share with the rest of you as well. While I did provide some guidance, most of the exploration is what she thought might be interesting and wanted to explore.

Important

From here on out, I wrote down my mom’s stream of consciousness thoughts on what we were working on. It’s not necessarily verbatim, but gets her thoughts across.

Italics are my words

Intro

First we load our packages and data.

Do you want to preview the data?

head(debt)
    year     race loan_debt loan_debt_pct
   <int>   <char>     <num>         <num>
1:  2016    White 11108.410     0.3367511
2:  2016    Black 14224.770     0.4183588
3:  2016 Hispanic  7493.999     0.2189689
4:  2013    White  8363.605     0.2845555
5:  2013    Black 10302.660     0.4122773
6:  2013 Hispanic  3177.410     0.1570289
head(retirement)
    year     race retirement
   <int>   <char>      <num>
1:  1989    White  32649.430
2:  1989    Black   5954.398
3:  1989 Hispanic   7121.722
4:  1992    White  36637.760
5:  1992    Black   7798.197
6:  1992 Hispanic   5248.894

Do you want to see summary statistics?

Yeah. Can we do that?

summary(debt)
      year          race             loan_debt       loan_debt_pct    
 Min.   :1989   Length:30          Min.   :  793.1   Min.   :0.09146  
 1st Qu.:1995   Class :character   1st Qu.: 1406.8   1st Qu.:0.13887  
 Median :2002   Mode  :character   Median : 2992.6   Median :0.16049  
 Mean   :2002                      Mean   : 4119.4   Mean   :0.19388  
 3rd Qu.:2010                      3rd Qu.: 5899.2   3rd Qu.:0.21823  
 Max.   :2016                      Max.   :14224.8   Max.   :0.41836  
summary(retirement)
      year          race             retirement    
 Min.   :1989   Length:30          Min.   :  5249  
 1st Qu.:1995   Class :character   1st Qu.: 15021  
 Median :2002   Mode  :character   Median : 21809  
 Mean   :2002                      Mean   : 41411  
 3rd Qu.:2010                      3rd Qu.: 44936  
 Max.   :2016                      Max.   :157884  

Looking at debt

Can we see debt by race?

debt |>
  ggplot() +
  geom_boxplot(aes(x = race, y = loan_debt))

It’s been 40 years since I saw a box and whisker plot in college… But it’s interesting that Hispanics seem to have lower debt than Blacks and Whites who seem to be about equal.

What about debt over time?

debt |>
  ggplot() +
  geom_point(aes(x = year, y = loan_debt, color = race))

The difference in debt between Blacks and Whites increases dramatically over time.

Does the share of families with debt change over time?

debt |>
  ggplot() +
  geom_point(aes(x = loan_debt_pct, y = loan_debt, color = race))

It appears that fewer Hispanics take on student loan debt than Blacks or Whites.

Looking at retirement

Can we copy/paste that code for retirement?

Yes. Yes we can!

retirement |>
  ggplot() +
  geom_boxplot(aes(x = race, y = retirement))

Whites seem to have much more solid retirement savings that Blacks or Hispanics. Even the lowest whisker in the Whites plot seems higher than the highest points in the Black or Hispanics plots.

minMaxRetirement <- retirement |>
  group_by(race) |>
  summarise(min = min(retirement), max = max(retirement))
minMaxRetirement
# A tibble: 3 × 3
  race        min     max
  <chr>     <dbl>   <dbl>
1 Black     5954.  29365.
2 Hispanic  5249.  28581.
3 White    32649. 157884.
max(minMaxRetirement$min) - min(minMaxRetirement$max)
[1] 4068.31

This is terrible! The highest average Hispanic retirement is over $4000 less than the lowest average White retirement.

retirement |>
  ggplot() +
  geom_point(aes(x = year, y = retirement, color = race))

This is not surprising. White retirement savings has increased faster than Black or Hispanic which has been relatively static.

Bringing it together

debt_retirement <- inner_join(debt, retirement, by = c("year", "race"))
head(debt_retirement)
    year     race loan_debt loan_debt_pct retirement
   <int>   <char>     <num>         <num>      <num>
1:  2016    White 11108.410     0.3367511  157884.20
2:  2016    Black 14224.770     0.4183588   25211.85
3:  2016 Hispanic  7493.999     0.2189689   28581.12
4:  2013    White  8363.605     0.2845555  138557.50
5:  2013    Black 10302.660     0.4122773   20440.14
6:  2013 Hispanic  3177.410     0.1570289   10264.48
debt_retirement |>
  ggplot() +
  geom_point(aes(x = loan_debt, y = retirement, color = race))

Loan debt doesn’t appear to have an effect on retirement for Blacks and Hispanics. It does have a large effect for Whites.

linear <- lm(retirement ~ loan_debt, data = debt_retirement[race == "White",])
summary(linear)

Call:
lm(formula = retirement ~ loan_debt, data = debt_retirement[race == 
    "White", ])

Residuals:
   Min     1Q Median     3Q    Max 
-13520 -10442  -6777  12753  22574 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 32652.056   8114.951   4.024  0.00382 ** 
loan_debt      12.284      1.419   8.657 2.46e-05 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 14480 on 8 degrees of freedom
Multiple R-squared:  0.9036,    Adjusted R-squared:  0.8915 
F-statistic: 74.95 on 1 and 8 DF,  p-value: 2.463e-05

Loan debt doesn’t have the effect we thought for Whites, and seems to have minimal effect for Blacks and Hispanics.

linear <- lm(retirement ~ ., data = debt_retirement)
summary(linear)

Call:
lm(formula = retirement ~ ., data = debt_retirement)

Residuals:
   Min     1Q Median     3Q    Max 
-33124 -14811   5102  11353  38578 

Coefficients:
                Estimate Std. Error t value Pr(>|t|)    
(Intercept)   -3.458e+06  1.744e+06  -1.983   0.0590 .  
year           1.733e+03  8.733e+02   1.985   0.0587 .  
raceHispanic   7.743e+02  1.443e+04   0.054   0.9577    
raceWhite      7.357e+04  1.156e+04   6.366 1.39e-06 ***
loan_debt      6.832e-01  5.213e+00   0.131   0.8968    
loan_debt_pct  6.545e+03  2.032e+05   0.032   0.9746    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 19760 on 24 degrees of freedom
Multiple R-squared:  0.8291,    Adjusted R-squared:  0.7935 
F-statistic: 23.29 on 5 and 24 DF,  p-value: 1.738e-08

I wonder if people with loan debt haven’t finished school?

Student loan debt could also be affected by family size which we don’t have data for.

What if we account for wealth too?

wealth <- 
  fread('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2021/2021-02-09/race_wealth.csv') |>
  rename(wealth_type = type)
head(wealth)
   wealth_type  year      race wealth_family
        <char> <int>    <char>         <num>
1:     Average  1963 Non-White      19503.84
2:     Average  1963     White     140632.66
3:     Average  1963     Black            NA
4:     Average  1963  Hispanic            NA
5:     Average  1983 Non-White      73233.62
6:     Average  1983     White     324057.60
summary(wealth)
 wealth_type             year          race           wealth_family   
 Length:96          Min.   :1963   Length:96          Min.   :  2467  
 Class :character   1st Qu.:1991   Class :character   1st Qu.: 19559  
 Mode  :character   Median :2000   Mode  :character   Median : 97209  
                    Mean   :1998                      Mean   :158020  
                    3rd Qu.:2008                      3rd Qu.:156895  
                    Max.   :2016                      Max.   :919336  
                                                      NA's   :24      
wealth |>
  ggplot() +
  geom_boxplot(aes(x = race, y = wealth_family))
Warning: Removed 24 rows containing non-finite outside the scale range
(`stat_boxplot()`).

Oh. That’s interesting. I think family wealth includes all assets and debts, liquid or not. That’s not a great indication of actual wealth.

It looks like there’s some data missing so lets filter for years greater than or equal to 1985 and for the median, rather than average, income.

wealth |>
  filter(year >= 1985, wealth_type == "Median") |>
  ggplot() +
  geom_point(aes(x = year, y = wealth_family, color = race))
Warning: Removed 10 rows containing missing values or values outside the scale range
(`geom_point()`).

It looks like there isn’t any data for Non-Whites. The median has increased for Whites, but dropped in 2008, while for others it has remained the same over time.

Add wealth into the mix

df <- left_join(debt_retirement, wealth, by = c("race", "year"))
head(df)
    year     race loan_debt loan_debt_pct retirement wealth_type wealth_family
   <int>   <char>     <num>         <num>      <num>      <char>         <num>
1:  2016    White 11108.410     0.3367511  157884.20     Average      919336.1
2:  2016    White 11108.410     0.3367511  157884.20      Median      171000.0
3:  2016    Black 14224.770     0.4183588   25211.85     Average      139523.1
4:  2016    Black 14224.770     0.4183588   25211.85      Median       17409.0
5:  2016 Hispanic  7493.999     0.2189689   28581.12     Average      191727.3
6:  2016 Hispanic  7493.999     0.2189689   28581.12      Median       20920.0

It’s kinda confusing with wealth_type, let’s change it a bit

df <- df |>
  dcast(year + race + loan_debt + loan_debt_pct + retirement ~ wealth_type, 
        value.var = "wealth_family") |>
  rename(average_wealth = Average, median_wealth = Median)
head(df)
Key: <year, race, loan_debt, loan_debt_pct, retirement>
    year     race loan_debt loan_debt_pct retirement average_wealth
   <int>   <char>     <num>         <num>      <num>          <num>
1:  1989    Black 1160.5680     0.1788198   5954.398       78092.20
2:  1989 Hispanic  897.5826     0.1272523   7121.722       84397.75
3:  1989    White 1100.4070     0.1047123  32649.430      424082.40
4:  1992    Black  927.4824     0.1382771   7798.197       80779.48
5:  1992 Hispanic  793.0610     0.0914574   5248.894       90751.79
6:  1992    White 1321.3030     0.1433340  36637.760      373825.90
   median_wealth
           <num>
1:      8023.198
2:      9329.301
3:    134677.800
4:     16602.980
5:     11387.300
6:    116891.700

“Can we scatter this one?”

Yes. Yes we can.

df |>
  ggplot() +
  geom_point(aes(x = loan_debt, y = average_wealth, color = race))

Hispanics only seem to be doing better with having less loan debt.

Can we predict if increasing loan debt is correlated to average wealth?

summary(lm(average_wealth ~ loan_debt + race, data = df))

Call:
lm(formula = average_wealth ~ loan_debt + race, data = df)

Residuals:
    Min      1Q  Median      3Q     Max 
-186449  -33795   17625   48869  180878 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)  1.341e+04  3.989e+04   0.336  0.73941    
loan_debt    1.821e+01  5.133e+00   3.547  0.00151 ** 
raceHispanic 7.419e+04  4.435e+04   1.673  0.10639    
raceWhite    5.228e+05  4.219e+04  12.391 2.05e-12 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 94210 on 26 degrees of freedom
Multiple R-squared:  0.8861,    Adjusted R-squared:  0.873 
F-statistic: 67.45 on 3 and 26 DF,  p-value: 2.139e-12

This makes sense and tracks with what we saw in the other regressions.

Bringing it back to retirement

summary(lm(retirement ~ ., data = df))

Call:
lm(formula = retirement ~ ., data = df)

Residuals:
   Min     1Q Median     3Q    Max 
-11882  -4546     -3   2733  16560 

Coefficients:
                 Estimate Std. Error t value Pr(>|t|)    
(Intercept)    -6.352e+05  6.836e+05  -0.929  0.36291    
year            3.097e+02  3.432e+02   0.902  0.37666    
raceHispanic   -4.747e+03  5.986e+03  -0.793  0.43628    
raceWhite       6.748e+03  1.404e+04   0.481  0.63545    
loan_debt      -2.796e+00  1.906e+00  -1.467  0.15656    
loan_debt_pct   1.130e+05  7.727e+04   1.463  0.15763    
average_wealth  2.716e-01  2.798e-02   9.709 2.06e-09 ***
median_wealth  -5.232e-01  1.756e-01  -2.980  0.00691 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 7119 on 22 degrees of freedom
Multiple R-squared:  0.9797,    Adjusted R-squared:  0.9732 
F-statistic: 151.4 on 7 and 22 DF,  p-value: < 2.2e-16

I guess this makes sense. If you have money, you have money for retirement.

Let me show you that R package I made that I always talk about.

dewey::regsearch(df, dependent = "retirement", 
                 independent = colnames(df)[c(1:4, 6:7)], maxvar = 6, 
                 family = "gaussian") |>
  head(12) |>
  kbl() |>
  kable_styling(bootstrap_options = c("striped", "hover", 
                                      "condensed", "responsive"),
                font_size = 14) |>
  scroll_box(height = "250px")
Warning in dewey::regsearch(df, dependent = "retirement", independent =
colnames(df)[c(1:4, : Missing 'interactions' argument. Defaulting to FALSE.
Warning in dewey::regsearch(df, dependent = "retirement", independent =
colnames(df)[c(1:4, : Missing 'multi' argument. Defaulting to FALSE.
[1] "Assembling regresions..."
[1] "Creating 63 formulas. Please be patient, this may take a while."
[1] "Creating regressions..."
[1] "Running 63 regressions. Please be patient, this may take a while."
[1] "Running regressions..."
formula aic bic rSquare warn xIntercept year race loan_debt loan_debt_pct average_wealth median_wealth raceHispanic raceWhite
retirement ~ + average_wealth 650.2739 654.4775 0.93199 NA 0.1895266 NA NA NA NA 0 NA NA NA
retirement ~ + median_wealth 686.7959 690.9995 0.77025 NA 0.3351785 NA NA NA NA NA 0.0000000 NA NA
retirement ~ + average_wealth + median_wealth 629.9331 635.5379 0.96771 NA 0.0036909 NA NA NA NA 0 0.0000088 NA NA
retirement ~ + median_wealth + year 670.4355 676.0403 0.87542 NA 0.0000571 0.0000559 NA NA NA NA 0.0000000 NA NA
retirement ~ + loan_debt + median_wealth 673.2736 678.8784 0.86306 NA 0.1440991 NA NA 0.0002116 NA NA 0.0000000 NA NA
retirement ~ + average_wealth + loan_debt_pct 638.7578 644.3626 0.95666 NA 0.0003023 NA NA NA 0.0005465 0 NA NA NA
retirement ~ + average_wealth + loan_debt 639.8237 645.4284 0.95509 NA 0.0030143 NA NA 0.0009078 NA 0 NA NA NA
retirement ~ + average_wealth + year 640.3430 645.9478 0.95431 NA 0.0011340 0.0011635 NA NA NA 0 NA NA NA
retirement ~ + loan_debt_pct + median_wealth 676.7298 682.3345 0.84633 NA 0.0215718 NA NA NA 0.0010906 NA 0.0000000 NA NA
retirement ~ + average_wealth + loan_debt_pct + median_wealth 624.0222 631.0282 0.97519 NA 0.0001792 NA NA NA 0.0094886 0 0.0001608 NA NA
retirement ~ + average_wealth + loan_debt + loan_debt_pct + median_wealth 621.5912 629.9984 0.97860 NA 0.0002454 NA NA 0.0570795 0.0087417 0 0.0000324 NA NA
retirement ~ + average_wealth + loan_debt + median_wealth 628.0037 635.0097 0.97167 NA 0.0007230 NA NA 0.0675575 NA 0 0.0006058 NA NA

Try choosing a model and defend your choice

I would choose retirement ~ + average_wealth + loan_debt + loan_debt_pct + median_wealth because it has the highest \(R^2\) value.

That’s not quite right. retirement ~ + average_wealth + median_wealth is much better because 96% of the model is explained by those two variables so adding loan_debt_pct and loan_debt does not give a lot of benefit.

What if we considered interaction terms?

dewey::regsearch(df, dependent = "retirement", 
                 independent = colnames(df)[c(1:4, 6:7)], maxvar = 6, 
                 family = "gaussian", interactions = TRUE) |>
  head(12) |>
  kbl() |>
  kable_styling(bootstrap_options = c("striped", "hover", 
                                      "condensed", "responsive"),
                font_size = 14) |>
  scroll_box(height = "250px")
Warning in dewey::regsearch(df, dependent = "retirement", independent =
colnames(df)[c(1:4, : Missing 'multi' argument. Defaulting to FALSE.
[1] "Gathering variables..."
[1] "WARNING: Using interaction terms without multithreading may take a very long time"
[1] "Assembling regresions..."
[1] "Creating 82159 formulas. Please be patient, this may take a while."
[1] "Creating regressions..."
[1] "Running 14828 regressions. Please be patient, this may take a while."
[1] "Running regressions..."
formula aic bic rSquare warn xIntercept year race loan_debt loan_debt_pct average_wealth median_wealth average_wealth.loan_debt average_wealth.loan_debt_pct average_wealth.median_wealth raceHispanic raceWhite average_wealth.raceHispanic average_wealth.raceWhite average_wealth.year loan_debt.loan_debt_pct loan_debt.median_wealth loan_debt.raceHispanic loan_debt.raceWhite loan_debt.year loan_debt_pct.median_wealth loan_debt_pct.raceHispanic loan_debt_pct.raceWhite loan_debt_pct.year median_wealth.raceHispanic median_wealth.raceWhite median_wealth.year raceHispanic.year raceWhite.year loan_debt_pct.loan_debt median_wealth.loan_debt median_wealth.loan_debt_pct raceHispanic.loan_debt raceWhite.loan_debt raceHispanic.loan_debt_pct raceWhite.loan_debt_pct raceHispanic.median_wealth raceWhite.median_wealth year.loan_debt year.loan_debt_pct year.median_wealth year.raceHispanic year.raceWhite
retirement ~ + average_wealth 650.2739 654.4775 0.93199 NA 0.1895266 NA NA NA NA 0.0000000 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
retirement ~ + median_wealth 686.7959 690.9995 0.77025 NA 0.3351785 NA NA NA NA NA 0.0000000 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
retirement ~ + average_wealth + median_wealth 629.9331 635.5379 0.96771 NA 0.0036909 NA NA NA NA 0.0000000 0.0000088 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
retirement ~ + median_wealth + year 670.4355 676.0403 0.87542 NA 0.0000571 0.0000559 NA NA NA NA 0.0000000 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
retirement ~ + loan_debt + median_wealth 673.2736 678.8784 0.86306 NA 0.1440991 NA NA 0.0002116 NA NA 0.0000000 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
retirement ~ + average_wealth + loan_debt_pct 638.7578 644.3626 0.95666 NA 0.0003023 NA NA NA 0.0005465 0.0000000 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
retirement ~ + average_wealth + loan_debt 639.8237 645.4284 0.95509 NA 0.0030143 NA NA 0.0009078 NA 0.0000000 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
retirement ~ + average_wealth + year 640.3430 645.9478 0.95431 NA 0.0011340 0.0011635 NA NA NA 0.0000000 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
retirement ~ + loan_debt_pct + median_wealth 676.7298 682.3345 0.84633 NA 0.0215718 NA NA NA 0.0010906 NA 0.0000000 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
retirement ~ + average_wealth*loan_debt_pct + average_wealth*median_wealth + average_wealth*race + average_wealth*year 576.6673 594.8828 0.99700 NA 0.0004973 0.0004812 NA NA 0.0004864 0.0000061 0.0008194 NA 0.0004452 2.50e-06 0.0020289 0.0045869 0.0087832 0.0120152 5.8e-06 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
retirement ~ + average_wealth + loan_debt_pct + median_wealth 624.0222 631.0282 0.97519 NA 0.0001792 NA NA NA 0.0094886 0.0000000 0.0001608 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
retirement ~ + average_wealth*median_wealth + median_wealth*year + loan_debt_pct 590.8267 602.0363 0.99328 NA 0.0081282 0.0076968 NA NA 0.0126352 0.0004052 0.0000001 NA NA 2.14e-05 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA 1e-07 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA

This is pretty cool. Average wealth by itself is pretty good, but when it interacts with loan_debt_pct, median_wealth, race, and year, it seems to produce a really great model.

Of course, we still want to be careful about overfitting.

Parting Thoughts

Race appears to have a huge impact on everything and increasing educational debt increases retirement, but only for Whites. I’d be curious to see the data behind the loans to learn why increasing educational debt does not always correlate with increased wealth.

Post-Conclusion

It’s Gus again. I had a really fun time with this. I like to think my mom learned a lot and I’m hoping you did too.