Structural Predictors of Chronic Absenteeism in NYC Schools
Overview
Analyzed structural predictors of chronic absenteeism across NYC public schools as a final project for MTH 4330 at Baruch College (Spring 2026), with Daisy Chen, Khushi Gandhi, and Safa Tahir. The project frames school-level absenteeism prediction as a supervised regression task using publicly available NYC DOE attendance and demographic data from 2013–2018 (7,607 observations). Rather than modeling individual behavior, we focused on four school-level structural features—Economic Need Index (ENI), % Poverty, % English Language Learners, and % Students with Disabilities — to understand what institutional conditions are most associated with high chronic absenteeism rates.
Technical Approach
Data & Splits
Two NYC DOE Open Data sources were merged on school identifier (DBN) and year. Splits follow a temporal structure to simulate a realistic forecasting scenario: training on 2013–2016, tuning on 2016–2017, and testing on 2017–2018. This prevents data leakage from a known methodological shift in NYC DOE’s ENI and poverty reporting in 2017–18, which caused a sharp discontinuity in those features and explains the tune-to-test error divergence seen across all models.
Models
Four models were evaluated in order of complexity:
- Multiple Linear Regression — OLS baseline on all four predictors
- Polynomial Regression — ENI term extended to degrees 1–10, other predictors kept linear
- Random Forest — 5-fold CV tuning over
mtry(1–4) and minimum node size (2–20); 500 trees - XGBoost — 5-fold CV tuning over number of trees, tree depth, and learning rate
SHAP (TreeSHAP) values were computed on the test set for both ensemble models to examine feature-level contributions.
Results
XGBoost was the most reliable model overall, with the smallest train-to-test RMSE gap of any method. Random Forest showed the most severe overfitting — train R² of 0.857 collapsed to 0.466 on test—a warning sign even after cross-validation pushed toward greater regularization.
| Model | Train RMSE | Tune RMSE | Test RMSE | Test R² |
|---|---|---|---|---|
| MLR | 10.608 | 10.572 | 10.950 | 0.506 |
| RF | 5.769 | 10.026 | 11.386 | 0.466 |
| XGBoost | 9.280 | 9.697 | 10.736 | 0.526 |
On feature importance, ENI was the dominant predictor by SHAP magnitude—consistent with its role as an aggregate economic need measure. % Students with Disabilities ranked second in XGBoost, ahead of % Poverty and % ELL, suggesting it captures independent structural barriers to attendance (medical appointments, health-related absences) not reducible to economic need.
Reflections
The most significant constraint was a data quality issue: in 2017–18, NYC DOE implemented a new matching process that caused ENI’s modal value to jump from ~72 to ~91—a methodological shift, not a real change in economic conditions. This is the primary driver of the tune-to-test degradation across all models, and it’s why a temporal split was essential. A random split would artificially suppress test error by leaking post-2017 observations into training.
Future directions: a Year × ENI interaction term to explicitly model this temporal instability, and a graduation outcome extension using NYC DOE High School Performance Directories to trace how structural absenteeism drivers translate into longer-term student outcomes.
