Structural Predictors of Chronic Absenteeism in NYC Schools

Overview

Analyzed structural predictors of chronic absenteeism across NYC public schools as a final project for MTH 4330 at Baruch College (Spring 2026), with Daisy Chen, Khushi Gandhi, and Safa Tahir. The project frames school-level absenteeism prediction as a supervised regression task using publicly available NYC DOE attendance and demographic data from 2013–2018 (7,607 observations). Rather than modeling individual behavior, we focused on four school-level structural features—Economic Need Index (ENI), % Poverty, % English Language Learners, and % Students with Disabilities — to understand what institutional conditions are most associated with high chronic absenteeism rates.

GitHub


Technical Approach

Data & Splits

Two NYC DOE Open Data sources were merged on school identifier (DBN) and year. Splits follow a temporal structure to simulate a realistic forecasting scenario: training on 2013–2016, tuning on 2016–2017, and testing on 2017–2018. This prevents data leakage from a known methodological shift in NYC DOE’s ENI and poverty reporting in 2017–18, which caused a sharp discontinuity in those features and explains the tune-to-test error divergence seen across all models.

Models

Four models were evaluated in order of complexity:

  • Multiple Linear Regression — OLS baseline on all four predictors
  • Polynomial Regression — ENI term extended to degrees 1–10, other predictors kept linear
  • Random Forest — 5-fold CV tuning over mtry (1–4) and minimum node size (2–20); 500 trees
  • XGBoost — 5-fold CV tuning over number of trees, tree depth, and learning rate

SHAP (TreeSHAP) values were computed on the test set for both ensemble models to examine feature-level contributions.


Results

XGBoost was the most reliable model overall, with the smallest train-to-test RMSE gap of any method. Random Forest showed the most severe overfitting — train R² of 0.857 collapsed to 0.466 on test—a warning sign even after cross-validation pushed toward greater regularization.

ModelTrain RMSETune RMSETest RMSETest R²
MLR10.60810.57210.9500.506
RF5.76910.02611.3860.466
XGBoost9.2809.69710.7360.526

On feature importance, ENI was the dominant predictor by SHAP magnitude—consistent with its role as an aggregate economic need measure. % Students with Disabilities ranked second in XGBoost, ahead of % Poverty and % ELL, suggesting it captures independent structural barriers to attendance (medical appointments, health-related absences) not reducible to economic need.


Reflections

The most significant constraint was a data quality issue: in 2017–18, NYC DOE implemented a new matching process that caused ENI’s modal value to jump from ~72 to ~91—a methodological shift, not a real change in economic conditions. This is the primary driver of the tune-to-test degradation across all models, and it’s why a temporal split was essential. A random split would artificially suppress test error by leaking post-2017 observations into training.

Future directions: a Year × ENI interaction term to explicitly model this temporal instability, and a graduation outcome extension using NYC DOE High School Performance Directories to trace how structural absenteeism drivers translate into longer-term student outcomes.