Structural Predictors of Chronic Absenteeism in NYC Schools

Overview

Analyzed structural predictors of chronic absenteeism across NYC public schools as a final project for MTH 4330 at Baruch College (Spring 2026), with Daisy Chen, Khushi Gandhi, and Safa Tahir. The project frames school-level absenteeism prediction as a supervised regression task using publicly available NYC DOE attendance and demographic data from 2013–2018 (7,607 observations). Rather than modeling individual behavior, we focused on four school-level structural features—Economic Need Index (ENI), % Poverty, % English Language Learners, and % Students with Disabilities — to understand what institutional conditions are most associated with high chronic absenteeism rates.

GitHub

Technical Approach

Data & Splits

Two NYC DOE Open Data sources were merged on school identifier (DBN) and year. Splits follow a temporal structure to simulate a realistic forecasting scenario: training on 2013–2016, tuning on 2016–2017, and testing on 2017–2018. This prevents data leakage from a known methodological shift in NYC DOE’s ENI and poverty reporting in 2017–18, which caused a sharp discontinuity in those features and explains the tune-to-test error divergence seen across all models.

Models

Four models were evaluated in order of complexity:

Multiple Linear Regression — OLS baseline on all four predictors
Polynomial Regression — ENI term extended to degrees 1–10, other predictors kept linear
Random Forest — 5-fold CV tuning over mtry (1–4) and minimum node size (2–20); 500 trees
XGBoost — 5-fold CV tuning over number of trees, tree depth, and learning rate

SHAP (TreeSHAP) values were computed on the test set for both ensemble models to examine feature-level contributions.

Results

XGBoost was the most reliable model overall, with the smallest train-to-test RMSE gap of any method. Random Forest showed the most severe overfitting — train R² of 0.857 collapsed to 0.466 on test—a warning sign even after cross-validation pushed toward greater regularization.

Model	Train RMSE	Tune RMSE	Test RMSE	Test R²
MLR	10.608	10.572	10.950	0.506
RF	5.769	10.026	11.386	0.466
XGBoost	9.280	9.697	10.736	0.526

On feature importance, ENI was the dominant predictor by SHAP magnitude—consistent with its role as an aggregate economic need measure. % Students with Disabilities ranked second in XGBoost, ahead of % Poverty and % ELL, suggesting it captures independent structural barriers to attendance (medical appointments, health-related absences) not reducible to economic need.

Reflections

The most significant constraint was a data quality issue: in 2017–18, NYC DOE implemented a new matching process that caused ENI’s modal value to jump from ~72 to ~91—a methodological shift, not a real change in economic conditions. This is the primary driver of the tune-to-test degradation across all models, and it’s why a temporal split was essential. A random split would artificially suppress test error by leaking post-2017 observations into training.

Future directions: a Year × ENI interaction term to explicitly model this temporal instability, and a graduation outcome extension using NYC DOE High School Performance Directories to trace how structural absenteeism drivers translate into longer-term student outcomes.

Share on

Bluesky Facebook LinkedIn Mastodon X (formerly Twitter)

Sonia Tyburczy