Education Data Science

Using LLMs To Improve Formulaic Readability Measures: A RoBERTa Approach

Project Year
2024
Abstract

Measuring text complexity is a crucial aspect of how we define and operationalize reading levels and metrics in K-12 curriculums and beyond. Applying Large language models (LLMs) to measure readability and complexity offers a new opportunity over older, more rigid, formulaic metrics. This project explores the effectiveness of a Robustly Optimized BERT Pretraining Approach (RoBERTa) language model in assessing the readability of textual passages in comparison to human-rated scores. Utilizing the CLEAR Corpus dataset, comprising various passages annotated with readability scores by teachers, we fine-tune the RoBERTa model to predict these scores, aiming to understand the nuances of text complexity as interpreted by machine learning. The research follows a systematic methodology involving exploratory data analysis (EDA) to identify key textual features and rigorous model training. Through quantitative metrics such as Root Mean Squared Error (RMSE) and Pearson Correlation, we evaluate the model’s predictive performance and understand how accurately the model can extract the features of readability. The findings reveal an improvement of up to 23% using LLMs in automated readability assessment tools, highlighting the model’s capacity and accuracy.

EDS Students

Anvit Garg
Anvit Garg
Class: 2024
Areas of interest: Edtech, Ethical AI in Education, Workforce Upskilling, and Education System Reform