top of page

AI, Empathy and Measuring Learning: Tilli x DataGood @ Berkeley Collaboration

- Authored by the brilliant team at DataGood, this blog captures their experience working on the Tilli project. -


In spring 2025, DataGood @ Berkeley teamed up with Tilli to take on a deceptively tough question: how can AI actually help teachers grade messy, handwritten K–5 math - all at scale, and with empathy? Meet the DataGood team below!


A Research-Led Approach to EdTech Innovation


Tilli inspired us to investigate which models would most effectively power their grading MVP: one that operates in low-bandwidth environments (e.g., WhatsApp), handles diverse handwriting inputs, and delivers actionable insights grounded in both academic performance and social-emotional indicators.


This project was all about zeroing in on what comes first: the groundwork. The data experimentation, the computational benchmarking, the analytical scaffolding that makes smart, scalable tools possible. 


Over the semester, we built a rigorous model evaluation pipeline, tested a range of models on messy, real-world inputs, and distilled practical insights to guide Tilli’s next steps. Our full-scale benchmarking sprint compared:

  • OCR models like TrOCR and TrOCR-Math (HuggingFace)

  • Multimodal giants like Gemini 2.0 and GPT-4o

  • Advanced LLMs like Claude Sonnet, Claude Opus, GPT o1, o3-mini, and more


What We Learned

Three takeaways kept surfacing:

  • 📷 Garbage in, garbage out. Even the smartest model collapses under noisy inputs. That’s why preprocessing - think binarization, deskewing, noise reduction - turned out to be essential. We proposed a fast image-quality checkpoint before any OCR happens.


  • 🧠 Correctness alone isn’t enough. Teachers need to know why a student made a mistake. So we looked at ways to flag error types, whether they were conceptual, procedural, or just plain careless, to lay the groundwork for better feedback systems.


  • 🔄 Split the load. We found that a hybrid architecture - such as lightweight models for OCR, heavy-duty ones for reasoning - strikes the sweet spot between affordability and depth. Especially when deployment lives in bandwidth-tight environments.


Why It Matters

Zeroing in on this strategic model evaluation to reduce risk, improve reliability, and accelerate Tilli’s product-to-pilot trajectory, grounded in data and pedagogy.


Our final handoff included:

  • A hybrid inference pipeline design for OCR + feedback

  • A data preprocessing & quality control framework

  • A rubric for scalable model selection, tailored for classroom, not just labs


While these tools aren’t productized yet, they carve a clear path forward for an AI system that’s lightweight, teacher-friendly, and smart enough to read both numbers and nuance!


A Note of Thanks

The Tilli team inspired us with their unwavering focus on students, teachers, and accessibility. By identifying which models deliver the best balance of speed, accuracy, and SEL-awareness, we’ve helped lay the groundwork for AI-powered grading that’s both scalable and empathetic.


Tilli's clarity of purpose and focus on teacher-centered design made this one of our most rewarding collaborations to date! This project provided a roadmap towards a tool that could one day meaningfully support teachers, especially in low-bandwidth classrooms. We’re grateful to have played a part in that vision, and we can’t wait to see where they take it next.


Want to follow the brilliant minds behind this collaboration?

Here are the LinkedIn profiles of the DataGood team members:



 
 
 

Comments


bottom of page