Jun 5, 2025 | 🚀 SPARKLE preprint is now live on arXiv! Reinforcement learning has driven impressive gains in LLM reasoning—but what exactly does RL improve? SPARKLE answers this question with a fine-grained evaluation framework that dissects reasoning into plan-following, problem decomposition, and knowledge use. The results are surprising: explicit plans can actually hurt on the hardest problems, yet RL-tuned models remain far more robust and flexible in handling them. We also find clear gains in how RL enhances knowledge integration. And we push back on a common myth: hard problems can be useful for RL—even when they seem unrewarding. SPARKLE shows how to turn those tough cases into real training signal. |
Apr 30, 2025 | 🚀 COSMOS preprint is now available on arXiv! With training-time and test-time adaptation strategies for LLMs exploding in number, figuring out the best one can feel like a wild goose chase. COSMOS makes it easy — predicting performance and cost accurately and efficiently so you don’t have to burn GPU hours testing every option. Smarter choices, fewer experiments. |
Apr 23, 2025 | I passed my qualifying exam! |
Dec 9, 2024 | Attended NeurIPS 2024 in Vancouver and presented two papers: - SpatialEval: We took a fresh look at how language models and vision-language models handle spatial reasoning. The twist? We tested them on our new benchmark SpatialEval across TQA, VQA, and VTQA. Found some pretty surprising results!
- GAD: Ever wondered if constrained decoding changes how LLMs actually behave? We proved it does - and came up with the first solution to fix the distribution distortion problem.
|
Sep 25, 2024 | Two first/co-first authored papers are accepted to NeurIPS 2024! 🎉🎉 |