Research
Published papers on AI systems, agents, and self-improvement.
5
Papers
4
ICML 2026
SIA: Self Improving AI with Harness & Weight Updates
Prannay Hebbar, Yogendra Manawat, Samuel Verboomen, Alesia Ivanova, Selvam Palanimalai, Kunal Bhatia, Vignesh Baskaran
SIA is a language-model agent that simultaneously modifies both task-specific agent scaffolding (tools, prompts, retry mechanisms) and model weights. Testing across legal charge classification, GPU kernel optimization, and RNA denoising shows combining both improvement methods surpasses prior approaches, achieving 25.1% over prior SOTA on LawBench and 12.4% faster GPU kernels.
SIA-W: Self-Improving Agents with Test-Time Weight Updates
Prannay Hebbar, Samuel Verboomen, Selvam Palanimalai, Yogendra Manawat, Kunal Bhatia, Vignesh Baskaran
A framework for autonomous self-refinement via evolving the agent's foundational structure (utilities, instructions, workflows) and test-time reinforcement learning to fine-tune model weights once structure stabilizes. Evaluated on LawBench (+16 pp), GPU kernel optimization (−19% runtime), and biological data denoising.
Adaptive Proxy Evaluation for Autonomously Improving ML Agents
Vignesh Baskaran, Prannay Hebbar, Samuel Verboomen, Alesia Ivanova, Selvam Palanimalai, Kunal Bhatia, Yogendra Manawat
Addresses the cost/reliability tradeoff in proxy evaluations for autonomous ML systems. Proposes an adaptive proxy that evaluates every candidate under identical conditions and progressively raises fidelity as search converges. MLEvolve (a Monte Carlo Graph Search framework requiring no task-specific tuning) achieved SOTA MAE of 0.1354 on MLE-bench Ventilator Pressure Prediction within 12 hours.
Socrates: Structured Questioning Unlocks Latent Knowledge in AI Research Agents
Damir Vrabac, Prannay Hebbar, Yogendra Manawat, Selvam Palanimalai, Samuel Verboomen, Gurusha Juneja, Kunal Bhatia, Vignesh Baskaran
Tackles the gap between benchmark performance and weak practical research task performance, attributing it to insufficient knowledge activation. Proposes Socrates: a two-agent system pairing a tool-equipped Scientist with an advisor that can only ask questions. Improved Kaggle test scores on 4/5 MLE-bench tasks with a mean increase of ~56%.
AIE-Bench: Benchmarking Agents That Build Agents
Abhishek Mishra, Selvam Palanimalai, Yogendra Manawat, Samuel Verboomen, Prannay Hebbar, Damir Vrabac, Deepak Nathani, Sumeet Motwani, Kunal Bhatia, Vignesh Baskaran
A benchmark for evaluating whether an AI agent can modify another agent to improve it. Uses a meta-agent (suggests enhancements) and a target-agent (being improved), covering both meta-improvement and self-improvement scenarios across terminal interaction and tool calling domains.