We introduce AstroVisBench, the first benchmark for both scientific computing and visualization in the astronomy domain. AstroVisBench judges a language model’s ability to both:
We use the following metrics evaluate LLMs using this benchmark along these dimensions:
We present the results of evaluating several LLMs on AstroVisBench below in an interactive leaderboard. If you would like to test your models on this benchmark, you can find the code to execute and evaluate model responses in our GitHub Repository .
Please cite our paper if you found our work to be useful in your work:
@misc{joseph2025astrovisbenchcodebenchmarkscientific,
title={AstroVisBench: A Code Benchmark for Scientific Computing and Visualization in Astronomy},
author={Sebastian Antony Joseph and Syed Murtaza Husain and Stella S. R. Offner and Stéphanie Juneau and Paul Torrey and Adam S. Bolton and Juan P. Farias and Niall Gaffney and Greg Durrett and Junyi Jessy Li},
year={2025},
eprint={2505.20538},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2505.20538},
}
Models
Metrics
Model - | NoErr(P) % - | VIscore - | NoErr(V) % - | CorrectV % - | VisFail % - | MiE % - | MaE % - |
---|