Research2026-04-22

STaD: Scaffolded Task Design for Identifying Compositional Skill Gaps in LLMs

arXiv:2604.18177v2 Announce Type: replace-cross Abstract: Benchmarks are often used as a standard to understand LLM capabilities in different domains. However, aggregate benchmark scores provide limited insight into compositional skill gaps of LLMs and how to improve them. To make these weaknesses...

Read Original Article on Arxiv CS.AI

arxivpapers