CreativityPrism: A Cross-Domain Evaluation Framework for Large Language Model Creativity
arXiv:2510.20091v3 Announce Type: replace-cross Abstract: Creativity is often seen as a hallmark of human intelligence. While large language models(LLMs) are increasingly perceived as generating creative text, there is still no cross-domain and scalable framework to evaluate their creativity across...
A Systematic Yardstick for Machine Creativity
The release of the CreativityPrism framework on arXiv represents a significant methodological step forward in AI evaluation. Rather than treating creativity as a monolithic, unmeasurable quality, the researchers propose a cross-domain, scalable framework specifically designed to assess large language model creativity across different fields—from poetry and storytelling to scientific hypothesis generation and business ideation. This moves beyond the typical "is this text novel?" binary into a structured, multi-dimensional assessment.
Why This Matters Now
The timing is critical. As LLMs like GPT-4, Claude, and Gemini are increasingly deployed in creative industries—marketing copy, code generation, product design, academic writing—the lack of a standardized creativity benchmark has become a glaring blind spot. Current evaluations focus heavily on factual accuracy, reasoning, or instruction-following. Creativity, however, remains evaluated through subjective human judgment or narrow, domain-specific tests (e.g., "write a poem about a cat"). CreativityPrism addresses this gap by proposing a framework that can be applied consistently across domains, enabling apples-to-apples comparisons of model creative output.
The framework’s emphasis on "cross-domain" is particularly important. A model that generates novel scientific analogies may fail at producing compelling narrative fiction. Without a cross-domain lens, we risk overestimating a model's general creative capability based on performance in a single, narrow task. CreativityPrism forces evaluators to consider creativity as a multi-faceted construct—likely including dimensions like novelty, fluency, flexibility, elaboration, and usefulness—rather than a single score.
Implications for AI Practitioners
For developers and product teams, this framework offers a practical tool for model selection and fine-tuning. Instead of relying on anecdotal "vibes" about which model is more creative, teams can now run structured evaluations. This is especially valuable for applications like:
- Content generation platforms that need to ensure output doesn't become repetitive or formulaic.
- Research assistants that must generate novel hypotheses, not just rephrase existing knowledge.
- Game design and interactive fiction, where creative divergence is a core product requirement.
However, practitioners should note that CreativityPrism is a framework, not a plug-and-play benchmark. Implementing it will require careful definition of creativity dimensions relevant to their specific domain. The value lies in the structured approach, not in a single magic number.
Key Takeaways
- CreativityPrism introduces the first cross-domain, scalable evaluation framework for LLM creativity, moving beyond subjective or narrow tests.
- The framework addresses a critical gap in current AI evaluation, which focuses on accuracy and reasoning but neglects systematic creativity measurement.
- For AI practitioners, this enables data-driven model selection and fine-tuning for creative applications, from marketing to scientific discovery.
- Implementation requires domain-specific adaptation, but the structured approach provides a much-needed foundation for comparing and improving machine creativity.