Policies Permitting LLM Use for Polishing Peer Reviews Are Currently Not Enforceable
arXiv:2603.20450v2 Announce Type: replace-cross Abstract: A number of scientific conferences and journals have recently enacted policies that prohibit LLM usage by peer reviewers, except for polishing, paraphrasing, and grammar correction of otherwise human-written reviews. But, are these policies...
The recent arXiv paper (2603.20450v2) tackles a growing tension in academic publishing: the gap between policy and enforcement regarding large language model (LLM) use in peer review. Several conferences and journals have adopted rules permitting LLMs only for “polishing, paraphrasing, and grammar correction” of human-written reviews, while banning their use for generating substantive content. The paper’s core finding is that these policies are currently unenforceable — there is no reliable technical method to distinguish between a human-written review that was lightly polished by an LLM and one that was substantially generated by an LLM.
This matters because peer review is the bedrock of scientific quality control. If reviewers can secretly outsource critical analysis to LLMs, the integrity of the process erodes. A model might produce fluent text but miss nuanced domain-specific errors, fabricate citations, or reproduce its own training biases — all without the reviewer’s active cognitive engagement. The “polishing exception” creates a particularly slippery slope: once you allow an LLM to rephrase a sentence, how do you audit whether the underlying ideas were also machine-generated? Current detection tools are brittle, easily fooled by paraphrasing, and raise false positives on non-native English speakers.
For AI practitioners, this has immediate practical implications. First, if you are a reviewer, the safest path is to avoid LLM use entirely for review writing, even for grammar. The paper suggests that any LLM involvement creates a plausible deniability problem that undermines trust. Second, if you are developing or deploying LLMs for scientific applications, this highlights a demand for “provenance” tools — watermarking, cryptographic logging of model interactions, or tamper-evident review platforms. The market for auditability solutions in scientific publishing is likely to grow. Third, the paper implicitly warns against over-reliance on LLM detection as a governance strategy. The arms race between generation and detection favors generators, especially as models become more sophisticated.
The broader lesson is that policy without enforceability is performative. Conference organizers and journal editors must either accept that LLM use is inevitable and design new review workflows around it (e.g., requiring reviewers to submit both a raw and polished version), or invest in technical infrastructure that makes enforcement possible. The status quo — rules on paper with no way to check compliance — invites abuse and damages credibility.
Key Takeaways
- Current policies allowing “polishing only” LLM use in peer review are technically unenforceable due to the inability to distinguish light editing from substantive generation.
- The integrity of peer review is at risk if reviewers can secretly outsource analysis to LLMs, which may introduce errors, biases, or fabricated content.
- AI practitioners should avoid using LLMs for review writing entirely, and developers should focus on building provenance and auditability tools for scientific workflows.
- Policy without enforceable technical mechanisms is performative; the community must either redesign review processes or invest in detection infrastructure.