AI Watermarking Challenges And Limitations

There is an inherent trade-off in watermarking systems between watermark strength (z-score) and text quality (perplexity) for various combinations of watermarking parameters: increasing robustness often reduces fidelity and capacity, and stronger watermarks that resist removal are more likely to degrade content quality. In addition, stronger watermarks may introduce detectable artifacts, while higher capacity can make watermarks easier to spot and remove.

Beyond the challenge of this well-known trade-off, watermark detection accuracy continues to be problematic when content undergoes legitimate modifications like compression or format conversion. In addition, real-time watermarking, especially for high-resolution video streams, still imposes significant computational requirements (Springer et al., 2025). Not only that, but AI watermarking technologies currently suffer from high false positive rates (up to 15-20% for text) and challenges in multi-model attribution – where content may be generated or modified by multiple AI systems (Kirchenbauer et al., 2023; Lu et al., 2024).

Other major challenges for watermarking systems include the difficulty of reliably watermarking short or heavily edited content, the lack of standardized detection protocols, vulnerability to attacks such as compression, cropping, rotation, and format conversion (EFF, 2024), and limited robustness to adversarial attacks, diffusion purification, discreet alterations, homoglyph and zero-width attacks, generative attacks, paraphrasing, model substitution, or tokenization attacks. Jia et al. (2021) identify a fundamental limitation of existing watermarking strategies: “the watermarking task is learned separately from the primary task.” They explain that naive watermarking can be defeated by an adaptive attacker because the watermarks are outliers to the task distribution: “As long as the adversary queries the watermarked model only on inputs that are sampled from the task distribution, the stolen model will only retain the victim model’s decision surface relevant to the task distribution, and therefore ignore the decision surface learned relevant to watermarking.”

More specifically, image watermarks can be defeated by diffusion purification, which removes up to 92% of watermarks, and model substitution attacks, which have been shown to bypass up to 68% of image watermarks (Lu et al., 2024), while text watermarks are susceptible to paraphrasing and editing, with studies showing that simple paraphrasing can remove or obscure watermarks in about 30% of cases (Kirchenbauer et al., 2023). Further for text watermarking, research from ETH Zürich found that statistical watermarks can be reverse-engineered and removed with approximately 85% success rates by analyzing patterns in the AI’s outputs (MIT Technology Review, 2024).

According to Chakraborty et al. (2022), “Existing black-box watermarking techniques are ineffective against model extraction attacks” due to the limitations imposed by computational overhead and edge device vulnerabilities: “In this new paradigm of edge intelligence, an attacker can directly query a proprietary DL model deployed in an edge device without any need to redirect the queries to a trusted cloud server, thus rendering model protection countermeasures like DAWN (Szyller et al., 2021) practically useless,” they write. More recently, Petrov et al. (2025) found that, while multiple watermarks can coexist, some methods like RoSteALS tend to overwrite previous watermarks, reducing their accuracy by up to 76%. Finally, for cryptographic watermarking using zero-knowledge proofs, Bagad et al. (2025) note that current proof generation times (approximately 5.4 minutes) could limit scalability for massive content production. This presents a practical challenge for large-scale deployment, though Petrov et al. have identified optimization opportunities that could potentially reduce proof generation times to seconds.

Thanks for reading!