BeClaude
Research2026-05-08

ViTok-v2: Scaling Native Resolution Auto-Encoders to 5 Billion Parameters

Source: Arxiv CS.AI

arXiv:2605.05331v1 Announce Type: cross Abstract: Vision Transformer (ViT) autoencoders have emerged as compelling tokenizers for images, offering improved reconstruction over convolutional tokenizers. However, existing ViT tokenizers cannot explore this landscape as performance degrades outside...

arxivpapers