Research2026-05-08

ViTok-v2: Scaling Native Resolution Auto-Encoders to 5 Billion Parameters

arXiv:2605.05331v1 Announce Type: cross Abstract: Vision Transformer (ViT) autoencoders have emerged as compelling tokenizers for images, offering improved reconstruction over convolutional tokenizers. However, existing ViT tokenizers cannot explore this landscape as performance degrades outside...

Read Original Article on Arxiv CS.AI

arxivpapers