Research2026-05-13

Beyond the Last Layer: Multi-Layer Representation Fusion for Visual Tokenization

arXiv:2605.10780v2 Announce Type: cross Abstract: Representation autoencoders that reuse frozen pretrained vision encoders as visual tokenizers have achieved strong reconstruction and generation quality. However, existing methods universally extract features from only the last encoder layer,...

Read Original Article on Arxiv CS.AI

arxivpapers