MST: Masked Self-Supervised Transformer for Visual Representation

MST structure


To transfer the high-level semantic feature to downstream dense prediction tasks, we present a novel Masked Self-supervised Transformer, which can explicitly capture the local context of an image while preserving the global semantic information. Specifically, inspired by the Masked Language Modeling (MLM) in NLP, we propose a masked token strategy based on the multi-head self-attention map, which dynamically masks some tokens of local patches without damaging the crucial structure for self-supervised learning. More importantly, the masked tokens together with the remaining tokens are further recovered by a global image decoder, which preserves the spatial information of the image and is more friendly to the downstream dense prediction tasks.

Thirty-fifth Conference on Neural Information Processing Systems