ViTPose : A simple yet powerful transformer baseline for Human Pose Estimation

ViTPose : A simple yet powerful transformer baseline for Human Pose Estimation
Visual pose estimation results of ViTPose on some test images from the MS COCO dataset.
ViTPose: Simple Vision Transformer Baselines for Human Pose Estimation
Although no specific domain knowledge is considered in the design, plain vision transformers have shown excellent performance in visual recognition tasks. However, little effort has been made to reveal the potential of such simple structures for pose estimation tasks. In this paper, we show the surprisingly good capabilities of plain vision transformers for pose estimation from various aspects, namely simplicity in model structure, scalability in model size, flexibility in training paradigm, and transferability of knowledge between models, through a simple baseline model called ViTPose. Specifically, ViTPose employs plain and non-hierarchical vision transformers as backbones to extract features for a given person instance and a lightweight decoder for pose estimation. It can be scaled up from 100M to 1B parameters by taking the advantages of the scalable model capacity and high parallelism of transformers, setting a new Pareto front between throughput and performance. Besides, ViTPose is very flexible regarding the attention type, input resolution, pre-training and finetuning strategy, as well as dealing with multiple pose tasks. We also empirically demonstrate that the knowledge of large ViTPose models can be easily transferred to small ones via a simple knowledge token. Experimental results show that our basic ViTPose model outperforms representative methods on the challenging MS COCO Keypoint Detection benchmark, while the largest model sets a new state-of-the-art. The code and models are available at https://github.com/ViTAE-Transformer/ViTPose.

Human Pose Estimation is a critical task in computer vision, with applications spanning from medical diagnostics and biomechanics analysis to surveillance, augmented reality and robotics.
The challenge ? Precisely localizing human anatomical keypoints despite occlusion, truncation, and variations in appearance, ideally in real-time. It requires a robust model to capture relationships between different joints and able to adapt to various image resolutions.
While CNNs have been the go-to approach for years, their inherent limitations in capturing long-range dependencies call for a paradigm shift. This blog presents ViTPose, a simple yet powerful Vision Transformer (ViT) baseline that achieves SOTA performance on COCO test set with a plain, non-hierarchical architecture.


Why Vision Transformers ?


Traditional CNN-based methods struggle with:

  • capturing global relationships between joints
  • complex architectures requiring multiple processing stages
  • adapting to complex posture variations
  • limited flexibility and scalability

ViTs address these challenges by leveraging global attention, making them easier to scale across different model sizes while offering superior performance with simpler architectures.

Despite existing transformer-based approaches relying on CNN feature extractors, ViTPose demonstrates that a plain ViT can be highly effective for pose estimation.


ViTPose overview

ViTPose is designed to be:

  • Simple : a plain, non-hierarchical transformer architecture
  • Scalable : ranging from 100M to 1B parameters, adapting to different computational needs
  • Flexible : works with different training strategies, attention mechanisms, datasets, and resolutions
  • Transferable : knowledge from larger models can be distilled into smaller models
Despite its simplicity, ViTPose achieves SOTA performance

ViTPose simple architecture


Encoder : plain, non-hierarchical transformer

  1. Input : images of size H*W*3 are divided into patches of size H/d * W/d * 3
  2. Embedding : patches are embedded into tokens of dimension H/d * W/d * C
  3. Multi-Head Self-Attention (MHSA) : models relationships between joints
  4. Transformer layers : tokens pass through multiple transformer layers

Decoder : extracted features are processed into localization heatmaps for keypoints and ViTPose offers two lightweight decoder variants. ViTPose also handles different datasets with multiple decoders.


Scaling ViTPose


ViTPose scales by modifying the number of transformer layers and feature dimensions. For a 256×192 input resolution, different model variants include :

  • ViTPose-B (Base) : 86M parameters
  • ViTPose-L (Large) : 307M parameters
  • ViTPose-H (Huge) : 632M parameters
Interestingly, the transformer encoder encodes such rich semantics that the choice of decoder becomes less critical.

ViTPose Flexibility


Pre-training Strategies : ViTPose achieves comparable performance using pose-specific pre-training and ImageNet pre-training.

Resolution Scaling : increasing input image size and reducing downsampling improves performance consistently.

Efficient attention mechanisms : computing full self-attention across all image patches is computationally expensive.
ViTPose explores alternative attention strategies:

  • Full attention: high accuracy but computationally heavy
  • Window attention : applies self-attention within small windows
  • Shifted-window attention : alternates standard and shifted windows across layers to improve cross-window interactions
  • Pooling-window attention : pooled summary tokens maintain global context with reduced computation

Fine-tuning strategy : ViTPose achieves strong performance even when only fine-tuning selective layers, such as freezing the MHSA module.

Tasks : multiple decoders without much extra cost to handle multiple pose datasets by sharing the same encoder


ViTPose Transferability


ViTPose employs Knowledge Distillation (KD) to transfer information from large models to smaller ones.

  • Output (heatmap) distillation : transfers final layer outputs.
    The loss is a MSE between the outputs from the student network and the teacher network given the same input.
  • Token-based distillation : introduces a learnable knowledge token into the teacher model, later used to guide the student model.
    The loss is a MSE between the ground truth and the output from the student network that uses the image and the (frozen) trained knowledge token as input.

    The final student loss is a combination of heatmap loss and token loss.
Using both distillation techniques improves the performance of the base model ViTPose-B.

Comparison with SOTA Models


ViTPose outperforms the CNN-based architectures and hybrid CNN-transformer models, achieving higher accuracy with optimized inference time.

Trained on multiple datasets, VitPose-B has similar performance than the other methods while being faster. The heavier ViTPose achieve higher accuracy with better inference time.

Limitations & direction for Skeleton project


One notable limitation of ViTPose is its performance on body-part images; it struggles with partial inputs, much like other non-finetuned models.
However, one potential direction involves using a trained model to progressively annotate large-scale datasets for Skeleton project.


Explore ViTPose