zihangJiang/TokenLabeling
Pytorch implementation of "Training a 85.4% Top-1 Accuracy Vision Transformer with 56M Parameters on ImageNet"
repo name | zihangJiang/TokenLabeling |
repo link | https://github.com/zihangJiang/TokenLabeling |
homepage | |
language | Python |
size (curr.) | 391 kB |
stars (curr.) | 172 |
created | 2021-04-20 |
license | Apache License 2.0 |
Token Labeling: Training an 85.4% Top-1 Accuracy Vision Transformer with 56M Parameters on ImageNet (arxiv)
This is a Pytorch implementation of our technical report.
Comparison between the proposed LV-ViT and other recent works based on transformers. Note that we only show models whose model sizes are under 100M.
Training Pipeline
Our codes are based on the pytorch-image-models by Ross Wightman.
LV-ViT Models
Model | layer | dim | Image resolution | Param | Top 1 | Download |
---|---|---|---|---|---|---|
LV-ViT-S | 16 | 384 | 224 | 26.15M | 83.3 | link |
LV-ViT-S | 16 | 384 | 384 | 26.30M | 84.4 | link |
LV-ViT-M | 20 | 512 | 224 | 55.83M | 84.0 | link |
LV-ViT-M | 20 | 512 | 384 | 56.03M | 85.4 | link |
LV-ViT-M | 20 | 512 | 448 | 56.13M | 85.5 | link |
LV-ViT-L | 24 | 768 | 448 | 150.47M | 86.2 | link |
Requirements
torch>=1.4.0 torchvision>=0.5.0 pyyaml timm==0.4.5
data prepare: ImageNet with the following folder structure, you can extract imagenet by this script.
│imagenet/
├──train/
│ ├── n01440764
│ │ ├── n01440764_10026.JPEG
│ │ ├── n01440764_10027.JPEG
│ │ ├── ......
│ ├── ......
├──val/
│ ├── n01440764
│ │ ├── ILSVRC2012_val_00000293.JPEG
│ │ ├── ILSVRC2012_val_00002138.JPEG
│ │ ├── ......
│ ├── ......
Validation
Replace DATA_DIR with your imagenet validation set path and MODEL_DIR with the checkpoint path
CUDA_VISIBLE_DEVICES=0 bash eval.sh /path/to/imagenet/val /path/to/checkpoint
Label data
We provide NFNet-F6 generated dense label map here. As NFNet-F6 are based on pure ImageNet data, no extra training data is involved.
Training
Coming soon
Reference
If you use this repo or find it useful, please consider citing:
@article{jiang2021token,
title={Token Labeling: Training a 85.5\% Top-1 Accuracy Vision Transformer with 56M Parameters on ImageNet},
author={Jiang, Zihang and Hou, Qibin and Yuan, Li and Zhou, Daquan and Jin, Xiaojie and Wang, Anran and Feng, Jiashi},
journal={arXiv preprint arXiv:2104.10858},
year={2021}
}