June 12, 2021

2317 words 11 mins read

dk-liang/Awesome-Visual-Transformer

dk-liang/Awesome-Visual-Transformer

Collect some papers about transformer with vision. Awesome Transformer with Computer Vision (CV)

repo name dk-liang/Awesome-Visual-Transformer
repo link https://github.com/dk-liang/Awesome-Visual-Transformer
homepage
language
size (curr.) 51 kB
stars (curr.) 1410
created 2020-12-03
license

Awesome Visual-Transformer Awesome

Collect some Transformer with Computer-Vision (CV) papers.

If you find some overlooked papers, please open issues or pull requests.

Papers

Transformer original paper

Technical blog

  • [Chinese Blog] 3W字长文带你轻松入门视觉transformer [Link]
  • [Chinese Blog] Vision Transformer 超详细解读 (原理分析+代码解读) [Link]

Survey

  • Transformers in Vision: A Survey [paper] - 2021.02.22
  • A Survey on Visual Transformer [paper] - 2020.1.30
  • A Survey of Transformers [paper] - 2020.6.09

arXiv papers

  • Space-time Mixing Attention for Video Transformer [paper]
  • Transformed CNNs: recasting pre-trained convolutional layers with self-attention [paper]
  • CAT: Cross Attention in Vision Transformer [paper]
  • Scaling Vision Transformers [paper]
  • DETReg: Unsupervised Pretraining with Region Priors for Object Detection [paper] [code]
  • Chasing Sparsity in Vision Transformers:An End-to-End Exploration [paper]
  • MViT: Mask Vision Transformer for Facial Expression Recognition in the wild [paper]
  • Demystifying Local Vision Transformer: Sparse Connectivity, Weight Sharing, and Dynamic Weight [paper]
  • On Improving Adversarial Transferability of Vision Transformers [paper]
  • Fully Transformer Networks for Semantic ImageSegmentation [paper]
  • Visual Transformer for Task-aware Active Learning [paper] [code]
  • Efficient Training of Visual Transformers with Small-Size Datasets [paper]
  • Reveal of Vision Transformers Robustness against Adversarial Attacks [paper]
  • Person Re-Identification with a Locally Aware Transformer [paper]
  • Refiner: Refining Self-attention for Vision Transformers [paper]
  • Shuffle Transformer: Rethinking Spatial Shuffle for Vision Transformer [paper]
  • ViTAE: Vision Transformer Advanced by Exploring Intrinsic Inductive Bias [paper]
  • Video Instance Segmentation using Inter-Frame Communication Transformers [paper]
  • Transformer in Convolutional Neural Networks [paper] [code]
  • Oriented Object Detection with Transformer [paper]
  • Uformer: A General U-Shaped Transformer for Image Restoration [paper] [code]
  • Patch Slimming for Efficient Vision Transformers [paper]
  • RegionViT: Regional-to-Local Attention for Vision Transformers [paper]
  • Associating Objects with Transformers for Video Object Segmentation [paper] [code]
  • Few-Shot Segmentation via Cycle-Consistent Transformer [paper]
  • Glance-and-Gaze Vision Transformer [paper] [code]
  • Unsupervised MRI Reconstruction via Zero-Shot Learned Adversarial Transformers [paper]
  • Anticipative Video Transformer [paper] [code]
  • DynamicViT: Efficient Vision Transformers with Dynamic Token Sparsification [paper] [code]
  • When Vision Transformers Outperform ResNets without Pretraining or Strong Data Augmentations [paper] [code]
  • Container: Context Aggregation Network [paper]
  • Unsupervised Out-of-Domain Detection via Pre-trained Transformers [paper]
  • TransMIL: Transformer based Correlated Multiple Instance Learning for Whole Slide Image Classication [paper]
  • [YOLOS] You Only Look at One Sequence: Rethinking Transformer in Vision through Object Detection [paper] [code]
  • [TransVOS] TransVOS: Video Object Segmentation with Transformers [paper]
  • [KVT] KVT: k-NN Attention for Boosting Vision Transformers [paper]
  • [MSG-Transformer] MSG-Transformer: Exchanging Local Spatial Information by Manipulating Messenger Tokens [paper] [code]
  • [SegFormer] SegFormer: Simple and Efficient Design for Semantic Segmentation with Transformers [paper] [code]
  • [SDNet] SDNet: mutil-branch for single image deraining using swin [paper] [code]
  • [DVT] Not All Images are Worth 16x16 Words: Dynamic Vision Transformers with Adaptive Sequence Length [paper]
  • Dual-stream Network for Visual Recognition [paper]
  • [GazeTR] Gaze Estimation using Transformer [paper] [code]
  • Transformer-Based Deep Image Matching for Generalizable Person Re-identification [paper]
  • Less is More: Pay Less Attention in Vision Transformers [paper]
  • [FoveaTer] FoveaTer: Foveated Transformer for Image Classification [paper]
  • [TransDA] Transformer-Based Source-Free Domain Adaptation [paper] [code]
  • An Attention Free Transformer [paper]
  • [PTNet] PTNet: A High-Resolution Infant MRI Synthesizer Based on Transformer [paper]
  • [ResT] ResT: An Efficient Transformer for Visual Recognition [paper] [code]
  • [CogView] CogView: Mastering Text-to-Image Generation via Transformers [paper]
  • [NesT] Aggregating Nested Transformers [paper]
  • [TAPG] Temporal Action Proposal Generation with Transformers [paper]
  • Boosting Crowd Counting with Transformers [paper]
  • [COTR] COTR: Convolution in Transformer Network for End to End Polyp Detection [paper]
  • [TransVOD] End-to-End Video Object Detection with Spatial-Temporal Transformers [paper] [code]
  • Intriguing Properties of Vision Transformers [paper] [code]
  • Combining Transformer Generators with Convolutional Discriminators [paper]
  • Rethinking the Design Principles of Robust Vision Transformer [paper]
  • Vision Transformers are Robust Learners [paper] [code]
  • Manipulation Detection in Satellite Images Using Vision Transformer [paper]
  • [Segmenter] Segmenter: Transformer for Semantic Segmentation [paper] [code]
  • [Swin-Unet] Swin-Unet: Unet-like Pure Transformer for Medical Image Segmentation [paper] [code]
  • Self-Supervised Learning with Swin Transformers [paper] [code]
  • [SCTN] SCTN: Sparse Convolution-Transformer Network for Scene Flow Estimation [paper]
  • [RelationTrack] RelationTrack: Relation-aware Multiple Object Tracking with Decoupled Representation [paper]
  • [VGTR] Visual Grounding with Transformers [paper]
  • [PST] Visual Composite Set Detection Using Part-and-Sum Transformers [paper]
  • [TrTr] TrTr: Visual Tracking with Transformer [paper] [code]
  • [MOTR] MOTR: End-to-End Multiple-Object Tracking with TRansformer [paper] [code]
  • Attention for Image Registration (AiR): an unsupervised Transformer approach [paper]
  • [TransHash] TransHash: Transformer-based Hamming Hashing for Efficient Image Retrieval [paper]
  • [ISTR] ISTR: End-to-End Instance Segmentation with Transformers [paper] [code]
  • [CAT] CAT: Cross-Attention Transformer for One-Shot Object Detection [paper]
  • [CoSformer] CoSformer: Detecting Co-Salient Object with Transformers [paper]
  • End-to-End Attention-based Image Captioning [paper]
  • [PMTrans] Pyramid Medical Transformer for Medical Image Segmentation [paper]
  • [HandsFormer] HandsFormer: Keypoint Transformer for Monocular 3D Pose Estimation ofHands and Object in Interaction [paper]
  • [GasHis-Transformer] GasHis-Transformer: A Multi-scale Visual Transformer Approach for Gastric Histopathology Image Classification [paper]
  • Emerging Properties in Self-Supervised Vision Transformers [paper]
  • [InTra] Inpainting Transformer for Anomaly Detection [paper]
  • [Twins] Twins: Revisiting Spatial Attention Design in Vision Transformers [paper] [code]
  • [MLMSPT] Point Cloud Learning with Transformer [paper]
  • Medical Transformer: Universal Brain Encoder for 3D MRI Analysis [paper]
  • [ConTNet] ConTNet: Why not use convolution and transformer at the same time? [paper] [code]
  • [DTNet] Dual Transformer for Point Cloud Analysis [paper]
  • Improve Vision Transformers Training by Suppressing Over-smoothing [paper] [code]
  • [Visformer] Visformer: The Vision-friendly Transformer [paper] [code]
  • Transformer Meets DCFAM: A Novel Semantic Segmentation Scheme for Fine-Resolution Remote Sensing Images [paper]
  • [VST] Visual Saliency Transformer [paper]
  • [M3DeTR] M3DeTR: Multi-representation, Multi-scale, Mutual-relation 3D Object Detection with Transformers [paper] [code]
  • [VidTr] VidTr: Video Transformer Without Convolutions [paper]
  • [Skeletor] Skeletor: Skeletal Transformers for Robust Body-Pose Estimation [paper]
  • [FaceT] Learning to Cluster Faces via Transformer [paper]
  • [MViT] Multiscale Vision Transformers [paper] [code]
  • [VATT] VATT: Transformers for Multimodal Self-Supervised Learning from Raw Video, Audio and Text [paper]
  • [So-ViT] So-ViT: Mind Visual Tokens for Vision Transformer [paper] [code]
  • Token Labeling: Training a 85.5% Top-1 Accuracy Vision Transformer with 56M Parameters on ImageNet [paper] [code]
  • [TransRPPG] TransRPPG: Remote Photoplethysmography Transformer for 3D Mask Face Presentation Attack Detection [paper]
  • [VideoGPT] VideoGPT: Video Generation using VQ-VAE and Transformers [paper]
  • [M2TR] M2TR: Multi-modal Multi-scale Transformers for Deepfake Detection [paper]
  • Transformer Transforms Salient Object Detection and Camouflaged Object Detection [paper]
  • [TransCrowd] TransCrowd: Weakly-Supervised Crowd Counting with Transformer [paper] [code]
  • [TransVG] TransVG: End-to-End Visual Grounding with Transformers [paper]
  • Visual Transformer Pruning [paper]
  • Self-supervised Video Retrieval Transformer Network [paper]
  • Vision Transformer using Low-level Chest X-ray Feature Corpus for COVID-19 Diagnosis and Severity Quantification [paper]
  • [TransGAN] TransGAN: Two Transformers Can Make One Strong GAN [paper] [code]
  • Geometry-Free View Synthesis: Transformers and no 3D Priors [paper] [code]
  • [CoaT] Co-Scale Conv-Attentional Image Transformers [paper] [code]
  • [LocalViT] LocalViT: Bringing Locality to Vision Transformers [paper] [code]
  • [ACTOR] Action-Conditioned 3D Human Motion Synthesis with Transformer VAE [paper]
  • [CIT] Cloth Interactive Transformer for Virtual Try-On [paper] [code]
  • Handwriting Transformers [paper]
  • [SiT] SiT: Self-supervised vIsion Transformer [paper] [code]
  • On the Robustness of Vision Transformers to Adversarial Examples [paper]
  • An Empirical Study of Training Self-Supervised Visual Transformers [paper]
  • A Video Is Worth Three Views: Trigeminal Transformers for Video-based Person Re-identification [paper]
  • [AOT-GAN] Aggregated Contextual Transformations for High-Resolution Image Inpainting [paper] [code]
  • Deepfake Detection Scheme Based on Vision Transformer and Distillation [paper]
  • [ATAG] Augmented Transformer with Adaptive Graph for Temporal Action Proposal Generation [paper]
  • [LeViT] LeViT: a Vision Transformer in ConvNet’s Clothing for Faster Inference [paper]
  • [TubeR] TubeR: Tube-Transformer for Action Detection [paper]
  • [AAformer] AAformer: Auto-Aligned Transformer for Person Re-Identification [paper]
  • [TFill] TFill: Image Completion via a Transformer-Based Architecture [paper]
  • Group-Free 3D Object Detection via Transformers [paper] [code]
  • [STGT] Spatial-Temporal Graph Transformer for Multiple Object Tracking [paper]
  • [YOGO] You Only Group Once: Efficient Point-Cloud Processing with Token Representation and Relation Inference Module[paper] [code]
  • Going deeper with Image Transformers[paper]
  • [Stark] Learning Spatio-Temporal Transformer for Visual Tracking [paper] [code]
  • [Meta-DETR] Meta-DETR: Few-Shot Object Detection via Unified Image-Level Meta-Learning [paper [code]
  • [DA-DETR] DA-DETR: Domain Adaptive Detection Transformer by Hybrid Attention [paper]
  • Robust Facial Expression Recognition with Convolutional Visual Transformers [paper]
  • Thinking Fast and Slow: Efficient Text-to-Visual Retrieval with Transformers [paper]
  • Spatiotemporal Transformer for Video-based Person Re-identification[paper]
  • [PiT] Rethinking Spatial Dimensions of Vision Transformers [paper] [code]
  • [TransUNet] TransUNet: Transformers Make Strong Encoders for Medical Image Segmentation [paper] [code]
  • [CvT] CvT: Introducing Convolutions to Vision Transformers [paper] [code]
  • Multi-Scale Vision Longformer: A New Vision Transformer for High-Resolution Image Encoding [paper]
  • [TFPose] TFPose: Direct Human Pose Estimation with Transformers [paper]
  • [TransCenter] TransCenter: Transformers with Dense Queries for Multiple-Object Tracking [paper]
  • [ViViT] ViViT: A Video Vision Transformer [paper]
  • [CrossViT] CrossViT: Cross-Attention Multi-Scale Vision Transformer for Image Classification [paper]
  • [TS-CAM] TS-CAM: Token Semantic Coupled Attention Map for Weakly Supervised Object Localization [paper]
  • Face Transformer for Recognition [paper]
  • On the Adversarial Robustness of Visual Transformers [paper]
  • Understanding Robustness of Transformers for Image Classification [paper]
  • Lifting Transformer for 3D Human Pose Estimation in Video [paper]
  • [GSA-Net] Global Self-Attention Networks for Image Recognition[paper]
  • High-Fidelity Pluralistic Image Completion with Transformers [paper] [code]
  • Swin Transformer: Hierarchical Vision Transformer using Shifted Windows [paper] [code]
  • [DPT] Vision Transformers for Dense Prediction [paper] [code]
  • [TransFG] TransFG: A Transformer Architecture for Fine-grained Recognition? [paper]
  • [TimeSformer] Is Space-Time Attention All You Need for Video Understanding? [paper]
  • Multi-view 3D Reconstruction with Transformer [paper]
  • Can Vision Transformers Learn without Natural Images? [paper] [code]
  • Transformers Solve the Limited Receptive Field for Monocular Depth Prediction [paper] [code]
  • End-to-End Trainable Multi-Instance Pose Estimation with Transformers [paper]
  • Instance-level Image Retrieval using Reranking Transformers [paper] [code]
  • [BossNAS] BossNAS: Exploring Hybrid CNN-transformers with Block-wisely Self-supervised Neural Architecture Search [paper] [code]
  • [CeiT] Incorporating Convolution Designs into Visual Transformers [paper]
  • [DeepViT] DeepViT: Towards Deeper Vision Transformer [paper]
  • [TNT] Transformer in Transformer [paper] [code]
  • Enhancing Transformer for Video Understanding Using Gated Multi-Level Attention and Temporal Adversarial Training [paper]
  • 3D Human Pose Estimation with Spatial and Temporal Transformers [paper] [code]
  • [SUNETR] SUNETR: Transformers for 3D Medical Image Segmentation [paper]
  • Scalable Visual Transformers with Hierarchical Pooling [paper]
  • [ConViT] ConViT: Improving Vision Transformers with Soft Convolutional Inductive Biases [paper]
  • [TransMed] TransMed: Transformers Advance Multi-modal Medical Image Classification [paper]
  • [U-Transformer] U-Net Transformer: Self and Cross Attention for Medical Image Segmentation [paper]
  • [SpecTr] SpecTr: Spectral Transformer for Hyperspectral Pathology Image Segmentation [paper] [code]
  • [TransBTS] TransBTS: Multimodal Brain Tumor Segmentation Using Transformer [paper] [code]
  • [SSTN] SSTN: Self-Supervised Domain Adaptation Thermal Object Detection for Autonomous Driving [paper]
  • [GANsformer] Generative Adversarial Transformers [paper] [code]
  • [PVT] Pyramid Vision Transformer: A Versatile Backbone for Dense Prediction without Convolutions [paper] [code]
  • Transformer is All You Need: Multimodal Multitask Learning with a Unified Transformer [paper] [code]
  • [MedT] Medical Transformer: Gated Axial-Attention for Medical Image Segmentation [paper] [code]
  • [CPVT] Do We Really Need Explicit Position Encodings for Vision Transformers? [paper] [code]
  • Deepfake Video Detection Using Convolutional Vision Transformer[paper]
  • Training Vision Transformers for Image Retrieval[paper]
  • [TransReID] TransReID: Transformer-based Object Re-Identification[paper]
  • [VTN] Video Transformer Network[paper]
  • [T2T-ViT] Tokens-to-Token ViT: Training Vision Transformers from Scratch on ImageNet [paper] [code]
  • [BoTNet] Bottleneck Transformers for Visual Recognition [paper]
  • [CPTR] CPTR: Full Transformer Network for Image Captioning [paper]
  • Learn to Dance with AIST++: Music Conditioned 3D Dance Generation [paper] [code]
  • [Trans2Seg] Segmenting Transparent Object in the Wild with Transformer [paper] [code]
  • [SMCA] Fast Convergence of DETR with Spatially Modulated Co-Attention [paper]
  • Investigating the Vision Transformer Model for Image Retrieval Tasks [paper]
  • [Trear] Trear: Transformer-based RGB-D Egocentric Action Recognition [paper]
  • [VisualSparta] VisualSparta: Sparse Transformer Fragment-level Matching for Large-scale Text-to-Image Search [paper]
  • [TrackFormer] TrackFormer: Multi-Object Tracking with Transformers [paper]
  • [LETR] Line Segment Detection Using Transformers without Edges [paper]
  • [TAPE] Transformer Guided Geometry Model for Flow-Based Unsupervised Visual Odometry [paper]
  • [TRIQ] Transformer for Image Quality Assessment [paper] [code]
  • [TransTrack] TransTrack: Multiple-Object Tracking with Transformer [paper] [code]
  • [TransPose] TransPose: Towards Explainable Human Pose Estimation by Transformer [paper]
  • [DeiT] Training data-efficient image transformers & distillation through attention [paper] [code]
  • [Pointformer] 3D Object Detection with Pointformer [paper]
  • [ViT-FRCNN] Toward Transformer-Based Object Detection [paper]
  • [Taming-transformers] Taming Transformers for High-Resolution Image Synthesis [paper] [code]
  • [SceneFormer] SceneFormer: Indoor Scene Generation with Transformers [paper]
  • [PCT] PCT: Point Cloud Transformer [paper]
  • [METRO] End-to-End Human Pose and Mesh Reconstruction with Transformers [paper]
  • [PointTransformer] Point Transformer [paper]
  • [PED] DETR for Pedestrian Detection[paper]
  • Transformer Guided Geometry Model for Flow-Based Unsupervised Visual Odometry[paper]
  • [C-Tran] General Multi-label Image Classification with Transformers [paper]
  • [TSP-FCOS] Rethinking Transformer-based Set Prediction for Object Detection [paper]
  • [ACT] End-to-End Object Detection with Adaptive Clustering Transformer [paper]
  • [STTR] Revisiting Stereo Depth Estimation From a Sequence-to-Sequence Perspective with Transformers [paper] [code]
  • [VTs] Visual Transformers: Token-based Image Representation and Processing for Computer Vision [paper]

2021

  • [NDT-Transformer] NDT-Transformer: Large-Scale 3D Point Cloud Localisation using the Normal Distribution Transform Representation (ICRA)[paper]
  • VT-ADL: A Vision Transformer Network for Image Anomaly Detection and Localization (ISIE) [paper]
  • Medical Image Segmentation using Squeeze-and-Expansion Transformers (IJCAI) [paper]
  • Vision Transformer for Fast and Efficient Scene Text Recognition (ICDAR) [paper]
  • Diverse Part Discovery: Occluded Person Re-identification with Part-Aware Transformer (CVPR) [paper]
  • [HOTR] HOTR: End-to-End Human-Object Interaction Detection with Transformers (CVPR oral) [paper]
  • High-Resolution Complex Scene Synthesis with Transformers (CVPRW) [paper]
  • [TransFuser] Multi-Modal Fusion Transformer for End-to-End Autonomous Driving (CVPR) [paper] [code]
  • Pose Recognition with Cascade Transformers (CVPR) [paper]
  • Seeing Out of tHe bOx: End-to-End Pre-training for Vision-Language Representation Learning (CVPR) [paper]
  • [LoFTR] LoFTR: Detector-Free Local Feature Matching with Transformers (CVPR) [paper] [code]
  • Thinking Fast and Slow: Efficient Text-to-Visual Retrieval with Transformers (CVPR) [paper]
  • [SETR] Rethinking Semantic Segmentation from a Sequence-to-Sequence Perspective with Transformers (CVPR) [paper] [code]
  • [TransT] Transformer Tracking (CVPR) [paper] [code]
  • Transformer Meets Tracker: Exploiting Temporal Context for Robust Visual Tracking (CVPR oral) [paper]
  • [VisTR] End-to-End Video Instance Segmentation with Transformers (CVPR) [paper]
  • Transformer Interpretability Beyond Attention Visualization (CVPR) [paper] [code]
  • [IPT] Pre-Trained Image Processing Transformer (CVPR) [paper]
  • [UP-DETR] UP-DETR: Unsupervised Pre-training for Object Detection with Transformers (CVPR) [paper]
  • [VTNet] VTNet: Visual Transformer Network for Object Goal Navigation (ICLR)[paper]
  • [Vision Transformer] An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale (ICLR)[paper] [code]
  • [Deformable DETR] Deformable DETR: Deformable Transformers for End-to-End Object Detection (ICLR)[paper] [code]
  • [LAMBDANETWORKS] MODELING LONG-RANGE INTERACTIONS WITHOUT ATTENTION (ICLR) paper] [code]
  • [LSTR] End-to-end Lane Shape Prediction with Transformers (WACV) [paper] [code]

2020

  • [DETR] End-to-End Object Detection with Transformers (ECCV) [paper] [code]
  • [FPT] Feature Pyramid Transformer (CVPR) [paper] [code]
  • [TTSR] Learning Texture Transformer Network for Image Super-Resolution (CVPR) [paper] [code]
  • [STTN] Learning Joint Spatial-Temporal Transformations for Video Inpainting (ECCV) [paper] [code]

Acknowledgement

Thanks the template from Awesome-Crowd-Counting

comments powered by Disqus