“Exploring Plain Vision Transformer Backbones for Object Detection” [link] Excellent read as usual from the FAIR team. Strong object detection results with only minor tweaks on the vanilla (ViT) Transformer backbone.