vis4d.op.base.vit¶
Residual networks for classification.
Classes
|
Vision Transformer (ViT) model without classification head. |
- class VisionTransformer(img_size=224, patch_size=16, in_channels=3, num_classes=1000, embed_dim=768, depth=12, num_heads=12, mlp_ratio=4.0, qkv_bias=True, init_values=None, class_token=True, no_embed_class=False, pre_norm=False, pos_drop_rate=0.0, drop_rate=0.0, attn_drop_rate=0.0, drop_path_rate=0.0, norm_layer=None, act_layer=GELU(approximate='none'))[source]¶
Vision Transformer (ViT) model without classification head.
- A PyTorch impl of`An Image is Worth 16x16 Words: Transformers for
Image Recognition at Scale` - https://arxiv.org/abs/2010.11929
- Adapted from:
pytorch vision transformer impl
timm vision transformer impl
Init VisionTransformer.
- Parameters:
img_size (int, optional) – Input image size. Defaults to 224.
patch_size (int, optional) – Patch size. Defaults to 16.
in_channels (int, optional) – Number of input channels. Defaults to 3.
num_classes (int, optional) – Number of classes. Defaults to 1000.
embed_dim (int, optional) – Embedding dimension. Defaults to 768.
depth (int, optional) – Depth. Defaults to 12.
num_heads (int, optional) – Number of attention heads. Defaults to 12.
mlp_ratio (float, optional) – Ratio of MLP hidden dim to embedding dim. Defaults to 4.0.
qkv_bias (bool, optional) – If to add bias to qkv. Defaults to True.
init_values (float, optional) – Initial values for layer scale. Defaults to None.
class_token (bool, optional) – If to add a class token. Defaults to True.
no_embed_class (bool, optional) – If to not embed class token. Defaults to False.
pre_norm (bool, optional) – If to use pre-norm. Defaults to False.
pos_drop_rate (float, optional) – Postional dropout rate. Defaults to 0.0.
drop_rate (float, optional) – Dropout rate. Defaults to 0.0.
attn_drop_rate (float, optional) – Attention dropout rate. Defaults to 0.0.
drop_path_rate (float, optional) – Drop path rate. Defaults to 0.0.
embed_layer (nn.Module, optional) – Embedding layer. Defaults to PatchEmbed.
norm_layer (nn.Module, optional) – Normalization layer. If None, nn.LayerNorm is used. Defaults to None.
act_layer (nn.Module, optional) – Activation layer. Defaults to nn.GELU().
- __call__(data)[source]¶
Applies the ViT encoder.
- Parameters:
data (tensor) – Input Images into the network shape [N, C, W, H]
- Return type:
list
[Tensor
]
- forward(images)[source]¶
Forward pass.
- Parameters:
images (torch.Tensor) – Input images tensor of shape (B, C, H, W).
- Returns:
- Features of the input images extracted
by the ViT encoder. feats[0] is the input images, and feats[1] is the output of the patch embedding layer. The rest of the elements are the outputs of each transformer block, with the shape (B, N, dim), where N is the number of patches, and dim is the embedding dimension. The final element is the output of the ViT encoder.
- Return type:
feats (list[torch.Tensor])
- property out_channels: list[int]¶
Return the number of output channels per feature level.