vis4d.op.base.vit¶

Residual networks for classification.

Classes

VisionTransformer([img_size, patch_size, ...])

Vision Transformer (ViT) model without classification head.

class VisionTransformer(img_size=224, patch_size=16, in_channels=3, num_classes=1000, embed_dim=768, depth=12, num_heads=12, mlp_ratio=4.0, qkv_bias=True, init_values=None, class_token=True, no_embed_class=False, pre_norm=False, pos_drop_rate=0.0, drop_rate=0.0, attn_drop_rate=0.0, drop_path_rate=0.0, norm_layer=None, act_layer=GELU(approximate='none'))[source]¶

Vision Transformer (ViT) model without classification head.

A PyTorch impl of`An Image is Worth 16x16 Words: Transformers for

Image Recognition at Scale` - https://arxiv.org/abs/2010.11929

Adapted from:

pytorch vision transformer impl
timm vision transformer impl

Init VisionTransformer.

Parameters:

img_size (int, optional) – Input image size. Defaults to 224.
patch_size (int, optional) – Patch size. Defaults to 16.
in_channels (int, optional) – Number of input channels. Defaults to 3.
num_classes (int, optional) – Number of classes. Defaults to 1000.
embed_dim (int, optional) – Embedding dimension. Defaults to 768.
depth (int, optional) – Depth. Defaults to 12.
num_heads (int, optional) – Number of attention heads. Defaults to 12.
mlp_ratio (float, optional) – Ratio of MLP hidden dim to embedding dim. Defaults to 4.0.
qkv_bias (bool, optional) – If to add bias to qkv. Defaults to True.
init_values (float, optional) – Initial values for layer scale. Defaults to None.
class_token (bool, optional) – If to add a class token. Defaults to True.
no_embed_class (bool, optional) – If to not embed class token. Defaults to False.
pre_norm (bool, optional) – If to use pre-norm. Defaults to False.
pos_drop_rate (float, optional) – Postional dropout rate. Defaults to 0.0.
drop_rate (float, optional) – Dropout rate. Defaults to 0.0.
attn_drop_rate (float, optional) – Attention dropout rate. Defaults to 0.0.
drop_path_rate (float, optional) – Drop path rate. Defaults to 0.0.
embed_layer (nn.Module, optional) – Embedding layer. Defaults to PatchEmbed.
norm_layer (nn.Module, optional) – Normalization layer. If None, nn.LayerNorm is used. Defaults to None.
act_layer (nn.Module, optional) – Activation layer. Defaults to nn.GELU().

__call__(data)[source]¶

Applies the ViT encoder.

Parameters:: data (tensor) – Input Images into the network shape [N, C, W, H]
Return type:: list[Tensor]

forward(images)[source]¶

Forward pass.

Parameters:

images (torch.Tensor) – Input images tensor of shape (B, C, H, W).

Returns:

Features of the input images extracted: by the ViT encoder. feats[0] is the input images, and feats[1] is the output of the patch embedding layer. The rest of the elements are the outputs of each transformer block, with the shape (B, N, dim), where N is the number of patches, and dim is the embedding dimension. The final element is the output of the ViT encoder.

Return type:

feats (list[torch.Tensor])

init_weights()[source]¶

Init weights using timm’s implementation.

Return type:: None

property out_channels: list[int]¶: Return the number of output channels per feature level.