vis4d.op.base.vit

Residual networks for classification.

Classes

VisionTransformer([img_size, patch_size, ...])

Vision Transformer (ViT) model without classification head.

class VisionTransformer(img_size=224, patch_size=16, in_channels=3, num_classes=1000, embed_dim=768, depth=12, num_heads=12, mlp_ratio=4.0, qkv_bias=True, init_values=None, class_token=True, no_embed_class=False, pre_norm=False, pos_drop_rate=0.0, drop_rate=0.0, attn_drop_rate=0.0, drop_path_rate=0.0, norm_layer=None, act_layer=GELU(approximate='none'))[source]

Vision Transformer (ViT) model without classification head.

A PyTorch impl of`An Image is Worth 16x16 Words: Transformers for

Image Recognition at Scale` - https://arxiv.org/abs/2010.11929

Adapted from:
  • pytorch vision transformer impl

  • timm vision transformer impl

Init VisionTransformer.

Parameters:
  • img_size (int, optional) – Input image size. Defaults to 224.

  • patch_size (int, optional) – Patch size. Defaults to 16.

  • in_channels (int, optional) – Number of input channels. Defaults to 3.

  • num_classes (int, optional) – Number of classes. Defaults to 1000.

  • embed_dim (int, optional) – Embedding dimension. Defaults to 768.

  • depth (int, optional) – Depth. Defaults to 12.

  • num_heads (int, optional) – Number of attention heads. Defaults to 12.

  • mlp_ratio (float, optional) – Ratio of MLP hidden dim to embedding dim. Defaults to 4.0.

  • qkv_bias (bool, optional) – If to add bias to qkv. Defaults to True.

  • init_values (float, optional) – Initial values for layer scale. Defaults to None.

  • class_token (bool, optional) – If to add a class token. Defaults to True.

  • no_embed_class (bool, optional) – If to not embed class token. Defaults to False.

  • pre_norm (bool, optional) – If to use pre-norm. Defaults to False.

  • pos_drop_rate (float, optional) – Postional dropout rate. Defaults to 0.0.

  • drop_rate (float, optional) – Dropout rate. Defaults to 0.0.

  • attn_drop_rate (float, optional) – Attention dropout rate. Defaults to 0.0.

  • drop_path_rate (float, optional) – Drop path rate. Defaults to 0.0.

  • embed_layer (nn.Module, optional) – Embedding layer. Defaults to PatchEmbed.

  • norm_layer (nn.Module, optional) – Normalization layer. If None, nn.LayerNorm is used. Defaults to None.

  • act_layer (nn.Module, optional) – Activation layer. Defaults to nn.GELU().

__call__(data)[source]

Applies the ViT encoder.

Parameters:

data (tensor) – Input Images into the network shape [N, C, W, H]

Return type:

list[Tensor]

forward(images)[source]

Forward pass.

Parameters:

images (torch.Tensor) – Input images tensor of shape (B, C, H, W).

Returns:

Features of the input images extracted

by the ViT encoder. feats[0] is the input images, and feats[1] is the output of the patch embedding layer. The rest of the elements are the outputs of each transformer block, with the shape (B, N, dim), where N is the number of patches, and dim is the embedding dimension. The final element is the output of the ViT encoder.

Return type:

feats (list[torch.Tensor])

init_weights()[source]

Init weights using timm’s implementation.

Return type:

None

property out_channels: list[int]

Return the number of output channels per feature level.