vis4d.op.layer.patch_embed

Image to Patch Embedding using Conv2d.

Modified from vision_transformer (https://github.com/google-research/vision_transformer).

Classes

PatchEmbed([img_size, patch_size, ...])

2D Image to Patch Embedding.

class PatchEmbed(img_size=224, patch_size=16, in_channels=3, embed_dim=768, norm_layer=None, flatten=True, bias=True)[source]

2D Image to Patch Embedding.

Init PatchEmbed.

Parameters:
  • img_size (int, optional) – Input image’s size. Defaults to 224.

  • patch_size (int, optional) – Patch size. Defaults to 16.

  • in_channels (int, optional) – Number of input image’s channels. Defaults to 3.

  • embed_dim (int, optional) – Patch embedding’s dim. Defaults to 768.

  • norm_layer (nn.Module, optional) – Normalization layer. Defaults to None, which means no normalization layer.

  • flatten (bool, optional) – If to flatten the output tensor. Defaults to True.

  • bias (bool, optional) – If to add bias to the convolution layer. Defaults to True.

Raises:

ValueError – If the input image’s size is not divisible by the patch size.

__call__(data)[source]

Applies the layer.

Parameters:

data (torch.Tensor) – Input tensor of shape (B, C, H, W).

Returns:

Output tensor of shape (B, N, C), where N is the

number of patches (N = H * W).

Return type:

torch.Tensor

forward(x)[source]

Forward function.

Return type:

Tensor