vis4d.op.layer

Init layers module.

class Attention(dim, num_heads=8, qkv_bias=False, attn_drop=0.0, proj_drop=0.0)[source]

ViT Attention Layer.

Modified from timm (https://github.com/huggingface/pytorch-image-models).

Init attention layer.

Parameters:
  • dim (int) – Input tensor’s dimension.

  • num_heads (int, optional) – Number of attention heads. Defaults to 8.

  • qkv_bias (bool, optional) – If to add bias to qkv. Defaults to False.

  • attn_drop (float, optional) – Dropout rate for attention. Defaults to 0.0.

  • proj_drop (float, optional) – Dropout rate for projection. Defaults to 0.0.

__call__(data)[source]

Applies the layer.

Parameters:

data (Tensor) – Input tensor of shape (B, N, dim).

Returns:

Output tensor of the same shape as input.

Return type:

Tensor

forward(x)[source]

Forward pass.

Return type:

Tensor

class CSPLayer(in_channels, out_channels, expand_ratio=0.5, num_blocks=1, add_identity=True)[source]

Cross Stage Partial Layer.

Parameters:
  • in_channels (int) – The input channels of the CSP layer.

  • out_channels (int) – The output channels of the CSP layer.

  • expand_ratio (float, optional) – Ratio to adjust the number of channels of the hidden layer. Defaults to 0.5.

  • num_blocks (int, optional) – Number of blocks. Defaults to 1.

  • add_identity (bool, optional) – Whether to add identity in blocks. Defaults to True.

Init.

forward(features)[source]

Forward pass.

Parameters:

features (torch.Tensor) – Input features.

Return type:

Tensor

class Conv2d(*args, norm=None, activation=None, **kwargs)[source]

Wrapper around Conv2d to support empty inputs and norm/activation.

Creates an instance of the class.

If norm is specified, it is initialized with 1.0 and bias with 0.0.

forward(x)[source]

Forward pass.

Return type:

Tensor

class DeformConv(in_channels, out_channels, kernel_size, stride=1, padding=0, dilation=1, groups=1, bias=True, norm=None, activation=None)[source]

Wrapper around Deformable Convolution operator with norm/activation.

If norm is specified, it is initialized with 1.0 and bias with 0.0.

Creates an instance of the class.

Parameters:
  • in_channels (int) – Input channels.

  • out_channels (int) – Output channels.

  • kernel_size (int) – Size of convolutional kernel.

  • stride (int, optional) – Stride of convolutional layer. Defaults to 1.

  • padding (int, optional) – Padding of convolutional layer. Defaults to 0.

  • dilation (int, optional) – Dilation of convolutional layer. Defaults to 1.

  • groups (int, optional) – Number of deformable groups. Defaults to 1.

  • bias (bool, optional) – Whether to use bias in convolutional layer. Defaults to True.

  • norm (nn.Module, optional) – Normalization layer. Defaults to None.

  • activation (nn.Module, optional) – Activation layer. Defaults to None.

forward(input_x)[source]

Forward.

Return type:

Tensor

init_weights()[source]

Initialize weights of offset conv layer.

Return type:

None

class DropPath(drop_prob=0.0, scale_by_keep=True)[source]

DropPath regularizer (Stochastic Depth) per sample.

Init DropPath.

Parameters:
  • drop_prob (float, optional) – Probability of an item to be masked. Defaults to 0.0.

  • scale_by_keep (bool, optional) – If to scale by keep probability. Defaults to True.

__call__(data)[source]

Applies the layer.

Parameters:

data (Tensor) – (tensor) input shape [N, …]

Return type:

Tensor

forward(x)[source]

Forward pass.

Return type:

Tensor

class PatchEmbed(img_size=224, patch_size=16, in_channels=3, embed_dim=768, norm_layer=None, flatten=True, bias=True)[source]

2D Image to Patch Embedding.

Init PatchEmbed.

Parameters:
  • img_size (int, optional) – Input image’s size. Defaults to 224.

  • patch_size (int, optional) – Patch size. Defaults to 16.

  • in_channels (int, optional) – Number of input image’s channels. Defaults to 3.

  • embed_dim (int, optional) – Patch embedding’s dim. Defaults to 768.

  • norm_layer (nn.Module, optional) – Normalization layer. Defaults to None, which means no normalization layer.

  • flatten (bool, optional) – If to flatten the output tensor. Defaults to True.

  • bias (bool, optional) – If to add bias to the convolution layer. Defaults to True.

Raises:

ValueError – If the input image’s size is not divisible by the patch size.

__call__(data)[source]

Applies the layer.

Parameters:

data (torch.Tensor) – Input tensor of shape (B, C, H, W).

Returns:

Output tensor of shape (B, N, C), where N is the

number of patches (N = H * W).

Return type:

torch.Tensor

forward(x)[source]

Forward function.

Return type:

Tensor

class ResnetBlockFC(size_in, size_out=None, size_h=None)[source]

Fully connected ResNet Block consisting of two linear layers.

Fully connected ResNet Block consisting of two linear layers.

Parameters:
  • size_in (int) – (int) input dimension

  • size_out (Optional[int]) – Optional(int) output dimension, if not specified same as input

  • size_h (Optional[int]) – Optional(int) hidden dimension, if not specfied same as min(in,out)

__call__(data)[source]

Applies the layer.

Parameters:

data (Tensor) – (tensor) input shape [N, C]

Return type:

Tensor

forward(data)[source]

Applies the layer.

Parameters:

data (Tensor) – (tensor) input shape [N, C]

Return type:

Tensor

class TransformerBlock(dim, num_heads, mlp_ratio=4.0, qkv_bias=False, drop=0.0, attn_drop=0.0, init_values=None, drop_path=0.0, act_layer=GELU(approximate='none'), norm_layer=None)[source]

Transformer block for Vision Transformer.

Init transformer block.

Parameters:
  • dim (int) – Input tensor’s dimension.

  • num_heads (int) – Number of attention heads.

  • mlp_ratio (float, optional) – Ratio of MLP hidden dim to embedding dim. Defaults to 4.0.

  • qkv_bias (bool, optional) – If to add bias to qkv. Defaults to False.

  • drop (float, optional) – Dropout rate for attention and projection. Defaults to 0.0.

  • attn_drop (float, optional) – Dropout rate for attention. Defaults to 0.0.

  • init_values (tuple[float, float] | None, optional) – Initial values for layer scale. Defaults to None.

  • drop_path (float, optional) – Dropout rate for drop path. Defaults to 0.0.

  • act_layer (nn.Module, optional) – Activation layer. Defaults to nn.GELU.

  • norm_layer (nn.Module, optional) – Normalization layer. If None, use nn.LayerNorm.

__call__(data)[source]

Forward pass.

Parameters:

data (torch.Tensor) – Input tensor of shape (B, N, dim).

Returns:

Output tensor of shape (B, N, dim).

Return type:

torch.Tensor

forward(x)[source]

Forward pass.

Return type:

Tensor

class TransformerBlockMLP(in_features, hidden_features=None, out_features=None, act_layer=GELU(approximate='none'), bias=True, drop=0.0)[source]

MLP as used in Vision Transformer, MLP-Mixer and related networks.

Init MLP.

Parameters:
  • in_features (int) – Number of input features.

  • hidden_features (int, optional) – Number of hidden features. Defaults to None.

  • out_features (int, optional) – Number of output features. Defaults to None.

  • act_layer (nn.Module, optional) – Activation layer. Defaults to nn.GELU.

  • bias (bool, optional) – If bias should be used. Defaults to True.

  • drop (float, optional) – Dropout probability. Defaults to 0.0.

__call__(data)[source]

Applies the layer.

Parameters:

data (Tensor) – (tensor) input shape [N, C]

Return type:

Tensor

forward(x)[source]

Forward pass.

Parameters:

x (Tensor) – (tensor) input shape [N, C]

Return type:

Tensor

class UnetDownConv(in_channels, out_channels, pooling=True, activation='ReLU')[source]

Downsamples a feature map by applying two convolutions and maxpool.

Creates a new downsampling convolution operator.

This operator consists of two convolutions followed by a maxpool operator.

Parameters:
  • in_channels (int) – input channesl

  • out_channels (int) – output channesl

  • pooling (bool) – If pooling should be applied

  • activation (str) – Activation that should be applied

__call__(data)[source]

Applies the operator.

Parameters:

data (Tensor) – Input data.

Returns:

Containing the features before the pooling

operation (features) and after (pooled_features).

Return type:

UnetDownConvOut

forward(data)[source]

Applies the operator.

Parameters:

data (Tensor) – Input data.

Returns:

containing the features before the pooling

operation (features) and after (pooled_features).

Return type:

UnetDownConvOut

class UnetUpConv(in_channels, out_channels, merge_mode='concat', up_mode='transpose')[source]

An operator that performs 2 convolutions and 1 UpConvolution.

A ReLU activation follows each convolution.

Creates a new UpConv operator.

This operator merges two inputs by upsampling one and combining it with the other.

Parameters:
  • in_channels (int) – Number of input channels (low res)

  • out_channels (int) – Number of output channels (high res)

  • merge_mode (str) – How to merge both input channels

  • up_mode (str) – How to upsample the channel with lower resolution

Raises:

ValueError – If upsampling mode is unknown

__call__(from_down, from_up)[source]

Forward pass.

Parameters:
  • from_down (Tensor) – Tensor from the encoder pathway. Assumed to have dimension ‘out_channels’

  • from_up (Tensor) – Upconv’d tensor from the decoder pathway. Assumed to have dimension ‘in_channels’

Return type:

Tensor

forward(from_down, from_up)[source]

Forward pass.

Parameters:
  • from_down (Tensor) – Tensor from the encoder pathway. Assumed to have dimension ‘out_channels’

  • from_up (Tensor) – Upconv’d tensor from the decoder pathway. Assumed to have dimension ‘in_channels’

Return type:

Tensor

add_conv_branch(num_branch_convs, last_layer_dim, conv_out_dim, conv_has_bias, norm_cfg, num_groups)[source]

Init conv branch for head.

Return type:

tuple[ModuleList, int]

Modules

vis4d.op.layer.attention

Attention layer.

vis4d.op.layer.conv2d

Wrapper for conv2d.

vis4d.op.layer.csp_layer

Cross Stage Partial Layer.

vis4d.op.layer.deform_conv

Wrapper for deformable convolution.

vis4d.op.layer.drop

DropPath (Stochastic Depth) regularization layers.

vis4d.op.layer.mlp

MLP Layers.

vis4d.op.layer.ms_deform_attn

Multi-Scale Deformable Attention Module.

vis4d.op.layer.patch_embed

Image to Patch Embedding using Conv2d.

vis4d.op.layer.positional_encoding

Positional encoding for transformer.

vis4d.op.layer.transformer

Transformer layer.

vis4d.op.layer.util

Utility functions for layer ops.

vis4d.op.layer.weight_init

Model weight initialization.