vis4d.op.layer.transformer

Transformer layer.

Modified from timm (https://github.com/huggingface/pytorch-image-models) and mmdetection (https://github.com/open-mmlab/mmdetection).

Functions

get_clones(module, num)

Create N identical layers.

inverse_sigmoid(x[, eps])

Inverse function of sigmoid.

Classes

FFN([embed_dims, feedforward_channels, ...])

Implements feed-forward networks (FFNs) with identity connection.

LayerScale(dim[, inplace, data_format, ...])

Layer scaler.

TransformerBlock(dim, num_heads[, ...])

Transformer block for Vision Transformer.

class FFN(embed_dims=256, feedforward_channels=1024, num_fcs=2, dropout=0.0, activation='ReLU', inplace=True, dropout_layer=None, add_identity=True, layer_scale_init_value=0.0)[source]

Implements feed-forward networks (FFNs) with identity connection.

Init FFN.

Parameters:
  • embed_dims (int) – The feature dimension. Defaults: 256.

  • feedforward_channels (int) – The hidden dimension of FFNs. Defaults: 1024.

  • num_fcs (int) – The number of fully-connected layers in FFNs. Defaults: 2.

  • dropout (float) – The dropout rate of FFNs.

  • activation (str) – The activation function of FFNs.

  • inplace (bool) – Whether to set inplace for activation.

  • dropout_layer (nn.Module | None, optional) – The dropout_layer used when adding the shortcut. Defaults to None. If None, Identity is used.

  • add_identity (bool, optional) – Whether to add the identity connection. Default: True.

  • layer_scale_init_value (float) – Initial value of scale factor in LayerScale. Default: 0.0

forward(x, identity=None)[source]

Forward function for FFN.

The function would add x to the output tensor if residue is None.

Return type:

None

class LayerScale(dim, inplace=False, data_format='channels_last', init_values=1e-05)[source]

Layer scaler.

Init layer scaler.

Parameters:
  • dim (int) – Input tensor’s dimension.

  • inplace (bool) – Whether performs operation in-place. Default: False.

  • data_format (str) – The input data format, could be ‘channels_last’ or ‘channels_first’, representing (B, C, H, W) and (B, N, C) format data respectively. Default: channels_last.

  • init_values (float, optional) – Initial values for layer scale. Defaults to 1e-5.

forward(x)[source]

Forward pass.

Return type:

Tensor

class TransformerBlock(dim, num_heads, mlp_ratio=4.0, qkv_bias=False, drop=0.0, attn_drop=0.0, init_values=None, drop_path=0.0, act_layer=GELU(approximate='none'), norm_layer=None)[source]

Transformer block for Vision Transformer.

Init transformer block.

Parameters:
  • dim (int) – Input tensor’s dimension.

  • num_heads (int) – Number of attention heads.

  • mlp_ratio (float, optional) – Ratio of MLP hidden dim to embedding dim. Defaults to 4.0.

  • qkv_bias (bool, optional) – If to add bias to qkv. Defaults to False.

  • drop (float, optional) – Dropout rate for attention and projection. Defaults to 0.0.

  • attn_drop (float, optional) – Dropout rate for attention. Defaults to 0.0.

  • init_values (tuple[float, float] | None, optional) – Initial values for layer scale. Defaults to None.

  • drop_path (float, optional) – Dropout rate for drop path. Defaults to 0.0.

  • act_layer (nn.Module, optional) – Activation layer. Defaults to nn.GELU.

  • norm_layer (nn.Module, optional) – Normalization layer. If None, use nn.LayerNorm.

__call__(data)[source]

Forward pass.

Parameters:

data (torch.Tensor) – Input tensor of shape (B, N, dim).

Returns:

Output tensor of shape (B, N, dim).

Return type:

torch.Tensor

forward(x)[source]

Forward pass.

Return type:

Tensor

get_clones(module, num)[source]

Create N identical layers.

Return type:

ModuleList

inverse_sigmoid(x, eps=1e-05)[source]

Inverse function of sigmoid.

Parameters:
  • x (Tensor) – The tensor to do the inverse.

  • eps (float) – EPS avoid numerical overflow. Defaults 1e-5.

Returns:

The x has passed the inverse function of sigmoid, has same

shape with input.

Return type:

Tensor