vis4d.op.detect3d.bevformer.decoder

BEVFormer decoder.

Classes

BEVFormerDecoder([num_layers, embed_dims, ...])

Implements the decoder in DETR3D transformer.

BEVFormerDecoderLayer([embed_dims, ...])

Implements decoder layer in DETR transformer.

DecoderCrossAttention([embed_dims, ...])

Custom Multi-Scale Deformable Attention.

class BEVFormerDecoder(num_layers=6, embed_dims=256, return_intermediate=True)[source]

Implements the decoder in DETR3D transformer.

Init.

Parameters:
  • num_layers (int) – The number of decoder layers. Default: 6.

  • embed_dims (int) – The embedding dimension. Default: 256.

  • return_intermediate (bool) – Whether to return intermediate results. Default: True.

forward(query, value, reference_points, spatial_shapes, level_start_index, query_pos, reg_branches)[source]

Forward function.

Parameters:
  • query (Tensor) – Input query with shape (num_query, bs, embed_dims).

  • value (Tensor) – Input value with shape (bs, num_query, embed_dims).

  • reference_points (Tensor) – The reference points of offset. In shape (bs, num_query, 4) when as_two_stage, otherwise has shape (bs, num_query, 2).

  • spatial_shapes (Tensor) – The spatial shapes of feature maps.

  • level_start_index (Tensor) – The start index of each level.

  • query_pos (Tensor) – The query position embedding.

  • reg_branches (list[Module]) – (list[nn.Module]): Used for refining the regression results.

Returns:

The output of the decoder with reference

points. If return_intermediate is True, the output and reference points of each layer will be stacked and return.

Return type:

tuple[Tensor, Tensor]

class BEVFormerDecoderLayer(embed_dims=256, feedforward_channels=512, drop_out=0.1)[source]

Implements decoder layer in DETR transformer.

Init.

Parameters:
  • embed_dims (int) – The embedding dimension.

  • feedforward_channels (int) – The hidden dimension of FFNs.

  • drop_out (float) – The dropout rate of FFNs.

forward(query, reference_points, value, spatial_shapes, level_start_index, query_pos=None)[source]

Forward.

Parameters:
  • query (Tensor) – The input query, has shape (bs, num_queries, dim).

  • reference_points (Tensor) – The reference points of offset. In shape (bs, num_query, 4) when as_two_stage, otherwise has shape (bs, num_query, 2).

  • value (Tensor, optional) – The input value, has shape (bs, num_keys, dim).

  • spatial_shapes (Tensor) – The spatial shapes of feature maps.

  • level_start_index (Tensor) – The start index of each level.

  • query_pos (Tensor, optional) – The positional encoding for query, has the same shape as query. If not None, it will be added to query before forward function. Defaults to None.

Returns:

forwarded results, has shape (bs, num_queries, dim).

Return type:

Tensor

class DecoderCrossAttention(embed_dims=256, num_heads=8, num_levels=4, num_points=4, im2col_step=64, dropout=0.1, batch_first=False)[source]

Custom Multi-Scale Deformable Attention.

Initialization.

Parameters:
  • embed_dims (int) – The embedding dimension of Attention. Default: 256.

  • num_heads (int) – Parallel attention heads. Default: 8.

  • num_levels (int) – The number of feature map used in Attention. Default: 4.

  • num_points (int) – The number of sampling points for each query in each head. Default: 4.

  • im2col_step (int) – The step used in image_to_column. Default: 64.

  • dropout (float) – A Dropout layer on inp_identity. Default: 0.1.

  • batch_first (bool) – Key, Query and Value are shape of (batch, n, embed_dim) or (n, batch, embed_dim). Default to False.

forward(query, reference_points, value, spatial_shapes, level_start_index, key_padding_mask=None, query_pos=None, identity=None)[source]

Forward.

Parameters:
  • query (Tensor) – Query of Transformer with shape (num_query, bs, embed_dims).

  • reference_points (Tensor) – The normalized reference points with shape (bs, num_query, num_levels, 2), all elements is range in [0, 1], top-left (0,0), bottom-right (1, 1), including padding area. or (N, Length_{query}, num_levels, 4), add additional two dimensions is (w, h) to form reference boxes.

  • value (Tensor) – The value tensor with shape (num_key, bs, embed_dims).

  • spatial_shapes (Tensor) – Spatial shape of features in different levels. With shape (num_levels, 2), last dimension represents (h, w).

  • level_start_index (Tensor) – The start index of each level. A tensor has shape (num_levels, ) and can be represented as [0, h_0*w_0, h_0*w_0+h_1*w_1, …].

  • key_padding_mask (Tensor) – ByteTensor for query, with shape [bs, num_key].

  • query_pos (Tensor) – The positional encoding for query. Default: None.

  • identity (Tensor) – The tensor used for addition, with the same shape as query. Default None. If None, query will be used.

Returns:

forwarded results with shape [num_query, bs, embed_dims].

Return type:

Tensor

init_weights()[source]

Default initialization for Parameters of Module.

Return type:

None