vis4d.op.detect3d.bevformer.spatial_cross_attention¶

Spatial Cross Attention Module for BEVFormer.

Classes

`MSDeformableAttention3D`([embed_dims, ...])	An attention module used in BEVFormer based on Deformable-Detr.
`SpatialCrossAttention`([embed_dims, ...])	An attention module used in BEVFormer.

class MSDeformableAttention3D(embed_dims=256, num_heads=8, num_levels=4, num_points=8, im2col_step=64, batch_first=True)[source]¶

An attention module used in BEVFormer based on Deformable-Detr.

Init.

Parameters:

embed_dims (int) – The embedding dimension of Attention. Default: 256.
num_heads (int) – Parallel attention heads. Default: 64.
num_levels (int) – The number of feature map used in Attention. Default: 4.
num_points (int) – The number of sampling points for each query in each head. Default: 4.
im2col_step (int) – The step used in image_to_column. Default: 64.
batch_first (bool) – Key, Query and Value are shape of (batch, n, embed_dim) or (n, batch, embed_dim). Default to True.

forward(query, reference_points, value, spatial_shapes, level_start_index, key_padding_mask=None, query_pos=None)[source]¶

Forward.

Parameters:

query (Tensor) – Query of Transformer with shape (bs, num_query, embed_dims).
reference_points (Tensor) – The normalized reference points with shape (bs, num_query, num_levels, 2), all elements is range in [0, 1], top-left (0,0), bottom-right (1, 1), including padding area. Or (N, Length_{query}, num_levels, 4), add additional two dimensions is (w, h) to form reference boxes.
value (Tensor) – The value tensor with shape (bs, num_key, embed_dims).
spatial_shapes (Tensor) – Spatial shape of features in different levels. With shape (num_levels, 2), last dimension represents (h, w).
level_start_index (Tensor) – The start index of each level. A tensor has shape (num_levels, ) and can be represented as [0, h_0*w_0, h_0*w_0+h_1*w_1, …].
key_padding_mask (Tensor) – ByteTensor for value, with shape [bs, num_key].
query_pos (Tensor) – The positional encoding for query. Default: None.

Returns:

forwarded results with shape [num_query, bs, embed_dims].

Return type:

Tensor

init_weights()[source]¶

Default initialization for Parameters of Module.

class SpatialCrossAttention(embed_dims=256, num_cams=6, dropout=0.1, deformable_attention=None)[source]¶

An attention module used in BEVFormer.

Init.

Parameters:

embed_dims (int) – The embedding dimension of Attention. Default: 256.
num_cams (int) – The number of cameras. Default: 6.
dropout (float) – A Dropout layer on inp_residual. Default: 0.1.
deformable_attention (MSDeformableAttention3D, optional) – The deformable attention module. Default: None. If None, we will use MSDeformableAttention3D with default parameters.

forward(query, reference_points, value, spatial_shapes, level_start_index, bev_mask, query_pos=None)[source]¶

Forward Function of Detr3DCrossAtten.

Parameters:

query (Tensor) – Query of Transformer with shape (num_query, bs, embed_dims).
reference_points (Tensor) – The normalized reference points with shape (bs, num_query, 4), all elements is range in [0, 1], top-left (0,0), bottom-right (1, 1), including padding area. Or (N, Length_{query}, num_levels, 4), add additional two dimensions is (w, h) to form reference boxes.
value (Tensor) – The value tensor with shape (num_key, bs, embed_dims). (B, N, C, H, W)
spatial_shapes (Tensor) – Spatial shape of features in different level. With shape (num_levels, 2), last dimension represent (h, w).
level_start_index (Tensor) – The start index of each level. A tensor has shape (num_levels) and can be represented as [0, h_0*w_0, h_0*w_0+h_1*w_1, …].
bev_mask (Tensor) – The mask of BEV features with shape (num_query, bs, num_levels, h, w).
query_pos (Tensor) – The positional encoding for query. Default None.

Returns:

Forwarded results with shape [num_query, bs, embed_dims].

Return type:

Tensor

init_weight()[source]¶

Default initialization for Parameters of Module.