vis4d.op.detect3d.bevformer.encoder¶

BEVFormer Encoder.

Classes

`BEVFormerEncoder`([num_layers, layer, ...])	Attention with both self and cross attention.
`BEVFormerEncoderLayer`([embed_dims, ...])	BEVFormer encoder layer.

class BEVFormerEncoder(num_layers=6, layer=None, embed_dims=256, num_points_in_pillar=4, point_cloud_range=(-51.2, -51.2, -5.0, 51.2, 51.2, 3.0), return_intermediate=False)[source]¶

Attention with both self and cross attention.

Init.

Parameters:

num_layers (int) – Number of layers in the encoder.
layer (BEVFormerEncoderLayer, optional) – Encoder layer. Defaults to None. If None, a default layer will be used.
embed_dims (int) – Embedding dimension.
num_points_in_pillar (int) – Number of points in each pillar.
point_cloud_range (Sequence[float]) – Range of the point cloud. Defaults to (-51.2, -51.2, -5.0, 51.2, 51.2, 3.0).
return_intermediate (bool) – Whether to return intermediate outputs.

forward(bev_query, value, bev_h, bev_w, bev_pos, spatial_shapes, level_start_index, prev_bev, shift, images_hw, cam_intrinsics, cam_extrinsics, lidar_extrinsics)[source]¶

Forward.

Parameters:

bev_query (Tensor) – Input BEV query with shape (num_query, batch_size, embed_dims).
value (Tensor) – Input multi-cameta features with shape (num_cam, num_value, batch_size, embed_dims).
bev_h (int) – BEV height.
bev_w (int) – BEV width.
bev_pos (Tensor) – BEV positional encoding with shape (batch_size, embed_dims).
spatial_shapes (Tensor) – Spatial shapes of multi-level features with shape (num_levels, 2).
level_start_index (Tensor) – Start index of each level with shape (num_levels, ).
prev_bev (Tensor | None) – Previous BEV features with shape (batch_size, embed_dims).
shift (Tensor) – Shift of each level with shape (num_levels, 2).
images_hw (tuple[int, int]) – List of image height and width.
cam_intrinsics (list[Tensor]) – List of camera intrinsics. In shape (num_cam, batch_size, 3, 3)
cam_extrinsics (list[Tensor]) – List of camera extrinsics. In shape (num_cam, batch_size, 4, 4)
lidar_extrinsics (Tensor) – LiDAR extrinsics. In shape (batch_size, 4, 4)

Returns:

Results with shape [batch_size, num_query, embed_dims]: when return_intermediate is False, otherwise it has shape [num_layers, batch_size, num_query, embed_dims].

Return type:

Tensor

get_reference_points(bev_h, bev_w, dim, batch_size, device, dtype)[source]¶

Get the reference points used in SCA and TSA.

Parameters:

bev_h (int) – Height of the BEV feature map.
bev_w (int) – Width of the BEV feature map.
dim (int) – Dimension of the reference points.
batch_size (int) – Batch size.
device (torch.device) – The device where reference_points should be.
dtype (torch.dtype) – The dtype of reference_points.

Returns:

reference points used in decoder, has shape (batch_size,: num_keys, num_levels, dim).

Return type:

Tensor

point_sampling(reference_points, images_hw, cam_intrinsics, cam_extrinsics, lidar_extrinsics)[source]¶

Sample points from reference points.

Return type:: tuple[Tensor, Tensor]

class BEVFormerEncoderLayer(embed_dims=256, self_attn=None, cross_attn=None, feedforward_channels=512, drop_out=0.1)[source]¶

BEVFormer encoder layer.

Init.

forward(query, value, bev_pos, ref_2d, bev_h, bev_w, spatial_shapes, level_start_index, reference_points_img, bev_mask, prev_bev=None)[source]¶

Forward function.

self_attn -> norm -> cross_attn -> norm -> ffn -> norm

Returns:

forwarded results with shape [num_queries, batch_size,: embed_dims].

Return type:

Tensor