vis4d.op.detect3d.bevformer.encoder

BEVFormer Encoder.

Classes

BEVFormerEncoder([num_layers, layer, ...])

Attention with both self and cross attention.

BEVFormerEncoderLayer([embed_dims, ...])

BEVFormer encoder layer.

class BEVFormerEncoder(num_layers=6, layer=None, embed_dims=256, num_points_in_pillar=4, point_cloud_range=(-51.2, -51.2, -5.0, 51.2, 51.2, 3.0), return_intermediate=False)[source]

Attention with both self and cross attention.

Init.

Parameters:
  • num_layers (int) – Number of layers in the encoder.

  • layer (BEVFormerEncoderLayer, optional) – Encoder layer. Defaults to None. If None, a default layer will be used.

  • embed_dims (int) – Embedding dimension.

  • num_points_in_pillar (int) – Number of points in each pillar.

  • point_cloud_range (Sequence[float]) – Range of the point cloud. Defaults to (-51.2, -51.2, -5.0, 51.2, 51.2, 3.0).

  • return_intermediate (bool) – Whether to return intermediate outputs.

forward(bev_query, value, bev_h, bev_w, bev_pos, spatial_shapes, level_start_index, prev_bev, shift, images_hw, cam_intrinsics, cam_extrinsics, lidar_extrinsics)[source]

Forward.

Parameters:
  • bev_query (Tensor) – Input BEV query with shape (num_query, batch_size, embed_dims).

  • value (Tensor) – Input multi-cameta features with shape (num_cam, num_value, batch_size, embed_dims).

  • bev_h (int) – BEV height.

  • bev_w (int) – BEV width.

  • bev_pos (Tensor) – BEV positional encoding with shape (batch_size, embed_dims).

  • spatial_shapes (Tensor) – Spatial shapes of multi-level features with shape (num_levels, 2).

  • level_start_index (Tensor) – Start index of each level with shape (num_levels, ).

  • prev_bev (Tensor | None) – Previous BEV features with shape (batch_size, embed_dims).

  • shift (Tensor) – Shift of each level with shape (num_levels, 2).

  • images_hw (tuple[int, int]) – List of image height and width.

  • cam_intrinsics (list[Tensor]) – List of camera intrinsics. In shape (num_cam, batch_size, 3, 3)

  • cam_extrinsics (list[Tensor]) – List of camera extrinsics. In shape (num_cam, batch_size, 4, 4)

  • lidar_extrinsics (Tensor) – LiDAR extrinsics. In shape (batch_size, 4, 4)

Returns:

Results with shape [batch_size, num_query, embed_dims]

when return_intermediate is False, otherwise it has shape [num_layers, batch_size, num_query, embed_dims].

Return type:

Tensor

get_reference_points(bev_h, bev_w, dim, batch_size, device, dtype)[source]

Get the reference points used in SCA and TSA.

Parameters:
  • bev_h (int) – Height of the BEV feature map.

  • bev_w (int) – Width of the BEV feature map.

  • dim (int) – Dimension of the reference points.

  • batch_size (int) – Batch size.

  • device (torch.device) – The device where reference_points should be.

  • dtype (torch.dtype) – The dtype of reference_points.

Returns:

reference points used in decoder, has shape (batch_size,

num_keys, num_levels, dim).

Return type:

Tensor

point_sampling(reference_points, images_hw, cam_intrinsics, cam_extrinsics, lidar_extrinsics)[source]

Sample points from reference points.

Return type:

tuple[Tensor, Tensor]

class BEVFormerEncoderLayer(embed_dims=256, self_attn=None, cross_attn=None, feedforward_channels=512, drop_out=0.1)[source]

BEVFormer encoder layer.

Init.

forward(query, value, bev_pos, ref_2d, bev_h, bev_w, spatial_shapes, level_start_index, reference_points_img, bev_mask, prev_bev=None)[source]

Forward function.

self_attn -> norm -> cross_attn -> norm -> ffn -> norm

Returns:

forwarded results with shape [num_queries, batch_size,

embed_dims].

Return type:

Tensor