vis4d.op.detect3d.bevformer.encoder¶
BEVFormer Encoder.
Classes
|
Attention with both self and cross attention. |
|
BEVFormer encoder layer. |
- class BEVFormerEncoder(num_layers=6, layer=None, embed_dims=256, num_points_in_pillar=4, point_cloud_range=(-51.2, -51.2, -5.0, 51.2, 51.2, 3.0), return_intermediate=False)[source]¶
Attention with both self and cross attention.
Init.
- Parameters:
num_layers (int) – Number of layers in the encoder.
layer (BEVFormerEncoderLayer, optional) – Encoder layer. Defaults to None. If None, a default layer will be used.
embed_dims (int) – Embedding dimension.
num_points_in_pillar (int) – Number of points in each pillar.
point_cloud_range (Sequence[float]) – Range of the point cloud. Defaults to (-51.2, -51.2, -5.0, 51.2, 51.2, 3.0).
return_intermediate (bool) – Whether to return intermediate outputs.
- forward(bev_query, value, bev_h, bev_w, bev_pos, spatial_shapes, level_start_index, prev_bev, shift, images_hw, cam_intrinsics, cam_extrinsics, lidar_extrinsics)[source]¶
Forward.
- Parameters:
bev_query (Tensor) – Input BEV query with shape (num_query, batch_size, embed_dims).
value (Tensor) – Input multi-cameta features with shape (num_cam, num_value, batch_size, embed_dims).
bev_h (int) – BEV height.
bev_w (int) – BEV width.
bev_pos (Tensor) – BEV positional encoding with shape (batch_size, embed_dims).
spatial_shapes (Tensor) – Spatial shapes of multi-level features with shape (num_levels, 2).
level_start_index (Tensor) – Start index of each level with shape (num_levels, ).
prev_bev (Tensor | None) – Previous BEV features with shape (batch_size, embed_dims).
shift (Tensor) – Shift of each level with shape (num_levels, 2).
images_hw (tuple[int, int]) – List of image height and width.
cam_intrinsics (list[Tensor]) – List of camera intrinsics. In shape (num_cam, batch_size, 3, 3)
cam_extrinsics (list[Tensor]) – List of camera extrinsics. In shape (num_cam, batch_size, 4, 4)
lidar_extrinsics (Tensor) – LiDAR extrinsics. In shape (batch_size, 4, 4)
- Returns:
- Results with shape [batch_size, num_query, embed_dims]
when return_intermediate is False, otherwise it has shape [num_layers, batch_size, num_query, embed_dims].
- Return type:
Tensor
- get_reference_points(bev_h, bev_w, dim, batch_size, device, dtype)[source]¶
Get the reference points used in SCA and TSA.
- Parameters:
bev_h (int) – Height of the BEV feature map.
bev_w (int) – Width of the BEV feature map.
dim (int) – Dimension of the reference points.
batch_size (int) – Batch size.
device (torch.device) – The device where reference_points should be.
dtype (torch.dtype) – The dtype of reference_points.
- Returns:
- reference points used in decoder, has shape (batch_size,
num_keys, num_levels, dim).
- Return type:
Tensor
- class BEVFormerEncoderLayer(embed_dims=256, self_attn=None, cross_attn=None, feedforward_channels=512, drop_out=0.1)[source]¶
BEVFormer encoder layer.
Init.
- forward(query, value, bev_pos, ref_2d, bev_h, bev_w, spatial_shapes, level_start_index, reference_points_img, bev_mask, prev_bev=None)[source]¶
Forward function.
self_attn -> norm -> cross_attn -> norm -> ffn -> norm
- Returns:
- forwarded results with shape [num_queries, batch_size,
embed_dims].
- Return type:
Tensor