vis4d.op.base

Base model module.

class BaseModel(*args, **kwargs)[source]

Abstract base model for feature extraction.

Initialize internal Module state, shared by both nn.Module and ScriptModule.

__call__(images)[source]

Type definition for call implementation.

Parameters:

images (torch.Tensor) – Image input to process.

Returns:

The output feature pyramid.

Return type:

list[torch.Tensor]

abstract forward(images)[source]

Base model forward.

Parameters:

images (Tensor[N, C, H, W]) – Image input to process. Expected to be type float32.

Raises:

NotImplementedError – This is an abstract class method.

Returns:

The output feature pyramid. The list index represents the level, which has a downsampling ratio of 2^index for most of the cases. fp[2] is the C2 or P2 in the FPN paper (https://arxiv.org/abs/1612.03144). fp[0] is the original image or the feature map with the same resolution. fp[1] may be the copy of the input image if the network doesn’t generate the feature map of the resolution.

Return type:

fp (list[torch.Tensor])

abstract property out_channels: list[int]

Get the number of channels for each level of feature pyramid.

Raises:

NotImplementedError – This is an abstract class method.

Returns:

Number of channels.

Return type:

list[int]

class CSPDarknet(arch='P5', deepen_factor=1.0, widen_factor=1.0, out_indices=(2, 3, 4), frozen_stages=-1, arch_ovewrite=None, spp_kernal_sizes=(5, 9, 13), norm_eval=False)[source]

CSP-Darknet backbone used in YOLOv5 and YOLOX.

Parameters:
  • arch (str) – Architecture of CSP-Darknet, from {P5, P6}. Default: P5.

  • deepen_factor (float) – Depth multiplier, multiply number of blocks in CSP layer by this amount. Default: 1.0.

  • widen_factor (float) – Width multiplier, multiply number of channels in each layer by this amount. Default: 1.0.

  • out_indices (Sequence[int]) – Output from which stages. Default: (2, 3, 4).

  • frozen_stages (int) – Stages to be frozen (stop grad and set eval mode). -1 means not freezing any parameters. Default: -1.

  • use_depthwise (bool) – Whether to use depthwise separable convolution. Default: False.

  • arch_ovewrite (list[list[int]], optional) – Overwrite default arch settings. Defaults to None.

  • spp_kernal_sizes (Sequence[int]) – (tuple[int]): Sequential of kernel sizes of SPP layers. Default: (5, 9, 13).

  • norm_eval (bool) – Whether to set norm layers to eval mode, namely, freeze running stats (mean and var). Note: Effect on Batch Norm and its variants only.

Example

>>> import torch
>>> from vis4d.op.base import CSPDarknet
>>> self = CSPDarknet()
>>> self.eval()
>>> inputs = torch.rand(1, 3, 416, 416)
>>> level_outputs = self.forward(inputs)
>>> for level_out in level_outputs:
...     print(tuple(level_out.shape))
...
(1, 256, 52, 52)
(1, 512, 26, 26)
(1, 1024, 13, 13)

Init.

forward(images)[source]

Forward pass.

Parameters:

images (torch.Tensor) – Input images.

Return type:

list[Tensor]

train(mode=True)[source]

Override the train mode for the model.

Parameters:

mode (bool) – Whether to set training mode to True.

Return type:

CSPDarknet

class DLA(name=None, levels=(1, 1, 1, 2, 2, 1), channels=(16, 32, 64, 128, 256, 512), block='BasicBlock', residual_root=False, cardinality=32, weights=None, style='imagenet')[source]

DLA base model.

Creates an instance of the class.

forward(images)[source]

DLA forward.

Parameters:

images (Tensor[N, C, H, W]) – Image input to process. Expected to type float32 with values ranging 0..255.

Returns:

The output feature pyramid. The list index represents the level, which has a downsampling raio of 2^index. fp[0] is a feature map with the image resolution instead of the original image.

Return type:

fp (list[Tensor])

load_pretrained_model(weights)[source]

Load pretrained weights.

Return type:

None

property out_channels: list[int]

Get the numbers of channels for each level of feature pyramid.

Returns:

number of channels

Return type:

list[int]

class ResNet(resnet_name, in_channels=3, stem_channels=None, base_channels=64, num_stages=4, strides=(1, 2, 2, 2), dilations=(1, 1, 1, 1), style='pytorch', deep_stem=False, avg_down=False, trainable_layers=5, norm='BatchNorm2d', norm_frozen=True, stages_with_dcn=(False, False, False, False), replace_stride_with_dilation=(False, False, False), use_checkpoint=False, zero_init_residual=True, pretrained=False, weights=None)[source]

ResNet BaseModel.

Create ResNet.

Parameters:
  • resnet_name (str) – Name of the ResNet variant.

  • in_channels (int) – Number of input image channels. Default: 3.

  • stem_channels (int | None) – Number of stem channels. If not specified, it will be the same as base_channels. Default: None.

  • base_channels (int) – Number of base channels of res layer. Default: 64.

  • num_stages (int) – Resnet stages. Default: 4.

  • strides (Sequence[int]) – Strides of the first block of each stage. Default: (1, 2, 2, 2).

  • dilations (Sequence[int]) – Dilation of each stage. Default: (1, 1, 1, 1)

  • style (str) – pytorch or caffe. If set to “pytorch”, the stride-two layer is the 3x3 conv layer, otherwise the stride-two layer is the first 1x1 conv layer. Default: pytorch.

  • deep_stem (bool) – Replace 7x7 conv in input stem with 3 3x3 conv. Default: False.

  • avg_down (bool) – Use AvgPool instead of stride conv when downsampling in the bottleneck. Default: False.

  • trainable_layers (int, optional) – Number layers for training or fine-tuning. 5 means all the layers can be fine-tuned. Defaults to 5.

  • norm (str) – Normalization layer str. Default: BatchNorm2d, which means using nn.BatchNorm2d.

  • norm_frozen (bool) – Whether to set norm layers to eval mode. It freezes running stats (mean and var). Note: Effect on Batch Norm and its variants only.

  • stages_with_dcn (Sequence[bool]) – Indices of stages with deformable convolutions. Default: (False, False, False, False).

  • replace_stride_with_dilation (Sequence[bool]) – Whether to replace stride with dilation. Default: (False, False, False).

  • use_checkpoint (bool) – Use checkpoint or not. Using checkpoint will save some memory while slowing down the training speed. Default: False.

  • zero_init_residual (bool) – Whether to use zero init for last norm layer in resblocks to let them behave as identity. Default: True.

  • pretrained (bool) – Whether to load pretrained weights. Default: False.

  • weights (str, optional) – model pretrained path. Default: None

forward(images)[source]

Forward function.

Parameters:

images (Tensor[N, C, H, W]) – Image input to process. Expected to type float32 with values ranging 0..255.

Returns:

The output feature pyramid. The list index

represents the level, which has a downsampling raio of 2^index. fp[0] and fp[1] is a reference to the input images and torchvision resnet downsamples the feature maps by 4 directly. The last feature map downsamples the input image by 64 with a pooling layer on the second last map.

Return type:

fp (list[torch.Tensor])

train(mode=True)[source]

Override the train mode for the model.

Return type:

ResNet

property out_channels: list[int]

Get the number of channels for each level of feature pyramid.

Returns:

number of channels

Return type:

list[int]

class ResNetV1c(resnet_name, pretrained=False, weights=None, **kwargs)[source]

ResNetV1c variant with a deeper stem.

Compared with default ResNet, ResNetV1c replaces the 7x7 conv in the input stem with three 3x3 convs. For more details please refer to Bag of Tricks for Image Classification with Convolutional Neural Networks <https://arxiv.org/abs/1812.01187>.

Initialize ResNetV1c.

Parameters:
  • resnet_name (str) – Name of the resnet model.

  • pretrained (bool, optional) – Whether to load ImageNet pre-trained weights. Defaults to False.

  • weights (str, optional) – Path to custom pretrained weights.

  • **kwargs (Any) – Arguments for ResNet.

Modules

vis4d.op.base.base

Base model interface.

vis4d.op.base.csp_darknet

CSP-Darknet base network used in YOLOX.

vis4d.op.base.dla

DLA base model.

vis4d.op.base.pointnet

Operations for PointNet.

vis4d.op.base.pointnetpp

Pointnet++ implementation.

vis4d.op.base.resnet

Residual networks base model.

vis4d.op.base.unet

Unet Implementation based on https://arxiv.org/abs/1505.04597.

vis4d.op.base.vgg

Residual networks for classification.

vis4d.op.base.vit

Residual networks for classification.