code_documentation.md
April 26, 2023 ยท View on GitHub
Our dataloaders and data-format
All our dataloaders inherit from source/datasets/base.py. Each dataloader reads images and
pose informations, stored in format specific to each dataset. The __get_item__ method creates
a dictionary for each image index containing the following elements:
idx: the index of the imagergb_path: the path to the RGB image. Will be used to save the renderings with a name.image: the corresponding image, a torch Tensor of shape [3, H, W]. The RGB values are normalized to [0, 1] (not [0, 255]).intr: intrinsics parameters, numpy array of shape [3, 3]pose: world-to-camera transformation matrix, numpy array of shaoe [3, 4]depth_range: depth_range, numpy array of shape [1, 2]scene: string, scene name
Optionally, when the depth or a foreground mask are available:
depth_gt: ground-truth depth map, numpy array of shape [H, W]valid_depth_gt: mask indicating where the depth map is valid, bool numpy array of shape [H, W]fg_mask: foreground segmentation mask, bool numpy array of shape [1, H, W]
In the code base, after creating the dataloader, we call prefetch_all_data, which will batch
all data, and store the corresponding batch in train_data.all.
Images
image = [N, 3, height, width] torch.Tensor of RGB images. Currently we
require all images to have the same resolution.
Extrinsic camera poses
pose = [N, 3, 4] numpy array of extrinsic pose matrices. Should be in world-to-camera format.
pose[i] should be in world-to-camera format, such that we can run
pose_w2c = pose[i]
x_camera = pose_w2c[:3, :3] @ x_world + pose_w2c[:3, 3:4]
to convert a 3D world space point x_world into a camera space point x_camera.
These matrices must be stored in the OpenCV coordinate system convention for camera rotation: x-axis to the right, y-axis down, and z-axis forwards.
The most common conventions are
[right, up, backwards]: OpenGL, NeRF, most graphics code.[right, down, forwards]: OpenCV, COLMAP, most computer vision code.
Fortunately switching from OpenCV/COLMAP to NeRF and inversely is
simple:
you just need to right-multiply the OpenCV pose matrices by np.diag([1, -1, -1, 1]),
which will flip the sign of the y-axis (from down to up) and z-axis (from
forwards to backwards):
camtoworlds_opengl = camtoworlds_opencv @ np.diag([1, -1, -1, 1])
Importantly, you want your scene 3D points to be more or less centered around 0 in the world coordinate space.
In 360 degrees inwards scene, this is ensured by centering the camera-to-world poses around 0.
In scenes like Replica where the cameras are outwards, this is done by substracting the center point of the 3D scene to the camera poses. Check the dataloader for more details.
Intrinsic camera poses
intr= [N, 3, 3] numpy array of intrinsic matrices. These should be in
OpenCV format, e.g.
intr = np.array([
[focal, 0, width/2],
[ 0, focal, height/2],
[ 0, 0, 1],
])
Existing data loaders
We have implemented different dataloaders: