Fast Single-View NeRF with Meta-Learning

In my latest blog, I described why Neural Radiance Fields (NeRF) are still worth caring about, even in the era of 3D Gaussian Splatting (3DGS).

In this post, we are going to dive into single-view 3D reconstruction / novel-view synthesis. More specifically: given just one image of an object, can we quickly adapt a neural field so that it renders plausible views from new camera angles?

There are many ways to do that. In an upcoming blog, I will show how to directly regress 3DGS parameters from a single image. But here, we will take a different route:

meta-learning.

Instead of training a NeRF from scratch for every object, we train a model that learns a good initialization across many objects. Then, at test time, we only need a small amount of gradient descent on one input view to specialize the model to a new instance.

This idea comes from the paper:

Learned Initializations for Optimizing Coordinate-Based Neural Representations

and the implementation below is a compact PyTorch version of that idea, applied to NeRF-style rendering.

Why this works

Standard NeRF fitting is slow because every new scene starts from random weights. The network has to discover geometry, appearance, and density structure from scratch.

Meta-learning changes the question.

Instead of asking:

How do we optimize a NeRF for one object?

we ask:

Can we learn initial parameters that are already close to a good solution for many objects of the same category?

If the answer is yes, then adapting to a new object becomes much easier. A few gradient steps from a learned initialization can be enough to recover a useful scene representation.

That is exactly the setup in this code:

each task is one object instance
the model sees many tasks during meta-training
for each task, it performs a few inner-loop updates
the shared initialization is updated so that future adaptation becomes faster

This is classic few-shot adaptation, but applied to coordinate-based neural representations.

1. The big picture

Our pipeline has three moving parts:

A NeRF-style renderer

Given ray origins and ray directions, we sample 3D points along the rays, query an MLP for color and density, and integrate the result with volume rendering.

A task distribution

Each task corresponds to one scene or object. In this example, the dataset contains multiple car instances, each with multiple views and camera poses.

A meta-learning algorithm

We use Reptile, a first-order meta-learning method. For each sampled scene:

copy the meta-model
optimize it for a few steps on that scene
move the meta-model parameters toward the adapted weights

After repeating this over many scenes, the meta-model becomes a strong initialization for unseen objects.

2. A brief overview of the paper

The paper’s central idea is simple:

coordinate-based neural representations like NeRF, SIREN, or occupancy networks are powerful, but expensive to optimize from scratch. If scenes come from a shared distribution, then their optimal parameters should have some common structure. So instead of random initialization, we should learn the initialization itself.

In practice, the paper shows that a good initialization can:

reduce optimization time
improve adaptation with limited observations
make single-view or sparse-view reconstruction more feasible
generalize across a category of related shapes or scenes

The implementation here follows that philosophy with a particularly clean recipe:

represent each object with a NeRF-like MLP
meta-train on many car scenes
adapt to a new object from a single image
render novel views after adaptation

This is a nice example because it shows that meta-learning is not some abstract training trick. It directly changes the user experience: instead of waiting for a full scene optimization, you get a model that is already “primed” to reconstruct cars.

3. The model: a tiny NeRF MLP

Let’s start with the core network:

class NerfModel(nn.Module):
    def __init__(self, hidden_dim=128):
        super(NerfModel, self).__init__()

        self.net = nn.Sequential(nn.Linear(120, hidden_dim), nn.ReLU(),
                                 nn.Linear(hidden_dim, hidden_dim), nn.ReLU(),
                                 nn.Linear(hidden_dim, hidden_dim), nn.ReLU(),
                                 nn.Linear(hidden_dim, hidden_dim), nn.ReLU(),
                                 nn.Linear(hidden_dim, hidden_dim), nn.ReLU(),
                                 nn.Linear(hidden_dim, 4))

    def forward(self, o):
        emb_x = torch.cat([torch.cat([torch.sin(o * (2 ** i)), torch.cos(o * (2 ** i))],
                                     dim=-1) for i in torch.linspace(0, 8, 20)], axis=-1)
        h = self.net(emb_x)
        c, sigma = torch.sigmoid(h[:, :3]), torch.relu(h[:, -1])
        return c, sigma

This is a standard coordinate MLP with positional encoding.

The input o is a batch of 3D points of shape [N, 3]. These points are not pixels. They are sampled positions in 3D space along camera rays.

The model first applies a Fourier-style positional encoding:

emb_x = torch.cat([torch.cat([torch.sin(o * (2 ** i)), torch.cos(o * (2 ** i))],
                             dim=-1) for i in torch.linspace(0, 8, 20)], axis=-1)

For each 3D coordinate, we compute multiple sine and cosine functions at increasing frequencies. This is crucial, because raw MLPs tend to learn low-frequency functions first. Without positional encoding, the network would struggle to represent sharp geometry and fine appearance details.

Why is the first layer Linear(120, hidden_dim)?

input point has 3 coordinates
for each frequency, we compute sin and cos, doubling the channels
that gives 3 * 2 = 6 features per frequency
there are 20 frequencies
total = 6 * 20 = 120

The output has 4 channels:

3 for RGB color
1 for volume density sigma

Color is squashed with a sigmoid to stay in [0, 1], while density goes through ReLU to remain non-negative.

This is slightly simpler than the original NeRF, which also conditions color on viewing direction. Here, the model depends only on 3D position, which is enough for object-centric scenes with mostly view-consistent appearance.

4. Volume rendering in a few lines

The renderer is where the neural field turns into an image.

def compute_accumulated_transmittance(alphas):
    accumulated_transmittance = torch.cumprod(alphas, 1)
    return torch.cat(
        (torch.ones((accumulated_transmittance.shape[0], 1), device=alphas.device),
         accumulated_transmittance[:, :-1]), dim=-1)

This helper computes the amount of light that survives as we march along the ray.

If alpha_i is the opacity at sample i, then the transmittance up to that sample is the product of all previous (1 - alpha) terms. Intuitively:

if earlier samples are dense, later samples should contribute less
if the ray stays empty for a while, deeper samples remain visible

That is exactly what transmittance models.

Now the main rendering function:

def render_rays(nerf_model, ray_origins, ray_directions, hn=0, hf=0.5, nb_bins=192):
    device = ray_origins.device
    t = torch.linspace(hn, hf, nb_bins, device=device).expand(ray_origins.shape[0], nb_bins)

For each ray, we sample nb_bins depth values between near plane hn and far plane hf.

Then we perturb the samples:

mid = (t[:, :-1] + t[:, 1:]) / 2.
lower = torch.cat((t[:, :1], mid), -1)
upper = torch.cat((mid, t[:, -1:]), -1)
u = torch.rand(t.shape, device=device)
t = lower + (upper - lower) * u

This is stratified sampling. Instead of always evaluating at fixed depths, we sample randomly inside each interval. That helps reduce aliasing and improves training stability.

Next we compute step sizes:

delta = torch.cat((t[:, 1:] - t[:, :-1], torch.tensor([1e10], device=device).expand(
                       ray_origins.shape[0], 1)), -1)

Each delta is the distance between adjacent samples. The last sample gets a huge value so that its opacity behaves like the terminal bin.

Now we lift ray parameters into 3D points:

x = ray_origins.unsqueeze(1) + t.unsqueeze(2) * ray_directions.unsqueeze(1)

Shape-wise:

ray_origins: [B, 3]
ray_directions: [B, 3]
t: [B, nb_bins]

After broadcasting, x becomes [B, nb_bins, 3], meaning: for each ray, a set of 3D points along its path.

We flatten those points and query the NeRF:

colors, sigma = nerf_model(x.reshape(-1, 3))
colors = colors.reshape(x.shape)
sigma = sigma.reshape(x.shape[:-1])

Then comes the standard NeRF opacity equation:

alpha = 1 - torch.exp(-sigma * delta)

High density over a long interval means high opacity.

The weights are then computed as:

weights = compute_accumulated_transmittance(1 - alpha).unsqueeze(2) * alpha.unsqueeze(2)

Each sample contributes:

its own opacity
multiplied by the probability that the ray has not already been absorbed earlier

Finally, pixel color is the weighted average of sample colors:

c = (weights * colors).sum(dim=1)

And this line handles the white background:

weight_sum = weights.sum(-1).sum(-1)
return c + 1 - weight_sum.unsqueeze(-1)

If the ray does not fully hit dense content, the missing mass is assigned to white. This matters because the dataset uses RGBA images composited over white.

5. Loading a multi-view object dataset

Now let’s look at the data pipeline.

def load_data(data_path, json_path, train=True, N=25, H=128, W=128):

The dataset is organized into object-centric scenes. Each scene has:

a folder with images
a transforms.json file
camera poses for each view

The split file determines which scenes belong to training or testing:

with open(json_path, "r") as f:
    data = json.load(f)

scenes = [data_path + f for f in sorted(data['train' if train else 'test'])]

Inside each scene, the function reads the camera metadata and loads up to N views.

For each view, it loads:

the image
the camera-to-world matrix
the focal length derived from camera_angle_x

img = torch.from_numpy(imread(
    scene_path + f"/{view['file_path'].split('/')[-1]}.png") / 255.)
c2w = torch.tensor(view["transform_matrix"])

focal_length = W / 2. / torch.tan(
    torch.tensor(data["camera_angle_x"]) / 2.)

Then it constructs ray directions for all pixels.

u, v = torch.meshgrid(torch.arange(W), torch.arange(H))
dirs = torch.stack((v - W / 2, -(u - H / 2),
                    - torch.ones_like(u) * focal_length), axis=-1)
dirs = (c2w[:3, :3] @ dirs[..., None]).squeeze(-1)

This creates camera-space rays and rotates them into world space using the camera pose.

Then we normalize the directions and assign a ray origin equal to the camera center:

scene_rays_d[view_idx] = dirs / np.linalg.norm(dirs, axis=-1, keepdims=True)
scene_rays_o[view_idx] = torch.zeros_like(scene_rays_d[view_idx]) + c2w[:3, 3]

Finally, the RGBA image is composited onto a white background:

scene_gt_pixels[view_idx] = img[..., :3] * img[..., -1:] + 1 - img[..., -1:]

This matches the white-background correction in the renderer.

At the end, we store three tensors per scene:

rays_o: ray origins
rays_d: ray directions
gt_pixels: ground-truth RGB values

This means every scene becomes a giant collection of ray/color supervision pairs.

6. What is a task?

The meta-learning setup hinges on the definition of a task.

@torch.no_grad()
def sample_task(rays_o, rays_d, gt_pixels):
    scene_idx = torch.randint(0, len(rays_o), (1,))
    o, d, gt = rays_o[scene_idx], rays_d[scene_idx], gt_pixels[scene_idx]
    return torch.cat([o.reshape(-1, 3), d.reshape(-1, 3), gt.reshape(-1, 3)], dim=-1)

A task here is simply:

sample one scene, then flatten all rays from all its views into a training set

Each row contains:

ray origin
ray direction
ground-truth RGB

So the meta-learner does not optimize on arbitrary mini-batches from the whole dataset. It optimizes on one scene at a time. That is important.

Why?

Because meta-learning needs a task distribution. In this case, each object instance is one task. The model is being trained to adapt quickly to a new object drawn from the same distribution of cars.

7. Inner-loop adaptation

The function below performs task-specific optimization:

def perform_k_training_steps(nerf_model, task, k, optimizer, batch_size=128,
                             device='cpu', hn=2., hf=6., nb_bins=128):
    for _ in (range(k)):
        indices = torch.randint(task.shape[0], size=[batch_size])
        batch = task[indices]
        ray_origins = batch[:, :3].to(device)
        ray_directions = batch[:, 3:6].to(device)
        ground_truth_px_values = batch[:, 6:].to(device)

        regenerated_px_values = render_rays(nerf_model, ray_origins, ray_directions,
                                            hn=hn, hf=hf, nb_bins=nb_bins)
        loss = nn.functional.mse_loss(ground_truth_px_values, regenerated_px_values)
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
    return nerf_model.parameters()

This is the adaptation loop used both during meta-training and testing.

For k steps:

sample a random batch of rays from the current task
render predicted RGB values
compare with ground truth using MSE
update the model

This is exactly how a NeRF is usually trained, except here it happens inside a meta-learning framework.

A small but important observation: the optimization target is not the full scene loss. We only sample random rays at each step. That keeps training memory-friendly and makes adaptation stochastic, which is standard in NeRF training.

8. Reptile: learning the initialization

Now the interesting part.

def reptile(meta_model, meta_optim, nb_iterations: int, device: str, sample_task: Callable,
            perform_k_training_steps: Callable, k=32):

    for epoch in tqdm(range(nb_iterations)):
        task = sample_task()
        nerf_model = copy.deepcopy(meta_model)
        optimizer = torch.optim.SGD(nerf_model.parameters(), 0.5)
        phi_tilde = perform_k_training_steps(nerf_model, task, k, optimizer, device=device)

        # Update phi
        meta_optim.zero_grad()
        with torch.no_grad():
            for p, g in zip(meta_model.parameters(), phi_tilde):
                p.grad = p - g
        meta_optim.step()

This is Reptile in a few lines.

Let phi be the meta-model parameters.

For each iteration:

Step 1: sample a task

We pick one scene.

Step 2: clone the meta-model

We create a fresh task-specific model initialized from phi.

Step 3: adapt it on the task

We run k steps of SGD, producing adapted parameters phi_tilde.

Step 4: move the meta-model toward `phi_tilde`

This is the core idea. If a few steps on a task consistently move parameters in a useful direction, then the initialization should drift toward a place from which such adaptation is easy.

The update here is implemented by setting:

p.grad = p - g

and then taking an optimizer step on the meta-model. This approximates:

phi <- phi + epsilon * (phi_tilde - phi)

which is the classic Reptile update.

What makes Reptile appealing is that it is first-order. Unlike MAML, it does not require differentiating through the inner optimization steps. That keeps the implementation simple and the memory cost manageable.

For coordinate-based representations like NeRF, where inner-loop optimization can already be expensive, this simplicity is a big win.

9. Meta-training

The main script starts by constructing the meta-model and optimizer:

device = 'cuda'
meta_model = NerfModel(hidden_dim=256).to(device)
meta_optim = torch.optim.Adam(meta_model.parameters(), lr=5e-5)

Then it loads the training scenes:

rays_o, rays_d, gt_pixels = load_data("data/cars/", "data/car_splits.json",
                                      train=True, N=25, H=128, W=128)

And launches Reptile:

reptile(meta_model, meta_optim, 100_000, device,
        lambda: sample_task(rays_o, rays_d, gt_pixels), perform_k_training_steps, 32)

So meta-training consists of:

100,000 outer-loop iterations
one sampled car per iteration
32 inner-loop SGD updates per task

By the end, meta_model is no longer a generic NeRF with random weights. It is a car-aware initialization that already encodes category-level priors:

cars occupy similar volumes
they have smooth surfaces
they exhibit consistent shape regularities
their appearance statistics are not arbitrary

That prior is exactly what enables fast single-view adaptation later.

10. Test-time adaptation from a single image

Now for the payoff.

rays_o, rays_d, gt_pixels = load_data("data/cars/", "data/car_splits.json",
                                      train=False, N=25, H=128, W=128)

We switch to unseen test scenes.

For each test object:

nerf_model = copy.deepcopy(meta_model)
optimizer = torch.optim.SGD(nerf_model.parameters(), 0.5)

Again, we start from the learned initialization, not from scratch.

Then comes the key detail:

test_data = torch.cat([rays_o[test_img][0].reshape(-1, 3),
                       rays_d[test_img][0].reshape(-1, 3),
                       gt_pixels[test_img][0].reshape(-1, 3)], dim=-1).to(device)

Only view 0 is used for adaptation.

That means the model is given just one observed image of the test object. From that single view, it runs:

training_loss = perform_k_training_steps(nerf_model, test_data, 1000,
                                         optimizer, batch_size=128,
                                         device=device)

After 1000 update steps on that one image, the adapted model is asked to render novel viewpoints:

img = render_rays(nerf_model,
                  rays_o[test_img][i].to(device).reshape(-1, 3),
                  rays_d[test_img][i].to(device).reshape(-1, 3),
                  hn=2., hf=6., nb_bins=128)

This is the fascinating part of the whole method.

The network never saw those target views during adaptation. It only saw one image. Any ability to render plausible unseen views comes from:

the inductive bias of the NeRF representation
the category prior stored in the meta-initialization
the geometric consistency enforced by volume rendering

So this is not just memorizing pixels. It is using a learned prior to infer hidden structure.

11. Why this is interesting

Single-view reconstruction is fundamentally ambiguous. A single image does not uniquely determine the full 3D shape. Many different 3D objects can produce the same photograph.

So the only way this can work is with priors.

This method bakes those priors into the initialization. The model has seen many cars before, so when it gets one new car image, it can quickly settle into a plausible explanation.

That makes this approach especially interesting for object-centric categories:

cars
chairs
airplanes
human faces
synthetic ShapeNet-style datasets

In all of these cases, objects vary, but not arbitrarily. There is shared structure, and meta-learning exploits it.

12. A few implementation notes

This code is compact, but there are a few details worth highlighting.

No viewing-direction conditioning

The original NeRF predicts color as a function of both 3D position and viewing direction. This implementation only uses position, which simplifies the model and is often enough for category-level object datasets.

White background assumption

The renderer adds missing opacity mass back to white, and the ground-truth images are alpha-composited over white as well. These two choices are aligned.

Reptile uses SGD in the inner loop

This is common. The inner optimizer should be simple and aggressive, because it represents fast adaptation.

Adam is used in the outer loop

The meta-parameters are updated more carefully, which helps stabilize training over many outer-loop iterations.

Task = full object, not one image

During meta-training, a task includes rays from many views of a single scene. During testing, adaptation uses only one view. That gap is intentional: it teaches the initialization on richer supervision, but evaluates it in a harder low-shot setting.

13. What to take away

The main lesson of this implementation is not just that NeRF can be meta-learned.

It is that optimization itself can be learned.

Usually, we treat initialization as a boring detail. Random Gaussian weights, Xavier init, Kaiming init, done. But for coordinate-based representations, initialization determines whether adaptation takes minutes, seconds, or fails completely.

By learning the initialization over a distribution of related scenes, we turn NeRF from a per-instance optimizer into a category-aware reconstruction prior.

That is what makes single-view adaptation possible here.

14. Final thoughts

This example sits at an interesting intersection of three ideas:

NeRF for continuous 3D representation
meta-learning for fast adaptation
single-view reconstruction for category-level inference

It is not the only solution to the problem, and it is probably not the fastest one in practice today. But it is conceptually beautiful. Instead of hard-coding a 3D prior, we let the optimization process itself absorb it.

That makes this a great example of how learning-based 3D methods can go beyond scene fitting and start behaving more like inference systems.

In my next post, I will look at the same single-image setting from the other side: instead of optimizing a neural field from a learned initialization, we will try to directly regress 3DGS parameters from one image.

That should make for a nice comparison: optimize a representation versus predict a representation.

Learn 3DGS Step-By-Step

📘 Master 3D Gaussian Splatting

Do you want to truly understand 3D Gaussian Splatting—not just run a repo? My 3D Gaussian Splatting Course teaches you the full pipeline from first principles. Everything is broken down into clear modules with code you can actually read and modify.

Explore the Course →

✉️ Want more posts like this?

Subscribe to my newsletter for future posts, updates, and practical guides on PyTorch, 3DGS, and differentiable rendering.

Subscribe to the Newsletter →

Consulting

💼 Research & Engineering Consulting

We help teams bridge the gap between research and production. Our work focuses on practical integration of 3D Gaussian Splatting techniques, implementation of recent methods, and custom research or prototyping for advanced splatting pipelines.

For consulting inquiries:
contact@qubitanalytics.be