Disentangling Transformer Design Choices for Physical Trajectory Prediction

14 Jan, 2026

Abstract

Transformers are increasingly used as data-driven surrogates for physical trajectory prediction. We show that Attentional Neural Integral Equations (ANIE), block-causal video transformers (VT), and backpropagation-through-time (BPTT) are instances of a unified $2 \times 2 \times 3$ design grid with three axes: (i) weight sharing across sequential blocks of layers, (ii) residuals across blocks, and (iii) how temporal dependencies are modeled (broadcasted / teacher forcing / recursive). This yields a single, flag-controlled implementation of all twelve configurations and enables controlled ablations. Across four PDE datasets (2D Navier-Stokes at two viscosities, 1D Burgers, and Shallow-Water), the dominant factor is the distributional shift between training and evaluation time: teacher forcing variants underperform. Within the non-teacher-forcing variants, sharing weights and omitting block-level residuals generally slightly or moderately help. Two combinations — (Broadcasted, Shared Weights, No Block Residuals) and (Recursive, Shared Weights, No Block Residuals) — are near-optimal across datasets, with worst-case normalized losses of $1.3071 \times$ and $1.2323 \times$ the per-dataset minima, and geometric-mean normalized losses of $1.1114 \times$ and $1.1458 \times$ , respectively. Our framing clarifies connections and differences between prior methods, identifies novel optimal configurations, and provides a compact codebase to compose and evaluate trajectory predictors.

Introduction

Transformers are increasingly used as data-driven surrogates for physical trajectory prediction. The Neural Integral Equations introduced Attentional Neural Integral Equations (ANIE) ¹, an algorithm that parameterizes an integral operator with a transformer and then searches for a fixed point. Framed this way, ANIE appears to occupy a very different algorithmic space from more conventional transformer-based approaches.

In this work, we unify seemingly disparate approaches in a structured design space. We group transformer layers into sequential blocks. Then, we construct a $2 \times 2 \times 3$ grid of architectural choices along three axes: (1) whether transformer blocks share weights, (2) whether residuals are added to each block, and (3) how temporal dependencies are modeled. Within this grid, ANIE and more conventional architectures emerge as specific points, which can be compared and recombined.

This framing lets us implement a suite of algorithms in a unified codebase, where each axis is exposed as a single flag and architectural differences correspond to small, transparent code changes. We can then perform controlled experiments across all twelve configurations, allowing us to isolate which design decisions consistently influence performance and to find novel combinations that outperform previous methods.

Across datasets, we find that the largest factor distinguishing configurations is whether there is a distributional shift in the inputs between training and evaluation. Within the subset of configurations without distributional shift, the impact of architectural choices is smaller, but we find that overall weight sharing between blocks and not adding residuals to each block lead to lower validation loss. Further, we find two novel combinations that perform nearly optimally on all datasets. Both combinations feature weight sharing, no residuals around each block, and no distributional shift.

Trajectory Prediction

Abstractly, the task is to learn a map from initial conditions to full trajectories of the system. To make the abstraction precise, let $I$ and $K$ denote compact topological spaces representing time and space, respectively. A state is a continuous function $s : K \to ℝ^{q}$ , and a trajectory is a continuous function $f : I \times K \to ℝ^{q}$ . We write $𝒮 = C (K, ℝ^{q})$ for the space of states and $ℱ = C (I \times K, ℝ^{q})$ for the space of trajectories. The learning problem is then to approximate the map $𝒮 \to ℱ$ that maps the initial conditions $s \in 𝒮$ to the trajectory $f \in ℱ$ .

In practice, we let $I$ and $K$ be finite discretizations of time and space in the discrete topology. Then we can represent elements of $𝒮$ with elements of $ℝ^{| K | \times q}$ , with each row representing one point in space. Similarly, we can represent elements of $ℱ$ as elements of $ℝ^{| I | \times | K | \times q}$ .

When the spatial dimensionality exceeds one, it is convenient to decompose $K$ into products. For example, in two spatial dimensions, we let $K = H \times W$ , where $H$ and $W$ index height and width. States then become elements of $ℝ^{| H | \times | W | \times q}$ , and trajectories become elements of $ℝ^{| I | \times | H | \times | W | \times q}$ . This pattern generalizes naturally to higher dimensions, but in our experiments the data involve either one or two spatial dimensions.

Transformers for Trajectory Prediction

As a starting point, we define a baseline algorithm that treats trajectory prediction as a special case of video modeling. This approach, which we refer to as the Video Transformer (VT), is the closest to standard transformer architectures used in video tasks ² ³. This approach is also very close to the block causal transformer ⁴, which was proposed for physical trajectory prediction.

We tokenize the spatio-temporal domain into patches. A token corresponds to one spatial patch at one time step. If $K^{'}$ indexes the patches, then a transformer with hidden dimension $d$ acts as $T : ℝ^{| I | \times | K^{'} | \times d} \to ℝ^{| I | \times | K^{'} | \times d}$ . If there are two spatial dimensions, then we let $H^{'}$ and $W^{'}$ index the patch height and width and regard $T : ℝ^{| I | \times | H^{'} | \times | W^{'} | \times d} \to ℝ^{| I | \times | H^{'} | \times | W^{'} | \times d}$ . We apply a block causal mask ⁴, so that each token can attend to all tokens from its own or earlier time steps, but never to the future.

To interface with the transformer, we define an encoder $ψ : ℝ^{| I | \times | K | \times q} \to ℝ^{| I | \times | K^{'} | \times d}$ that chooses a representation for each patch. For example, $ψ$ may be a patch-wise MLP applied independently across space and time. On the output side, we define a decoder $γ : ℝ^{| I | \times | K^{'} | \times d} \to ℝ^{| I | \times | K | \times q}$ to map back into the original trajectory format.

At training time, the model input is $s | | f_{< | I |}$ , i.e. the concatenation of the initial condition $s$ with the trajectory up to — but excluding — the last time step. The model prediction is compared against the full ground-truth trajectory under mean squared error.

At evaluation time, however, we cannot provide the ground-truth trajectory as input. Instead, we generate one step at a time, feeding the model's own prediction ${\hat{f}}_{i}$ back in as input for the next step. This is directly analogous to autoregressive sampling in language and video models. If training loss on a sample reaches zero, then predictions at every step exactly match the ground truth, so evaluation loss is also zero.

Backpropagation Through Time

A limitation of VT is that the input distribution at training time differs from the one encountered at evaluation time. During training, the input for time step $i$ is the ground truth state $f_{i}$ . However, during evaluation, the model must instead consume its own predicted state ${\hat{f}}_{i}$ . Unless the model is perfectly accurate, these two distributions differ. This distributional shift can accumulate across steps, often causing the evaluation loss to be much larger than the training loss.

One proposed remedy is to make the training procedure match the evaluation procedure by backpropagating through the trajectory generation algorithm. This idea, called backpropagation through time (BPTT), has been proposed in various settings ⁵ ⁶ ⁷, even beyond transformers.

ANIE

An alternative perspective begins with the observation that many physical systems can be described by integral equations of the form:

y (i, k) = g (i, k) + \int_{α (i)}^{β (i)} \int_{K} G (y (τ, ξ), ξ, τ, i, k) d ξ d τ

From the integral equation, we define an associated integral operator as the right-hand side:

κ (y) (i, k) = g (i, k) + \int_{α (i)}^{β (i)} \int_{K} G (y (τ, ξ), ξ, τ, i, k) d ξ d τ

Then a solution to the system must be a fixed point of $κ$ .

Attentional Neural Integral Equations (ANIE) ¹ leverages this structure by parameterizing an integral operator with a transformer and then searching for a trajectory as the fixed point of that operator. To formalize this, we define a sequence of maps: an encoder $ψ : ℱ \to ℱ_{1}$ , a transformer $T : ℱ_{1} \to ℱ_{1}$ , and a decoder $γ : ℱ_{1} \to ℱ$ . These components serve analogous roles to those in VT. The transformer in ANIE does not require any mask, but a mask may be used to enforce integration bounds in the operator.

We approximate the residual $Δ (y) = κ (y) - y$ by the transformer $T$ , so we approximate $κ$ by $\hat{κ} (y) = y + T (y)$ . We run $m$ refinement steps with smoothing parameter $b$ :

R (y^{i}) = y^{i + 1} = y^{i} + b T (y^{i})

When $T = Δ$ , this equals the successive iterations $y^{i + 1} = (1 - b) y^{i} + b κ (y^{i})$ . Prior work ¹ found that the choice of $b$ did not significantly change the results, and found only small performance improvements for values larger than $m = 3$ , so we set $b = 1$ and $m = 3$ in all experiments.

For any $s \in 𝒮$ , we define $s \vec{1} \in ℱ$ as the trajectory with constant state $s$ : $s \vec{1} (i, k) = s (k)$ . ANIE chooses $s \vec{1}$ as the initialization, so the forward pass is $ANIE (s) = [γ \circ R^{(m)} \circ ψ] (s \vec{1})$ , where $R^{(m)}$ denotes $m$ iterations of $R$ .

If $\hat{κ}$ is a contraction, then the successive iterations converge to a fixed point of $\hat{κ}$ , equivalently a zero of $T$ , by the Banach fixed point theorem. In general, the existence of a fixed point is unknown. But in practice, because $m$ is a fixed hyperparameter, $ANIE$ defines a map $𝒮 \to ℱ$ , regardless of whether the iterations converge as $m \to \infty$ .

In this way, ANIE reframes trajectory prediction: instead of autoregressively rolling out a sequence, the model iteratively refines a candidate trajectory with a learned operator.

Design Grid

ANIE may appear unrelated to VT and BPTT at first glance. But we show that they differ by only a small number of architectural choices. The key insight is that the composition of $m$ transformers with $L$ layers each is a transformer with $L^{'} = L m$ layers. If $T$ is a transformer with $L^{'}$ layers, we group layers into $m$ blocks of $L$ layers each. Then by varying $T$ along three axes, we can recover VT, BPTT, and ANIE as special cases, while also exploring new hybrids.

We define this space as a $2 \times 2 \times 3$ grid:

Axis 1 (Weight Sharing)	Axis 2 (Residuals)	Axis 3 (Temporal Dependencies)
Shared Weights: shared weights across $m$ blocks ( $B_{i} \equiv B$ )	Block Residuals: add residual connections around each block	Broadcasted: unmasked; input $s \vec{1}$
Unshared Weights: unshared weights across $m$ blocks ( $B_{i}$ all distinct)	No Block Residuals: do not add residual connections around each block	Teacher Forcing: block-causal mask; input $s ‖ ‖ f_{< i}$ at training time, $s ‖ ‖ {\hat{f}}_{< i}$ at evaluation time
		Recursive: block-causal mask; input $s ‖ ‖ {\hat{f}}_{< i}$

"Input $s ‖ ‖ f_{< i}$ " means the transformer has access to all ground truth tokens for timesteps less than $i$ , including initial conditions. "Input $s ‖ ‖ {\hat{f}}_{< i}$ " means the transformer has access to the initial conditions and its own predictions for time steps less than $i$ .

In this space, we recover VT as (Unshared Weights, Block Residuals, Teacher Forcing), BPTT as (Unshared Weights, No Block Residuals, Recursive), and ANIE as (Shared Weights, Block Residuals, Broadcasted). Further, with this framing, by choosing values along each axis, we obtain twelve possible configurations, which we evaluate systematically.

Design Choice Strengths

In the discussion below, we propose plausible mechanisms by which each choice could affect inductive biases, optimization dynamics, or information flow in the model. Our list is not exhaustive.

Shared Weights vs. Unshared Weights

Potential Advantages of Shared Weights

With Shared Weights, the model repeatedly applies the same operator, which may provide an inductive bias towards stable iterative refinement of proposal trajectories. An analogy to convolutional neural networks is useful: weight sharing across strides provides a powerful inductive bias for image processing; weight sharing across blocks may play a similar role for trajectory prediction.

Potential Advantages of Unshared Weights

Shared Weights restricts the parameter space to a low-dimensional subspace of the unshared model. With enough data, optimization over the larger unshared space is more likely to find a better solution. Thus, while Shared Weights may effectively regularize in low-data regimes, it may also limit expressivity in high-data regimes.

Block Residuals vs. No Block Residuals

Potential Advantages of No Block Residuals

For consistency with ¹, we use Post-LN transformer layers instead of Pre-LN transformer layers ⁸. For Post-LN transformer layers, when the weights equal 0, a transformer layer simply implements layer normalization — the composition of multiple layers implements the final layer normalization.

With No Block Residuals, when the weights equal 0, the composition of multiple blocks similarly implements the final layer normalization. But with Block Residuals, when the weights equal 0, the outputs tend to grow with the number of blocks, preventing iterative refinement. Since weight decay keeps weights close to 0, No Block Residuals may have a stronger inductive bias towards iterative refinement.

Potential Advantages of Block Residuals

For Post-LN transformer layers, layer normalization interferes with the stream of residuals inside the transformer layers. Block Residuals features a clean residual stream across blocks, which may better propagate signal through the network.

Recursive vs. Teacher Forcing vs. Broadcasted

Potential Advantages of Recursive Over Broadcasted and Teacher Forcing

Recursive allocates increasingly deep computational graphs to the prediction of later time-steps, which may be beneficial since later predictions are typically more difficult. Teacher Forcing shares this property during evaluation, but only Recursive has this property during training, so only Recursive receives gradient updates to take advantage of it.

Potential Advantages of Broadcasted and Teacher Forcing Over Recursive

Recursive requires backpropagation through a deeper computational graph, which may exacerbate vanishing or exploding gradients. Although modern techniques like residual connections ⁹, normalization ¹⁰, adaptive optimizers ¹¹, and careful weight initialization ¹² help stabilize deep networks, the recursive setting could still be more fragile to architectural or optimization choices.

Potential Advantages of Teacher Forcing Over Broadcasted and Recursive

Teacher Forcing has access to $s ‖ ‖ f_{< | I |}$ during training, which is information computed by the system itself and may be useful to predict $f$ . By contrast, $s ‖ ‖ {\hat{f}}_{< | I |}$ contains information computed by the model itself, which is error-prone. Meanwhile, $s \vec{1}$ contains no information other than $s$ . So, the model may require greater capacity to accurately compute $f$ from $s \vec{1}$ or $s ‖ ‖ {\hat{f}}_{< | I |}$ than from $s ‖ ‖ f_{< | I |}$ .

Potential Advantages of Broadcasted and Recursive Over Teacher Forcing

Teacher Forcing receives input $s ‖ ‖ f_{< | I |}$ during training, but receives input $s ‖ ‖ {\hat{f}}_{< | I |}$ during evaluation. This distributional shift may accumulate over time as the model's errors accumulate, leading to much greater validation loss than training loss.

Potential Advantages of Broadcasted Over Teacher Forcing and Recursive

Broadcasted is unmasked, which permits early time-steps to use the keys and values from later time-steps. This may allow the model to adjust its predictions for global consistency across time.

Potential Advantages of Teacher Forcing and Recursive Over Broadcasted

Broadcasted receives input $s \vec{1}$ , while Teacher Forcing and Recursive receive $s ‖ ‖ f_{< | I |}$ and $s ‖ ‖ {\hat{f}}_{< | I |}$ , respectively. The latter two contain more information than $s \vec{1}$ , so the model may require greater capacity to accurately compute $f$ from $s \vec{1}$ .

Methods

Data and Tokenization

Navier-Stokes

The 2D incompressible Navier-Stokes equations describe the motion of a 2D, viscous, incompressible fluid. In vorticity form, the system evolves according to:

\partial_{t} w + u \cdot \nabla w = ν Δ w + f \nabla \cdot u = 0

where $u : I \times H \times W \to ℝ^{2}$ is the velocity field, $w : I \times H \times W \to ℝ$ is the vorticity defined by $\nabla \times u$ , $ν \in ℝ_{\geq 0}$ is the viscosity, and $f : I \times H \times W \to ℝ$ is a forcing term.

We use the Navier-Stokes dataset from ¹, generated by a script from the Fourier Neural Operator ¹³, with viscosity $10^{- 3}$ . We also generate a new dataset with viscosity $10^{- 3.5}$ . We use temporal resolution $| I | = 9$ and the full spatial resolution $| H | \times | W | = 64 \times 64$ , with $q = 1$ channel for vorticity. Each sample consists of initial conditions in $ℝ^{64 \times 64 \times 1}$ and a trajectory in $ℝ^{9 \times 64 \times 64 \times 1}$ . We split each dataset into 4000 training and 1000 evaluation samples. To tokenize the input, we partition space into non-overlapping $4 \times 4$ patches, yielding $| H^{'} | = 16$ and $| W^{'} | = 16$ .

Burgers

The 1D Burgers equation is a common PDE across multiple areas of physics, with the interesting property that for small viscosity values, the system nearly forms discontinuities. The system evolves according to:

\partial_{t} u + u \partial_{x} u = ν \partial_{x x} u

where $u : I \times K \to ℝ$ is the scalar field and $ν \in ℝ_{\geq 0}$ is the viscosity.

We use the Burgers dataset from ¹, generated by a script from ¹³. We use temporal resolution $| I | = 9$ and the full spatial resolution $| K | = 1024$ . We split the dataset into 800 training and 200 evaluation samples. To tokenize the input, we partition space into non-overlapping patches of length 16, yielding 64 patches.

Shallow-Water

The shallow-water equations are derived from the Navier-Stokes equations, modeling incompressible fluids under the assumption that the fluid height is small and vertical velocity is negligible compared to horizontal velocity. The system evolves according to:

\partial_{t} h + \partial_{x} (h u) + \partial_{y} (h v) = 0

\partial_{t} (h u) + \partial_{x} (u^{2} h + \frac{1}{2} g_{r} h^{2}) + \partial_{y} (u v h) = - g_{r} h \partial_{x} b

\partial_{t} (h v) + \partial_{y} (v^{2} h + \frac{1}{2} g_{r} h^{2}) + \partial_{x} (u v h) = - g_{r} h \partial_{y} b

where $h$ is the depth, $u$ is the horizontal velocity, $v$ is the vertical velocity, and $b$ is the time-invariant bottom surface elevation.

We use the Shallow-Water dataset from ¹⁴, with temporal resolution $| I | = 9$ and the full spatial resolution $| H | \times | W | = 128 \times 128$ , with $q = 1$ channel for height. Each sample consists of initial conditions in $ℝ^{128 \times 128 \times 1}$ and a trajectory in $ℝ^{9 \times 128 \times 128 \times 1}$ . We split into 800 training and 200 evaluation samples. To tokenize the input, we partition space into non-overlapping $8 \times 8$ patches, yielding $| H^{'} | = 16$ and $| W^{'} | = 16$ .

Optimization

We use the Adam optimizer ¹¹ with weight decay $10^{- 4}$ and otherwise use PyTorch's default parameters ¹⁵. The learning rate follows a cosine annealing schedule ¹⁶ with 101 epochs per half-cycle, plus 10 epochs of linear learning rate warmup ¹⁷. We train for 4959 epochs, terminating at the 25th local minimum of the cosine annealing schedule (excluding the warmup minimum). We chose the half-cycle length 101 for consistency with ¹, and 4959 epochs because it is the closest local minimum of the learning rate to 5000.

Architectures

Transformer $T$ : We use a Galerkin transformer ¹⁸, which has shown empirical success in operator learning tasks and has linear complexity in the number of tokens. It uses multi-linear attention, layer normalization for keys and values, attention scaled by $1 / (number of tokens)$ , and a special initialization for the attention projection matrices. We set model dimension $d = 60$ , feedforward dimension $4 \cdot d$ , and ReLU activations. Layers are grouped into $m = 3$ sequential blocks of $L = 4$ layers each ( $L^{'} = 12$ total layers).

Positional encoding: We compare two choices: rotary positional embeddings (RoPE) ¹⁹ and the coordinate positional encoding from ¹, where space-time coordinates normalized to $[0, 1]$ are concatenated as extra channels.

Encoder/Decoder $ψ / γ$ : The encoder is a patch-wise MLP with two hidden layers of width $4 \cdot d$ and ReLU activations. We experiment with and without layer-normalization in the encoder. We also vary the decoder between ¹'s MLP (which takes outputs for all tokens in a timestep, with two hidden layers of size 64 and 256) and a patch-wise MLP mirroring the encoder.

In total, we vary positional encoding, encoder, and decoder choices in two ways each, performing a dense search over these choices per configuration. Using the Navier-Stokes dataset with viscosity $10^{- 3}$ , we tune for 313 epochs. For the full 4959-epoch training runs on all datasets, we use the best-performing architecture identified in this search.

Results

Table 1. Validation losses across datasets. Best results per dataset are in bold.

Temporal Dependencies	Weight Sharing	Residuals	NS $(10^{- 3})$	NS $(10^{- 3.5})$	Burgers	SWE
Broadcasted	Shared Weights	No Block Residuals	2.521e-3	9.992e-2	2.741e-4	2.898e-4
Broadcasted	Shared Weights	Block Residuals	2.632e-3	1.017e-1	3.270e-4	3.420e-4
Broadcasted	Unshared Weights	No Block Residuals	2.908e-3	8.578e-2	4.256e-4	5.975e-4
Broadcasted	Unshared Weights	Block Residuals	3.640e-3	1.445e-1	3.486e-4	2.944e-4
Teacher Forcing	Shared Weights	No Block Residuals	1.659e-2	4.744e-1	1.202e-3	4.017e-4
Teacher Forcing	Shared Weights	Block Residuals	1.479e-2	4.854e-1	1.108e-3	3.281e-4
Teacher Forcing	Unshared Weights	No Block Residuals	1.644e-2	5.908e-1	1.285e-3	6.052e-4
Teacher Forcing	Unshared Weights	Block Residuals	1.683e-2	5.195e-1	1.077e-3	4.119e-4
Recursive	Shared Weights	No Block Residuals	1.929e-3	1.000e-1	3.377e-4	3.468e-4
Recursive	Shared Weights	Block Residuals	2.213e-3	8.560e-2	4.494e-4	4.906e-4
Recursive	Unshared Weights	No Block Residuals	2.686e-3	1.153e-1	4.043e-4	3.754e-4
Recursive	Unshared Weights	Block Residuals	2.931e-3	1.013e-1	4.333e-4	5.235e-4

Table 2. Normalized validation loss ratios (validation loss ÷ per-dataset minimum). Best results per dataset are in bold.

Temporal Dependencies	Weight Sharing	Residuals	NS $(10^{- 3})$	NS $(10^{- 3.5})$	Burgers	SWE
Broadcasted	Shared Weights	No Block Residuals	1.3071	1.1672	1.0000	1.0000
Broadcasted	Shared Weights	Block Residuals	1.3645	1.1883	1.1933	1.1803
Broadcasted	Unshared Weights	No Block Residuals	1.5075	1.0021	1.5529	2.0620
Broadcasted	Unshared Weights	Block Residuals	1.8869	1.6884	1.2722	1.0160
Teacher Forcing	Shared Weights	No Block Residuals	8.6010	5.5413	4.3855	1.3864
Teacher Forcing	Shared Weights	Block Residuals	7.6663	5.6702	4.0431	1.1321
Teacher Forcing	Unshared Weights	No Block Residuals	8.5233	6.9020	4.6888	2.0886
Teacher Forcing	Unshared Weights	Block Residuals	8.7233	6.0686	3.9301	1.4214
Recursive	Shared Weights	No Block Residuals	1.0000	1.1686	1.2323	1.1968
Recursive	Shared Weights	Block Residuals	1.1474	1.0000	1.6397	1.6930
Recursive	Unshared Weights	No Block Residuals	1.3926	1.3465	1.4751	1.2953
Recursive	Unshared Weights	Block Residuals	1.5196	1.1838	1.5811	1.8066

Table 3. Summary across datasets: geometric mean and maximum normalized validation loss per configuration. Best results per summary statistic in bold.

Temporal Dependencies	Weight Sharing	Residuals	Geometric Mean Normalized Loss	Worst-Case Normalized Loss
Broadcasted	Shared Weights	No Block Residuals	1.1114	1.3071
Broadcasted	Shared Weights	Block Residuals	1.2293	1.3645
Broadcasted	Unshared Weights	No Block Residuals	1.4830	2.0620
Broadcasted	Unshared Weights	Block Residuals	1.4245	1.8869
Teacher Forcing	Shared Weights	No Block Residuals	4.1259	8.6010
Teacher Forcing	Shared Weights	Block Residuals	3.7558	7.6663
Teacher Forcing	Unshared Weights	No Block Residuals	4.8992	8.5233
Teacher Forcing	Unshared Weights	Block Residuals	4.1469	8.7233
Recursive	Shared Weights	No Block Residuals	1.1458	1.2323
Recursive	Shared Weights	Block Residuals	1.3359	1.6930
Recursive	Unshared Weights	No Block Residuals	1.3758	1.4751
Recursive	Unshared Weights	Block Residuals	1.5056	1.8066

Two configurations are nearly optimal across datasets. (Broadcasted, Shared Weights, No Block Residuals) minimizes the geometric mean normalized validation loss at $1.1114$ , followed closely by (Recursive, Shared Weights, No Block Residuals) at $1.1458$ . Considering worst-case normalized validation loss, (Recursive, Shared Weights, No Block Residuals) comes first at $1.2323$ , followed by (Broadcasted, Shared Weights, No Block Residuals) at $1.3071$ . With the same two configurations leading on both summary statistics, we suggest these as the default best choices.

Teacher Forcing performs poorly. All Broadcasted and Recursive configurations have worst-case normalized validation loss $\leq 2.0620$ . Meanwhile, on all datasets except SWE, all Teacher Forcing configurations have normalized validation loss ranging from $3.9301$ to $8.7233$ . Because only Teacher Forcing configurations have a distributional shift between training and evaluation, this distributional shift well explains the inferior performance.

Broadcasted and Recursive are competitive. Two optimal configurations are Broadcasted and two are Recursive. Broadcasted beats Recursive in 8/16 matched comparisons, with a geometric mean improvement of $1.0238$ . With mixed and close results, we do not claim either is generally better.

Shared Weights moderately outperforms Unshared Weights. All optimal configurations have Shared Weights. Excluding Teacher Forcing, Shared Weights beats Unshared Weights in 13/16 matched comparisons, with a geometric mean improvement of $1.2026$ . This is consistent with the hypothesis that weight sharing provides an inductive bias towards iterative refinement.

No Block Residuals slightly outperforms Block Residuals. Three out of four optimal configurations have No Block Residuals. No Block Residuals beats Block Residuals in 12/16 matched comparisons, with a geometric mean improvement of $1.0720$ . This can also be explained by an inductive bias towards iterative refinement when residuals around blocks are absent.

Conclusion

We introduced a unifying $2 \times 2 \times 3$ design grid for transformer-based trajectory predictors and showed that VT, BPTT, and ANIE are simply different coordinates within this unified space. This perspective makes implementation differences explicit, exposes a family of hybrid methods, and allows us to perform a controlled ablation study.

We find that no single configuration is optimal across datasets, but (Broadcasted, Shared Weights, No Block Residuals) and (Recursive, Shared Weights, No Block Residuals) are close to optimal on all datasets, and we recommend these two as the best default options.

We find that while Teacher Forcing configurations are sometimes competitive, overall they perform significantly worse than other configurations — a result well explained by the distributional shift from training to evaluation time.

Within the set of Broadcasted and Recursive configurations, all configurations have validation loss ratios $\leq 2.0620$ on all datasets. Within that variation, Shared Weights outperforms Unshared Weights, No Block Residuals outperforms Block Residuals, and Broadcasted and Recursive are competitive with each other. We hypothesize that both Shared Weights and No Block Residuals provide an inductive bias towards stable iterative refinement, which may underlie the performance gains along both dimensions.

We hope our conceptual framework, results, and code provide a clear substrate to build stronger trajectory predictors and to disentangle the source of performance gains.

References

Zappala, E., et al. "Neural Integral Equations." Nature Machine Intelligence 6 (2024): 1046–1062. arXiv:2209.15190.↩
Weissenborn, D., Täckström, O., and Uszkoreit, J. "Scaling Autoregressive Video Models." ICLR 2020. arXiv:1906.02629.↩
Arnab, A., Dehghani, M., Heigold, G., Sun, C., Lučić, M., and Schmid, C. "ViViT: A Video Vision Transformer." ICCV 2021. arXiv:2103.15691.↩
Doron, M., et al. "BCAT: A Block Causal Transformer for PDE Foundation Models for Fluid Dynamics." arXiv:2501.18972 (2025).↩
Werbos, P.J. "Backpropagation through time: what it does and how to do it." Proceedings of the IEEE 78.10 (1990): 1550–1560.↩
Brandstetter, J., Worrall, D., and Welling, M. "Message Passing Neural PDE Solvers." ICLR 2022. arXiv:2202.03376.↩
Kochkov, D., et al. "Machine learning-accelerated computational fluid dynamics." PNAS 118.21 (2021).↩
Xiong, R., Yang, Y., He, D., Zheng, K., Zheng, S., Xing, C., Zhang, H., Lan, Y., Wang, L., and Liu, T.Y. "On Layer Normalization in the Transformer Architecture." ICML 2020. arXiv:2002.04745.↩
He, K., Zhang, X., Ren, S., and Sun, J. "Deep Residual Learning for Image Recognition." CVPR 2016. arXiv:1512.03385.↩
Ba, J.L., Kiros, J.R., and Hinton, G.E. "Layer Normalization." arXiv:1607.06450 (2016).↩
Kingma, D.P. and Ba, J. "Adam: A Method for Stochastic Optimization." ICLR 2015. arXiv:1412.6980.↩
Glorot, X. and Bengio, Y. "Understanding the difficulty of training deep feedforward neural networks." AISTATS 2010.↩
Li, Z., Kovachki, N., Azizzadenesheli, K., Liu, B., Bhattacharya, K., Stuart, A., and Anandkumar, A. "Fourier Neural Operator for Parametric Partial Differential Equations." ICLR 2021. arXiv:2010.08895.↩
Takamoto, M., Praditia, T., Leiteritz, R., MacKinlay, D., Alesiani, F., Pflüger, D., and Niepert, M. "PDEBench: An Extensive Benchmark for Scientific Machine Learning." NeurIPS 2022. arXiv:2210.07182.↩
Paszke, A., et al. "PyTorch: An Imperative Style, High-Performance Deep Learning Library." NeurIPS 2019. arXiv:1912.01703.↩
Loshchilov, I. and Hutter, F. "SGDR: Stochastic Gradient Descent with Warm Restarts." ICLR 2017. arXiv:1608.03983.↩
Vaswani, A., et al. "Attention Is All You Need." NeurIPS 2017. arXiv:1706.03762.↩
Cao, S. "Choose a Transformer: Fourier or Galerkin." NeurIPS 2021. arXiv:2105.14995.↩
Su, J., Lu, Y., Pan, S., Murtadha, A., Wen, B., and Liu, Y. "RoFormer: Enhanced Transformer with Rotary Position Embedding." Neurocomputing 568 (2024). arXiv:2104.09864.↩