Causality weighting for time-domain PINNs

Part 3 — Training pathologies and remedies

Learning objectives

Diagnose causality violation: late-time residuals collapse before early-time residuals
Implement Wang–Sankaran–Perdikaris (2024) causality-aware weighting
Confirm that causal weighting forces the network to learn time-causally
Apply the technique to 1D advection — a canonical seismic-wave proxy

Time-domain PINNs are notorious for cheating on time. Given a PDE $u_t + \mathcal{N}[u] = 0$ on $(x, t) \in \Omega \times [0, T]$ with IC $u(x, 0) = u_0(x)$ , the optimiser can find a smooth function that satisfies the PDE near $t = T$ while violating the IC at $t = 0$ . The total residual averaged over the $(x, t)$ domain is small, but the solution is wrong — the network has solved an easier problem (steady state at $t = T$ ) instead of the actual evolution problem.

This is the causality violation — pathology #4 from §3.1. Wang, Sankaran & Perdikaris (2024) named it and gave a clean fix.

The causality-aware weight

For a residual $\mathcal{L}_r(\theta; t) = \frac{1}{N_x} \sum_i r(x_i, t; \theta)^2$ at fixed time $t$ , define

w(t) = \exp\left( -\varepsilon \int_0^t \overline{\mathcal{L}_r}(s) \, ds \right) ,

where $\varepsilon$ is a hyperparameter and $\overline{\mathcal{L}_r}(s)$ is the residual at time $s$ averaged over space. The intuition: $w(t)$ is small when the integrated earlier-time residual $\int_0^t \overline{\mathcal{L}_r}$ is large, and rises to one as the earlier-time residual falls. Late-time collocation points are downweighted until earlier residuals are small.

The weight is recomputed each training step. Initially $w(t) \approx 1$ for small $t$ and $w(t) \to 0$ for large $t$ ; the optimiser sees only the IC region. As the IC region fits, $w(t)$ at later times rises and the optimiser advances forward in time. The network learns the evolution one time-slab at a time — the natural causality of the underlying physics.

Try it: causality on 1D advection

The widget races uniform-weighted training (§3.1 causality-violation pathology) against causality-weighted training on $u_t + u_x = 0$ , $u(x, 0) = \sin(\pi x)$ , on $x \in [-1, 1]$ , $t \in [0, 1]$ . The exact solution is $u(x, t) = \sin(\pi (x - t))$ .

What you should observe

Uniform weights: late-time residual sits at or above the early-time residual — the causality-violation signature. The network is solving the problem out of time-order.
Causal weights: both early and late residuals drop by an order of magnitude or more, and crucially the early bin drops first. The schedule (right panel) shows $w(t = 0)$ stays at 1 while $w(t = 1)$ starts near zero and rises only as the early-time fit cleans up.
The relative-L² improvement on this 1D advection toy is modest because the architecture is small; the per-window residual story is what scales to 2D wave equations, where causality weighting goes from "nice" to "essential". This is the classical result Wang, Sankaran & Perdikaris (2024) demonstrated on Allen-Cahn and the wave equation.
The causality-weight panel shows the schedule: $w(t = 0)$ stays at 1.0; $w(t = 1)$ starts low and rises only as training progresses.

For the wave equation

The advection problem above is the smallest non-trivial test case. On the 2D acoustic wave equation $u_{tt} = c^2 \nabla^2 u$ — the Part 4 setting — causality violation is severe. Without causality weighting the wavefront often forms simultaneously across the entire domain, producing checkerboard artefacts; with causality weighting the wavefront propagates outward from the source as it should physically. Wang, Sankaran & Perdikaris (2024) demonstrate this on the Allen-Cahn equation; subsequent papers (Diab et al. 2024) apply the same idea to acoustic FWI.

Choosing ε

The hyperparameter $\varepsilon$ controls how aggressively later times are downweighted. Wang et al. (2024) recommend choosing $\varepsilon$ such that $w(T)$ at the end of the time domain is roughly $10^{-2}$ at the start of training. In our advection demo $\varepsilon = 100$ achieves this. Too small and the weighting has no effect; too large and the optimiser cannot escape the $t = 0$ region. The widget exposes the schedule so you can see this directly.

References

Wang, S., Sankaran, S., Perdikaris, P. (2024). Respecting causality is all you need for training physics-informed neural networks. Comput. Methods Appl. Mech. Engrg.
Mattey, R., Ghosh, S. (2022). A novel sequential method to train physics-informed neural networks. CMAME.