The PINN idea: PDE residual as a loss term

Part 1 — The PINN formulation

Learning objectives

State the PINN loss recipe: data fit + initial/boundary conditions + PDE residual
Recognise that the PDE-residual term forces correct extrapolation in a way pure supervised learning cannot
See concretely that any one term alone is insufficient: data alone fails to extrapolate; IC alone does not constrain shape; PDE alone admits trivial solutions
Compute the PINN parameter gradient by combining standard backprop (for data and IC terms) with input-derivative AD (for the PDE residual)

The fix introduced by Maziar Raissi, Paris Perdikaris, and George Karniadakis (2017–2019) is, on paper, embarrassingly simple. We add a loss term that punishes the network for violating the PDE the function is supposed to satisfy. That single change converts a function-fitter into a PDE-solver. This section sets the formulation and walks through it on the smallest possible PDE: a 1D first-order ODE.

The PINN loss recipe

For a function $u(\mathbf{x})$ that is supposed to satisfy $\mathcal{N}[u] = 0$ for some differential operator $\mathcal{N}$ on a domain $\Omega$ , plus boundary conditions $\mathcal{B}[u] = 0$ on $\partial\Omega$ and possibly some sparse data ${(\mathbf{x}_i, y_i)}$ , the PINN loss is:

L(\theta) = \lambda_d L_{\mathrm{data}} + \lambda_b L_{\mathrm{BC}} + \lambda_p L_{\mathrm{PDE}}

where each term is a mean-squared error:

L_{\mathrm{data}} = \frac{1}{N_d} \sum_{i=1}^{N_d} (u_\theta(\mathbf{x}_i) - y_i)^2

L_{\mathrm{BC}} = \frac{1}{N_b} \sum_{i=1}^{N_b} (\mathcal{B}[u_\theta](\mathbf{x}_i^{\mathrm{bnd}}))^2

L_{\mathrm{PDE}} = \frac{1}{N_c} \sum_{j=1}^{N_c} (\mathcal{N}[u_\theta](\mathbf{x}_j^{\mathrm{coll}}))^2

The "collocation points" $\mathbf{x}$ are sampled from the interior of $\Omega$ — they are where we are checking the PDE holds. They do not need to be data points; they can be drawn arbitrarily from the domain. The PDE residual $\mathcal{N}[u$ \theta] $N [u_{θ}]$ is the differential operator applied to the network output, evaluated at each collocation point, using automatic differentiation through the network with respect to its inputs. This is exactly what Part 0 (§0.6–0.7) was setting up.

The PINN does standard backprop to minimise $L(\theta)$ over the network parameters. But because $L_{\mathrm{PDE}}$ involves derivatives of the network with respect to its inputs, training requires differentiating through input-derivatives. The runtime that powers this textbook (nn-runtime.js) does this exactly: forwardDerivs(x) returns $u$ , $\nabla_x u$ , and the diagonal Hessian $\partial^2 u / \partial x_d^2$ at one input point; backwardDerivs(\ldots) propagates loss gradients on those quantities back to the parameters.

The smallest possible PINN

To make the recipe concrete, here is the simplest possible target: the linear first-order ODE

u'(t) = -k\, u(t),\qquad u(0) = 1,\qquad t \in [0, T].

The true solution is $u(t) = e^{-kt}$ — exponential decay. The PINN loss has three parts. Define the PDE residual $R(t) = u'(t) + k,u(t)$ . Then:

$L_{\mathrm{data}} = \frac{1}{N_d}\sum_i (u_\theta(t_i) - y_i)^2$ (over a few sparse, possibly noisy observations — sometimes available, sometimes not)
$L_{\mathrm{IC}} = (u_\theta(0) - 1)^2$ (the initial condition is a single algebraic constraint at $t=0$ )
$L_{\mathrm{PDE}} = \frac{1}{N_c}\sum_j R(t_j)^2 = \frac{1}{N_c}\sum_j (u_\theta'(t_j) + k,u_\theta(t_j))^2$ (the PDE residual is forced to zero at each collocation point $t_j$ )

Try it

The widget builds a 1-32-32-1 Tanh MLP for $u_\theta(t)$ and trains it with Adam. The three loss terms are independently toggleable. The reveal is to turn each on and off and see what changes:

Just $L_{\mathrm{PDE}}$ . The minimum of $L_{\mathrm{PDE}}$ is $u(t) = 0$ — the trivial solution. Indeed, any $u(t) = c,e^{-kt}$ for any constant $c$ satisfies the residual equation; without the IC, training collapses toward a flat low-amplitude curve.
Just $L_{\mathrm{IC}}$ . The network learns $u(t) = 1$ — a constant. Satisfies $u(0)=1$ ; ignores the dynamics.
$L_{\mathrm{IC}} + L_{\mathrm{PDE}}$ (no data!). This is the reveal. Together the two terms uniquely pin down the true exponential. Watch the prediction sweep onto the true curve as training progresses. No labels needed.
Just $L_{\mathrm{data}}$ . With four noisy observations on $[0, T/3]$ , the MLP fits them and extrapolates badly past $T/3$ — the §1.1 failure mode in miniature.
All three on. The most robust setup: data anchors the trajectory, IC fixes the starting value, PDE enforces the dynamics everywhere.

What the runtime is doing

For each collocation point $t_j$ the widget calls forwardDerivs([t_j]) on the MLP. That returns $u(t_j) = \hat{y}$ and $u'(t_j) = \hat{y}'$ via forward-mode AD through the network. The PDE residual $R(t_j) = \hat{y}' + k,\hat{y}$ is computed as a plain arithmetic expression. The loss contribution is $R^2/N_c$ . To get parameter gradients, we apply the chain rule: $\partial L / \partial u = 2 R k / N_c$ , $\partial L / \partial u' = 2 R / N_c$ . These are passed to backwardDerivs, which back-propagates through the augmented forward pass to give parameter gradients. The widget accumulates these contributions across all collocation points and adds the data and IC term gradients (regular backprop) before taking an Adam step. This is the entire mechanism. There is no other secret.

What you now know

A PINN is a neural network plus a loss function that includes a PDE residual term computed via input-derivative auto-differentiation. The PDE residual gives the network a reason to extrapolate correctly that pure supervised learning lacks. The same recipe scales up to harder PDEs — the next section, §1.3, applies it to the canonical 1D Burgers PDE on a 2D domain $(x, t)$ .

Pause-and-check. (1) Why does turning off the IC term and leaving only $L_{\mathrm{PDE}}$ on cause the network to drift toward a small-amplitude solution? (2) The collocation points $t_j$ used to evaluate $L_{\mathrm{PDE}}$ have no labels. What anchors the network's prediction at those points? (3) If the true ODE were $u'' + \omega^2 u = 0$ with $u(0)=1$ and $u'(0)=0$ (simple harmonic oscillator), what would the PINN loss look like?

References

Raissi, M., Perdikaris, P., Karniadakis, G.E. (2019). Physics-informed neural networks: A deep learning framework for solving forward and inverse problems involving nonlinear PDEs. J. Comput. Phys. 378, 686–707.
Karniadakis, G.E., Kevrekidis, I.G., Lu, L., Perdikaris, P., Wang, S., Yang, L. (2021). Physics-informed machine learning. Nat. Rev. Phys. 3, 422–440.
Lu, L., Meng, X., Mao, Z., Karniadakis, G.E. (2021). DeepXDE: A deep learning library for solving differential equations. SIAM Review 63(1), 208–228.
Cuomo, S., Di Cola, V.S., Giampaolo, F., et al. (2022). Scientific machine learning through physics-informed neural networks: Where we are and what is next. J. Sci. Comput. 92(3), 88.