The PINN idea: PDE residual as a loss term

Part 1 — The PINN formulation

Learning objectives

  • State the PINN loss recipe: data fit + initial/boundary conditions + PDE residual
  • Recognise that the PDE-residual term forces correct extrapolation in a way pure supervised learning cannot
  • See concretely that any one term alone is insufficient: data alone fails to extrapolate; IC alone does not constrain shape; PDE alone admits trivial solutions
  • Compute the PINN parameter gradient by combining standard backprop (for data and IC terms) with input-derivative AD (for the PDE residual)

The fix introduced by Maziar Raissi, Paris Perdikaris, and George Karniadakis (2017–2019) is, on paper, embarrassingly simple. We add a loss term that punishes the network for violating the PDE the function is supposed to satisfy. That single change converts a function-fitter into a PDE-solver. This section sets the formulation and walks through it on the smallest possible PDE: a 1D first-order ODE.

The PINN loss recipe

For a function u(x)u(\mathbf{x}) that is supposed to satisfy N[u]=0\mathcal{N}[u] = 0 for some differential operator N\mathcal{N} on a domain Ω\Omega, plus boundary conditions B[u]=0\mathcal{B}[u] = 0 on Ω\partial\Omega and possibly some sparse data {(xi,yi)}{(\mathbf{x}_i, y_i)}, the PINN loss is:

L(θ)=λdLdata+λbLBC+λpLPDEL(\theta) = \lambda_d L_{\mathrm{data}} + \lambda_b L_{\mathrm{BC}} + \lambda_p L_{\mathrm{PDE}}

where each term is a mean-squared error:

Ldata=1Ndi=1Nd(uθ(xi)yi)2L_{\mathrm{data}} = \frac{1}{N_d} \sum_{i=1}^{N_d} (u_\theta(\mathbf{x}_i) - y_i)^2
LBC=1Nbi=1Nb(B[uθ](xibnd))2L_{\mathrm{BC}} = \frac{1}{N_b} \sum_{i=1}^{N_b} (\mathcal{B}[u_\theta](\mathbf{x}_i^{\mathrm{bnd}}))^2
LPDE=1Ncj=1Nc(N[uθ](xjcoll))2L_{\mathrm{PDE}} = \frac{1}{N_c} \sum_{j=1}^{N_c} (\mathcal{N}[u_\theta](\mathbf{x}_j^{\mathrm{coll}}))^2

The "collocation points" xjcoll\mathbf{x}j^{\mathrm{coll}} are sampled from the interior of Ω\Omega — they are where we are checking the PDE holds. They do not need to be data points; they can be drawn arbitrarily from the domain. The PDE residual N[uθ]\mathcal{N}[u\theta] is the differential operator applied to the network output, evaluated at each collocation point, using automatic differentiation through the network with respect to its inputs. This is exactly what Part 0 (§0.6–0.7) was setting up.

The PINN does standard backprop to minimise L(θ)L(\theta) over the network parameters. But because LPDEL_{\mathrm{PDE}} involves derivatives of the network with respect to its inputs, training requires differentiating through input-derivatives. The runtime that powers this textbook (nn-runtime.js) does this exactly: forwardDerivs(x) returns uu, xu\nabla_x u, and the diagonal Hessian 2u/xd2\partial^2 u / \partial x_d^2 at one input point; backwardDerivs(\ldots) propagates loss gradients on those quantities back to the parameters.

The smallest possible PINN

To make the recipe concrete, here is the simplest possible target: the linear first-order ODE

u(t)=ku(t),u(0)=1,t[0,T].u'(t) = -k\, u(t),\qquad u(0) = 1,\qquad t \in [0, T].

The true solution is u(t)=ektu(t) = e^{-kt} — exponential decay. The PINN loss has three parts. Define the PDE residual R(t)=u(t)+ku(t)R(t) = u'(t) + k,u(t). Then:

  • Ldata=1Ndi(uθ(ti)yi)2L_{\mathrm{data}} = \frac{1}{N_d}\sum_i (u_\theta(t_i) - y_i)^2 (over a few sparse, possibly noisy observations — sometimes available, sometimes not)
  • LIC=(uθ(0)1)2L_{\mathrm{IC}} = (u_\theta(0) - 1)^2 (the initial condition is a single algebraic constraint at t=0t=0)
  • LPDE=1NcjR(tj)2=1Ncj(uθ(tj)+kuθ(tj))2L_{\mathrm{PDE}} = \frac{1}{N_c}\sum_j R(t_j)^2 = \frac{1}{N_c}\sum_j (u_\theta'(t_j) + k,u_\theta(t_j))^2 (the PDE residual is forced to zero at each collocation point tjt_j)

Try it

Ode PinnInteractive figure — enable JavaScript to interact.

The widget builds a 1-32-32-1 Tanh MLP for uθ(t)u_\theta(t) and trains it with Adam. The three loss terms are independently toggleable. The reveal is to turn each on and off and see what changes:

  • Just LPDEL_{\mathrm{PDE}}. The minimum of LPDEL_{\mathrm{PDE}} is u(t)=0u(t) = 0 — the trivial solution. Indeed, any u(t)=cektu(t) = c,e^{-kt} for any constant cc satisfies the residual equation; without the IC, training collapses toward a flat low-amplitude curve.
  • Just LICL_{\mathrm{IC}}. The network learns u(t)=1u(t) = 1 — a constant. Satisfies u(0)=1u(0)=1; ignores the dynamics.
  • LIC+LPDEL_{\mathrm{IC}} + L_{\mathrm{PDE}} (no data!). This is the reveal. Together the two terms uniquely pin down the true exponential. Watch the prediction sweep onto the true curve as training progresses. No labels needed.
  • Just LdataL_{\mathrm{data}}. With four noisy observations on [0,T/3][0, T/3], the MLP fits them and extrapolates badly past T/3T/3 — the §1.1 failure mode in miniature.
  • All three on. The most robust setup: data anchors the trajectory, IC fixes the starting value, PDE enforces the dynamics everywhere.

What the runtime is doing

For each collocation point tjt_j the widget calls forwardDerivs([t_j]) on the MLP. That returns u(tj)=y^u(t_j) = \hat{y} and u(tj)=y^u'(t_j) = \hat{y}' via forward-mode AD through the network. The PDE residual R(tj)=y^+ky^R(t_j) = \hat{y}' + k,\hat{y} is computed as a plain arithmetic expression. The loss contribution is R2/NcR^2/N_c. To get parameter gradients, we apply the chain rule: L/u=2Rk/Nc\partial L / \partial u = 2 R k / N_c, L/u=2R/Nc\partial L / \partial u' = 2 R / N_c. These are passed to backwardDerivs, which back-propagates through the augmented forward pass to give parameter gradients. The widget accumulates these contributions across all collocation points and adds the data and IC term gradients (regular backprop) before taking an Adam step. This is the entire mechanism. There is no other secret.

What you now know

A PINN is a neural network plus a loss function that includes a PDE residual term computed via input-derivative auto-differentiation. The PDE residual gives the network a reason to extrapolate correctly that pure supervised learning lacks. The same recipe scales up to harder PDEs — the next section, §1.3, applies it to the canonical 1D Burgers PDE on a 2D domain (x,t)(x, t).

Pause-and-check. (1) Why does turning off the IC term and leaving only LPDEL_{\mathrm{PDE}} on cause the network to drift toward a small-amplitude solution? (2) The collocation points tjt_j used to evaluate LPDEL_{\mathrm{PDE}} have no labels. What anchors the network's prediction at those points? (3) If the true ODE were u+ω2u=0u'' + \omega^2 u = 0 with u(0)=1u(0)=1 and u(0)=0u'(0)=0 (simple harmonic oscillator), what would the PINN loss look like?

References

  • Raissi, M., Perdikaris, P., Karniadakis, G.E. (2019). Physics-informed neural networks: A deep learning framework for solving forward and inverse problems involving nonlinear PDEs. J. Comput. Phys. 378, 686–707.
  • Karniadakis, G.E., Kevrekidis, I.G., Lu, L., Perdikaris, P., Wang, S., Yang, L. (2021). Physics-informed machine learning. Nat. Rev. Phys. 3, 422–440.
  • Lu, L., Meng, X., Mao, Z., Karniadakis, G.E. (2021). DeepXDE: A deep learning library for solving differential equations. SIAM Review 63(1), 208–228.
  • Cuomo, S., Di Cola, V.S., Giampaolo, F., et al. (2022). Scientific machine learning through physics-informed neural networks: Where we are and what is next. J. Sci. Comput. 92(3), 88.

This page is prerendered for SEO and accessibility. The interactive widgets above hydrate on JavaScript load.