SHOULD analysis on the Himmelblau hypersurface

This notebook shows a practical application of the probabilistic RMSE-reduction workflow and Theorem 3 in the reference paper. We use the Himmelblau graph as a non-linear hypersurface, estimate the probability that reconciliation reduces the point-wise RMSE for a single forecast, and then study empirical calibration over an archive of simulated forecasts.

The notebook is intentionally practical: it demonstrates how to build the predictive samples, how to call p_reduction and p_reduction_and_intervals, and how to compare predicted probabilities with realized improvements in a controlled simulation.

import os
os.environ["JAX_PLATFORMS"] = "cpu"
os.environ["JAX_ENABLE_X64"] = "1"

import numpy as np
import jax
import jax.numpy as jnp
import plotly.graph_objects as go

from jnlr.reconcile import make_solver_alm_optax as make_solver
from jnlr.utils.function_utils import f_impl
from jnlr.should import p_reduction, p_reduction_and_intervals
from jnlr.stats import clopper_pearson_intervals
import jnlr.utils.manifolds as mfs

SEED = 7
rng = np.random.default_rng(SEED)
np.set_printoptions(precision=4, suppress=True)

f = f_impl(mfs.f_himmelblau)
solver = make_solver(f, n_iterations=20)

noise_scale = np.array([0.45, 0.45, 4.0])
print(f"Seed: {SEED}")
print(f"Noise scale: {noise_scale}")

Seed: 7
Noise scale: [0.45 0.45 4.  ]

Problem setup

We work with the explicit Himmelblau graph

\[z = f(x, y) = (x^2 + y - 11)^2 + (x + y^2 - 7)^2\]

and with its implicit representation built through f_impl, so that reconciliation acts on points in \(\mathbb{R}^3\). The true point lies on the graph, while the forecast is an incoherent perturbation of that point.

xy_true = rng.uniform(-4.0, 4.0, size=(2,))
z_true = np.array([xy_true[0], xy_true[1], float(mfs.f_himmelblau(jnp.asarray(xy_true)))])

eps = rng.normal(size=(3,)) * noise_scale
z_hat = z_true + eps

single_sample_count = 256
eps_samples = rng.normal(size=(single_sample_count, 3)) * noise_scale
z_hat_samples = z_hat[None, :] + eps_samples

z_tilde = np.asarray(solver(jnp.asarray(z_hat[None, :])))[0]
e_hat = float(np.asarray(p_reduction(
    solver,
    jnp.asarray(z_hat[None, :]),
    jnp.asarray(z_hat_samples[None, :, :]),
))[0])

z_tilde_samples = np.asarray(solver(jnp.asarray(z_hat_samples)))

print("True point z_true:", z_true)
print("Forecast z_hat:", z_hat)
print("Reconciled forecast z_tilde:", z_tilde)
print(f"Estimated probability of RMSE reduction: {e_hat:.3f}")

True point z_true: [ 1.0008  3.1777 63.3214]
Forecast z_hat: [ 0.8774  2.7769 61.5027]
Reconciled forecast z_tilde: [ 0.716   2.7859 61.4942]
Estimated probability of RMSE reduction: 0.391

Single-forecast geometry

The figure below shows the true point on the Himmelblau surface, the incoherent forecast, the reconciled forecast, and the predictive samples before and after projection. This is the geometric picture behind the probability estimate returned by p_reduction.

x = np.linspace(-4.0, 4.0, 45)
y = np.linspace(-4.0, 4.0, 45)
xx, yy = np.meshgrid(x, y)
surface_points = np.column_stack([xx.ravel(), yy.ravel()])
zz = np.asarray(jax.vmap(mfs.f_himmelblau)(jnp.asarray(surface_points))).reshape(xx.shape)

fig = go.Figure()
fig.add_trace(go.Surface(
    x=xx,
    y=yy,
    z=zz,
    opacity=0.55,
    colorscale="Viridis",
    showscale=False,
    name="Himmelblau surface",
))
fig.add_trace(go.Scatter3d(
    x=z_hat_samples[:, 0],
    y=z_hat_samples[:, 1],
    z=z_hat_samples[:, 2],
    mode="markers",
    marker=dict(size=3, color="#d95f02", opacity=0.35),
    name="Predictive samples",
))
fig.add_trace(go.Scatter3d(
    x=z_tilde_samples[:, 0],
    y=z_tilde_samples[:, 1],
    z=z_tilde_samples[:, 2],
    mode="markers",
    marker=dict(size=3, color="#1b9e77", opacity=0.35),
    name="Reconciled samples",
))
fig.add_trace(go.Scatter3d(
    x=[z_true[0]], y=[z_true[1]], z=[z_true[2]],
    mode="markers",
    marker=dict(size=8, color="#2c7fb8", symbol="diamond"),
    name="True point",
))
fig.add_trace(go.Scatter3d(
    x=[z_hat[0]], y=[z_hat[1]], z=[z_hat[2]],
    mode="markers+text",
    text=["z_hat"],
    textposition="top center",
    marker=dict(size=7, color="#d95f02"),
    name="Forecast",
))
fig.add_trace(go.Scatter3d(
    x=[z_tilde[0]], y=[z_tilde[1]], z=[z_tilde[2]],
    mode="markers+text",
    text=["z_tilde"],
    textposition="top center",
    marker=dict(size=7, color="#1b9e77"),
    name="Reconciled forecast",
))
fig.add_trace(go.Scatter3d(
    x=[z_hat[0], z_tilde[0]],
    y=[z_hat[1], z_tilde[1]],
    z=[z_hat[2], z_tilde[2]],
    mode="lines",
    line=dict(color="black", width=6),
    name="delta_pi",
))
fig.update_layout(
    title="Single forecast and predictive samples on the Himmelblau hypersurface",
    scene=dict(
        xaxis_title="x",
        yaxis_title="y",
        zaxis_title="z",
        camera=dict(eye=dict(x=1.6, y=1.5, z=0.9)),
    ),
    width=980,
    height=760,
)
fig

Archive calibration experiment

We now simulate an archive of forecasts. For each scenario we sample a true point on the manifold, perturb it to obtain an incoherent forecast, build predictive samples around that forecast, estimate the reduction probability with p_reduction, and compare that estimate with the realized event

\[y_i = \mathbf{1}\{\|\tilde z_i - z_i^\star\|^2 < \|\hat z_i - z_i^\star\|^2\}.\]

This produces the pairs \((e_i, y_i)\) used in the calibration analysis of Theorem 3. To keep the uncertainty bands readable, we summarize the archive with equal-mass bins rather than fixed-width bins: each bin contains a comparable number of scenarios, so the Clopper-Pearson intervals are driven less by sparse tails.

archive_size = 240
archive_sample_count = 256
archive_bin_count = 6

xy_archive = rng.uniform(-4.0, 4.0, size=(archive_size, 2))
z_true_archive = np.column_stack([
    xy_archive,
    np.asarray(jax.vmap(mfs.f_himmelblau)(jnp.asarray(xy_archive)))
])

archive_noise = rng.normal(size=(archive_size, 3)) * noise_scale
z_hat_archive = z_true_archive + archive_noise
archive_sample_noise = rng.normal(size=(archive_size, archive_sample_count, 3)) * noise_scale
z_hat_archive_samples = z_hat_archive[:, None, :] + archive_sample_noise

archive_e = np.asarray(p_reduction(
    solver,
    jnp.asarray(z_hat_archive),
    jnp.asarray(z_hat_archive_samples),
))
z_tilde_archive = np.asarray(solver(jnp.asarray(z_hat_archive)))

pre_error = np.sum((z_hat_archive - z_true_archive) ** 2, axis=1)
post_error = np.sum((z_tilde_archive - z_true_archive) ** 2, axis=1)
archive_y = (post_error < pre_error).astype(int)

print(f"Archive size: {archive_size}")
print(f"Samples per forecast: {archive_sample_count}")
print(f"Equal-mass calibration bins: {archive_bin_count}")
print(f"Mean predicted reduction probability: {np.nanmean(archive_e):.3f}")
print(f"Observed fraction of reductions: {archive_y.mean():.3f}")
print(f"Probability range: [{np.nanmin(archive_e):.3f}, {np.nanmax(archive_e):.3f}]")

Archive size: 240
Samples per forecast: 256
Equal-mass calibration bins: 6
Mean predicted reduction probability: 0.818
Observed fraction of reductions: 0.817
Probability range: [0.008, 1.000]

valid_idx = np.flatnonzero(np.isfinite(archive_e))
sorted_idx = valid_idx[np.argsort(archive_e[valid_idx])]
bin_slices = [chunk for chunk in np.array_split(sorted_idx, archive_bin_count) if len(chunk)]
records = []

for bin_id, bin_idx in enumerate(bin_slices, start=1):
    e_bin = archive_e[bin_idx]
    y_bin = archive_y[bin_idx]
    n_bin = int(len(bin_idx))
    k_bin = int(y_bin.sum())
    empirical = k_bin / n_bin
    lower, upper = clopper_pearson_intervals(k_bin, n_bin, alpha=0.05)
    records.append({
        "bin_id": bin_id,
        "n": n_bin,
        "k": k_bin,
        "empirical": empirical,
        "lower": float(lower),
        "upper": float(upper),
        "mean_prediction": float(np.mean(e_bin)),
        "min_prediction": float(np.min(e_bin)),
        "max_prediction": float(np.max(e_bin)),
    })

for record in records:
    print(
        f"bin {record['bin_id']}: e in [{record['min_prediction']:.3f}, {record['max_prediction']:.3f}], "
        f"n={record['n']:3d}, mean e={record['mean_prediction']:.3f}, empirical={record['empirical']:.3f}, "
        f"CI=[{record['lower']:.3f}, {record['upper']:.3f}]"
    )

bin 1: e in [0.008, 0.562], n= 40, mean e=0.395, empirical=0.325, CI=[0.186, 0.491]
bin 2: e in [0.566, 0.769], n= 40, mean e=0.669, empirical=0.675, CI=[0.509, 0.814]
bin 3: e in [0.769, 0.941], n= 40, mean e=0.868, empirical=0.900, CI=[0.763, 0.972]
bin 4: e in [0.941, 0.996], n= 40, mean e=0.976, empirical=1.000, CI=[0.914, 1.000]
bin 5: e in [0.996, 1.000], n= 40, mean e=0.999, empirical=1.000, CI=[0.914, 1.000]
bin 6: e in [1.000, 1.000], n= 40, mean e=1.000, empirical=1.000, CI=[0.914, 1.000]

Calibration plot

If the estimated probabilities were perfectly calibrated, the bin-wise empirical success rate would lie close to the diagonal. The points below come from equal-mass bins, so each interval is supported by a comparable number of archive elements and the uncertainty bars are more interpretable than with sparse fixed-width bins.

centers = np.array([r["mean_prediction"] for r in records])
empirical = np.array([r["empirical"] for r in records])
lower = np.array([r["lower"] for r in records])
upper = np.array([r["upper"] for r in records])
counts = np.array([r["n"] for r in records])

cal_fig = go.Figure()
cal_fig.add_trace(go.Scatter(
    x=[0, 1],
    y=[0, 1],
    mode="lines",
    line=dict(color="black", dash="dash"),
    name="Ideal calibration",
))
cal_fig.add_trace(go.Scatter(
    x=centers,
    y=empirical,
    mode="markers+text",
    text=[f"n={n}" for n in counts],
    textposition="top center",
    marker=dict(size=10, color="#1f78b4"),
    error_y=dict(
        type="data",
        symmetric=False,
        array=upper - empirical,
        arrayminus=empirical - lower,
        thickness=1.5,
        width=5,
    ),
    name="Observed calibration",
))
cal_fig.update_layout(
    title="Calibration of the Monte Carlo RMSE-reduction estimate",
    xaxis_title="Predicted probability of RMSE reduction",
    yaxis_title="Observed frequency of RMSE reduction",
    xaxis=dict(range=[0, 1]),
    yaxis=dict(range=[0, 1]),
    width=900,
    height=600,
)
cal_fig

Interpretation

For a single forecast, p_reduction and p_reduction_and_intervals provide a practical estimate of how likely reconciliation is to improve the point-wise RMSE under the predictive distribution used to generate samples. Over an archive of forecasts, the calibration plot compares those predicted probabilities with realized improvements and adds finite-sample uncertainty bars through Clopper-Pearson intervals.

This notebook therefore illustrates a practical workflow around the probabilistic theorem: it does not prove the theorem, but it shows how to operationalize the estimate, collect an archive of \((e_i, y_i)\) pairs, and inspect whether the predicted probabilities behave coherently in simulation.