If you see this, something is wrong
First published on Wednesday, Jun 17, 2026 and last modified on Wednesday, Jun 17, 2026 by François Chaplais.
WebMagic
Automatic regression from a set of input columns to a set of output columns, using a nested ladder of penalized spline models, with cross-validated model selection.
Having nested ladders of models means that, if a model belongs to the contained class of models, it is also belongs to the containing class.
Given a CSV data table with \( N\) rows and several numeric columns, you designate some columns as inputs \( X \in \mathbb{R}^{N \times m}\) and some as outputs \( Y \in \mathbb{R}^{N \times n}\) . The tool fits a family of models
and helps you select the most accurate model that is still numerically robust. Each output is predicted independently by the same input space:
with each \( f_j : \mathbb{R}^m \to \mathbb{R}\) fitted by penalized least squares.
The input and output columns are selected by the user in a first phase. Only then can modelization be performed.
Only numeric rows that pass the active search filter are used.
Cross-validation splits the data into \( k = 5\) consecutive blocks (folds) based on the current row order. If your data has a time structure (daily measurements, for instance), the folds follow time: folds 1–4 are trained and fold 5 is validated on the last 20
If the rows are sorted by one of the inputs, the folds will have systematic differences in input range, which can inflate or deflate CV error estimates. For most tabular data with no natural ordering, row order is arbitrary and the results are stable.
Before any model fitting, inputs and outputs are standardized column-wise to zero mean and unit variance:
Means and standard deviations are estimated on the training fold only inside each cross-validation fold, to avoid data leakage. The same scaling is applied to the validation fold before scoring.
Standardization serves two purposes:
Nonlinear regression is implemented by mapping each standardized input through a B-spline basis, converting a nonlinear problem into ordinary linear regression in the transformed space. B-splines have two key advantages over polynomial bases:
A 1-D spline space is defined by a knot vector and a polynomial degree \( d\) . The tool uses \( d = 3\) (cubic splines) throughout.
K — the mesh parameter is the number of interior knots per spline term. It is user-configurable (default 4, range 2–20) and controls how finely the input domain is divided. For \( K\) interior knots \( \xi_1 < \xi_2 < \dots < \xi_K\) placed at quantiles of the training data, the full extended knot vector has the boundary values repeated \( d+1\) times:
This gives a spline space of dimension
For the default \( K = 4\) , each input has \( B = 8\) basis functions. For the refined basis (M3), one midpoint per span is inserted, giving \( K' = 2K - 1\) interior knots and \( B' = 2K + 3\) basis functions (e.g. \( K=4 \to K'=7, B'=11\) ; \( K=8 \to K'=15, B'=19\) ).
Basis function \( N_{i,d}(x)\) is defined recursively:
with the convention \( 0/0 = 0\) . The vector \( \phi(x) = (N_{1,d}(x), \dots, N_{B,d}(x))\) is the feature vector for scalar input \( x\) .
An important property is the partition of unity:
This ensures the intercept is not multicollinear with the basis functions.
Interior knots are placed at empirical quantiles of the training data. For \( K\) knots, they are placed at the \( \frac{i}{K+1} \times 100\) percentiles for \( i = 1, \dots, K\) , distributing basis support according to data density. For the default \( K = 4\) this gives the 20th, 40th, 60th, and 80th percentiles. The boundary knots are set at the minimum and maximum training values; inputs outside this range are clamped at prediction time.
Knot placement is recomputed on the training fold only inside each cross-validation fold, preventing data leakage from validation rows into the spline mesh.
K is the spatial resolution of the spline mesh. Each interior knot marks a breakpoint where the polynomial pieces are joined. Between two consecutive knots, the spline behaves like a degree-3 polynomial; K determines how many such segments cover the input range.
The minimum feature size the model can resolve — the shortest oscillation half-period, or the narrowest bump — is roughly
A signal with \( n_c\) full oscillation cycles over the data range needs at least \( K \approx 2 n_c\) interior knots to be faithfully represented. For \( K = 4\) the model can capture up to 2 full cycles per input; for \( K = 8\) , up to 4 cycles.
Changing K affects every spline-based rung simultaneously:
| Rung | Basis functions per input | Total P (m inputs) |
| M2 | \( B = K + 4\) | \( 1 + Bm\) |
| M3 (refined) | \( B' = 2K + 3\) | \( 1 + B'm\) |
| M4 interaction block | — | \( P_{M3} + K^2\) |
| M5 interaction block | — | \( P_{M4} + K^2\) |
Two consequences stand out:
The total parameter count \( P\) must stay well below \( N\) for cross-validated estimation to be meaningful. A practical guideline is \( P \lesssim N/5\) :
| K | B (M2) | P, M2, m=3 | P, M2, m=5 | P, M4, m=5 |
| 2 | 6 | 19 | 31 | 35 |
| 4 | 8 | 25 | 41 | 57 |
| 8 | 12 | 37 | 61 | 125 |
| 12 | 16 | 49 | 81 | 225 |
| 16 | 20 | 61 | 101 | 357 |
For M4 and M5 with large K, P can grow very quickly. Always check the N column in the ranking table against the EDF column — if \( \text{EDF} \ll P\) , the regularizer has already compressed the model and K is larger than the data supports.
The EDF column in the ranking table is the key signal:
On the scatter plot, representational underfitting looks like all dots clustered near CV RMSE = 1 even at high EDF — indistinguishable from confounding by the chart alone, but distinguishable by looking at EDF: underfitting has EDF ≈ P (the model is using all its capacity), whereas confounding has EDF ≈ 1 (the regularizer has shut the model down).
For an additive model with \( m\) inputs, each input \( r \in \{1,\dots,m\}\) gets its own \( B_r\) -dimensional B-spline basis. The design matrix for \( N\) training rows is
where \( \Phi_r \in \mathbb{R}^{N \times B_r}\) is the spline design block for variable \( r\) , and the leading column of 1s is the intercept. The total parameter count per output is
Coefficients \( C \in \mathbb{R}^{P \times n}\) (one column per output) are found by minimizing the penalized residual sum of squares:
where \( \lambda > 0\) is the regularization strength and \( \Omega = \mathrm{diag}(0, 1, \dots, 1)\) leaves the intercept unpenalized and equally penalizes all basis coefficients. The closed-form solution satisfies the normal equations:
This is solved via SVD for numerical robustness.
The regularization parameter is selected from a grid
by 5-fold cross-validation. The \( \lambda\) minimizing mean CV RMSE is used for the final refit on all data.
The \( N\) rows are split into \( k = 5\) folds of equal size (positional: first \( N/5\) rows → fold 1, etc.). For each fold \( f\) :
Mean and standard deviation of CV RMSE across the 5 folds are returned for each \( \lambda\) .
All RMSE values are reported in standardized output space (each output has mean 0, std 1 on training data). This makes scores directly comparable across different outputs:
| CV RMSE | Interpretation |
| \( \approx 1.0\) | No skill — model does no better than predicting the training mean |
| \( < 0.5\) | Reasonable predictive skill |
| \( < 0.1\) | Strong fit |
A CV RMSE near 1.0 means the inputs carry little information about the outputs (or the relationship is dominated by variation not captured by the inputs).
The condition number of the penalized Gram matrix
is
the ratio of the largest to smallest singular value. It measures numerical ill-conditioning of the fit.
| \( \kappa\) | Interpretation |
| \( < 10^4\) | Very well-conditioned |
| \( 10^4\) to \( 10^8\) | Acceptable |
| \( > 10^8\) | Disqualified — coefficient estimates are numerically unstable |
A large \( \kappa\) indicates near-linear dependency among the columns of \( A\) , which occurs when:
Increasing \( \lambda\) directly reduces \( \kappa\) because \( \Omega\) adds a positive-definite stabilizing term.
Branch A fits in the original (standardized) input space, climbing a nested sequence of model classes M0 → M1 → … → M5.
Nesting means every model is a special case of the next: \( \mathcal{F}_0 \subseteq \mathcal{F}_1 \subseteq \dots \subseteq \mathcal{F}_5\) . This guarantees that training RMSE cannot increase as you climb.
\( P = 1\) . The model predicts the training mean of each output. CV RMSE ≈ 1.0 by construction (predicting the mean on held-out data in standardized space gives RMSE ≈ 1). Serves as the baseline.
\( P = 1 + m\) . Captures linear relationships between inputs and outputs. The design matrix has one column per input plus the intercept.
with \( B = K + 4\) basis functions per input (\( K\) interior knots, degree 3). \( P = 1 + (K+4)m\) . For the default \( K = 4\) this gives \( B = 8\) and \( P = 1 + 8m\) . Captures smooth nonlinear main effects but no interaction between inputs. Each input is modeled independently by a cubic spline.
Same structure as M2 but the knot vector of each term is refined by inserting one midpoint per span, giving \( K' = 2K - 1\) interior knots and \( B' = 2K + 3\) basis functions per input.
\( P = 1 + (2K+3)m\) . For the default \( K = 4\) this gives \( K' = 7\) , \( B' = 11\) , and \( P = 1 + 11m\) .
The refined knots are a strict superset of the coarse knots (nested spaces: \( \mathcal{F}_{M2} \subseteq \mathcal{F}_{M3}\) ).
Extends M3 by adding a tensor-product spline block for one pair of inputs \( (r, s)\) :
where \( \otimes\) denotes the outer product of the two \( K\) -dimensional coarse feature vectors, giving \( K^2\) additional columns. For the default \( K = 4\) this is 16 columns.
\( P_{M4} = P_{M3} + K^2\) .
The pair \( (r, s)\) is selected greedily: all candidate pairs are evaluated on M3 residuals with a fixed \( \lambda = 1\) , and the pair giving the largest residual reduction is chosen.
Adds a second tensor-product block for a second greedily-selected pair \( (r', s') \ne (r, s)\) .
\( P_{M5} = P_{M4} + K^2\) .
For \( m = 1\) , the ladder stops at M3 (no pairs exist). M5 requires \( m \geq 3\) .
Parameter count formulas (m inputs, K interior knots):
| Model | P (general) | P (m=5, K=4) | P (m=5, K=8) |
| M0 | \( 1\) | 1 | 1 |
| M1 | \( 1 + m\) | 6 | 6 |
| M2 | \( 1 + (K+4)m\) | 41 | 61 |
| M3 | \( 1 + (2K+3)m\) | 56 | 86 |
| M4 | \( P_{M3} + K^2\) | 72 | 150 |
| M5 | \( P_{M4} + K^2\) | 88 | 214 |
Increasing K from 4 to 8 raises M2 from 41 to 61 parameters (for m=5), and causes interaction blocks to grow from 16 to 64 columns — always check that \( P \lesssim N/5\) before running with a large K.
Branch B projects the inputs onto their principal components (PCs) before fitting, then runs similar spline models on the reduced coordinate space.
Given standardized inputs \( \tilde X \in \mathbb{R}^{N \times m}\) , the thin SVD gives
where the columns of \( V \in \mathbb{R}^{m \times m}\) are the PC directions (sorted by descending singular value). The first \( q\) PC scores are
The fraction of variance explained by the first \( q\) PCs is \( \sum_{i=1}^q \sigma_i^2 / \sum_i \sigma_i^2\) .
PCA is refit inside each cross-validation fold on the training rows only, then applied to validation rows. This avoids leakage of global variance structure into the validation estimates.
For each number of PCs \( q \in \{1, 2, 3\}\) (up to \( \min(m,3)\) ), the following models are evaluated:
| Model | Description | P (q PCs, K=4) |
| B\( q\) -lin | Linear on \( q\) PCs | \( 1 + q\) |
| B\( q\) -spl | Additive spline on \( q\) PCs | \( 1 + 8q\) |
| B\( q\) -int\( _1\) | Spline + PC\( _0 \times\) PC\( _1\) interaction | \( 1 + 11q + 16\) |
| B\( q\) -int\( _2\) | Spline + 2 interactions | \( 1 + 11q + 32\) |
Interaction entries appear only for \( q \geq 2\) . Interaction pairs are fixed as \( (\text{PC}_0, \text{PC}_1)\) , \( (\text{PC}_0, \text{PC}_2)\) , … — always anchored to the highest-variance PCs.
Total Branch B entries: \( q_{\max}(q_{\max}+3)/2\) (5 for \( q_{\max}=2\) , 9 for \( q_{\max}=3\) ).
Branch B is useful when:
It will generally perform similarly to Branch A when the inputs are already near-independent, but may outperform Branch A when there is strong multicollinearity.
After running rungs, the recommended model is the simplest rung satisfying both:
This deliberately favors simpler models: if M1 and M3 give nearly the same CV RMSE, M1 is chosen because it has fewer parameters and is more likely to generalize.
The scatter plot displays every run rung as a dot with coordinates:
The ideal corner is top-left: perfect accuracy (\( \text{cvRmse} \to 0\) ) with perfect conditioning (\( \log_{10}\kappa \to 0\) ).
Two reference lines mark practical limits:
The axes are anchored at these absolute thresholds (not auto-scaled), so the chart is visually stable as you add more rungs.
To rank rungs, each point is mapped to a normalized distance from the ideal origin:
This combines accuracy and robustness on a common scale. Both thresholds are physically meaningful:
The rung with the smallest \( d\) is ranked first.
Climbing the ladder increases the nominal parameter count \( P\) , but the ridge regularizer prevents all \( P\) parameters from being active. The effective degrees of freedom (EDF, shown in the ranking table) is the actual complexity the fit exercises:
At \( \lambda = 0\) the model uses all \( P\) parameters (EDF = P); as \( \lambda \to \infty\) the model collapses to the intercept (EDF = 1).
What low EDF means in practice:
The tradeoff in a nutshell. Adding more basis functions (going from M1 to M2, M3, …) gives the model capacity to represent complex shapes, but:
The scatter plot makes this visible: climbing the ladder on data with a weak signal moves points right and up — worse accuracy and worse conditioning — without reducing EDF meaningfully. The right response is not to add more model complexity, but to reconsider the inputs (see §12.6).
Once a model has been run, it can be exported in five formats. All exports are self-contained and require no external library.
| Format | File | Use case |
| JSON |
model.json
|
Model archive, custom loaders |
| JavaScript |
model.js
|
Browser or Node, drop-in predict() function
|
| Python |
model.py
|
Pure Python ≥ 3.8, no NumPy required |
| C |
model.c
|
Embedded systems, ANSI C89/C99 |
| R |
model.R
|
Base R, no packages required |
Each export embeds:
The prediction flow in all formats is:
A CV RMSE near 1.0 across all models means the inputs have little predictive power over the outputs given the current data. This can happen because:
Solution: deseasonalize both inputs and outputs by subtracting a fitted seasonal mean (e.g., monthly or weekly averages) before running the modelization. The residuals will then expose the within-cycle correlation.
A high \( \kappa\) for a given model class usually means either:
Increasing \( \lambda\) (which the λ grid search handles automatically) will reduce \( \kappa\) by adding a stabilizing diagonal to \( G_\lambda\) . If even the maximum \( \lambda = 100\) leaves \( \kappa > 10^8\) , the model class is too complex for the available data — try a simpler rung.
For time-series data (e.g., daily measurements), keep the rows in chronological order. The 5-fold split will then validate the last 20
A model can be accurate without being meaningful, and meaningful without being accurate. The CV RMSE measures only predictive accuracy; it says nothing about whether the learned relationship reflects a genuine causal or physical link between inputs and outputs.
Confounding is the most common source of misleading accuracy. If both an input and an output are driven by a third variable that is not in the model, the fit will capture the shared variation due to that third variable rather than the direct relationship. The model will appear to predict well in-sample but will fail whenever the confounding variable behaves differently.
Conversely, a genuine physical relationship between two variables may be weak in the data if both are dominated by a shared confounding cycle. In that case CV RMSE stays near 1.0 across all model classes — the scatter plot shows all dots near the right edge — and EDF collapses to 1–2 regardless of which rung you run. This is not a failure of the fitting algorithm; it is a correct signal that the inputs, as presented, carry little information about the outputs beyond what the confounder already explains.
Recognizing confounding:
Remedies:
In all cases, the goal is to present the model with inputs and outputs whose shared variation is primarily due to the relationship you wish to study, not to a third variable driving both.
If the DataTable search filter is active, only the visible rows are used for fitting. This can be used deliberately (e.g., restrict fitting to a season or experimental condition), but the resulting model is only valid for inputs in that filtered range.
| Symbol | Meaning |
| \( N\) | Number of data rows |
| \( m\) | Number of input columns |
| \( n\) | Number of output columns |
| \( K\) | Number of interior knots per spline term — user-configurable (default 4, range 2–20) |
| \( K'\) | Refined knot count after one midpoint insertion: \( K' = 2K-1\) |
| \( B\) | Basis functions per coarse spline term: \( B = K + 4\) |
| \( B'\) | Basis functions per refined spline term: \( B' = 2K + 3\) |
| \( P\) | Total number of model parameters per output |
| \( A \in \mathbb{R}^{N \times P}\) | Design matrix |
| \( C \in \mathbb{R}^{P \times n}\) | Coefficient matrix |
| \( \lambda\) | Ridge regularization parameter |
| \( \Omega\) | Ridge penalty matrix: \( \mathrm{diag}(0,1,\dots,1)\) |
| \( \kappa\) | Condition number of \( A^\top A + \lambda\Omega\) |
| \( \text{edf}\) | Effective degrees of freedom: \( \mathrm{tr}(H) = P - \lambda\,\mathrm{tr}(G_\lambda^{-1}\Omega)\) |
| \( \text{cvRmse}\) | Mean cross-validated RMSE in standardized output space |
| \( q\) | Number of principal components (Branch B) |
| \( V\) | PCA loading matrix (columns = PC directions) |