Why Submission Scoring Fails When You Train on Bound Business Only

Commercial lines underwriting risk scoring dashboard

A model trained only on policies you chose to bind cannot tell you which submissions you should have declined. The data you never accepted is precisely the data that would teach the model about adverse selection — and by excluding it, you guarantee the model will systematically underweight risk at the tail where your losses concentrate.

This is not a theoretical concern. It is survivorship bias encoded directly into your feature weights, and it explains why so many first-generation underwriting AI systems perform well on in-sample backtests but produce disappointing loss ratios in production.

The Mechanics of Selection Bias in Underwriting Data

When an underwriter declines a submission, that decision rarely enters any structured data system. The ACORD 125 form might be logged in your submission management platform, but the decline reason, the specific risk factors that drove the decision, and the subsequent claim history of that account (which presumably went elsewhere at a lower premium) are lost. What remains in your training set is a filtered sample: accounts that cleared your underwriters' informal criteria, priced at whatever rate the underwriter felt was adequate, and retained for at least one policy period.

The statistical consequence is a biased estimate of loss costs. If your book systematically declined accounts with three or more prior losses in the ACORD 125 supplemental, your model never learned that three prior losses is a meaningful predictor. It learned, at best, that one or two prior losses is correlated with elevated severity — because that is the most adverse data in your training set. The model cannot extrapolate to the behavior it was never trained to recognize.

In actuarial terms, this is analogous to the problem that makes loss development factors (LDFs) unstable on thin books: the denominator is wrong because the universe of exposure is not the universe you observe.

Why the Problem Is Worse for Commercial Lines Than Personal

Personal auto and homeowners carriers benefit from regulatory and market structure: most submissions in a state are filed at roughly the same rate, acceptance criteria are relatively standardized, and the ISO classification system limits the range of risk variation within a rating territory. The selection effect exists, but it is bounded.

Commercial lines P&C has no equivalent constraint. A commercial general liability submission for a roofing contractor in a PCS wind zone with three years of loss runs showing two claims above $50K will be declined by some carriers and bound by others at premiums that vary by 40% or more. That variation in selection behavior means the bound data at any one carrier tells a highly idiosyncratic story about which risks are "acceptable," not which risks are actually loss-prone.

A regional carrier writing commercial property in the Midwest is not selecting from the same effective universe as a Lloyd's syndicate writing the same SIC codes. Training on one carrier's bound data and expecting the model to generalize is a category error.

The Declined Submission Problem: What You're Missing

The most rigorous fix is to reintroduce declined submissions into model training with appropriate labels. If you log declined submissions in your system — even a minimal record of the ACORD data and the decline reason — you can treat these as negative examples: accounts the underwriter judged too risky to bind at any viable premium. Feeding these into the training process alongside bound policies gives the model exposure to the tail of the risk distribution it needs to classify.

This requires consistent decline logging, which most carriers do not do well. In practice, underwriters record declines in an unstructured note field if they record them at all. Extracting usable features from that data requires natural language processing against a relatively small, noisy corpus — a solvable problem, but not a trivial one.

An alternative approach: use public loss data and ISO rating bureau statistics to anchor the model's priors on exposure classes you rarely write. If your book has limited experience in the 7382 (computer programming and data processing) SIC code, ISO loss costs for that class provide an external reference point that prevents the model from effectively ignoring that segment.

Truncated Experience: The Companion Problem

Selection bias has a companion problem in bound books: truncated loss experience. Policies bound in the last 12 to 18 months have not yet developed to ultimate. The claims triage workflow may have closed the small-dollar losses, but the large bodily injury and property damage claims are still open, with reserves that understate the probable final indemnity amount.

A model that treats recent policy year data as equivalent to mature accident year data will underestimate the loss cost on recent business. The credibility-weighted approach standard in actuarial LDF analysis — applying age-to-age development factors from prior accident years to project current year reserves to ultimate — is the right conceptual framework, but most ML-based scoring systems do not implement it. They use reported loss amounts as the target variable without adjustment.

The practical consequence is that models trained with recent data as a substantial fraction of the sample will price recent business more aggressively than the mature experience would support. You will not see it in your backtests because the backtest data is also affected. You will see it in your combined ratio two years out.

Correcting for Both Effects: A Practical Framework

The corrections are not complex in principle, though they require data infrastructure most carriers do not have pre-built. The steps are:

First, apply LDF-based development factors to recent accident years before using them as training targets. Use your actuarial team's selected factors, or industry factors from the ISO loss development publication if your experience is too thin to be credible. The goal is to train on estimated ultimate losses, not reported losses.

Second, re-weight your training sample to reduce the influence of recent policy years. Policies that have not yet had two full loss development periods should contribute proportionally less to the parameter estimates than mature accident years where the experience is more stable.

Third, supplement your in-house data with declined submission records and external benchmark loss costs. Even imperfect negative examples improve the model's ability to identify accounts at the tail of the risk distribution that your current underwriters would decline on subjective grounds but the model might otherwise score favorably.

What This Means for Interpreting Model Output

Even with these corrections applied, an underwriting model trained on carrier-specific data carries implicit assumptions about the distribution of risks in your submission pipeline. If your marketing strategy shifts — you start writing a new SIC code, you expand into a new state, you begin accepting business from a new wholesaler — the model's predictions become less reliable because the new submissions are outside the training distribution.

Distribution shift is a standard machine learning problem with standard mitigations: monitor feature distributions in production, alert when they diverge from training-time distributions, and retrain on schedule with updated experience. In insurance, the retraining schedule should align with the actuarial year-end process, when mature accident year data becomes available and LDF selections are refreshed. Continuous monitoring between annual retrains catches distribution shift before it materially affects pricing.

As we discuss in our article on Loss Development Factors vs. Bornhuetter-Ferguson, the actuarial credibility framework provides a principled way to weight model predictions against prior experience when data is thin. The same logic applies to model monitoring: a model producing predictions well outside the range of historical ISO loss costs for a class should trigger review, not automatic binding.

Conclusion

Training underwriting models exclusively on bound business is the default behavior because bound business is the data that exists in structured form. The fix is not to throw out existing models — it is to understand what they can and cannot learn from the data they were given, and to build in the actuarial adjustments that correct for the most systematic distortions. A model that knows its own limitations is more useful than one that does not.

If your submission scoring system has not been audited for survivorship bias in its training data, that is a concrete next step with a concrete payoff: corrected loss cost estimates that hold up when your book evolves.