Liberty Needs Paperwork
Liberty Needs Paperwork

Liberty Needs Paperwork

What a risk score owes the person it scores.

Nine-oh-seven. The judge sees a number.

It sits in the corner of a screen like a weather forecast: tidy, confident, ready for reuse. The defendant is still talking, still trying to be understood. But the number has already started doing its work.

Risk score: 8/10.

If a system can put someone in a cage, that system must be easier to inspect, challenge, and appeal than most institutions are comfortable with.

If that sounds expensive, good. Liberty is an expensive interface. The question is: can the person push back?


a number that pretends to be a reason Link to heading

Pretrial risk assessments were sold as a corrective to messy human judgment and crude money-bail schedules. Replace gut feel with actuarial estimates. Reduce arbitrary disparities. On paper, a reasonable upgrade.

In practice, most tools do something narrower. The Public Safety Assessment (PSA), for example, estimates the likelihood of three outcomes while on pretrial release: failure to appear, new criminal arrest, and new violent criminal arrest. Judges consider this along with other information. 1

That framing is honest. It says: this is a forecast, not a verdict.

Then the real world happens. Dockets are crowded. Time is short. A number looks like relief.

And relief is exactly how authority sneaks in.

If you want to understand algorithmic judging, don’t imagine a robot with a gavel. Imagine something more boring, more plausible, and more dangerous: a score that becomes hard to disagree with, even when nobody can explain it.

A score can be fast. A reason is slower. A remedy is slower than that.

The rest of this essay is about what happens when we confuse the three.

first switchback: prediction is not judgment Link to heading

A model that outputs “8/10” is doing a familiar machine thing: mapping inputs to outputs. Functionalism, in broad strokes, treats mental states by what they do, not what they’re made of. If something plays the right functional role, maybe it is the same kind of thing.

Tempting lens for AI ethics. If the system behaves like judgment, perhaps it is judgment.

But bail is not just behavior. Bail is state power.

A risk assessment can estimate probabilities. It cannot tell you what you are allowed to do to a person on the basis of those probabilities. That step is moral and legal. That step is judgment.

Judgment carries reasons that can be argued with. It sits inside procedures that recognize fallibility. It assigns responsibility when harm occurs.

A model can’t apologize. It can’t make restitution. It can’t be disbarred or voted out. It can’t be cross-examined in the way due process expects.

So when we ask “can AI make fair ethical decisions,” we’re already slipping. We’re asking a prediction engine to inherit moral authority because it outputs a number in a font we trust.

Use AI to estimate. Use law and human responsibility to decide.

That’s not sentiment. It’s an engineering claim about where accountability lives.

second switchback: “fair” is not one setting Link to heading

Even if the model is just estimating risk, we still have to ask what kind of “fair” it’s optimized for. This is where the story gets uncomfortable for anyone who wants a single metric.

Fairness has no shortage of definitions. It has a surplus, and they conflict.

Kleinberg, Mullainathan, and Raghavan formalized a core impossibility: except in constrained special cases, you can’t satisfy multiple intuitive fairness conditions simultaneously. 6 Chouldechova makes the point directly for recidivism prediction: when prevalence differs across groups, several fairness criteria cannot all hold at once. 8

Street-level translation:

You can tune a system so a given score means the same thing across groups (calibration). Or you can tune it so the system makes similar mistakes across groups (equalized error rates). When base rates differ, you usually don’t get both.

So when someone says “the model is fair,” the only honest response is:

Which fairness, traded for which other, and who holds the receipt?

Bring that back into the courtroom. If the defendant sees “8/10,” do they also see the definition of the outcome being predicted, the error rates for people like them, and the fairness tradeoffs chosen by the jurisdiction?

Most of the time, no. They see a number and the posture changes in the room.

A score is not neutral. It’s compressed policy.

third switchback: the data is a footprint, not a mirror Link to heading

Risk models don’t learn from “crime.” They learn from records: arrests, convictions, failures to appear, supervision violations. Those records are not a clean window into behavior. They’re a footprint left by enforcement choices, prosecutorial discretion, plea bargaining, policing intensity, and resource disparities.

If you’ve ever debugged a production system, you know the feeling: the logs tell you what happened, but they also tell you where the logging is broken.

Prediction systems inherit that problem. They can look objective while replaying historical attention.

This is one reason the COMPAS controversy still matters. ProPublica’s “Machine Bias” investigation argued that COMPAS produced racially disparate error patterns. 3 Flores, Bechtel, and Lowenkamp published a rejoinder arguing that ProPublica used faulty statistics. 15

Sit with that dispute. Even if you think ProPublica’s framing was wrong, the governance problem survives: the score still carries power, still compresses tradeoffs, and the affected person still struggles to contest it.

The math argument does not dissolve the legitimacy argument.

what the number does to humans Link to heading

Back at the bail hearing. Nobody has to say “the model decides.” The model can “advise” and still become the decision. Humans anchor on numbers. Humans defer to instruments. In a pressured environment, the easiest path becomes the default path.

That’s not a moral failure of judges. It’s a systems reality.

When a score is present, three quiet shifts occur:

Attention narrows. The hearing becomes a search for reasons to justify the score rather than a full investigation of the person.

Disagreement becomes expensive. If the judge departs from the score, they may fear blame later. If they follow it, responsibility diffuses into “the tool.”

The defendant becomes a profile. The room starts to treat a feature vector as a person, because the feature vector speaks in numbers and the person speaks in sentences.

A score doesn’t need a robe to exert judicial gravity. It just needs a busy docket and a lack of contestability.

In State v. Loomis, the Wisconsin Supreme Court held that a trial court’s use of a COMPAS risk assessment at sentencing did not violate due process, while also emphasizing limitations and warnings about how such tools should be used. 5

If you read Loomis as “courts are fine with black boxes,” you’re missing the flinch. The opinion contains caution language precisely because the legitimacy problem is obvious.

GDPR Article 22 encodes a similar instinct more explicitly: a person has the right not to be subject to a decision based solely on automated processing that produces legal or similarly significant effects, with safeguards including the right to obtain human intervention, express a view, and contest the decision. 13

Different domain, same underlying need: if a system can shape your life, you need a handle to grab.

the strict rule Link to heading

Here’s the rule, stated plainly:

The more an AI system participates in deciding, the more transparent the process must become, and the easier recourse must be.

Not “should.” Must.

This is not an abstract moral preference. It’s a stability requirement. Systems that cannot be contested eventually lose legitimacy, and legitimacy is what keeps compliance from collapsing into force.

Think of it like building a bridge on a public trail. The heavier the load, the more you overbuild the supports. You don’t get to say “it usually holds.”

what strict looks like when you stop hand-waving Link to heading

If a risk tool is in the loop, the loop needs mechanical guarantees. Some technical, some procedural, all costly in the way serious safeguards are.

Transparency. Disclose the target: what outcome is being predicted, exactly. “New arrest” is not “new crime”; don’t hide that distinction behind a label. Disclose the inputs: if prior arrests matter, say so; if age matters, say so; if the system uses proxies for protected attributes, name them. Disclose performance over time, not once. 12 And prefer simpler models when they perform similarly; Dressel and Farid found that a simple linear predictor using just a couple of features could match a widely used commercial tool. 9 If you can get similar predictive performance with far more transparency, you don’t earn ethical points for complexity.

Contestability. The defendant sees what the system “thinks” it knows, not just the score, but the underlying factors. The defendant can correct errors in the data. The defense can challenge the model’s validity; if the model is proprietary and cannot be meaningfully interrogated, that is a procedural defect, not a business model. And the judge must write a reason that stands without the score. If the score disappeared, the justification should still be coherent. That’s how you prevent “the score ate the hearing.”

Governance. Independent audits, not vendor assurances. Partnership on AI documented serious shortcomings in these tools; their partner consensus view was that current tools should not be used to automate pretrial detention decisions. 10 11 Ongoing monitoring with public reporting; populations change, enforcement changes, policies change. Clear limits on how the score is used; if the score is “advisory,” make departures ordinary and defensible, not rare and risky.

Human-in-the-loop that isn’t theater. A human who clicks “approve” is not a safeguard. That’s a decorative checkbox. Real human intervention requires authority to override without penalty, time to consider evidence, and a culture that treats disagreement with the model as a professional act, not a liability event.

Otherwise the model is the judge and the human is the printer.

the uncomfortable twist Link to heading

Even if you could pick a fairness definition and optimize it, the “cost” of fairness can show up as more detention, not less. Corbett-Davies and Goel discuss how fairness constraints interact with decision thresholds and can impose costs. 14

This breaks a comforting story. Fairness work does not automatically push toward more humane outcomes. Sometimes it pushes toward a different distribution of harm.

So the ethical question is never “did we meet the metric?”

It’s “who paid, and can they contest the bill?”

can AI make fair ethical decisions? Link to heading

Back in the courtroom. The number still sits there: 8/10. The person still stands at the table.

The honest answer:

AI can help estimate. AI cannot, on its own, be a fair ethical decider. Not because machines are spooky, but because fairness here includes contestability, explanation, and responsibility. Those live in institutions, not in tensors.

If we insist on involving AI, then we need to make the system legible enough that the defendant can fight it, and structured enough that the judge cannot hide behind it.

A fair system is not one that never errs. A fair system is one that can admit error without trapping people inside it.

So the test for algorithmic judging is not whether the score is calibrated, or whether it beats human accuracy by two points.

The test is simpler and harsher:

When the number is wrong, can the person make it stop being wrong in time to matter?

That’s what justice looks like when you stop treating a score as a reason.

  • About the Public Safety Assessment (PSA) Advancing Pretrial (2020) 1
  • Machine Bias: There’s software used across the country to predict future criminals. And it’s biased against blacks. Angwin et al., ProPublica (2016) 3
  • State v. Loomis (Wisconsin Supreme Court opinion) Supreme Court of Wisconsin (2016) 5
  • Inherent Trade-Offs in the Fair Determination of Risk Scores Kleinberg, Mullainathan, Raghavan (2016) 6
  • Fair prediction with disparate impact: A study of bias in recidivism prediction instruments Chouldechova (2016) 8
  • The accuracy, fairness, and limits of predicting recidivism Dressel & Farid, Science Advances (2018) 9
  • Report on Algorithmic Risk Assessment Tools in the U.S. Criminal Justice System Partnership on AI (2019) 10
  • The Partnership on AI Response to NIST AI RFI (risk assessment tools requirements) Partnership on AI (2021) 11
  • Algorithmic fairness Hellman, Stanford Encyclopedia of Philosophy (2025) 12
  • GDPR (Regulation (EU) 2016/679), Article 22: Automated individual decision-making EUR-Lex (2016) 13
  • Algorithmic decision making and the cost of fairness Corbett-Davies & Goel (2017) 14
  • False Positives, False Negatives, and False Analyses: A Rejoinder to “Machine Bias…” Flores, Bechtel, Lowenkamp (2016) 15

Continue reading