The Behavioral Trust Gap in Open Source

Every major supply chain attack exploits the same structural gap. Current security tools can't see it — because they're answering the wrong question.

The gap nobody scans for

In October 2021, an attacker compromised a single npm authentication token and pushed a cryptominer and credential-stealing trojan into ua-parser-js — used by Facebook, Microsoft, Amazon, and Google. Seven million weekly installs. Zero prior warnings from npm audit, Snyk, Dependabot, or any CVE database on Earth.

This wasn't a failure of execution. The tools worked exactly as designed. They scanned their databases and found nothing, because there was nothing to find. No CVE had been filed. No vulnerability had been disclosed. The package was clean right up until the moment it wasn't.

The same pattern played out with event-stream in 2018, when a social engineering attack slipped a cryptocurrency-stealing payload into a package with two million weekly downloads. And again with colors.js in 2022, when the sole maintainer intentionally broke the package, taking down CI pipelines across hundreds of projects. In every case: zero pre-incident warnings from any existing tool.

This is the behavioral trust gap. The distance between "no known vulnerability" and "actually trustworthy." It is the gap that every major supply chain attack exploits. And it is invisible to every security tool that works by scanning for known-bad.


Why the gap exists

Trust in open source is inferred from popularity. Stars, downloads, dependents — these are the signals developers use when choosing packages, and they're the signals that security tools implicitly rely on when they don't flag anything. If a package has 100 million downloads per week, the assumption is that someone is watching. Millions of developers can't all be wrong.

But these are cumulative metrics. They tell you about the past — that a package has been useful to many people over time. They tell you nothing about the present structural conditions. A sole maintainer with 100 million weekly downloads is a high-value target precisely because of those downloads. The trust signal (downloads) and the risk signal (single point of failure) point in opposite directions.

Here's what this looks like in practice, using live data from proof-of-commitment:

Package Score Risk Maintainers Downloads/wk
chalk 75 CRITICAL 1 411M
esbuild 88 CRITICAL 1 194M
axios 89 CRITICAL 1 101M

Three of the most-depended-upon packages in the npm ecosystem. Quality scores of 75 to 89. Every traditional metric looks healthy. And all three are CRITICAL — because each is controlled by a single maintainer with sole publish access to tens or hundreds of millions of weekly installs.

Current tools cannot see this because they only look for known-bad. npm audit checks an advisory database. Snyk checks its vulnerability feed. Dependabot checks for outdated versions with known issues. All of them answer the same question: does this package have a known vulnerability? And they answer it well. But the question itself has a blind spot. The next axios isn't in any database yet. The structural conditions that will make it possible are visible right now, to anyone measuring the right things.


Behavioral commitment as a signal

If the gap is between "no known vulnerability" and "actually trustworthy," the question is: what would a trustworthiness signal actually look like?

Not a score. A signal. The distinction matters. A score implies precision that doesn't exist — trust isn't a number you can calculate to two decimal places. A signal is directional: it tells you where to look, not what to conclude.

The signal we use is behavioral commitment: patterns of sustained real cost that are structurally hard to fake. Five dimensions:

  • Longevity — How long has the project existed? Age isn't virtue, but it's a baseline. A package that has been maintained for 10 years has demonstrated something that a 3-month-old fork hasn't.
  • Momentum — Is the download trend growing, stable, or declining? Declining adoption is a leading indicator of abandonment, and abandoned packages with high download counts are the primary target for social engineering attacks (cf. event-stream).
  • Release cadence — Does the project ship regularly? Release consistency is expensive to fake because it requires sustained real work. A project that releases every few weeks is being actively maintained. A project that hasn't released in two years may still work, but nobody is watching the door.
  • Maintainer depth — How many people can publish? This is the single strongest structural signal. A sole maintainer is a single point of failure — one compromised token, one case of burnout, one social engineering attempt away from incident. This isn't about the maintainer's competence. It's about the structure.
  • Community governance — Is there a visible GitHub community? Open issues being triaged, pull requests being reviewed, multiple contributors with write access? These are signs of institutional resilience — the project survives the departure of any single person.

None of these are novel observations individually. What's novel is treating them as a composite signal and making it cheap to check. The reason structural risk persists isn't that nobody knows about it — it's that nobody measures it automatically. The information exists. The tooling to surface it at decision-time didn't.

These signals are hard to fake precisely because they require sustained real cost. You can buy GitHub stars. You can inflate download counts. You can write an impressive README in an afternoon. But you can't fake three years of consistent biweekly releases, or a governance structure with multiple maintainers who each have meaningful commit history. Berkeley's SWE-bench research showed that coding benchmarks can be gamed to 100% with zero actual bugs fixed — point-in-time metrics are gameable by definition. Behavioral metrics resist gaming because they require real work sustained over real time.


A different question

The shift from "scan for known-bad" to "score for structurally-fragile" isn't incremental. It's a different question entirely.

npm audit answers: does this package have a known vulnerability?
Behavioral scoring answers: is this the kind of project that produces vulnerabilities?

The first question is reactive. It requires someone to discover the vulnerability, file an advisory, and propagate it through the database. By the time a database entry exists, the attack has already happened. The response is incident management, not prevention.

The second question is structural. It doesn't predict specific attacks — that's not the claim. It identifies the conditions under which attacks succeed: single points of failure, token concentration, asymmetric blast radius, governance gaps. These conditions are observable months or years before any incident. They were observable before event-stream. Before ua-parser-js. Before axios.

This reframe has a practical consequence. If you're a security team deciding which dependencies to monitor closely, CVE scanning tells you to monitor everything equally until something breaks. Structural scoring tells you which packages to watch before anything happens. It's the difference between a smoke detector and a fire inspection. You want both, but only one can prevent fires.

The analogy to financial markets is instructive. Nobody would accept a credit rating system that only flagged defaults after they occurred. The entire point of credit analysis is to assess structural risk before the event. Rating agencies look at debt ratios, governance structures, revenue stability — behavioral commitments, not just past defaults. Open source dependency management is, right now, where credit analysis was before rating agencies existed: purely reactive, and reliably surprised by every crisis.


What we built

proof-of-commitment is an open-source tool that computes behavioral trust scores for npm packages across five dimensions: longevity, momentum, release cadence, maintainer depth, and community governance. It flags CRITICAL when a package has a single maintainer and more than 10 million weekly downloads — the structural profile that produced all three major supply chain attacks analyzed above.

It runs as a CLI (npx proof-of-commitment axios chalk esbuild), a web interface, a GitHub Action that comments on pull requests, and an MCP server that integrates with Claude Desktop, Cursor, and Windsurf. All data comes from public npm and GitHub APIs. The tool is free and the source is open.

What it doesn't do: it doesn't scan code. It doesn't replace npm audit, Snyk, or Dependabot. It doesn't detect malicious payloads or known CVEs. Those tools exist and they work. The behavioral trust gap is the space they don't cover — and that's the only space we're in.


Limitations and open questions

Some honest caveats:

  • Point-in-time, not temporal — The current scoring is a snapshot. It doesn't track behavioral changes over time. A maintainer handoff — the exact attack vector in event-stream — is invisible to a point-in-time score. Temporal diffing is the most important gap to close.
  • Single maintainer ≠ bad — Many of the best-maintained packages in the ecosystem are the work of dedicated individuals. The CRITICAL flag doesn't mean the maintainer is doing a bad job. It means the structure creates concentration risk regardless of individual competence. The right response isn't to abandon these packages — it's to recognize the structural dependency and plan accordingly.
  • npm-first — The scoring currently covers npm. PyPI support exists but is less mature. The behavioral trust gap exists in every package ecosystem; the tooling needs to follow.
  • Gaming will evolve — Any metric that becomes important becomes a target. If behavioral scores start influencing decisions, people will try to game them. The advantage of commitment-based signals is that gaming requires real sustained cost — but the advantage isn't absolute. This is an ongoing arms race, not a solved problem.

The behavioral trust gap isn't a tooling problem. It's a measurement problem. We measure what packages contain — known bugs, outdated versions, CVE matches — because that's what we know how to measure. We don't measure how packages are governed — who controls the publish token, how many people are watching, whether the project structure is resilient or fragile — because nobody built the instruments.

The instruments exist now. The structural conditions that produced event-stream, ua-parser-js, and axios are detectable. Not predicted — detected. The conditions are still present in hundreds of high-impact packages today. The question is whether we check.


Commit indexes behavioral commitment — the signals that require real cost and resist gaming. proof-of-commitment is our first tool. It's free and open source.

Case studies: Three npm Disasters That Were Predictable · State of npm Trust: April 2026 · Declarations Are Gameable

Stay in the loop

Early access, research updates, and the occasional strong opinion.