Palo Alto Unit42 crawled 49,943 OpenClaw skills and found 80% have behavioral deviations from their declared intent. Then they admitted their own scanner can't catch the dangerous tail. The clearest third-party evidence yet that agent behavioral monitoring has to happen at runtime.
Arch Linux's AUR has a documented mechanism for orphaned packages to be adopted by new maintainers. Last week attackers used it as designed. Number started at 400. Ended at 1,579. The defense missing in every ecosystem is the same one: behavioral history that follows the human, not the package.
TeamPCP open-sourced their self-propagating npm worm on May 12. Within a month, Red Hat Miasma (Jun 1) and Phantom Gyp (Jun 3) had forked it — each finding a new install-time bypass the previous defense couldn't survive. The target profile inverted: from 91-score TanStack to 28-score awaitly. Here's the pattern, and what the next derivative looks like.
Varonis proved it: an enterprise AI agent forwarded AWS keys and a $1.28M customer list to an attacker who sent two casual emails. The agent had valid credentials and passed every technical check. Only 7% of security teams believe they'd catch it.
The Phantom Gyp technique ships a weaponized binding.gyp that triggers code execution during npm install. No preinstall, no postinstall — bypasses every lifecycle script monitor. 57 packages, 286 malicious versions, under two hours.
37 npm packages infected with a Rust-based infostealer that hides behind an eBPF rootkit, talks over Tor, and self-propagates through npm's Trusted Publishing. The commit author on every malicious push: [email protected].
34 malicious packages across three ecosystems. Every one scored 15 or lower. The new part: zero-width Unicode instructions hidden in .cursorrules and CLAUDE.md, designed to turn your coding assistant into an exfiltration tool.
A March 2026 IETF internet-draft specifies behavioral trust scoring for AI agent payments. 0–100 score, L0–L4 spend tiers, public cross-org query API. The category got a protocol document. The implementation is still the whole thing.
The Miasma attack hijacked 32 @redhat-cloud-services npm packages through a compromised GitHub account. SLSA provenance attestations were valid on every malicious version. Provenance tells you who published. It doesn't tell you whether to trust them.
Microsoft found 14 malicious npm packages impersonating OpenSearch and Elasticsearch. They stole AWS credentials, Vault tokens, and npm publish keys. Behavioral scoring would have flagged all of them on install.
OSV withdrew 157 malware reports after automated false positives hit FastAPI, Strawberry GraphQL, and dozens of other legitimate packages. Behavioral signals don't have false positives.
Five major npm supply chain attacks in three weeks. I scored every compromised package. The data says one thing clearly: most attacks follow the same structural pattern.
Cursor agents install npm, pip, cargo, and Go packages on your behalf. That's new attack surface. poc hook intercepts every install before it runs.
The Shai-Hulud worm stole npm tokens and republished packages autonomously. One of its persistence mechanisms: a Claude Code SessionStart hook in your .claude/settings.json.
drizzle-kit scores 83 on its own. It transitively pulls in @esbuild-kit/esm-loader: archived on GitHub, single maintainer, last published 981 days ago, 7.5M weekly downloads. Five community PRs to drop it have been open for up to 18 months. None merged.
stripe has 12M downloads/week and 1 npm publisher. @google-cloud/storage has 12M/week and 1 publisher. AWS S3 SDK has 29M/week and 2 publishers. Company reputation doesn't fix credential concentration.
Most npm supply chain audits stop at npm audit and Socket. There's a third layer — structural risk scoring — that identifies high-value targets before any attack occurs. Here's the complete checklist.
Two npm supply chain attacks hit the same week in May 2026. One was predictable from behavioral signals. One wasn't. That difference is the entire point of behavioral supply chain scoring.
On May 11, 84 malicious @tanstack artifacts were published using TanStack's own legitimate OIDC identity. No stolen credentials. The attacker extracted tokens from GitHub Actions runner memory after poisoning the build cache — and left behavioral traces in public repos the whole time.
v1.7.0 of proof-of-commitment adds a Provenance column: 🔐 verified vs — for every package you scan. Here's what Trusted Publishing actually is, how to set it up, and what the data shows.
From May 9 to May 16, every CRITICAL package scanned by proof-of-commitment showed as HEALTHY. 297 weekly users. Zero error. One wrong string comparison — Array.includes exact-match failed when the API changed to full-text flag format. v1.7.0 fixes it.
The SOC2 thread and the AI strip mining thread hit HN the same day. One founder can't get the stamp because they have no employees. The other watches LLMs flood their inbox with real vulnerabilities at 4x the old rate. Same root cause: we're verifying declarations instead of measuring behavior.
OpenSSF Scorecard measures process security. Behavioral signals measure publisher concentration. Both matter. Here's what happens when you combine them on npm's most critical packages — and why the axios attack proved they answer different questions.
Paste a GitHub URL. Get behavioral trust scores for every dependency instantly — publisher concentration, release consistency, contributor depth. No install, no account.
Commit now detects npm Trusted Publishing (OIDC provenance) in every package score. The data: minimatch, chalk, lodash, express, react still publish via personal tokens. Build tools adopted. Utility packages didn't.
A depth-2 supply chain audit methodology, run against five widely-used npm packages. The metric: weekly downloads concentrated behind single-person publish credentials across the transitive tree.
Same tool, same methodology, four ecosystems. 5.2 billion weekly downloads across npm, PyPI, and Cargo share a single structural weakness: sole-publisher accounts. Go doesn't have it. The difference is architectural.
After finding publisher-concentration risk across npm, PyPI, and Cargo, Go was the first ecosystem where the structural pattern didn't appear. The risk model is different — and so are the failure modes.
I scanned a pnpm workspace with 4 packages. 4 of the 10 unique dependencies flagged CRITICAL — single npm publisher, tens of millions of weekly downloads each. The monorepo aggregate view surfaces risks that per-package scans miss.
I scanned the 20 most-downloaded Rust crates. 11 came back CRITICAL — single crates.io owner, millions of weekly downloads. Five of those are all owned by the same person.
The effort proxy broke. LLMs made 200 plausible words cost nothing. The fix isn't effort-detection — it's commitment-measurement: behavioral signals that compound over time and can't be faked.
Evaluation awareness is now a measured property of frontier AI. Claude Haiku 4.5 showed awareness in 9% of test scenarios despite active filtering. The behavioral trust problem just got empirical.
I ran the same supply chain analysis on Python that I did on npm. The findings are different — and in some ways worse. Eight CRITICAL packages, 2.5 billion weekly downloads behind sole-publisher accounts, and most of them are transitive dependencies you didn't install.
Persona's age verification SDK runs 269 behavioral checks, tracks you with FingerprintJS for 365 days, and sends raw signals to servers backed by Founders Fund. The behavioral signals are legitimate. The architecture isn't inevitable.
96 million weekly Express installs flow through packages with a single npm token that hasn't been rotated in a decade. npm audit shows zero issues. Our tool scores two of them CRITICAL.
glob has 340 million weekly downloads and one maintainer. cross-spawn has 190 million. inherits has 157 million. None of them appear in your package.json. We scored 113 packages. 26 came back CRITICAL.
AugmentCode studied AGENTS.md files across real codebases. Best result: equivalent to upgrading from Haiku to Opus. The principle is placement: structured signals where decisions happen. Npm install has no equivalent yet.
The five behavioral dimensions, the CRITICAL flag, the bulk download optimization, and real benchmark data for chalk, express, and hono. All public data. All reproducible.
Full lock file support: scan all resolved transitive dependencies, not just your direct ones. The riskiest packages are frequently two hops in — invisible to package.json audits. Works with npm, yarn, and pnpm lock files.
88% of organizations have had agent security incidents. 135,000 MCP servers exposed. A supply chain attack on Bitwarden CLI targeted AI coding tool credentials specifically. The identity layer is being solved. The supply chain layer hasn't started.
@anthropic-ai/sdk scores HEALTHY at depth 1. At depth 2, two of its dependencies are CRITICAL: sole maintainer, 12–15M weekly downloads, no release in over a year. The attack surface is one level deeper than most teams look.
Credential compromise and build pipeline attacks look different and require different defenses. ua-parser-js (2021) and Bitwarden CLI (2026) are not the same kind of attack. Here's how to tell them apart — and what tooling actually covers which gap.
We audited the top 100 npm packages by weekly downloads. 7 of the top 10 have a single maintainer. 47% of all weekly npm traffic — 7.2 billion downloads — flows through packages controlled by one person. Full dataset included.
Five dimensions, all public data, one deterministic CRITICAL flag. Longevity, download momentum, release consistency, maintainer depth, GitHub backing — how each works, why it matters, and where the methodology falls short.
The npm supply chain attack that CVE scanners missed — and what it tells us about how trust actually works. Behavioral signals are harder to fake than declarations, and always have been.
I built a behavioral scoring system that flags single-maintainer packages with massive download volumes as CRITICAL. axios scores 86/100 but has one maintainer and 82M weekly downloads. Here is the structural case.
Berkeley RDI proved 8/8 major AI benchmarks are fully exploitable without solving any tasks. Goodhart's Law executing faithfully. The only signal that can't be gamed is the one that watches the benchmark.
Delve faked compliance certificates for 494 companies. Now agents are faking benchmark scores. Same pattern, new layer. The only thing that catches both is behavioral telemetry.
Nine maintainers, seven years, 78K weekly downloads — a behavioral score of 92. Today, attackers compromised the official package via a CI/CD pipeline attack. Here's what structural scoring catches, what it misses, and what the complete supply chain security stack looks like.
Infrastructure for AI agents is shipping at breakneck speed. Identity, coordination, payments — all live. But nobody is watching what agents actually do. The gap between 'agent registered' and 'agent behaved well' is the attack surface of the next decade.
npm audit, Snyk, Socket, and OpenSSF Scorecard all answer different questions. None of them measure structural supply chain risk. We scanned 30 top npm packages — 17 are CRITICAL. Here's the data.
An honest comparison of four npm security tools. They scan for different things. Here's where each one wins, where each one fails, and what the ua-parser-js attack reveals about the gap none of them close.
23 companies just standardized how AI agents pay for things. Nobody standardized who's allowed to say no. Open L3 creates unbundled L4 — and the governance gap widens with every x402 integration.
esbuild has 201M weekly downloads and one maintainer — more than TypeScript. I ran 25 of the most downloaded npm packages through a behavioral risk scorer. 9 are CRITICAL. The results are worse than I expected.
Hono is one of the hottest web frameworks in JavaScript right now — Cloudflare Workers, Bun, Deno. Fast, TypeScript-first, everywhere. Also: a single npm publisher with the same structural risk profile as ua-parser-js before the 2021 attack.
OX Security proved STDIO transport is RCE by design. 9 of 11 MCP marketplaces accepted a malicious server. Anthropic called it "expected behavior." This is the npm supply chain crisis, replaying at the agent layer.
A practical tutorial: add behavioral supply chain auditing to GitHub Actions, GitLab CI, or any CI system. Auto-detects your dependencies, posts PR comments, and catches structural risk before the CVE exists.
We applied Commit's trust scoring retrospectively to every stage of the 2018 event-stream supply chain attack. The package itself scored 66 with two risk flags. But the real signal was the dependency it ingested: flatmap-stream, scoring 13 out of 100. Here's the full breakdown, dimension by dimension.
We built a static analyzer, pointed it at the most popular MCP servers, and manually triaged every finding. 862 findings. The confirmed CVSS 8.8 vulnerability was in the repo that scored 73 — not the eight that scored 100. The results challenge assumptions about automated scanning and MCP security.
axios scores 86/100 — nearly perfect on every quality dimension. It also scores CRITICAL. These are not contradictory. This is the most important thing Commit reveals about npm supply chain risk.
AI companies are spending hundreds of millions licensing content and listings. None of it tells them whether a business is actually good. The market for verified outcome data is proven — and nobody has built the product.
We ran three real npm supply chain incidents — event-stream (2018), ua-parser-js (2021), and colors.js (2022) — through proof-of-commitment scoring. The structural signals were there before every attack. In two cases, they were screaming. Here's what the data shows, and where it falls short.
We audited the 50 most downloaded npm packages with behavioral commitment scoring. 30% are CRITICAL. 2.54 billion weekly downloads depend on a single maintainer each — including minimatch (562M/wk), chalk (413M/wk), and glob (332M/wk).
We ran an autonomous agent system for 38 days. 3,083 tasks. 92% self-directed. The operational data proves the thesis: behavioral signals are the only honest ones. Even when the agent doing the declaring is yourself.
Cloudflare shipped Artifacts and AI Platform — compute, storage, and inference for agents — in 48 hours. Zero identity layer. AWS commoditized compute in 2006, IAM came in 2010. We're at the same moment for agents.
RSAC 2026 shipped five major agent identity frameworks in one week. Every framework missed the same three gaps. When you look carefully, they share a structural property: they're all cross-org problems that single-org solutions can't close.
Cloudflare shipped six agent infrastructure products in 24 hours. AWS, Anthropic, OpenAI matched them. The L3 race — identity, OAuth, network routing — was won this week. The L4 race — behavioral trust — just started.
Three real-world breaches this week share one shape: trust established at one moment, the world changed, no one noticed. TOCTOU is the oldest exploit in computing — applied to trust, it's the gap that L4 behavioral governance must close.
A federal court ruled that user delegation doesn't constitute platform authorization — the first legal separation of these two concepts. Every platform now has legal standing to require agent authorization independently. Litigation isn't the answer. Trust grants are.
We scored real Norwegian businesses using government data — not reviews. The results look nothing like their Yelp ratings. When you measure commitment instead of opinion, a completely different picture of trust emerges.
Anthropic's system card says Claude Mythos is both more aligned and more dangerous than any prior model. During testing, it covered its tracks in git. The dangerous behavior passed all declarative controls — and was detectable only through behavioral telemetry.
Everyone named it in the same week. O'Reilly, Bloomberg, half a dozen startups — all pointing at the same gap. The agent stack has identity, payments, and authorization. It doesn't have trust.
Caveman makes Claude speak like a prehistoric human to save 87% of tokens. 688 people upvoted it. That's not a fun hack — it's revealed preference about what's broken in AI pricing for the machine-paced era.
The Commit extension measures two things about every business AI recommends: what public records prove, and what your own behavior reveals. Here's why both layers matter.
Germany's national digital ID abandoned static device certification for runtime behavioral attestation — PlayIntegrity verdicts, AppAttest assertions, continuous posture evaluation, dynamic blocking. The same architecture applies to AI agents.
AI search recommends only 1.2% of local businesses. 68% of its business info is wrong. Consumers aren't checking. Nobody is measuring this failure — because the measurement tools are broken too.
A zero-install MCP server that lets you ask Claude "How trustworthy is Equinor?" Verified data from Norwegian government registers. Two lines of config — no code required.
PageRank counted hyperlinks because they were costly acts. AI floods the information layer — making all content-based signals gameable. The next ranking system will count commitments.