May 13, 2026

Stephanie Jarmak

How we're using Sourcegraph and a Slack bot to detect vulnerabilities and react quickly

A Slack bot triages every GitHub advisory, posts a rocket-to-trigger ask in the channel, and on one human reaction runs the full content pipeline: detection queries, blog scaffold, social drafts, a 35-second auto-cut demo. The operator's remaining job is to read the drafts and decide whether they're honest.

At 19:24 UTC on 2026-05-11, between two slices of a six-minute window, the npm registry accepted 84 malicious versions across 42 @tanstack/* packages, including @tanstack/[email protected], with credential-exfiltration payloads wired into preinstall hooks (Aikido analysis; TanStack postmortem). By the time the maintainers had pulled the bad versions and CVE-2026-45321 was filed at CVSS 9.6, the worm (TeamPCP's "Mini Shai-Hulud," a sequel to the Shai-Hulud 2.0 npm worm that backdoored 796 packages in November) had already self-replicated into other maintainers' accounts through stolen CI tokens (StepSecurity write-up; Snyk advisory).

An hour after this, a Slack message sat in #supply-chain-response: GHSA-ID, CVSS score, the affected package list, a Deep Search prompt ready to paste, five targeted Code Search queries, and one ask in the bottom line, "react 🚀 to run the full pipeline."

Initial Slack alert from supplychain-bot showing the TanStack GHSA, CVSS, package details, and the prefilled detection queries — Figure 1. The bot's first message. Triaged advisory, detection queries already shaped, one human prompt: react 🚀 to commit to the full response pipeline.

Five minutes after triggering the pipeline, a draft blog section, social posts for LinkedIn, X, and email, a 35-second screen-captured Deep Search demo, a GIF, and verification counts for every Code Search query were all back in the thread. The operator's remaining job was to read the drafts and decide whether they were honest.

The volume problem is now structural

The supply chain numbers from the last 18 months stopped being newsworthy and became the steady state. Sonatype logged 454,600 newly published malicious packages across npm, PyPI, Maven, NuGet and Hugging Face in 2025, lifting the cumulative known-malicious total above 1.23 million (2026 State of the Software Supply Chain). Self-replicating campaigns landed inside that total. 171,740 worm-injected packages hit npm alone, including the Shai-Hulud and Shai-Hulud 2.0 events that compromised more than 1,000 unique packages with combined weekly downloads in the tens of millions (CISA alert; Datadog Security Labs).

The forcing function under those numbers is an exponential rise in code volume. GitHub's most recent telemetry has 46% of net-new code arriving from AI assistants (Medium summary citing GitHub data); Google's CEO has 75% of new code generated by AI internally (DevOps.com coverage); Meta's 2026 internal target is 65% of engineers committing more than 75% AI-authored lines (same coverage). Roughly one in every two new dependency declarations, manifest edits, and CI workflow tweaks is now written by something that does not pause to verify a package name against a real registry. Sonatype's 2025 work found AI coding assistants hallucinate package upgrades about 27.75% of the time, recommending more than 10,000 versions that do not exist (Artur Markus summary).

The asymmetry that makes those two trends a security problem rather than a productivity story is the one Anthropic spelled out in plain language when it released Claude Mythos Preview. Mythos found vulnerabilities in every major operating system and every major web browser when pointed at them, including a 27-year-old OpenBSD bug and a 16-year-old FFmpeg flaw. On a Firefox-JS exploitation benchmark it produced 181 successful exploits where Opus 4.6 produced 2. Anthropic's own framing is that "in the short term, this could be attackers, if frontier labs aren't careful about how they release these models," and that the transitional period before defenders catch up "may be tumultuous regardless." Project Glasswing is the bet that releasing Mythos first to defenders of critical software is enough to keep the equilibrium from tipping.

Sourcegraph's earlier post on the LiteLLM 1.82.7/1.82.8 incident, Detecting supply chain attacks at scale with Deep Search, showed the human-driven version of this loop. A TeamPCP-poisoned PyPI release, a Deep Search prompt that surfaced the exposure pattern, a set of follow-up Code Search queries that exhausted the long tail, and the categorization separating pinned-but-fine from range-spec-exposed repos. That post took an afternoon to write and verify. The next one is going to happen at 3 a.m. on a Sunday, and the one after that is going to happen forty minutes later, and that is what we built our supply-chain-alerts Slack bot to help with.

What the bot does

The pipeline is two long-lived processes (advisory-monitor and a Socket-Mode Slack bot) and a handful of one-shot Node scripts each well under 300 lines. Every script reads an incident descriptor (a small JSON file with the package name, registry, affected versions, safe version, and advisory URL) from disk, and writes its outputs to a directory named after the incident ID. There is no orchestrator process, no in-memory state, no message bus.

advisory-monitor polls the GitHub Advisory Database every 15 minutes. When an advisory passes triage, three things happen before a human sees the channel. The advisory's GraphQL response is normalized into incidents/<id>.json. A fine-grained PAT scoped to a single repo opens and auto-merges a PR against a private demo-targets repo indexed by our Sourcegraph instance, committing an inert manifest under incidents/<id>/; the indexer picks up the commit within a minute or two. And a Slack message lands in #supply-chain-response carrying the GHSA, CVSS, the package and its registry, a ready-to-paste natural-language prompt for Deep Search, the detection queries, and a 🚀-to-trigger ask.

When a human reacts 🚀, the bot fires the full content pipeline as subprocesses, running query generation, blog scaffold, social drafts (LinkedIn, X thread, email), and a screen-captured Deep Search demo. Each artifact gets posted back to the original Slack thread as it lands. The query-verification step reports the actual hit count for each detection so the operator can see whether the queries found anything real before reading the draft.

Slack thread showing the bot acknowledging the 🚀 reaction and starting the content pipeline — Figure 2. Pipeline start. The bot acknowledges the 🚀 reaction and posts a progress message; subprocesses run in the background.

Slack thread showing pipeline artifacts: detection queries, blog scaffold, social drafts, and the demo video — Figure 3. The bot returns the full artifact set to the same thread: detection queries, blog scaffold excerpt, social drafts, query hit counts, and the cut-down demo video.

The remaining sections are about the three decisions that shaped the system. Each of them is a place where the obvious implementation would have been wrong.

Triage is the only thing standing between you and alert fatigue

A naive advisory monitor pages on every CVSS-7+ in every target registry. That is a few dozen alerts a week, almost none of which justify spinning up a response pipeline. The whole point of the bot is to be cheap to ignore and worth paying attention to when it fires, so the triage filter has to be aggressive without being lossy on the incidents that actually matter.

We layered three signals. The first pass is CVSS ≥ 7 inside our target registry set (npm, pypi, rubygems, go, maven, actions). The second is a buzz check against Hacker News stories whose titles mention the package, restricted to stories posted after the advisory date plus a six-hour grace window; the "after" clause matters because old HN coverage of an unrelated CVE on the same package would otherwise trigger fresh alerts forever. The third is a high-reach fallback: a curated list of packages where blast radius is the story regardless of buzz (axios, requests, lodash, langchain, and so on), combined with a supply-chain keyword check on the advisory description ("malicious", "exfiltrat", "credential", "backdoor", "post-install", "preinstall") so routine CVEs on popular packages don't fire.

Advisories that pass CVSS but lack buzz or high-reach status get parked in a pending state and re-checked on every poll for 12 hours. Most expire silently. The few that gain coverage during that window almost always become real incidents, and the parking lot is what kept the bot quiet on the TanStack precursor advisories the morning before the worm actually published. The cost of being early is a delay, but the cost of being noisy is the operator muting the channel.

The demo repo is the only place the search can actually find anything

The video shows Deep Search detecting the compromised package. To be detectable the vulnerability has to exist somewhere indexed, and a query that returns zero hits over a sweeping black background is technically a working detection but essentially unusable as a demonstration. We needed a deterministic, controlled source of truth where every freshly disclosed package would already be present and indexable (and still safe).

The fix is a dedicated, private demo-targets repo indexed by our Sourcegraph instance, containing nothing but minimal manifests under incidents/<incident-id>/:

{
  "name": "demo-tanstack-zod-adapter-2026-05-12",
  "version": "0.0.0",
  "private": true,
  "dependencies": { "@tanstack/zod-adapter": "1.166.15" }
}

Nothing runs from these files. install scripts are blank, CI is off, Dependabot is disabled, and the README documents that nothing in the repo should ever be installed or built. A fine-grained PAT with Pull requests R/W scoped to that one repo handles the auto-PR and auto-merge, kept separate from the broader GITHUB_TOKEN used by advisory-monitor for advisory reads. The blast radius if either token is compromised is bounded by what the token can actually do.

The first version of the auto-PR script embedded a little too honest of a description in the manifest itself ("description": "Demo manifest pinning X@Y for indexing. DO NOT INSTALL."), and Deep Search, correctly, identified the match as a synthetic indexing target and refused to report it as a real consumer. Moving the disclosure out of the manifest body and into the repo README restored the signal: the name: demo-<id> and private: true fields still tell a careful reader what they are looking at, but they no longer give Deep Search a clean rule to discard the match on. The lesson is the boring one: when you are using one agent's output to validate another agent's behavior, the framing the second agent sees has to match the framing real code would have.

The video editor has to know which seconds are worth watching

Deep Search streams reasoning, and we want to show how Deep Search works. It is also, for stretches, just a spinner sitting on the screen while the model thinks. A raw screen capture of a single query runs around 210 seconds for a result that has perhaps 19 seconds of actual content. Posting a three-and-a-half-minute video to a social thread is asking for it to be closed at the ten-second mark.

When I floated this problem to a marketing colleague, saying I was working on getting an agent to do the editing pass automatically, my colleague was a bit skeptical. But thankfully, you can wire up an automated pipeline super easily these days, and what worked was actually pretty old-fashioned: an ffmpeg pipeline that turns "the boring parts" into a numerical signal, and a small Node script that decides which seconds survive.

The first step finds content changes. Run ffmpeg with boxblur=10:1,select='gt(scene,0.003)',showinfo over the source capture; the blur smooths small localized motion (the spinner) so it doesn't trip scene detection, and the 0.003 threshold is deliberately low so character-by-character text streaming registers. Parsing pts_time:X.XXX out of the showinfo output yields a list of timestamps where something visually meaningful changed.

Next, cluster those events into bursts. Two consecutive timestamps less than one second apart belong to the same burst. Real activity (typing the prompt, streaming results) produces dense bursts of five or more events. Keyframe artifacts and pixel jitter in the dead zone do not cluster densely.

Then anchor-filter the bursts. Some artifact pairs happen to fire within a second of each other, producing fake two- or three-event "bursts" sprinkled through the static portion of the recording. The two-tier rule that separated signal from noise reads as follows. A burst with five or more events is an anchor and is always kept. A smaller burst is kept only if it falls within 15 seconds of an anchor and before the last anchor's end. The "before the last anchor's end" clause is the one that lets the editor cut everything that happens after Deep Search stops streaming; without it, a stray two-event cluster firing 30 seconds into dead air pins another 30 seconds of dead air into the output.

The last step is the ffmpeg select filter itself. For each surviving burst, keep [burst.start, burst.end] and a [burst.end, burst.end + 2s] tail. Add a 2s freeze on the final frame so the result stays readable, then pipe through setpts=N/(25*TB) to re-time the kept frames into a continuous 25fps stream. There's probably an alternative approach you could do with a bitmap threshold, but this works just fine so far.

For the @tanstack/zod-adapter recording, this pipeline took a 210-second raw capture down to a 19-second content portion, or 35 seconds with the title and end cards attached. Typing plays at natural speed, streaming plays at natural speed, and every long spinner pause cuts to a hard two seconds.

Figure 4. The 35-second auto-cut MP4 for the @tanstack/zod-adapter incident: title card, the Deep Search prompt typing in, streaming reasoning, the vulnerable repo surfaced, end card. Source video and edited cut both live under supply-chain-playbook/tanstack-zod-adapter-2026-05-12/.

And the blog scaffold the bot writes to disk for the same incident, verbatim. The operator's review pass tightens prose; the queries and version metadata are derived from the incident descriptor and ship without manual entry.

Excerpt: tanstack-zod-adapter-2026-05-12/blog-section.md

## Am I affected?

### Starting with Deep Search

Paste this into [Deep Search](https://sourcegraph.com/deep-search):

> Do any repositories use @tanstack/zod-adapter version 1.166.15? Context: Malware
> in @tanstack/* packages exfiltrates cloud credentials, GitHub tokens, and SSH
> keys. Are any non-affected repositories employing version pinning to mitigate
> the issue? They should pin to 1.166.16 until a clean release is available.

### From Deep Search to Code Search queries

Deep Search gives you the overview. When you need exhaustive coverage, take what
it found and write targeted Code Search queries. Here are three of the five we used
(queries 2 and 4 follow the same shape):

#### 1. Detect explicit installs of compromised versions

```
("@tanstack/zod-adapter": "1.166.15") file:(package.json)
```

**Result:** Zero matches means no one pinned to the compromised versions directly.

#### 3. Find vulnerable patterns: range-based dependencies

```
"@tanstack/zod-adapter" ("^" OR "~" OR ">=" OR ">") file:(package.json)
```

**Result:** Check if the range includes the affected versions.

#### 5. Confirm defensive patterns

```
"@tanstack/zod-adapter": "1.166.16" file:(package.json)
```

**Result:** These repos have already mitigated the risk.

What lives in the operator's head, and what doesn't

The operator's day-to-day for a fired incident reduces to: read the Slack alert, decide whether the incident is worth covering, react 🚀, wait five minutes, read the drafts, tighten the tone, and publish. They no longer scrape together five Code Search queries by hand, no longer screen-record their own Deep Search session, no longer hand-write a LinkedIn post against a vulnerability they were looking at for the first time forty minutes ago. The work that survived the automation is the work the operator has comparative advantage on: judgment about whether the incident matters to our readers, and prose that sounds like a person (I swear this is at least partly human written, and entirely human reviewed :) ).

Every step of the pipeline is its own script, so editing the Deep Search prompt template, the burst-detection thresholds, or the social-post templates is a single file change. The next 🚀 reaction fires fresh subprocesses that read the updated code straight from disk. There is no service to restart and no deploy to wait on. The architecture would not scale to many concurrent incidents, but in practice supply chain incidents don't fire concurrently, and the operator is the bottleneck either way.

Next: a resource hub, and a bot that does the boring half of the response too

The bot covers disclosure-to-publication. The work that comes before disclosure (pattern-detection sweeps over indexed code for behaviors that look like the previous worm even when no advisory has been filed) is the obvious next layer, and is what Mythos-class models will plausibly be doing on the offensive side regardless. We are building a public Security Alerts Resource Hub on sourcegraph.com that collects the Deep Search prompts, Code Search queries, and pinned-vs-exposed categorizations for every incident the bot has run through, so the next responder, internal or not, starts from a query that has already been tested against real code. The first entries are LiteLLM, the TanStack worm, and the axios SSRF disclosure from March. The point is to make the queries reusable artifacts rather than one-shot screenshots, so that when the next Shai-Hulud variant publishes at 3 a.m. on a Sunday, the answer to "where is this in our code" is a copy-paste away from a working query someone has already shaped.

How we're using Sourcegraph and a Slack bot to detect vulnerabilities and react quickly

The volume problem is now structural

What the bot does

Triage is the only thing standing between you and alert fatigue

The demo repo is the only place the search can actually find anything

The video editor has to know which seconds are worth watching

What lives in the operator's head, and what doesn't

Next: a resource hub, and a bot that does the boring half of the response too

Related reading

Unblock your organization.
Ship faster.

How we're using Sourcegraph and a Slack bot to detect vulnerabilities and react quickly

The volume problem is now structural

What the bot does

Triage is the only thing standing between you and alert fatigue

The demo repo is the only place the search can actually find anything

The video editor has to know which seconds are worth watching

What lives in the operator's head, and what doesn't

Next: a resource hub, and a bot that does the boring half of the response too

Related reading

Unblock your organization.Ship faster.

Unblock your organization.
Ship faster.