Methodology · v1
How biotech.today works
Every number on this site is derived from a specific source, rolled up with an explicit formula, and published with known limitations. If we can't defend it, we don't display it. This page is the reference — chart tooltips in BT Pro deep-link here.
Last updated: April 18, 2026 · Accuracy log
1. Data sources
We ingest from eight channels, all either public-domain, permissively-licensed, or fair-use. What lands in our platform (and what we can redistribute via BT Pro) is always derived from these. Paid third-party databases — Crunchbase, PitchBook, family-office directories — are used only for editorial prep, never inside the platform.
| Channel | Coverage | Licence | Update |
|---|---|---|---|
| PubMed / bioRxiv / medRxiv | Peer-reviewed + preprint papers in life-sciences | Public domain (PubMed) / CC-BY (bioRxiv) | Daily + weekly deep refresh |
| ClinicalTrials.gov | All registered clinical trials worldwide | Public domain (US government) | Daily |
| SEC EDGAR | 8-K, 10-K, 10-Q, S-1, 424B, SC 13D, Form D for US-public biotechs | Public domain (US government) | Daily |
| NIH RePORTER | NIH grants (R01, R21, SBIR/STTR, U01, P01, etc.) | Public domain (US government) | Weekly |
| GitHub | Bio-niche repos ≥ 100 stars (releases, stars, commits, contributors) | API terms (we attribute) | Daily |
| News RSS | FierceBiotech, Endpoints, BusinessWire biotech, PR Newswire health | Publisher RSS feeds (excerpt + link) | Daily |
| X / Twitter (via Apify) | Curated People handles (~500 anchors; scaling to 10K) | Derived analytics only — raw tweets not redistributed | 2× daily |
| Podcast RSS + transcripts | Top 15 biotech + AI-biotech shows (metadata first, then Whisper transcripts) | Publisher RSS + fair-use transcript excerpts | Per episode |
4. Entity linking
We match named entities in ingested content with a three-layer pipeline:
- Rule-based regex matcher over an alias table. Minimum 5-character alias length, whole-word boundary required, case-insensitive.
- Alias blocklist filters common false positives — generic English words that overlap with brand names (
gold,cell,data,trial,model,human). SeeALIAS_BLOCKLISTinpackages/data-pipeline/jobs/pubmed.py. - LLM validator (Claude Haiku 4.5) on low-confidence matches — asks "is this actually about this entity?" before the signal is counted toward mindshare. Validator results are auditable.
5. Known limitations
- Company coverage is uneven. Our pubmed co-mention matcher + SEC EDGAR filings together cover well-published / US-listed companies (Moderna, BioNTech, Vertex, Relay, Recursion, Arvinas, Intellia, etc.) but miss private non-US companies that don't appear in English-language papers. International registry ingest (UK Companies House, Swiss ZEFIX, Japan NTA, OpenCorporates) is rolling out to close this.
- People coverage is paper-heavy. Scientists with prolific publication records (David Baker, John Jumper, Andrew Hopkins) are well-represented; operators and investors who talk on Twitter but don't co-author papers are not. X / Twitter ingestion via Apify is closing this gap. Until then, the People leaderboard is temporarily unpublished.
- Per-signal noise exists. Around 1–3% of signals hit our alias blocklist or LLM-validator filter after the fact. We publish a weekly accuracy log with the corrections (see below).
- Real-time latency. Ingestion runs 2–4× daily on scheduled jobs; most signals land in the platform within 6 hours of their source publication. Real-time streaming is not on the roadmap — this is an intelligence product, not a news ticker.
6. Accuracy commitments
We publish an accuracy log every week, listing false-positive signals that were caught (either by the automated blocklist, the LLM validator, or manual review) and removed. This is radical transparency — honesty is the moat.
If you spot a data error on any BT Pro chart, entity page, or newsletter, please email corrections@biotech.today. We respond within 1 business day and publish corrections in the weekly log.
Latest accuracy log7. What we redistribute, what we don't
Every signal in BT Pro is either (a) public-domain source data, (b) data we ingested under a permissive licence with attribution, or (c) our own derivation (rankings, trend detection, editorial analysis). Raw third-party content that can't be redistributed (e.g., full tweet text from X via Apify) is stored only internally — the platform publishes aggregate analytics derived from it (counts, rankings, mindshare scores, "highlighted by N people" labels), never the raw content itself. This mirrors Kaito's data model: you're paying for our analysis, not a proxy of someone else's feed.
8. Contact
Questions, corrections, or licensing: hello@biotech.today. This page is versioned; major methodology changes are announced in the newsletter.