Methodology · v1

How biotech.today works

Every number on this site is derived from a specific source, rolled up with an explicit formula, and published with known limitations. If we can't defend it, we don't display it. This page is the reference — chart tooltips in BT Pro deep-link here.

Last updated: April 18, 2026 · Accuracy log

1. Data sources

We ingest from eight channels, all either public-domain, permissively-licensed, or fair-use. What lands in our platform (and what we can redistribute via BT Pro) is always derived from these. Paid third-party databases — Crunchbase, PitchBook, family-office directories — are used only for editorial prep, never inside the platform.

Channel	Coverage	Licence	Update
PubMed / bioRxiv / medRxiv	Peer-reviewed + preprint papers in life-sciences	Public domain (PubMed) / CC-BY (bioRxiv)	Daily + weekly deep refresh
ClinicalTrials.gov	All registered clinical trials worldwide	Public domain (US government)	Daily
SEC EDGAR	8-K, 10-K, 10-Q, S-1, 424B, SC 13D, Form D for US-public biotechs	Public domain (US government)	Daily
NIH RePORTER	NIH grants (R01, R21, SBIR/STTR, U01, P01, etc.)	Public domain (US government)	Weekly
GitHub	Bio-niche repos ≥ 100 stars (releases, stars, commits, contributors)	API terms (we attribute)	Daily
News RSS	FierceBiotech, Endpoints, BusinessWire biotech, PR Newswire health	Publisher RSS feeds (excerpt + link)	Daily
X / Twitter (via Apify)	Curated People handles (~500 anchors; scaling to 10K)	Derived analytics only — raw tweets not redistributed	2× daily
Podcast RSS + transcripts	Top 15 biotech + AI-biotech shows (metadata first, then Whisper transcripts)	Publisher RSS + fair-use transcript excerpts	Per episode

2. Mindshare score

Mindshare is our composite attention metric. For each tracked entity (Tool · Modality · Company · Person · Trial · Paper · Podcast · Conference) on each day, we compute:

raw_score(e, day) = Σ  signal.weight × exp(-age_days / τ)   for signal.event_date ≤ day
τ = 14 days   (trend half-life ≈ 9.7 days)

We then normalise within-type:

share_pct(e, day) = raw_score(e, day) / Σ raw_score(e', day)   for e' ∈ type

All shares in a given entity type sum to 100% on a given day. This is deliberate: we rank within-type rather than cross-type so Tools aren't competing for attention against People or Companies. The per-channel signal.weight values live in config.yaml and are tuneable without re-ingesting data.

Current channel weights: papers 1.0 · preprints 0.7 · trials 0.9 · github 0.6 · news 0.5 · grants 0.65 · twitter 0.3 · patents 0.8.

3. Development mindshare (GitHub)

For the Tool category, we also compute a Development Mindsharecomposite — a code-level companion to the public-attention mindshare above. Per tracked repo:

dev_score(repo) =
    w1 · log(stars)
  + w2 · commits_per_month
  + w3 · unique_contributors_per_month
  + w4 · star_velocity_30d

We index every bio-niche repo with ≥ 100 stars (topics include bioinformatics, drug-discovery, protein-folding, cheminformatics, molecular-dynamics, rna-seq, single-cell, generative-chemistry). Weights live in config.yaml:github.dev_mindshare_weights; current defaults are 1.0 / 0.4 / 0.3 / 0.2.

4. Entity linking

We match named entities in ingested content with a three-layer pipeline:

Rule-based regex matcher over an alias table. Minimum 5-character alias length, whole-word boundary required, case-insensitive.
Alias blocklist filters common false positives — generic English words that overlap with brand names (gold, cell, data, trial, model, human). See ALIAS_BLOCKLIST in packages/data-pipeline/jobs/pubmed.py.
LLM validator (Claude Haiku 4.5) on low-confidence matches — asks "is this actually about this entity?" before the signal is counted toward mindshare. Validator results are auditable.

5. Known limitations

Company coverage is uneven. Our pubmed co-mention matcher + SEC EDGAR filings together cover well-published / US-listed companies (Moderna, BioNTech, Vertex, Relay, Recursion, Arvinas, Intellia, etc.) but miss private non-US companies that don't appear in English-language papers. International registry ingest (UK Companies House, Swiss ZEFIX, Japan NTA, OpenCorporates) is rolling out to close this.
People coverage is paper-heavy. Scientists with prolific publication records (David Baker, John Jumper, Andrew Hopkins) are well-represented; operators and investors who talk on Twitter but don't co-author papers are not. X / Twitter ingestion via Apify is closing this gap. Until then, the People leaderboard is temporarily unpublished.
Per-signal noise exists. Around 1–3% of signals hit our alias blocklist or LLM-validator filter after the fact. We publish a weekly accuracy log with the corrections (see below).
Real-time latency. Ingestion runs 2–4× daily on scheduled jobs; most signals land in the platform within 6 hours of their source publication. Real-time streaming is not on the roadmap — this is an intelligence product, not a news ticker.

6. Accuracy commitments

We publish an accuracy log every week, listing false-positive signals that were caught (either by the automated blocklist, the LLM validator, or manual review) and removed. This is radical transparency — honesty is the moat.

If you spot a data error on any BT Pro chart, entity page, or newsletter, please email corrections@biotech.today. We respond within 1 business day and publish corrections in the weekly log.

Latest accuracy log

7. What we redistribute, what we don't

Every signal in BT Pro is either (a) public-domain source data, (b) data we ingested under a permissive licence with attribution, or (c) our own derivation (rankings, trend detection, editorial analysis). Raw third-party content that can't be redistributed (e.g., full tweet text from X via Apify) is stored only internally — the platform publishes aggregate analytics derived from it (counts, rankings, mindshare scores, "highlighted by N people" labels), never the raw content itself. This mirrors Kaito's data model: you're paying for our analysis, not a proxy of someone else's feed.

8. Contact

Questions, corrections, or licensing: hello@biotech.today. This page is versioned; major methodology changes are announced in the newsletter.