Generative Design Comparison

ESM-2 vs ProGen2 vs EvoDiff: Protein Language Models & Generation (2026)

Last updated: 2026-04-16

Protein language models are the foundation of modern computational biology. ESM-2 (Meta) provides rich learned representations used for everything from structure prediction to function annotation. ProGen2 (Salesforce) generates novel protein sequences autoregressively. EvoDiff (Microsoft) takes a different approach — discrete diffusion that enables order-agnostic and motif-conditioned generation. These tools serve fundamentally different purposes.

ESM-2

Meta AI (FAIR)

Production

ProGen2

Salesforce Research

Research+

EvoDiff

Microsoft Research

Research+

Head-to-Head

Structured comparison across key dimensions.

Dimension	ESM-2	ProGen2	EvoDiff
Primary purpose	Protein representation learning (embeddings for downstream tasks)	Autoregressive protein sequence generation	Diffusion-based protein sequence generation with conditioning
Architecture	Masked language model (BERT-style transformer)	Autoregressive language model (GPT-style transformer)	Discrete diffusion model (order-agnostic autoregressive)
Model sizes	8M to 15B parameters (650M most commonly used)	151M to 6.4B parameters	~640M parameters (OADM model)
Training data	UniRef50/90 (~250M sequences)	UniRef90, BFD, OAS (>1B sequences including metagenomics and immune repertoires)	UniRef50 + evolutionary alignments (OpenFold MSAs)
Can generate sequences?	Limited — masked token infilling only (not designed for generation)	Yes — strong autoregressive generation with family/taxonomy conditioning	Yes — diffusion-based with motif scaffolding and inpainting
Zero-shot prediction	Excellent — variant effect prediction, contact maps, secondary structure from embeddings alone	Log-likelihood scoring for fitness; less validated than ESM-2 for zero-shot tasks	Not designed for zero-shot prediction tasks
Structure prediction	Yes — ESMFold uses ESM-2 embeddings (single-sequence, no MSA needed)	No — sequence only	No — sequence only (validate with ESMFold/AF2 downstream)
License	MIT	MIT	MIT
On Platform	Partial (via ESMFold)	No	No
Key limitation	Not a generative model; masked infilling ≠ coherent full-sequence design	Left-to-right generation only; cannot condition on internal motifs or do inpainting	Smaller model; generated proteins need downstream structural validation; less adopted than ESM-2

When to Use Each

ESM-2

You need protein embeddings for downstream ML tasks. You want zero-shot variant effect prediction. You need structure prediction (ESMFold). You're building a classifier, regressor, or search system on top of protein representations.

ProGen2

You want to generate novel protein sequences. You need controllable generation conditioned on protein family, taxonomy, or function. You want the largest available autoregressive protein model (6.4B parameters).

EvoDiff

You need motif-conditioned protein generation (scaffold a sequence around a fixed motif). You want order-agnostic generation (not left-to-right). You need sequence inpainting or infilling capabilities.

Practitioner Verdict

Use ESM-2 when you need protein embeddings for downstream tasks (structure prediction, function classification, variant effect prediction) — it's the representation backbone of the field. Use ProGen2 for autoregressive protein sequence generation, especially when you want to control generation with taxonomic or functional conditioning. Use EvoDiff when you need diffusion-based generation with motif scaffolding or inpainting capabilities that autoregressive models can't do.

Stay updated on these tools

Weekly briefing on AI tool releases, benchmarks, and what works in drug discovery.

ESM-2 vs ProGen2 vs EvoDiff: Protein Language Models & Generation (2026)

Last updated: 2026-04-16

Head-to-Head

Structured comparison across key dimensions.

Dimension	ESM-2	ProGen2	EvoDiff
Primary purpose	Protein representation learning (embeddings for downstream tasks)	Autoregressive protein sequence generation	Diffusion-based protein sequence generation with conditioning
Architecture	Masked language model (BERT-style transformer)	Autoregressive language model (GPT-style transformer)	Discrete diffusion model (order-agnostic autoregressive)
Model sizes	8M to 15B parameters (650M most commonly used)	151M to 6.4B parameters	~640M parameters (OADM model)
Training data	UniRef50/90 (~250M sequences)	UniRef90, BFD, OAS (>1B sequences including metagenomics and immune repertoires)	UniRef50 + evolutionary alignments (OpenFold MSAs)
Can generate sequences?	Limited — masked token infilling only (not designed for generation)	Yes — strong autoregressive generation with family/taxonomy conditioning	Yes — diffusion-based with motif scaffolding and inpainting
Zero-shot prediction	Excellent — variant effect prediction, contact maps, secondary structure from embeddings alone	Log-likelihood scoring for fitness; less validated than ESM-2 for zero-shot tasks	Not designed for zero-shot prediction tasks
Structure prediction	Yes — ESMFold uses ESM-2 embeddings (single-sequence, no MSA needed)	No — sequence only	No — sequence only (validate with ESMFold/AF2 downstream)
License	MIT	MIT	MIT
On Platform	Partial (via ESMFold)	No	No
Key limitation	Not a generative model; masked infilling ≠ coherent full-sequence design	Left-to-right generation only; cannot condition on internal motifs or do inpainting	Smaller model; generated proteins need downstream structural validation; less adopted than ESM-2

When to Use Each