SSRG Web Corpus (v3)

A 2.1B-document curated web corpus with per-document provenance and quality scores. Cleaned, deduplicated, and license-annotated.

Download

corpus manifest (JSONL index of all shards)

Source and related corpora