Dataset Name Total Tokens Affordances Limitations
PG
12,574 volumes
1,140,676,329 - large collection of open-access full text data
- non-OCR
- does not continue through end of twentieth-century
- non-uniform distribution over time
- sampling criteria not well known
- no genre differentiation
- sparse pre-1800 and post-1940
SC
1,711
volumes
217,854,521 - manually curated
- contains only fiction
- non-OCR
- continuous from 1800 to 2000
- small size to measure historical change
- non-uniform distribution over time
- 19C and 20C sampling criteria not uniform
Hathi1M
1,671,370 pages
587,951,218 - large historically diverse sample
- uniformly distributed over time
- continuous from 1800 to 2000
- differentiated by instrumentality, i.e. fiction and non-fiction
- some error due to OCR
- only page-level derived data
- labels are based on predictive models not manual curation and thus contain some error