PG
12,574 volumes |
1,140,676,329 |
- large collection of open-access full text data
- non-OCR |
- does not continue through end of twentieth-century
- non-uniform distribution over time
- sampling criteria not well known
- no genre differentiation
- sparse pre-1800 and post-1940 |
SC
1,711
volumes |
217,854,521 |
- manually curated
- contains only fiction
- non-OCR
- continuous from 1800 to 2000 |
- small size to measure historical change
- non-uniform distribution over time
- 19C and 20C sampling criteria not uniform |
Hathi1M
1,671,370 pages |
587,951,218 |
- large historically diverse sample
- uniformly distributed over time
- continuous from 1800 to 2000
- differentiated by instrumentality, i.e. fiction and non-fiction |
- some error due to OCR
- only page-level derived data
- labels are based on predictive models not manual curation and thus contain some error |
|
|
|
|