Dataset Name	Total Tokens	Affordances	Limitations
PG 12,574 volumes	1,140,676,329	- large collection of open-access full text data - non-OCR	- does not continue through end of twentieth-century - non-uniform distribution over time - sampling criteria not well known - no genre differentiation - sparse pre-1800 and post-1940
SC 1,711 volumes	217,854,521	- manually curated - contains only fiction - non-OCR - continuous from 1800 to 2000	- small size to measure historical change - non-uniform distribution over time - 19C and 20C sampling criteria not uniform
Hathi1M 1,671,370 pages	587,951,218	- large historically diverse sample - uniformly distributed over time - continuous from 1800 to 2000 - differentiated by instrumentality, i.e. fiction and non-fiction	- some error due to OCR - only page-level derived data - labels are based on predictive models not manual curation and thus contain some error