by Lawrence Evalyn
As an interdisciplinary collaborator, I have learned that nothing alarms a computer scientist like saying you intend to manually repeat a trivial task 4,000 times. I am here today to insist that many tasks are only trivial for experts, and that human expertise should often be integrated as a core part of the methodological loop. I don’t mean data cleaning, which is generally focused on standardization (and to which similar principles apply) but data creation or annotation, in which new interpretive choices are made. Relying on an expert judgment can be not only more reliable, but faster, than automating or outsourcing.
Let me ground my argument in a particular example: assessing a probable gender for the author names attributed to roughly 52,000 works published in late eighteenth-century England. Because most eighteenth-century works were written by John, this worked out to only a few thousand unique names. I spent several hours, spread across several weeks, determining how un-automatable this task was. Simple lists of names and genders skewed modern, misunderstanding names like Brooke Boothby. The gender package in R is more nuanced, tying names to gender probabilities using census data — but its eighteenth-century data is all from Denmark and Iceland. It could identify “Karl” (7 titles) and “Olof” (1 title) — but not “George” (783 titles) or “Charles” (656 titles).
Then there were the names which are not names. No existing code will be prepared for “M.G. Lewis, Esq.” or “Gentleman Present At The Time,” let alone “Amicable and Provident Society (Dunchurch, England).” You may hear the siren song of automation: “gentleman” could be coded male, “society” could be coded for organizations… but that siren song is a mistake. “Member of the Honourable Society of Lincoln’s Inn,” for example, is not an organization, and needs to be coded as a male author— this title asserts that the author is a barrister, a profession legally barred to women until 1919. As that example shows, it is not merely “human” intervention which is required, it is subject expertise. The creation of quite simple data required historically-specific knowledge about which titles and occupations were legally barred to women — reverend, burgess, captain, MP, MD, MA, etc — and which would merely be unusual for women. That level of historical background knowledge is hard to automate or outsource.
But why even try to outsource? I already know who Brooke Boothby is; it’s faster for me to write down what I know than for someone else to look him up. Doing the task by hand took a few hours, but done in bursts with some nice music and bubble tea, it was sometimes downright pleasant. The process introduced me to interesting texts like Edward Rushton’s Expostulatory letter to George Washington, on his continuing to be a holder of slaves. I took notes on my judgment calls, so I could be confident that all the interpretations were internally consistent. And I spent fewer total hours making these judgment calls by hand than I had previously spent looking for “smarter” solutions.
It’s awkward to identify the source of information as “a person who knew it”: we’re more comfortable with more complex sources of epistemological validity. But all data is already from “a person who knew it”: naming the person, and the value of their knowing, is a strict improvement, even if it exposes some shaky epistemological foundations. I often joke that I don’t work with “big data,” I work with artisanal hand-crafted farmer’s market data, but that joke elides the fact that even the “big” stuff is fundamentally hand-crafted. In machine learning, for example, neural nets are people all the way down. The problem is, everyone wants to do the model work, not the data work. But the data work is where new information comes from. Moreover, the more time-consuming and tedious the data work is, the more likely it is contributing something genuinely new and necessary.
This is not a call for better, more complex tools which are suited for historical materials. It is a call for more patience. It is easy to pass around the relevant xkcd, and smile knowingly that automation does not make tasks go faster. It’s harder to embrace the idea that gaining expertise in a topic does not place one “above” the labour of sitzfleisch. So the opportunity I present to you is: what key information does no one know because all of the people capable of putting the pieces together think it would be extremely boring work to find out?