<?xml version="1.0" encoding="utf-8"?>
<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Publishing DTD v1.2 20190208//EN" "https://jats.nlm.nih.gov/publishing/1.2/JATS-journalpublishing1-mathml3.dtd">
<article xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" article-type="research-article" dtd-version="1.2" xml:lang="en">
  <front>
    <journal-meta>
      <journal-id journal-id-type="publisher-id">1832</journal-id>
      <journal-title-group>
        <journal-title>Journal of Cultural Analytics</journal-title>
      </journal-title-group>
      <issn pub-type="epub">2371-4549</issn>
      <publisher>
        <publisher-name>Center for Digital Humanities, Princeton University</publisher-name>
      </publisher>
      <self-uri xlink:href="https://culturalanalytics.org/">Website: Journal of Cultural Analytics</self-uri>
    </journal-meta>
    <article-meta>
      <article-id pub-id-type="publisher-id">11830</article-id>
      <article-id pub-id-type="doi">10.22148/001c.11830</article-id>
      <article-categories>
        <subj-group subj-group-type="heading">
          <subject>Commentary</subject>
        </subj-group>
      </article-categories>
      <title-group>
        <article-title>Is there a text in my data? (Part 1): on counting words</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <name>
            <surname>Gavin</surname>
            <given-names>Michael</given-names>
          </name>
        </contrib>
      </contrib-group>
      <pub-date publication-format="electronic" date-type="pub" iso-8601-date="2020-01-25">
        <day>25</day>
        <month>1</month>
        <year>2020</year>
      </pub-date>
      <pub-date publication-format="electronic" date-type="collection" iso-8601-date="2021-05-03">
        <year>2020</year>
      </pub-date>
      <volume>5</volume>
      <issue seq="4">1</issue>
      <issue-title>Articles in 2020</issue-title>
      <elocation-id>11830</elocation-id>
      <permissions>
        <license license-type="open-access">
          <ali:license_ref xmlns:ali="http://www.niso.org/schemas/ali/1.0/">
              http://creativecommons.org/licenses/by/4.0
            </ali:license_ref>
          <license-p>
              This is an open access article distributed under the terms of the <ext-link ext-link-type="uri" xlink:href="http://creativecommons.org/licenses/by/4.0">Creative Commons Attribution License (4.0)</ext-link>, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
            </license-p>
        </license>
      </permissions>
      <self-uri content-type="pdf" xlink:href="https://culturalanalytics.org/article/11830.pdf"/>
      <self-uri content-type="xml" xlink:href="https://culturalanalytics.org/article/11830.xml"/>
      <self-uri content-type="json" xlink:href="https://culturalanalytics.org/article/11830.json"/>
      <self-uri content-type="html" xlink:href="https://culturalanalytics.org/article/11830"/>
      <abstract>
        <p>This essay is the first in a two-part series. This first installment invites readers to consider a few very basic questions: what does it mean to count words in a text? What happens to the text, and to our understanding of it, when we decompose it into a series of word counts? What relation exists between the textual domain and its numerical image? Or, to restate this question with a nod to literary critic stanley fish, “is there a text in my data?” following one document through a series of typical transformations – first into a simple list of words and their frequencies, then to a vector of elements in a matrix, and from there through the processes of normalization, dimensionality reduction, and analysis – this essay argues against the commonly held notion that counting words reduces complexity, suggesting instead that semantic models embed textual objects in highly complex structures that are extremely sensitive to historical context and subtle nuances in meaning. Word frequencies aren’t static, given things that simply exist in a text. They’re produced through the act of modeling, and the mathematical structures they imply dissolve both words and texts into elaborate systems of mutual interrelation.</p>
      </abstract>
      <kwd-group>
        <kwd>vector space models</kwd>
        <kwd>philosophy of language</kwd>
        <kwd>replication</kwd>
      </kwd-group>
    </article-meta>
  </front>
</article>
