Reassembling the English novel, 1789-1919

The absence of an exhaustive bibliography of novels published in the British Isles and Ireland during the 19th century blocks several lines of research in sociologically-inclined literary history and book history. Without a detailed account of novelistic production, it is difficult to characterize, for example, the population of individuals who pursued careers as novelists. This paper contributes to efforts to develop such an account by estimating yearly rates of new novel publication in the British Isles and Ireland between 1789 and 1919. This period witnessed, in aggregate, the publication of between 40,000 and 63,000 previously unpublished novels. The number of new novels published each year counts as essential information for researchers interested in understanding the development of the text industry between 1789 and 1919.


Introduction
Aspirations in the 20th century for sociologically-inclined literary history foundered due to a lack of accessible, trustworthy, and inclusive bibliographies and biographical records.
Despite sustained interest, no principled estimates of the number of novelists writing or the number of new novels published during the 19th and early 20th centuries ever materialized (Sutherland 1988). Without a detailed accounting of novelistic production, numerous questions proved impossible to answer. The following three are representative: How many writers made careers as novelists, Are there unacknowledged precursors or forgotten rivals to canonical authors, To what extent is a writer's critical or commercial success predictable from their social origins? Although material traces of every novel published in Europe and North America survive, gathering particulars required to answer questions such as these proved too time-consuming or too resource intensive.
The lack of credible information about the population of novelists and the population of published novels obstructs research in literary studies, cultural studies, book history, and sociology of literature. Two communities in particular stand to gain from a more detailed accounting of these two populations. The first includes those interested in studying literary form and prose style from below. A characteristic concern of this group is an interest in how the emergence and diffusion of literary morphology reveals information about broader economic, social, and cultural relationships within and across national and linguistic situations (e.g., Escarpit (1958), Moretti (1998), Casanova (1999), and Moretti (2000a)). The second group includes researchers in cultural studies and sociology of culture interested in uniting literary history with sociological concerns. This group includes those interested in the working conditions facing novelists and those studying the history of occupational gender segregation in the text industry (e.g., R. Williams (1965), Tuchman (1989). This group also includes those interested in reassembling an understanding of literary artworks as products of networks of actors whose actions are necessary for works' existence and whose actions, in turn, shape the art objects (Becker 1995, p. xii). Library digitization and sharing of machine-readable datasets are two developments which support research agendas associated with these communities. More generally, these developments facilitate studying literary works at multiple scales and with a broader range of vocabularies.
To demonstrate the improving prospects for data-intensive, sociologically-inclined literary history-enabled by the availability of digital surrogates of surviving volumes and the sharing of machine-readable bibliographic data-this paper estimates the yearly rates of new novel publication in the British Isles and Ireland between 1789 and 1919. This period witnessed, in aggregate, the publication of between 40,000 and 63,000 previously unpublished novels ("new novels"). Although there has been considerable speculation about this time series, ours are the first principled estimates to be published. The years studied include the rise of mass literacy and one of the more important periods in the history of publishing (1830-1850), a period during which practices and institutional arrangements resembling the modern publishing industry emerge (Raven 2007, pp. 328-329).
The analysis presented here is limited to literary production on islands in the North Atlantic.
Although the prospect of comparative research was a primary motivation for this work, a lack of comprehensive bibliographical records outside the British Isles and Ireland made such research difficult. The exhaustive bibliography of novels published between 1770 and 1836 found in Raven and Forster (2000) and  (hereafter "RFGS")-indispensable to the work here-has no real equivalent. (For example, although Brümmer (1884) is impressive in the number of German-language titles it documents, like Block (1961), it makes no claims to have enumerated all titles published.) Bibliographic work on novels written in languages other than English is, however, ongoing and library digitization makes the work easier. And the estimates presented here provide information about plausible trajectories of literary production elsewhere. For example, because it is hard to imagine per capita novelistic production growing considerably faster than it did in the British Isles and Ireland during the 1840s, the pace of growth during this decade may be used as an estimate of the upper bound on the pace of growth in established text industries in other regions.

Rise of the Text Industry
No comprehensive survey of new novels published in the British Isles and Ireland exists for any year after 1836. There is neither an exhaustive list of new novels published nor principled estimates of the number of new novels published in any year after 1836. Given the pace of expansion in the publishing industry during the period and the time and resources required to complete exhaustive surveys (such as RFGS) this is understandable. 1 The absence of information about novels published after 1836 is regrettable because this period witnesses the rise of mass literacy and sees the publishing industry adopt practices and organizational structures characteristic of the modern text industry (Raven 2007, pp. 328-329). What little information we have about the population of literary works published after 1836 relies on inferences drawn from the heterogeneous population of published books (novels and non-novels, new and reissued) (Weedon 2003;Eliot 1997). Even here, however, the information is not detailed enough to allow us to estimate the number of novels (new or reprinted), published during any year or decade.
In this paper we estimate rates of novelistic production for each year between 1789 and 1919 from five existing data sources using a probabilistic model. In addition to annual publication counts, the data permit us to estimate the proportion of new titles associated with men and women authors. Although we do not directly observe the number of new novels published in any year after 1836-or new novels by author gender after 1829-we infer credible intervals through the use of a model of several correlated time series. Our results make visible, for the first time, a period of particularly intense growth between 1840 and 1855.

Background
There are bibliographies and related resources that purport to provide information about new novels published during specific periods of the 19th century. Most are unusable. Typical are bibliographies of a period or novel subgenre which for one reason or another are not exhaustive. Block (1961) (Garside, Raven, and Schöwerling 2000, p. 2). There are, however, a small number of works which are exhaustive for a period or genre and do provide information usable by those interested in an inclusive history of the novel and of novel writing. Bassett (2008), for example, enumerates all three-volume editions appearing between 1863 and 1897. RFGS, mentioned earlier, enumerates all novels published between 1770 and 1836. RFGS also helpfully makes clear how they go about the essential task of distinguishing novels from non-novels (Garside, Raven, and Schöwerling 2000).
For those interested in an inventory of new novels published during the 19th century, the most useful information comes from historians of publishing. (With notable exceptions-1 There are many challenges associated with assembling an exhaustive list. A small number of books are published but never advertised in industry publications such as Publishers' Circular. In other cases, novels may be advertised but never published, or published under a different title. Bibliographic work is complicated further by the fact that in a (very) small number of cases, no copies of a novel survive.
including Escarpit (1958), Moretti (1982), Moretti (1998), and Moretti (2000b)-literary historians working after 1950 have not pursued an inclusive history of the novel, one which would include all novels and novelists.) Working with a machine-readable version of the Nineteenth Century Short Title Catalog (NTSC), Eliot (1997) creates a time series which provides information about the number of books published in London, Oxford, Cambridge, Edinburgh, andDublin each year between 1801 and1870. 2 Until an integrated history of the English novel and the book trade is written, this series will be invaluable. It helps us in two specific ways. First, it provides a crude upper bound on the number of new novels published each year as the number of new novels will always be less than the number of books (novels and non-novels) appearing in a given year. Second, because the rate of book production and the rate of new novel production are correlated, the time series gives us considerable insight into how the rate of new novel production likely changed from year to year.
The two most important resources used to estimate the rate of novelistic production are RFGS and a series derived from the Nineteenth-Century Short Title Catalog (NSTC). Three other resources used in the model-which tend to cover shorter periods-are introduced in the next section.

Method
We estimate annual rates of novelistic production from five data sources using a probabilistic model. The model assumes that changes in the pace of novelistic production are well described by exponential growth with transitory deviations. Using the model and available data we infer the pace of growth and the character of deviations. Taken together these inferences permit us to estimate the number of novels published each year between 1789 and 1919. In this section we first describe the resources used and then elaborate the model.

The English Novel, 1770-1836 ("RFGS")
The most important source of information is The English Novel, an exhaustive survey of novels appearing between 1770 and 1836 (Raven and Forster 2000;Garside, Mandal, et al. 2006). In this paper we refer to the two-volume printed bibliography, updates, and online database collectively as RFGS. 3 RFGS anchors the analysis in this paper in several respects. What RFGS records, counts of new novels-and, for 1800-1829, counts by author gender-is what we wish to infer for the entire period . RFGS provides a principled, descriptive definition of the novel: 2 Working with data from Eliot (1994), Weedon (2003) combines the work of Eliot with other sources to offer a succinct description of publishing between 1836 and 1919 (Weedon 2003, pp. 46-51).
3 To the best of our knowledge, Garside, Mandal, et al. (2006) includes corrections and additions to Raven and Forster (2000) and Garside, Mandal, et al. (2006) which have been published online from time to time (e.g., Garside, Berlanger, and Mandal (2001)).
printed works referred to as novels by readers at the time. The usefulness and specificity of this definition is amplified by the fact that RFGS provides examples of works which meet the definition (the bibliography itself) as well as works which do not meet the definition.
RFGS includes detailed records for each title listed in the bibliography. For years 1800-1829, each record includes an indication of the gender of the author. RFGS code author gender as ("Male","Female", "Unknown"). If the title indicates author gender but not author name, the title is associated with the indicated author gender. For example, although the novel The Castle of Probation (1802) does not have a named author, it is associated with a "Male" author in RFGS because the novel's full title includes the words "By a Clergyman". 4 As a practical matter, we see RFGS as providing two distinct time series: first, counts of new novels published between 1770 and 1836; and, second, counts of new novels by author gender between 1800 and 1829. We further limit our attention to records associated with 1789 and later years in order to allay concerns about the definitional strategy used. As the 18th century progresses, characteristics associated with works labeled "novels" tend to stabilize. Works published after 1789 which were referred to as novels are very likely to share morphology with works labeled novels published during later decades. This is less often the case for novels published earlier in the 18th century.
To address the concern that the definition used by RFGS may be too restrictive, that it may tend to exclude literary works which were not called novels but which are, in all other respects, treated by readers at the time as if they were novels, it is worth noting that different definitions of the novel tend to agree on particulars in more than 85% of cases. Moreover, disagreement is localized. Most disputed cases involve novel-like (didactic) juvenile fiction and novel-like religious fiction (Troy Bassett, personal communication, Nov. 9, 2015). It should, therefore, be straightforward for other researchers to adjust the estimates reported here or to modify the model source code accompanying this paper to accommodate different assumptions about what works count as novels.
Nineteenth-Century Short Title Catalog (London, Oxford, Cambridge, Edinburgh, or Dublin), 1801-1870 ("LOCED"). Eliot (1997) extracts yearly totals of entries (novels and non-novels) listed in the Nineteenth-Century Short Title Catalog (NSTC) associated with one of the following places of publication: London, Oxford, Cambridge, Edinburgh, or Dublin. We refer to this time series using Eliot's abbreviation, "LOCED".
Because RFGS provide an exhaustive survey of new novels between 1801 and 1836, we know what percentage of LOCED titles are new novels for 36 years. During these years there is an opportunity to observe how the two time series covary.
Our LOCED series differs from Eliot's in one important respect. The original LOCED series has an unusual feature: undated material is assigned to the nearest half-decade (to a year ending with a "0" or a "5") (Eliot 1997, p. 86). To deal with this idiosyncrasy, we ignore entirely publication counts from the original series which are associated with years ending in "0" or "5". Although ignoring counts in these years might appear to bias the counts associated with other years downward (as many works, were their publication years known, "belong" in adjacent years), we have a different view. The original LOCED series mixes two time series, a series recording dated material and a series recording undated material. (New novels, for example, are virtually certain to report publication years on their title pages.) By stripping out counts for years ending with "0" or "5", we ignore the time series related to undated publications. At this point the inference strategy may be growing clearer. We aim to gather several partially overlapping time series which are correlated in order to "triangulate" from observed rates to unobserved rates.
The Athenaeum Reviews of Novels, 1860, 1865, . . . , 1900 The fourth and fifth resources are used primarily to improve the estimates of the number of new novels published after 1850. Improving our estimates for this period is important because uncertainty grows as we move further away from the bibliographic terra firma of the early 19th century. The fourth resource appears in Casey (1996). Casey provides counts for the number of novels reviewed in The Athenaeum during nine years: 1860, 1865, 1870, 1875, 1880, 1885, 1890, 1895, and 1900. (The Athenaeum was a London literary magazine published from 1828 to 1921.) Casey also breaks down the number of novels reviewed during the nine years by author gender. We make the assumption that every title counted as a novel in this time series meets the definition of a novel used by RFGS.
Counts are taken from Chart 2 in Casey (1996). In Casey's series, titles with multiple authors contribute an author fraction to the relevant count. As the model used here is designed to model count data, all non-integer values in Chart 2 are rounded down. As novels with multiple authors are exceedingly rare during the period, we feel that ignoring authors other than the first will not meaningfully change any results presented in our analysis.
The Athenaeum does not review all novels published, so these counts are significantly lower than the total number of new novels published. If we knew the percentage of new novels reviewed by the magazine, we could derive the number of new novels published during these nine years. We infer the percentage of novels reviewed by modeling the overlapping time series. This strategy is the same as the one used to infer the percentage of total books published which are novels. In our model, we assume that the percentage of novels reviewed, whatever it turns out to be, is fixed during the period 1860-1900. Supporting this assumption is the observation that novel reviews in The Athenaeum increased markedly between 1860 and 1900, suggesting that the periodical enjoyed flexibility in the number of titles it reviewed. 6 These years were chosen because a preliminary model made implausible predictions for these years. The predictions were implausible in that they were near or lower than a lower bound on the number of novels published in the relevant years. Lower bounds were available for these years because the ATCL database already contains records for many thousands of novels published in the 19th century.

Elicited Distributions of New
7 The distributions were elicited in a phone conversation between Allen Riddell and Troy Bassett on November 9th, 2015. The quartiles reported in the paper are discounted from the original quartiles (450,

A Model of Novelistic Production
In this section we review the most important assumptions we make in our model-exponential growth with transitory deviations-and then describe in detail how the five time series mentioned earlier appear in the full model. To simplify the presentation, we initially describe the model without considering author gender. The minor adjustments required to model author gender are presented at the end of this section.
Seen from a distance, it is obvious that the rate at which new novels appear grows exponentially. We can appreciate this by looking at the rate at which books (novels and non-novels) appear (Eliot 1997;Weedon 2003). Additional evidence, if any is needed, is available from Eliot (1998) which shows nonlinear growth in the number of titles labeled as "Literature" in the NSTC (Eliot 1998, p. 85). The standard approach to modeling this sort of trend is a log-linear model. Taking log publication rates as our estimands, we can describe the trend using a linearly increasing rate of publication. In a log-linear model, the log rate of new novel publication in year t is described by a two-parameter expression,  (2007)-we do not describe them in any detail 550, 700). Discounting is required because ATCL uses a more inclusive definition of the novel than RFGS. (For example, RFGS exclude some religious and didactic fiction that ATCL includes.) Bassett reports that between 10% and 15% of the novels included in ATCL would not be counted as novels according to RFGS. For this reason we discount the reported quartiles by 12.5% (the midpoint between 10% and 15%). The matching of ideal distributions to the elicited distributions (implied by the quartiles) involves one additional step because we model the rate of new novel publication on the log scale. We use Gamma distributions which have quartiles as close as possible to the elicited distributions (now on the log scale). For example, the final representation of the distribution with quartiles 394, 482, and 613 is (on the log scale) a Gamma distribution with shape and rate parameters of 278 and 46.
here.) The backbone of our model is therefore a Gaussian Process of the log rate of new novel publication between 1800 and 1919. In symbols, the log rate of new novel appearance for year t = 1, . . . , 120 is given by where the year t = 1 is associated with 1800, t = 2 with 1801, and so on. GP(0, K) is a zero-mean Gaussian Process with 120 × 120 covariance matrix K; and the element (t, t ) of Two examples may help make the covariance matrix K more intelligible. K 2,3 is the covariance between the observation λ 2 , the log rate for 1801, and λ 3 , the log rate for 1802. Its value is σ 2 λ exp − |2−3| 2 l 2 λ . K 2,120 is the covariance between the rate for 1801 and 1919. Unless l λ is extremely small, K 2,120 will be near zero because it contains the term (exp(−c) will be near zero whenever c is a large number.) A near-zero covariance makes sense here because we do not anticipate an observation of the 1801 rate telling us anything about 1919 rates.
To capture the belief that deviations from the trend will tend to persist for a bounded number of years, we use an informative prior distribution on the characteristic length-scale l λ . This distribution places 90% probability on values between 1 and 10, expressing the prior belief that deviations will tend to persist for between 1 and 10 years. Such a prior distribution is consistent with the belief that, say, a market panic might affect the rate of novel publication in the short term but would likely cease to influence publication rates in years which are more than ten years distant from the event. Here, as elsewhere, we draw on domain expertise to justify our modeling choices. Different choices will lead to different results. (Different models-say, linear or quadratic rather than log-linear-will lead to radically different results.) Readers who prefer different assumptions are invited to edit the code which accompanies this article and develop models which reflect their beliefs.
The observed annual counts of new novels from RFGS (1800-1836) (the first time series) are connected to the latent log rates λ 1:37 via a negative binomial sampling distribution.
This sampling model allows us to connect the smoothly varying rates to observed counts of new novels. Separating the latent rate from the observed counts in the model is particularly important before 1840 because there is considerable year-to-year variation in the observed counts of new novels which are due to the arbitrary assignment of novel publications into discrete years. 8 In symbols, the sampling model is given for year t = 1, . . . , 37 by where NegativeBinomial 2 is parameterized by a location parameter and a parameter con- We use a two-parameter negative binomial sampling model here rather than a simpler, single-parameter Poisson model. The former's ability to model additional variation is important given the uncertainty about the latent process being modeled.
To incorporate the counts of Publishers' Circular (PC) titles (the second time series), we introduce an additional Gaussian Process to model, for each year, the proportion of PC titles which are new novels. Background knowledge and Eliot (1998) lead us to believe that the proportion will be certainly less than 50% and that it will increase modestly over the period. As we did for the rates of new novel appearance, we transform the proportions into units which are conveniently modeled using a linear trend. In this case, we express the proportions on the log odds scale, denoting the log odds as ν t for year t. (The log odds is the logarithm of the odds, log( p 1−p ), where p is a proportion between 0 and 1.) In contrast to our thinking about year-to-year variation in rates of new novel publication, we anticipate that the proportion of PC titles which are new novels will change comparatively slowly. Whereas an economic crisis or other kind of "shock" might affect the rate of new novel publication over a period of several years, it would likely not affect the proportion of books which are novels. In other words, we anticipate that factors influencing the economics of publishing novels as opposed to non-novels does not change as rapidly as factors influencing the rate of book publishing in general. To capture this belief, the characteristic length-scale for this second Gaussian Process is modeled with a prior distribution placing 90% probability on values between 8 and 36, expressing the belief that deviations from trend will tend to persist for between 8 and 36 years. In symbols, the proportions are modeled for year t = 1, . . . , 120 on the log odds scale as follows: 8 One way of appreciating the importance of modeling new novel publication with a continuous rate parameter is to imagine a situation where the aleatory variation in new novel counts is considerably greater. Imagine modeling new novel publication via weekly counts. In such a setting observing that zero new novels appeared in a given week would not be particularly meaningful. It would certainly not imply that there was zero activity associated with novel publishing during that week.
As with the yearly novel publication counts, observations of PC title counts (1843-1919) are connected to latent rates via a negative binomial sampling distribution. The latent rate of PC title appearance in year t, the mean of the sampling distribution, is exp(λ t )/ logit −1 (ν t ), where logit −1 , the inverse logistic function, is the inverse of the transformation of a proportion into log odds. For example, if the proportion of PC titles which are novels is 12% and the rate of new novel appearance is 300 then the observed PC title count will be modeled with a negative binomial distribution with mean 2,500.
The yearly Nineteenth-Century Short Title Catalog (LOCED) publication counts (the third time series) record similar information as the PC title counts series. They both record total publications (novels and non-novels). They differ primarily in the years they cover. The PC counts tend to be lower because PC tends to only report editions for sale in London.
Because these series are very similar, we model the LOCED rate in terms of the PC rate.
We assume that the LOCED rate is a fixed multiple of the PC rate. The rate at which titles are recorded in LOCED is incorporated into the model by assuming that the rate is the same as the PC rate, multiplied by a constant factor, π ν . Because LOCED counts are always greater than PC counts, this factor will be greater than one. 9 As before, a negative 9 Counts derived from the NSTC and PC supply essential quantitative information about the development of text industry in the British Isles and Ireland. In particular, these time series provide information about the year-to-year variation in the number of editions produced by the text industry. These sources have been used in previous research and are certain to be used in the future. While a precise understanding of their relationship is a topic for another paper, we can offer some preliminary observations. We know that for any given year the PC series always reports fewer editions than LOCED. The reason for this is, we suspect, that PC tends to only report titles for sale in London. LOCED, by contrast, contains records for all editions which ended up in libraries. Since there was a legal deposit requirement and LOCED includes records from the legal deposit libraries, LOCED covers a broader range of editions. LOCED gives us a sense of all editions published in the British Isles and Ireland, not just those published or distributed in London. For example, technical works published by university presses in Oxford, Cambridge, and Edinburgh which were not distributed in London would likely appear in LOCED. These editions would tend not to appear in PC.
In our model we assume that, for every year, the number of editions in LOCED is a fixed multiple of the number of editions in PC. We make this assumption because it simplifies the model and because we think it is a reasonable assumption. It is a reasonable assumption if one believes that the rate of growth of publishing outside of London grew at the same rate as publishing in London. The reasoning behind such a belief should be familiar at this point. Technological changes in the text industry such as cheaper paper and cheaper printing shaped publishing everywhere, not just in London. The same holds for relevant institutional changes, such as lower costs of capital associated with maturing financial institutions. So the fixed multiple assumption rests on the belief that the PC series captures the number of titles for sale in London and LOCED captures the number of titles published in London as well as in publishing centers binomial sampling distribution connects this yearly rate to the observed LOCED counts . For reasons discussed earlier, LOCED counts from years which end in a '0' or '5' are ignored.
Counts of new novels reviewed in The Athenaeum (the fourth time series) are incorporated into the model using a similar strategy to the one just described for LOCED title counts.
The rate at which novels are reviewed is assumed to be equal to the rate of new novel publication multiplied by a constant factor, π a . The use of a constant factor reflects the assumption that the proportion of new novels reviewed in The Athenaeum was roughly the same during each of the nine years. As noted earlier, that The Athenaeum's reviewing expands considerably during the period (from 137 in 1860 to 473 in 1900) lends this assumption superficial plausibility. As we know in advance that The Athenaeum does not review all new novels, an informative Gamma prior distribution placing 90% probability on a value between 30% and 70% is used. As with the other count-based time series, a negative binomial sampling model is used to model the relationship between latent rates and observed counts.
We connect the three distributions elicited from Bassett (the fifth data source) directly to new novel publication log rates for the relevant years (λ 87 , λ 92 , and λ 94 ). This makes incorporating the distributions into the model straightforward: the three elicited distributions are used as prior distributions on the rate of new novel appearance during 1886, 1891, and 1894. Although a meticulous approach would associate the three distributions with the unobserved counts of new novel publications-this is, after all, what Bassett was asked about-such an approach would add considerably complexity to the model by requiring us to model latent discrete variables (the unobserved counts). Assuming that the Bassett estimates concern continuous latent rates rather than discrete counts has the consequence of modestly understating the variance of the elicited distributions. Given that the elicited distributions indicate a generous degree of uncertainty we think this is a reasonable price to pay for a simpler model.

Modeling author gender
The essential structure of the model has been introduced.
The full model differs slightly from the version presented. In addition to estimating the number of new novels published each year, the full model also estimates the number of novels published by author gender. This is accomplished by adding, for each year, two parameters to the model. The first parameter, ρ t , records the proportion of new novels associated with an author of unknown gender. The second parameter, σ t , records the proportion of known-author-gender new novels associated with men authors (a proportion of a proportion). With these two parameters it is possible to calculate the proportion of outside of London. If the rate of publishing grew at the same pace throughout the British Isles and Ireland, the ratio of LOCED titles to PC titles should be approximately constant. new titles given each of the three author gender annotations. For example, new novels associated with women authors in year t is given by (1 − ρ t )(1 − σ t ). Each sequence, ρ 1:120 and σ 1:120 , is modeled on the log odds scale using Gaussian Processes with a linear trend.
Prior distributions for the characteristic length-scale parameters are the same as the prior distribution used for the length-scale parameter for the Gaussian Process model of ν 1:120 (the proportion of PC titles which are new novels). Observed counts of new titles by author gender-available in The Athenaeum series and, for 1800 to 1829, in RFGS-are modeled with negative binomial sampling distributions.

New novels by author gender, 1789-1799
We estimate the number of new novels by author gender separately for the 11 years between 1789 and 1799. Because the number of new novels published during this period appears in RFGS, we need only estimate, for each year, the proportion of novels associated with men, women, and unknown gender authors.
We accomplish this by collecting and manually annotating a random sample of 110 titles from RFGS (ten titles for each year). For each year we calculate a posterior distribution over proportions using a multinomial sampling model and an informative Dirichlet prior distribution loosely centered on observed proportions in 1800.
For the full model covering the period between 1800 and 1919, we estimate model parameters using Markov Chain Monte Carlo (Carpenter et al. 2017). (For a general introduction to Monte Carlo methods in Bayesian statistics see Liu (2002).) All parameters whose prior distributions are not discussed are given reasonable, weakly informative prior distributions.

New Novel Publications, 1789-1919
The model provides estimates of the rate of novel publications for each year between 1789 and 1919. Figure 1 visualizes these rates. (Figure 2 shows these rates normalized by population.) Each interval in Figure 1 shows the posterior credible interval for the rate of new novel publication, exp(λ t ), for a specific year t. Points represent the number of new novels published during 1789-1836-a period for which we have exhaustive bibliographies.
In aggregate between 40,000 and 63,000 new novels likely appeared between the years 1789 and 1919. (All intervals mentioned are 90% credible intervals.) A summary by decade appears in Table 1. For comparison, the number of these titles which are still in print today is shown, by author gender and decade of publication, in Table 2. This "reprint canon" (borrowing the label from Bassett (2017)) serves as an approximation of the body of works currently taught in universities. The reprint canon very likely represents less than one percent of novels published during the period. It is possible that it represents as little as one half of one percent of published titles. 10 One remarkable development which is visible by inspection is the rapid growth in new novel publication between 1840 and 1855. Figure 3  Estimates of men authors' share of new novel publication by year is shown in Figure 4. The estimates are consistent with the widely held belief that there was a demographic shift in the occupation of novel writing during the 19th century (Tuchman 1989, pp. 5-11). At the beginning of the 19th century a majority of novels with known author gender were associated with women novelists. By the end of the 19th century this percentage had likely declined to roughly 40%. 11 Within the expected secular decline in the proportion of novels associated with women authors there is some evidence of a cyclical trend: the proportion of titles associated with men authors declines during the 1860s and 1870s before recovering again. 12 The estimates also permit us to say that it is virtually certain that novels by men authors and novels first published in the 1860s are overrepresented among titles which are still in print today. That is, the proportion of novels associated with men authors in the reprint canon does not reflect the proportion of novels written by men during the period. It is very in Table 3. Table 4 shows reprint canon titles by author gender and year. 11 Our estimates concern the characteristics of the population of new novel titles, not novelists. If one assumes that novelist gender is uncorrelated with the number of novels they publish, then the share of novelists associated with each gender should be roughly the same as the share of novels associated with each gender. Estimating the demographic characteristics of the population of professional novelists should be addressed in subsequent research. This research may need to, for example, avoid double-counting novelists who used different-or even collective-pseudonyms.
likely that between 40% and 58% of novels written between 1789 and 1919 were associated with men authors (Table 1). In the reprint canon, however, 71% of novels from this period are associated with men authors ( Table 2). The distribution of reprint canon titles by year of first publication is also not aligned with the distribution of titles published during the period. Titles published in the 1860s, in particular, appear to be overrepresented in the reprint canon. Titles published in the 1900s appear to be underrepresented. Although it is possible that the reprint canon does not reflect literary works used in research and taught in university classrooms, the reprint canon does reflect the population of 19th century novels which continue to be sold and read.

Limitations and Future Work
The estimates presented here reduce uncertainty about the number of new novels published between 1789 and 1919. The reduction is significant enough that a variety of existing narratives of developments in the literary market and the text industry merit revisiting in light of the new estimates. The account offered by Tuchman (1989) of changes in the percentage of women pursuing careers as novelists is one example. The census data Tuchman uses to gauge changes between 1861 and 1919 are, by her own admission, unreliable (Tuchman 1989, p. 58). Although the estimates presented here concern the annual number  to an expansion in the number of novel readers or intensification of novel reading among the existing population of novel readers? The latter, at least, seems unlikely, because the gains of the industrial revolution-which might have enabled more people to purchase the luxury goods which novels and circulating library subscriptions unquestionably were-did not accrue meaningfully to the broader population until after 1840 (Allen 2009).

Conclusion
The number of new novels published each year counts as essential information for researchers interested in understanding the text industry and text culture between 1789 and 1919.
Knowing that a novel was one among 100 (rather than 500) new works published in a given year affects how a researcher understands the position of a work in the literary marketplace (Eliot 2002). Estimates of a variety of quantities which have been the subject of scholarly attention can be bounded by or estimated from the number of new novels published each year. Novels' share of all editions can be bounded from below given the number of first edition novels and the number of works published in a given period. (The changing share of prose fiction has been discussed in more than one scholarly study (Erickson 1996;Eliot 1998).) A second quantity of interest to book historians and social historians of literature is the number of individuals who pursued careers as novelists (Sutherland 1988). As the vast majority of novels are written by one person, this quantity can be bounded from above by the number of novels published during a given period. Equipped with an estimate of the average number of novels published by a novelist during the period, a serviceable estimate of the quantity itself could be calculated as well.
Reliable estimates of the number of new novels published each year help bibliographers assembling exhaustive lists of published novels. Such estimates allow bibliographers to gauge their progress. For example, if a model such as ours, one which draws together a range of sources, predicts that there are very likely between 78 and 160 first edition women-authored novels published in 1865, a bibliographer can consult their list of titles to see if their total aligns with the estimate. If the total in the bibliography falls conspicuously short of the estimated total, this indicates that novels by women are missing from the bibliography.
In such a scenario, the bibliographer might then expand the range of sources they are drawing on to identify novels. Absent such estimates it is difficult for a bibliographer to conveniently assess their progress towards attaining an exhaustive list. Without estimates of the total, they must follow the expensive and time-consuming approach of RFGS: make sure they have exhaustively reviewed all sources of information that could have recorded the publication of a novel. Equipped with good estimates of the total number of published titles, bibliographers can judge their progress at much lower cost.
Credible estimates of the share of new novels published during each year by gender allow literary studies scholars and book historians to assess how well arbitrary collections of novels reflects the population. We have already mentioned a particular corpus, the "reprint canon", which includes novels widely-used in university teaching and research. Our estimates allow us to compare the reprint canon to the population of published novels. Another corpus of novels which might be compared with the population is the collection of novels authored by writers who are included in the Dictionary of National Bibliography (DNB). If this corpus does not resemble the relevant population then it is unlikely that the individuals in the DNB resemble the population of novelists. Knowing if the DNB reflects the population of novelists would permit researchers to calibrate their trust in existing studies which assume or suggest that writers in the DNB resemble the population (e.g., Altick (1962)).
The utility of the estimates presented here pales in comparison to the usefulness of an exhaustive bibliography of the 40,000-63,000 new novels published between 1789 and 1919.
The latter would allow us to say a great deal more about the particular kinds of novels which were published and the range of writers and publishers involved in the text industry.
But an exhaustive bibliography of new novels published in the British Isles and Ireland between 1789 and 1919 does not exist and is unlikely to emerge in the next few years. In the interim, the estimates gathered here give researchers, bibliographers in particular, a series of bearings which will allow them to better assess existing accounts of the history of the novel and the history of the text industry.