Enumerations: Data and Literary Study

by Andrew Piper

For well over a century, academic disciplines have studied human behavior using quantitative information. Until recently, however, the humanities have remained largely immune to the use of data—or vigorously resisted it. Thanks to new developments in computer science and natural language processing, literary scholars have embraced the quantitative study of literary works and have helped make Digital Humanities a rapidly growing field. But these developments raise a fundamental, and as yet unanswered question: what is the meaning of literary quantity?
          In Enumerations, Andrew Piper answers that question across a variety of domains fundamental to the study of literature. He focuses on the elementary particles of literature, from the role of punctuation in poetry, the matter of plot in novels, the study of topoi, and the behavior of characters, to the nature of fictional language and the shape of a poet’s career. How does quantity affect our understanding of these categories? What happens when we look at 3,388,230 punctuation marks, 1.4 billion words, or 650,000 fictional characters? Does this change how we think about poetry, the novel, fictionality, character, the commonplace, or the writer’s career? In the course of answering such questions, Piper introduces readers to the analytical building blocks of computational text analysis and brings them to bear on fundamental concerns of literary scholarship. This book will be essential reading for anyone interested in Digital Humanities and the future of literary study.

About the Author

Andrew Piper is professor in the department of languages, literatures, and cultures at McGill University. He is the author of Dreaming in Books: The Making of Bibliographic Imagination in the Romantic Age and Book Was There: Reading in Electronic Times, both published by the University of Chicago Press. He is also a founding member of the Multigraph Collective, a group of twenty-two scholars that recently published Interacting with Print: Elements of Reading in the Era of Print Saturation, also with the University of Chicago Press.

Read an Excerpt


Punctuation (Opposition)

"Like a , , , , , , , , , , this look between us."


While writing existed long before punctuation was invented, there is no more rudimentary form of inscription than the punctuation mark. The dot, the line, the curve — these are writing's elements. As marks of punctuation — as period, comma, hyphen, parenthesis, or question mark — they both interrupt and conjoin. They divide, but also mark time. Punctuation makes us feel writing. It makes the virtual real.

There is no shortage of scholarship on punctuation. We have numerous accounts of the invention, the fashionability, and the fall of certain types of punctuation marks (whither the semicolon). And since the late eighteenth century, we also have numerous prescriptive works on punctuation's rules. With the spread of literacy and the expansion of print in the nineteenth century, the manual of style would emerge as a quintessentially modern genre, books of syntactical paternalism encircling the unruly hordes of the printed masses. And then there is the seemingly endless parade of interpretive engagements with singular moments of punctuation: the disputed semicolon at the close of Faust 2; the famous dash of Kleist's "Marquise von O."; the missing period at the end of Whitman's first version of "Song of Myself"; or the parenthesis in e. e. cummings's "windows go orange in the slowly" that is both its own line and a visual index of the quarter moon about which the poem speaks — a form of onomatographia, when we use punctuation marks to look like objects in the world.

Throughout all of this, however, we have never had a history of what Georges Bataille might have called the general economy of punctuation, a study of the norms and excesses of punctuation in a given period. What is the meaning of punctuation's distributions, its luxuriant overaccumulation, as well as its rhythmic rise and fall, "the delay of language," in poet Amiri Baraka's words? To study the economy of punctuation, and not just a few singular auratic marks, is to study the way spacing and pacing make meaning on the page. It is to understand the way tactics of interruption, delay, rhythm, periodicity, and stoppage are all essential ways of communicating within literature's long history. The economy of punctuation allows us to see the social norms surrounding how we feel about the discontinuities of what we want to say.

Take for example Paul Scheerbart's Lesabéndio, a science fiction novel written in the opening decades of the twentieth century. The story concerns the main character's wish to build a tower to transcend his planetary limits. His goal, he says, is to commune with "the Larger [das Größere]." Quantity for Scheerbart is the new Babel. The planet on which Lesabéndio lives, an asteroid off Jupiter named Pallas, is populated by stretchy people with suction feet who have telescopic eyes. They don't have sex and are born from nuts. Their books are microfilms that they wear around their necks. When they die, they are absorbed by another member of the planet, who stretches extremely high and takes in the dying member through his or her pores. It was one of Walter Benjamin's favorite novels.

Whatever else it is, Lesabéndio is unique in its predilection for periods. It belongs to a select group of novels in the German canon that use an almost equal ratio of periods to commas (the average since the late eighteenth century is closer to 2.6 commas for every period). Even more telling is when the novel uses periods. There is a moment around the midpoint of the novel when the amount of periods increases significantly (fig. 1.1). Even the lowest occasions of period use after this moment are above the highest in the novel's first half. What has happened?

Over the course of the first half of the story, Lesabéndio has been building support for his tower. In the process, he has overcome one obstacle after another. But a crisis is reached when his colleague Peka, the artist, feels that his role in the project has been undermined. In the segment with more periods than any other in the novel, Peka cries: "You have destroyed me! You have taken everything from me. You have annihilated me. Your cursed tower has made a poor end of my artistic dreams." It is at this point that Peka begins crying, only to realize that his tears provide a new kind of glue that is needed to overcome what initially appeared as an insurmountable technical obstacle. Technology triumphs on the fluid surfaces of human sadness.

This moment marks a major turning point in the novel, a kind of conversional axis within the narrative. And it is the period and its accumulation that captures this conversion, the fate of art in the age of industrialization. "This is no longer an artistic story — it is something other — something incomprehensible," says Lesabéndio just after the novel's meridian. The period's rise marks a turning point toward something unknown, something greater than oneself, something potentially inhuman.

The General Economy of Punctuation

Lesabéndio's predilection for periods was not unique. Over the course of the twentieth century, both novels and poetry, at least in English, were increasing the frequencies with which they used periods, at the same time that commas were decreasing (fig. 1.2). Indeed, in the case of English-language poetry, punctuation itself has been decreasing for the past century or so. Poetry appears to be heading toward writing's unpunctuated origins as a form of continuous script.

If novels like Lesabéndio were increasingly relying on periods to mark their punctuatedness, this was true within novels as well. Like Lesabéndio, more often than not novels in English tend toward using more periods as the plot progresses, especially as we move into the twentieth century (fig. 1.3).

One way we might try to understand this is to see it as a trend toward narrative resolution. The pensiveness and contradictions of commas give way to the clarity and pointedness of periods. Periods mark ends, and thus there are more of them as the narration reaches an end. This is certainly true in a work like The Sorrows of Young Werther (1774), whose commas decline precipitously toward the close of the novel as we move out of the young man's sentimental outpourings and into the colder, more clinical narrative of the editor. The utter erasure of punctuation from Molly Bloom's monologue at the close of James Joyce's Ulysses tells the same story in reverse. The absence of punctuation reminds us how much this novel resists closure.

But rather than see the period as a mark that exclusively indicates an end, whether narrative or glottal, Scheerbart's novel suggests the power of the increased quantity of periods to signal a sense of an opening. Quantity makes a difference to the dot's meaning. As Lesabéndio ascends his tower ever further into the cosmos, we do not move toward resolution, either in the visual or plotted sense; rather, we move, like Faust, outward and upward, into increasing degrees of abstraction and the "incomprehensible." The punctuatedness of periods, counterintuitively, initiates a profound sense of openness.

This chapter is an exploration of the general economy of literary punctuation. For Bataille, whose work grew out of the soil of French surrealism, the secret of life lay in its superabundance, the fundamental fact of increase that lay behind it. "General economy" did not for Bataille signal a closed system of inputs and outputs, a form of circularity or homeostasis; rather, the economic was far more a model of expenditure and excess. Life was too much. It necessitated that "glorious operation," which Bataille labeled "useless consumption." It turned the rationality of production on its head. The project of art was no longer to make us aware of the productive forces of society, that familiar Marxist credo. Instead, its aim was to draw attention to the necessity of waste and excess as conditions of a full sense of being, of feeling more than ourselves. Art made the fact of excess recognizable; this was the sense of life's "accursed share." For Bataille, this is what allows us to feel our way into something beyond ourselves.

Strange as it may seem, computation may be one of the more effective ways to study this idea of general economy that Bataille had in mind. Computation allows us to see the overall distribution of literary features and identify those spaces of either lack or excess that give shape to a given genre or period of time (or even the notion of the "time period"). Far from a handmaiden of empiricism or the "rational economy" that Bataille wanted us to move away from, computation enables us to inhabit, and more importantly see, that accursed share of writing, those spaces of aesthetic expenditure and luxury that offered keys to understanding human beings for Bataille. Reading for quantitative excess follows in the direct path of Bataille and his early twentieth-century surrealist roots.

Such thinking, it should be pointed out, deviates strongly from the norms of statistical reasoning. Outliers are traditionally seen as problems, the exceptions that help prove a rule. Bataille's interest in excess was undoubtedly influenced by the rise of statistical thinking in the nineteenth century, with its strong emphasis on norms and normal distributions. As Wilhelm Lexis, one of the pioneers of statistical thinking in Germany, wrote in 1877, "The state of a human community is on the one hand partially determined by the positive historical forms and norms of both society and state ...; but it is also determined by the common and relatively constant actions and afflictions of individuals in diverse settings, which in their discrete units cannot be comprehended but which produce characteristic mass phenomena [Massenerscheinungen] accessible to scientific observation." The Bataillean point of view, on the other hand, suggests a more dialectical relationship between norms and excess, the ways in which the act of exceeding one kind of norm can produce its own form of normative behavior, that accursed share that so fascinated Bataille.

My focus in this chapter will be on the relationship between punctuation's excess and its manifestation in twentieth-century poetry. Few narratives are more strongly ingrained in the field of poetics than the growing antipathy to punctuation in the twentieth century. From modernist sound experiments to the later depunctuated work of poets like William Carlos Williams, punctuation in the twentieth century is most often thought to bar access to the acoustic and rhythmic immediacy of poetry. As Marinetti writes in his "Supplement to the Technical Manifesto," establishing a basic motif around which subsequent work would organize itself: "Words freed of punctuation shine on each other, interweaving their diverse magnetism, following the uninterrupted dynamism of thought."

And yet, at the same time as this allergy to punctuation grows, we can also observe one particular type of punctuation, the period, become increasingly deployed. The period arguably becomes twentieth-century poetry's accursed share. Far from enacting Bataille's dream of dissolution, from clearing a space of freedom and release, the period's excess seems to capture more of a sense of irresolution and antinomy. As I will try to show, the period's abundance brings us into a language space marked not only by a sense of the elementary — more deictic and rudimentary — but also by that of opposition and conjunction, a sense of the irreconcilable. The ending, as Scheerbart's Lesabéndio intuited, is also an opening.

Using a collection of 75,000 poems written in English by 452 poets who were active during the twentieth century (POETRY_20C), this chapter explores poems that deploy periods well in excess of the norms of their age. In doing so, it asks what they might have in common in addition to their punctuatedness, how one kind of excess might establish other kinds of expressive norms. This is what I meant above about the exploratory nature of this chapter — it does not start with a clear, demarcated hypothesis that it sets out to test, but with a more general sense of a scholarly narrative about twentieth-century poetry's antipathy to punctuation that it sets out to complement. But it is also exploratory in the sense that it starts at the beginning of things, using perhaps the simplest computational technique there is, "grep" (which stands for "globally search a regular expression and print"), to extract the frequency of the simplest typographic symbol there is, the period, in order to understand something about its distribution and meaning.

As we will see, the spaces of over-periodization that computation helps bring into view cut more transversally through the traditional ways poets have been divided up in the twentieth century by either period or school. While over-periodization does become distinctly racialized as a poetic practice — a means of traversing a sense of self that can neither cleave itself off nor fully commune with a larger population — such racialization only tells part of the period's abundance. African American writers, for example, are 2.8 times more likely to be represented in the high-period group than they are in the collection at large. Nevertheless, they still comprise only 17% of all poems in the high-period group. There is a stylistic and communicative diversity to these poems, one that revolves around the more general question of too many endings ("over-ending"), of what comes after the end. How do we think and dwell in this excessively punctuated moment?

The Excessive Period

What does the economy of the period look like in twentieth-century Anglophone poetry? Where is the space of its excess and what does it say? The first step in answering these questions is to extract all periods in poems and calculate their ratio relative to the number of words in a given poem. From there, we can then understand the ways in which the period's frequency is distributed across the collection of twentieth-century poems. Figure 1.4 presents a histogram that shows the distribution of periods expressed as a percentage of words per poem across the entire collection. A histogram is a useful tool to visualize how a particular value is distributed within a particular data set. Here the x-axis refers to the percentage of periods in a poem relative to the total number of words in that poem (so in a 100-word poem, 5% means that there are five periods detected in that poem). Each bar in the graph represents a single percent, starting at 0. The y-axis tells us how many poems have that value. The numbers of poems with values beyond 20% are unfortunately so few that they cannot be seen here. The graph has also been artificially cut at 50% to make it more legible though, as we will see, it extends out to 165%, the maximum amount of periods per words in our data set.

What this plot and the data behind it show us is that a majority of poems fall within a very narrow range. Fifty percent of poems have a rate of between 2.35 and 5.88 periods for every hundred words. Ninety percent of the data can be accounted for with a period rate of just under 10% of a poem's words. There is a significant cohort of about 6,700 poems, around 9% of the entire collection, that have no periods at all. Finally, there is a very small cohort of 929 poems, or just over 1.2% of all poems, that use periods to an exceptionally high degree (i.e., more than 3 standard deviations above the mean). This is punctuation's 1%. The range in this group is quite wide, from about 14%, or 1 period for every 7–8 words, to 50%, or 1 period for every 2 words, all the way up to 165%, in which there are more periods than words, as in Amiri Baraka's "American Ecstasy":


Loss of Life or Both Hands or Both Eyes The Principal Sum Loss of One Hand and One Foot The Principal Sum Loss of One Hand and One Eye or One Foot and One Eye The Principal Sum Loss of One Hand or One Foot One half The Principal Sum Loss of One Eye One fourth The Principal Sum
The histogram can help us better understand the distributional nature of punctuation in twentieth-century poetry, showing us the range of its norms and the areas of its excess. What it cannot tell us, of course, is what periods say, in either their normal or excessive states. What are the semantic associations that accompany periods in poetry? For this we need other kinds of methods that are able to think about the relationship between the distributions of periods and the distributions of language.


Table of Contents

Preface Introduction (Reading’s Refrain)
1. Punctuation (Opposition)
2. Plot (Lack)
3. Topoi (Dispersion)
4. Fictionality (Sense)
5. Characterization (Constraint)
6. Corpus (Vulnerability)
Conclusion (Implications) Acknowledgments
Data Sets

