enculturation

A Journal of Rhetoric, Writing, and Culture

Synthetic Speech and Negotiation: AI’s Nonhuman Rhetoric

Alex Reid, University at Buffalo

(Published June 1, 2020)

Developments in artificial intelligence, natural language processing, and speech synthesis have improved the capacity of digital assistants, such as Siri, Google, and Alexa, to converse with their human users. As compared to earlier versions, newer digital assistants better understand what people say and respond in more pleasing, less robotic speech. Parallel research puts these same capacities to work in teaching AIs to be effective negotiators by recognizing a connection between natural language use and bargaining. In such scenarios, we might ask if digital assistants are becoming rhetorical agents, or, more precisely, which kinds of rhetorical capacities they are developing. To examine this question, this article brings together two research discourses on the rhetorical operation of AI speech synthesis and natural language processing. The first is familiar to our field. Scholarship in posthuman and new materialist rhetoric has opened the discipline—at least on a conceptual, theoretical basis—to the investigation of intelligent machines as rhetorical agents. The second comes from the engineers and computer scientists who develop and investigate the technological processes of these AI operations.

In conducting this analysis and bringing these conversations together, a few things become clear. First, as is probably already obvious, the ability of digital assistants to synthesize speech represents a very modest rhetorical capacity. Despite that limitation, the speech of digital assistants (I focus on Siri here) is a form of rhetorical delivery constructed to communicate clearly and effectively as well as to produce an affective response through its efforts to produce pleasing speech. As I discuss here, Siri’s rhetorical operation is more easily understood within a new materialist/posthuman rhetoric that does not presuppose agency, will, or cognition as a condition for rhetorical action. Second, the complex technical processes Siri completes to construct speech are abstractly analogous to the nonconscious, physiological processes of human bodies when speaking. As I argue, this phenomenon can be described in terms of what Manuel DeLanda would term a shared “possibility space.”[i] That is, as Siri is designed to simulate a narrow range of human speech, it does so by accessing a small portion of the rhetorical possibilities of speech delivery available to humans, though it does so through very different processes. This provides an (albeit modest) insight into the nonhuman status of rhetoric itself as a possibility space that we, like digital assistants, access to some limited degree, even though our access is significantly broader than that of Siri today. Taken together, these investigations contribute to the broader new materialist study of digital rhetoric by demonstrating how close analyses of digital technologies can move us from abstract theoretical models of rhetorical ontologies toward an understanding of how we specifically encounter nonhuman rhetorics in our daily lives.

 

New Materialist Rhetorics of AI

Rhetorical concern with rhetorical agency and artificial intelligences predates the emergence of digital assistants. Carolyn Miller’s 2007 essay “What Can Automation Tell Us about Agency?” appeared during the early days of smartphones, in the same year as the first iPhones were released and before Siri had been built. Miller imagines an automated tool for grading student speeches called “AutoSpeech-Easy” with the ability to “account for not only the stream of oral language but also visual data about body language and auditory data about expressiveness, and the like” (139). In her thought experiment, Miller uncovers an understandable and significant objection among writing and speech faculty to machine grading. As she describes, the “resistance to automation is rooted in a commitment to agency, or more specifically that we find it difficult (and perhaps perverse) to conceive of rhetorical action under conditions that seem to remove agency not from the rhetor so much as from the audience” (141). This seems particularly pointed for the purposes of speech “because we understand [speaking] intuitively as a performance, meaning that it is dynamic and temporal, that it requires a living presence . . . Speaking is understood as im-mediate, both in the sense that it happens in the instant and in the sense that it is not mediated but direct” (145). In short, for Miller and the faculty she consults, rhetoric is founded upon the agency not only of the rhetor who acts through speaking but also of the audience who deliberates upon what they hear. Miller examines the theoretical, ideological, and practical approaches to understanding agency and ultimately argues that agency is a quality that is attributed to one by another, which humans attribute to others, and that “our attributions of agency are ultimately moral judgments, matters of human decency and respect” (153). On this basis she concludes that we are not obligated to give such attributions to machines, at least not in 2007. As Thomas Rickert observes, “Miller’s essay insightfully demonstrates that issues of nonhuman agency are rhetorical in and of themselves, but her argument uses that insight to foreclose the issue of things, because human attribution in a sense trumps all” (212). Effectively, she asserts that humans begin with the agency to make attributions of agency to others and can freely decide whether or not to grant it to machines.

Rickert, of course, is interested in investigating the issues that he sees Miller foreclosing (as am I). His concept of ambient rhetoric takes an ecological approach in which humans participate but in which we are neither necessarily the beginning nor endpoint for rhetorical action. As he argues, rhetoric “must diffuse outward to include the material environment, things (including the technological), our own embodiment, and a complex understanding of ecological relationality as participating in rhetorical practices and their theorization” (3). This ecological approach shifts agency from an inherent quality of (human) beings to an emergent and relational capacity of environments.[ii] As Rickert writes,

the change in perspective is crucial: not subjective agency in a (necessary) context but a dynamic interchange of powers and actions in complex feedback loops; a multiplication of agencies that in turn transform, to varying degrees, the agents; a distribution of varied powers and agencies . . .  There are profound implications for rhetorical theory as well, since the bulk of our conceptual framework and terminology focuses on cognitive agents wielding symbolic power via language and image with the world as backdrop, stage, or exigence. (10-11)

My investigation here twists the new materialist focus on the nonsymbolic, nonlinguistic rhetoric of nonhumans, though perhaps in a way that we should have anticipated: I am examining distributed nonhuman agents emerging from complex feedback loops in search of becoming (if only in the most modest sense) “cognitive agents wielding symbolic power via language.”

Alongside Rickert’s ambient rhetoric, multiple new materialist rhetorics have emerged in connection with a larger, divergent, theoretical conversation including actor-network theory (Latour), agential realism (Barad), vibrant matter (Bennett), new materialism (DeLanda, Bradiotti) and object-oriented ontology (Harman) to name a few of the more salient concepts and scholars. While important differences arise among various posthuman and new materialist theories, much of the work done in rhetoric links these thinkers together with work in rhetoric to investigate the rhetorical operation of things/objects. For example, Byron Hawk deploys a theory of quasi-objects to investigate a rhetoric of sound. He describes quasi-objects as “part relation and part material specificity . . . they are partially moving, emergent, composed events that are slowed down and partially stabilized by relations” (28). In taking up the quasi-object in rhetoric and composition, Hawk sees himself “expanding the field’s object of study, whether it is conceived of as composition, writing, or rhetoric, into various layers and levels of systems that compose a more complex, shifting, unbounded object . . . ultimately strengthening our disciplinary ecology” (45). In analyzing speech synthesis, as this essay does, there are potentially rhetorical stakes in terming the sounds a smartphone makes as “speech.” We might think of it as analogous to recorded speech (and as I will discuss, recording human speech to create a database of phonemes is part of the process of building speech synthesis). However, the voice of digital assistants is clearly not recorded but rather procedurally generated speech (or sound). Maybe calling the sounds Siri makes speech already implies some rhetorical obligation upon us as listeners. As Hawk notes, “Through sound as matter, people feel their bodies vibrate emphatically (embodiment); locate themselves in an environment via reverberation (spatial orientation); analyze speech and language via phonemes (communication); and capture and distribute sound via technological mediation (music)” (29). Smartphones and other digital assistants participate in each of these aspects of sound. In constructing a rhetorical concept of quasi-objects, which Hawk develops from the work of Latour and Michel Serres, it becomes possible to see technologies such as digital assistants not merely as static things with inherent qualities but rather as active participants in their environments who gain emergent rhetorical capacities through their relations with others.

Casey Boyle follows a similar line of thinking in his investigation of posthuman rhetorical practices. He draws on Gilbert Simondon’s analysis of technologies as ensembles of technical elements and technical individuals to note that “[a]s an ensemble’s relations become saturated, that ensemble internalizes some of its external relations and incorporates as a functioning unit” (32). As Boyle explains, “Our own experience with frenetic development of smartphone devices gives us an in-hand experience for how a device can be seen to practice its functions in ways that get better but also incorporate practices and functions once deemed outside its purview” (31). Much like Hawk and Rickert, Boyle is interested in an ecological approach to rhetorical practice and through the lens of Simondon (and others), he investigates rhetorical ecologies as ensembles of human and nonhuman practitioners. As he argues, “[R]hetoric, by attending to practice and its nonconscious and nonreflective exercises, reframes itself by also considering its actives as exercises within a more expansive body of relations than can be reduced to any individual human. When we practice, we occasion the exercise of an ecology that only later can be understood as human or nonhuman” (59). On a broad scale, the rise of digital media itself represents an incorporation of relations that once existed as external mechanical and chemical media. The smartphone further internalizes media production and display from these once separate formats into a single device. Speech synthesis emerges as a capacity from the intersection of media digitization and computational power and brings to our attention a quintessential example of the nonconscious and nonreflective exercise of embodied rhetoric: the body’s production of speech sounds. As Boyle suggests, it is only later that the ecological practice of making speech sounds can be divided into human and nonhuman actions.

Despite the interest Rickert, Hawk, Boyle, and others show in the rhetorical operation of technologies, artificial intelligence has not been a significant area of investigation within new materialist rhetoric, even though artificial intelligence is an area of interest in the earliest manifestations of new materialism. In 1991, for instance, Manuel DeLanda writes in War in the Age of Intelligent Machines,  

The task of the photoanalyst, for example, will be permanently altered when Artificial Intelligence finally endows computers with the “ability to see.” Although true “machine vision” is still in the future, computers can now “understand” the contents of a video frame as long as the kinds of object displayed belong to a limited repertoire (simple geometric shapes, for instance). Similarly, the task of policing radio communications will take a giant step forward when computers begin to “understand” natural languages, to be able to translate foreign languages automatically, for instance. (181)

Nearly thirty years later, we might remove the scare quotes from that passage as consumer digital assistants like Siri now understand and speak over 20 languages, and machine vision is employed in many fields, perhaps most notably in autonomous vehicles.

Ehren Pflugfelder examines the rhetoric of autonomous vehicles and is a notable exception to my earlier remark about the absence of the rhetorical study of AI. He proposes that “a vehicle like the Google driverless car could be assumed to have agency in the art of navigation” (116). As he notes, this is evident in the unease many Americans have in the prospect of giving up control of driving to an autonomous vehicle. With this in mind, Pflugfelder writes, “Recognizing how users feel about the shifts in agency for various navigation-related tasks could help technical communicators convey whether users. are confused or antagonistic to new autonomous features, and whether users are able to incorporate those features into their own user techne” (117). In other words, we might investigate how autonomous vehicles interact rhetorically with human passengers. How do they persuade human users to rely upon them? How can they be designed to enable humans to negotiate successfully with their cars to expand the collective agency of the automobile-driver/passenger ensemble?

Whereas such questions may be rare in new materialist rhetorics, research in computer science and engineering highly focus on the technical operations of AI. While AI research has drawn heavily upon theories of logical argumentation, there is a growing interest in other rhetorical elements: “Computational models of argument are being developed to reflect aspects of how humans build, exchange, analyse, and use arguments in their daily life to deal with a world where the information may be controversial, incomplete, or inconsistent” (Atkinson et al. 2). In AI interactions with humans, it is understandable that such models include natural language processing and, when the interactions are being spoken, speech synthesis as well. However, there is also a line of research exploring the use of the same principles for resolving arguments among AI agents. In short, if AIs are to persuade one another (or at least if humans are to help them learn how to do so), then part of the approach may be teaching AIs how to argue in natural language. AI argumentation, natural language processing, and speech synthesis constitute a broad swath of research (and commercial investment). And as we might expect, it comes in highly specialized, esoteric genres (much like our own). There is good reason to doubt our collective ability as a scholarly community to engage productively with this discourse. Equally, there is good reason to approach such scholarship with a critical, skeptical lens as our discipline often has with scientific and technical discourses. At the very least, some understanding of the technical operation of natural language processing and speech synthesis in digital assistants is necessary if we want to understand their nonhuman rhetorical actions.

That said, even if we agree with the necessity of understanding the rhetorical operation of digital assistants, incorporating new materialism into rhetoric to do this work is not without its challenges. Bruno Latour offers his own skepticism of this move by saying, “I don’t think [rhetoric] is very well equipped to address nonhumans” (Walsh et al. 415). He instead views rhetoric in a more traditional/classical way. In his view, despite the effort to extend rhetorical analysis to nonhumans, “you cannot make the whole proliferation of voices fit inside anything that would have been the equipment of rhetoric at, say, the time of Cicero or of a twentieth-century Perelman, to take two extreme examples. It immediately would become limited” (416). In a sense, this is another rehearsal of arguments over big rhetoric (as Jenny Rice notes in her response to Latour in the same Forum) (431). At the same time, however, Latour does see a crucial need for something many of us would view as a rhetorical practice when he observes,

most of the time we speak badly of each other and we speak badly of the thing in which we are immersed. And we don’t know how to speak in a way that assembles [effectively]. Rhetoric is then only communication, or it’s only PR. So, if you take the highest version of rhetoric as speaking well to those who are most concerned by what you say, it would be really useful. The technical difficulty is that we still have no account of ontological pluralism with that. (423)

I will confess that from my perspective the question of whether or not we call such a project “rhetoric” is semantics. I don’t believe any field can claim intellectual ownership or primacy over a project so large that Latour considers it to be “the equivalent of civilization” (423). Media studies, science and technology studies, communication, and many other fields have stakes (and expertise) related to these concerns.  In this interdisciplinary mixture, I would argue the focus of rhetoric might be “know-how” or, as Latour puts it, “how to speak in a way that assembles” (423).

In the context of these civilization-sized projects, the study of how digital assistants come to speak is a modest, minimal rhetorical act. It is simply an act of deciding how to sound words and phrases. That said, it is undoubtedly a necessary rhetorical act for any kind of speech and one undertaken regularly by humans and now by digital assistants as well. Even when the layer of negotiation is added and additional decisions are added, there is no reason to attribute to these machines anything like the human experiences of conscious thought and agency that we associate with these rhetorical acts. The trickier part comes later, after we’ve recognized that rhetorical practices can occur in the absence of conscious human thought and agency. What does that mean for how we understand the ontological conditions of rhetoric? Recognizing the rhetorical operation of nonhumans helps us to better understand how rhetoric functions not only independently of humans but in relation to us as well. As Casey Boyle puts it, the key to a posthuman rhetoric “is an acknowledgment of a kind of betweenness among what was previously considered the human and nonhuman” (43). This can only occur by placing rhetoric itself in this in-between space.

 

How AI Speech Synthesis and Negotiation Works

At the most basic level, speech synthesis gives voice to a message that is first formed in text. As such, the compositional and rhetorical challenge of speech synthesis is not in deciding what to say but rather how to say—that is, in deciding how to pronounce the words. The trajectory of technological development in the field of speech synthesis has moved from exemplar-based systems, which store the speech corpus and then seek to link up or concatenate words and phrases, to model-based systems, which store a model of speech rather than the corpus itself. The latter is termed statistical parametric speech synthesis as the model stores the parameters of speech in a statistically distributed fashion so that there are more models of commonly occurring phonemes. Of course, no system can effectively model all the possible rare occurrences of sounds. Although each individual occurrence is in itself rare, there are enough rare occurrences that it is not uncommon for one of them to arise when Siri speaks. As anyone who spends time listening to Siri speak knows,  there have been improvements toward “natural-sounding speech” (as determined by user listening tests), but imperfections remain in the system. The latest iterations of speech synthesis address the imperfections by employing neural networks that are designed to improve Siri’s selection of speech sounds. In this section, I step into the technical description of how speech synthesis works. This takes me far afield from our scholarly discourses, but any understanding of the rhetorical operation of technologies must address their actual functioning as it is only at the intersection of software, hardware, media infrastructures, and human users that any changes to that rhetorical operation can occur. That is, any effort to shift the rhetorical relations among humans and technologies will not get far without an understanding of how the devices function.

As the engineers working on Siri describe, speech synthesis is composed of three parts: a database of speech sounds (called phones or half-phones), a text that needs to be spoken, and a computational process by which the text is matched with sounds and then spoken.[iii] Although the composition of the text to be spoken is a necessary step, it is not one undertaken as part of speech synthesis, so I will set that aside for the moment. In creating a speech-sound database, part of the challenge is that Siri speaks 21 languages. To address the multiple languages challenge, the engineers seek to pursue a language-neutral process in their recording and database creation process. This process begins with 10-20 hours of recordings of a professional voice talent. Because discrepancies can exist between the recording and the script, corrections have to be made between the transcription of the audio and the original script. With this cleaned up, the audio itself is then aligned with the corrected transcript, and a number of acoustic and symbolic features are added to the data.

Speech synthesis, like most applications, seeks to balance memory and processing demands with user experience, so this data is pruned to create a more efficient system while not degrading the quality of the audio (as measured by a subjective preference test). As a digitized collection of analog sounds combined with various notations and symbolic features, the database is quite different from how humans recall speech sounds. Certainly, we too have an embodied memory of speech sounds and neural pathways for selecting and activating them. Those processes are completely different in material terms, and yet, they both provide access to a possibility space of human vocalization. No two humans will have access to an identical range of human voice sounds. In fact, over time and from moment to moment any individual’s capacity to make those sounds can shift (e.g., from a sore throat or the dentist’s novocaine). However, there’s a fair amount of overlap among those possibility spaces, especially among human speakers of a common language or, even more so, dialect. There is also an overlapping possibility space with Siri. In other words, though we employ different processes, in the end Siri and people are able to make a common set of speaking sounds.

When Siri is given a text to speak, the first step is the creation of a phonetic translation. Capes et al. explain, “To supplement the natural features of the input text, the front-end can also incorporate explicit annotations placed in the text stream to provide hints about pacing, prosody, and discourse domain, many of which are known for specific types of Siri responses” (4011). As we all know from our own speaking, words are often pronounced differently, depending on their placement and particular emphasis in a sentence. Some words flow quickly into one another whereas others tend to be spoken more slowly. These are the types of issues that the phonetic translation and annotations seek to address. Although I would not suggest that such translations require that Siri understands the content of what it is saying in the way that humans tend to, I would suggest that it does represent some understanding of the text. The closest analogy might be a person reading aloud from a text in a language they speak but with content they do not understand. With the newly annotated script in memory, a preselection process identifies a set of 100 half-phones (partial speech sounds) that it might use for each sound in each word. This is a comparatively quick process designed to reduce latency and processing demands. The computationally difficult task is selecting the best choice among those 100 sounds and then stitching them together for the best rhetorical effect. It is at this point that the neural net goes to work:

The input features to the neural nets are language- and context-dependent linguistic features from the front-end. In general, these features are: quinphones, stress, part-of-speech context, tone context (for tonal languages), prominence, sentence type and initial/final positional features for syllable, word, phrase and sentence; and also a positional feature for the half-phone to indicate first or second half. The output features consist of the duration of a half-phone, 13-dimensional Mel-frequency cepstral coefficients (MFCCs) as spectral features and fundamental frequency (f0) at the beginning, middle (only for f0), and end of a half-phone, and their deltas. In total, there are 58 features at the output layer. (Capes et al. 4012)

In identifying the 58 features that define a half-phone, the neural net is calculating the waveform of the sound in relation to how it expects humans will be able to hear them. As even this brief, albeit jargon-filled explanation demonstrates, the synthesis of pleasing speech is a technically complex task that calls upon each instantiation of Siri on each iOS device to make rhetorical decisions again and again, and yet, in terms of ways in which we often imagine artificial intelligence, it is a modest act.

From the perspective of rhetorical agency, what is important here is the recognition that Siri is making complex decisions about how to say a phrase and is doing so with a rhetorical purpose. That purpose is not simply that of the designers or Apple. If Siri performed exactly as its designers would wish it to, then it would speak perfectly every time. Ironically, it is in Siri’s errors and glitches that we most easily recognize Siri’s independent actions at work, making better or worse decisions about how to say a sentence. No one will confuse Siri’s speech synthesis with the typical process of human speaking. For one thing, only rarely are we giving voice to a prepared text. However, in an abstract sense, humans face the same challenges as Siri does in figuring out how to make the sounds of speech, and there is an evolutionary process by which our species acquired the capacity to vocalize the complex range of sounds that comprise human speech. That process, which distinguished humans from the vocalizations of other primates, began with the shrinking of the mouth, the increased flexibility of the tongue, the lengthening of the neck, and the dropping of the larynx in the throat. Of course, speech also relied upon the development of the brain, including the capacity to control our breathing, as controlled exhalation is vitally necessary for making sounds. These physiological processes, by which we control our bodies to intone words in specific ways, are the human actions most analogous to Siri’s process of speech synthesis. These physiological actions emerge from the nonconscious, which, as Katherine Hayles describes, “comes online much faster than consciousness and processes information too dense, subtle, and noisy for consciousness to comprehend. It discerns patterns that consciousness is unable to detect and draws inferences from them; it anticipates future events based on these inferences; and it influences behavior in ways consistent with its inferences” (28). For example, when we speak with compassion or anger or sarcasm, we typically do not give conscious attention to the tone of our voice (though obviously we can, as is evident from the vocal training of singers to the practice of actors and from the rhetorical canon of delivery).

In short, Siri’s speech synthesis, though it lacks a conscious awareness of what it is saying, operates in a manner analogous to many of the nonconscious operations of human speech. By this, I do not mean to suggest Siri is as rhetorically versatile as most humans. Obviously, it is not. However, since speech synthesis is designed to be rhetorically effective in relation to humans, Siri—when it is working—exhibits rhetorical tendencies and capacities that it shares with humans. Both human and AI interact with language as an assemblage that is comprised of material components emerging from the history of human physiological capacities to make sounds, a history of human expression, and the code of the language itself, which regulates how vocalization can be used to produce expression. Even with the task of deciding what to say still bracketed, both the human and the AI confront physical limitations in relation to glitches and latency. These are minimal rhetorical considerations, and I do not mean to suggest they are the foundation or origin of rhetoricity. Rather, they are one set of thresholds that are independently overcome by both senders and receivers. That is, both humans and AIs seek to send and receive valuable (relatively glitch-free) speech in a timely manner (with low latency).

It is reasonable to focus solely on speech synthesis as a separate piece of programming designed to perform a limited task; yet, it is important to note that these functions are simultaneously intertwined with other rhetorical operations related to natural language processing and the machine composition of texts. That is, Siri’s speech synthesis is of little utility if it is not connected to a capacity to understand human speech and compose responses. Digital assistants will soon be able to carry on simple conversations with humans to call a restaurant and make dinner reservations. In turn, it seems equally likely that the human-sounding voice answering customer-service calls will also be an AI. Indeed, it entirely makes sense that one’s own AI will be conversing with others’ AIs at the restaurant or bank. At that point, the need for natural language might seem to be obviated and the only question then would be whose AI wins these quotidian rhetorical battles played out as a series of mathematical moves. Does your AI convince the bank’s AI to waive that overdraft fee? However, at least in terms of current engineering approaches, natural language is integral to negotiation because AIs are learning to argue by modeling their behaviors on humans. Although it is certainly possible to diagram arguments in symbolic, rational terms from which one imagines computers could make arguments, just as they devise winning chess strategies, negotiation and persuasion do not work that way. As we know, unquestionably rational arguments often fail to persuade. Our tendency is to explain the failure of rational arguments in terms of human nature, individual psychology, group behavior, or ideology. The ability of human audiences to resist even the most rational argument is a testament to human agency and, arguably, one of the primary reasons why rhetorical strategies beyond inartistic proofs must be developed. However, in the efforts to model AI negotiations, these nonrational responses to arguments in natural language are not presumed to be strictly human-cultural qualities but are instead features of the terrain of negotiation itself.

In research conducted by Facebook, two AI agents negotiate over the split of resources called balls, books, and hats (Lewis et al.). The research begins with collecting a large dataset of negotiations between two people. Using this dataset, AI agents are then trained to generate likely models of how a negotiation will proceed. Research into the development of these rhetorical capacities seeks to balance desirable outcomes with the ability to conduct negotiations in natural languages. This latter objective has two ostensible purposes: first to make the negotiations intelligible to humans and second to create the possibility for AIs to negotiate with humans or advise humans involved in real-time negotiations (e.g., having an AI advisor helping to negotiate the price of a car). However, the process of focusing on natural language has an unintended consequence. Lewis and his colleagues discover “that models trained to maximise the likelihood of human utterances can generate fluent language, but make comparatively poor negotiators, which are overly willing to compromise” (2443). To remedy this outcome, the researchers switched from a model focused primarily on producing human speech to one focused on achieving goals. They employed two strategies to make this switch. The first was “self play, in which pre-trained models practice negotiating with each other in order to optimise performance,” and the second was “dialogue rollouts, in which an agent simulates complete dialogues during decoding to estimate the reward of utterances” (2443-4). In short, through self-play, agents were trained not only to produce natural language but also to develop strategies that led to achieving their objectives. The dialogue rollouts then allowed the AIs to use that knowledge to evaluate utterances during a negotiation in an attempt to figure out which utterances would most likely lead to the best outcome.

Perhaps unsurprisingly, the goal-based agents regularly outperformed the likelihood agents in negotiations. They also negotiated harder with humans as evidenced by the longer dialogues that were recorded between them. One side effect of this negotiation was an increased number of instances where no deal was struck, even when not achieving agreement resulted in a negative consequence for both parties. This result is not unlike that seen in the Ultimatum Game, an economic experiment in which one party is given $20 provided that she shares it with her friend who must accept his share of the money. Rationally speaking, the friend should accept any amount of money since it’s free money he wouldn’t otherwise have, but in practice, people rarely accept a cut less than 30%. Though no one would suggest that AI negotiators experience stubbornness, in these failed negotiations they often exhibit a stubborn refusal to compromise. A second side effect was that the goal-based agents learned to deceive by “initially feigning interest in a valueless item, only to later ‘compromise’ by conceding it” (2450). Unlike humans, the agents in the Facebook experiment did not have the capacities to remember prior interactions with specific individuals, share that information with a community, or ultimately act communally to establish metanorms that might, for example, punish deceitful action. However, the addition of such capacities is hardly far-fetched as they are not so different from the way that self-driving cars share experiences and learn from one another about the behaviors of human drivers and pedestrians.

 

Rhetorical Possibilities

In the end, despite the technical complexities, Siri’s speech synthesis is fairly straightforward. It is a program designed to make decisions resulting in speech its human listeners will find pleasing. Nonetheless, Siri is concerned with the rhetorical qualities of delivery, and, much like any human delivering a speech written by another person, it is a rhetorical actor. Latour suggests that “an actor is what is made to act by many others . . . [An actor] is not the source of an action but the moving target of a vast array of entities swarming toward it” (Reassembling 46). In this sense, through its speech synthesis programming, Siri is made to act/speak. Similarly, AI negotiators are designed to compose sentences that they predict will lead to an agreement that meets their pre-established goals. Through their software, they are made to negotiate, and these acts of persuasion are clearly rhetorical acts. All this really means is that we now have machines that are designed to make decisions and take actions that have rhetorical effects. To what extent this means that these machines have agency remains unclear, but there are certainly strict limits to that potential agency. For example, both Siri and humans make decisions about how to speak words and generally try to do so in a recognizable and rhetorically pleasing way. However, obviously, humans do not always respond when spoken to and can speak in a range of non-pleasing ways. Siri does not have those options. That said, these machines increasingly demonstrate rhetorical capacities that lead us to interact with them as legitimate rhetorical interlocutors, even as we maintain an awareness that they are not human. Through their rhetorical acts, digital assistants invite us to respond to them as fellow interlocutors. This response-ability, to use Diane Davis’ term, describes an “inclination toward the other” that precedes being and contends that “if it were not for this irremissible obligation, this preoriginary obligation to respond, then in the face of the other I would nonchalantly file my nails. The face comes through each time as pure appeal, persuasion without a rhetorician” (Davis 14). Along with the face, I would add the voice. When Siri speaks, it incites a response.

Of course, even with this response-ability, speaking to (with? at?) Siri might still seem strange: a self-conscious, uncanny act. As Latour observes,

 . . . we often make a mistake when we see someone talking wildly to himself and making sweeping gestures—before we notice he is talking to someone through the intermediary of a portable phone. ‘Ah, so I was wrong about the setup: he isn’t crazy, he is using a device to talk to someone else!’ These are the questions we now have to raise: who is being addressed by those who assert that they are only talking to themselves, and what apparatus serves as their go-between? (Inquiry 189)

To put it differently, which sounds saner? To imagine that our engagement with the world is mediated by our relationship with objects and our environment, or that it’s all in our head? In this situation, the someone being addressed is Siri. Is a spoken exchange between Siri and a human a rhetorical encounter between two beings? Or is it an internal hallucination or act of will and imagination purely on the part of the human? No one expects Siri and its counterparts will be passing a Turing test any time soon. When we speak to a digital assistant, interact with a computer voice on a customer-service phone line, or chat with a bot assistant on a website, few of us would imagine that we are having a conversation with the kinds of AIs in science fiction films. On the other hand, we are unlikely to care if our interactions end with the desired result; they are simply another set of actors with whom we can and do converse.

Fundamentally, we might take from Latour the realization that we have always been in conversation with nonhumans. In some respects, this realization is built into rhetoric, as we have long studied the rhetorical operation of technology to understand how writing, books, computers, and so on have shaped rhetorical practice, and in a sense, digital assistants are part of that continuum. However, digital assistants are also a change in kind; for as much as we might idiomatically say that we are “in conversation” with a text, it never literally talks back. This technological shift and its implications for rhetorical practice cannot be ignored but neither can the ethical and political problems that might arise as a consequence of extending rhetorical capacities to nonhumans. As Latour’s example suggests, nonhuman rhetoric has implications for our understanding of human rhetoric, even if those implications are only the strange looks we get when we attempt to converse with Siri in public. Still, those challenges cannot be solved by ignoring the rhetorical operation of increasingly sophisticated machines. Instead what is required is a revised conception of the relationship among rhetoric, agency, and cognition, one that allows us to responsibly interrogate the rhetoric of machines without placing any necessary limits on human capacities. This really shouldn’t be difficult to achieve. After all, there’s no reason why an investigation of Siri as a rhetorical actor should affect the ontological possibilities of human action in an abstract sense. That said, Siri does shape the actual historical and material conditions under which specific capacities for human rhetorical practices arise, so, in this sense, understanding the rhetorical operation of digital assistants might prove important for understanding the specific rhetorical practices humans develop. With this in mind, studying Siri’s rhetoric seems more pressing from an ethical-political angle as we might want to intervene in the development of technology in a way that expands human capacities in certain ways rather than others.

One way to accomplish this investigation is through the concept of possibility spaces that I mention earlier. Manuel DeLanda describes these as “the structure of the space of possibilities that is defined by an entity’s tendencies and capacities” (Philosophy 11). That is, in any given moment, an entity will exhibit certain tendencies and capacities rather than others, but these arise from a virtual space of possibilities whose structure can be explored and described. DeLanda divides these possibilities into tendencies and capacities. In short, tendencies are a relatively delimited list of characteristics that are related to the qualities of the object; capacities, however, are characteristics that only emerge in relation to other objects. For example, water in the glass on my desk has a tendency to freeze, but it isn’t frozen now. My iPhone has a tendency to lose its battery charge, but it’s charged at the moment. In the hands of an expert, a block of ice has the capacity to become an ice sculpture. Siri has the capacity to synthesize speech, but it must be commanded to do so by a human user. Rather than speaking generally of possibilities, DeLanda seeks to describe possibility spaces that structure and contain these tendencies and capacities.

Arguably, possibility spaces lie at the foundation of rhetorical practice. That is, when Aristotle speaks of the available means of persuasion, he is speaking of what is possible. Commonplaces describe rhetorical capacities that might arise for particular occasions. Of course, those possibilities can shift with each rhetorical situation. Many rhetorical capacities are so common and engrained in our actions that we tend to overlook them, as when we continue to speak in the same language and use our bodies to give voice to our speech. Siri doesn’t have our minds or our bodies, but it accesses the same possibility space for synthesizing speech as we do for speaking sentences. That possibility space of rhetorical action isn’t inside us. It isn’t inherent to humanness. Siri accesses only a tiny portion of rhetorical possibilities, but so do we. In each rhetorical situation, we are limited in our choices in relation to the range of rhetorical acts we have undertaken before. But even if all the possibilities for human rhetorical action across time were taken to describe a single vast possibility space, there’s no reason not to believe that the possibilities for rhetoric extend far beyond what has ever been available to us. Indeed, we might sincerely hope this is the case if we want to believe that the future contains the possibility for better understanding among humans and our nonhuman fellow travelers.

With this in mind, the population of natural language users—human and nonhuman—share a possibility space of rhetorical tendencies and capacities. As Siri’s speech synthesis function is strictly related to speaking sentences, the shared possibility space it has with us is quite small. In that small space, as I have discussed, it shares our tendencies for pronouncing words in relation to one another, for pacing, prosody, and so on. Both we and Siri have to make decisions based on how the language we speak works. None of us have much say in that, and for humans, choices about how to pace and pronounce words are usually made unconsciously. Similarly, in this small possibility space of speech, we might discover shared capacities with Siri. Capacities are always more elusive than tendencies as they must be discovered or invented in our relations with others. However, to give an appropriately posthuman example, Siri shares with humans the capacity to use speech to give voice commands to other machines. In other words, Siri can command Alexa using its voice in the same way we do. To generalize this capacity, though, Siri shares with us the ability to convey information and commands to others who have the reciprocal capacity to hear and understand the language Siri is speaking. Siri’s speech might even persuade a human audience, such as when its delivery of the weather report convinces us to carry an umbrella. Of course, most adult human listeners will recognize that Siri is a machine giving voice to a text produced elsewhere, so perhaps the creators of the weather report are also due some recognition for persuading us. Again, Siri’s speech synthesis capacities are strictly limited to just that—synthesizing speech—but to the extent that the synthesis of speech is integral to a rhetorical act, Siri exhibits rhetorical capacities that it shares with the population of human speakers.

This is also where we can draw the boundaries of this shared space. Although Siri has a tendency to synthesize pleasing speech and a capacity to inform, command, and persuade, it does not share many of the other qualities we associate with rhetorical practice. For example, though words can be mispronounced and sentences misspoken by humans and machines alike, for us these failures are tied up with our desire to be understood and recognized in a conversation. They are interwoven with other agential, subjective, and cognitive functions. As such, our desires and egos are at stake. But with Siri, such stakes are not integral to rhetorical action. In an everyday encounter, a user says “Hey Siri” or perhaps picks up the phone to have her face identified by the camera. A dialogue ensues in which the user’s speech is converted to text, the text to a request carried out by distant servers, and the result converted back into speech by Siri on the phone. On one hand, with eyes squinted, we might see the playing out of Burkean identification intermixed with a procedural rhetoric. On the other hand, we might insist it is nothing of the sort. We can say with confidence that Siri is not like “us,” that we do not “identify.” Similarly, though Siri is identifying us, it is not the Burkean-Freudian notion of identification. The processes that listen for the wake-up phrases, that turn human speech into text, that identify that text as an actionable request (i.e., something Siri can do), that then carry out that request, and finally that turn the output from text into Siri’s speech are each separate from one another. There is no central point of control, no consciousness to speak of (or with or from). Of course, the same might be said for the many nonconscious operations of the human rhetor whose capacities to recognize another by face or voice, to translate sound into words into concepts, to cogitate upon those words and to form a response, and then to vocalize that response are also separable. We also know this about ourselves and our human interlocutors, but we don’t give much thought to that either. What Siri demonstrates is that although all manner of agential, subjective, and cognitive processes may be occurring in humans when they are engaged in rhetorical actions, such processes are not necessary for rhetoric itself.

In short, rhetoric can be described as a posthuman possibility space whose tendencies and capacities might be activated by both humans and nonhumans. Devices such as Siri demonstrate a fairly limited interaction with those possibility spaces but in doing so suggest the possibility of future machines with more access to those possibilities. We are a receptive audience for Siri. Even when we fully recognize Siri’s limitations, we also understand what it says. We recognize the sounds it makes as speech, and, importantly, we distinguish between Siri’s speech and the other speech—from videos, podcasts, audiobooks, etc.—that we might hear through our iPhones. That is, we identify Siri as the speaker, and, unlike those other voices, we recognize Siri as a being to whom we might respond and who will respond to us, with whom some kind of conversation might occur. Though Siri’s speech synthesis can be described and investigated as a discrete feature of an iPhone, it is ultimately difficult to disambiguate its rhetorical effect upon human users from the integrated operation of the smartphone. Are we persuaded by Siri’s voice to take the next left when relying upon our phones for driving directions? Not exactly, or at least not its voice alone, as that speech synthesis is tied to a mapping application to say nothing of a network of GPS technologies. Our decision to interact with Alexa is certainly more tied to our ability to ask it to perform an increasing number of functions in our ever “smarter” homes than it is our desire to hear its voice. And yet without the voice, the driving directions wouldn’t be available to us, and we’d have no way of knowing if our vocal requests had been acknowledged. The expansion of these devices’ capacities is clearly designed to make them more desirable. Some might say addictive. We become more dependent upon them both for things we once did without them and for things we never imagined needing to do but now find necessary. These are all rhetorical effects of these devices. The farther we go with such thinking, the more difficult it becomes to identify whose thought or agency might be at work. Does the iPhone desire us to use it? Who can say? Certainly, Apple wants us to use the phones they make, and they design them to foster such desire. On the other hand, in Latourian fashion, we might ask the users themselves why they act as they do. Why do they obey Siri when it tells them to take the next exit? Why do they ask Alexa for a weather report? Somewhere in the many answers they might offer is the realization that one speaks with digital assistants because they are beings capable of such rhetorical interactions. And that realization shifts the foundations on which nonhuman rhetoric and speech can be investigated.

Even with this shift, we are still left with all the concerns for human political agency that Miller identifies, but we might now argue that those concerns cannot be productively addressed without recognizing the nonhuman rhetorical agents with whom we share our media ecologies. We are also left with Latour’s questions about the appropriate role for rhetoric as a practice and a field of study in the investigation of new materialist ontological pluralism. There are no easy answers for such concerns. We can say that we must move with care and respect so that we are attentive to that ontological pluralism, so that we can learn how to speak well of others. Knowing that we must do such things is quite different from knowing how to do them, though the two are intertwined. Rather than imagining the simple existence of nonhuman rhetorical actors as a threat to our capacities to think and act, the investigation of a non-anthropocentric rhetoric and the rhetorical alliances we form with humans and nonhumans alike might instead open avenues for addressing our collective concerns.




[i] For a detailed discussion of new materialism and possibility spaces, see Manuel DeLanda’s Philosophy and Simulation.

[ii] New materialist and posthuman rhetorical ecologies draw in various ways from a longer history of ecological approaches to rhetoric and composition from Cooper’s “Ecology of Writing” through the rise of ecocomposition (e.g., Syverson) to Edbauer’s rhetorical ecologies. They also draw from concepts of media ecology, distributed cognition, and extended mind.

[iii] The work of the Siri Team on its development of Siri for OS 11 is reported in the Apple Machine Learning Journal. My summary here of the general process for developing Siri is discussed in that journal.

Works Cited

Atkinson, Katie, et al. "Towards Artificial Argumentation." AI Magazine, vol. 38, no. 3, 2017, pp. 25-36.

Barad, Karen. Meeting the Universe Halfway: Quantum Pphysics and the Entanglement of Matter and Meaning. Duke UP, 2007.

Bennett, Jane. Vibrant Matter: A Political Ecology of Things. Duke UP, 2010.

Boyle, Casey. Rhetoric as a Posthuman Practice. Ohio State UP, 2018.

Braidotti, Rosi. Posthuman Knowledge. John Wiley & Sons, 2019.

Capes, Tim, et al. “Siri On-Device Deep Learning-Guided Unit Selection Text-to-Speech System.” Interspeech, Aug. 2017, Stockholm, Sweden. [LEG1] 

Cooper, Marilyn. “The Ecology of Writing.” College English, vol. 48, no. 4, Apr. 1986, pp. 364–375.

Davis, D. Diane. Inessential Solidarity: Rhetoric and Foreigner Relations. U of Pittsburgh P, 2010.

DeLanda, Manuel. Philosophy and Simulation: The Emergence of Synthetic Reason. Bloomsbury, 2015.

---. War in the Age of Intelligent Machines. Zone Books, 1991.

Edbauer, Jenny. “Unframing Models of Public Distribution: From Rhetorical Situation to Rhetorical Ecologies.” Rhetoric Society Quarterly, vol. 35, no. 4, 2005, pp. 5-24.

Harman, Graham. Object-oriented Ontology: A New Theory of Everything. Penguin UK, 2018.

Hawk, Byron. Resounding the Rhetorical: Composition as a Quasi-object. U of Pittsburgh P, 2018.

Hayles, N. Katherine. Unthought: The Power of the Cognitive Nonconscious. U of Chicago P, 2017.

Latour, Bruno. An Inquiry into Modes of Existence: An Anthropology of the Moderns. Translated by Catherine Porter, Harvard UP, 2013.

---. Reassembling the Social: An Introduction to Actor-Network Theory, Oxford UP, 2005.

Lewis, Mike, et al. “Deal or No Deal? End-to-End Learning of Negotiation Dialogues.” Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, Sept. 2017, Copenhagen, Denmark, Association for Computational Linguistics, pp. 2443–2453.

Miller, Carolyn R. “What Can Automation Tell Us About Agency?” Rhetoric Society Quarterly, vol. 37, no. 2, 2007, pp. 137–157.

Rickert, Thomas. Ambient Rhetoric: The Attunements of Rhetorical Being. U Of Pittsburgh P, 2013.

Pflugfelder, Ehren Helmut. Communicating Mobility and Technology: A Material Rhetoric for Persuasive Transportation, Routledge, 2016.

Siri Team. “Deep Learning for Siri's Voice: On-Device Deep Mixture Density Networks for Hybrid Unit Selection Synthesis - Apple.” Apple Machine Learning Journal, Aug. 2017, machinelearning.apple.com/2017/08/06/siri-voices.html.

Syverson, Margaret A. The Wealth of Reality an Ecology of Composition. Southern Illinois UP, 1999.

Walsh, Lynda, et al. “Forum: Bruno Latour on Rhetoric.” Rhetoric Society Quarterly, vol. 47, no. 5, 2017, pp. 403-462.