Back to Data Visualisation

The Shape of Stories

13/01/2020

There’s a great clip of Kurt Vonnegart giving a lecture on “the shape of stories”. He makes the case that stories can be distilled down to a two-dimensional plot. The y-axis is the valence of the story (what he calls the "G-B axis" for good to bad), and the x-axis is time (which he calls the "B-E" axis, for - you guessed it - beginnging to entropy). Not only does Vonnegart say that stories can be distilled to these shapes, but also that:

There's no reason why the simple shapes of stories can't be fed into computers.

That is, we should be able to extract the shape of a story using algorithms! And so, in this post I will attempt to do just that; I will use sentiment analysis to try and empirically recreate the curves which Vonnegart attributes to several stories.

Let's start by importing the libraries we'll need and looking into some sentiment analysis.

Sentiment Analysis

There are many approaches to sentiment analysis. The easiest is a simple bag-of-words model, in which we count up the number of positive words (“happy”, “good”, “amazeballs”) in the text, then count the number of negative words (“hangry”, “bleh”, “stanky”), and the valence of the text is the number of positive words minus the number of negative words. This approach will probably do for our purposes.

After a little Googling, I came across this page of sentiment analysis resources. I decided to go with SentiWordNet because it's built on WordNet which I'm already familiar with. SentiWordNet gives a positive sentiment and negative sentiment score to every synset (group of synonymous words) in WordNet's lexicon. Because a given word could have several associated synsets, each corresponding to a different meaning of the word, there are several possible sentiment values associated with each word. I decided to simply use the sentiment associated with the first synset of each word.

Let's make sure that this sentiment function returns sensible results:

Looks ok. Ideally terrible would be worse than bad, and misery would be worse than poverty.

Text Extraction

We'll use Beautiful Soup to extract the text of the story from the web.

Let's see if we can extract the text for Cinderella.

Hmmm... this is looking a bit Grimm, and doesn't really match the narrative arc that Vonnegart described. Here's another version I found that looks better suited to this project:

Data Inspection

We'll split this text up into 100 chunks of words, and calculate the mean sentiment for each chunk.

Let's look at the points where the highest and lowest sentiments occur:

Some of these make sense, others not so much. On the whole, I think this will be acceptable, but it will certainly be worth trying other sentiment analysis tools in the future.

The Sentiment Plot

We now need to decide how to plot the sentiment over the course of the story. For Cinderella, the plot Vonnegart drew looked something like this:

The progress goes something like this:

Let's now compare this to the empirical sentiment over time.

Smoothing the Sentiment Plot

The empirical sentiment plot is so jagged it's hard to discern any overall pattern. Let's smooth out the curve so that we can get a better sense of some high-level trends.

Sliding Window

First, let's try running a sliding window across the sentiments. The width of the window is a hyperparameter we need to tune, so I plotted a few reasonable sounding values to see which looks best.

Window sizes of 5\%, 10\%, and 25\% all look reasonable. I decided to go with the middle one of these: 10\%.

EWMA

Another approach to smoothing out the graph would be to use an exponentially weighted moving average (EWMA). This has the nice property that the contribution of words to the current sentiment value decays exponentially as you move through time. If the sentiment in the current window is given by $s_i$, then the EWMA sentiment value is given by

$$ S_i = \alpha \cdot S_{i-1} + (1-\alpha)\cdot s_i. $$

Again, we have a hyperparameter: $\alpha$, the decay constant. And again, I plotted a few reasonable sounding values to see what looks best.

I think $\alpha=0.75$ is probably the best of these. But not as good as the sliding window with window size = 10\%, so I've used that one from now on. Let's now put that on the same axes as Vonnegart's plot, and see how well it matches up. Note that the range of the sliding window sentiments is very small, so we have to normalise it.

Not too bad. The main problem I see with this is that the empirical sentiment starts at a high point, whereas the Vonnegart curve starts at a low point. Looking back at the first paragraph, the opening is a little ambiguous:

A rich man's wife became sick, and when she felt that her end was drawing near, she called her only daughter to her bedside and said, "Dear child, remain pious and good, and then our dear God will always protect you, and I will look down on you from heaven and be near you." With this she closed her eyes and died.

All The Stories!

Now that we've got our sentiment plotting procedure down, we can plot all the kinds of stories mentioned in Vonnegart's talk. In addition to Cinderella, we've got

Let's see how the Vonnegart plots compare to the empirical sentiment for each of these stories.

Observations:

I was curious what the tall peak corresponded to in the Hamlet plot, so I wrote a function to look at the text for the top $n$ peaks or troughs of a given curve.

Looks like it's the bit where Guidenstern and Rosencratz arrive at Elsinore. Is that a high point of the story? I don't remember it standing out, but I don't know Hamlet super well.

Conclusion

So what have we learned here? We've learned that using a sliding window with a window size of 10\% of the text does smooth out the sentiment curve pretty well so that you can make out the arc of the story. We've learned that we can sometimes make out the kinds of curves that Vonnegart talks about in stories - as in the cases of Cinderella, The Hobbit, and Jane Eyre - but not always - as in the cases of Hamlet and Kafka.

I've got several ideas about how to extend this work. As mentioned earlier, it's probably worth exploring some other sentiment analysis technologies, to see if I can get sentiment scores that line up with intuition better. I also mentioned earlier that I'd be interested in doing Fourier analysis of these story plots, to see what kinds of cycles there are, and whether stories can be modelled by a pair of sinusoids.

I'm also interested in creating "shape of story" plots for a wider range of stories. One that I think would be really interesting is the web serial Worm by J.C. McCrae. Worm is a whopper of a book, at 6,000 pages if it was physically printed. It’s also really grim, but it somehow manages to keep getting grimmer as the book progresses. Does this show up in the plot?