Some Analysis of All Hacker News Evergreen Stories

Introduction

At Contextly, we build engagement tools that help publishers build high-value, loyal audiences. One of the ways we provide value to a publisher is by automatically detecting older stories that are still valuable to readers and including these stories in our recommendations. We call these stories “evergreens”.

Although, we can detect and surface such stories, describing the value of these stories in terms of page views leaves something to be desired.

We would like to describe the value of evergreen stories in a more compelling way. A better description would be one that moves us closer to understanding the economic value of stories, especially the economic value to publishers and readers.

The Hacker News dataset is well suited for this purpose, so we use it as a case study about the value of evergreen stories. This dataset has features that permit compelling descriptions of value because they incorporate the concept of opportunity cost. We use these descriptions to demonstrate that, on average, the value readers get from evergreen stories is greater than the value they get from non-evergreen stories. This conclusion, if broadly applicable, has implications for publishers and readers.

What is an Evergreen Story?

Conceptually, an evergreen story is any story that provides value to readers well after its publication date. The following definition was applied to the Hacker News dataset:

An evergreen story is any story where the difference between the submission date of the story and the publication date of the story is two years or more. The publication date of the story is indicated in the story’s title by using the annotation “(YYYY)”, e.g. “The WorldWideWeb application is now available as an alpha release (1991)” by Tim Berners-Lee.

The Hacker News Dataset

The Hacker News Firebase API was used to collect all Hacker News posts through November 7th, 2014. There were 8,569,254 posts of types “job”, “story”, “comment”, “poll”, or “pollopt”. Of those posts, 1,544,661 of type “story” were submitted by the Hacker News community.  Of those stories, 6,826 were identified as evergreen according to the definition above. If you would like to see these evergreen stories, you can find them all here:

Data definitions and pretty print examples of the API response can be found on Github.

The oldest evergreen story is Equatorie of the Planetis (1393). The highest-scoring evergreen story is Forgotten Employee (2002), with a score of 746.

The data was divided into two groups: i) evergreen stories, ii) non-evergreen stories. The data was then bucketed by month.

The number of evergreen stories were counted for each month:

Stories_per_Month_Evergreen

Notice there is a sharp increase in the number of evergreen stories starting around December 2012. Another interpretation is that the increase reflects broader adoption of the annotation “(YYYY)”, i.e. more evergreen stories were labeled.

Similarly, the number of non-evergreen stories were counted for each month:

Stories_per_Month_Non_Evergreen

Problem Setup and Assumptions

For each month, the stories were divided into two groups, evergreen stories and non-evergreen stories. We are interested in three measurements associated with each story: score, number of comments, and the sum of the length of comments.

For a given month, consider the collection of scores of evergreen stories (or non-evergreen stories). It is reasonable to assume that the observed collection of scores is only one of many possible ways the collections of scores might have occurred, i.e. the scores could have occurred with different values from what we observe. Rephrased as a thought experiment, if we had the ability to repeat the story submissions for a given month many times, we would expect the scores to vary from one attempt to the next. Let’s formalize this concept.

In the analysis that follows, we treat each measurement as a realization of a random variable. For example, in a given month, let’s say there are n evergreen stories, each with a score. We view the collection of scores in a given month as being generated by a sequence, X_1, …, X_n, of independent random variables.

Three Descriptions of Value

Three descriptions of value were constructed, one for each measurement type: score, number of comments, and the sum of the length of comments.

First Description of Value: Score

One way to determine the value of a story is to simply ask people what they think of it and summarize the results. On Hacker News this can be done by voting, which leads to each story having a score.

The mean of the scores was calculated for evergreen stories (and non-evergreen stories) submitted in a given month:

EM_1_copy

For each month, 95% confidence intervals for the mean were added by calculating the percentile bootstrap. This gives us three data points for evergreen stories and three data points for non-evergreen stories for each month: i) empirical mean, ii) lower bound of the confidence interval, iii) upper bound of the confidence interval. These points were then interpolated and smoothed over time using the function smooth.spline in R.

Over the past two years, the 95% confidence intervals (dashed lines) are disjoint, with the actual mean for evergreens likely occurring between ~18 to ~32 and the actual mean for non-evergreens likely occurring in a narrow range centered at ~10. Notice the empirical mean for evergreens (~25) is more than double the empirical mean for non-evergreens (~10).

One is left to wonder if the confidence intervals during the period prior to December, 2012 are overlapping because the annotation “(YYYY)” is not being used as frequently as it is after December, 2012. This would result in many evergreens being mislabeled as non-evergreen.

Using the score of each story to define value seems like an obvious thing to do as a first attempt, but maybe we can do better.

Second Description of Value: Number of Comments

Each comment on Hacker News requires some time from the user. A typical workflow for a user might be something like read the story, think about it, read some comments, post a comment, read responses to the comment, etc. This workflow is accompanied by a significant time commitment. That time could have been used doing other things. Those things, too, could have provided value to the user. However, all other opportunities immediately available to the user were forgone; the user chose to submit a comment on Hacker News. This notion of opportunity cost moves us closer to the economic value of a story.

The mean number of comments was calculated for evergreen stories (and non-evergreen stories) submitted in a giving month.

EM_2

Again, over the past two years, the 95% confidence intervals are disjoint, with the actual mean number of comments for evergreens likely occurring between ~7 to ~17 and the actual mean number of comments for non-evergreens likely occurring in a narrow range centered at ~5. The empirical mean number of comments for evergreens (~12) is more than double the empirical mean number of comments for non-evergreens (~5).

Although appealing, we can still improve on this description.

Third Description of Value: The Sum of the Lengths of Comments

In the previous description, all comments were considered to be equal. That is, the variation in the time commitment across comments was not taken into account. If one is willing to accept comment length as a proxy for the time commitment of a comment, we can account for the variation in the time commitment across comments by proxy.

The lengths of all comments were summed for each story. This gives us the total length of comments for each story. The mean of the total length of comments was calculated for evergreen stories (and non-evergreen stories) submitted in a given month:

EM_3

Over the past two years, the 95% confidence intervals are disjoint, with the actual mean total length of comments for evergreens likely occurring between ~3000 to ~8000 and the actual mean total length of comments for non-evergreens likely occurring in a narrow range centered at ~1500. The empirical mean total length of comments for evergreens (~5000) is more than triple the empirical mean total length of comments for non-evergreens (~1500).

Conclusion

Three descriptions of value were constructed. Two of them incorporate the concept of opportunity cost. The mean value of evergreen stories was compared to the mean value of non-evergreens stories for each month. The analysis provided here supports the claim that the mean value of evergreens stories is greater than the mean value of non-evergreen stories, at least for the Hacker News community over the last two years.

How is the use case for Hacker News similar to the use case for publishers in general? One of the key characteristics of evergreen stories is that they have withstood the test of time; after years, many of these stories are still interesting and informative. This characteristic is not limited to evergreen stories on Hacker News. Many publishers have stories that readers would find interesting and informative if only they were not buried in their archives.

One might argue that the curation aspects of Hacker News act as a strong filter on quality, differentiating the case for Hacker News from the case for publishers. But publishers’ stories are also subject to curation by communities of interest in places like Twitter.

Although the descriptions of value were constructed from the readers’s point of view, it is easy to extend these notions of value to the publisher by recognizing that the publisher is the one that provides the story.

There is significant value in evergreen stories for both readers and publishers. The question becomes how to identify and then distribute these stories to readers at times when they will provide the most value. The answer likely lies in the intersection of human curation and discovery and recommendation algorithms.

I would like to thank Jay Buckingham for reviewing this post and for his helpful feedback.