Ophan: Inside the Guardian's data-driven newsroom

The Guardian newspaper has come a long way since its foundation in 1821. The U.K.-based publication was originally known as The Manchester Guardian, owing to the city where it was then headquartered, and promised to "zealously enforce the principles of civil and religious Liberty ... warmly advocate the cause of Reform and ... endeavour to assist in the diffusion of just principles of Political Economy."

Almost 200 years on, and the Guardian has been pushing to transform from a traditional print-based national title into an international digital brand. The newspaper was elevated into the world's consciousness two years ago, when NSA contractor Edward Snowden leaked details of the U.S. government's surveillance programs to Guardian reporter Glenn Greenwald.

Shortly after the first Snowden revelations came to light, the Guardian replaced its local web domains from the U.K., U.S., and Australia with a single .com address as part of its mission to create a single, global online identity.

Of its £215 million ($335 million) in revenue, £80 million ($125 million) emanates from digital -- a 20 percent rise on the previous year. Today, the Guardian claims more visitors from the U.S. than it does from its native U.K.; of its 120 million unique monthly visitors, more than half comes from mobile.

As with all modern publications, the need to make sense of its traffic and the deluge of data it generates is imperative, and the Guardian uses a number of tools to do just that. One of these is Ophan, an analytics tool built entirely in-house to serve the company's own needs.

VentureBeat met up with the Guardian's Chris Moran, digital audience editor, and Graham Tackley, director of architecture, to get the lowdown on how Ophan came to be, and what role it plays in its data-driven newsroom.

Ophan: In the beginning

The origins of Ophan can be traced back three years to a hack day hosted within the Guardian's digital development team. Tackley built a prototype real-time analytics system for tracking the Guardian's articles, and Moran liked the idea so much that he pleaded with him to keep it going.

"It ran for a few months on my desktop, so I tried to do my job while this thing was running in the background," explained Tackley. "Chris was becoming increasingly reliant on it, as were others around the Guardian who'd started to use it."

But the Guardian already had analytics at its disposal. Prior to Ophan, the digital division used Omniture, a Web analytics platform acquired by Adobe back in 2009 -- they actually still use Omniture, in addition to Ophan. But Omniture is used chiefly for longer-tail tracking, while the Guardian was looking for something more immediate and real-time, and which offered more granular data around what stories were doing well now on the Guardian's website.

Ophan data has a shelf life of around seven days before it disappears. "Omniture is still extremely useful, especially for handling larger datasets," said Moran. "A lot of the genesis of Ophan is to make it something that Omniture is not." In other words, it's a complementary tool, not a replacement.

There were existing options the Guardian could have turned to before creating Ophan. Chartbeat, for example, has carved out a solid reputation in the web analytics space, and this was something the Guardian actually trialed. But "Internal organization issues" prevented the company from investing in it -- they were looking for something that everybody could use if they wanted, one that wasn't restricted by licenses.

"Chartbeat is brilliant," said Moran. "But I bet you wouldn't find it being used by more than 50 people [in an organization]. Partly because of logins -- what happens is people buy one and share it about, but even if you're sharing a login, there's something there that's stopping people from getting to the data."

This is actually a drawback to any third-party tool. The more logins you want, the more (generally) you pay, which means only the "important" people get to use it, such as those heading up the digital teams. What the Guardian wanted was for anyone -- journalist, sub-editor, or marketing intern -- to be able to log in and see what's happening with the online traffic. All that's needed is a Guardian email address.

The Guardian has full, unbridled control over Ophan -- because it's theirs. Anyone can get into it. And that's why it claims 900 people from the Guardian use Ophan each month.

"What's critical is that rather than people coming to me as the gatekeeper of all data, and saying, 'Chris, when is the best time to launch this article,' I can say 'what's your gut instinct, do that, and look at Ophan and learn something from it'," said Moran.

Ophan data will be used differently depending on who's accessing it. For the Guardian's social team, they can analyze how they could be promoting a story better; for the journalist or editor, it's more about the conversation going on around an article -- errors, missed points, misleading comments, and so on.

Inside Ophan

The Guardian development team has created its own tracking JavaScript for Ophan, and it's all housed within Amazon Web Services (AWS).

Ophan has been a perennial work-in-progress, with bits tacked on and integrated as required based on the evolving needs of different departments. One of the first things that Tackley built into Ophan was "Recently Published," a move designed in part to help inform and justify what stories made it onto the Guardian's online front page.

Before Ophan, the process of deciding what stories made the Guardian's front page involved group emails sent by the broader network of editors, explaining why a specific piece of theirs deserved to be positioned front-and-center.

"One thing you can guarantee all poorly performing pieces of content have in common is that we didn't 'sell it' ourselves," said Moran. "We just let it die. People don't magically find URLs for stories, we have to give that content a chance, and we've got to push it somewhere."

And this is the role Ophan fulfills -- it's about giving an article the chance of performing well. Across the U.S., U.K., and Australia, the Guardian publishes between 400 and 500 pieces of content each day, so it can be helpful to have the data to back up some of its decisions, particularly about what goes on the front page. Recently Published displays the latest 50 articles that have been posted, and it can be useful to know where an article has already been shared (e.g. Facebook, Twitter, front page) to let editors make a call on what should happen next.

So, by glancing down the list of recently published articles, you could see a piece that's been shared to Facebook and is getting crazy traffic, but hasn't yet made it to the front page of the Guardian's website. This may catch the attention of the front page editor, who will make a call on whether it deserves to be on there.

And this is where the Guardian is keen to distinguish between algorithms and human editing. Just because a story is killing it on social networks doesn't necessarily make it a good fit for the Guardian's front page. They're careful not to automate the whole process -- they want data, but they want humans making the final decisions. It's this that helps prevent a cute story about a cat licking cream off a dog's nose (who knows, it could happen) being elevated above, say, a major terrorism story or a natural disaster -- one is front-page news material, the other isn't.

Conversely, if a story on the front page is doing badly there, but doing well on Twitter, they can look at why that might be the case. Perhaps a picture is positioned incorrectly or there's some other minor element that can be tweaked to improve engagement. It's about helping them optimize a post for the platform it's on, using the data provided by Ophan.

Ophan basically helps inform how content is shared and when. For example, "Comment" pieces often work well on Facebook, better than breaking news. "With breaking news, you've probably only got an hour or so where it will be likely that anyone will share it," said Moran. "But with a comment piece, that can suddenly open up."

So rather than just sharing a breaking news piece as a matter of habit, they may hold on until a comment piece related to the news story comes up and share that instead. They would still link to the news from the comment piece anyway, but it's just a more efficient and effective way of sharing the same core story.

Page views vs. attention vs. social shares

Recently Published is color-coded by traffic referral source -- for example, green is Google and aquamarine is Twitter. The main metric that's used here to establish popularity is page views, a measure that isn't without its flaws, but it gives a broad indication of whether a piece "resonates." Plus, for Guardian staff, it helps keep things simple, because page views is an easy concept to grasp. Elsewhere, Ophan provides far more granular data for those who need it. For example, it shows where a reader went after each article on the Guardian website, and, crucially it shows how long people paid attention to an article for.

Social signals, such as a Facebook like or a tweet, can indicate user engagement, but they can be misleading too.

"We used to be in a position where lots of our writers would judge popularity of a piece by the number of retweets it got," explained Moran. "And increasingly we realized that Twitter is not the Internet, it's just a weird, slightly warped version of it because of the audience."

The same thing applies to Facebook. Often the number of shares or likes really doesn't tally with the actual number of people who clicked to read it. "When we posted the death of Nelson Mandela news piece on Facebook, it got 40,000 likes in an hour, but we looked at Ophan, and it translated into 10,000 clicks," said Moran. "That tells you something -- three out of four of those likes is not engagement, it's 'I just want people to know that I know that he's dead.'

So page views serve as a counterweight to other data -- it's not either/or.

Median attention time is for those not happy with simple page view data or tweet-counts, and many argue that how engaged a person is with the article is a more important metric than a click. But attention time can be slippery, too.

For example, an article on China banning puns performed immensely well from a traffic and social perspective. However, the median attention time on the article was a mere seven seconds, which suggests there's something not quite right with the article -- did it have a clickbait title that didn't deliver once clicked? Or was it just plain boring?

Looking at other data in Ophan, the digital team could see that Reddit was playing a core part in the article's virality. And it could drill down even further by source, and see that the attention time for Reddit specifically was even lower than seven seconds. But looking at Facebook, it was 43 seconds, and for the Guardian homepage it was even higher. So in effect, the broader attention-time data was saying less about the quality of the article than it was about the interest Reddit readers had on the topic of the article.

Ophan offers a range of features geared towards specific divisions, including the top Google search terms that lead to an article from the search engine, while there's a section dedicated to what people are saying about a story on Twitter. This lets editors and the digital team compare their own tweets with tweets from the public -- is there anything they could've done better themselves? Is there a mistake somewhere in the article?

"A lot of what's in Ophan right now is very specific to the Guardian, there's a lot of stuff that makes a lot of sense to us," explained Tackley. "Lots of people have been asking about licensing it from us, but at the moment we're not inclined to do that -- there are a number of really good products on the market already."

Such products include the likes of Chartbeat, Parsely, and others that weren't available or not as developed when work first started on Ophan.

Ophan: In the beginning

Inside Ophan

Page views vs. attention vs. social shares

More