Big data and the study of reading

By Daniel Allington and Andrew Salway

[Cross-posted from Comments disabled here; enabled on the original.]

We’re really looking forward to running the workshop on big data and digital reading on 6 March 2014. Here is your required reading… just kidding, but we’ve selected two discussion pieces that we think could be interesting to talk about, so if you could have a look at them ahead of the workshop and post any initial thoughts below, that would be brilliant.

Both of the pieces relate to the first two words of our title.

What is ‘big data’? Or: how big is ‘big’?

To a computer scientist, ‘big data’ has a fairly precise meaning: it refers to any dataset that exceeds the limits of commonly used tools. Clearly, that’s a moving target, as the capabilities of software and hardware are constantly being expanded. Nonetheless, ‘big data’ is a meaningful concept for the scientists and programmers who must work around the limits of current technology to cope with the staggering volumes of data now being produced.

If we were to adapt the computer scientist’s definition to the tools used in the humanities and social sciences, ‘big’ data would probably be something much smaller. In English studies, for example, the most commonly used ‘tool’ is the technique of close reading, so one could arguably use the words ‘big data’ in reference to any text or collection of texts too large for an individual researcher to subject to close analysis. As has been observed in a discussion of the various species of ‘big data’ hype, ‘[s]ince humanists usually still work with small numbers of examples, any study with n > 50 is in danger of being described as an example of “big data.”’ (Underwood, 2013) In sociology, by comparison, the primary quantitative research tool has been the sample survey, generally limited to a few hundred or thousand respondents – so one might perhaps be inclined to start to calling sociological data ‘big’ once it includes records for hundreds of thousands of individuals.

But such numbers are not what the words ‘big data’ usually call to mind. In practice, ‘big data’ tends to mean a ‘data-driven’ rather than ‘hypothesis-driven’ approach to research, which aims not to formulate hypotheses and then test them against data collected for its relevance to those hypotheses, but to start with a dataset and then identify patterns within it, often through the use of ‘machine learning’ algorithms. This is sometimes described as ‘inductive’ rather than ‘deductive’ research, which probably sounds more radical from the point of view of a scientist than from that of a humanist – although it is necessarily quantitative, and as such quite alien to much traditional humanist research, especially in literary studies.

‘Big data’ as an approach to social scientific research

The first discussion piece we’d like to draw your attention to is a talk on ‘What Big Data Means for Social Science’ by economist Sendhil Mulainathan (2013). This is available at the following address in both video and transcript form:

In the main part of his talk, Mulainathan argues that by collecting as much data as possible – including data that we have no specific reason for believing to be relevant – and then searching algorithmically for the variables, or combinations of variables, that predict the outcomes we are interested in, we can avoid being led up the garden path by our hypotheses. In context of Mulainathan’s main (imaginary) example – a false medical hypothesis that appears to be supported by data only because the data collected was too narrow to show that something unexpected was going on – this seems highly convincing, although the talk as a whole could be regarded as something of a ‘sales pitch’ for data-driven research. By contrast, the position Mulainathan takes in response to audience questions is far more nuanced, and shows keen awareness of the problems of the approach he has been espousing.

In this context, it should be observed that the purely statistical approach to linguistics which Mulainathan provides as a real-world model for the scientific use of big data from 06:01 to 07:05 is highly controversial. It has been very successful in answering certain kinds of questions (see Halevy, Norvig, and Pereira, 2009), but because of its indifference to the meaning of the behaviour it models, linguists such as Noam Chomsky have argued that this should not be considered success in any ‘sense that science has ever been interested in.’ (Pinker and Chomsky 2011, parag. 2) At 25:56, Mulainathan is challenged by the philosopher Daniel Dennett, who makes the suggestion (related to Chomsky’s argument) that the ‘big data’ approach may yield predictions, but not understanding; at 31:44, Mulainathan is challenged by the psychologist, Daniel Kahneman, who argues that a variable that does not have much predictive power may still be relevant and interesting. From 29:14 to 30:40, Mulainathan argues that there are limits to the ‘big data’ approach which cannot be solved simply through more data or more computing power, and that theory-driven hypothesis testing will always be required.

‘Big data’ as an approach to humanist research

The second discussion piece is ‘Big Data for Dead People: Digital Readings and the Conundrums of Positivism’ by historian Tim Hitchcock (2013a), a talk subsequently published in textual form on Hitchcock’s blog. It is available at the following address:

Hitchcock is the lead academic behind a very large humanist research project: Old Bailey Online. He argues both that the potential for close reading of historical documents has been greatly extended with the availability of linked data, and that visualisations of quantitative data can help us to see individual events in terms of long-term trends. But he also asks whether ‘our practice as humanists and historians is being driven by the technology, rather than being served by it’ (2013, parag. 72) due to a tendency to ‘ask questions we know that computers can answer’ (parag. 71). He goes on as follows:

in choosing to move towards a ‘big data’ approach… and in adopting the forms of representation and analysis that come with big data, all of us are naturally being pushed subtly towards a kind of social science, and a kind of positivism, which has been profoundly out of favour for at least the last thirty years. (Hitchcock 2013, parag. 70)

Note that in a comment on Brian Lennon’s (2013) discussion of this piece, Hitchcock (2013b) agrees that this critique of the digital humanist version of ‘big data’ could be taken further. It should also be recognised that similar points to Hitchcock’s have been made in critiques of humanist positivism that do not directly tie its rise to the notion of ‘big data’ (see, for example, Eyers 2013).

‘Big data’ and reading

The internet provides unprecedented opportunities for gathering very large datasets about reading – digital and otherwise. As such, there is the potential for an apparent revolution in reader study, comparable to the explosion of qualitative audience research that followed the realisation that ‘naturally occurring’ data on media consumption practices was easily accessible via the Internet. But in view of Hitchcock’s observations about positivism, there is perhaps a still greater need to guard against the assumption that online data can, as one scholar astutely put it, ‘unproblematically unveil those cultural processes and mechanisms which cultural studies has been positing’ (Hills 2002, p. 175). There are many reasons for this, including the fact that, as Jen Schradie has shown, internet research leaves working class people under-represented because ‘people with lower levels of income and education are not accessing or creating online content nearly as much as people with a college degree and a comfortable middle-class lifestyle.’ (2013, parag. 5; see also Schradie, 2011) This means that when we study a phenomenon – reading, say, or political activism – through analysis of internet ‘big data’, we may in fact be studying only an elite variant of that phenomenon. For example, Schradie’s research on 35 political organisations in the US found that ‘[t]hree of the most active… offline have virtually no online presence’, while ‘[o]ne… does not come up on Google searches’ and ‘[h]alf… are not active on Twitter.’ (2013, parag. 10) Such disparities will of course be multiplied in an international context, given disparities in internet access. And they are paralleled by disparities in the ability to carry out ‘big data’ research, which, as Ben Williamson has observed, ‘concentrate[s] data analysis and knowledge production in a few highly resourced research centres, including the R&D labs of corporate technology companies.’ (2014, parag. 10) Indeed, long before the term ‘big data’ became a political, commercial, and academic buzzword, it was recognised that the industrial creation and analysis of vast social datasets far exceeded the capacities of scholarly research – and argued that the most urgent project for social science might therefore be one of ‘critically engaging with the extensive data sources which now exist, and not least, campaigning for access to such data where they are currently private.’ (Savage and Burrows 2007, p. 896) Online retail giant Amazon’s collection of information on the behaviour and preferences of readers can thus, for example, be seen as an opportunity for research, in that some of this information is available to members of the public, including academics – but it can also be seen as a phenomenon that in itself requires the most critical form of scholarly attention.

Ahead of the workshop…

At the workshop, we’ll be asking what ‘big data’ could bring – in every sense – to humanist and social-scientific research on digital reading. We hope that this process of discussion will start in responses to the video and essay posted above. In addition to general points about ‘big data’, it would be great to read about the themes and questions that you are interested in with regard to (digital) reading, and how these may, or may not, benefit from ‘big data’ and its associated assumptions.


Eyers, Tom (2013). ‘The perils of the “digital humanities”: new positivisms and the fate of literary theory’. Postmodern Culture 23 (2). Available online at
Halevy, Alon, Norvig, Peter, and Pereira, Fernando (2009). ‘The unreasonable effectiveness of data’. Intelligent Systems 24 (2): 8-12. Available online at
Hills, Matt (2002). Fan cultures. London and New York: Routledge.
Hitchcock, Tim (2013a). ‘Big Data for Dead People: Digital Readings and the Conundrums of Positivism’. Keynote address, Reading Historical Sources in the Digital Age, 5 December, Centre virtuel de la connaissance sur l’Europe, Luxembourg. Published online, 9 December. Accessed 24 Jan 2014 at
Hitchcock, Tim (2013b). Comment on ‘On Digital Humanities “Surprisism”’, December. Accessed 24 Jan 2014 at
Lennon, Brian (2013). ‘On Digital Humanities “Surprisism”’, 11 December. Accessed 24 January 2014 at
Mulainathan, Sendhil (2013). ‘What Big Data Means for Social Science’. July, HeadCon ’13. Published online, 11 November. Accessed 24 Jan 2014 at
Pinker, Steven and Chomsky, Noam (2011). ‘Pinker/Chomsky Q&A from MIT150 panel’. Accessed 28 February 2014 at
Savage, Mike and Burrows, Roger (2007). ‘The coming crisis of empirical sociology’. Sociology 41 (5): 885-899.
Schradie, Jen (2011). ‘The digital production gap: the digital divide and Web 2.0 collide’. Poetics 39 (2): 145-168.
Schradie, Jen (2013). ‘Big data not big enough? How the digital divide leaves people out’. 31 July. Accessed 1 March at
Underwood, Ted (2013). ‘Against (talking about) “big data”’. 10 May. Accessed 27 January 2014 at
Williamson, Ben (2014). ‘The end of theory in digital social research?’ 20 January. Accessed 1 March at

[If you wish to comment on this piece, please go to]