So I am preparing to teach quantitative analysis of social media data using R, the open source language for statistical programming. I usually do anything code-related in Emacs, because I already know how to use Emacs and you can do everything code-related in Emacs and I don’t want to install and learn the quirks of loads of different IDEs. But that argument won’t make sense from the point of view of my students, firstly because they won’t need to do everything code-related, they’ll just need to create R notebooks, and secondly because they don’t already know how to use Emacs, and learning how to use Emacs is hard because Emacs is weird.
If you’re an Emacs user and you don’t believe me, then just imagine using Vim because that’s how weird Emacs is to someone who isn’t an Emacs user. And if you’re a Vim user and you’re feeling all superior, then try reading the preceding paragraph again after switching every mention of Emacs to a mention of Vim because the same point applies. Both Emacs and Vim are very difficult to learn because neither of them makes any sense from the point of view of someone who doesn’t already know how to use it: these days, people come to software applications with expectations formed from their use of other software applications, and neither Emacs nor Vim has an interface that works quite like any other software application’s interface. This means that there’s a steep learning curve. The payoff is that, once you’ve got the hang of Emacs or Vim, you’ll never need to learn anything else for your coding-related requirements. But not everybody needs that payoff.
Enter RStudio. RStudio is a dedicated open source IDE for R, and it has built-in support for ‘notebooks,’ which are documents that enable you to combine the code in which you do your analysis with the text in which you write up that analysis. What happens – or is supposed to happen – is that every time you save your .Rmd (R Markdown) file, RStudio first compiles (‘knits’) it into an .md (regular Markdown) file using an open source tool called knitr and then compiles the .md file into an HTML page using another open source tool called Pandoc, and the really neat thing is that it does all this for you, supposedly without your having to think about it. Also, RStudio is relatively easy to learn to use, because its interface is more like the interfaces of other contemporary software applications than the interfaces of Emacs and Vim. Hence my decision to teach my students with RStudio.
In preparation for all this neatness, I’ve switched to using RStudio for my own research – and moreover, to using it on one of the Windows PCs that my employer provides, because that’s what the students will be using. And on the face of it, RStudio is pretty darn terrific. There’s a window for your notebook (or script, or whatever), and when you run code that creates a table or a chart, it appears in the notebook itself, right below the chunk of code. There’s also a window showing the environment (i.e. all the variables and functions that you have defined) and the history (i.e. a chronological list of all the commands that have been executed), a window for help text and for display of images outside of the notebook, and a console window, which is where things actually happen: when you run the code in your notebook, what it actually does is to send that code to the console, line by line, where it runs just as if you’d typed it there. You can also try out lines of code in the console, then put them into your notebook via the history window if you like the result. It’s a lot like using Emacs Speaks Statistics, except not in Emacs.
The results of my little experiment have been mixed. I’ve got some work done that I’m relatively satisfied with, including a piece intended to teach how opinion polls work. But RStudio – at least on this ordinary Windows PC – constantly hangs. I don’t know why it does this. It seems to have nothing to do with the memory or processing requirements of what I’m using it to ask R to do – though it seems to happen more often when nothing has been sent to or typed in at the console for a while. Maybe they just lose touch; I don’t know. Sometimes, I’ll ask it to do a calculation as trivial as 1+1, and it will hang (yes, I have tried this and it did). After a minute or three, it might start working again. Or I might get tired of waiting and click the menu option to restart R. Eventually, a pop up window – or sometimes a whole series of pop up windows – will appear, telling me that the connection with R has been lost. Then, a little while later, the answer to the calculation will appear, and an instant after that, R will restart. This isn’t so bad if I’m at the beginning of a notebook, but by the end, when later calculations may depend upon the results of earlier calculations, it can mean that I need to re-run the whole thing, which again means waiting because, even when it’s not hanging, RStudio often becomes painfully sluggish for no apparent reason, drip-feeding lines of code to the console in slow, slow motion. I also end up doing that when one of two other things that tend to happen happens: either the source window stops sending code to the R terminal altogether but the R terminal keeps working (which means that I can at least test bits of code by copy-pasting them from the source window to the terminal by hand, though that becomes inconvenient quite quickly), or I tell the source window to execute some particular chunk of code and the clock icon appears to tell me that it’s scheduled to run after some other chunk of code has finished executing (but there is no other chunk of code executing – or if there is, it’s executing without telling me that it’s executing and without my having told it to execute). So what I’ve started doing is making myself go and do something else to pass the time every time it seems to be happening.
To give you an idea of how much time I’ve wasted like that, it’s how this blogpost got written.
But that’s not all. Once I eventually got my current piece of work into some sort of near-readable form, RStudio started refusing to knit my .Rmd file into an .md file, giving up with the message
Error creating notebook: no lines available in input at the top of the source window and in most cases telling me which code chunk it had given up at with the console message
Quitting from lines X-Y (filename.Rmd). Each time this happened, I checked and tested and fiddled with the code but there was never anything wrong with any of it. Sometimes, I’d run a chunk by hand and try to save again and then it would breeze past that chunk only to give up at a different one. But it didn’t always tell me which chunk it had choked on, and sometimes it did tell me but the trick of running the chunk by hand and trying again didn’t work. I tried clearing the environment, restarting R, and re-running; I tried clearing the environment, restarting RStudio, and re-running. Same problem. Did I mention that ‘re-running’ takes maybe fifteen minutes?
Eventually, I gave up on using RStudio’s point-and-click interface and called the knitr program directly from the R console with
library(knitr); knit('my-Rmd-file.Rmd'), which worked – albeit rather slowly – and proved that there really was nothing wrong with my code. It also gave me an .md file that I’ll presumably be able to compile into an HTML notebook… once I’ve figured out how to get Pandoc working on this Windows PC, that is (because RStudio was doing all that behind the scenes where I didn’t think I had to think about it). I was at the point of wondering whether to email the .md file to myself so that I could convert it to HTML on my Linux machine (hey, doesn’t that sound like a great workflow?), when I decided to try clearing the environment, closing RStudio, turning the computer off and then on again, and then re-running all the code chunks. That did the trick.
I think that this latest problem might have something to do with the size that my document has reached, because I previously hit a problem where the cursor started jumping around randomly within paragraphs towards the end of the opinion polls piece once that got beyond a certain length. Or maybe it was memory use. R and notebooks both conspire to make memory management difficult. But this was after I’d moved the most memory-intensive bits of computation out of the notebook and into separate scripts whose output was loaded by the notebook code. It might also be the two together. Perhaps the solution with the long documents problem (if there is a long documents problem and not just a memory problem) is to split the notebook up into smaller files, although I can’t see a way of recombining them in RStudio, and it wouldn’t solve the memory problem (if there is a memory problem and not just a long documents problem), and anyway, neither of the notebooks I’m talking about is a particularly long document: the one I’m working on at the moment is just over 4000 words long and the opinion polls one is just under 6500 words long, whereas most journal articles I’ve published have been around 8000 words long. And if it’s a memory problem and not (or as well as) a long documents problem then maybe the solution is to do much less computation in the notebook itself and to do much more in scripts that save their output for the notebook to load in and display – though as I’ve said, I’m already doing quite a bit of that, and to be honest it kind of defeats the object of a code notebook.
This doesn’t sound amazingly appealing, does it? RStudio is supposed to make things easier by automating the boring stuff and hiding it behind a nice point-and-click interface, but a lot of the time it just doesn’t work.TM
So right now, I’m torn. Do I expect my students to put up with this crud? Or do I expect them to put up with the different crud that is having to learn Emacs — as well as Pandoc and makefiles, to do the work that RStudio was supposed to do behind the scenes where you don’t have to think about it? (On second thoughts, I don’t even know whether Windows has makefiles.) Or do I give up on both RStudio and Emacs, and have them create notebooks in Jupyter? Because Jupyter has its own headaches – in particular, that you can’t inspect objects if you’re writing a script rather than a notebook, and that there’s no console if you’re on Windows, and that it’s actually really awkward to do everything inside a browser window, and that Jupyter’s reference management plugin can’t handle page numbers, and that the format in which it saves notebooks creates severe complexities for version control. (Not that I’ve managed to get RStudio’s version control integration working yet, either. The instructions suggest that a particular dialogue will appear, but it doesn’t appear.)
That doesn’t sound amazingly appealing either, does it? Nothing I’ve mentioned here would sound even remotely appealing to a reasonable person, as opposed to the sort of masochist who makes things difficult for himself in order to prove how serious he or she is. (Moi?) Thinking about everything in terms of how I’m going to teach it to students (possibly across a language barrier) is making me keenly aware of what a vast chain of small and irritating obstacles I’ve had to overcome to be able to do the sort of research that I now do. Having overcome those obstacles, I keep clambering forward along the chain of new obstacles because, having come this far, I can hardly give up. But my students are somehow going to have to leap the whole lot in a single bound. Which means that I’m going to somehow have to coach them to do that.
The trouble with anything code-related and open source (which includes not only RStudio, but also Jupyter, Vim, Emacs, and R itself and all its packages) is the implicit assumption that anyone worthy of using it is a software engineer at heart. The result is that nothing that works properly is easy to use and that nothing that is easy to use works properly – and that an awful lot of things barely work at all and are monstrously difficult to interact with (in addition to being woefully under-documented to boot). But that apparently doesn’t matter, because – as a software engineer at heart – you will happily solve the problems that arise by yourself (because obviously, you have nothing better to do this week). And if not, then I suggest that you ask for your money back (ha ha, very amusing).
I’ll tolerate this crud when it’s just me and my research, but when you’re teaching a class, it’s a very different matter. If I’ve got 20 students trying to follow along with me, but every two minutes, somebody’s IDE stops working for no apparent reason, then what am I supposed to do? (Protip: saying ‘I suggest that you ask for your money back’ is a really bad idea when the person that you’re talking to is paying your employer for the privilege. Furthermore: regularly interrupting a class to sort out problems on individual students’ computers could be an equally bad idea when all the other students who are sitting around waiting for the class to resume are also paying your employer for the privilege.)
A way forward will eventually become apparent. But solving problems of this nature isn’t a good use of my time – and I’m feeling less and less inclined to avoid the conclusion that it would be better for my employer to pay for commercial software that was designed to be used by people who have other things to do besides getting it to work. MATLAB or Mathematica, for instance. But – thanks to the macho ideology of open source and its disdain for anyone who can’t (or doesn’t always have time to) deal with endless and very tedious technical problems – some people can get very sniffy about those, hence borderline-unusable software becoming standard not only in academia but in industry as well. And a large part of the point of the course I’m going to be teaching is going to be employability skills, which means that there’s no point teaching students to use something that will get them sneered at in industry. N.B. by ‘industry’, I mean the bit of industry that does social media analytics. Engineers other than software engineers have more sense than to sneer at people for using Mathematica just because it isn’t open source.
I’ve ranted about this kind of thing before with regard to open source typesetting software, but things are just as bad pretty much everywhere (see e.g. this amusing rant about web app deployment). The culture of free but utterly ramschackle software is underpinned by a profoundly counterproductive elitism. The implicit message is always ‘If you can’t – or aren’t willing to – spend hours, days, and months getting our beautiful gift to the world to work despite all the problems with it that we couldn’t be bothered to fix, then it’s high time you got back to something more suited to your abilities – like cowering under a rock and eating mould, you worthless pleb.’
I think I know the way around this particular problem. It involves prioritising what works properly over what’s easy to use, using the university’s Windows machines only in order to remotely access its high performance computing system so that we can work (albeit at one remove) under an operating system that is horribly difficult to use but that doesn’t place quite so many obstacles in our path (i.e. Linux), forgetting all about RStudio, and figuring out how to teach students who don’t all speak English particularly confidently what
C-x C-s and
M-0 C-k mean before I can even start teaching them how to analyse data.
Nut, meet sledgehammer.
But a part of me will still be wondering what coding might be like if only the easy stuff wasn’t quite so hard.