R Markdown centered data analysis workflow

TLDR

A summary is here, and the codes are here.

I’m a workflow fanatic. Ok, fanatic maybe too strong a word, but that’s what I told one of my R stars when I met her in person recently, who’s also passionate about (in my views) project workflow and “generally doing basic stuff really, really well”. I can’t pinpoint how and why I became so interested in establishing a good project workflow, but it’s one of those things that I really wish I had mastered while I was still in school. Maybe the lack of such training during graduate school led me down this path of caring about research computing education as well as getting involved with Software/Data Carpentry, which I wish I had known about it back then1.

My pursuit of a data analysis workflow

(also something about notebooks)

In any case, coming up with a workflow that suits me for most of my needs either at home or at work has been a long journey of mine, and I’ve written a couple of posts on here for some “milestones”.

I haven’t written about it yet, but I also started using Python pretty heavily at work, and my interest in workflow since then has been about how to put together codes (R/Python/SQL) and by-products (findings/plots) more efficiently in R Markdown. The idea of combining codes and prose is not new at all, with ever growing popularity of using notebooks by data scientists for doing just that. I have used notebooks too (e.g., RCloud, Jupyter Notebook, etc.) in the past, but for various reasons it never stuck with me. It’s still hard to pinpoint exatly what, but if I have to guess, among many pros and cons of notebooks, what didn’t work for me was the lack of the final/intermediate by-product (e.g., a summary document with findings, tables, and plots) that I can refer back to anytime in the future. Most likely due to how I used it (inefficiently, that is), my notebooks were always in-progress, never-ending, ever-changing, without the merit of reusable codes. Yihui’s post on notebooks, IDEs, and R Markdown comes to mind.

What I’m hoping for in a data analysis workflow

R Markdown seemed a promising component of a workflow, especially in terms of being a central place where I can document what I tried, which scripts I used, and finally the findings from each code exeucution that can help me understand a project’s “whereabout” anytime I needed to. And here I’m hoping my workflow to

After several trial-and-error, I feel like what I have at the moment is a good working version of an R Markdown centered workflow, specifically for data analysis using text editor of choice, Vim. I understand there are several components here that can make this workflow rather specific than general (e.g., Vim, not RStudio), but I believe the general idea can still be helpful for others as well.

So, what’s it look like?

Initially, I was going to write about the workflow with an example in this blog post, but since it included a working example, it became a bit confusing structure-wise to include the working example (I’m sure there’s a way to do it), so I decided to use github repo instead for code examples as well as the final summary document. So think of this blog post as an introduction/motivation for the actual work that is stored in github.

As the subtitle of the main document in the repo says, I also talk about how to use Python codes in R Markdown using reticulate package. In fact, three types of coding examples are given:

For each type of examples, three different scenarios are used in terms of how to incorporate corresponding codes into the R Markdown document.

Although not perfect and still under development, I feel like I now have a portable and transferable workflow for data analysis that I can mix and match depending on the scope of any projects. As someone said, just trying to do “basic stuff really, really well”, so that it just “flows”!

Relevant/future work

Interactive use (of R and Python shells) in a data analysis project workflow and “search path” are two topics that I think about quite a lot these days, but in terms of loss of reproducibility when Rmd files are rendered in console, I must have been impacted by it, as in the workflow I described in this post, I render my home R Markdown document in console. Either I might have overlooked its impact or I haven’t run into any such loss of reproducibility.


  1. It seems they were getting seriuos just as I was wrapping up my time in school