From the Notebook to the Cluster
Part 0: Start with the Project
1 The long game
This series is about moving a research coding project from a local machine to high-throughput computing. The destination matters: eventually, we want to run an analysis on the UW–Madison Center for High Throughput Computing (CHTC), where work that would be slow or repetitive on a laptop can be split into many independent jobs.
But the path to the cluster does not start with the cluster.
It starts with the project.
Before writing the analysis, before creating a Quarto document, before building a container image, and before submitting a job, we need to give the project a good home. A well-structured project makes local exploration easier today and makes containerization, command-line execution, and CHTC submission easier later.
That is the purpose of Part 0.
We know what we want to do: we have a dataset, the Palmer Penguins data, and we know we want to run an analysis on it. The analysis itself will come later. For now, the task is more basic. We need to decide what kind of project can hold this work, not only for the first script we write, but for the later stages of the workflow as well.
The rest of the series follows a deliberate sequence:
Part 0 create the project home -> toolero::init_project()
Part 1 create the document/code -> toolero::create_qmd()
Part 2 containerize the project -> containr
Part 3 submit the first job -> submitr single-job workflow
Part 4 submit many jobs -> toolero::write_by_group() + submitr
Each step has a clear intent. Each step should leave the project better prepared for the next one.
2 Why project structure comes first
Most research projects begin naively: a dataset, a script, and a question. The first few decisions feel small. A file goes on the Desktop. A script gets named analysis-final.R. A plot gets exported by hand. A package gets installed without recording the version. A collaborator receives a copy of the code by email.
Nothing breaks immediately.
That is why these habits are easy to underestimate.
The cost appears later, when the project needs to be shared, rerun, reviewed, containerized, or scaled-up and moved to another computing system. At that point, the project starts to depend on memory: where the raw data came from, which file is current, what package versions were installed, which script produced which result, and what changed between attempts.
A good project structure does not solve every problem. It does, however, remove a set of avoidable problems from the path. It gives the work a predictable home. It separates raw inputs from derived outputs. It records dependencies. It creates a place for scripts, documents, results, and supporting files. It also makes the project easier to explain to someone else, including your future self.
The best time to make these decisions is at the beginning, before the project has accumulated enough history to make reorganization painful.
3 Before we run anything
We are not going to create the project immediately.
First, we should decide what the project needs to support.
For this series, we know a few things in advance:
- our original input dataset will be named
palmer-penguins.csv; - the original dataset should live in
data-raw/; - later, we will split the dataset into job-sized inputs;
- those split inputs should live in
data/jobs/; - outputs should have a predictable home;
- the project should track R package dependencies;
- the project should track changes over time;
- the structure should still make sense when the work moves from a laptop to CHTC.
These decisions are small, but they are not arbitrary. They are early design choices that make later steps easier.
In other words, we are not only asking, “Where should I put this file today?” We are asking, “What structure will still make sense when this analysis needs to be rendered, scripted, containerized, and submitted as a job?”
4 What a good project home needs
For this workflow, a good research coding project needs:
- a project directory that contains the work;
- a predictable folder structure;
- a place for raw data;
- a place for derived data and job inputs;
- a place for results;
- dependency tracking with
renv; - version control with
git; - room to add Quarto documents, scripts, containers, and CHTC submission files later.
The exact folders matter less than the habit they represent: separate different kinds of files by role. Raw inputs are not the same as processed outputs. Scripts are not the same as results. Documentation is not the same as data.
The structure we create in this post is intentionally modest. It is not a rigid framework. It is a starting point that works well for many research workflows and can be customized as the project grows.
5 Installation
Before running any of the code below, you will need the toolero package. The current CRAN release can be installed in the usual way:
install.packages("toolero")However, this post reflects features introduced in toolero 0.4.0, which is currently available only in the development version. These features are scheduled to be pushed to CRAN at the end of July. In the meantime, install the development version from GitHub:
# install.packages("pak")
pak::pak("erwinlares/toolero")For the most current information on new features, breaking changes, and development status, see the package repository at github.com/erwinlares/toolero.
6 Design the scaffold
The toolero package provides a project scaffolding function called init_project(). It creates a standard research project structure and, by default, initializes renv and git.
Once the package is loaded:
library(toolero)The minimal version of the command looks like this:
# Do not run this yet.
# We are first sketching the project we want to create.
init_project(
path = "palmer-penguins-analysis"
)This would create a project called palmer-penguins-analysis using the default scaffold.
6.1 The default folder structure
As of 0.4.0, the default folder set follows conventions established by The Carpentries and UW-Madison Libraries: data-raw/, data/, scripts/, output/figures/, output/tables/, and reports/. This replaces the previous default set, so if you have used an earlier version of toolero the folder names will look a little different here.
6.2 Customizing the scaffold
We already know something specific about this project. Later in the series, we will split the Palmer Penguins dataset into separate inputs for many HTC jobs. Those files need a home. We can create that folder now.
That is what custom_folders is for. It lets us extend the default scaffold with folders that are specific to this project. Bare names add new folders; names prefixed with a minus sign"-" suppress a folder from the default set without affecting anything else.
# Do not run this yet.
# This adds the future home for job-sized input files.
init_project(
path = "palmer-penguins-analysis",
custom_folders = "data/jobs"
)The important choice is not just the syntax. The important choice is the convention:
data-raw/palmer-penguins.csv # original input data
data/jobs/ # future split inputs, one file per job
output/tables/ # outputs from local or cluster runs
We are not splitting the data in Part 0. That happens later. But we know where the project is going, so we create a place for the future job inputs now.
6.3 Config-driven scaffolds
For projects with a recurring or non-standard folder structure, toolero 0.4.0 also introduces a config argument that lets you drive the entire folder set from a YAML file. generate_project_config() writes a skeleton config pre-filled with the standard folders; edit it to define your own layout and store it somewhere reusable.
# Write a config skeleton to your home directory
generate_project_config("my-research-project.yml", path = "~")Then pass it to init_project():
init_project(
path = "palmer-penguins-analysis",
config = "~/my-research-project.yml"
)This is worth knowing about, but for this series we will stick with the defaults plus custom_folders, which is the right level of customization for a typical analysis project.
7 Keep the defaults that support the long game
The full function signature exposes a few useful choices:
init_project <- function(path,
use_renv = TRUE,
use_git = TRUE,
custom_folders = NULL,
config = NULL,
open = FALSE,
uw_branding = FALSE) {}For this series, we want to keep the important defaults and add the folder we know we will need:
# Do not run this yet.
# This is the design we are choosing.
init_project(
path = "palmer-penguins-analysis",
use_renv = TRUE,
use_git = TRUE,
custom_folders = "data/jobs",
open = FALSE,
uw_branding = FALSE
)The defaults are intentionally opinionated.
use_renv = TRUE matters because package versions are part of the analysis. Later, containr will use the project’s renv.lock file to help build a container image. That image will make the project’s software environment portable.
use_git = TRUE matters because the project will change as it moves through the series. Version control records those changes and gives you a history of the project: when the document was created, when the script changed, when the container was built, and when the submission workflow was added.
custom_folders = "data/jobs" matters because high-throughput computing works best when a large task can be split into many independent pieces. We are not there yet, but the project structure is already making room for that step.
open = FALSE is fine for this post because we are showing a scripted setup. In interactive work, you may prefer open = TRUE so the project opens after creation.
uw_branding = FALSE keeps the scaffold generic. If you are creating a UW–Madison-facing project and want supported UW-branded assets, you can set it to TRUE.
8 Create the project
Now that we have made the design choices explicit, create the project.
library(toolero)
init_project(
path = "palmer-penguins-analysis",
use_renv = TRUE,
use_git = TRUE,
custom_folders = "data/jobs"
)After running this command, the project has a home and a folder structure that matches the workflow we are building toward.
A typical structure looks like this:
palmer-penguins-analysis/
├── data/
│ └── jobs/
├── data-raw/
├── output/
│ ├── figures/
│ └── tables/
├── reports/
└── scripts/
Depending on your local setup and options, you may also see files created by renv, git, or R project tooling.
9 Why these folders matter
The folder names are not magic. Their value comes from having a shared convention.
9.1 data-raw/
Use data-raw/ for original input data or files that are close to the original source. In this series, our starting dataset will live here:
data-raw/palmer-penguins.csv
That filename is specific on purpose. A name like data.csv is easy to type, but it becomes vague quickly. palmer-penguins.csv tells us what the file is before we open it.
9.2 data/jobs/
Use data/jobs/ for future job-sized inputs.
Later in the series, we will split the Palmer Penguins data into smaller files. Each file will be suitable for one independent job. That is the kind of structure CHTC can use well: many small, independent tasks instead of one large, monolithic run.
Creating data/jobs/ now is a small way of keeping the long game visible.
9.3 scripts/
Use scripts/ for standalone scripts or helper programs. Early in a project, you may not need much here. As the project grows, this folder keeps executable code separate from one-off analysis documents.
9.4 output/figures/ and output/tables/
Use these folders for analysis outputs — plots, model results, CSV summaries, archived results from a cluster job. The point is to avoid mixing inputs and outputs. A future script should be able to read from known input locations and write to known output locations without depending on files scattered across the project.
9.5 reports/
Use reports/ for rendered documents, notes, or supporting documentation. In the next post, we will create a Quarto document that becomes the source of truth for the analysis. The important thing is that documentation is part of the project, not an afterthought stored somewhere else.
10 Put the dataset where it belongs
Now that the project exists, place the original dataset here:
palmer-penguins-analysis/data-raw/palmer-penguins.csv
This gives the original input file a clear home.
If you are creating the file from the palmerpenguins package, you could do that from inside the project:
# install.packages("palmerpenguins")
# install.packages("readr")
if (!dir.exists("data-raw")) {
dir.create("data-raw", recursive = TRUE)
}
palmerpenguins::penguins |>
readr::write_csv("data-raw/palmer-penguins.csv")The important thing is not the mechanics of this particular dataset. The important thing is the habit: the original input file has a clear, stable, project-relative path.
That path will matter later. In Part 1, the Quarto document will read from data-raw/palmer-penguins.csv. In later parts, derived job inputs will be written under data/jobs/. Keeping those paths predictable makes the project easier to move from local analysis to command-line execution and eventually to CHTC.
11 Why renv matters for the long game
When use_renv = TRUE, init_project() initializes renv for the project. renv records the R packages used by the project in a lockfile called renv.lock.
That lockfile matters because package versions are part of the analysis. If a collaborator reruns the project with different package versions, the code may behave differently. If you return to the project months later, your library may no longer match the one you used originally.
For this series, renv also has a second role: it prepares the project for containerization.
In a later post, containr will use the project’s renv.lock file to help generate a container image. That image will capture the software environment needed to run the analysis somewhere other than your laptop.
So renv is not just a local reproducibility tool. It is part of the bridge from local work to portable execution.
A useful habit is:
renv::install("package-name")
renv::snapshot()Install packages through renv, then snapshot the project after changes to keep renv.lock current.
12 Why git matters for the long game
When use_git = TRUE, init_project() initializes version control with git.
Version control is not only for software developers. It is a research workflow tool. It records how the project changes over time and gives you a way to recover earlier states. It also makes collaboration safer because changes can be reviewed, compared, and merged deliberately.
For the Notebook-to-Cluster workflow, git gives you a project history that can answer practical questions later:
- When did the analysis script change?
- Which version of the document produced these results?
- What changed between the first local run and the cluster submission?
- Which commit was used when the container image was built?
Those questions are much easier to answer when version control starts at the beginning.
13 What we have after Part 0
At the end of Part 0, we have not analyzed anything yet. That is intentional.
What we have is a project that can grow:
palmer-penguins-analysis/
├── data-raw/
│ └── palmer-penguins.csv
├── data/
│ └── jobs/
├── output/
│ ├── figures/
│ └── tables/
├── reports/
└── scripts/
We also have the beginnings of two important records:
renv.lock, which records the package environment;- a
gitrepository, which records the project history.
This may feel like preparation rather than progress, but it is progress. A project with a stable home is easier to explain, easier to rerun, easier to containerize, and easier to submit to a computing system.
14 What comes next
In Part 1, we will create the document that carries the analysis.
That document will do more than hold code. It will become the source of truth for the workflow: prose, code, outputs, and decisions in one place. We will use toolero::create_qmd() to scaffold the Quarto document and automatically set up the post-render purl() step that derives a standalone .R script from the document.
That matters for the long game. The .qmd will be the human-readable notebook. The derived .R file will be the machine-executable script. Keeping those two connected is how we avoid code drift when the project moves from local exploration to command-line execution and eventually to CHTC.
For now, the project has a home.
That is the right first step.