Introduction

My mentor has assigned me an interesting task to begin working on this week: test the current RcppDeepState GitHub Action on GitHub-hosted Rcpp-based packages. Akhila, the previous RcppDeepState maintainer, had already conducted a similar task, however this was done locally rather than in the package repository. Here’s a list of all the Rcpp-based packages where RcppDeepState found issues1.

Now that RcppDeepState has been integrated with GitHub action, we can test more packages stored on GitHub; all we have to do is fork the original repositories, setup the action within them, and then make a pull request so that the action returns a comment with the analysis result. The main advantages of using RcppDeepState-action inside a repository are that it allows to:

  • dynamically check for issues inside packages using continuous integration;
  • reduce the risk of code level bugs that can compromise the entire package;
  • improve the quality of the final package by making it easier to detect subtler bugs, receiving quick feedbacks and alerts if an error is detected;

The problem

At the time of writing, CRAN lists 18493 packages, 312 of which have problems, according to Akhila’s report1. Given this list of packages, the question is whether it is feasible to determine if a package is hosted on GitHub. If so, can RcppDeepState be run on this repository?

To begin with, the answer to the first question is yes, and my mentor Dr.Toby Dylan Hocking supplied me with a fantastic method that allows me to find the GitHub repository of a package by evaluating the package’s metadata available on CRAN. This method is based on locating the https://github.com prefix inside the metadata of the package. More information on this technique is available in the corresponding blog post2.

The answer to the second question is affirmative; once we have a link to the repository, we can simply fork it, initialize RcppDeepState-action using the RcppDeepState::ci_setup() method, and submit a pull request; RcppDeepState’s report will be displayed as a comment within the pull request. The problem here is that, given the number of packages listed in the previous step, doing these steps one by one is impractical; consequently, in this post, I propose a method for automating this process.

Solution

The solution makes use of four libraries:

  • gh3: a minimal client to access the REST API of GitHub;
  • git2r4: an interface to the libgit2 library, which provides access to Git repositories with some basic commands;
  • data.table5: a library to aggregate large data and run fast operations;
  • RcppDeepState6: a package to fuzz test your R library’s C++ code in order to find more subtle bugs like memory leaks or even more general memory errors.

Steps

The first step is to import the required libraries and get the GitHub personal access token (PAT) stored in an environment variable called GITHUB_PAT in my case. This token will be used to authorize the push of local commits to a repository’s remote branch. To generate this token, you can follow the instructions provided by GitHub7.

library("gh")
library("git2r")
library("RcppDeepState")
library("data.table")

cred <- cred_token(token = "GITHUB_PAT")

Given a pkg.repos data table produced by following the steps specified in Dr.Toby Dylan Hocking’s blog post2, the implementation of the automatic fork/pull-request process mentioned above is described in the next steps.

> pkg.repos
          Package                                  repo.url
  1: humaniformat https://github.com/ironholds/humaniformat
  2:       jmotif        https://github.com/jMotif/jmotif-R
  3:     olctools     https://github.com/Ironholds/olctools
  4:  RcppDynProg  https://github.com/WinVector/RcppDynProg
  5:      BWStest     https://github.com/shabbychef/BWStest
 ---                                                       
111:       tweenr       https://github.com/thomasp85/tweenr
112:         uwot        https://github.com/jlmelville/uwot
113:       vapour       https://github.com/hypertidy/vapour
114:           wk         https://github.com/paleolimbot/wk
115:      wkutils    https://github.com/paleolimbot/wkutils

We begin by removing the http://github.com prefix from each repository url, resulting in with a new column containing strings in the format <repository owner>/<repository name>.

pkg.repos[, repo_full_name := sub("https://github.com/", "", repo.url) ]
> pkg.repos
          Package                                  repo.url         repo_full_name
  1: humaniformat https://github.com/ironholds/humaniformat ironholds/humaniformat
  2:       jmotif        https://github.com/jMotif/jmotif-R        jMotif/jmotif-R
  3:     olctools     https://github.com/Ironholds/olctools     Ironholds/olctools
  4:  RcppDynProg  https://github.com/WinVector/RcppDynProg  WinVector/RcppDynProg
  5:      BWStest     https://github.com/shabbychef/BWStest     shabbychef/BWStest
 ---                                                                              
111:       tweenr       https://github.com/thomasp85/tweenr       thomasp85/tweenr
112:         uwot        https://github.com/jlmelville/uwot        jlmelville/uwot
113:       vapour       https://github.com/hypertidy/vapour       hypertidy/vapour
114:           wk         https://github.com/paleolimbot/wk         paleolimbot/wk
115:      wkutils    https://github.com/paleolimbot/wkutils    paleolimbot/wkutils

Then we can iterate over the repositories listed above, forking and cloning each one. Let us call each repository in the loop repo_full_name.

fork_endpoint <- paste0("POST /repos/", repo_full_name, "/forks")
fork_result <- gh(fork_endpoint)

repo <- clone(fork_result$clone_url, fork_result$name)
config(repo, http.followRedirects='true')

After successfully cloning the repository, a new branch for the RcppDeepState analysis can be created. RcppDeepState is the name of this new branch.

test_branch_name <- "RcppDeepState"
test_branch <- branch_create(last_commit(repo), test_branch_name)
checkout(repo, test_branch_name)

The following step is to determine if the repository includes a legitimate package. This is accomplished by checking the existence of the DESCRIPTION file within the repository’s root: if this file exists, the repository includes a valid package that can be examined using RcppDeepState; otherwise, RcppDeepState cannot analyze the package.

if (!file.exists(file.path("./", fork_result$name, "DESCRIPTION"))){
  stop("The repository doesn't contain a valid package")
}

We can now use the existing ci_setup function to initialize the workflow file within the repository. This function accepts as input the location of the repository on the filesystem and a list of parameters corresponding to the action’s inputs. In this scenario, we’ve specified fail_ci_if_error=TRUE to cause the CI process to fail if an error is discovered, and comment=TRUE to print the report comment inside the pull request that will be produced in the following phase.

RcppDeepState::ci_setup(fork_result$name, fail_ci_if_error=TRUE, comment=TRUE)

The last step is to push the new changes to the forked repository and submit a pull request.

# commit and push the workflow file
add(repo, file.path("./", fork_result$name, ".github", "workflows", "*"))
commit(repo, message="RcppDeepState CI Setup")
push(repo, "origin", paste("refs", "heads", test_branch_name, sep="/"),
    credentials=cred)

# open the pull request
pulls_endpoint <- paste0("POST /repos/", fork_result$full_name, "/pulls")
pull_title <- "Analyze the package with RcppDeepState"
pull_body <- paste("### RcppDeepState Analysis\nThis pull request aims to find", 
                   "bugs in this R package using RcppDeepState-action")
gh(pulls_endpoint, title=pull_title, owner=fork_result$owner$login,
    repo=fork_result$name, body=pull_body, base=fork_result$default_branch,
    head=test_branch_name)

Final script

We get the following script by combining all of the previous steps with the solution supplied by my mentor. As you can see, a batch_size option has been introduced to allow you to choose the number of repositories to test. This option was added to prevent the creation of 115 repositories within your GitHub account.

library("gh")
library("git2r")
library("RcppDeepState")
library("data.table")

cred <- cred_token(token = "GITHUB_PAT")
batch_size <- 2 # adjust the batch size

if(!file.exists("problems.html")){
  download.file(
    "https://akhikolla.github.io./packages-folders/",
    "problems.html")
}
prob.dt <- nc::capture_all_str(
  "problems.html",
  '<li><a href="',
  Package=".*?",
  '[.]html">')

if(!file.exists("packages.rds")){
  download.file(
    "https://cloud.r-project.org/web/packages/packages.rds",
    "packages.rds")
}
meta.mat <- readRDS("packages.rds")
meta.dt <- data.table(meta.mat)
meta.prob <- meta.dt[prob.dt, on="Package"]

pkg.repos <- meta.prob[, nc::capture_all_str(
  c("",URL), # to avoid attempting to download URL.
  repo.url="https://github.com/.*?/[^#/ ,]+"),
  by=Package]

pkg.repos[, repo_full_name := sub("https://github.com/", "", repo.url) ]

for (repo_full_name in head(pkg.repos$repo_full_name, batch_size)){
  
  fork_endpoint <- paste0("POST /repos/", repo_full_name, "/forks")
  fork_result <- gh(fork_endpoint)

  repo <- clone(fork_result$clone_url, fork_result$name)
  config(repo, http.followRedirects='true')

  test_branch_name <- "RcppDeepState"
  test_branch <- branch_create(last_commit(repo), test_branch_name)
  checkout(repo, test_branch_name)

  ### check if the repository's root contains a valid package
  if (!file.exists(file.path("./", fork_result$name, "DESCRIPTION"))){
    stop("The repository doesn't contain a valid package")
  }
  
  RcppDeepState::ci_setup(fork_result$name, fail_ci_if_error=TRUE,
                          comment=TRUE)

  # commit and push the workflow file
  add(repo, file.path("./", fork_result$name, ".github", "workflows", "*"))
  commit(repo, message="RcppDeepState CI Setup")
  push(repo, "origin", paste("refs", "heads", test_branch_name, sep="/"),
      credentials=cred)

  # submit a pull request  
  pulls_endpoint <- paste0("POST /repos/", fork_result$full_name, "/pulls")
  pull_title <- "Analyze the package with RcppDeepState"
  pull_body <- paste("### RcppDeepState Analysis\nThis pull request aims to", 
                    "find bugs in this R package using RcppDeepState-action")
  gh(pulls_endpoint, title=pull_title, owner=fork_result$owner$login,
      repo=fork_result$name, body=pull_body, base=fork_result$default_branch,
      head=test_branch_name)
}

Test

Before running the preceding script, I assumed that running it with a batch_size of 115(nrow(pkg.repos$repo_full_name)) would result in a massive generation of repositories within my GitHub account. As a preliminary solution, I set the batch_size argument to 2, which means that just two packages will be examined. I explain a possible approach to avoid this massive production of repositories under my GitHub user profile in the Future work paragraph.

The following repositories will be tested with a batch_size of 2:

Following the execution of the above script, two repositories will be automatically generated by forking the originals. If you go inside the repositories’ pull requests, you’ll see that a new pull request titled Analyze the package with RcppDeepState has been automatically submitted. After a few minutes, when the CI checks are completed, you will see a comment inside the pull request with the analysis result.

Results

The findings of the analysis are publicly available on Github, and they highlight some issues within the evaluated packages. If we compare the results to those discovered by Akhila1, we can see that there are some similarities.

Here are the links to the results:

Future work

Creating 115 repositories, as previously said, will result in a huge generation of repositories within my GitHub account. This is not a problem solely because of the number of repositories, but if I need to remove all of them, I will undoubtedly have to write a script to do so; this can be a dangerous task if done in my current working environment (my user profile) because I may accidentally specify the incorrect condition and end up deleting the wrong repositories.

One possible solution is to create a Github Organization and instruct the above script to fork the repositories to a specific organization rather than to my GitHub account. This can be accomplished by passing an extra argument to the GitHub REST API: the organization parameter; this parameter must be set to the organization’s name. The rest of the code will be left unchanged.

fork_endpoint <- paste0("POST /repos/", repo_full_name, "/forks")
fork_result <- gh(fork_endpoint, organization="<org name>")