Monday, July 18, 2016

ReArchitecting the AutoGrader

So on Friday I followed through with my plans to get the rest of the FeatureGrader to expose errors in the students’ code to students, rather than just having it respond with “your code timed out or had an error” and I think I was largely successful.

At least I got those few extra code changes deployed into production and my manual tests through the edX interface showed me that my test examples would display full errors for RSpec failures, migrations failures, and Rails failures. Of course I’m blogging before I’ve reviewed how things faired over the weekend, but it feels like a step in the right direction. Even if the students can’t understand the errors themselves, they can copy and paste the output and perhaps a TA has an increased chance of helping them.

I also wrapped my spike in tests like:

  Scenario: student submits a HW4 with migration_error
    Given I set up a test that requires internet connection
    Given an XQueue that has submission "hw4_migration_error.json" in queue
    And has been setup with the config file "conf.yml"
    Then I should receive a grade of "0" for my assignment
    And results should include 
"SQLException: duplicate column name: director: ALTER TABLE"

to check that the errors would actually be displayed even as we mutate the code. I have a selection of similar scenarios which feel like they are crying out to be DRYed out with a scenario template. Similarly, with these tests in place I wonder if I can’t improve some of the underlying grading code. Maybe we can re-throw these TestFailedError custom errors that look like they might have been intended for communicating submitted code errors back up to the grader. I found myself spending the time I could have been doing further refactoring reaching out to individual students on the forums and in GitHub to add some details about where the grader had been getting stuck for them, and encouraging them to re-submit since the grader had changed and they should now be able to see more details.

I just sneaked a peak at the GitHub comment thread, and while there are some new issues that could distract me from completing this blog, at the very least I can see some students deriving value from the new level of grader feedback. So grader refactoring? I continue to feel negative about that task. The nested sandboxes of the feature grader … the fear is that refactoring could open new cans of worms and it just feels like we miss a huge chance by not having students submit their solutions via pull request.

So how would a PR-based grader work? Well, reflecting on the git-immersion grader that we developed for the AV102 Managing Distributed Teams course, we can have students submit their GitHub usernames and have the grader grab details from GitHub. We can get a list of comments from a PR and so if we had code-climate, CI etc. set up on a repo and had students submit their solutions as pull-requests we could pull in relevant data using a combination of the repo name and their GitHub username.

Making pull-requests would require students to fork rather than clone repos as they were originally encouraged to do. Switching back to that should not be a big deal. I was keen to remove forking since it didn’t really contribute to the experience of the assignment and was just an additional hoop to jump through. However if submission is by PR then we want students to understand forking and pulling; and of course that’s a valuable real world skill.

This means all the solutions to the assignments exist in much larger numbers in GitHub repos, but they exist in a lot already, so not much change there. What we might have though is students submitting assignments through a process that’s worth learning, rather than an idiosyncratic one specific to edX and our custom auto graders.

With a CI system like Travis or Semaphore we can run custom scripts to achieve the necessary mutation grading and so forth; although setting that up might be a little involved. The most critical step however is some mechanism for checking that the students are making git commit step by step. Particularly since the solutions will be available in even greater numbers, what we need to ensure is that students are not just copying a complete solution verbatim and submitting in a single git commit. I am less concerned about the students ability to master an individual problem completely independently, and more concerned being able to follow a git process where they write small pieces of code step by step (googling when they get stuck) and commit each to git.

So for example in the Ruby-Intro assignment I imagine a step that checks that each individual method solution was submitted in a separate commit and that that commit comes from the student in question. Pairing is a concern there, but perhaps we can get the students set up so that the pairing session involves author and committer so that both are credited.

But basically we’d be checking that the first sum(arr) method was written and submitted in one commit, and then that max_2_sum(arr) was solved in a separate commit, and that the student in question was either the committer or the author on the assignment. In addition we would check that the commits were suitably spaced out in time, and of a recent date. The nature of the assignment changes here from being mainly focused on “can you solve this programming problem?”, to “can you solve this code versioning issue?”. And having the entire architecture based around industry standard CI might allow us to reliably change out the problems more frequently; something that feels challenging with the current grader architecture. The current grader architecture is set up to allow the publication of new assignments, but the process of doing so is understood by few. Maybe better documentation is the key there, although I think if there is a set of well tested assignments, then the temptation for many instructors and maintainers is just to use the existing tried and tested problems and focus their attention on other logistical aspects of a course.

Using existing CI systems effectively removes a large portion of the existing grader architecture, i.e. the complex sandboxing and forking of processes. This then removes a critical maintenance burden … which is provided reliably and free by the many available CI services (Travis, Semaphore, CodeShip etc.). Students now start to experience industry standard tools that will help them pass interviews and land jobs. The most serious criticism is the idea is that students won’t be trying to solve the problems themselves, but google any aspect of our assignments and find links like this. The danger of the arms race to keep solutions secret is that we burn all our resources on that, while preventing students from learning by reviewing different solutions to the problem.

I’m sure I’m highly biased but it feels to me that having students submitted a video of themselves pairing on the problem, along with a technical check to ensure they’ve submitted the right sorts of git commits will reap dividends in terms of students learning the process of coding. Ultimately the big win would be checking that the tests were written before the code, which could be checked by asking students to commit the failing tests, and then commit the code that makes them pass. Not ideal practice on the master branch but acceptable for pedagogical purposes perhaps … especially if we are checking for feature branches, and then even that those sets of commits are squashed onto master to ensure it always stays green …


I also reflect that it might be more efficient to be using web hooks on the GitHub Repos in question, rather than repeatedly querying the API (which is rate limited). We’d need our centralised autograder to be storing the data about all the student PRs so that we could ensure that the student’s submission was checked in a timely fashion.

No comments: