Assignment: The Artifact Evaluation Committee

For this assignment, you were asked to help with artifact evaluation for a conference.

Background: Artifact Evaluation

In recent years, the systems community has put more and more emphasis on the availability and reproducibility of research results. After a paper has been accepted for publication at a conference, the authors may opt-in to submit an artifact of their work. This includes code, data, traces, scripts and other material that is necessary to reproduce the results presented in the paper.

The authors of the paper write up an artifact appendix that includes instructions on how to evaluate the artifact and the requirements (e.g., hardware, software, time, etc.) needed to reproduce the results. The artifact appendix is often added to the actual paper itself. The artifact is often placed in a public git repository (e.g., GitHub) or a public archive (e.g., Zenodo).

After submission of the artifact, the artifact evaluation committee (AEC) attempts to evaluate the artifact by following the instructions in the artifact appendix. The AEC runs the provided scripts to compile, run, and evaluate the artifact. It compares the obtained results with the results in the paper. Depending on success, the AEC may award the paper with a set of badges that indicate availability, whether the artifact is runnable, and reproducibility.

Because evaluating an artifact is not always easy, and running code on a different machine with slightly different configuration may lead to surprising results and unhandled failures, conducting the evaluation of an artifact is not an easy task. Hence, this assignment.

Tasks Summary

Select a paper and which part of the evaluation is to be reproduced
Write a critique of the paper
Reproduce the results.
Discuss the results
Submit your report (see formatting guidelines)

Selecting a Paper

Generally, you are free to select any systems paper that you like. However, it must meet all of the following criteria:

You are not an author of the paper.
The paper is an evaluation paper: it includes data (graphs, tables, numbers).
The paper has something to do with systems.
There is some form of code or an artifact available that you can run.

Once you have selected the paper, you will need to define what you are going to reproduce. This should involve running some code to re-create a graph or table of the paper. Heads up to Reproduction for more details on this.

Where to look for papers? It can be hard to find a paper that meets the criteria above — especially if you have just started with graduate school. We included a list of paper suggestions below. Alternatively, you may look at the papers we read. Finally, you may consider visiting Systems Research Artifacts, or the proceedings of the systems conferences (e.g., OSDI, SOSP, ASPLOS, Usenix ATC, EuroSys, FAST, NSDI, etc.).

Task: As soon as you have selected your paper, please send us email and tell us what paper you have selected and what part of it you are going to reproduce. Please include a link to the paper in the e-mail.

Paper Critique

After you've selected the paper, you need to read it carefully and write a critique of the paper before reproducing the results. Focus your attention on the research methodology more than the research idea. Your critique will probably be on the order of one to two pages although there will be exceptions.

In your critique you must at least answer the following questions.

What is the purpose of the paper?
What is the hypothesis that the authors are testing?
What is the experimental setup?
What is good/bad about the experimental setup?
How well was the research carried out? What results are presented?
Do you believe the results? Why/Why not?
What things might you have done differently?
What lessons did you learn from reading this paper critically?

Task: Read the paper and write a critique answering the questions above.

Reproduction

After you have written your critique of the paper, you can start reproducing the results of the paper. This is a graph or a table and it will involve running something.

Note, the goal of this exercise is to understand systems research, writing, and reproducibility. Many papers have published artifacts that should help with the task at hand. However, this doesn't mean that you can simply run some scripts. The artifacts have been created by the authors to help reproduce the results of the paper — with the same scale and configuration as mentioned in the paper.

You will need to reproduce the results of the paper on your own machine (e.g., a standard laptop or desktop). This means you will need to understand the platform differences and the benchmark well enough to adapt it to fit your platform. You will need to explain and justify how you adapted the scripts and the obtained results.

Be careful to articulate any hidden assumptions that you make. Think hard about how to interpret your results given different hardware and software configurations. You may take advantage of data and/or tools that have been made available by the authors, but you may not do so to the extent that there is no work left to the assignment.

Task: Adapt and run an experiment of the paper on your own machine (laptop or desktop).

Discussion

The last step of the assignment is to write a report that describes your efforts and results in the reproduction of the paper.

In your report, you must answer at least the following questions.

What experiment are you trying to reproduce? Identify the corresponding tables/graphs from the original paper
What experimental setup did use use to reproduce the paper? Explain how it differs from the platform used in the paper.
What are the tools and/or traces that you have used? Give versions where applicable; compare with the ones of the original paper.
What are your assumptions that you have made about information not stated in the paper?
How did you adapt the parameters and configurations to run the artifact on your platform? Justify your selection.
How do your results compare with the ones in the paper? Discuss your results and explain why the differ, or why they match. Include your data (graph or table)
What's your verdict? Discuss the reproducibility of the results.
Did your assessment of the paper change after trying to reproduce the results? Why? Why not?

Task: Write the reproduction report answering the questions above.

Submission

Prepare the submission of your assignment.

The PDF document must be no longer than 3 pages including figures and tables, plus as many pages as needed for references. (see Formatting Guidelines)
Submit the PDF following the submission instructions

List of Papers

You may pick a paper from our reading list, from the following papers, or pick one yourself. If you pick a different one, come and talk to us before you start doing your artifact evaluation assignment.

Agache 2020: Firecracker: lightweight virtualization for serverless applications Artifact is here
Amit 2017: Optimizing the TLB Shootdown Algorithm with Page Access Tracking
Amit 2019: JumpSwitches: Restoring the Performance of Indirect Branches In the Era of Spectre
Balmau 2017: TRIAD: Creating Synergies Between Memory, Disk and Log in Log Structured Key-Value Stores. Reproduce any figure numbered 9 or greater.
Blake 2003: High Availability, Scalable Storage, Dynamic Peer Networks: Pick Two (appeared in the 2003 Hot Topics in Operating Systems). Reproduce the graph in Section 4.1.
Cadar 2008: Klee: Unassisted and Automatic Generation of High-Coverage Tests for Complex Systems Programs. Download their tool (it's not on a Stanford site, it's at llvm.org) and try it on some of the workloads they used.
Curtsinger 2015: COZ: Finding Code that Counts with Causal Profiling. Reproduce any of the examples of COZ profiling. If that works seamlessly, use COZ to evaluate something that they did not evaluate in the paper and report on what you learned about it.
Cutler 2018: The benefits and costs of writing a POSIX kernel in a high-level language. The code for this project is available here. See if you can reproduce some of their measurements about kernel functionality.
Harnik 2013: To Zip or not to Zip: Effective Resource Usage for Real-Time Compression. See if you can get the same kinds of compression timings that the authors got.
Jamet 2020: Characterizing the impact of last-level cache replacement policies on big-data workloads. In theory, it should be easy to use the simulator and tracer used in this study to reproduce their results exactly. See if theory meets practice. If so, analyze a different benchmark using their tools!
Kadekodi 2018: Geriatrix: Aging what you see and what you don’t see. A file system aging approach for modern storage systems. There are so many graphs frmo which to choose -- see if you can reproduce some runtime results on an aged file system.
Koller 2013: Write Policies for Host-side Flash Caches. Start with the analytical results from Figure 1. Then see if you can put together a system that looks something like what the authors did and see if you can run any of their benchmarks.
Kyrola 2012: GraphChi: Large-Scale Graph Computation on Just a PC. Most of the graphs from this paper are available from the SNAP repository and many of the systems against which to compare are open source.
Lawall 2022 : OS scheduling with nest: keeping tasks close together on warm cores. This has undergone artifact evaluation, so this is a type 3 project. You need to make sure you are running on a very different platform. Then you need to explain your results relative to those in the paper.
Lozi 2016: The Linux Scheduler: a Decade of Wasted Cores This paper has a collection of different graphs illustrating several interesting behaviors of the Linux scheduler, see if the behavior described still exists.
Mao 2012: Cache Craftiness for Fast Multicore Key-Value Storage. The software described here is available here. See if you can reproduce any of figures 9 - 11.
Min 2016: Understanding Manycore Scalability of File Systems. You can pretty much try to reproduce anything in these figures!
Ren 2019: An Analysis of Performance Evolution of Linux’s Core Operations
Roghanchi 2017: ffwd: delegation is (much) faster than you think. This paper explores different ways to consistently handle access to shared memory. See if you can reproduce any of the benchmarks in the first three or four figures. Code is available here
Roy 2013: X-Stream: Edge-centric Graph Processing using Streaming Partitions. This paper has a lot of different data - not just run time. Trying to reproduce it should be, um, fun.
Sumbaly 2012: Serving Large-scale Batch Computed Data with Project Voldemort. Using the publicly available Voldemort and MySQL releases, see if you can reproduce any of the graphs in the evaluation.
Vangoor 2017: To FUSE or Not to FUSE: Performance of User-Space File Systems. See if you can reproduce a few of the results from Table 3 on any system to which you have access.
Volos 2014: Aerie: flexible file-system interfaces to storage-class memory. See if you can reproduce Figure 1.
Wu 2018: Anna: A KVS For Any Scale. The code for this system is available here. Can you reproduce any of the comparisons with Redis or Cassandra or any other widely used KV store?
Zhao 2016: Non-Intrusive Performance Profiling for Entire Software Stacks Based on the Flow Reconstruction Principle. Pick one workload used in the paper and see if you can reproduce it. Can you run workloads not in the paper?