Background for the header
To the home page of the University of Antwerp



Mining Software Repositories


Many companies as well as open source projects record large portions of the development and change activities. Versioning systems like CVS, SubVersion and ClearCase, are for example widely adopted to record the code changing process. Furthermore, these versioning systems are often associated with a change management system like bug- and issuezilla. As such, these systems therefore record the entire evolution of a software system.

The mining software repositories community believes that this evolution information provides valuable insight into the current software system. The evolution of an entity over time, for example, explains better the reasoning behind that entity. This information is first extracted from these repositories so that this knowledge can be used to improve future development.

This lab session serves as a first contact with this newly arrising domain. As such you will learn the kind of information that can be extracted from a software repository. In order to gain this understanding initial prototypes will be used.


Task 1: Query the Versioning System

Many software projects use a versioning system like CVS and SVN as central location where different developers can contribute their individual changes. All these changes of the entire project history are maintained and stored within this system. Hence, the information contained within can provide use with interesting development knowledge. First, we will use simple queries to identify interesting facts concerning the software system at hand.

Exercise 1

Study the information provided by the extracted CVS log (file: CVS.log)


  • What's the project you're looking at?
  • What valuable information can you extract?
  • (How) can you use the information for following patterns:
    • Read all the code in one hour
    • Chat with the maintainers
    • Study the exceptional entities

Exercise 2

A plain log file doesn't scale well. Instead, one should represent the data in a more query-friendly format. The file project.csv was created by converting the previous CSV log file into a comma separated file format. Read the file into a spreadsheet or database.


  • Which files were recently changed?
  • Which files are unstable?
  • Which developer can you contact for more information about org/gjt/sp/jedit/

Task 2: First Contact Visualizations

Writing queries to understand a software system might be quite daunting especially in your first contact with a system. Hence we'll introduce a simple visualization of the versioning system.

Exercise 3

Create a scatterplot like visualization of the CVS log data with on one axis the time and on the other the file-id. A dot visualizes the fact that a file was changed.
  • Select for each entry the fileid and date (as a number)
  • Visualize data, e.g. in GNUPlot


  • Can you relate the visualization to the previous queries?
  • Do you identify other information?
  • Which patterns would you define?
  • What other information would you add to the visualization? Why?

Exercise 4

Create a CVSGrab visualization of a software repository (or the built-in demo- repository).


  • Which metrics are shown on the horizontal and vertical axis?
  • Can you explain the visualization?
  • Which developers have most knowledge concerning a file?
  • Does developer X have insight in a certain file, subsystem or the entire system?
  • Which files have been (primarily) growing/shrinking/changing

Task 3: Coupling, the Evolutionary View

Coupling is probably the most mentioned quality attribute. In the past you have seen metrics to calculate the degree of coupling for a class. Now we're going to explore the evolutionary view of coupling, i.e. two files that are/have to be adapted together. According to recent studies these historical calculations are often more accurate than static calculations.

Exercise 5

Visualize a software system's CVS log using Beyer's CCVisu.


  • How does CCVisu show the coupling between elements?
  • What does the visualization learn you about the coupling in the system? (TIP: use coloring to identify packages)
  • What are the disadvantages of the technique?


  • All files you need for these exercises
Valid HTML 4.01! Valid CSS!

 Lab On REengineering - Antwerpen, last modified 13:46:23 16 March 2011