- The students know
- why developers duplicate code,
- the problems that can result from duplicating code,
- that duplicating code can sometimes be justified,
- that there exist tools to dectect and visualize code cloning in software systems,
- and the advantages and disadvantages of exact string matching and token based comparison algorithms.
- The students are able to
- detect and visualize code cloning in software systems using CCFinderX,
- use the metrics that CCFinderX calculates to remove false positives,
- select code clones that are refactoring candidates,
- and combine code clones into one function/method to remove them.
- The students are aware of problems with the application of the tools and shortcomings that are typical for solutions in a relatively young research domain.
Task 1: Introduction
Listen to the introduction of the assistent. See in the Documents section for the slides.
Task 2: Small scale detection
We start with the ultimate clone detector: the programmer.
Look at the class DuplicationSuspect.java in an editor. See in the Documents section for the file.
Manual clone detection does not scale very well. We will therefore use some tools to do the tedious comparisons.
- Can you detect duplication with your bare eyes?
- Which methods seem to be similar?
Use the simpleDude.pl script on the DuplicationSuspect.java file. See in the Documents section for the script.
simpleDude.pl DuplicationSuspect.java > report.txt
Try to change the parameter $slidingWindowSize at the beginning of the simpleDude.pl script from 10 to something higher, e.g. 20, 30 ...
The old adage that an image speaks a thousand words does not fail to apply in reengineering. We use the Dotplot visualization (presented in the lecture) to get a better overview of the cloning activity in the file.
- Did the tool detect more/less duplication than you?
- What are the problems with this way of reporting duplication?
- The detector uses exact string matching as a comparison mechanism. What are the consequences of that?
Start the CCFinderX (the path of the installed directory may not include a space char) by clicking on gemx.bat in the bin directory.
- (1) File -> Detect Clones
- (2) Select preprocess script of target source file -> java
- (3) Select a root directory of the target source files -> DuplicationLab
- (4) Specify detection options by this dialog -> Keep the default settings
Look at the scatter plot view of CCFinderX.
Questions about the Scatter Plot view:
- What does the middle diagonal mean?
- How many clone classes can you distinguish?
Questions about the Source Text View:
- How many clone pairs does the first clone set comprise?
- Can you determine any refactoring candidates from the source code view?
Look in the Quick guide of CCFinderX (see in the doc directory of the CCFinderX root directory) for an explanation of the
- Minimum Clone length
- Minimum TKS
- Shaper Level
- P-match Application
Task 3: Large scale detection
Now, with a bit of experience we feel ready to take on a real world system. In the lab directory there are four systems that can be investigated: FreeMercator, MegaMek, PostgreSQL and Quake3. See in the Documents section for the source code archives. But if you want to investigate a project of your own, it is even better.
First we do the duplication analysis of FreeMarcator together. Unpack the archive.
Compute the duplication of FreeMercator. Start with the default token length of 50. If you think that too many small clones are found, you can increase this minimum.
Questions for the Visual Analysis View:
With an overwhelming number of reported clones, we need other means to help us with the analysis. CCFinderX has a Metric Analysis View, which lets us filter the clones using a number of metrics. The following metrics are offered to describe S, which is a set of code fragments that are copies of each other (S can also be called a clone class, each member is a clone):
- Which groups of files that are obviously interconnected with each other can you see?
- Use Zooming, selection of specific file pairs, and the source code panel, to investigate the kind of duplication found in these groups.
- Can you find a file, which is (almost) a complete copy of another? Where in the dotplot do you have to look for such an occurrence?
- Once you have detected the code clones and you want to look for refactoring candidates, what is the natural thing to look for when looking at the clones?
- RAD(S) is the degree of distribution of clones in a clone set S in the file system. If all fragments are in the same file, RAD(S)=0. If all fragments are in different files the same directory, then RAD(S)=1.
- LEN(S) is the average length of clones (number of tokens) of clones in S.
- RNR(S) represents how many clones in S consist of non-repeated code. A low value of RNR means that a large part of the clone is repeated code.
- NIF(S) is the number of files which include at least one code fragment from S.
- POP(S) is the number of code fragments (population)in S.
- LOOP, COND, McCabe Loop is defined as count of loops in a code fragment, COND is defined as count of conditional branches, and McCabe is defined as the sum of them. In order to focus attention on complex code, select code clones with the higher values of these metrics.
Play around with the Metrics view by changing the maximal and
minimal values of the different metrics. Try to isolate a few clone
classes. Look at its members in the Source Code view of the Metric
- Which selection seems to remove false positives best?
- You are on the lookout for clones, which can be easily refactored. Which selection of metric values seems to lead to these clones?
Task 4: Refactoring
To make this a reengineering lab, we will refactor some duplication.
Some of the examples that were found in the last exercise should be chosen and removed, i.e. combined into a single function/method.
Questions to guide the refactoring process:
- How much of the code belongs to the clone? Only the part that is actually copied?
- Which parts of the common code need to be abstracted to make the code work in general? How many parameters do we need to pass to the extracted functionality?
- To which class does the functionality belong? Are the original places related via inheritance relationships that we can exploit (move to superclass)? Do we need to create a new class?
- How sure are you that your refactoring did not change the behaviour of the system?
Task 5: Iteration
Redo task 4 and 5 but examine another given project, or one of your own.
Task 6: Final Discussion
Discuss in class about the following questions.
- How many false positives (clones which you as a programmer would not name as such) has the detector found?
- Where the filters offered by the detector enough to get to the true positives?
- Did you feel that the filters also removed some true positives?
- What were the reasons you could not refactor some clones?
- Could a tool detect the characteristics that are detrimental to the refactorability of clones?
- What are shortcomings of the tools you used?
- What feature do you miss most?
- If you looked at your own code: have you found any striking examples?
- Will you pay more attention to duplication in your own programming in the future?
- The slides of the code duplication introduction.
The list of files/systems under examination:
- The list of scripts/tools for code duplication detection used in the exercises:
- The pdf version of the practicum.