Wednesday, June 10, 2009

Writing a Code Duplication Detector

Now that I have started consulting on software development, I am developing a different way of analyzing code for quickly detecting code hotspots which need to addressed first. The techniques I am using are different than traditional 'static code analysis' (e.g. using tools like lint, PMD, FindBugs etc). I am using a mix of various code metrics and visualizations to detect 'anomalies'. In this process, I found a need for a good 'code duplication detector'.

There are two good code duplication detector already available.
  1. CPD (Copy Paste Detector) from PMD project.
  2. Simian from RedHill Consulting.
I am big fan of reuse and try to avoid rewrites as much as possible. Still in this case both tools were not appropriate for my need.

Out of box CPD supports only Java, JSP, C, C++, Fortran and PHP code. I wanted C# and other languates also. It means I have to write a lexers for any new language that I need.

Simian supports almost all common languages. but it is 'closed' source. Hence i cannot customize or adopt it for my needs.

So the option was to write my own. I decided to write it in Python. Thanks to Python and tools available with Python, it was really quick to write. In 3 days, I wrote a nice Code Duplication Detector which supports lots of languages and optimized it also.

The key parts for writing a duplication detector are
  1. Lexers (to generate tokens by parsing the source code)
  2. good string matching algorithm to detect the duplicates.
I wanted to avoid writing my own lexers since it meant writing a new lexer every time I want to support a new language. I decided to piggy back on excellent lexers from Pygments project. Pygments is a 'syntax highligher' written in Python. It already supports large number of programing languages and markups and it is actively developed.

For algorithms, I started by studying CPD. CPD pages just mentions that it uses Rabin Karp string matching algorithm however I could not find any design document. However there are many good refrences and code examples of Rabin Karp algorithm on internet. (e.g. Rabin Karp Wikipedia entry). After some experimentation I was able to implement it in Python.

After combining the two parts, the Code Duplication Detector (CDD) was ready. I tested it on Apache httpd server source. After two iterations of bug fixes and optimizations, it can process 865 files of Apache httpd source code in about 3min 30 seconds. Not bad for a 3 days work.

Check the duplicates found in source code ofApache httpd server and Apache Tomcat server
  • Duplicates found in Apache httpd server source using CDD
    Number of files analyzed : 865
    Number of duplicate found : 161
    Time required for detection : 3 min 30 seconds

  • Duplicate found in Apache Tomcat source code using CDD
    Number of files analyzed : 1774
    Number of duplicate found : 436
    Time required for detection : 19 min 35 seconds
Jai Ho Python :-)

Now I have started working on (a) visualization of amount of duplication in various modules (b) generating 'tag clouds' for source keywords, class names.

PS> I intend to make the code CDD code available under BSD license. It will take some more time. Mean while if you are interested in trying it out, send me a mail and I will email the current source.

Update (21 July 2009) - CDD Code is now available at part of Thinking Craftsman Toolkit project on google code.