Announcement

Saturday, August 01, 2009

Visualizing Code Duplication in a project

Treemap visualization is an excellent way to visualize the information/various metrics about the directory tree (e.g. source files directory tree). I have used treemaps for visualizing SourceMonitor metrics about entire project with excellent results. Unfortunately there are very few simple and opensource treemap visualization softwares available. There is JTreemap applet which can be use to view csv files as treemaps. Sometime back an Excel plugin was available from Microsoft Research site. However, there is no trace of it now on the Microsoft research site.

As part of Thinking Craftsman Toolkit, I wrote a simple Tkinter/Python treemap viewer to view the SourceMonitor metrics as treemaps. After writing initial version of Code Duplication Detector, I realized there is no good visualization tool to visually check the 'proliferation' of duplication across various files/directories. The tools like Simian or CPD just give a list of duplications. I thought 'Treemaps' can be excellent tool to visualize the duplication. Hence I added '-t' flag to CDD. This flag displays the treemap view of the Code Duplication. You can see the screen snapshot of the treemap view here.(See the thumbnail below)


Tuesday, July 21, 2009

Thinking Craftsman Toolkit on Google code

I have created a project named 'Thinking Craftsman Toolkit (TC Toolkit)' on Google code. Currently it includes three small tools
  1. Code Duplication Detector (CDD)
    Code duplication detector is similar to Copy Paste Detector (CPD) or Simian. It uses Pygments Lexer to parse the source files and uses Rabin Karp algorithm to detect the duplicates. Hence it supports all languages supported by Pygments.

  2. Token Tag Cloud (TTC)
    Sometime back I read a blog article 'See How Noisy Your Code Is'. TTC is tool for creating various tag clouds based on token types (e.g. keywords, names, classnames etc).

  3. Sourcemonitor Treemap Viewer (SMTreemap)
    Source Monitor is an excellent tool to generate various metrics from the source code (e.g. maxium complexity, averge compelxity, line count, block depth etc). However, it is difficult to quickly analyse this data for large code bases. Treemaps are excellent to visualize the hierarchical data on two dimensions (as size and color). This tool uses Tkinter to display the SourceMonitor data as treemap. You have to export the source monitor data as CSV or XML. smtreemap.py can then use this CSV or XML file as input to display the treemap
There is no installer or setup file yet. You can get the tools by checking out the source from the SVN repository.

As I promised in the last blog post on 'Writing Code Duplication Detector', source for Code Duplication Detector is now released as part of TC Toolkit project.

Wednesday, June 10, 2009

Writing a Code Duplication Detector

Now that I have started consulting on software development, I am developing a different way of analyzing code for quickly detecting code hotspots which need to addressed first. The techniques I am using are different than traditional 'static code analysis' (e.g. using tools like lint, PMD, FindBugs etc). I am using a mix of various code metrics and visualizations to detect 'anomalies'. In this process, I found a need for a good 'code duplication detector'.

There are two good code duplication detector already available.
  1. CPD (Copy Paste Detector) from PMD project.
  2. Simian from RedHill Consulting.
I am big fan of reuse and try to avoid rewrites as much as possible. Still in this case both tools were not appropriate for my need.

Out of box CPD supports only Java, JSP, C, C++, Fortran and PHP code. I wanted C# and other languates also. It means I have to write a lexers for any new language that I need.

Simian supports almost all common languages. but it is 'closed' source. Hence i cannot customize or adopt it for my needs.

So the option was to write my own. I decided to write it in Python. Thanks to Python and tools available with Python, it was really quick to write. In 3 days, I wrote a nice Code Duplication Detector which supports lots of languages and optimized it also.

The key parts for writing a duplication detector are
  1. Lexers (to generate tokens by parsing the source code)
  2. good string matching algorithm to detect the duplicates.
I wanted to avoid writing my own lexers since it meant writing a new lexer every time I want to support a new language. I decided to piggy back on excellent lexers from Pygments project. Pygments is a 'syntax highligher' written in Python. It already supports large number of programing languages and markups and it is actively developed.

For algorithms, I started by studying CPD. CPD pages just mentions that it uses Rabin Karp string matching algorithm however I could not find any design document. However there are many good refrences and code examples of Rabin Karp algorithm on internet. (e.g. Rabin Karp Wikipedia entry). After some experimentation I was able to implement it in Python.

After combining the two parts, the Code Duplication Detector (CDD) was ready. I tested it on Apache httpd server source. After two iterations of bug fixes and optimizations, it can process 865 files of Apache httpd source code in about 3min 30 seconds. Not bad for a 3 days work.

Check the duplicates found in source code ofApache httpd server and Apache Tomcat server
  • Duplicates found in Apache httpd server source using CDD
    Number of files analyzed : 865
    Number of duplicate found : 161
    Time required for detection : 3 min 30 seconds

  • Duplicate found in Apache Tomcat source code using CDD
    Number of files analyzed : 1774
    Number of duplicate found : 436
    Time required for detection : 19 min 35 seconds
Jai Ho Python :-)

Now I have started working on (a) visualization of amount of duplication in various modules (b) generating 'tag clouds' for source keywords, class names.

PS> I intend to make the code CDD code available under BSD license. It will take some more time. Mean while if you are interested in trying it out, send me a mail and I will email the current source.

Update (21 July 2009) - CDD Code is now available at part of Thinking Craftsman Toolkit project on google code.