Announcement

Showing posts with label tctoolkit. Show all posts
Showing posts with label tctoolkit. Show all posts

Tuesday, April 08, 2014

Simple Code analysis with TC Tool - Analyzing code duplication

There are many code duplication tools available (opensource and commercial) like CPD - Copy Paste Detector or Simian. However CDD (Code Duplication Detector) in TCToolkit has some unique advantages.In a previous blog post I have explained why I wrote CDD
  • It uses excellent Pygments library for parsing the source code. Hence all the languages supported by Pygments are supported by CDD for duplication check.
  • It is reasonably Fast.  Last few weeks I spent some time optimizing it for speed.
    For example, on my Dell laptop it detected 164 duplicates in 1445 files of Tomcat source code in 45 seconds.
  • It can output duplications in multiple formats. 
    • In simple text format.
    • In HTML format with 'syntax highlighted' duplicate text fragments
    • It can also add Cpp/Java style '// code comments' in the original source code.
  • It create a matrix visualization of duplication to identify any duplication patterns. See the example below for Tomcat source (org/apache/coyote/http11) directory.


Here is the command line that I used for tomcat code analysis
cdd.py -l java -o javadups.htm
To see all the options available
cdd.py --help
There are few other simple code visualization tools in TCToolkit like TTC (Token Tag Cloud) or CCOM (Class Concurrence Matrix). I will explain their usage in later posts.

Give it a try and tell me your opinion.

Saturday, March 22, 2014

TCToolkit Update (Version 0.6.x)

When I consulted to companies on improving their source code (for refactoring it, improving the performance, detecting the design bottlenecks, detecting problematic files etc), I needed a way to quickly analyze a code base. However, there were not many tools available which gave me a quick insight on code. Commercial tools like Coverity, KlocWorks, Lattix etc are expensive. Because i could use it, I had to convince my client to 'license' it and that was difficult. Hence about 2 years back I wrote few python scripts to quickly help me analyze a codebase. Later I open sourced these python scripts a 'TCToolkit'.  

Recently I have done significant refactoring and updates to these scripts and also added some new scripts. Also I have moved the TCToolkit code to Bitbucket. (https://bitbucket.org/nitinbhide/tctoolkit ). 

Important updates are listed below

  1. Improved the performance of CDD (Code Duplication Detector). On my Dell laptop, subversion C code base (around 450 files) can now be analyzed for duplication in about 90 seconds.
  2. Now I use d3js library for generating the visualizations. Token tag cloud (TCC) now uses d3js for generating the tag cloud. CDD uses d3js for displaying the 'duplication matrix'.
  3. A new script 'CCOM' (Class Co-occurrence matrix) is added. This script analyzes the code base and finds out which classes are used together. It displays this information in matrix form.

    For example, class A has class B as member variable, or member function of class A uses class B as parameter then class A and B are treated as occuring. If a function takes two parameters objects class B and class C, then class B and C are treated as 'co-occurring'.
    If classes are co-occurring, then chances are there is some dependency between their functionality and hence changes in one MAY impact other.
  4. smjstreemap.py : This script generates a treemap visualization from the excellent freeware code metrics tool SourceMonitor. It also uses d3js for displaying the treemap.
 Give it a try on on your code base and see what kind of insights you get about your project.

Wednesday, October 30, 2013

Simple framework to assess potential risks in a Software Project

If you are in a software project, how do you assess the potential risks for a given software project ? So far I have not seen any coherent way of assessing the possible risks. Usually problems are discovered at really late the project life cycle (i.e. just before release dates) and by that time it too late to take any corrective actions. So a common problem is how to detect possible risks as early in project life cycle as possible ?

However, how do you define the 'success' of a project ?
  1. Project is delivered to customer. 
  2. Your company got the expected profit margin from the project
  3. Customer accepted the delivery
  4. Customer's end users are happy with delivery.
  5. Number of bugs reported are and hence your warranty costs are low.
Ideally a 'successful' project should include all the above. However, many times you achieve few items out of this list. For example, Customer accepted the delivery and end users are happy with features but there are lot bugs reported and rework is high. Customer has request new features and to implement new features require lot of changes in code etc Hence your cost are high/profit margin is now low. How do you assess these kind of risks ?

Last few years, I have been working on various code analysis techniques (Check my open source projects SVNPlot and TCToolkit).  Based on my experience I am convinced that analysis of code, design, version control history etc gives you pretty good idea about the success or failure of a project. 

Recently I have created simple framework to assess the possible risks. 

First we analyze the project in three ways 
  • Code Vs Testing quadrant 
  • Requirement Vs Testing quadrant 
  • Design Vs Codequadrant
Map where your project falls in each case. Based on which quadrants the project is mapped, will tell you possible risks for your project.




I find that based on various project metrics, if I mentally map the project to these quadrants, I get a 'rough judgement' of kind problems project will have in future.

What do you think ? 

Monday, January 11, 2010

Code Analysis and Visualization Tools

Since last few years, I am studying how improve the effective of code reviews. In my experience, every organization have its own 'code review guidelines'. However, ROI of time spent on code reviews varies a tremendously. In some project groups, it works very well. While in other groups, many code review comments are of 'naming convention' related issues and not serious code problems. I developed a program called 'Effective Code Review' to teach simple but effective techniques of analyzing code to find more defects during the code reviews.

Last one year I have been conducting this 'Effective Code Review' program in various organizations. As part of this program, I checked/discovered various opensource and freeware tools available to analyze code and/or visualize various aspects of code. I also wrote few simple tools to do that. I have released these simple tools as 'Thinking Craftsman Toolkit' on google code.

I have prepared a list of such tools and their websites and published it on my website with the hope that other will find this list useful.

Check Code Analysis and Visualization Tools on Thinking Craftsman website. If you have any good tools, please leave a comment. I will update the list based on the comments.

Saturday, August 01, 2009

Visualizing Code Duplication in a project

Treemap visualization is an excellent way to visualize the information/various metrics about the directory tree (e.g. source files directory tree). I have used treemaps for visualizing SourceMonitor metrics about entire project with excellent results. Unfortunately there are very few simple and opensource treemap visualization softwares available. There is JTreemap applet which can be use to view csv files as treemaps. Sometime back an Excel plugin was available from Microsoft Research site. However, there is no trace of it now on the Microsoft research site.

As part of Thinking Craftsman Toolkit, I wrote a simple Tkinter/Python treemap viewer to view the SourceMonitor metrics as treemaps. After writing initial version of Code Duplication Detector, I realized there is no good visualization tool to visually check the 'proliferation' of duplication across various files/directories. The tools like Simian or CPD just give a list of duplications. I thought 'Treemaps' can be excellent tool to visualize the duplication. Hence I added '-t' flag to CDD. This flag displays the treemap view of the Code Duplication. You can see the screen snapshot of the treemap view here.(See the thumbnail below)


Tuesday, July 21, 2009

Thinking Craftsman Toolkit on Google code

I have created a project named 'Thinking Craftsman Toolkit (TC Toolkit)' on Google code. Currently it includes three small tools
  1. Code Duplication Detector (CDD)
    Code duplication detector is similar to Copy Paste Detector (CPD) or Simian. It uses Pygments Lexer to parse the source files and uses Rabin Karp algorithm to detect the duplicates. Hence it supports all languages supported by Pygments.

  2. Token Tag Cloud (TTC)
    Sometime back I read a blog article 'See How Noisy Your Code Is'. TTC is tool for creating various tag clouds based on token types (e.g. keywords, names, classnames etc).

  3. Sourcemonitor Treemap Viewer (SMTreemap)
    Source Monitor is an excellent tool to generate various metrics from the source code (e.g. maxium complexity, averge compelxity, line count, block depth etc). However, it is difficult to quickly analyse this data for large code bases. Treemaps are excellent to visualize the hierarchical data on two dimensions (as size and color). This tool uses Tkinter to display the SourceMonitor data as treemap. You have to export the source monitor data as CSV or XML. smtreemap.py can then use this CSV or XML file as input to display the treemap
There is no installer or setup file yet. You can get the tools by checking out the source from the SVN repository.

As I promised in the last blog post on 'Writing Code Duplication Detector', source for Code Duplication Detector is now released as part of TC Toolkit project.

Wednesday, June 10, 2009

Writing a Code Duplication Detector

Now that I have started consulting on software development, I am developing a different way of analyzing code for quickly detecting code hotspots which need to addressed first. The techniques I am using are different than traditional 'static code analysis' (e.g. using tools like lint, PMD, FindBugs etc). I am using a mix of various code metrics and visualizations to detect 'anomalies'. In this process, I found a need for a good 'code duplication detector'.

There are two good code duplication detector already available.
  1. CPD (Copy Paste Detector) from PMD project.
  2. Simian from RedHill Consulting.
I am big fan of reuse and try to avoid rewrites as much as possible. Still in this case both tools were not appropriate for my need.

Out of box CPD supports only Java, JSP, C, C++, Fortran and PHP code. I wanted C# and other languates also. It means I have to write a lexers for any new language that I need.

Simian supports almost all common languages. but it is 'closed' source. Hence i cannot customize or adopt it for my needs.

So the option was to write my own. I decided to write it in Python. Thanks to Python and tools available with Python, it was really quick to write. In 3 days, I wrote a nice Code Duplication Detector which supports lots of languages and optimized it also.

The key parts for writing a duplication detector are
  1. Lexers (to generate tokens by parsing the source code)
  2. good string matching algorithm to detect the duplicates.
I wanted to avoid writing my own lexers since it meant writing a new lexer every time I want to support a new language. I decided to piggy back on excellent lexers from Pygments project. Pygments is a 'syntax highligher' written in Python. It already supports large number of programing languages and markups and it is actively developed.

For algorithms, I started by studying CPD. CPD pages just mentions that it uses Rabin Karp string matching algorithm however I could not find any design document. However there are many good refrences and code examples of Rabin Karp algorithm on internet. (e.g. Rabin Karp Wikipedia entry). After some experimentation I was able to implement it in Python.

After combining the two parts, the Code Duplication Detector (CDD) was ready. I tested it on Apache httpd server source. After two iterations of bug fixes and optimizations, it can process 865 files of Apache httpd source code in about 3min 30 seconds. Not bad for a 3 days work.

Check the duplicates found in source code ofApache httpd server and Apache Tomcat server
  • Duplicates found in Apache httpd server source using CDD
    Number of files analyzed : 865
    Number of duplicate found : 161
    Time required for detection : 3 min 30 seconds

  • Duplicate found in Apache Tomcat source code using CDD
    Number of files analyzed : 1774
    Number of duplicate found : 436
    Time required for detection : 19 min 35 seconds
Jai Ho Python :-)

Now I have started working on (a) visualization of amount of duplication in various modules (b) generating 'tag clouds' for source keywords, class names.

PS> I intend to make the code CDD code available under BSD license. It will take some more time. Mean while if you are interested in trying it out, send me a mail and I will email the current source.

Update (21 July 2009) - CDD Code is now available at part of Thinking Craftsman Toolkit project on google code.