Thoughts of a Thinking Craftsman: python

Showing posts with label python. Show all posts

Tuesday, April 08, 2014

Simple Code analysis with TC Tool - Analyzing code duplication

There are many code duplication tools available (opensource and commercial) like CPD - Copy Paste Detector or Simian. However CDD (Code Duplication Detector) in TCToolkit has some unique advantages.In a previous blog post I have explained why I wrote CDD

It uses excellent Pygments library for parsing the source code. Hence all the languages supported by Pygments are supported by CDD for duplication check.
It is reasonably Fast. Last few weeks I spent some time optimizing it for speed.
For example, on my Dell laptop it detected 164 duplicates in 1445 files of Tomcat source code in 45 seconds.
It can output duplications in multiple formats.

In simple text format.
In HTML format with 'syntax highlighted' duplicate text fragments
It can also add Cpp/Java style '// code comments' in the original source code.

It create a matrix visualization of duplication to identify any duplication patterns. See the example below for Tomcat source (org/apache/coyote/http11) directory.

Here is the command line that I used for tomcat code analysis

cdd.py -l java -o javadups.htm

To see all the options available

cdd.py --help

There are few other simple code visualization tools in TCToolkit like TTC (Token Tag Cloud) or CCOM (Class Concurrence Matrix). I will explain their usage in later posts.

Give it a try and tell me your opinion.

Saturday, March 22, 2014

TCToolkit Update (Version 0.6.x)

When I consulted to companies on improving their source code (for refactoring it, improving the performance, detecting the design bottlenecks, detecting problematic files etc), I needed a way to quickly analyze a code base. However, there were not many tools available which gave me a quick insight on code. Commercial tools like Coverity, KlocWorks, Lattix etc are expensive. Because i could use it, I had to convince my client to 'license' it and that was difficult. Hence about 2 years back I wrote few python scripts to quickly help me analyze a codebase. Later I open sourced these python scripts a 'TCToolkit'.

Recently I have done significant refactoring and updates to these scripts and also added some new scripts. Also I have moved the TCToolkit code to Bitbucket. (https://bitbucket.org/nitinbhide/tctoolkit ).

Important updates are listed below

Improved the performance of CDD (Code Duplication Detector). On my Dell laptop, subversion C code base (around 450 files) can now be analyzed for duplication in about 90 seconds.
Now I use d3js library for generating the visualizations. Token tag cloud (TCC) now uses d3js for generating the tag cloud. CDD uses d3js for displaying the 'duplication matrix'.
A new script 'CCOM' (Class Co-occurrence matrix) is added. This script analyzes the code base and finds out which classes are used together. It displays this information in matrix form.

For example, class A has class B as member variable, or member function of class A uses class B as parameter then class A and B are treated as occuring. If a function takes two parameters objects class B and class C, then class B and C are treated as 'co-occurring'. If classes are co-occurring, then chances are there is some dependency between their functionality and hence changes in one MAY impact other.
smjstreemap.py : This script generates a treemap visualization from the excellent freeware code metrics tool SourceMonitor. It also uses d3js for displaying the treemap.

Give it a try on on your code base and see what kind of insights you get about your project.

Wednesday, July 11, 2012

Django connection pooling using sqlalchemy connection pool

As you know, Django uses new database connection for each request. This works well initially. However as the load on the server increases, creating/destroying connections to database starts taking significant amount of time. You will find many questions about using some kind of connection pooling for Django on sites like StackOverflow . For example, Django persistent database connection.

At BootStrapToday we use sqlalchemy's connection pooling mechanism with Django for pooling the database connections. We use variation of approach by Igor Katson described in http://dumpz.org/67550/. Igor's approach requires patching Django which we wanted to avoid. Hence we created a small function that we import in one of __init__.py (or models.py) (i.e. some file which gets imported early in the application startup).

import sqlalchemy.pool as pool
pool_initialized=False

def init_pool():
     if not globals().get('pool_initialized', False):
         global pool_initialized
         pool_initialized = True
         try:
             backendname = settings.DATABASES['default']['ENGINE']
             backend = load_backend(backendname)

             #replace the database object with a proxy.
             backend.Database = pool.manage(backend.Database)

             backend.DatabaseError = backend.Database.DatabaseError
             backend.IntegrityError = backend.Database.IntegrityError
             logging.info("Connection Pool initialized")
         except:
             logging.exception("Connection Pool initialization error")

#Now call init_pool function to initialize the connection pool. No change required in the
# Django code.
init_pool()

So far this seems to be working quite well.

Thursday, January 26, 2012

Comparison of Emberjs vs Knockoutjs

Recently I am experimenting with two Javascript MVC frameworks. Emberjs and Knockoutjs. Mainly to select the framework for the new UI design of BootStrapToday.
Yesterday I wrote a blog article about the experiences of using Emberjs and Knockoutjs with Django web framework.

Emberjs vs Knockoujs for Django projects

Monday, July 04, 2011

Optimizing Django database access : Part 2

Originally published on BootstrapToday Blog

Few months back we moved to Django 1.2. Immediately we realized that our method 'optimizing' database queries on foreign key fields doesn't work any more. The main reason for this 'break' was Django 1.2 changes to support multiple databases. Specially the way 'get' function is called to access value of a ForeignKey field.

In Django 1.2 ForeignKey fields are accessed using rel_mgr. See the call below.

rel_obj = rel_mgr.using(db).get(**params).

However, manager.using() call returns a queryset. Hence if the manager has implemented a custom 'get' function, this 'custom get' function is not called. Since our database access optimzation is based on 'custom get' function. This change broke our optimization.

Check Django Ticket 16173 for the details.

Also our original code didn't consider multidb scenarios. Hence query 'signature' computation has to consider the db name also.

from django.db import connection
from django.core import signals

def install_cache(*args, **kwargs):
    setattr(connection, 'model_cache', dict())

def clear_cache(*args, **kwargs):
    delattr(connection, 'model_cache')

signals.request_started.connect(install_cache)
signals.request_finished.connect(clear_cache)

class YourModelManager(models.Manager):
    def get(self, *args, **kwargs):
        '''
        Added single row level caching for per connection. Since each request
        opens a new connection, this essentially translates to per request
        caching
        '''
        model_key = (self.db, self.model, repr(kwargs))

        model = connection.model_cache.get(model_key)
        if model is not None:
            return model

        # One of the cache queries missed, so we have to get the object
        # from the database:
        model = super(YourModelManager, self).get(*args, **kwargs)
        if model is not None and model.pk is not None:
            connection.model_cache[model_key]=model
        return model

There are minor changes from the first version. Now this manager class stores the 'db' name also in 'key'.

Along with this change you have to make one change the core Django code. In file django/db/fields/related.py in in __get__ function of 'ReverseSingleRelatedObjectDescriptor' class, Replace the line

         rel_obj = rel_mgr.using(db).get(**params)

by line

         rel_obj = rel_mgr.db_manager(db).get(**params)

This will ensure that YourModelManager's get function is called while querying the foreignkeys.

Saturday, February 05, 2011

Optimizing Django database access : some ideas/experiments

Originally published on BootstrapToday Blog

As we added more features to BootStrapToday, we started facing issues of performance. Page access was getting progressively slower. Recently we analyzed page performance using excellent Django Debug Toolbar and discovered that in worst there were more than 500 database calls in a page. Obviously that was making page display slow. After looking at various calls, analyzing the code and making changes, we were able to bring it down to about 80 calls and dramatically improving the performance. Django has excellent general purpose caching framework. However, it’s better to minimize the calls and then add caching for remaining queries. In this article, I am going to show a simple but effective idea for improving the database access performance.

In Django, a common reason for increased database calls in ForeignKey fields. When you try to access the variable representing foreign key typically it results in a database access. Usually suggested solution to this problem is use of ‘select_related’ function of Django query set API. It can substantially reduce the number of database calls. Sometimes it is not possible to use ‘select_related’ (e.g. you don’t want to change how a third party app works or changing the queries requires significant change in the logic etc).

In our case, we have Foreign Key fields like Priority, Status etc on Ticket models. These fields are there because later we want to add ability to ‘customize’ these fields. However, the values in these tables rarely change. Hence usually this results in multiple queries to get the same data. Usually these are ‘get’ calls. If we can add a simple caching to ‘get’ queries for status, priority etc, then we can significantly reduce the number of database calls.

In Django, a new connection is opened to handle a new request. Hence if we add model instance cache to ‘connection’ then query results will be cached during handling of one request. New request will make the database query again. With this strategy we don’t need complicated logic to clear ‘stale’ items from the cache.

from django.db import connection
from django.core import signals

def install_cache(*args, **kwargs):
    setattr(connection, 'model_cache', dict())

def clear_cache(*args, **kwargs):
    delattr(connection, 'model_cache')

signals.request_started.connect(install_cache)
signals.request_finished.connect(clear_cache)

class YourModelManager(models.Manager):
    def get(self, *args, **kwargs):
        '''
        Added single row level caching for per connection. Since each request
        opens a new connection, this essentially translates to per request
        caching
        '''
        model_key = (self.model, repr(kwargs))

        model = connection.model_cache.get(model_key)
        if model is not None:
            return model

        # One of the cache queries missed, so we have to get the object from the database:
        model = super(YourModelManager, self).get(*args, **kwargs)
        if model is not None and model.pk is not None:
            connection.model_cache[model_key]=model
        return model

As a side benefit, since the same model instance is returned for same query in subsequent calls, number of duplicate instances is reduced and hence memory foot print is also reduced.

Saturday, August 01, 2009

Visualizing Code Duplication in a project

Treemap visualization is an excellent way to visualize the information/various metrics about the directory tree (e.g. source files directory tree). I have used treemaps for visualizing SourceMonitor metrics about entire project with excellent results. Unfortunately there are very few simple and opensource treemap visualization softwares available. There is JTreemap applet which can be use to view csv files as treemaps. Sometime back an Excel plugin was available from Microsoft Research site. However, there is no trace of it now on the Microsoft research site.

As part of Thinking Craftsman Toolkit, I wrote a simple Tkinter/Python treemap viewer to view the SourceMonitor metrics as treemaps. After writing initial version of Code Duplication Detector, I realized there is no good visualization tool to visually check the 'proliferation' of duplication across various files/directories. The tools like Simian or CPD just give a list of duplications. I thought 'Treemaps' can be excellent tool to visualize the duplication. Hence I added '-t' flag to CDD. This flag displays the treemap view of the Code Duplication. You can see the screen snapshot of the treemap view here.(See the thumbnail below)

Tuesday, July 21, 2009

Thinking Craftsman Toolkit on Google code

I have created a project named 'Thinking Craftsman Toolkit (TC Toolkit)' on Google code. Currently it includes three small tools

Code Duplication Detector (CDD)
Code duplication detector is similar to Copy Paste Detector (CPD) or Simian. It uses Pygments Lexer to parse the source files and uses Rabin Karp algorithm to detect the duplicates. Hence it supports all languages supported by Pygments.
Token Tag Cloud (TTC)
Sometime back I read a blog article 'See How Noisy Your Code Is'. TTC is tool for creating various tag clouds based on token types (e.g. keywords, names, classnames etc).
Sourcemonitor Treemap Viewer (SMTreemap)
Source Monitor is an excellent tool to generate various metrics from the source code (e.g. maxium complexity, averge compelxity, line count, block depth etc). However, it is difficult to quickly analyse this data for large code bases. Treemaps are excellent to visualize the hierarchical data on two dimensions (as size and color). This tool uses Tkinter to display the SourceMonitor data as treemap. You have to export the source monitor data as CSV or XML. smtreemap.py can then use this CSV or XML file as input to display the treemap

There is no installer or setup file yet. You can get the tools by checking out the source from the SVN repository.

As I promised in the last blog post on 'Writing Code Duplication Detector', source for Code Duplication Detector is now released as part of TC Toolkit project.

Wednesday, June 10, 2009

Writing a Code Duplication Detector

Now that I have started consulting on software development, I am developing a different way of analyzing code for quickly detecting code hotspots which need to addressed first. The techniques I am using are different than traditional 'static code analysis' (e.g. using tools like lint, PMD, FindBugs etc). I am using a mix of various code metrics and visualizations to detect 'anomalies'. In this process, I found a need for a good 'code duplication detector'.

There are two good code duplication detector already available.

CPD (Copy Paste Detector) from PMD project.
Simian from RedHill Consulting.

I am big fan of reuse and try to avoid rewrites as much as possible. Still in this case both tools were not appropriate for my need.

Out of box CPD supports only Java, JSP, C, C++, Fortran and PHP code. I wanted C# and other languates also. It means I have to write a lexers for any new language that I need.

Simian supports almost all common languages. but it is 'closed' source. Hence i cannot customize or adopt it for my needs.

So the option was to write my own. I decided to write it in Python. Thanks to Python and tools available with Python, it was really quick to write. In 3 days, I wrote a nice Code Duplication Detector which supports lots of languages and optimized it also.

The key parts for writing a duplication detector are

Lexers (to generate tokens by parsing the source code)
good string matching algorithm to detect the duplicates.

I wanted to avoid writing my own lexers since it meant writing a new lexer every time I want to support a new language. I decided to piggy back on excellent lexers from Pygments project. Pygments is a 'syntax highligher' written in Python. It already supports large number of programing languages and markups and it is actively developed.

For algorithms, I started by studying CPD. CPD pages just mentions that it uses Rabin Karp string matching algorithm however I could not find any design document. However there are many good refrences and code examples of Rabin Karp algorithm on internet. (e.g. Rabin Karp Wikipedia entry). After some experimentation I was able to implement it in Python.

After combining the two parts, the Code Duplication Detector (CDD) was ready. I tested it on Apache httpd server source. After two iterations of bug fixes and optimizations, it can process 865 files of Apache httpd source code in about 3min 30 seconds. Not bad for a 3 days work.

Check the duplicates found in source code ofApache httpd server and Apache Tomcat server

Duplicates found in Apache httpd server source using CDD
Number of files analyzed : 865
Number of duplicate found : 161
Time required for detection : 3 min 30 seconds
Duplicate found in Apache Tomcat source code using CDD
Number of files analyzed : 1774
Number of duplicate found : 436
Time required for detection : 19 min 35 seconds

Jai Ho Python :-)

Now I have started working on (a) visualization of amount of duplication in various modules (b) generating 'tag clouds' for source keywords, class names.

PS> I intend to make the code CDD code available under BSD license. It will take some more time. Mean while if you are interested in trying it out, send me a mail and I will email the current source.

Update (21 July 2009) - CDD Code is now available at part of Thinking Craftsman Toolkit project on google code.

Wednesday, January 28, 2009

Using Social Network Analysis with Version control data

As I mentioned in the last post, am experimenting about using social network analysis (sna) on verision control data. Now with SVNPlot project, I have a way of converting the Subversion logs into sqlite database. It allows me to query the data in many different ways.

I used the Rietveld repository data and did some premilinary analysis. I am not an expert on SNA but Initial results look very interesting and promising. You can see the results on my website

Update : Oscar Castaneda has added SNA data extraction to SVNPlot as part of GSoC 2010 project. He has used these modifications to analyze Apache repositories and reported his findings in ApacheCon. Check the details at

Life After Google Summer of Code by Oscar Castaneda
Oscar's GSoC 2010 proposal
Details on how to use his contributions in SVNPlot to extract the data.

Sunday, January 18, 2009

Social Network Analysis and Version Control

Recently I came across the concept of Social Network Analysis.

Given below is small introduction of Social Network Analysis is from Orgnet site

Social network analysis [SNA] is the mapping and measuring of relationships and flows between people, groups, organizations, computers, web sites, and other information/knowledge processing entities. The nodes in the network are the people and groups while the links show relationships or flows between the nodes. SNA provides both a visual and a mathematical analysis of human relationships.

The concept is originated in 'social sciences (socialogy, anthropology)' to study the relationships on communities. Today it is being used in fraud ring detection, identifying leaders in organizational network,analyzing the relience of computer networks and various other ways. The various casestudies from Orgnet site can give you good idea about the possibilities.

I started thinking about applying SNA for version control history with files and authors as nodes. There is some research going on in this area in universities. References below have few links. Google search with "data mining version control" will give you additional links

With SVNPlot, now I have a way of converting Subversion logs into an SQLite database. Also Python have some excellent libraries for Network analysis. I am using NetworkX for analysis and Matplotlib for visualization. I think such analysis will be useful in

In indentifying the key developers and their specific areas in the project.
Key files (files which are involved in the code changes more frequently than others)
Identify the clusters of related files (across directories and modules)

I think the results will be useful to software development companies as well especially for getting advance warning for problems and especially big projects in indentifying critical developers, planning the technology transfer during movement from people from one project to another etc. I see many exciting possibilities.

The initial results are interesting. I will put up the charts/analysis etc on my site in a few days time.

References and Interesting Articles/Links

Introduction to Social Network Analysis (from orgnet.com)
Casestudies of Social Network Analysis (from Orgnet.com)
Wikipedia page on Social Networks (Check the history of Social Network Analysis)
Social Life of Routers (Computer networks as social networks)
Finding Go-to People and Subject Matter Experts in Organization
Predicting Defects using Network Analysis on Dependency Graphs – ICSE 2008
Mining Software Archives (a special issue of IEEE magazine)

Wednesday, January 14, 2009

SVNPlot - my first opensource project

During the 1 week gap between the two jobs, I finally started an opensource project. The project is called in SVNPlot. It is inspired by the excellent StatSVN Subversion Statistics generation package.

SVNPlot generates graphs similar to StatSVN. The difference is in how the graphs are generated. SVNPlot generates these graphs in two steps. First it converts the Subversion logs into a 'sqlite3' database. Then it uses sql queries to extract the data from the database and then uses excellent Matplotlib plotting library to plot the graphs.

I believe using SQL queries to query the necessary data is resulting great flexibility in data extraction. Also since the sqlite3 is quite fast, it is possible to generate these graphs on demand.

As tribute to python and author of Python-Guido van Rossum, I have generated the graphs for Rietveld project. Check it out here

SVNPlot is hosted on Google code (http://code.google.com/p/svnplot/) and licensed under New BSD license. For information on installation and usage, check the introduction page here

I am using python to implement SVNPlot. I am a novice to python. Hence any suggestions to improvement are welcome.

Thoughts of a Thinking Craftsman

Announcement