Wednesday, January 28, 2009

Using Social Network Analysis with Version control data

As I mentioned in the last post, am experimenting about using social network analysis (sna) on verision control data. Now with SVNPlot project, I have a way of converting the Subversion logs into sqlite database. It allows me to query the data in many different ways.

I used the Rietveld repository data and did some premilinary analysis. I am not an expert on SNA but Initial results look very interesting and promising. You can see the results on my website

Update : Oscar Castaneda has added SNA data extraction to SVNPlot as part of GSoC 2010 project. He has used these modifications to analyze Apache repositories and reported his findings in ApacheCon. Check the details at
  1. Life After Google Summer of Code by Oscar Castaneda
  2. Oscar's GSoC 2010 proposal 
  3. Details on how to use his contributions in SVNPlot to extract the data.

Sunday, January 18, 2009

Social Network Analysis and Version Control

Recently I came across the concept of Social Network Analysis.

Given below is small introduction of Social Network Analysis is from Orgnet site
Social network analysis [SNA] is the mapping and measuring of relationships and flows between people, groups, organizations, computers, web sites, and other information/knowledge processing entities. The nodes in the network are the people and groups while the links show relationships or flows between the nodes. SNA provides both a visual and a mathematical analysis of human relationships.
The concept is originated in 'social sciences (socialogy, anthropology)' to study the relationships on communities. Today it is being used in fraud ring detection, identifying leaders in organizational network,analyzing the relience of computer networks and various other ways. The various casestudies from Orgnet site can give you good idea about the possibilities.

I started thinking about applying SNA for version control history with files and authors as nodes. There is some research going on in this area in universities. References below have few links. Google search with "data mining version control" will give you additional links

With SVNPlot, now I have a way of converting Subversion logs into an SQLite database. Also Python have some excellent libraries for Network analysis. I am using NetworkX for analysis and Matplotlib for visualization. I think such analysis will be useful in
  1. In indentifying the key developers and their specific areas in the project.
  2. Key files (files which are involved in the code changes more frequently than others)
  3. Identify the clusters of related files (across directories and modules)
I think the results will be useful to software development companies as well especially for getting advance warning for problems and especially big projects in indentifying critical developers, planning the technology transfer during movement from people from one project to another etc. I see many exciting possibilities.

The initial results are interesting. I will put up the charts/analysis etc on my site in a few days time.

References and Interesting Articles/Links
  1. Introduction to Social Network Analysis (from
  2. Casestudies of Social Network Analysis (from
  3. Wikipedia page on Social Networks (Check the history of Social Network Analysis)
  4. Social Life of Routers (Computer networks as social networks)
  5. Finding Go-to People and Subject Matter Experts in Organization
  6. Predicting Defects using Network Analysis on Dependency Graphs – ICSE 2008
  7. Mining Software Archives (a special issue of IEEE magazine)

Wednesday, January 14, 2009

SVNPlot - my first opensource project

During the 1 week gap between the two jobs, I finally started an opensource project. The project is called in SVNPlot. It is inspired by the excellent StatSVN Subversion Statistics generation package.

SVNPlot generates graphs similar to StatSVN. The difference is in how the graphs are generated. SVNPlot generates these graphs in two steps. First it converts the Subversion logs into a 'sqlite3' database. Then it uses sql queries to extract the data from the database and then uses excellent Matplotlib plotting library to plot the graphs.

I believe using SQL queries to query the necessary data is resulting great flexibility in data extraction. Also since the sqlite3 is quite fast, it is possible to generate these graphs on demand.

As tribute to python and author of Python-Guido van Rossum, I have generated the graphs for Rietveld project. Check it out here

is hosted on Google code ( and licensed under New BSD license. For information on installation and usage, check the introduction page here

I am using python to implement SVNPlot. I am a novice to python. Hence any suggestions to improvement are welcome.