Thoughts of a Thinking Craftsman: 2011

Thursday, October 13, 2011

Dennis Ritchie

Dennis Ritchie (Creator of C and key developer of unix and author of book K&R C) passed away. It seems Oct 2011 is bad for people of computer science. World has lost another revolutionary thinker.

Dennis Ritchie - RIP. We, programmer are forever in your debt.

Dennis Ritchie Wikipedia page
Herb Sutter on Dennis Ritchie
Development of C Programming Language by Dennis Ritchie

Tuesday, August 30, 2011

write small functions

In my teams there is always a rule, that maximum function size is 25 lines. Every time when a new team member joined the group, he/she expressed that it is not possible to follow this rule every time and they will have to write longer functions. So I always told them, Ok, but then they have to convince me that there is no way to break this long function in smaller logical chunks. In my 15 years of experience, no one succeeded. I could always show them a way of logically breaking the function in smaller function.

Many team members asked why 25 lines limit and what is the benefit ???. While I explained the reasons verbally to my team members I thought, this time I will document 'why' on the blog.

Entire function has to be visible in one page.
We write code once and edit it few times, but we read it many more times. Hence the code has to be optimized for easy reading, understanding. If I have to scroll or do page up/page down to read a function it breaks the flow. If entire function is visible on one page, then it is easy to ready, understand and hence debug. It is also easy to spot potential problem when whole function is visible.Since typically editor one page is about 25 to 30 lines, hence maximum function size is about 25 lines.
Reduce the code complexity
Code with high complexity is difficult to understand and modify.Generally it is recommended to limit the cyclomatic complexity to 15.Once you have a limit on function size, it automatically limits the complexity to manageable proportions. After all, how complex the code can get in 25 lines ??.
Reduce multiple returns
Multiple returns from a function (especially more than 2-3 returns) make the code difficult to understand. I have seen functions which are 500-1000 lines and with 25-30 return statements. Usually these functions end up having many bugs.
Also in C++, you have to manually add any cleanup (e.g. free the memory or close files etc). before every return statements. (Of course you can use techniques like RAII to simplify the clean up).
Still with small function size, automatically limits number of 'return' from a function
Easy to spot errors (especially clean up errors)
Many times in functions you allocate memory, open files, database connections etc. and you have to cleanup/release these resources. If you have large function, it is difficult spot where all memory is allocated, files opened etc and what point it needs to be freed, files closed. With smaller functions is easier to spot these kinds of errors.
For new code stick to this rule
For new code stick to this rule. If you are maintaining/enhancing existing code, it is very tempting to jump into large scale refactoring of existing code in one-go. Don't make that mistake.
If there is a large function but there are no bugs reported or traced to that function, don't refactor it to smaller functions. Leave it as it is.
However, if you are fixing a bug in a large function refactor the portion where you did the bug fix into a smaller function. Don't try to break 500 line function into 20 functions in one-go. So if you have a 500 line function, after refactoring it may become two functions of 480 lines and 20 lines. If one more bug is reported in the same function do one more small refactoring. Very soon number of large functions will reduce and code will be cleaner and simpler. Check the Refactoring website and follow the guidelines from Martin Fowler's Refactoring Book.

This one thumb rule has worked well for me in practice. What is your experience on writing smaller functions ? Do you have any such rule which worked for you ?

Monday, July 04, 2011

Optimizing Django database access : Part 2

Originally published on BootstrapToday Blog

Few months back we moved to Django 1.2. Immediately we realized that our method 'optimizing' database queries on foreign key fields doesn't work any more. The main reason for this 'break' was Django 1.2 changes to support multiple databases. Specially the way 'get' function is called to access value of a ForeignKey field.

In Django 1.2 ForeignKey fields are accessed using rel_mgr. See the call below.

rel_obj = rel_mgr.using(db).get(**params).

However, manager.using() call returns a queryset. Hence if the manager has implemented a custom 'get' function, this 'custom get' function is not called. Since our database access optimzation is based on 'custom get' function. This change broke our optimization.

Check Django Ticket 16173 for the details.

Also our original code didn't consider multidb scenarios. Hence query 'signature' computation has to consider the db name also.

from django.db import connection
from django.core import signals

def install_cache(*args, **kwargs):
    setattr(connection, 'model_cache', dict())

def clear_cache(*args, **kwargs):
    delattr(connection, 'model_cache')

signals.request_started.connect(install_cache)
signals.request_finished.connect(clear_cache)

class YourModelManager(models.Manager):
    def get(self, *args, **kwargs):
        '''
        Added single row level caching for per connection. Since each request
        opens a new connection, this essentially translates to per request
        caching
        '''
        model_key = (self.db, self.model, repr(kwargs))

        model = connection.model_cache.get(model_key)
        if model is not None:
            return model

        # One of the cache queries missed, so we have to get the object
        # from the database:
        model = super(YourModelManager, self).get(*args, **kwargs)
        if model is not None and model.pk is not None:
            connection.model_cache[model_key]=model
        return model

There are minor changes from the first version. Now this manager class stores the 'db' name also in 'key'.

Along with this change you have to make one change the core Django code. In file django/db/fields/related.py in in __get__ function of 'ReverseSingleRelatedObjectDescriptor' class, Replace the line

         rel_obj = rel_mgr.using(db).get(**params)

by line

         rel_obj = rel_mgr.db_manager(db).get(**params)

This will ensure that YourModelManager's get function is called while querying the foreignkeys.

Saturday, February 12, 2011

Insights from ‘The Design of Design’ - Part I

I have good fortunate of meeting/interacting with some great software designers while working in Geometric Ltd and during last two years as independent consultant. I am always intrigued by how an expert software designer thinks and how he learns. As Sir Ken Robinson says the key skill in today's world is ‘knowing how to learn new things’. I think of myself as ‘Thinking Craftsman’ (i.e. someone who is thinking about his trade/craft and strives to continuously improve his/her skills). Hence ‘how an expert designer learns and becomes an expert’ is a key question for me in the quest of improving my own skills.

‘The Design of Design’ is a new book from Fred Brooks (Author of Mythical Manmonth). As expected, it has some great/some obvious insights but most importantly it has great explanations of these insights. In this article, I am going to discuss about insights related to how ‘expert designers become experts’.

Insight One: Exemplars in Design
This is what Fred Books says about 'exemplars'

Exemplars provide safe models for new designs, implicit checklists of design tasks, warnings of potential mistakes and launching pads for radical new designs. Hence great designers have invested great efforts in studying their precedents. Bach took a six month unpaid leave to study the work and ideas of Buxtehude. Bach proved to be much greater composer but his surpassing excellence came from comprehending and using the techniques of his predecessors and not ignoring them

I argue that great technical designers need to do likewise but that the hurried pace of modern design has discouraged this practice. ... Technical design disciplines eager to produce great designs need to develop accessible bodies of exemplars and knowledgeable critiques of them (page 154-155)

‘Certainly lazy or slack designer can minimize his work by picking an exemplar and just modifying it to fit. By and large, those who just copy do not draw on ancient or remote exemplars but only on those that are most recent and fashionable’ (page 162).

There are two things that came to my mind after reading this article. First is ‘Design Patterns’. “In software engineering, a design pattern is a general reusable solution to a commonly occurring problem in software design. A design pattern is not a finished design that can be transformed directly into code. It is a description or template for how to solve a problem that can be used in many different situations.”.

One purpose of documenting the ‘design patterns in software’ is to 'make experts insights available to a novice’. In this sense, ‘Design patterns’ fit perfectly into what Fred Brooks calls ‘accessible bodies of exemplars’. Personally my thinking about the software design changed after I studied patterns (especially advantages and limitations of each pattern). Even today I read and reread GoF Design pattern book, articles by Robert Martin, articles and books of Martin Fowler, books like Effective C++ and More Effective C++. Every time I gain some small new insights which enrich my own ‘body of knowledge’. Another recent good book of exemplars is ‘Beautiful Code: Leading Programmers Explain How They Think’. There are positive and negative reviews on this book. For me, this book is invaluable for its insights into how various developers think about a problem and how they come up with a solution.

Second thought was ‘about Googling’. Recently (past few years) I see many developers just google about a problem, find something and copy that code. Many times they just pick up some pattern and copy the sample code for that pattern found on the internet. But since they don’t have any real understanding of the pattern, they end with more problems. While conducting programs on ‘design patterns’ I see participants eager to get sample code rather than eager to understand the pattern, participants eager to get the ‘power point slides’ rather than reading the books and articles. So far I have not found any solution to this ‘lazy or slack designer syndrome’. Bigger problem is many of these lazy/slack designers are considered ‘good/great designers’ in their company because they can rattle of latest technology and design buzz words.

For many years I regularly read blogs, various book and studied how other good designers think out of habit.But I could not clearly explain/articulate a new comer why I am doing this. Now I know how this habit has helped me and how it can help a new comer to study his craft.

Saturday, February 05, 2011

Optimizing Django database access : some ideas/experiments

Originally published on BootstrapToday Blog

As we added more features to BootStrapToday, we started facing issues of performance. Page access was getting progressively slower. Recently we analyzed page performance using excellent Django Debug Toolbar and discovered that in worst there were more than 500 database calls in a page. Obviously that was making page display slow. After looking at various calls, analyzing the code and making changes, we were able to bring it down to about 80 calls and dramatically improving the performance. Django has excellent general purpose caching framework. However, it’s better to minimize the calls and then add caching for remaining queries. In this article, I am going to show a simple but effective idea for improving the database access performance.

In Django, a common reason for increased database calls in ForeignKey fields. When you try to access the variable representing foreign key typically it results in a database access. Usually suggested solution to this problem is use of ‘select_related’ function of Django query set API. It can substantially reduce the number of database calls. Sometimes it is not possible to use ‘select_related’ (e.g. you don’t want to change how a third party app works or changing the queries requires significant change in the logic etc).

In our case, we have Foreign Key fields like Priority, Status etc on Ticket models. These fields are there because later we want to add ability to ‘customize’ these fields. However, the values in these tables rarely change. Hence usually this results in multiple queries to get the same data. Usually these are ‘get’ calls. If we can add a simple caching to ‘get’ queries for status, priority etc, then we can significantly reduce the number of database calls.

In Django, a new connection is opened to handle a new request. Hence if we add model instance cache to ‘connection’ then query results will be cached during handling of one request. New request will make the database query again. With this strategy we don’t need complicated logic to clear ‘stale’ items from the cache.

from django.db import connection
from django.core import signals

def install_cache(*args, **kwargs):
    setattr(connection, 'model_cache', dict())

def clear_cache(*args, **kwargs):
    delattr(connection, 'model_cache')

signals.request_started.connect(install_cache)
signals.request_finished.connect(clear_cache)

class YourModelManager(models.Manager):
    def get(self, *args, **kwargs):
        '''
        Added single row level caching for per connection. Since each request
        opens a new connection, this essentially translates to per request
        caching
        '''
        model_key = (self.model, repr(kwargs))

        model = connection.model_cache.get(model_key)
        if model is not None:
            return model

        # One of the cache queries missed, so we have to get the object from the database:
        model = super(YourModelManager, self).get(*args, **kwargs)
        if model is not None and model.pk is not None:
            connection.model_cache[model_key]=model
        return model

As a side benefit, since the same model instance is returned for same query in subsequent calls, number of duplicate instances is reduced and hence memory foot print is also reduced.

Thoughts of a Thinking Craftsman

Announcement