Tuesday, February 24, 2009

Comparing Spds, Part 5 - Optimization

Testing out our sequence matching algorithm showed that our estimate of cubic order of complexity was accurate; unfortunately, this performance is what might be called "sub-optimal". If we have any hope of testing our 1,000 line stored procedures in a reasonable timeframe, we're going to have to optimize our algorithm.

The most obvious optimization to make is to simply reduce the number of comparisons, and the way we're going to do that is by checking the sequences ahead of the actual matching. We first check each value in the sequences to see if there exists a match in the other sequence, and then overload the MatchLineNum column to indicate match or no-match. At this point we have a series of subsequences within each sequence of possible matches. But what's important here is the length of the subsequences - the length of the longest subsequence is the maximum search size for that sequence - we need not search for longer ones in it because we already know that longer ones already contain a mismatch. Also, the longest possible matching subsequence between the two main ones will be the smaller of these two maximum lengths.

As an example, let's say that we have two sequences, S1 and S2, and we find that the length of the longest possible subsequences of S1 and S2 is n and m, resp. For the purposes of matching the subsequences between S1 and S2, the maximum subsequence to compare will be of length min(n, m).

The code to determine this maximum subsequence utilizes CTEs, one of the more important enhancements of SQL Server 2005. This was an interesting exercise. I wrote CTEs that first separately picked out the starting and ending points of the subsequences with matching values, constructed the sequences by matching starting and ending points, then calculated the maximum length of these subsequences. CTEs require a set-based approarch, rather than the more algorithmic row-by-row processing; the resulting code is more intuitive, and ultimately more powerful. Please look here for an excellent introductory article to CTEs.

In comparing this improved version against the original code, I tested both with input sizes of 100, 200, 300, 400, and 500, running each code base 5 times for each input size. Here are the results (columns 'Version1' and 'Version2' represent performance time in seconds):


We can see that the new version improved performance considerably for the smaller input sizes, but peeters out as the size increases. Analyzing the print output from the stored procedure, paying special attention to the @CmprSz values that were output, I can see that the starting compare sizes grew larger as a ratio to the input size as the input size increased. In other words, the larger the sample size the smaller the cost savings. Given that the test sequences consisted of "words" of two letters of the form [A-J][A-J], obviously the larger the sample size the more likely that random values will be shared between the sequences. In two particular runs of the larger sample sets, all values were shared between the two sequences, resulting in zero cost savings.

This is not exactly a failure of testing design - for many real-world scenarios, the likelihood of shared values should follow this trend of increasing as the sample size increases. Consider comparing the words from works of literature - comparing the text of any random two English language authors of the 20th century, the larger the works the more likely that they will share words between them. Another example is DNA - consider the base pairing of AT and GC combinations - since there are only two possibilities in the domain of values, we are certain to find 100% likelihood of shared values between any two DNA sequences (thus reducing our cost savings to zero).

You can find a complete copy of the improved stored procedure here.

No comments:

Post a Comment