Jesse's SQL Blog: February 2009

Wednesday, February 25, 2009

Comparing Stored Procedures, Part 6

In the previous entry, we found that computing the maximum search size prior to the sequence matching using SQL Server 2005's CTE feature saved a considerable amount of time, but we were still performing at an unacceptable rate. I decided it was time to rework the core logic of the matching algorithm. Impressed by the blazing performance of the CTE used to precalculate the max search length, I wanted to recode the algorithm using the set-based logic of the CTE.

Initially using an iterative approach to this problem made a lot of sense, since we need to control the order of the subsequences checked (longer before shorter). There didn't seem to be an obvious way to do this with a CTE. The real power of the CTE comes from its feature of recursion - the ability to have a set of data in the CTE refer to itself (please go here for an excellent article on this). I developed a hybrid approach that would combine iterative and recursive code. The recursive CTE in it will select off the longest matching subsequence between the two sequences it can find whose values are all unmatched, save it in a temp table, and then update the two sequences to show the matching values. It would then continue doing this until no more matching subsequences can be found. This new version can be found here.

The code is much more concise, much more elegant, than the iterative algorithm. But what is truly amazing is the performance increase. After altering the testing script to test all three versions of the code under the same conditions, in the same testing session, the improved performance speaks for itself:

The 'Improved' column compares the latest version with the first. The improved performance represents not just absolute time; notice how slowly the rate of extra time required increases for each increment of the input size. After running the new version for input sizes from 100 to 800 values, I created this graph of the performance time in seconds (graph developed using The Graph Calculator by Holt, Rinehart, and Winston):

The graph rises very slowly at first, then the performance time starts to grow very quickly towards the right end of the graph. Using this website for power regression analysis, we arrive at the approximate formula of y=3.5x^2.5/10^-6. If this is accurate, then the new version is somewhere between square and cubic order of complexity, and although considerably faster than the previous versions, will suffer at larger input sizes.

Next: we examine the results of running our new code on the original large stored procedures.

Tuesday, February 24, 2009

Comparing Spds, Part 5 - Optimization

Testing out our sequence matching algorithm showed that our estimate of cubic order of complexity was accurate; unfortunately, this performance is what might be called "sub-optimal". If we have any hope of testing our 1,000 line stored procedures in a reasonable timeframe, we're going to have to optimize our algorithm.

The most obvious optimization to make is to simply reduce the number of comparisons, and the way we're going to do that is by checking the sequences ahead of the actual matching. We first check each value in the sequences to see if there exists a match in the other sequence, and then overload the MatchLineNum column to indicate match or no-match. At this point we have a series of subsequences within each sequence of possible matches. But what's important here is the length of the subsequences - the length of the longest subsequence is the maximum search size for that sequence - we need not search for longer ones in it because we already know that longer ones already contain a mismatch. Also, the longest possible matching subsequence between the two main ones will be the smaller of these two maximum lengths.

As an example, let's say that we have two sequences, S1 and S2, and we find that the length of the longest possible subsequences of S1 and S2 is n and m, resp. For the purposes of matching the subsequences between S1 and S2, the maximum subsequence to compare will be of length min(n, m).

The code to determine this maximum subsequence utilizes CTEs, one of the more important enhancements of SQL Server 2005. This was an interesting exercise. I wrote CTEs that first separately picked out the starting and ending points of the subsequences with matching values, constructed the sequences by matching starting and ending points, then calculated the maximum length of these subsequences. CTEs require a set-based approarch, rather than the more algorithmic row-by-row processing; the resulting code is more intuitive, and ultimately more powerful. Please look here for an excellent introductory article to CTEs.

In comparing this improved version against the original code, I tested both with input sizes of 100, 200, 300, 400, and 500, running each code base 5 times for each input size. Here are the results (columns 'Version1' and 'Version2' represent performance time in seconds):

InputSize___Version1___Version2____Improvement
100_________5.064______1.222_______75.9%
200_________48.680_____13.534______72.2%
300_________92.918_____35.958______61.3%
400_________461.862____318.318_____31.1%
500_________614.412____417.818_____32.0%

We can see that the new version improved performance considerably for the smaller input sizes, but peeters out as the size increases. Analyzing the print output from the stored procedure, paying special attention to the @CmprSz values that were output, I can see that the starting compare sizes grew larger as a ratio to the input size as the input size increased. In other words, the larger the sample size the smaller the cost savings. Given that the test sequences consisted of "words" of two letters of the form [A-J][A-J], obviously the larger the sample size the more likely that random values will be shared between the sequences. In two particular runs of the larger sample sets, all values were shared between the two sequences, resulting in zero cost savings.

This is not exactly a failure of testing design - for many real-world scenarios, the likelihood of shared values should follow this trend of increasing as the sample size increases. Consider comparing the words from works of literature - comparing the text of any random two English language authors of the 20th century, the larger the works the more likely that they will share words between them. Another example is DNA - consider the base pairing of AT and GC combinations - since there are only two possibilities in the domain of values, we are certain to find 100% likelihood of shared values between any two DNA sequences (thus reducing our cost savings to zero).

You can find a complete copy of the improved stored procedure here.

Thursday, February 19, 2009

Comparing Spds, Part 4 - Performance

In a previous posting, I analyzed the comparison algorithm to determine its complexity, and found it to be of cubic order. The next step is to test the actual performance of the code and see if my estimate was correct. To do this, I wrote a testing script that tests the code several times under several different input sizes (this code can be found here). Using the table in the testing script, I ran a couple more queries to reformat the results to be more suitable for my specific code:

SELECT
InputSize,
NextInputSize = InputSize + 50.0,
NumberOfRuns = COUNT(*),
AveragePercentMatch = AVG(PcntMatch),
AverageRunTime = AVG(CONVERT(decimal(9, 2), CONVERT(varchar(max), DATEDIFF(ss, RunStart, RunDone)) + '.' + CONVERT(varchar(max), DATEDIFF(ms, RunStart, RunDone))))
INTO #AggregatedPerformanceResults
FROM #PerformanceResults
GROUP BY InputSize

SELECT
A.InputSize,
A.AverageRunTime,
ExpectedNextAverageRunTime = POWER(A.NextInputSize / A.InputSize, 3.0) * A.AverageRunTime
FROM #AggregatedPerformanceResults A
JOIN #AggregatedPerformanceResults B ON B.InputSize = A.NextInputSize

and here are the results I saw:

InputSize___AverageRunTime___ExpectedNextAvgRunTime
50__________4.618000_________36.944000_____________
100_________37.270000________125.786250____________
150_________142.642000_______338.114370____________
200_________366.866000_______716.535156____________

The 'ExpectedNextAverageRunTime' values are, as the name indicates, the expected average time of the next input size based on the results of the given input size. This is computed by computing the percentage difference between the given and next input size, cubing it, and multiplying the result by the given average run time. For InputSize = 50, the percentage difference between 100 and 50 is 100/50 = 2, 2 to the third power is 8, times the average run time of 4.618 seconds is 36.944 seconds. Notice that differs only slightly from the actual average run time for InputSize of 100, which is 37.27 seconds. Comparing subsequent expected versus actual average run times shows a similar consistency, which confirms the analysis of the previous posting, that the matching algorithm is of cubic order of complexity.

Next: optimizing the matching algorithm.

Thursday, February 12, 2009

Complicated Comments

A "Question of the Day" I submitted to SQLServerCentral.com was published yesterday, and it seemed to generate a lot of interest, due to the rather surprising answer.

The question was, "Given the code below, which numbers will be printed?"

PRINT '1' -- /* ;PRINT '2' */ ;PRINT '3' /*
PRINT '4' --*/
--/*
PRINT '5'
--*/
/*
PRINT '6'
--/*
*/
PRINT '7'
--*/
PRINT '8'

Possible answers:
A. 1, 8
B. 1, 4, 8
C. 1, 4, 5, 8
D. 1, 4, 5, 7, 8

Correct answer: C

Fewer than half of the respondents answered correctly (I certainly would have gotten the question wrong myself - this question derived from research that completely surprised me). It's not exactly a trick question, but the answer centers around a little-known feature of nested comments. It's fairly obvious that the numbers 1, 4, 5, and 8 will print. The tricky part is near the bottom of the code. Most people look at the block comment starting above the line "PRINT '6'" and think that it ends above the line "PRINT '7'", following the reasoning that the line below "PRINT '6'", "--/*" is ignored by the SQL parser. But because the inline comment of that line is already inside a block comment, the inline comment is ignored! Which means that the line "--/*" actually begins a new, nested block comment that closes on the next line, but causes the line "PRINT '7'" to remain commented-out. In other words, the block comment containing "PRINT '6'" also contains "PRINT '7'", and does not close until the line above "PRINT '8'". This confusion explains why, as of right now, 40% of the respondents chose answer "D".

When I submitted the question for publication, I wondered that many people would find it too abstruse and esoteric to be of value. A number of participants in the discussion of the question pointed out that if they saw comments similar to this in a code review, they would send the developer back to his or her desk to clean up. And point taken. Nesting comments in this fashion has very little, if any, upside and almost 100% downside in that it can lead to largely unexpected results. A couple of articles on the site help to establish best practices for commenting: "The case against using single-line comments", by Cade Bryant, and "Worst Practice - Bad Comments", by Andy Warren.

This problem originally came to mind when I started work on a tool for comparing the text from stored procedures. I wanted to allow the option of omitting all comments from the comparison, which of course would require me to identify and parse out comments from the text. In my research, I came across the question of how to treat nested comments - the "order of operation" of inline versus block comments, nested blocks, etc., and that's when I came across the scenario that formed this question of the day. I have to say that I was surprised, almost to the point of shock, that commenting functioned this way.

Tuesday, February 10, 2009

Comparing Spds Part 3 - Performance

In the previous post, "Comparing Stored Procedures, Part 2," we looked at an algorithm that would match smaller and smaller subsequences between two sequences of interest. When I coded it, I approached it as a proof-of-concept, and thus sidelined any performance concerns. When I tested it on two stored procedures of about 1000 lines a piece on an idle server, the comparison took about 6.5 hours! Some possible reasons why:

the code is RBAR-based rather than set-based (see this article for a refutation of that)
there is no indexing of the text elements
very little optimization has been done

How many comparisons are going on in this algorithm? To figure this out, let's split the accounting between how many subsequences exist, and how those are compared to the other sequence.

Let's say that we have two sequences, N and M, of lengths n and m respectively, where m >= n. Let's take subsequences out of N and compare to M. So how many are there? This count will be denoted by x. If n = 1, then there is only one subsequence, so x = 1. If n = 2, then x = 3 (with sequence 'AB', there is 'AB', 'A', and 'B'). If n = 3, then x = 6. I will omit the proof here, but this results in the simple series (taken from Wikipedia):

If n = 1000, as is the case in my comparison, then x > 500,000! So we have upwards of a half-million comparisons to do already, and we're only counting the left side! Note that this is the worst-case scenario, where none of the elements match between the sequences (any elements contained in a matching subsequence are excluded from future comparisons).

So now the question is, how many subsequences on the right side must we compare to those on the left? Let's look at a simple example comparing strings 'ABC' and 'XYZ':

compare subsequences of length 3: ABC:XYZ (1 comparison)
compare subsequences of length 2: AB:XY, AB:YZ, BC:XY, BC:YZ (4 comparisons)
compare subsequences of length 1: A:X, A:Y, A:Z, B:X, B:Y, B:Z, C:X, C:Y, C:Z (9 comparisons)

The total number of comparisons is 1 + 4 + 9 = 14. Note that the number of comparisons of length n is n^2, and that the sum is another series:

Now let's say that we were comparing 'ABCD' and 'XYZ'. How many extra comparisons does this add (extras highlighted in bold)?

compare subsequences of length 3: ABC:XYZ, BCD:XYZ (1 extra comparison)
compare subsequences of length 2: AB:XY, AB:YZ, BC:XY, BC:YZ, CD:XY, CD:YZ (2 extra comparisons)
compare subsequences of length 1: A:X, A:Y, A:Z, B:X, B:Y, B:Z, C:X, C:Y, C:Z, D:X, D:Y, D:Z (3 extra comparisons)

Does this look familiar? The extra comparisons are the first series summation we saw earlier. And although it certainly adds processing resources, what we are most interested in is the order of complexity, which in our case is

Examining our example comparison of two stored procedures of around 1000 lines each, this results in processing involving on the order of 1 billion comparisons. In an unoptimized algorithm, it's no wonder why this took almost seven hours to complete!

Next: testing our hypothesis in estimating performance.

Friday, February 6, 2009

Comparing Stored Procedures, Part 2

In the previous posting on this topic, Comparing Stored Procedures, Part 1, we found that the stored procedures of interest were more than 90% alike, when counting matching lines of code between them. This result is certainly helpful, but the approach has certain downsides. For example, if the two spds were composed of many blank lines, then we might have a false positive result. If the spds had exactly the same lines, but in different orders, then we would see a 100% match, but we couldn't really say that they were the same. Another thing I was really interested in was, For the lines of matching code, how large were the blocks of matches (matching lines where the next lines matched also)? That sort of result could lead on to something developers familiar with Visual Studio 98 had at their disposal - Windiff, a visual comparison of the spds.

In order to provide this more detailed picture of how the stored procedures matched one another, I devised an algorithm that would compare blocks of lines between them, starting with the largest possible sequence to compare (the size of this would be the number of lines of the smaller spd), and then compare smaller and smaller sequences until the comparison size was just one line (which is what was compared in Comparing Stored Procedures, Part 1). The trick to this approach is that if a match was found between to large sequences of lines, the subsequences in them would never be compared. Although this certainly helps with performance, the real reason behind this is that the fact that large blocks of code matched is very meaningful. Take this example: the string 'ABCD' matches 'ABDC' more than it matches 'DCBA'. How much "more" it matches is the subject of a later posting, where we establish our scoring algorithm.

For a complete list of the code, please go here. The code runs a comparison of the strings "AZXYZBC" and "BCDAEAXYZ" (regarding each letter as a line of code to compare). Because "AZXYZBC" is the smaller string, the size of subsequences to compare starts at its length, 7, and decrements from there. The comparison first checks to see if any instances of "AZXYZBC" are in "BCDAEAXYZ"; obviously there are none, so it next checks for matching subsequences of length 6. No matches are found until it reaches length 3, at which point it finds matches of "XYZ" in both sequences. When it finds that subsequence, it marks each row in the subsequence to indicate the starting matching position in the other sequence; this mark will prevent the letter from being compared again. This allows the comparison to show the matching subsequences in their greatest length, which also indicates the strength of the match.

This image shows the final matching between the sequences:

Notice that row 2 on the left shows no match, even though there is a 'Z' on the right in row 9 - this is because the matching 'XYZ' subsequences are much more important given their longer length.

Next: performance analysis of our algorithm.

Wednesday, February 4, 2009

Combination of Columns with Unique Values

One of the side projects that I am most proud of is the code I wrote to query a table and return a list of columns comprising a potential unique key, which can be found here. Here's a snippet of the write-up I did for it:

"As an ETL developer, I often receive files in which the natural key is either unidentified or misidentified, and I need to determine a combination of columns that uniquely identifies each row (the natural key, which is usually also the primary or unique key of the table holding this data), in order to integrate the data from that file into the database. Many times it is simply not clear what the unique key should be, and this requires a painstaking cycle of guessing and testing combinations of columns for unique values. To address this problem, I developed a stored procedure that will automatically query the table to discover a candidate for the unique key. At first, I wanted to create a tool that would report all unique combos, but due to performance reasons, I later limited that to just the first one discovered. I added a way to retest and exclude that result if the first one is not satisfactory."

You can also follow this link for the complete article:
http://www.sqlservercentral.com/scripts/T-SQL/62086/

Comparing Stored Procedures, Part 1

I recently completed a data mart requiring me to reproduce, and improve upon, a critical departmental report created in Excel from the results of three stored procedures. These spds appeared to be almost, but not quite identical - they queried similar data sources and output results in identical data structures, using slightly varying processes. I surmised that there was a single stored procedure at one time, that was copied and modified to create the other two, almost like a genetic mutation. Unfortunately all three were actively used - and independently maintained (or not maintained - the risk of keeping duplicate code in use). What would've simplified the analysis greatly is if the original developer had overloaded the original spd to output each of the three result sets, perhaps accepting a parameter to indicate the result of interest.

What I really wanted to know was how similar were the spds, and what made them different from one another? Given that each spd was about 1000 lines long, a casual comparison would not be feasible.

I started my analysis by reading each spd into a table, line by line, so that each row in the table represented a line of code from the spd, and then compared the number of lines from each spd that matched lines in the other spd. Please go here for the complete code.

This produced some interesting results:
spd1: rpt_ReportingSpd1
spd2: rpt_ReportingSpd2
Percentage match between the spds: 92.27%
Percentage of spd1 found in spd2: 94.64%
Percentage of spd2 found in spd1: 90.12%

So based on the raw numbers of lines matching between the stored procedures, they were mostly the same.

Next: an improved algorithm providing more advanced matching.

Tuesday, February 3, 2009

Welcome to My Blog

I am a database developer focusing on ETL/BI development. My technological background centers on Microsoft SQL Server technologies such as T-SQL programming, SSIS, and SSRS, but also includes Informatica, Visual Foxpro, and application development such as Visual Basic and Java. This weblog will focus on SQL Server, especially T-SQL development and programming. My aim is to present projects and problems in SQL Server and explain my thought process at delivering a solution.

My webpage is http://www.jessemclain.com/, and I can be reached via email at jesse@jessemclain.com. I have contributions on SQL Server Central that can be accessed via the directory at http://www.sqlservercentral.com/Authors/Scripts/Jesse_McLain/413474/.

Jesse's SQL Blog