Last week we discovered a set-based method of implementing our sequence-matching algorithm using a Common Table Expression (CTE), and found that it beat our iterative code by several orders of magnitude. Remember the original impetus for this effort, comparing the two stored procedures of more than 1,000 lines each, which initially took about six-and-a-half hours? Running this new version in TempDb on an idle database server took just over 30 seconds! How could this be, given that in Part 6 a comparison of 800 random 2-letter values took about 70 seconds? (Omitted from those results - an input size of 1,000 took over 2 minutes.) How could a test of the same input size (1000 values), but with vastly more complex values (a real stored procedure rather than random two-letter values), take one-quarter of the time to complete? The answer lies in the length of the matching subsequences. If we run this query:
SELECT MatchLineNum, SubSeqLen = COUNT(*)
FROM Seq1
GROUP BY MatchLineNum
ORDER BY COUNT(*) DESC
we get the results:
So on the first matching subsequence, we match 121 values, which is over 10% of the entire sequence. The next matches are composed of 84, 82, 64, and 56 values, which when combined with the first, comprise about 30% of all values. With the sequences of random values, the matching subsequences are much shorter, and therefore requre more executions of the CTE block.
Wednesday, March 4, 2009
Subscribe to:
Post Comments (Atom)
No comments:
Post a Comment