- the code is RBAR-based rather than set-based (see this article for a refutation of that)
- there is no indexing of the text elements
- very little optimization has been done
How many comparisons are going on in this algorithm? To figure this out, let's split the accounting between how many subsequences exist, and how those are compared to the other sequence.
Let's say that we have two sequences, N and M, of lengths n and m respectively, where m >= n. Let's take subsequences out of N and compare to M. So how many are there? This count will be denoted by x. If n = 1, then there is only one subsequence, so x = 1. If n = 2, then x = 3 (with sequence 'AB', there is 'AB', 'A', and 'B'). If n = 3, then x = 6. I will omit the proof here, but this results in the simple series (taken from Wikipedia):
If n = 1000, as is the case in my comparison, then x > 500,000! So we have upwards of a half-million comparisons to do already, and we're only counting the left side! Note that this is the worst-case scenario, where none of the elements match between the sequences (any elements contained in a matching subsequence are excluded from future comparisons).
So now the question is, how many subsequences on the right side must we compare to those on the left? Let's look at a simple example comparing strings 'ABC' and 'XYZ':
- compare subsequences of length 3: ABC:XYZ (1 comparison)
- compare subsequences of length 2: AB:XY, AB:YZ, BC:XY, BC:YZ (4 comparisons)
- compare subsequences of length 1: A:X, A:Y, A:Z, B:X, B:Y, B:Z, C:X, C:Y, C:Z (9 comparisons)
The total number of comparisons is 1 + 4 + 9 = 14. Note that the number of comparisons of length n is n^2, and that the sum is another series:
Now let's say that we were comparing 'ABCD' and 'XYZ'. How many extra comparisons does this add (extras highlighted in bold)?
- compare subsequences of length 3: ABC:XYZ, BCD:XYZ (1 extra comparison)
- compare subsequences of length 2: AB:XY, AB:YZ, BC:XY, BC:YZ, CD:XY, CD:YZ (2 extra comparisons)
- compare subsequences of length 1: A:X, A:Y, A:Z, B:X, B:Y, B:Z, C:X, C:Y, C:Z, D:X, D:Y, D:Z (3 extra comparisons)
Does this look familiar? The extra comparisons are the first series summation we saw earlier. And although it certainly adds processing resources, what we are most interested in is the order of complexity, which in our case is
Examining our example comparison of two stored procedures of around 1000 lines each, this results in processing involving on the order of 1 billion comparisons. In an unoptimized algorithm, it's no wonder why this took almost seven hours to complete!
Jessie,
ReplyDeleteThe script looks interesting and I understand the math. I am mostly an application developer but have occasion to write a stored proc or two. What I am not sure about (not being a SQL geek) is how to use the Performance Testing Script to test the performance of ANOTHER stored proc. How do I make the PTS aware of the stored procedure I am attempting to test the performance of? Sorry, it may be simple but I am missing it.
Thanks,
Little John
JMasciantoni@FHCP.com
John, thanks for your question. The blog entry following this one, 'part 4', has a link to the testing script: http://www.sqlservercentral.com/scripts/TSQL/65943/. That script has a loop to repeatedly test the stored procedure "spd_SequenceCompare". So if you change that reference from "EXEC spd_SequenceCompare" to whichever spd you wish to test you will have what you are looking for. The only other thing is that you would probably want to change how your input data gets populated, and remove the references to "PcntMatch". Feel free to contact me with any other questions.
ReplyDelete