In the previous posting on this topic, Comparing Stored Procedures, Part 1, we found that the stored procedures of interest were more than 90% alike, when counting matching lines of code between them. This result is certainly helpful, but the approach has certain downsides. For example, if the two spds were composed of many blank lines, then we might have a false positive result. If the spds had exactly the same lines, but in different orders, then we would see a 100% match, but we couldn't really say that they were the same. Another thing I was really interested in was, For the lines of matching code, how large were the blocks of matches (matching lines where the next lines matched also)? That sort of result could lead on to something developers familiar with Visual Studio 98 had at their disposal - Windiff, a visual comparison of the spds.
In order to provide this more detailed picture of how the stored procedures matched one another, I devised an algorithm that would compare blocks of lines between them, starting with the largest possible sequence to compare (the size of this would be the number of lines of the smaller spd), and then compare smaller and smaller sequences until the comparison size was just one line (which is what was compared in Comparing Stored Procedures, Part 1). The trick to this approach is that if a match was found between to large sequences of lines, the subsequences in them would never be compared. Although this certainly helps with performance, the real reason behind this is that the fact that large blocks of code matched is very meaningful. Take this example: the string 'ABCD' matches 'ABDC' more than it matches 'DCBA'. How much "more" it matches is the subject of a later posting, where we establish our scoring algorithm.
For a complete list of the code, please go here. The code runs a comparison of the strings "AZXYZBC" and "BCDAEAXYZ" (regarding each letter as a line of code to compare). Because "AZXYZBC" is the smaller string, the size of subsequences to compare starts at its length, 7, and decrements from there. The comparison first checks to see if any instances of "AZXYZBC" are in "BCDAEAXYZ"; obviously there are none, so it next checks for matching subsequences of length 6. No matches are found until it reaches length 3, at which point it finds matches of "XYZ" in both sequences. When it finds that subsequence, it marks each row in the subsequence to indicate the starting matching position in the other sequence; this mark will prevent the letter from being compared again. This allows the comparison to show the matching subsequences in their greatest length, which also indicates the strength of the match.
This image shows the final matching between the sequences:
Notice that row 2 on the left shows no match, even though there is a 'Z' on the right in row 9 - this is because the matching 'XYZ' subsequences are much more important given their longer length.
Next: performance analysis of our algorithm.