In order to provide this more detailed picture of how the stored procedures matched one another, I devised an algorithm that would compare blocks of lines between them, starting with the largest possible sequence to compare (the size of this would be the number of lines of the smaller spd), and then compare smaller and smaller sequences until the comparison size was just one line (which is what was compared in Comparing Stored Procedures, Part 1). The trick to this approach is that if a match was found between to large sequences of lines, the subsequences in them would never be compared. Although this certainly helps with performance, the real reason behind this is that the fact that large blocks of code matched is very meaningful. Take this example: the string 'ABCD' matches 'ABDC' more than it matches 'DCBA'. How much "more" it matches is the subject of a later posting, where we establish our scoring algorithm.
For a complete list of the code, please go here. The code runs a comparison of the strings "AZXYZBC" and "BCDAEAXYZ" (regarding each letter as a line of code to compare). Because "AZXYZBC" is the smaller string, the size of subsequences to compare starts at its length, 7, and decrements from there. The comparison first checks to see if any instances of "AZXYZBC" are in "BCDAEAXYZ"; obviously there are none, so it next checks for matching subsequences of length 6. No matches are found until it reaches length 3, at which point it finds matches of "XYZ" in both sequences. When it finds that subsequence, it marks each row in the subsequence to indicate the starting matching position in the other sequence; this mark will prevent the letter from being compared again. This allows the comparison to show the matching subsequences in their greatest length, which also indicates the strength of the match.
This image shows the final matching between the sequences:
Notice that row 2 on the left shows no match, even though there is a 'Z' on the right in row 9 - this is because the matching 'XYZ' subsequences are much more important given their longer length.
Next: performance analysis of our algorithm.
No comments:
Post a Comment