I recently completed a data mart requiring me to reproduce, and improve upon, a critical departmental report created in Excel from the results of three stored procedures. These spds appeared to be almost, but not quite identical - they queried similar data sources and output results in identical data structures, using slightly varying processes. I surmised that there was a single stored procedure at one time, that was copied and modified to create the other two, almost like a genetic mutation. Unfortunately all three were actively used - and independently maintained (or not maintained - the risk of keeping duplicate code in use). What would've simplified the analysis greatly is if the original developer had overloaded the original spd to output each of the three result sets, perhaps accepting a parameter to indicate the result of interest.
What I really wanted to know was how similar were the spds, and what made them different from one another? Given that each spd was about 1000 lines long, a casual comparison would not be feasible.
I started my analysis by reading each spd into a table, line by line, so that each row in the table represented a line of code from the spd, and then compared the number of lines from each spd that matched lines in the other spd. Please go here for the complete code.
This produced some interesting results:
Percentage match between the spds: 92.27%
Percentage of spd1 found in spd2: 94.64%
Percentage of spd2 found in spd1: 90.12%
So based on the raw numbers of lines matching between the stored procedures, they were mostly the same.
Next: an improved algorithm providing more advanced matching.