Wednesday, March 25, 2009

SQL Server Data Profiler

With the newest edition of SQL Server, 2008, comes a tool that serves as a good first-cut at providing something for data profiling. Here are some good articles on it:

Fuzzy Matching

While doing some research on data quality, I came across an article about implementing the Jaro-Winkler distance metric in SSIS. Given two character strings, this algorithm will return a numeric between 0 and 1 indicating similarity (so that 0 is no similarity and 1 means exact match). The Wikipedia entry uses the names MARTHA and MARHTA as an example, producing the result 0.94 (indicating a 94% similarity).

Friday, March 6, 2009

When I ran the comparison of the original versus the most recent versions of the sequence comparisons, I saved the results of the Seq1 and Seq2 tables. While running some tests on the result tables, I discovered a mistake:

The four counts should all be equal, and because they are not, we know that there is a mistake in at least one of them. Because the version 1 results are not even consistent between one another, I have to think that the problem is with the original version.

This query:

Seq1_Line = ISNULL(LTRIM(STR(S1.CodeLineNum)), '')
,ISNULL(S1.CodeLineTxt, '')
,ISNULL(S2.CodeLineTxt, '')
,Seq2_Line = ISNULL(LTRIM(STR(S2.CodeLineNum)), '')
,OrderBy = ISNULL(S1.CodeLineNum, S2.CodeLineNum)
FROM Seq1 S1
ON S1.CodeLineNum = S2.MatchLineNum

will display the matches between the two sequences in a "Windiff"-fashion, and running it on the results of the most recent matching algorithm shows that that one worked correctly.

Wednesday, March 4, 2009

Comparing Stored Procedures, Part 7

Last week we discovered a set-based method of implementing our sequence-matching algorithm using a Common Table Expression (CTE), and found that it beat our iterative code by several orders of magnitude. Remember the original impetus for this effort, comparing the two stored procedures of more than 1,000 lines each, which initially took about six-and-a-half hours? Running this new version in TempDb on an idle database server took just over 30 seconds! How could this be, given that in Part 6 a comparison of 800 random 2-letter values took about 70 seconds? (Omitted from those results - an input size of 1,000 took over 2 minutes.) How could a test of the same input size (1000 values), but with vastly more complex values (a real stored procedure rather than random two-letter values), take one-quarter of the time to complete? The answer lies in the length of the matching subsequences. If we run this query:

SELECT MatchLineNum, SubSeqLen = COUNT(*)
GROUP BY MatchLineNum


we get the results:

So on the first matching subsequence, we match 121 values, which is over 10% of the entire sequence. The next matches are composed of 84, 82, 64, and 56 values, which when combined with the first, comprise about 30% of all values. With the sequences of random values, the matching subsequences are much shorter, and therefore requre more executions of the CTE block.