Jesse's SQL Blog: CTE

Showing posts with label CTE. Show all posts

Tuesday, September 1, 2009

TSQL Challenge #13 - Set-based solution

The set-based solution took me 3 or 4 times as long to develop as the cursor-based solution yesterday. I'm not sure if that's the nature of set-based development, my own learning curve, or both (like going from algebra to calculas). The set-based solution took a couple of interesting tricks to solve, which I will hold off publishing until after the deadline to the challenge.

Thursday, June 18, 2009

Poker DW: Stacking the Deck

Returning to the Poker Data Warehouse, I dove into the text-parsing process after setting up a sketch of an initial database design, with the idea of polishing out the design at a later date. I recently completed the parsing of a sample hand history, and man was it more work than I expected! Fortunately I love writing text-parsing routines (perhaps from early career work in merge-purge duplicate removal), so the 750 line stored procedure was more a labor of love than a tedious chore. Getting into the dirty details of the source data made me think more about the 30,000 ft overview also.

I created a representation of a 52 card deck of cards in the Poker DW, and I started thinking about how to evaluate 5 card poker hands (i.e., determining what a player had at the end of the hand). What I really want is to be able to evaluate the odds of making the best hand on the next card or the river, which would ultimately allow me to judge whether a player made the right decision. This result would be similar to the "% to win" stats that you see on TV.

After I created my deck of cards, I started playing around with representing a hand of 5 cards. How many possible 5 card hands are there? Easy - think of it like this. Take 1 card from the deck, there's 52 cards to choose from. Take another card, there's 51 to choose from. Keep picking until you have 5 cards in your hand, that leaves 52 * 51 * 50 * 49 * 48 = 311,875,200 possible 5 card hands.

The problem with this method is that I'm picking permutations of 5 card hands, rather than combinations. Let's reduce my example above to picking two cards rather than five. According to that math, there are 52 * 51 = 2,652 possible two card hands. Using the card deck created above, this query will return that count, 2652 rows:

;WITH Draw1 AS (
SELECT Card1 = CardId
FROM Dim_Deck
),
Draw2 AS (
SELECT
Card1,
Card2 = CardId
FROM Dim_Deck D2
JOIN Draw1 D1
ON D1.Card1 <> D2.CardId
)
SELECT COUNT(*) FROM Draw2

Note the use of the recursive CTE to create the second draw, Draw2. So let's say that I picked the five of clubs first, and the four of hearts second. That is one of the 2,652 possible events. But the reversal of that order is also one of the possible events (picking the four of hearts first, and the five of clubs second). But I really don't care which order the two cards come in (the permutation), I only care about the set of cards that results.

Looking at an even simpler example of a deck of 5 cards, ace to five, how many ways are there to pick two? Here's a simple matrix:

The code above will pick everything except the diagonal that shows pairs:

but what we really want is this:

And in order to get it, we change the "<>" operators to ">":

;WITH Draw1 AS (
SELECT Card1 = CardId
FROM Dim_Deck
),
Draw2 AS (
SELECT
Card1,
Card2 = CardId
FROM Dim_Deck D2
JOIN Draw1 D1
ON D1.Card1 > D2.CardId
)
SELECT COUNT(*) FROM Draw2

and we obtain the correct result, 1326 rows.

Wednesday, February 25, 2009

Comparing Stored Procedures, Part 6

In the previous entry, we found that computing the maximum search size prior to the sequence matching using SQL Server 2005's CTE feature saved a considerable amount of time, but we were still performing at an unacceptable rate. I decided it was time to rework the core logic of the matching algorithm. Impressed by the blazing performance of the CTE used to precalculate the max search length, I wanted to recode the algorithm using the set-based logic of the CTE.

Initially using an iterative approach to this problem made a lot of sense, since we need to control the order of the subsequences checked (longer before shorter). There didn't seem to be an obvious way to do this with a CTE. The real power of the CTE comes from its feature of recursion - the ability to have a set of data in the CTE refer to itself (please go here for an excellent article on this). I developed a hybrid approach that would combine iterative and recursive code. The recursive CTE in it will select off the longest matching subsequence between the two sequences it can find whose values are all unmatched, save it in a temp table, and then update the two sequences to show the matching values. It would then continue doing this until no more matching subsequences can be found. This new version can be found here.

The code is much more concise, much more elegant, than the iterative algorithm. But what is truly amazing is the performance increase. After altering the testing script to test all three versions of the code under the same conditions, in the same testing session, the improved performance speaks for itself:

The 'Improved' column compares the latest version with the first. The improved performance represents not just absolute time; notice how slowly the rate of extra time required increases for each increment of the input size. After running the new version for input sizes from 100 to 800 values, I created this graph of the performance time in seconds (graph developed using The Graph Calculator by Holt, Rinehart, and Winston):

The graph rises very slowly at first, then the performance time starts to grow very quickly towards the right end of the graph. Using this website for power regression analysis, we arrive at the approximate formula of y=3.5x^2.5/10^-6. If this is accurate, then the new version is somewhere between square and cubic order of complexity, and although considerably faster than the previous versions, will suffer at larger input sizes.

Next: we examine the results of running our new code on the original large stored procedures.

Tuesday, February 24, 2009

Comparing Spds, Part 5 - Optimization

Testing out our sequence matching algorithm showed that our estimate of cubic order of complexity was accurate; unfortunately, this performance is what might be called "sub-optimal". If we have any hope of testing our 1,000 line stored procedures in a reasonable timeframe, we're going to have to optimize our algorithm.

The most obvious optimization to make is to simply reduce the number of comparisons, and the way we're going to do that is by checking the sequences ahead of the actual matching. We first check each value in the sequences to see if there exists a match in the other sequence, and then overload the MatchLineNum column to indicate match or no-match. At this point we have a series of subsequences within each sequence of possible matches. But what's important here is the length of the subsequences - the length of the longest subsequence is the maximum search size for that sequence - we need not search for longer ones in it because we already know that longer ones already contain a mismatch. Also, the longest possible matching subsequence between the two main ones will be the smaller of these two maximum lengths.

As an example, let's say that we have two sequences, S1 and S2, and we find that the length of the longest possible subsequences of S1 and S2 is n and m, resp. For the purposes of matching the subsequences between S1 and S2, the maximum subsequence to compare will be of length min(n, m).

The code to determine this maximum subsequence utilizes CTEs, one of the more important enhancements of SQL Server 2005. This was an interesting exercise. I wrote CTEs that first separately picked out the starting and ending points of the subsequences with matching values, constructed the sequences by matching starting and ending points, then calculated the maximum length of these subsequences. CTEs require a set-based approarch, rather than the more algorithmic row-by-row processing; the resulting code is more intuitive, and ultimately more powerful. Please look here for an excellent introductory article to CTEs.

In comparing this improved version against the original code, I tested both with input sizes of 100, 200, 300, 400, and 500, running each code base 5 times for each input size. Here are the results (columns 'Version1' and 'Version2' represent performance time in seconds):

InputSize___Version1___Version2____Improvement
100_________5.064______1.222_______75.9%
200_________48.680_____13.534______72.2%
300_________92.918_____35.958______61.3%
400_________461.862____318.318_____31.1%
500_________614.412____417.818_____32.0%

We can see that the new version improved performance considerably for the smaller input sizes, but peeters out as the size increases. Analyzing the print output from the stored procedure, paying special attention to the @CmprSz values that were output, I can see that the starting compare sizes grew larger as a ratio to the input size as the input size increased. In other words, the larger the sample size the smaller the cost savings. Given that the test sequences consisted of "words" of two letters of the form [A-J][A-J], obviously the larger the sample size the more likely that random values will be shared between the sequences. In two particular runs of the larger sample sets, all values were shared between the two sequences, resulting in zero cost savings.

This is not exactly a failure of testing design - for many real-world scenarios, the likelihood of shared values should follow this trend of increasing as the sample size increases. Consider comparing the words from works of literature - comparing the text of any random two English language authors of the 20th century, the larger the works the more likely that they will share words between them. Another example is DNA - consider the base pairing of AT and GC combinations - since there are only two possibilities in the domain of values, we are certain to find 100% likelihood of shared values between any two DNA sequences (thus reducing our cost savings to zero).

You can find a complete copy of the improved stored procedure here.

Jesse's SQL Blog