Monday, August 31, 2009

TSQL Challenge #13

Looking at TSQL Challenge #13, I created a grouping by batch and invoice number:



The most straight-forward approach from here is to insert this into a temp table, open a cursor on it, run through the data and modify the "Set" column so that the batch/invoice numbers are appropriately grouped together. Then join the results of that temp table back to the original data set via batch/invoice number, so that the modified "Set" column is appended. This cursor-based solution is here (the rules require a set-based solution, so I did the cursor-based one just as a baseline to compare to the set-based solution).

Tuesday, July 7, 2009

Avoid Logging When Populating a Table

I recently ran into a brick wall while trying to populate a skinny table with just over 130 million rows, in that the log filled up way before the table did (I was down to about 40gb free on my local machine). This is a scenario where I have multiple recursive CTEs preceeding a INSERT .. SELECT FROM used to create those rows. To get around this problem, I created a stored procedure that outputs the results of those CTEs as a straightforward SELECT, then I redirect that output to a text file via a BCP batch file. I then BCP that file back into my destination table, thereby bypassing the extraneous logging that, in this case, is just a waste of space and time.

Thursday, June 18, 2009

Cool Use of CTEs to create number table

Many developers have needed to create a numbers table for matching data to ranges, etc., and this article shows various ways to do it, including multiple uses of CTEs: http://www.projectdmx.com/tsql/tblnumbers.aspx

Poker DW: Stacking the Deck Part 2

Earlier we showed how to construct a list of all possible 5 card poker hands, and verified the result in Wikipedia. What I want to create is a table of those hands, all 2,598,960 of them, with their high- and low-hand ranks also. I want to be able to analyze the odds of one hand winning over another with n cards to come, which means that I'll need to create a temp copy of the deck, pull out the known cards from it, and make a hand with every card left and evaluate those hands against my table of all possible.

Now how will I represent the cards in the table of poker hands? The most obvious way is to create a five column key consisting of the 5 CardId values that make that hand. Question: does the order of those CardIds matter? Remember what we talked about last entry, the order should not matter. But we have to put them in some sort of order - if we have five columns, say CardId1, CardId2, CardId3, CardId4, and CardId5, something is going to have to go somewhere. Let's say that we arbitrarily enter the CardIds into the columns in no particular order - how will we now query them? Let's make a trivial example of querying for two cards. Our WHERE clause of such a query would look like:

WHERE CardId1 = @CardId1 AND CardId2 = @CardId2
OR CardId1 = @CardId2 AND CardId2 = @CardId1



We have to match every permutation of variables to columns. With three cards:

WHERE
CardId1 = @CardId1 AND CardId2 = @CardId2 AND CardId3 = @CardId3
OR CardId1 = @CardId1 AND CardId2 = @CardId3 AND CardId3 = @CardId2 OR CardId1 = @CardId2 AND CardId2 = @CardId1 AND CardId3 = @CardId3 OR CardId1 = @CardId2 AND CardId2 = @CardId3 AND CardId3 = @CardId1 OR CardId1 = @CardId3 AND CardId2 = @CardId1 AND CardId3 = @CardId2 OR CardId1 = @CardId3 AND CardId2 = @CardId2 AND CardId3 = @CardId1


Going back to our research on permutations, the number of permutations of n elements is n!, which is also equal to n(n + 1)/2. With five cards we're looking at a WHERE clause that is 5(5+1)/2 = 5*6/2 = 15 lines long. The coding for that isn't so bad (try not to make a mistake - you'll be matching 5 variable/column pairs per line for 15 lines, for a total of 75 equality checks), but think of how slowly that would perform! And that's just to evaluate one hand - imagine the gears grinding away to find all possible 5 card hands with two cards to come - if you're on the flop, and you want to evaluate your chances to the river, you have "47 choose 2" =
1081 possible outcomes.

What I came up with is a solution using prime numbers that I learned while studying Gödel's incompleteness theorems. We assign every card in the deck a unique prime number; the first card gets 2, the second card 3, all the way up to the last card, which gets prime number 239. Now what happens if we want to look at a two-card hand and match it to a table of all possible two-card hands? If we multiply the prime numbers corresponding to those cards, we will get a number that is unique to those two cards (the primes of any other two cards will result in a different number when multiplied). Obviously it doesn't matter which order the primes are multiplied, so we have just found the perfect primary key for our poker hands table. When we want to evaluate a hand, we multiply the primes corresponding to the cards and match the result to the primary key.

We have an updated Dim_Deck creation script that adds a "PrimeFactor" column to it. Now I'm working on a creating the table of all possible hands.

Poker DW: Stacking the Deck

Returning to the Poker Data Warehouse, I dove into the text-parsing process after setting up a sketch of an initial database design, with the idea of polishing out the design at a later date. I recently completed the parsing of a sample hand history, and man was it more work than I expected! Fortunately I love writing text-parsing routines (perhaps from early career work in merge-purge duplicate removal), so the 750 line stored procedure was more a labor of love than a tedious chore. Getting into the dirty details of the source data made me think more about the 30,000 ft overview also.


I created a representation of a 52 card deck of cards in the Poker DW, and I started thinking about how to evaluate 5 card poker hands (i.e., determining what a player had at the end of the hand). What I really want is to be able to evaluate the odds of making the best hand on the next card or the river, which would ultimately allow me to judge whether a player made the right decision. This result would be similar to the "% to win" stats that you see on TV.


After I created my deck of cards, I started playing around with representing a hand of 5 cards. How many possible 5 card hands are there? Easy - think of it like this. Take 1 card from the deck, there's 52 cards to choose from. Take another card, there's 51 to choose from. Keep picking until you have 5 cards in your hand, that leaves 52 * 51 * 50 * 49 * 48 = 311,875,200 possible 5 card hands.


The problem with this method is that I'm picking permutations of 5 card hands, rather than combinations. Let's reduce my example above to picking two cards rather than five. According to that math, there are 52 * 51 = 2,652 possible two card hands. Using the card deck created above, this query will return that count, 2652 rows:


;WITH Draw1 AS (
SELECT Card1 = CardId
FROM Dim_Deck
),
Draw2 AS (
SELECT
Card1,
Card2 = CardId
FROM Dim_Deck D2
JOIN Draw1 D1
ON D1.Card1 <> D2.CardId
)
SELECT COUNT(*) FROM Draw2



Note the use of the recursive CTE to create the second draw, Draw2. So let's say that I picked the five of clubs first, and the four of hearts second. That is one of the 2,652 possible events. But the reversal of that order is also one of the possible events (picking the four of hearts first, and the five of clubs second). But I really don't care which order the two cards come in (the permutation), I only care about the set of cards that results.


Looking at an even simpler example of a deck of 5 cards, ace to five, how many ways are there to pick two? Here's a simple matrix:



The code above will pick everything except the diagonal that shows pairs:




but what we really want is this:



And in order to get it, we change the "<>" operators to ">":


;WITH Draw1 AS (
SELECT Card1 = CardId
FROM Dim_Deck
),
Draw2 AS (
SELECT
Card1,
Card2 = CardId
FROM Dim_Deck D2
JOIN Draw1 D1
ON D1.Card1 > D2.CardId
)
SELECT COUNT(*) FROM Draw2

and we obtain the correct result, 1326 rows.

Monday, June 15, 2009

StackOverflow

I started answering questions on StackOverflow, and came across a couple of interesting sites. The first is Vyas code page, a small library of useful SQL code. The other is an add-on toolkit for SSMS.

Friday, June 12, 2009

T-SQL to Export Table Structure to a script

Good script for scripting out SQL tables via T-SQL rather than Enterprise Manager/Dev Studio: http://www.sqlservercentral.com/scripts/Miscellaneous/30730/

Wednesday, May 27, 2009

CRM Import - Importing Into Drop-Down Combobox Targets

During the initial try to load accounts into MS-CRM 4.0, my source file failed for columns that had drop-down combobox columns as targets. A little research lead to this forum post, which advises using the GUIDs of the target system. I was a little mislead by this post, since it actually applied to lookups, not drop-down combobox values. The solution I needed involved converting the source values to the AttributeValue equivalents in the StringMap table of the MS-CRM database.

Thursday, May 21, 2009

BCP out Temp Tables

Hit a little snag today trying to output, via BCP, the results of a stored procedure that created and dropped a local temp table. The spd would run fine in Query Analyser, but when run from the DOS prompt I got the error message that the temp table didn't exist. Perhaps the problem is caused by BCP compiling and running the spd in different threads? Anyways, the results of a little googling provide the workaround of keeping the table in TempDb rather than creating as a real temp table. That works as long as the table is not created and dropped within the spd that BCP calls.

Monday, May 18, 2009

Case Study: Poker DW: Reporting Questions

After reviewing the entities and their relationships, we next want to look at some of the questions we might want to answer from the data in our DW. Some possible questions:
  • Which players will call a big checkraise on the flop with an overpair?
  • What is the actual expected outcome of reraising on the button with a suited connector and bluffing the flop?

Next: ???

Case Study: Poker DW: Entities

Continuing from the introduction, one of the first things we'll have to do in designing our Poker DW (Data Warehouse), is to identify all of the entities of interest. Taking a look at our sample hand history file (ignore the last two lines of HTML code, the hosting company stamped them when I uploaded the file), the first line starts with a sequence number, the format of the game and the datetime stamp of when a particular hand took place. The second line indicates the name of the table, whether it is real or play money, and which seat is the dealer. The next ten or so lines tell us who is in which seat, and what their stack size is (at the beginning of the hand), followed by a line for each player who posts a blind (small, big, and other). So up to now, we've seen such entities as Time, Table, Player, Seat, and Money (stack and pot size).

The next section begins with the header line "*** POCKET CARDS ***". Here we have such information as the hole cards dealt to the player, and all of the preflop action (fold, check, call, or raise). We can identify three more entities here: Betting Stage, Cards and Actions. The next section, "*** FLOP *** [10s 9d 3h]", contains the same entities, but this time we have community cards. At each step in these sections, we can calculate the pot size and stack sizes for each player. Two more sections, "Turn" and "River", provide similar info.

Special consideration should be given to the next section, "*** SHOW DOWN ***", as it will show us exactly what cards other players held during the hand, allowing us to "backfill" that info for earlier rounds of betting. This will help us answer some important questions in the hand histories. The final section, "*** SUMMARY ***", provides info such as the rake, the Hi hand (and Low if this is a hi/lo split game), and the final pot size (which we can use to verify our "running" pot size throughout the hand).

So let's summarize our entities and their relationships. Central to this is Hands. Hands occur at certain Times at a particular Table, which have Seats. Players make Actions with Money based on Cards appearing at a Betting Stage.

Friday, May 15, 2009

Case Study: Data Warehouse for Poker (Intro)

A friend of mine plays a good deal of online poker, and wants to improve his game by studying the hands he has played. I suggested creating a data warehouse from the hand histories held in the text files that the app saves, and using that data to identify winning and losing trends in his game. This entry will serve as the first in a series of the steps we will take to develop this.

Guide to Entries:

Thursday, May 14, 2009

Grouping Datetimes to Identify Sessions

Let's say that I have a record of events that are unique by a datetime "timestamp" marker, and that these events can be grouped by sessions, so that every event that occurs within a certain period of time of the preceeding and/or subsequent events to it are considered as part of the same session. For example, let's say that we are examining cars driving by a traffic counter where each car passing is recorded as an event, and we want to organize the events by sessions so that any two cars passing within 5 seconds of one another constitutes a session (the events will "chain" together so that if four cars pass within 2 seconds of one another, but the first and last cars are within 10 seconds, all four car events are a part of the same session).

Now, if I only have the events, how do I create sessions around them?

Wednesday, May 6, 2009

Another Version for Calculating Median

Joe Celko published a "history" of calculating the median in SQL, along with a final version that seems to work similarly to mine: http://www.simple-talk.com/sql/t-sql-programming/median-workbench/.

Tuesday, May 5, 2009

Querying Sys.Columns & Sys.Types

If you want to query the structure of a table that includes column names and data types, you have to perform a join between catalog views Sys.Columns and Sys.Types. There are some caveats to this. If any of your columns are defined as nvarchar or user-defined data types, you must qualify the data coming from Sys.Types. When you add a user-defined data type, it is entered into Sys.Types with a reference to it's native data type. In the example of the AdventureWorks database, the data type "AccountNumber" is defined as nvarchar, and shows up in Sys.Types with system_type_id = 231 (which points to "nvarchar", where system_type_id = 231 and user_type_id = 231).


Running the query:


SELECT
Tb.Name,
C.Name,
Tp.Name
FROM Sys.Tables Tb
JOIN Sys.Schemas Sch
ON Sch.Schema_Id = Tb.Schema_Id
JOIN Sys.Columns C
ON C.Object_Id = Tb.Object_Id
JOIN Sys.Types Tp
ON Tp.System_Type_Id = C.System_Type_Id
WHERE Tb.Name = 'Address'
ORDER BY Tb.Name, C.Name, Tp.Name



produces these results:



Weird, huh? Why did 'AddressLine1' show up six times with six different data types? The reason is two-fold. First, 'AddressLine1' is defined as nvarchar(60), which means that it will also show up as "sysname" datatype (think of "sysname" as MicroSoft's built-in user-defined data type).


Take a look at the results of the query below. It shows that, including itself, six different data types are based on nvarchar! That's why 'AddressLine1' showed up six times in the query above.

SELECT Name FROM Sys.Types Tp
WHERE System_Type_Id = 231


Name
-------------------
nvarchar
sysname
AccountNumber
Name
OrderNumber
Phone
(6 row(s) affected)


So let's change our query to use this 'User_Type_Id' column instead:

SELECT
Tb.Name,
C.Name,
Tp.Name
FROM Sys.Tables Tb
JOIN Sys.Schemas Sch
ON Sch.Schema_Id = Tb.Schema_Id
JOIN Sys.Columns C
ON C.Object_Id = Tb.Object_Id
JOIN Sys.Types Tp
ON Tp.User_Type_Id = C.System_Type_Id
WHERE Tb.Name = 'Address'
ORDER BY Tb.Name, C.Name, Tp.Name

This produces the results we want:


Tuesday, April 21, 2009

Data Patterns and the LIKE Clause

The "LIKE" clause of the SQL SELECT statement is one of the more interesting features of SQL's character processing. Most everyone is familiar with the "%" wildcard, which allows queries such as "SELECT LastName FROM Customers WHERE LastName LIKE 'Mc%'". This returns all the customer last names beginning with "Mc" (such as yours truly). But I suspect that many developers are unaware of some of the deeper uses of data pattern expressions.

The list of other wildcard characters related to LIKE includes "_", "[", "-", "]", and "^". The first, "_", is the 'any single character' expression. The "[]" characters act as a single character wildcard, but allow us to specify which characters will match. The WHERE clause above is equivalent to "WHERE LastName LIKE '[M][c]%'". When multiple characters reside within the brackets, the filter acts like an "or" expression. So changing the filter to "WHERE LastName LIKE '[M][c][aeiou]%'" would produce last names beginning with "Mc", then followed by a vowel, then any terminating string.

If you use the "-" with the brackets, you can specify ranges of characters (ranges defined by ASCII order). For example, let's say we want to search for user names that begin with 'jmclain' and are then followed by a single digit number. We would execute "SELECT * FROM Users WHERE UserName LIKE 'jmclain[0-9]'".

Where it gets complicated is when you want to search a column for wildcard literals. For example, let's say that you have a column called 'SalesDescription', and you want to count the rows where the SalesDescription column contains the string "50% off". If you were to execute "SELECT COUNT(*) FROM Sales WHERE SalesDescription LIKE '50% off'", you would mistakenly pull in rows with SalesDescription values such as '50 cents off', since the "%" wildcard represents "any string". To correct this, you have two options. The simplest is to enclose the "%" wildcard with brackets, so that the filter changes to "WHERE SalesDescription LIKE '50[%] off'".

The second option is to make use of the ESCAPE clause of the LIKE operator. What this method lacks in simplicity, it make up in robustness (and isn't really that complicated anyways). To solve the above problem suchwise, the filter changes to "WHERE SalesDescription LIKE '50!% off' ESCAPE '!'". I prefer the first method above because 1. it is simpler, and 2. in order to use the ESCAPE clause, you must be certain that your target expression doesn't contain the escape character. So if a given SalesDescription value in the table was, unbeknowst to you, something like '50% off!!!', the results start to become unreliable. Best practices for using ESCAPE stipulate first starting with uncommon characters such as "~" or "", and then querying your column to make sure they are not present.

The best use of ESCAPE is when you want to find brackets in your target. Let's say that you wanted to find the SalesDescription value "[50% off]". After checking to ensure that the column values don't contain the tilde ("~") character, you would use the filter "WHERE SalesDescription LIKE '~[50~% off~]' ESCAPE '~'".

Friday, April 17, 2009

Converting Datetime Values to Varchar

Back when I used to code in Foxpro, I had to write custom code to convert date/time values to various formats. T-SQL provides for a great number of formats using the CONVERT function. An article on MSSQLTips lists many (but not all) of these formats. This code (which can easily be wrapped into a stored procedure), will list out all valid format codes and an example of how they will appear:


SET NOCOUNT ON
CREATE TABLE #Fmts (FmtNo tinyint, Example varchar(max))
DECLARE @fmt int; SET @fmt = 0
DECLARE @dt datetime; SET @dt = GETDATE()
WHILE @fmt < 132
BEGIN
BEGIN TRY
INSERT INTO #Fmts (FmtNo, Example)
VALUES (@fmt, CONVERT(varchar, @dt, @fmt))
END TRY
BEGIN CATCH
PRINT '@fmt = ' + LTRIM(STR(@fmt)) + ' is not valid.'
END CATCH
SET @fmt = @fmt + 1
END
SELECT FmtNo, Example = LEFT(Example, 30) FROM #Fmts
DROP TABLE #Fmts
SET NOCOUNT OFF



And sample output:





Wednesday, April 15, 2009

Question of the Day

I stumped the gurus on SQLServerCentral.com with another challenging Question of the Day on April 6th, 2009:

Given this code,

DECLARE @val int;
SET @val = -1
CREATE TABLE #empty (val int)

which statement(s) will result in @val being NULL? (select all that apply)

  1. SET @val = NULL
  2. SELECT @val = NULL FROM #empty
  3. SELECT @val = val FROM #empty
  4. SELECT @val = (SELECT val FROM #empty)
As of today, only about 30% of respondents answered correctly, and judging from the comments in the discussion section, a lot of them gained a deeper understanding of the concept of null in SQL. I try to make my questions (and answers), tricky enough so as to not be obvious, but my goal isn't to trick people with arcane technicalities - I want to make them aware of certain subtleties of the database engine. This question arose after some unexpected results made me delve into a bit of code, and tested scenarios just like those in the question.

Monday, April 13, 2009

Collation Sequences

Being a database developer rather than a DBA, I rarely deal with collation types, but I came across a situation recently where I had to dig into the issue. My objective was to produce a breakdown of how many rows contained each ASCII character. I considered two approaches: slice the values into one-character chunks, or loop through all 256 ASCII values and count the number of rows containing each character. The former approach has the advantage of not only counting the rows but the frequency of characters (e.g., "100 rows contain 250 instances of the character 'A'"), but I opted for the second approach since it intuitively seemed faster. If your database was created with case-insensitive collation (such as "SQL_Latin1_General_CP1_CI_AS"), checking for the characters 'A', would pull in values of 'a':

USE NorthWind
GO

SELECT DISTINCT City
FROM dbo.Customers
WHERE CHARINDEX('A', City) > 0




To fix, simply add the "COLLATE" clause to the query:

SELECT DISTINCT City
FROM dbo.Customers
WHERE CHARINDEX('A' COLLATE Latin1_General_BIN, City) > 0