Wednesday, May 27, 2009

CRM Import - Importing Into Drop-Down Combobox Targets

During the initial try to load accounts into MS-CRM 4.0, my source file failed for columns that had drop-down combobox columns as targets. A little research lead to this forum post, which advises using the GUIDs of the target system. I was a little mislead by this post, since it actually applied to lookups, not drop-down combobox values. The solution I needed involved converting the source values to the AttributeValue equivalents in the StringMap table of the MS-CRM database.

Thursday, May 21, 2009

BCP out Temp Tables

Hit a little snag today trying to output, via BCP, the results of a stored procedure that created and dropped a local temp table. The spd would run fine in Query Analyser, but when run from the DOS prompt I got the error message that the temp table didn't exist. Perhaps the problem is caused by BCP compiling and running the spd in different threads? Anyways, the results of a little googling provide the workaround of keeping the table in TempDb rather than creating as a real temp table. That works as long as the table is not created and dropped within the spd that BCP calls.

Monday, May 18, 2009

Case Study: Poker DW: Reporting Questions

After reviewing the entities and their relationships, we next want to look at some of the questions we might want to answer from the data in our DW. Some possible questions:
  • Which players will call a big checkraise on the flop with an overpair?
  • What is the actual expected outcome of reraising on the button with a suited connector and bluffing the flop?

Next: ???

Case Study: Poker DW: Entities

Continuing from the introduction, one of the first things we'll have to do in designing our Poker DW (Data Warehouse), is to identify all of the entities of interest. Taking a look at our sample hand history file (ignore the last two lines of HTML code, the hosting company stamped them when I uploaded the file), the first line starts with a sequence number, the format of the game and the datetime stamp of when a particular hand took place. The second line indicates the name of the table, whether it is real or play money, and which seat is the dealer. The next ten or so lines tell us who is in which seat, and what their stack size is (at the beginning of the hand), followed by a line for each player who posts a blind (small, big, and other). So up to now, we've seen such entities as Time, Table, Player, Seat, and Money (stack and pot size).

The next section begins with the header line "*** POCKET CARDS ***". Here we have such information as the hole cards dealt to the player, and all of the preflop action (fold, check, call, or raise). We can identify three more entities here: Betting Stage, Cards and Actions. The next section, "*** FLOP *** [10s 9d 3h]", contains the same entities, but this time we have community cards. At each step in these sections, we can calculate the pot size and stack sizes for each player. Two more sections, "Turn" and "River", provide similar info.

Special consideration should be given to the next section, "*** SHOW DOWN ***", as it will show us exactly what cards other players held during the hand, allowing us to "backfill" that info for earlier rounds of betting. This will help us answer some important questions in the hand histories. The final section, "*** SUMMARY ***", provides info such as the rake, the Hi hand (and Low if this is a hi/lo split game), and the final pot size (which we can use to verify our "running" pot size throughout the hand).

So let's summarize our entities and their relationships. Central to this is Hands. Hands occur at certain Times at a particular Table, which have Seats. Players make Actions with Money based on Cards appearing at a Betting Stage.

Friday, May 15, 2009

Case Study: Data Warehouse for Poker (Intro)

A friend of mine plays a good deal of online poker, and wants to improve his game by studying the hands he has played. I suggested creating a data warehouse from the hand histories held in the text files that the app saves, and using that data to identify winning and losing trends in his game. This entry will serve as the first in a series of the steps we will take to develop this.

Guide to Entries:

Thursday, May 14, 2009

Grouping Datetimes to Identify Sessions

Let's say that I have a record of events that are unique by a datetime "timestamp" marker, and that these events can be grouped by sessions, so that every event that occurs within a certain period of time of the preceeding and/or subsequent events to it are considered as part of the same session. For example, let's say that we are examining cars driving by a traffic counter where each car passing is recorded as an event, and we want to organize the events by sessions so that any two cars passing within 5 seconds of one another constitutes a session (the events will "chain" together so that if four cars pass within 2 seconds of one another, but the first and last cars are within 10 seconds, all four car events are a part of the same session).

Now, if I only have the events, how do I create sessions around them?

Wednesday, May 6, 2009

Another Version for Calculating Median

Joe Celko published a "history" of calculating the median in SQL, along with a final version that seems to work similarly to mine: http://www.simple-talk.com/sql/t-sql-programming/median-workbench/.

Tuesday, May 5, 2009

Querying Sys.Columns & Sys.Types

If you want to query the structure of a table that includes column names and data types, you have to perform a join between catalog views Sys.Columns and Sys.Types. There are some caveats to this. If any of your columns are defined as nvarchar or user-defined data types, you must qualify the data coming from Sys.Types. When you add a user-defined data type, it is entered into Sys.Types with a reference to it's native data type. In the example of the AdventureWorks database, the data type "AccountNumber" is defined as nvarchar, and shows up in Sys.Types with system_type_id = 231 (which points to "nvarchar", where system_type_id = 231 and user_type_id = 231).


Running the query:


SELECT
Tb.Name,
C.Name,
Tp.Name
FROM Sys.Tables Tb
JOIN Sys.Schemas Sch
ON Sch.Schema_Id = Tb.Schema_Id
JOIN Sys.Columns C
ON C.Object_Id = Tb.Object_Id
JOIN Sys.Types Tp
ON Tp.System_Type_Id = C.System_Type_Id
WHERE Tb.Name = 'Address'
ORDER BY Tb.Name, C.Name, Tp.Name



produces these results:



Weird, huh? Why did 'AddressLine1' show up six times with six different data types? The reason is two-fold. First, 'AddressLine1' is defined as nvarchar(60), which means that it will also show up as "sysname" datatype (think of "sysname" as MicroSoft's built-in user-defined data type).


Take a look at the results of the query below. It shows that, including itself, six different data types are based on nvarchar! That's why 'AddressLine1' showed up six times in the query above.

SELECT Name FROM Sys.Types Tp
WHERE System_Type_Id = 231


Name
-------------------
nvarchar
sysname
AccountNumber
Name
OrderNumber
Phone
(6 row(s) affected)


So let's change our query to use this 'User_Type_Id' column instead:

SELECT
Tb.Name,
C.Name,
Tp.Name
FROM Sys.Tables Tb
JOIN Sys.Schemas Sch
ON Sch.Schema_Id = Tb.Schema_Id
JOIN Sys.Columns C
ON C.Object_Id = Tb.Object_Id
JOIN Sys.Types Tp
ON Tp.User_Type_Id = C.System_Type_Id
WHERE Tb.Name = 'Address'
ORDER BY Tb.Name, C.Name, Tp.Name

This produces the results we want:


Tuesday, April 21, 2009

Data Patterns and the LIKE Clause

The "LIKE" clause of the SQL SELECT statement is one of the more interesting features of SQL's character processing. Most everyone is familiar with the "%" wildcard, which allows queries such as "SELECT LastName FROM Customers WHERE LastName LIKE 'Mc%'". This returns all the customer last names beginning with "Mc" (such as yours truly). But I suspect that many developers are unaware of some of the deeper uses of data pattern expressions.

The list of other wildcard characters related to LIKE includes "_", "[", "-", "]", and "^". The first, "_", is the 'any single character' expression. The "[]" characters act as a single character wildcard, but allow us to specify which characters will match. The WHERE clause above is equivalent to "WHERE LastName LIKE '[M][c]%'". When multiple characters reside within the brackets, the filter acts like an "or" expression. So changing the filter to "WHERE LastName LIKE '[M][c][aeiou]%'" would produce last names beginning with "Mc", then followed by a vowel, then any terminating string.

If you use the "-" with the brackets, you can specify ranges of characters (ranges defined by ASCII order). For example, let's say we want to search for user names that begin with 'jmclain' and are then followed by a single digit number. We would execute "SELECT * FROM Users WHERE UserName LIKE 'jmclain[0-9]'".

Where it gets complicated is when you want to search a column for wildcard literals. For example, let's say that you have a column called 'SalesDescription', and you want to count the rows where the SalesDescription column contains the string "50% off". If you were to execute "SELECT COUNT(*) FROM Sales WHERE SalesDescription LIKE '50% off'", you would mistakenly pull in rows with SalesDescription values such as '50 cents off', since the "%" wildcard represents "any string". To correct this, you have two options. The simplest is to enclose the "%" wildcard with brackets, so that the filter changes to "WHERE SalesDescription LIKE '50[%] off'".

The second option is to make use of the ESCAPE clause of the LIKE operator. What this method lacks in simplicity, it make up in robustness (and isn't really that complicated anyways). To solve the above problem suchwise, the filter changes to "WHERE SalesDescription LIKE '50!% off' ESCAPE '!'". I prefer the first method above because 1. it is simpler, and 2. in order to use the ESCAPE clause, you must be certain that your target expression doesn't contain the escape character. So if a given SalesDescription value in the table was, unbeknowst to you, something like '50% off!!!', the results start to become unreliable. Best practices for using ESCAPE stipulate first starting with uncommon characters such as "~" or "", and then querying your column to make sure they are not present.

The best use of ESCAPE is when you want to find brackets in your target. Let's say that you wanted to find the SalesDescription value "[50% off]". After checking to ensure that the column values don't contain the tilde ("~") character, you would use the filter "WHERE SalesDescription LIKE '~[50~% off~]' ESCAPE '~'".

Friday, April 17, 2009

Converting Datetime Values to Varchar

Back when I used to code in Foxpro, I had to write custom code to convert date/time values to various formats. T-SQL provides for a great number of formats using the CONVERT function. An article on MSSQLTips lists many (but not all) of these formats. This code (which can easily be wrapped into a stored procedure), will list out all valid format codes and an example of how they will appear:


SET NOCOUNT ON
CREATE TABLE #Fmts (FmtNo tinyint, Example varchar(max))
DECLARE @fmt int; SET @fmt = 0
DECLARE @dt datetime; SET @dt = GETDATE()
WHILE @fmt < 132
BEGIN
BEGIN TRY
INSERT INTO #Fmts (FmtNo, Example)
VALUES (@fmt, CONVERT(varchar, @dt, @fmt))
END TRY
BEGIN CATCH
PRINT '@fmt = ' + LTRIM(STR(@fmt)) + ' is not valid.'
END CATCH
SET @fmt = @fmt + 1
END
SELECT FmtNo, Example = LEFT(Example, 30) FROM #Fmts
DROP TABLE #Fmts
SET NOCOUNT OFF



And sample output:





Wednesday, April 15, 2009

Question of the Day

I stumped the gurus on SQLServerCentral.com with another challenging Question of the Day on April 6th, 2009:

Given this code,

DECLARE @val int;
SET @val = -1
CREATE TABLE #empty (val int)

which statement(s) will result in @val being NULL? (select all that apply)

  1. SET @val = NULL
  2. SELECT @val = NULL FROM #empty
  3. SELECT @val = val FROM #empty
  4. SELECT @val = (SELECT val FROM #empty)
As of today, only about 30% of respondents answered correctly, and judging from the comments in the discussion section, a lot of them gained a deeper understanding of the concept of null in SQL. I try to make my questions (and answers), tricky enough so as to not be obvious, but my goal isn't to trick people with arcane technicalities - I want to make them aware of certain subtleties of the database engine. This question arose after some unexpected results made me delve into a bit of code, and tested scenarios just like those in the question.

Monday, April 13, 2009

Collation Sequences

Being a database developer rather than a DBA, I rarely deal with collation types, but I came across a situation recently where I had to dig into the issue. My objective was to produce a breakdown of how many rows contained each ASCII character. I considered two approaches: slice the values into one-character chunks, or loop through all 256 ASCII values and count the number of rows containing each character. The former approach has the advantage of not only counting the rows but the frequency of characters (e.g., "100 rows contain 250 instances of the character 'A'"), but I opted for the second approach since it intuitively seemed faster. If your database was created with case-insensitive collation (such as "SQL_Latin1_General_CP1_CI_AS"), checking for the characters 'A', would pull in values of 'a':

USE NorthWind
GO

SELECT DISTINCT City
FROM dbo.Customers
WHERE CHARINDEX('A', City) > 0




To fix, simply add the "COLLATE" clause to the query:

SELECT DISTINCT City
FROM dbo.Customers
WHERE CHARINDEX('A' COLLATE Latin1_General_BIN, City) > 0


Friday, April 3, 2009

"Average" Date

Have you ever tried to calculate the average of a datetime column, and gotten this error:

"Msg 8117, Level 16, State 1, Line 1
Operand data type datetime is invalid for avg operator."

When you first think about it, the error makes sense - are you trying to determine the average month, year, hour, or second? But shouldn't there be such a thing as an "average" date? If we have a bunch of sales orders in a given month, doesn't the "average" date of sale actually mean something?

What if I calculated the MIN value, then calc'd the DATEDIFF between the MIN and all other values? At that point I'd essentially have an integer value, which of course I could average, and then derive the "average" date:

;WITH
CvrtToDate AS (
SELECT
/* "DataValue" is assumed to be varchar(max) */
DataValue = CONVERT(datetime, DataValue)
FROM DataSet
WHERE ISDATE(DataValue) = 1
)
,MinAndMax AS (
SELECT
ValueMin = MIN(DataValue)
,ValueMax = MAX(DataValue)
FROM CvrtToDate
)
,DateDiffs AS (
SELECT
DaysFromMin = DATEDIFF(d, MinAndMax.ValueMin, DataValue)
FROM CvrtToDate, MinAndMax
)
,AvgDaysFromMin AS (
SELECT DataValue = AVG(DaysFromMin)
FROM DateDiffs
)
SELECT
AvgDate = DATEADD(d, AvgDaysFromMin.DataValue, MinAndMax.ValueMin)
FROM MinAndMax, AvgDaysFromMin


This query bears a result that makes sense - we have a date that is between the oldest, and most recent, that is somewhere near the midway point.

A little Google research bears fruit for a much simpler calculation. From "Ask Ben: Averaging Date/Time Stamps In SQL": "The secret to date/time averaging is that date/time stamps can be represented as a floating point number. I have covered this a number of times on this blog so I won't go into too much detail, but the idea is that as a floating point number, the integer part represents the number of days since the beginning of time (as the SQL server defines it) and the decimal part represents the time or rather, the fraction of days. SQL does not make this conversion for you; you have to CAST the date/time stamp as a FLOAT value."

This leads to the revised calculation:

SELECT
ValueMin = MIN(CONVERT(datetime, DataValue))
,ValueMax = MAX(CONVERT(datetime, DataValue))
,ValueAvg = CONVERT(datetime, AVG(CONVERT(float,
CONVERT(datetime, DataValue))))
FROM DataSet
WHERE ISDATE(DataValue) = 1


Not only is this calc far simpler, but it is slightly more precise, as it includes time-of-day in the result.

Calculating Median

I found an article on calculating the median in SQL (http://www.sqlservercentral.com/scripts/Miscellaneous/31775/), and after reading a Wikipedia article on it, I realized that it was incorrect for sample sets of even size. I left a comment with a version of the calc that accounts for this, with a copy of this code:

;WITH
TopHalf AS (
SELECT TOP 50 PERCENT DataValue
FROM DataSet
ORDER BY DataValue ASC
)
,BottomHalf AS (
SELECT TOP 50 PERCENT DataValue
FROM DataSet
ORDER BY DataValue DESC
)
,BottomOfTopHalf AS (
SELECT TOP 1 DataValue
FROM TopHalf
ORDER BY DataValue DESC
)
,TopOfBottomHalf AS (
SELECT TOP 1 DataValue
FROM BottomHalf
ORDER BY DataValue ASC
)
SELECT
Median = (BottomOfTopHalf.DataValue
+ TopOfBottomHalf.DataValue) / 2.0
FROM BottomOfTopHalf, TopOfBottomHalf

Wednesday, April 1, 2009

Function to Insert Commas Into Number String

While looking into some data quality issues, I considered the case of source data arriving as a string of numbers with embedded commas (for example, "1,526,734.56"), and as I didn't have any such data to test with, I decided to create some, and also created this function in the process (which can be used to create some): fn_AddCommasToNumberString.

(Followup on 4/23/09): A forum post on SQLServerCentral.com explained a very easy way to do this using varchar conversion:

declare @test float
set @test = 7265342.12
select @test, convert(varchar(20),cast(@test as money),1)

Wednesday, March 25, 2009

SQL Server Data Profiler

With the newest edition of SQL Server, 2008, comes a tool that serves as a good first-cut at providing something for data profiling. Here are some good articles on it:

Fuzzy Matching

While doing some research on data quality, I came across an article about implementing the Jaro-Winkler distance metric in SSIS. Given two character strings, this algorithm will return a numeric between 0 and 1 indicating similarity (so that 0 is no similarity and 1 means exact match). The Wikipedia entry uses the names MARTHA and MARHTA as an example, producing the result 0.94 (indicating a 94% similarity).

Friday, March 6, 2009

When I ran the comparison of the original versus the most recent versions of the sequence comparisons, I saved the results of the Seq1 and Seq2 tables. While running some tests on the result tables, I discovered a mistake:



The four counts should all be equal, and because they are not, we know that there is a mistake in at least one of them. Because the version 1 results are not even consistent between one another, I have to think that the problem is with the original version.

This query:

SELECT
Seq1_Line = ISNULL(LTRIM(STR(S1.CodeLineNum)), '')
,ISNULL(S1.CodeLineTxt, '')
,ISNULL(S2.CodeLineTxt, '')
,Seq2_Line = ISNULL(LTRIM(STR(S2.CodeLineNum)), '')
,OrderBy = ISNULL(S1.CodeLineNum, S2.CodeLineNum)
FROM Seq1 S1
FULL OUTER JOIN Seq2 S2
ON S1.CodeLineNum = S2.MatchLineNum
ORDER BY OrderBy

will display the matches between the two sequences in a "Windiff"-fashion, and running it on the results of the most recent matching algorithm shows that that one worked correctly.

Wednesday, March 4, 2009

Comparing Stored Procedures, Part 7

Last week we discovered a set-based method of implementing our sequence-matching algorithm using a Common Table Expression (CTE), and found that it beat our iterative code by several orders of magnitude. Remember the original impetus for this effort, comparing the two stored procedures of more than 1,000 lines each, which initially took about six-and-a-half hours? Running this new version in TempDb on an idle database server took just over 30 seconds! How could this be, given that in Part 6 a comparison of 800 random 2-letter values took about 70 seconds? (Omitted from those results - an input size of 1,000 took over 2 minutes.) How could a test of the same input size (1000 values), but with vastly more complex values (a real stored procedure rather than random two-letter values), take one-quarter of the time to complete? The answer lies in the length of the matching subsequences. If we run this query:

SELECT MatchLineNum, SubSeqLen = COUNT(*)
FROM Seq1
GROUP BY MatchLineNum

ORDER BY COUNT(*) DESC


we get the results:



So on the first matching subsequence, we match 121 values, which is over 10% of the entire sequence. The next matches are composed of 84, 82, 64, and 56 values, which when combined with the first, comprise about 30% of all values. With the sequences of random values, the matching subsequences are much shorter, and therefore requre more executions of the CTE block.

Wednesday, February 25, 2009

Comparing Stored Procedures, Part 6

In the previous entry, we found that computing the maximum search size prior to the sequence matching using SQL Server 2005's CTE feature saved a considerable amount of time, but we were still performing at an unacceptable rate. I decided it was time to rework the core logic of the matching algorithm. Impressed by the blazing performance of the CTE used to precalculate the max search length, I wanted to recode the algorithm using the set-based logic of the CTE.

Initially using an iterative approach to this problem made a lot of sense, since we need to control the order of the subsequences checked (longer before shorter). There didn't seem to be an obvious way to do this with a CTE. The real power of the CTE comes from its feature of recursion - the ability to have a set of data in the CTE refer to itself (please go here for an excellent article on this). I developed a hybrid approach that would combine iterative and recursive code. The recursive CTE in it will select off the longest matching subsequence between the two sequences it can find whose values are all unmatched, save it in a temp table, and then update the two sequences to show the matching values. It would then continue doing this until no more matching subsequences can be found. This new version can be found here.


The code is much more concise, much more elegant, than the iterative algorithm. But what is truly amazing is the performance increase. After altering the testing script to test all three versions of the code under the same conditions, in the same testing session, the improved performance speaks for itself:



The 'Improved' column compares the latest version with the first. The improved performance represents not just absolute time; notice how slowly the rate of extra time required increases for each increment of the input size. After running the new version for input sizes from 100 to 800 values, I created this graph of the performance time in seconds (graph developed using The Graph Calculator by Holt, Rinehart, and Winston):


The graph rises very slowly at first, then the performance time starts to grow very quickly towards the right end of the graph. Using this website for power regression analysis, we arrive at the approximate formula of y=3.5x^2.5/10^-6. If this is accurate, then the new version is somewhere between square and cubic order of complexity, and although considerably faster than the previous versions, will suffer at larger input sizes.

Next: we examine the results of running our new code on the original large stored procedures.