Monday, November 30, 2009

Building a Yahoo Finance input adapter for SQL Server StreamInsight

Here's an excellent SSC article by Johan Åhlén on building CEP (Complex Event Processing) solutions in SQL 2008. "What is StreamInsight and what is it good for? StreamInsight is a platform for developing and deploying applications that handle high-speed streaming data. It could be used for near real-time processing of data from production environments, structured data such as financial information and unstructured data such as Facebook, Twitter and blogs. Multiple sources can be combined and refined before they are being output. This technology is called CEP - complex event processing.

"Combine StreamInsight with data mining, and you can have real-time fraud detection, stock market prediction, you name it... Imagine a real-time Business Intelligence-application where the management can actually see the KPI gauges moving on the screen. I would say that monitoring real-time flow of information is one of the key success factors for tomorrow's Business Intelligence. This is also supported by Ralph Kimball, saying in an interview that Business Intelligence is moving from the strategic level towards operational level processes such as customer support, logistics and sales. At the operational level data must be continuously updated, not just once per day. I would add also that the new generation, that has grown up with Facebook and Twitter, will make it necessary to monitor new sources for successful Business Intelligence."

Monday, November 23, 2009

Data Vault Institute

While reading the comments on an SSC article about fault-tolerant ETL loading, I came across a link to the Data Vault Institute, and what sparked my attention was the comment by Dan Linstedt that "today's data warehouse has become a system of record. Due in part for the need of compliance." This struck me as an important point, given that many of the databases on which I've worked in the past were marketing databases, where bad data was generally disposed of. But in financial, health care, and governmental data warehouses, this approach would be completely unsatisfactory.

Monday, November 9, 2009

Intelligent Keys

I'm engaged in an internal debate about the use of intelligent versus surrogate keys. Typically when this issue arises, the debate centers around whether we want to use a key that is already present in the data (such as SSN in an employee table - this is an intelligent key, also known as a natural key), or if it's better to generate a new meaningless key (such as an auto-incrementing integer - this is a surrogate key).

Now the internal debate isn't over that issue per se - I fall on the side that favors the surrogate key creation. The real debate I'm in is whether it's okay to create an intelligent surrogate key. The most typical surrogate as mentioned previously is an auto-incrementing integer identity - every time a row is inserted into the table, a new key is created by adding one to the max value. These keys have zero business meaning - that's their advantage, that they are decoupled from the business data. However, there are situations where it makes sense to create this value intelligently. One example is creating a time dimension in a data warehouse, whereby the primary key consists of an integer in the form "YYYYMMDD". Microsoft favors this method (as I discovered in their training kit for exam 70-448). A big advantage to this approach is that if you create a clustered index on that intelligent surrogate key, all of your data will be sorted in date order (of course, if you insert into that table by earliest date first, it will also be in that order - unless you add an earlier time period at a later date).

TSQL Challenge #16 Out

I missed it (was out a week ago): http://beyondrelational.com/blogs/tc/archive/2009/11/02/tsql-challenge-16-find-intersections-in-date-ranges-and-concatenate-aggregated-labels.aspx