Monday, December 08, 2008 11:42 AM
Data Mining with Microsoft's SQL Data Services
Microsoft is making some great progress with its SQL Server 2008 Data Services initiative. The latest addition to the line-up of data services available online is data mining! You can access the beta version of the data mining tools in the cloud by visiting http://sqlserverdatamining.com/cloud/. At this website, you have 2 options for trying out the data mining service. You can download the Excel 2007 add-in which will then work with data files on your computer and access the service in the cloud to mine the data or you can work entirely in the browser.
If you use the Excel add-in, you first download the CloudTatSetup.msi file. Run the msi file to install the add-in. Once it's installed, you can open your own Excel files to analyze with these tools or visit a site like http://www.swivel.com/ to find publicly available data sets that you can use to explore the toolset with. In the example below, I use the data set found at http://www.swivel.com/data_sets/csv/1000542 to explore the survival rate of the passengers of the ill-fated voyage of the Titanic in 1909.
The Titanic data set displays information about the passengers of the voyage including their gender, age, occupation, and how many lifeboats they had access to. It also displays whether the passenger survived or not. With the data set open in Excel, you must first format the data as an Excel table. You can do this by selecting the data, then going to the Insert tab on the Ribbon and clicking on Table. Click OK on the resulting dialog box to format the table.
Now that the data is presented as a table in Excel, you can click on the "Analyze (in the Cloud)" tab in the "Table Tools" section of the Ribbon. If you do not see this tab, then simply click in the Excel table first.
The add-in displays several data mining tools which will analyze the data in your Excel table and mine the data using the data services in the cloud. You can explore each of these tools at your leisure, but we'll walk through the "Analyze Key Influencers" tool now.
Clicking on the "Analyze Key Influencers" button will prompt you to select the column to analyze for key factors. In this case, we wish to analyze the "Survived?" column to see what other columns in our data set contributed the most to whether a passenger survived or not.
After you click the Run button, a dialog box will appear asking you to agree to providing the data over the Internet to Microsoft's servers. You must check the "I Agree" checkbox and click on the OK button. This is needed because at the moment there is no encryption utilized to transfer your data to Microsoft's SQL Data Services servers since the data mining service is still in beta. I imagine that encryption will be added at a later time, but you may want to avoid sending sensitive data over the Internet while this service is still in beta.
Once you click OK, the service will process your data and then pop up a new dialog box prompting for more information for discrimination reporting. At this time, we'll keep our example simple and skip this report. Simply click Close for now.
The screenshot below is the results of the analysis. In this example, the colored bars show the strength of the influence that a data value has on the column we were analyzing. The data value of "Boat < 1" greatly influenced the Survival column to favor a value of No. In other words, if a passenger had access to less than 1 boat (aka, no lifeboats), then it was very very likely that the passenger did not survive. Also, if a passenger's gender was Male, there was also a likely change that they would not survive. At the bottom, you can see that the most influential data values on a passenger surviving the voyage were that their gender be Female and that they have access to many lifeboats. Interestingly enough, the passenger's age, occupation, and what class they belonged to on the ship had almost no influence on survival.
This is just a simple example of how to use the toolset to analyze the data. You can imagine applying such tools to analyzing your store sales data to determine what best about your customers or products influences a sale. You can also use such an analysis on publicly available stock data to see if you can out-think the stock market and see what influences stock or currency prices. This can also be useful in hospital settings to determine what influences patient recovery rates.
Check out the rest of the data mining tools available in the Excel add-in to explore the other ways that Microsoft's SQL Data Services can help you mine your data. If you play with the toolset in the browser, you'll see the same familiar interface. The only difference is that you must upload your entire data files (.CSV only at the moment) to the server before you can begin your analysis. Simply click on the "Load Data" button at the top of the browser to upload your data file.
You'll have several options at this point. You can select one of the pre-loaded sample data sets to work with, you can upload your own CSV file, or you can connect to your database if its hosted by SQL Data Services.
Once your data set is loaded, simply follow the same steps as specified earlier to analyze the data in the browser!
Although these cloud-based data mining tools are only in beta at the moment, they're already shaping up pretty well. As the tools evolve, I'll further explore them in this blog!