business reports

InetSoft Webinar: Talking About Big Data

This is the continuation of the transcript of a Webinar hosted by InetSoft in September 2011 on the topic of "Best Practices in Data Mining" The speaker is Mark Flaherty, CMO at InetSoft.

Enlarge This Dashboard
Visual Dashboard
More Dashboard Samples

Flaherty: And in that context, what we have seen more and more of our customers coming to us, talking about big data for example, where we have large volumes of data, or we have different types of data or coming in at different speeds. So I think some of our more mature customers are also focusing on what are the best practices around sampling when it comes to big data.

When it comes to data visualization, what are some of best methods to use? When it comes to transformations, there are question such as how do we handle missing values? That’s from the data preparation process, and a lot of our customers are looking into some of the best practices on that end.

Moderator: When you talk about sampling, I am presuming you are talking about taking a small subset of your data and creating some algorithms using the subset. Obviously if you are trying to develop an algorithm based on a megabyte of data, it's going to run a lot of faster than if you try to do that on a terabyte of data. When you do sampling, what’s a good percentage of the total? Is there a best practice there?

View BI Flash Demo

 

View a 5-minute demo of InetSoft's business intelligence software, a small footprint BI platform designed for maximum agility and self-service.

Flaherty: I think we have seen ranges from like 2% to 6% or 7%, but I think definitely when we are trying to predict something which is rare, such as fraud. You will need to pull a percentage from the higher end of this range from a sampling perspective. When the customers are asking, can I model on a complete set of data, is there even a possibility, will I be able to yield create a better model, if I have a well-defined sample versus building a model on a complete set of data?

I think that’s where I think lot of our discussions break down, and I think we have come back to the same topic, okay, what are you trying to solve here. So in case of definitely kind of rare event, taking a larger sample is definitely one of the best practices around.

Another reason to not over sample is to leave data for testing the predictive model you have created from the subset. The more test samples you have, the greater confidence you will have from the test results.

Moderator: That’s interesting because you never know what you are missing when you use just a portion of your data.

Flaherty: Another point to make is that along with the advances in the data mining software tools, there have been advances on the hardware side to handle big data, larger or complete data sets. So I think even when we say big data, larger samples or bigger volumes of data, it still is snapshot of what you are gathering because data is always coming into an organization at a constant rate.

So I think it’s all relative, but I think definitely sampling is not dead in a big data world and large volumes of sampled data are definitely useful for rare event types of predictive modeling. Identifying the business problem, providing training and deploying the appropriate kind of analytic infrastructure is critical when it comes to data preparation.

Moderator: Training is often overlooked and yet vastly important for any kind of information systems. There are a number of companies that do conferences, for example. Or you can attend webcasts. For the data preparation or the use of models or building models, what are the more important areas to consider?

Flaherty: I think analytical data preparation is definitely one of the biggest well-attended courses. We are seeing definite trends around a big percentage of people attending analytical data preparation kinds of training. The second biggest draw is definitely how do I apply specific modeling techniques for specific real world problems, such as risk management or fraud detection or customer lifetime value?

The third trend is to learn what to do after the predictive model is created. I have this very high yielding model, a productive model, but what do I do with it? How do I operationalize that model into my systems? That’s where lot of folks from both I the analytic camp and also from the IT camp are coming in and taking a training classes on how to deploy that model. How do I make it productized or operationalized?

Previous: How the Analytics Software Facilitates the Discovery Process Next: Managing Scoring Models Over Time

More Resources:

Big Data and The Need For Agile BI
The Challenges of Big Data for BI
The Importance of Data Mash-Up
InetSoft's Enterprise Data Management Application
 
Agile BI Study
>> Ovum Data Mashup Report
Technical Summary
>> BI Software Datasheet
Data Mashup
>> Data Mashup Whitepaper
Copyright © 2012, InetSoft Technology Corp.
InetSoft Technology reporting vendor
BI Tools | Business Intelligence Dashboard | BI Product Information | Dashboard Reporting | Java Reporting Tools | Business Analytics | Executive Dashboards
Key Performance Indicators | KPI Dashboard | OLAP Server | Reporting Software | Web Reporting Software | Ad Hoc Analysis | Reporting Tools | BI Software