This is the continuation of the transcript of a Webinar hosted by InetSoft in September 2018 on the topic of "Best Practices in Data Mining" The speaker is Mark Flaherty, CMO at InetSoft.
Moderator: What advice would you give organizations to help them build flexibility into their data mining programs?
Flaherty: I think the first thing to think about when you think about flexibility is what is your business response time? If you are talking about a business response time where the amount of time that you can react with something is a week or so, then your flexibility has to be tied to that kind of environment.
There is also the possibility that you might talking about online customer service requests which demands showing quick results. Your CSR’s or your end users or actual customers are interacting with your predictive models through instantaneous predictions in a Web site. In that case the models actually can change in a real-time basis without having to bring your infrastructure down.
So you would want to have a system in place where you can deploy new models seamlessly in the backend that impact your customers or your CSRs or whatever interact with those models in real time. And those systems are fairly convenient to put together with today’s technology so that you can actually update your response rate immediately. So if you see changes in the market, and you have done the data analysis to see how those changes in the market can impact your models and impact the patterns that the models are using, you can actually apply those in real-time while people are interactively using your systems.
Moderator: We know things are going to happen to our input streams, that there are going to be missing elements. There are going to be outliers. There is going to be data coming in from third-parties, Twitter streams, maybe you are buying data from data compilers. They may change their interfaces. They may change what the data elements mean. So it’s important for us to be looking for problems to happen.
There are two best practices for dealing with this uncertainty. One is to document the models enough to know what’s happening to be able to diagnose a problem if you see a data element starting to drop out that should be there. Put that documentation in a repository with the rest of the code for that application. The second is one has to do with handling missing data and handling outliers.
Flaherty: Yes, clean the data. Fix it.
Moderator: Yes, that’s one of these best practices everywhere. If you get down to performance issues that’s when you want to hard code things, for example, but anytime you can employ a useable and reasonable layer of abstraction around managing large data sets, that’s preferred.
Flaherty: If we think about the time dimension, one of the best practices is that in order to run these models frequently and in many more kind of situations, you don’t need that complex a model. You just need to build simpler models which can provide a greater degree of flexibility versus a complex model which might prove to be difficult to be used in a constantly changing kind of an environment. So that’s one thing.
And you want to hold on to those old models and descriptions of those old models. That’s where the governance comes in as definitely you want to have lot of transparency across the different stakeholders who are involved in the modeling workflow, and so documentation is as critical there from a compliance kind of perspective and governance kind of perspective.
InetSoft Technology Corp.
InetSoft Technology Corp.