This podcast is about how to develop a good strategy for data mining. Data mining is not likely to be fruitful unless the data you want to use meets certain criteria. Today we will talk about some of the aspects of the data and its application that you should consider. Is the data available? This may seem like an obvious question, but be aware that although data might be available it may not be in the form that can be used easily. You can input data from databases, via ODBC from files.
The data, however, might be held in some other form or in a machine that cannot be directly accessed. It will need to be downloaded or dumped in a suitable form before it can be used. It might be scattered among different databases and sources and need to be pulled together. It may not even be online, If it exists only on paper, data entry will be required before you can begin data mining.
Does the data cover the relevant attributes? The object of data mining is to identify relevant attributes so this may seem like an odd question. It is very useful however to look at what data is available and to try to identify the likely relevant factors that are not recorded.
In trying to predict ice cream sales, for example, you may have a lot of information about retail outlets and sales history, but you may not have temperature and weather which is likely to play a significant role.
Missing attributes don’t necessarily mean that data mining will not produce useful results but they can limit the accuracy of resulting predictions. A quick way of assessing the situation is to perform a comprehensive audit of your data before moving on. Consider attaching a data audit note to your data source and executing it to generate a full report. Is the data noisy? Data often contains errors or may contain subjective and therefore variable judgments. These phenomena are collectively referred to as noise. Sometimes noise in data is normal. They may well be underlying rules, but they may not hold for 100% of the cases. Typically the more noise there is in data, the more difficult it is to get accurate results. However machine learning methods are able to handle noisy data and have been used successfully on data sets containing almost 50% noise.
Is there enough data for data mining? It is not necessarily the size of the data set that is important. The representativeness of the data set is far more significant together with its coverage of possible outcomes and combinations of variables. Typically, the more attributes that are considered the more records that will be needed to give representative coverage. If the data is representative, and there are general underlying rules, it may well be that a data sample of a thousand or even a few hundred records will give equally good results as a million, and you will get the results more quickly.
Is expertise on the data available? In many cases you will be working on your own data and will therefore be highly familiar with its content and meaning. However, if you are working on a data for another department of your organization or for a client, it is highly desirable that you have access to experts who know the data. They can guide in the identification of relevant attributes and can help to interpret the results of data mining, distinguishing the true nuggets of information from fool’s gold or anomalies in the data.
View a 3-minute demonstration
of InetSoft's easy, agile, and robust BI software.
As with most business endeavors data mining is much more effective if done in a planned systematic way, even with cutting edge data mining tools. The majority of work in data mining requires a knowledgeable business analyst to keep the process on track, to guide your planning, answer the following questions.
What substantive problem do you want to solve? What data sources are available and what parts of the data are relevant to the current problem? What kind of pre processing and data cleaning do you need to do before you start mining the data? What data mining techniques will you use, however will you evaluate the results of the data mining analysis and how will you get the most out of the information you obtained from data mining.
The typical data mining process can become complicated very quickly. There is a lot to keep track of, complete business problems, multiple data sources, varying data quality across data sources and array of data mining techniques, different ways of measuring data mining success and so on. To stay on track, it helps to have an explicitly defined process model for data mining, the process model guides you through the critical issues outlined and make sure the important points that are addressed, it serves as a data mining road map so that you won’t lose your way as you dig into the complexities of your data.
The data mining process model recommended for use is a cross industry standard process for data mining. As you can tell from the name, this model is designed as a general model that can be applied to a wide variety of industries and business problems.