Data Mining – A Brief Description – Data mining is the process of analyzing large sets of data and discovering useful patterns, links, relationships, and correlations among the data. This extracted data is summarized and used for specialized tasks. It is also known as knowledge discovery in databases (KDD). Computational statistics, machine learning through programming languages and artificial intelligence algorithms forms the core principles of a data mining process. Various techniques and tools are used for performing a data mining process on a large unorganized data set. Verification of the results of a data mining process is crucial to avoid any unintended or even intended manipulation of data. Primary users of data mining are those who are involved in sectors that require a strong focus on customers like retail, financial, communication, FMCG, healthcare, and marketing organizations.
A Typical Data Mining Process
A data mining process will typically involve the following tasks, although it may vary with respect to the requirements of a data miner and his/her task.
Data mining techniques
There are certain standards used in the data mining field that is followed by data miners when applying any processes to a task or creating predictive models. Cross-Industry Standard Process for Data Mining (CRISP-DM) is a data mining standard that is used by a majority of data miners to create models that suit their needs. Another standard that is used to create algorithms and models for data mining is the Sample, Explore, Modify, Model, and Assess (SEMMA) standard.
When models are created by data miners that focus on predictive analyses of data, they usually tend to follow the Predictive Model Markup Language (PMML) standard. This standard is especially used by the business analytics sector. This field requires predictive analyses of high quantity and quality due to the large and complex nature of consumption data.
In terms of the common data mining algorithms used today, two major divisions can be envisioned. The first being the classical techniques and the second being next-generation techniques.
1) The classical techniques involve techniques used prior to the digital or computer age. These include statistics, data counting and probability, clustering, nearest neighbor methods and regression analyses for prediction, histograms for summarizing data.
2) The next-generation techniques almost all include a component of computer programming in creating models or algorithms for data mining processes. They generally either discover new information within large databases or build predictive models. They are distinguished from the classical techniques since they have mostly been developed in the past two decades and they are the ones being talked about in the news media when the word data mining is mentioned. Decision trees and neural networks are the most often mentioned and used next generation techniques.
Check out the latest articles on Data Science