Importance of following methodological steps cannot be underestimated. The first step in defining the project scope and project requirements is the most crucial, since everything that comes in subsequent steps determined by the initial step. It is important that everyone is in clear what is to be done and that there is not any perceptive gaps and misunderstandings. The output of the first step is the input to the second step and so on, so any mistakes in the beginning can have a huge cumulative effect at the end. And nothing worse than going through the whole analytical life-cycle iteration and produce good results, present to the business just to be told “sorry, but that is not what we wanted!” And sometimes what is formulated to be business needs and the real business needs are not the same. In other words - what is expressed by the business could be only facet, or mere effect of deeper and complex business problem. Integral part of clear definition of the business problem and project objectives is also clear understanding of how should proposed vision of the solution look like.
It is clear that analytical methodology is not iterative but rather one-directional process. Each time that project moves a step back it is a waste of time and money. And there are many “potholes” on the road, and purpose of the following this methodology is to help navigate the team through well-traveled and safe route from point of establishing the business need to the point of building business application that is designed to address this need.
The following steps are necessary in running analytical projects:
- Business problem definition
- Examining environment
- Gather and prepare data
- Analyze the data
- Implement the results
- Monitor an Measure
Business problem definition
One of the first things to do when scoping business requirements is to do research on the current business environment in order to have better understanding of business processes, business operations, terminology as well as organizational structure. Only then it makes sense to engage in discussions, brainstorming sessions, or workshops in order to better understand business problems, as well as examining current business situation, its deficiencies and to focus on scoping the project in such way to address these deficiencies so that new solution is measurably better than old one.
Purpose of scoping analytical project is not only to establish business requirements but also to establish vision of a proposed solution and develop the business case. As the problem becomes clearer, it also becomes clearer what type of data is needed to address the problem. Next step is to identify potential data sources relevant to the business problem which can be an existing data warehouse or data mart, operational system data, or external data. Integral part of formulating analytical/data mining problem is to examine the structure, accessibility and to see if the data fit the minimum requirements in terms of quantity and quality. And if the data doesn’t fit proposed a business problem for whatever reasons – only way forward is to re-formulate objectives. Also, it is important to establish project’s success criteria and that can only be done if one can measure and quantify the current business problem. This implies that problems that cannot be measured are not good candidates for analytical/data mining projects.
An estimate of the potential business value will also need to be made, based on assumptions of project duration and cost of the implementation, subtracted from projected value that the project is likely to generate. This step is completed with signed-off project plan that clearly stipulates scope, objectives, success, criteria, risks, data, exclusions, contingencies, assumptions, accepted approach and methodology, fees and anything else that may be deemed to be important to successful conclusion of the project. In a nutshell, by having project plan in place we are fairly certain that we have all necessary ingredients of success and therefore we can continue deeper into the project implementation.
Some of the steps done in the first step of analytical lifecycle, such as evaluating data readiness will be repeated again, just on deeper level and in more detail. We will need more closely to determine if the data sources are suitable to support the data mining activities. Also, we need to look at potential performance issues given available hardware and data sizes. We also need to consider availability of historical information and how far and deep we need to go. We need to examine update frequency of data, difficulty in extracting the data, etc. Apart of evaluating data readiness we also need to evaluate top-down organizational readiness to participate. Have they done similar projects before, if yes what were the obstacles, challenges? Will top management provide all necessary support? Also, what training and knowledge-transfer programs will be necessary for continual project support once it reaches production phase. Furthermore, evaluation of IT platforms and infrastructures, as well as networks and its capacity may be needed in order to see if it is suitable for carrying data mining environment. And lastly, what would be appropriate software configurations and appropriate hardware requirements needed to carry data mining load in both development, testing and production environments? These are all important questions to answer before moving forward in project execution.
Within the first step, we have emphasized the importance of not only knowing business questions but also having vision of the solution. That vision becomes clearer when we start planning implementation architecture which will integrate analytical results into the production environment. Planning this early helps to make sure that right resources are available when analytical life-cycle gets the implementation step. Additional considerations for the choice of the implementation platform include a vision of how scoring results will be used, and where as well as by whom. Also performance and capacity considerations must be considered as well as anticipated shelf-life of the model and frequency of data updates. And lastly and most importantly – needs of different business units need to be considered. How will model score new data? Will it be in real time, in batch environment or on request such as for marketing campaigns?
Data access, Collection and Preparation
Data access and collection can be pretty straight forward if data reside in data warehouse, however if data is scattered in various files, databases, and operational systems that can be a tedious process. If the data resides on an operational system, it will have to be extracted to separate storage, so that the data mining doesn't interfere with normal operations. Data also need to be evaluated for completeness, redundancy, missing values, duplications, and sometimes reduction of complexity may be necessary such as rounding data and reducing the number of decimal values. Next step is to prepare data, and this phase can be as much art as it science. This is indeed the most laborious part of the data mining process and this can often take anywhere from 60 to 80% of the overall effort. However, once data-preparation scripting code has been produced, building next model on same type of data can be far quicker, since all one needs to do is to run this script on raw data and data load and preparation is done automatically (usually from couple of minutes to several hours, or overnight).
It has been said that the objective of data preparation is to bring the data into maximum shape for modeling, where the natural order of the data is the least disturbed and the most enhanced to help solve or reduce business problem. Also, this phase is not just about preparing the data, but also about preparing the modeler so that “right” type of the model can be built right. “Right” type of the model means that model has to be in line with the scope and objectives. However, to build model “right” mean that modeler should follow the basic principles of knowledge induction ensuring that the model is built on general patterns in the data and not on the noise and idiosyncrasies. Model built in this way is likely to perform well when deployed on new data. This also means following specific modeling requirements which can vary depending which technique is used. For example, when the regression model is used it is important that modeler ensures that certain statistical assumptions are satisfied before or during modeling). Data preparation can be done in multiple passes. Usually, good starting point is to see what all available fields are and then use domain expertise to do first round of variable selection. Initial data examination can quickly categorize all the fields in 3 different categories. In the first category we group all the variables that are even remotely relevant to the business problem and that can be used in their current format. In the second category we can put all the variables that are still also relevant to the problem at hand and can be used but NOT in current format. In other words – something needs to be done with these variables to increase their information content. They either need to be rolled up, derived, remapped, consolidated, binned, etc. And in the third category we would put all the variables that should NOT be used in model. That can be either be because variables integrity cannot be trusted or they are just not relevant to the business question.
Next phase is to focus on the chosen set of variables and start inspecting data values to make sure that data quality is adequate for subsequent analysis. Also, one needs to check variable redundancy where two variables have different names but the same meaning (for example “sex” and “gender”). This usually happens when data feeds are merged from several departmental databases. Opposite can also happen where two variables with same name mean different things or have different construction logic behind them. Then we have common problem with missing values. Part of the problem is to determine what caused missing values? Does missing mean that one of the few allowable values or categories are just not captured, or is it just means that in this instance specific value cannot happen and therefore it is missing. For example if you merge individual and corporate clients (companies) – individual clients will have value in field of age but for the corporation value of age will be missing since companies don’t have date of birth.
Other part of the problem is to figure how missing was plugged in, since some database administration rules are programmed to plug any missing values with some default values such as “999”. If that is done on “age” field – average customer’s age may be around 400 years old?! These are all basic cleaning preparation tasks that can lead us to more advanced tasks such as data derivations, mathematical transformations, etc. Sometimes, it is important to create higher level data aggregations (daily, weekly, monthly, and quarterly) and then create ratio and percentage type of variables that can express temporal changes in data. For example – when building “churn” models in telecommunication industry variables such as ratio of last month usage to average monthly usage per quarter are vitally important since they show how does usage in last month compare with average usage in some previous period, and that can be indicative of looming churn.
It has been stated that data mining is mix between art and science. Artistic flare is expressed through data preparation by innovative and creative ways of getting “data to talk”. However, word of caution: too much unnecessary data gymnastics can have the opposite effect in obscuring information hidden in data and make extraction of general patterns more difficult.
Analyze the data
After grueling and laborious data preparation data should be ready for further analysis. Ideally, number of variables is reduced to no more than couple of dozens of well-crafted variables that express basic customer characteristics such as demographics, product affinity and behavior as well as recent changes in behavior. Actual analysis is relatively quick step even though it is most complex for the simple reason that all work is done by sophisticated algorithms. This presumes that data miner does have at his disposal some analytical workbench which contains all the tools that modeler may need. This assumes that modeler is familiar with model building methodology, and also with different analytical disciplines and some analytical methods.
Analytical disciplines fundamentally differ among themselves in terms of how they handle basic time dimensions. There are those disciplines that engage the past in order to tell you what happened. They offer retrospective, “rear-view” mirror that shows “how many products we have sold last year overall, or in a particular area at a particular sales channel. These analytical questions are done by standard reporting, ad-hoc reporting, query drill-downs and OLAP technologies. So, we look back in the past to find out what has happened and use this knowledge implicitly to devise strategies that would exploit some of these findings and generate some future benefits.
Then we move to present dimension by adding alerts and other types of monitoring systems that can inform us before things start going wrong, so we can do something about it. This is often used in production systems for process and quality control, and can also be used to link it to specific client important actions so we can timely react to it.
And then there is a future dimension. Here, statistical analysis would help us to decide if we should accept or reject a specific hypothesis and if differences between certain segments of the observed population are statistically significant or not. Maybe, we have hypothesis that customers with large income are less likely to switch to competitor, or that male and single segment of our customer base are less likely to repay on a loan, etc. Strengths and plausibility of these hypothesis can be examined with basic and rudimentary hypothesis-based statistical analysis. As one can assume, main limitation of this approach is the fact that we need to have some hypothesis and are conditioned by it. In other words – we not going to find what we are not trying to find. And so we move forward toward more advanced analytics and data mining, where we no longer need to have hypothesis. Here, after grueling data preparation - we submit the data to some algorithmic method to build mathematically some decision surface and discover all the hypothesis that are hidden in data. And by focusing on potentially actionable, useful and non-trivial data patterns that are related to our business problem, we may discover some valuable nuggets in data and unearth some great business benefits that can be directly attributable to actionable analytics.
Advanced analytics and data mining allows us to answer questions such as what will happen in a future and what are the characteristic of that event based on similar instances occurring in the past. For example, we can predict who will respond to our direct marketing promotion, who will submit insurance claim and what will be size of that claim, etc. Here we see basic task of data mining which is classification task defined as allocating objects to one or more predefined classes (responder vs. non-responder, credit default vs. non default, etc). However, if we trying to predict amount that responder will spend during our promotion – then this becomes a prediction task. Prediction can be defined as predicting unknown numerical outcome. So, strictly speaking – predicting the insurance claim would be a classification task, while predicting claim amount would be prediction task. Or, if we want to predict some numerical outcomes based on seasonality and trend using time series data than we use a type of predictive modeling known as forecasting. If we just want to segment our customer base on demographic or maybe even behavioral characteristics than we could use clustering or segmentation tasks and methods. And, if the goal is to optimize certain decisions given specific business constraints, where we have many potential options and scenarios and only few are optimal – then we need optimization methods which are in the domain of Operational Research. So, different types of business questions call for different analytical technology and they all have something in common - they are all led by business question, and they all require historical data.
Analytical Methodology Process
It is important that one makes a clear distinction between analytical project methodologies from process methodology which is part of project methodology. Process methodology outlines steps from a purely analytical perspective after business objectives are known and data collected and at least some major tasks of data preparation completed such as data roll up. Main input in this phase is an analytical table that would have some input variables and most likely some target variable that formulates business problem. So, if we are to create a mathematical model that can assign a probability of response to specific type of marketing campaign, target variable would simply be binary variables with two values indicating response or non-response from a previous similar marketing campaign. Rest of the input variables would be characteristics of these customers who either have responded or they have not. Additional input variables could be demographic variables, purchasing behavioral variables, product affinity information, call-centre information, even social media data, third party credit information and more.
Next step is to explore data and to generate basic statistical measure, mean, mode, standard deviations, and percentages of missing values and examine univariate and multi-variate distributions. Step of data exploration is important since it brings modeler closer to the internal structure of the data so better models can build quicker. Exploration step may lead to further modification of data such as replacing values and reducing data cardinality, as well as mathematical transformation which often improves model fit. Additional data derivations can be performed before we move in a step of pure data mining. At this point data is developed and ready for analysis, and in this step predictive or descriptive models are built. This step often involves building multiple challenger models either from a same family of methods (the same method but different algorithmic setting) or across different methods such as regression, decision trees and neural networks.
It is important to know these different techniques because sometimes specific technique is just not appropriate to be used. There are many ways these techniques can be classified and one classification category is based on the structure of algorithms. So we have rule-based methods (decision trees, association, rule-induction engines) which use Boolean if-then conditional statements to express the relationship between variables. Then we have distance-based methods (clustering, memory-based reasoning) that group similar objects and entities based on some measure of distance between them, and lastly we have Equational methods such as regression and neural network methods which use mathematical equation to express multivariate relationships. So, if one of the objectives is that the model is transparent and easy to understand then we can only use methods that are transparent (regression-based scorecards, or decision trees, but certainly not neural networks.)
After building various candidates/challenger models they need to be evaluated and assessed before “champion” model is chosen. That can be done using a variety of statistical metrics and graphical charts. If all models fail to meet some qualitative performance criteria, modeler then has option of doing some more data preparation or spend more time on finding champion model through changing algorithmic settings, parameters or simply bringing into a picture techniques that were omitted before. Key point to remember is that this is an iterative process of exploring, modifying, modeling, and assessment steps in contrast to project methodology steps which are not iterative. If the model is already built and ready to go in production and someone realizes that some potentially very useful variables forgotten to be included – one needs to scrap the current model, include forgotten variable's and build model again. So any time project steps are moved back in project iteration it becomes project waste of time and resources.
Implement the model in production
This is the phase when the business problem starts to get answered through implementation of some model on a new set of data. If business requirement was to produce the model that would assign probability of response on a new list of customers, implementation of the model is as simple as assigning probability to specific customers and then selecting those above some cut-off probability and sending them marketing message and ignoring the rest. Another type of implementation is when insurance broker captures information of a prospect for a new policy, and then pressing some button that initiate run of a mathematical model that produces probability that this customer is likely to lapse this policy within some period of time. And if the probability of lapse is above some threshold broker may politely excuse himself and look for some other business that has a lower chance of lapse. Or, imagine some banking online transactional processing system that runs some 200 000 transactions per minute red-flagging any transaction that looks like it could be potentially fraudulent. Or, credit lender who on the basis of customer-given information as well as third-party information calculates credit rating score and then decides to give or withdraw the credit loan depending again whether the score is below or above some specified threshold.
These are some of the examples of scoring types where the main output is some probability that something will, or will not happen. This often requires that the model be implemented directly into an operational system, or the warehouse environment. Choice of implementation platform must be made on basis of how business wants specific problem or challenge to be solved, performance and capacity considerations, of the anticipated use of the scoring results, frequency of updates for the model’s data sources and needs of the business units. Sometimes modeling results can be implemented in real time, or on request by marketing campaigns, or sales force. Another vital consideration is finite life span of the model. As products, services, or customer groups change, so may the usefulness of the model that has been developed. That may require a model to be retrained, potentially including additional data fields. Also, customer’s attributes in data may change such as their credit-worthiness, financial or personal data such as marital status or job status, also there may be regulatory changes, changes in the competitive environment, a focus on new types of products and services, which may reduce shelf-life of the model and even make model obsolete. Without major changes in dynamics of the modeled population retraining existing model on a refreshed set of data may be advisable once to twice a year.
Measure the modeling performance
Once, during my guest lecture on some university I was asked the question if I was ever involved in a failed data mining project? And without blinking of an eye my answer was “Yes, I was!” And then I explained that on a few occasions my assignment of producing and testing the model was completed before handing such model to others to implement within business processes. Sometime later, I would learn that the model was never implemented, and never returned any business value. Even though it was not of my making – it is hard to call such projects without implementation – a success! So, I concluded in saying that you can have the greatest model in place but if you not implement or measure – it is failure! So, by measuring model's performance we are measuring changes resulted by model's implementation, its impact, improvement and reduction of a business problem. And directly measurable benefits can have an impact on the bottom line, earnings, profits, reduction of losses, saved revenue and expenses, etc.? Also, we need to find out what intangible benefits were produced by working, in-production model. If we can clearly prove that the specific project has delivered more value minus cost of the problem + cost of the project implementation – we are almost certain to get more business sponsorship for similar or new business challenges for which analytical projects can be beneficial.
Usually, when model is in implementation process there are variety of methods used to see how well model is performing. Over the time modeling performance degrades and worsen, and there is point when model either needs to be re-trained with new, fresh data or completely re-modeled. Most common way of ascertaining modeling performance in production is by the way of having “pilot” population which is is scored by the model but not intervened. What does that mean practically? If certain policies in insurance are deemed to be at risk of lapse in period of time, some of this policies are not rejected at signing but let to go as normal, healthy business. So, if all such policies that model predicted to lapse – in fact do lapse, then production model would be 100% accurate on pilot data. Contrary to that – if none of the lapse in period of time after model score them to lapse – then, such modeling performance would with worrisome, and most likely any such model in implementation would be stopped immediately. Similar, with response models – if models allocate certain number of customers who are “piloted” as non-responders, and they do respond on direct marketing campaign that would tell us whether accuracy of prediction is better, worst or at par with pre-production testing. And based on that we could then decide to whether to continue or not with such model. We may also want to report on some of these ‘pilot” results and produce stability reports, and also to see whether any issues with modeling accuracies are worst in specific segments of population. That assumes that modeling implementation has been done correctly in terms of implementing the model on same population profile on which model was built.
These different type of model-monitoring reports have purpose of showing how does modeling accuracy and error spread over time, demographic and other population characteristics, so that adequate measures can be undertaken.