How to Conduct a Data Mining Project

I felt many times that it is not so clear what to do to make a successful data mining project. Here are some guidelines I like to follow.

In the beginning, I have to point out that data mining is not a product; it is about getting some knowledge. We get knowledge through learning. Therefore, a data mining project is about establishing a learning environment. In this example, I am showing how I would try to solve is well-known data mining task, called Churn Detection. I would approach this in the following way.

I would start by a 5-days Proof-Of-Concept (POC) mentoring and project. This 5 days would include:
- 1.5 days of training. I would insist on the training part, because we must have a common understanding what we are doing.
- 2 days of data preparation and overview. I would assume that the data is already cleansed, and that the customer has employees assigned to the project that know how to get the data in some test SQL Server database. I would also expect that in this data, we already have customers that left, and that they are flagged. We would need to get deeper understanding of data distribution etc., and prepare many different additional computed variables. In addition, we would need to prepare different data sets for different time windows (e.g., take six months of previous data, one year …). We do not know what is the most appropriate time window to get the most accurate patterns in advance.
- 1 day of preparing different data mining models and evaluating the models. For churn detection, I would, as I said, expect to have a flag. Then we could try to learn patterns with directed models – Decision Trees, Naïve Bayes and Neural Networks. We would need to prepare models for different data sets, and models with different algorithm parameters for each data set. There is no way to tell in advance how many models we should prepare; we should limit ourselves with time constraint instead. With directed models, we should find reasons for churn.
- If time and data permits, I would like to check during these 3 days also additional patterns. I would like to have customers' data also prepared as series of events, like events when we have some contact with customers. Then I would try to analyze with Association Rules which events go together with the churn event, and with Sequence Clustering whether there is a specific order of events that leads to churn. I just need to point out that this might be possible if the employees really know the data very well, and are capable of preparing what I would need in a short time.
- Last 0.5 days would be spent on presenting the results and defining roadmap for the future.
After POC is accepted, we should move to a real project. In the real project, we would need to:
- Find a way to measure when the model(s) we selected for deployment are obsolete and need to be refined. I would suggest keeping all measures of the efficiency in a small control data warehouse, and analyzing this info with OLAP cubes or PowerPivot.
- Decide what exactly deployment means. Deployment could be only through reports. It could mean including DMX queries in their OLTP application. It could mean including churn detection results in their existing OLAP cubes, if they got some. It is not possible to tell in advance how much time the deployment would take in advance, before we decide what exactly the deployment means. Nevertheless, I would expect the customer would need our help for two to three days in any case.
- No matter how we deploy models(s), I think that no deployment should be without measuring whether the models are still efficient (first bullet in this part). There is no way how to find in advance, by the rule of thumb, when a model is obsolete. Data mining is constant learning, and we have to establish an environment that enables us fast and efficient learning. This should take us about 2-3 days.
- Of course, for the real project, we would go through the data and models preparation again. We should invest 2-3 days more in this part.
To summarize time: we would need 5 days for POC, and 5 additional days for the real project, not counting time needed for real deployment.
For the number of people involved, I would definitely like to have at least two of the customer's IT guys. They should have strong SQL knowledge, and be very familiar with the data.

Hope that this blog clears a bit how to conduct a data mining project.

Avtor: Anonymous, objavljeno na portalu SloDug.si (Arhiv)

Please note that we won't show your email to others, or use it for sending unwanted emails. We will only use it to render your Gravatar image and to validate you as a real person.

Leave a comment