Data Mining

Go beyond Analytics

Nearly All Business Fail to Understand the Value in Data Mining

You would be surprised to know how many times we found simple trends that equated to multi million profit gains. For example, in a large financial organisation we found one simple factor that equated to an immediate 20% improvement in profit, along with 80 more other profitable factors. In a top legal firm, we found 4 factors could be used to win 80% more cases. Just realising the potential of data mining and prioritising and funding this activity in itself puts you in the top echelons of the business elite. The reason data mining beats analytics even though they work really well together is this. Analytics is limited to the analysis you can imagine, data mining is not. Data mining on the other hand, will tell you every pattern that exists whether it is useful to you or not. The secret is knowing how to exploit these pattern insights.


A delivery company discovers the accident rate and delivery times are greatly affected by the driver’s age and whether the route was inner city or mainly motorway. Would they not adjust which route the driver goes on? If you discovered managers and team leaders were sick three times less often than everybody else, would you change your sickness policy or find new ways to give people more responsibilities and empowerment? Chances are, you would take some form of action, but you would have taken that action in the light of this new information. These may seem like relatively obvious deduction analytics could provide. However, we used the same principles to data mine gambling data and consistently beat the bookmaker by 37% over the course of a year of simulated bets. This would be impossible for analytics to achieve.


The other side to data mining is about accuracy and measurability, with the great example being forecasting. This typically means going beyond simple averages, linear regression functions and moving averages, to more complex weighted moving averages, exponential smoothing and time serial analysis to forecast sales, staffing and stock levels, etc. These last methods will often take into account other attributes, clusters, seasonality and have further alpha inputs that can lead to significantly higher accuracy levels. As the best method often varies from case to case, each can be implemented separately and then statistically compared. This is done using functions like mean square error, mean absolute variation and tracking signal so that the absolute best forecast is used. If you are not already using these methods, how much better would your decisions be if your estimates were more accurate? How much money would that save or make for your business?
The activity of forecasting often identifies hidden factors affecting your estimates. A great tip is to log competitor sales campaign dates and plot them against your own data. Often, competitors stick to schedules, which once known can be very valuable to your forecasting.

Customer Relationship Management

In customer relationship management, it is critical to know your customer. Knowing your customer makes your marketing more effective, reduces churn, increases spend levels, and ensures you are giving the customer the service they want and expect. It also helps you exceed their expectations and evolve your offering. This is an area where data mining plays a major role and where nearly all the methods and tools play a part.

Understanding your Customers

Usually, the first step in understanding your customers is to classify and cluster them. For a retail company, a simple example would mean classifying your customer base into age groups, disposable income bands and their primary interest, and then clustering the combinations that naturally stick together. The resulting clusters are often given memorable names like the “silver surfer”, but using these clusters makes marketing to each group far easier, helps you focus on the right delivery channel strategy, and can greatly increase conversion rates by being more targeted. Whilst this activity is often done manually, it becomes progressively harder as the number of influencing attributes grows. This is when it becomes a good idea to augment your internal data with external data. However this is not always easy if you do it yourself.

This data sample from the office of national statistics shows the kind of data available and what effort is required to make it usable. We have taken sources like these and thousands like it and linked them by postcode and given the fields proper descriptions to make them much more understandable so you don't have to.

sample ons data

Clustering your Customers

The nice thing about clusters is that whilst you cannot change the customer’s gender or address, you can often influence other attributes and help move a customer to a new cluster of higher value. A great example is Amazon Prime. Once a customer joins Prime there is usually a big shift in the lifetime value, shopping frequency, basket size, churn, ability to upsell, and so on. Knowing the value difference between the two clusters helps you understand the maximum spend available per customer to move them up to a higher value cluster. However, bear in mind it is also effective to move customers down to a less profitable cluster if they look likely to leave.

How do you know they are likely to leave? The answer lies in understanding the data of every customer who has left in the past and then to use data mining tools to automatically detect all the influencing factors. Once identified, these factors let you filter and identify the customers most likely to leave so you are able to take proactive action such as getting in touch and offering them a better deal or a complementary bottle of wine when things go wrong.

Knowing your customers should be just as important as knowing your competitors. With most businesses publishing their prices on the internet, and the public data available on web traffic, it is now easier than ever to profile your competitors and track how they are doing in comparison to your own efforts.

Data Quality

However, some answers you will never find, because you just do not have the data or it is in an unusable state. The unfortunate truth is that from a Gartner perspective most businesses have poor data quality levels with most floating around level 1 and 2 of their 5 level quality framework. This means your business is most likely to be missing data and have a significant amount of incorrect data. Whilst poor data quality dilutes your data analysis, it can be addressed with automated data cleansing via ETL (extract translate and load) or excluded entirely from the models you build. Data mining models can also fill in missing data and identifying outliers statistically, but the best solution is to have the right level of governance in place and a solid data strategy.

Data Availability

Often however, you find yourself needing data you just do not have. This is where a little creativity comes into play as you start to look at what data you can get, and how likely it may influence what you are investigating. For example, a business wants to open its next store, but needs to pick the best possible location. It is unlikely to already have good sales data on the area unless it sells online to that area in sufficient volume. From a data mining perspective, this is where it is useful to know as much as possible from as many sources as possible. Additional data for the location might include population demographics, disposable income, education levels, number of household vehicles, density of competing businesses, university student populations, average house prices, etc. By linking all these factors to your own data and then letting the data mining tools discover the trends, the decision on location becomes more informed and the risk of a poor investment is reduced. This is because, now al you have to do is to look for locations that exhibit the same trends.

Another technique is to create derived metrics from the data you have, to predict key supporting factors. Take the example of betting on a greyhound race. If you know that only 5% of dogs who are bumped finish first, or that 38% of the dogs that lead in the first corner go on to win, it soon becomes clear that it is worth knowing the likelihood of these supporting factors. This is where you would create new metrics specifically to predict their likelihood, such as the average trap the dog starts in, to see if collisions are more likely when in the wrong trap, or the ranked weight of each dog to see if the lighter dog has the advantage in the initial sprint.
Whilst the list of applications for data mining is exhaustive, hopefully you get the idea. Data mining brings to the table some very sophisticated mathematical techniques, such as Neural Networks, as well as some quite simple but clever methods, like decision trees.
Whatever your data mining needs, we would love to help.

Data Mining Case Study

Any kind of commercial data mining case study is hard to find for the simple reason that no business in their right mind would share their insights with their competition. That is why we produced one of our own. We picked one of the most difficult things to data mine we could think of to share with you: gambling data. However, we did not do this to just prove a point in terms of how powerful data mining can be, we used this data to test our systems and methods so we could make them better. The result took our methods and processes to the max and way beyond the level of sophistication any normal business would ever need. Our result was to achieve profitability both with betting exchanges such as BetFair and the bookmaker with over 150,000 simulated bets. Now we are sharing the results and how we did it with you.

Greyhound Racing

We chose greyhound data as there is a lot of it. They race seven days a week and each meeting can have a lot of races. This adds up to a lot of dogs, tracks and races, which was perfect for our needs.

If you have never tried to improve your chances using a spreadsheet, then you should give it a go, because whilst it will not consistently give you a profit, it will stop you losing as much. Excel does a great job of importing published HTML greyhound stats data and just using an average of previous race speeds by converting time and distance to metres per second it will give you a nice boost over pure luck. This also partially gets over the variations in distance between track lengths, as dogs typically race at more than one track and distance. However, the real picture of how dogs perform is far more complicated and involves several hundred factors.

The results

Organisers go to great lengths to balance out the competing dogs with the usual winners being the bookmakers, unless you visit the track often and study the form of each dog, which takes time. So, the real question was, could data mining beat the bookie using just data alone?

In the case of greyhound racing the answer is absolutely yes, but it was not easy. Below are the results we achieved for races between June 2013 and September 2014 across 24 tracks using the same data mining model. The predictions were done for each type of bet and were then split into risk categories with Strategy A being low risk and C being higher risk and so on.

Going up against the bookmakers starting price (SP), a profit of £21,257 was achieved with £1 bets from sixty thousand bets, whilst using a betting exchange generated £52,882 profit from one hundred and fifty three thousand £1 bets. The reason why an exchange did better is because you can also place LAY bets which is betting the dog to lose which essentially makes you a bookie. However, win, forecast and tri cast bets can be placed with both.

Our method looks at risk from Low (A) to very high (Z) and breaks this scale up into groups of similar sized risk and calls them strategies. The data then is split between using a bookmaker in green and using an exchange in orange. The reason for this is that exchanges typically offer better returns or odds than a bookmaker. The final column represents how many bets must be placed to ensure a profitable outcome.

By breaking down your data in such a way it allows you to cherry pick the best opportunities or go for high volume. The same approach works exactly the same with marketing campaigns and customer selection, with the green and yellow columns representing different channels. In this case, sticking to the lowest three risk levels A,B and C ensured a healthy 38% profit at the bookmakers and 37% profit at the exchange

Greyhound Predictor Results


The above results are based on placing £1 bets or Lays, so the profit totals can be easily multiplied by your typical bet stake. Whilst betting on low volume high return (A) strategies is less risky, betting on the higher volume strategies can greatly reduce your betting tax on exchanges such as Betfair who give volume discounts of up to several percent. The exchange rate profit was estimated based on a potential of 20% better odds or worse in the case of laying, with a 5% tax applied to the net winnings for each race event, then proportionally applying the tax to all betting strategies.

Bet Types

1-2 Forecast bets
Betting on two dogs to finish in first and second position in a specific order has a higher profit than a reverse forecast version of the bet.

Win Bets
This is the most common bet: for a single dog to come home first. Starting prices were not used in the data mining model to predict any of the bet types, so bets could be A) placed well in advance of the odds being available and B) did not start off with any bias towards the bookmaker odds. However, a valid odds range was used in the case study to identify dogs that, for example, had lost fitness or for some other issue known to the bookmakers that did not show in the data, had had its odds lowered or raised significantly.

1-2-3 Tricast bets
Betting on three dogs to finish first, second and third in a specific order is much harder to predict, but comes with a much greater reward and hence this bet has the greatest profit percent in this model. Given the high profit margin, a reverse tricast is not a bad bet, although was not included in the model.

Betting on the dog to not win is a betting exchange only bet and is the most consistent profit strategy over time. However, significantly more funds are needed to cover your bets compared to other bet types, as you are in effect being the bookmaker and you need to cover the potential pay-out should the dog win.

Each way bets were excluded due to being unprofitable.

Gambling is not something we endorse or support. It was however a perfect subject area to demonstrate the effectiveness of data mining with one of the most difficult data sets possible