neworleans_2013_By: Chris Vandersluis
This paper was originally presented at the PMI Global Congress in New Orleans, LA in October 2013.

Abstract

Applying statistical data mining techniques to project data is relatively uncommon and surprisingly so given the remarkable value that such techniques can reveal. These techniques can be complex regression analysis or can be the simplest of grouping and graphical representations of data.

When we apply such techniques to project tracking data we have a very good chance of realizing significant benefits. Project tracking data is ideal for data-mining because the data is already structured, often has an approval mechanism and is even tracked for completeness. Mining the timesheet, schedule progress and financial information can show project managers how to get more project work done more effectively.

This paper is organized into the following sections:

• What is Data Mining?

• All we need is a nugget!

• What are we mining for?

• Where are we mining for it?

• What is the mining process?

• Real-world data mining examples

• Where can you pick up your prospector’s license?

Overview

The concept of data-mining isn’t new. The statistical constructs such as regression analysis that make up modern data mining techniques go back to the early 1800s. Another cornerstone of data-mining; visual data representation is not new to project management at all. Henry Gantt’s bar charts are an example that has already celebrated its 100th anniversary. Yet applying data mining techniques to project data is relatively uncommon and surprisingly so given the remarkable value that such techniques can reveal.

Project Management data is a poster child for data-mining particularly when we look at project tracking data. Information such as schedule progress, timesheets, financial costing and quality deliverables have a number of elements that are critical to Data mining or Knowledge Discovery in Databases (KDD): First, such data typically has a formal sign-off or approval aspect to it so the quality of the data is very high. Not only is the data high quality but often it must pass a completeness test so we know that the data has all been collected for that period. Finally, this type of data has a formal structure to it. If the data is contained in a centralized enterprise system, all the better.

Over the coming pages we’ll cover an overview of data mining techniques, what kind of data returns the best chances for value discovery and the techniques and tools that project managers can apply to their own project data.

What is Data Mining?

Data mining is a term which refers to the practice of searching through large volumes of data to find patterns or trends that may reveal information about the subject we are researching. The practice has become popularized with modern day web site analysis with services such as Google, Facebook and Microsoft who have collected huge volumes of data. Data mining is also referred to as Knowledge Discovery and is sometimes associated with the term “Big Data” which refers to large data volume.

Data mining is often done with the use of computer based tools and statistical techniques. We’ll often see computer programs, systems and services referenced when talking about the practice. When we talk about large volumes of data, we can be talking about millions or billions of records and the use of automated information analytics is key to successfully sifting through that data looking for value.

In the context of project management, the volume of data is less significant that its quality. Project management data is ideally suited for data mining as much of it is already formalized in its structure. Whether we are looking at schedule, estimating, tracking, timesheets, cost control, risk registers, or quality reports, the data is already date-oriented, associated to both work and/or people. That’s perfect for our purposes.

Let’s take an example. If we made a hypothesis that when project overhead is high and the schedule is late, we are more likely to encounter excessive scope changes, then we might represent that pattern like this:

IF high {OVERHEAD, DELAY} THEN high SCOPE CHANGE

Here a rule will look for correlations of high overhead and late projects and see if there is a direct correlation with scope changes.

Data which is harder to mine is unstructured data. An example might be a Charter document or a Business Review narrative done in a word processor. Yet that data too can be organized in such a way that it can be useful for data mining.

All we need is a nugget!

In mining it is common to have to sift through a lot of raw material in order to extract the tiniest grain of gold. On a profitable mine, you might dig up 2,000 lbs (950kg) of raw “paydirt” and, once you had finished processing it you’d have between 1/16 – 1/8 of an ounce (1 to 2 grams) of gold. For a small gold mine, this could be a workable, profitable mine.

Similarly with Data mining, we expect to go through a lot of data to find that tiny nugget of value. Fortunately, the modern world has a fetish for storing data. And it adds up in a hurry. A 400 page novel is about 1,400,000 characters or “bytes”. One estimate of the total volume of data stored on the Internet is approximately 1 Zetabyte. What’s a Zetabyte you ask? It’s 1 “sextillion” or 1,000 extabytes or 1 billion terrabytes. As a number, that’s 1,000,000,000,000,000,000,000. It’s a lot. No one does searches through the entire Internet of course, and we’re looking for data within a very narrow context.

When we think of data that may be of interest to us in Data Mining project management data, it’s worthwhile however to extend our perspective from just our scheduling system to any data source that may be relevant to project management. Having access to the project management tool database is rather obvious but what about timesheet information, what about financial information, what about client satisfaction surveys, what about quality databases?

It’s a common concept in data mining that the data wasn’t originally designed to spit out the analysis we’re now asking of it. By doing searches across the data of multiple systems, we have to approach our analysis with this understanding that the data wasn’t created to be looked at together.

What are we mining for?

The output of the data mining process is in great part guided by the goals or targets you are measuring for. A prospector does not go to diamond country and hunt for gold. To coin a popular phrase, we are looking for “a sip from the fire hose”. The outcome of our mining doesn’t need to be more than a single metric to be valuable.

To make a goal most effective, think of articulating it in business terms. “How can we reduce needless overhead?” is better than “What is the variance between overhead hours and productive project hours?” This is because the end result should be something that has value to the organization. Many inexperienced data miners will get caught up in the process of mining. “If we can find it, it must have value,” they say. The result is a lot of data, but little valuable information. In such an environment, it’s not surprising to have dozens or hundreds of metrics and the goal of the data mining team quickly becomes trying to find value in the correlations they’ve discovered rather than highlighting metrics that are relevant to the business issues that are of concern to us.

When we articulate our goals in business terms, the data miner is left to their own devices on how to find analysis that is relevant to that goal. The result is often that the data miner can bring in correlations from related data or data that can be made to relate. As we’ll see in some of the real-world examples later in this article, hypotheses are not always proven or disproven in the first pass through the data.

When we articulate our goal in business terms, there are several criteria that we can consider to see if the goal is worthy:

Is it actionable?

Finding information on something that you can take no action on has no value. Ask yourself this: “If the hypothesis turns out to be true, what action might we take? If there is no answer to that question, think of putting this goal aside.

Is it relevant?

A goal that might be actionable but that doesn’t move the organization forward is of little value. “What will change if we take actions on this element and how will that affect our organization?” is a great question to ask.

Is it significant?

There is a cost to data mining. Personnel will need to gather data sources, create tools and analyses of that data and then present it to senior management. There is some level of disruption, however minor, in that exercise. The changes that are anticipated also have the chance of being disruptive. Changing any existing process typically comes with some loss of effectiveness in the short term. If the value of the changes you are expecting are not significant, then doing them may cost more than not. “What is the potential quantifiable impact of this hypothesis and its associated changes if it turns out to be valid?” is a good question to consider.

Just because we’re looking for gold doesn’t mean we’re going to walk past diamonds on the ground. It’s ok to find correlations for one thing while you were looking for something else. In one of the examples we’ll describe later in this article, an organization looking to reduce staff turnover, discovered that not only was time being lost on retraining staff but that the organization had become so projectized that proper training and care of the staff had been forgone in favor of short term results. Returning to investing in training paid off for the organization in a big way. Not only was total turnover reduced but total costs of training were reduced too as people were now happy to remain at the firm. The overall improvement in resource capacity was remarkable.

The tiny kernels of value that we are on the hunt for will come from correlations in the data; patterns that match from one source to another and, most interestingly (at least for data mines) correlations and associations of the data that no one might have ever thought to look at together.

Where are we mining for it?

Prospectors don’t wander into a random territory and start digging with the hope of finding anything. Prospecting is a scientific process. Prospecting for ore starts with some thinking of what we’re looking for and where we’re likely to find it. Prospecting for data is much the same. Starting with a goal of what we are looking for will guide where we should start our search.

There’s an old joke about a woman walking down the street late at night. She comes upon a man on his hands and knees under a streetlamp. “I lost my keys,” explains the man under the lamp and, feeling sorry for the poor fellow, the passer-by gets down on her hands and knees as well and starts looking. After a few fruitless of minutes of search, the passer-by asks, “Are you sure you lost them here?” “Oh no,” says the man. I lost them down the street, but this is where the light is.”

It’s an old joke, but the point is that where you look makes a big difference to the likelihood of finding what you are looking for.

When we thinking of the results of our search, it is helpful to think of the result in business terms rather than data terms. For example, “I’m looking for variance of more than 22% between expected and actual duration” is less useful than “I’m looking to improve our on-time performance.” The first example sounds more exact; more scientific even, but it is the second example that is the reason we are searching. Articulating the business result rather than the data result can open our perspective to look at data other than just the narrowly defined search.

Data mining is all about looking for trends and variances however they’re defined. You will be looking for direct correlations and indirect correlations as well as direct effect and indirect effect. The first two are statistical terms.

Direct Correlation

This statistical term refers to high values of one variable being associated to high values of another variable. As an example, we might find that high numbers of change orders are associated with high numbers of delays in the project schedule.

Inverse Correlation

This statistical term is also called negative correlation. It refers to the association of high values of one variable with low values of another variable. For example, we might find that high numbers of project quality documents are associated to low numbers of project defect recalls.

Direct Effect

This is a term I’ll use to refer to elements of one domain of data that affect the same domain of data. Looking at scheduling data to correlate to a scheduling result is a Direct Effect. An example might be associating the number of changes in duration of a task with the delay in the schedule of that task. This might show the effects of excessive schedule change.

Indirect Effect

This is a term I’ll use to refer to elements of one domain of data that affect another domain of data. Looking at assignment data to correlate to schedule data is an Indirect Effect. For example, associating the length of project meeting durations and the risk of a project finishing on time is an Indirect Effect.

Project Management data is well suited to data mining because of it structure. Ideal data sources for data mining are structured into neat columns and rows, has good quality because of some kind of approval process and is date oriented. Data from an ERP/Finance system would typically be a good source but so is project management data. As you think of where you can look for trends in data, think about the quality of your own project management data. It is common to find that different aspects of the data has different levels of quality. Planning data is often voluminous but good planning doesn’t give a full indication of how things actually occurred. Tracking data is often less complete with some projects being well followed and others not so much. It is not uncommon to find good tracking data in the early stages of a project and poorer quality of tracking data near the end.

One source that is often overlooked is timesheet data. Even if the timesheet data is not perfectly aligned or integrated with the project plans, if the data is task-oriented it is still tremendously valuable. Timesheets typically have a very high degree of completeness and there is usually an approval process associated with the data. Looking at trends between timesheets data and project plans can be very revealing.

One pitfall to avoid is “Data Dredging”. This is a term that refers to the practice of scanning a data set for any relationships at all and once, having been found, assigning an explanation to the pattern.

An additional consideration in the modern age is where data is stored. With combinations of on premises enterprise software, in-the-cloud subscription services for project software, on premises central data storage and personal-device decentralized storage, being able to reach the data you want may not be obvious. Just because you can see an actual vs. planned Gantt chart on your screen doesn’t mean you can poll the data that created it easily or even that the data came from one place. Even with online services however, the ability to get access to your own historical data is often possible. It may just require an additional step of work to bring to one location for analysis.

When you start with data mining it is worthwhile spending some time inventorying what data might be available. It is often a good opportunity for some lateral thinking as it is not uncommon to find that the more obvious sources of data aren’t complete, have poor quality or are being managed unevenly. Looking a little further afield, one can often find pools of good quality data where they might not be expected. For example, perhaps there is good planning data and poor tracking data in the scheduling system. However, if the timesheets have good quality, task-based data, it may be very possible to create as-built views of completed projects just from the timesheets.

What is the mining process?

We start with the Scientific Method. You likely remember the basics from high school. We start with a hypothesis, then we gather data and analyze it. Then we prove or disprove the hypothesis. Just because we can measure a thing doesn’t make it inherently valuable to do so. Similarly, hyper-focusing on just one goal and overanalyzing or over-measuring it can be just as unproductive. If the data and analysis doesn’t come together to give clear indicators of one hypothesis, it’s sometimes better to back off, declare that hypothesis unprovable for now and see if there is anything else or anything related that you can show trends in. There are some real world examples of this in the next section.

If you are doing research on how to create a data mining environment, you may come across some of these terms for algorithms among many others:

Decision trees

Resembling organigrams or WBS diagrams, these structures represent the branches resulting from decisions.

Nearest neighbor

Calculations based on a technique that groups each record with records similar to it in the dataset

Rule induction

Rules derived from statistically significant data sets

Artificial neural networks

Non-linear predictive models that learn through a reiterative process

Genetic algorithms

Calculation algorithms based on the same techniques used in DNA research

eyeball If all those terms are giving you unwanted flashbacks to your statistics classes in college, don’t despair. Sometimes the most effective tool is the Mark-I eyeball. Creating charts of information can often be enough for the human eye to pick out patterns that complicated data analysis software has a much harder time with. Often the best way to get started with data mining is to synthesize available data into a spreadsheet and chart it into different formats. Where a line graph won’t show a particular effect, sometimes a pie chart will.

It’s also worthwhile to remember that the volume of the project management data you’ll likely be sifting through is tiny in data mining terms. Just picking out a trend or an association of data can be enough to produce value and that is often accomplished without massive statistical calculations.

Examples

Our firm has encountered a number of real world examples of data mining that our clients have described to us. While we can’t share the company names, we can share some of the initiatives and results these organizations have identified.

Example 1

Objective: How much time is being lost to non-essential overhead tasks?

Data Sources: Project Tasks, Timesheet data

This organization is a US-based international manufacturing company. The division in question had approximately 1500 people in it and managed hundreds of projects every year. The main office of this organization had multiple buildings some of which were sufficiently far apart that transportation between buildings was provided by shuttle bus. A request from management asking how much time of project personnel was being spent on interoffice transport initiated an effort to determine how much time was being lost to project capacity through non-essential overhead tasks.

The organization determined that existing timesheet and project management data was not being collected at a sufficient level of detail to answer the question adequately and over a 90 day period adjusted their internal timesheet and project schedule systems to match and encouraged 100% participation of the staff.

The initial results were surprising to all. It was determined that over 16% of total resource capacity was being spent on non-essential overhead tasks, over 10% of total resource capacity was being spent on non-essential inter-office transport. The organization initiated a series of policies that included co-locating project personal to the maximum extent possible and requiring the fewest number of meeting attendees to travel to wherever the largest number of meeting attendees were located for project meetings. The result was a savings of well over 10% of total resource capacity or the equivalent of hiring 150 new employees for free.

Example 2

Objective: What types of projects end up in management review?

Data Sources: Project Scheduling Software, Timesheet data

This organization was the IT division of a multi-national insurance company. The project management process of this organization included a process to identify projects in trouble at an early stage. If a project exceeded certain thresholds the project manager would need to present the status and plans of the project in a “management review” to members of the management staff. The data mining initiative was an attempt to identify the characteristics of the projects which would end up in management review. There were numerous hypotheses. Perhaps this was a resource skill issue or perhaps a project selection issue or an issue with poor estimating.

The organization found that analysis of most of the hypotheses returned no correlation and then turned to timesheet data to look for any patterns. A correlate was found between the amount of time spent in project meetings and the likelihood of the project ending up in management review. The pattern was a statistical type called double peak. Projects with both very low project meeting time and very high project meeting time were much more likely to have the project end up in trouble. The organization created a series of new standards with the appropriate amount of time to allocate for project meetings and used variances in those times from timesheets as an early warning system for projects that might end up in difficulty. The result was an improvement of over 25% of projects requiring management review.

Example 3

Objective: Can we lower staff turnover?

Data Sources: Timesheet data, Project Schedule data

This organization was a US-based IT services firm with several hundred employees. Management had identified that organizational efficiency was being affected by a high rate of staff turnover. Management wanted to look at project data to determine if the cause of the staff turnover could be identified. There were numerous hypotheses including the expectation of poor soft-skills among some project managers and excessive overtime causing burn-out. Review of data on those two causes however, showed no correlate.

A review of timesheet data did show an inverse correlations between the amount of training time per staff and the retention of that employee. Staff with a low amount of training time were twice as likely to leave the firm. Interestingly, these staff had often been involved in projects identified as successful. The impact of staff leaving was to either use sub-contractor staff or hire new staff. Contractors would arrive to the project pre-trained but would also leave the organization with their skills upon completion of the project. New staff arrived with various levels of training but had the added pressure that comes with being new of blending into a project team that was now under increased pressure to perform. The idea of sending the new hire for training was to wait until this new project was complete and then schedule it. This was often too late.

The results of the analysis was that the organization’s matrix was over projectized and as a result, department managers were given insufficient authority to keep their personnel trained. Management intervened to return this type of authority to department staff. The initial result was a short term decrease in efficiency but within six months, those efficiency losses had been overcome and the organization then realized an improvement in throughput of over 10%. Staff turnover was dramatically reduced and an added benefit was improved staff morale across the organization as well as decreased costs in hiring and on-boarding.

Where can you pick up your prospector’s license?

If you have decided that data mining has the potential to bring value to your organization, then here are some tips on how to get started.

First, do some internal research on what data might be available. If you have long-standing internal project management systems or even enterprise project management systems, you’re likely well on your way. Other possible targets might be purchasing records from finance, timesheet information, quality registers and production delivery records.

Next, start with baby steps. Look at some big questions that can provide a hypothesis with a good chance of being proven or disproven. Rather than spending a lot of money and time purchasing and installing data mining systems, do your first analyses in a spreadsheet system on your desktop. Use charting liberally. Just looking at a graph of data with the naked eye will often reveal trends immediately.

Where you find data that is spotty, look for alternate sources while encouraging the organization to do more in the future to get good compliance on data collection. That data can be carrying valuable nuggets of information that could be found if it were only collected.

Finally, just like a mineral prospector, document, document and, document what you’re doing. There will be others who follow you in the organization and showing where you found valuable information and how you extracted it will ensure that others will be able to find it quickly.

Chris Vandersluis Prospecting at 12As a final parting word, here is a very old picture of my own. Once upon a time in Northern Ontario I had the opportunity to do some actual mineral prospecting myself with my uncle and cousin.

Good luck in your own search.

Figure 1- The author (left) prospecting for iron near
Sudbury, Ontario, Canada circa 1969