SpendZen

The Zen of all things Spend.

Subscribe via E-mail

Your email:
Ten Tips Spend Intelligence
Spend Analysis: The Nexus of Spend Management
Spend Analysis Case Study

Connect With Us!

Spend Radar is proud to support and raise awareness of the technology innovation in Chicago.

Built  in Chicago

Current Articles | RSS Feed RSS Feed

Data Cleansing is the 90%

  
  
  

data cleansingI once had a professor in a data mining class say to me, "cleaning your data is 90% of the work in data mining." All the wonderful algorithms that pioneers in the data mining field have developed over the years such as association rules, k-means clustering, neural network classification, and on and on are all utterly useless until that labor up front is accomplished. Even basic statistical calculations such as mean or standard deviation are unreliable if someone in advance doesn't disambiguate all the "widgets" from the "widgits [sic]" or even worse, the "iwdgets [sic]." 

Recently I've been working on developing automated methods of knowledge discovery in spend data. The advantages are obvious to working with cleansed data in spend analytics software, however there are a few that are not so apparent when simply looking at the surface that I would like to share.

  • Rapid turnaround of new analysis algorithms - In the past day, I've been able to turn out a pattern analysis tool that can pull out hidden cyclical spend from client data. Monthly bills, periodic maintenance, etc., can all be distinguished very easily from noise because all the data up front has been normalized. I didn't have to waste any time cleaning up names or correcting typos. This benefit in time is invaluable especially amongst high salary engineering staff.
  • Application of analysis techniques with minimal knowledge of what the data is about - Classification techniques at Spend Radar quickly provide an analyst with the ability to recognize that several purchases that seem like they may be of completely different items are in fact of items that are similar. A "shank nail" and a "brass escutcheon pin" are classified both under the appropriate classification heading, and so an analyst need not be a domain expert to do his/her job. This also allows for easy item category roll up when performing algorithmic analysis.
  • Shorter learning curve for analysts - If an analyst is moved off a project and a new one comes on board, there is knowledge that the first analyst takes with him/her that the second one now needs to learn from scratch. Having the data cleansed and classified lowers the cognitive burden on the new analyst because there are fewer idiosyncrasies that need to be remembered.
  • Knowledge discovery in data accuracy improved - This point is rather obvious however I'll include it anyway. Misclassifications cause both a lack of understanding in human and machine when looking at spend data. Good quality information helps lead to quality analysis.

Now that I've nearly completed this post, I'm going to go back and concentrate on the 10% of the work that remains that will be the most fruitful for any organization. I think I'll write some code that uses linear regression to identify seasonal spend behavior. That should only take an hour...

Michael Jaskiewicz, Senior Software Engineer, Spend Radar

Comments

Currently, there are no comments. Be the first to post one!
Post Comment
Name
 *
Email
 *
Website (optional)
Comment
 *

Allowed tags: <a> link, <b> bold, <i> italics