Data Mining on a Budget

Published in January 2005

People tend to think that in order to do data mining, you have to be a huge enterprise with a dedicated staff, and lots of specialized software. This is simply not true. If I were to tell you that
to do gold mining, you had to have a lot of hydraulic equipment and a hundred-man crew, you would know better. If you have the right hole in the ground, you can do quite well with a pick and

If you watch the Food channel on cable or if you just like Memphis-style BBQ, you know the name “Corky’s”. The chain started in 1984 in Memphis by Don Pelts and has grown by franchise at a
steady rate ever since. They will never be a McDonald’s because all the meats are slow cooked for up to 22 hours over hickory wood and charcoal, and then every pork shoulder is hand-pulled. No
automation, no mass production.

They sell a small menu of 25 items by mail order via a toll-free number (1-800-9CORKYS) or from their website ( and ship the merchandise in special boxes sometimes using dry ice.
Most of the year, their staff can handle the orders. But at Christmas time, they have the problem of success.

Their packing operation consists of two lines. At the start of the line, someone pulls a box of the right size, and puts the pick list in it. As it goes down the line, packers put in the items, and
when it gets to the end of the line, it is ready for shipment. This is a standard business operation in lots of industries. Their people know what boxes to use for the standard gift packs and can
pretty accurately judge any odd sized orders.

At Christmas time, however, mail-order business is so good that they have to get outside temp help. The temporary help does not have the experience to judge the box sizes by looking at a pick list.
If a box that is too small starts down the line, it will jam up things at some point. The supervisor has to get it off the line, and re-pack the order by hand. If a box that is too large goes down
the line, it is a waste of money and creates extra shipping costs. And you do not throw dry ice into any old box, so it needs to be the right kind of box.

Mark Tutt (On The Mark Solutions, LLC) has been consulting with Corky’s for years and set up a new order system for them on Sybase. One of the goals of the new system is print the pick list and
shipping labels with all of the calculations done, including what box size the order requires.

Following the rule that you do not re-invent the wheel, Mr. Tutt went to the newsgroups to find out if anyone had a solution already. The suggestions tended to be along the lines of getting the
weights and shapes of the items and using a Tetris program to figure out the packing.

Programmers seem to love to face every new problem as if nobody has ever done it before and nobody will ever do it again. The “Code first, research later!” mentality is hard to overcome.

The answer was not in complicated 3-D math, but in the past 4 or 5 years of orders in the database. Human beings with years of experience had been packing orders and leaving a record of their work
to be mined. Obviously the standard gift packs are easy to spot. But most of the orders tend to be something that had occurred before, too. Here are the answers, if you will bother to dig them out.

First, Mr. Tutt found all of the unique configurations in the orders, how often they occurred and the boxes used to pack them. If the same configuration had two or more boxes, then you should go
with the smaller size. As it turned out, there were about 4995 unique configurations in the custom orders which covered about 99.5 percent of the cases.

Next, this table of configurations was put into a stored procedure that did a slightly modified exact relational division to obtain the box size required.

[[NOTE: Relational division is one of the eight basic operations in Codd’s relational algebra. The idea is that a divisor table is used to partition a dividend table and produce a quotient or
results table. The quotient table is made up of those values of one column for which a second column had all of the values in the divisor.

This is easier to explain with an example. Imagine that you have a shopping list and a several baskets of groceries. If a basket contains everything on the shopping, then you return that basket as
a result. If the basket contains all the items and more than are on the shopping list, then you have a relational division with a remainder. If the basket contains exactly the items on the list,
then you have an exact relational division, without a remainder. There are several ways to write the SQL, and each method will have slightly different performance based on the data. ]]

None of this is rocket science. None of this required special software. A pick, a shovel and the right hole in the ground are sufficient for a gold mine.

There is a rule in maintaining data quality that you first mop up the water, then you fix the leak. The new application will allow the warehouse to print the packing lists and shipping labels on
site. Before this, couriers took orders from the order center to the warehouse. The real trick is to add a checkbox to the user interface and have the staff flag orders that did not pack correctly.
This traps the configuration and the packing data, so that they can weed out bad configurations and replace them during the nightly process runs with the new corrected version.


submit to reddit

About Joe Celko

Joe is an Independent SQL and RDBMS Expert. He joined the ANSI X3H2 Database Standards Committee in 1987 and helped write the ANSI/ISO SQL-89 and SQL-92 standards. He is one of the top SQL experts in the world, writing over 700 articles primarily on SQL and database topics in the computer trade and academic press. The author of six books on databases and SQL, Joe also contributes his time as a speaker and instructor at universities, trade conferences and local user groups. Joe is now an independent contractor based in the Austin, TX area.