Published in TDAN.com April 2000
INTRODUCTION
In the not too distant past, people wanting to perform a simple linear regression had to perform many calculations ‘by hand’ (even with a calculator, this is tedious work). The amount of data
which could be manipulated in this fashion was severely limited and performing multiple linear regression was even more involved. In the late 1970s, it became possible for a moderately competent
programmer with affordable hardware to produce a program which would carry out this exercise on the computer. In the early 1980s, Visicalc became available, providing linear regression capabilities
to people possessing much less technical skill. Ultimately, linear regression became encapsulated as a single, built-in spreadsheet function. Today, of course, not only will packaged software
perform multiple linear regression, run stepwise variable selection and calculate significance tests, but it will drive even more powerful modeling and analysis methods, such as neural networks and
fractal dimension estimation.
Modern commercial data analysis software, often driven by expert systems, will perform many diagnostic tests and generate appropriate data summarizations automatically. Current data modeling tools
can select many model parameters autonomously using powerful error resampling methods, such as k-fold cross validation and bootstrapping. The trend is obvious: increasing amounts of analytical work
are being performed by computers. The question posed by this fact should be equally obvious: how much can data analysis be automated? Is it possible that all facets of data analysis can be handled
by the machine without human assistance?
DISCUSSION
A review of several data modeling tools which I published several years ago included the following passage:
‘While any of these products is capable of performing without preprocessing or expert guidance, none offers a true “one-button solution” for users. Knowledge of basic statistical and modeling
concepts would benefit users of such systems, as data analysis and preprocessing make the tool’s job easier, and post-processing and diagnostics ensure the quality of model output. In particular,
questions of sampling and overfitting remain issues that users must address with these new modeling tools.’
Certainly, at this point in time, some progress has been made toward mechanizing several of the tasks mentioned, particularly in regard to avoiding overfit. Data analysis tasks may be divided by
whether or not they may be performed by humans and whether or not they may be performed by computers. This gives us the following four categories of tasks, those which:
1. require human performance (the manual)
2. require computer performance (the necessarily automatic)
3. can be handled by computer or human (the optionally automatic)
4. cannot be handled by man or machine (the unfeasible)
Naturally, the lines which divide these categories change over time. Before the availability of computers to do this kind of work, all data analysis tasks by definition were either handled by
humans or they were unfeasible. The introduction of computing machinery and its subsequent growth in power has eaten into both of these areas, automating increasing amounts of manual work and
claiming previously unfeasible work as computer-only work. In fact, new data analysis tasks have even been invented specifically for computers. To answer the question at hand, it will be helpful to
understand why different types of tasks have fallen to the computer.
The most natural applications of computers to analytical work have been those which require mathematical manipulation of large amounts of data. Indeed, our very concept of what constitutes a
‘large’ data set has been radically transformed by the growth of the information storage and manipulation capabilities of our machines. Today, it possible to buy several gigabytes of hard-disk
storage for a few hundred dollars and an increasing number of organizations maintain terabyte-level storage.
The core math and logic of small-scale data analysis which was formerly performed manually has largely if not completely been claimed by the computers. On my shelf sit several books on ‘pencil and
paper’ data analysis which describe various techniques and tricks for getting the most from your manual efforts. Such texts include interesting higher- level analytical and mathematical material
but the skills which they describe for low-level, detailed processing of the numbers are mostly out-dated. Tasks such as building linear regressions, calculating standard deviations and finding
medians has irrevocably been absorbed by computer software.
Of course, many types of work which were once unfeasible were simply extensions of work performed by humans. Calculating means or correlation coefficients is impractical for human analysts when
data sets become too large. Theoretically, it is possible for humans to do such things, but it is simply too costly and time-consuming for them to do so. Such drudgery has quickly been snapped up
by the machines. Finding the average of a million numbers is quite easy today, even with what is now considered ‘low-end’ hardware.
These two types of rote work, the previously manual and the previously unfeasible, may be considered as one, distinguished only by how much tedium humans will or can tolerate. Additionally,
computers have handled an escalating level of decision-making in the actual analysis process. This has been true at the lowest level, as in converting class data to a series of dummy variables or
automatically identifying statistical outliers using simple heuristics. It has also been true at a higher level: some commercial offerings embody a great deal of expertise for handling data.
Scenario from Cognos is a data modeling tool which will automatically set variable types for imported data, select variables for inclusion in the model, set model parameters and control model
complexity. Additionally, Scenario generates reports in English which summarize its findings. As another example, Forecast Pro from Business Forecast Systems will automatically choose an
appropriate forecasting method using an expert system which analyzes the data. Tools like these provide a great deal of data analysis sophistication within an automatic framework. Computers can
definitely handle a large portion of the data analysis job, even at a high level of decision-making. As with other fields within computing, when the problem-solving process can be clearly expressed
as an algorithm, the computer can be programmed to perform the task.
HUMAN ABILITIES AND PROBLEM CONTEXT
To this point, a distinction has been made only between what humans and computers can and cannot do. Clearly, though, humans possess a range of abilities in this department. In the book “The
Electronic Cottage”, Joseph Deken writes about the ability to deal with abstract data as though it were another sense. Some humans, obviously, have a more developed ‘sense’ of information than
others. This is important in that some people possess a greater ability to interpret, understand and absorb abstract information than others. To generalize this idea, the effectiveness with which
some people (such as statisticians) deal with available data analysis systems will be greater than for others.
Automation will, to some extent, help people with less of this ‘sense’ utilize information. Deken touches on the idea of the ‘computer as consultant’: ‘By using a computer, you can obtain not
only new data but valuable built-in “consultant service” to help you analyze it.’ What Deken describes is a vision of interactive exploration and understanding which is becoming a reality on
today’s software tools (On-Line Analytical Processing and data visualization come to mind). (This is fairly visionary for a book which was published in 1981!) Deken offers a warning, though, about
using only computer analysis- specifically in relation to causal relationships:
‘Causal relationships are the strongest type you can hope to find, and are the backbone of engineering and science. A word of caution is in order here, as you begin to flex your newfound computer
capabilities: You will have unprecedented power to find relationships and associations. Once found, an association’s predictive value can only be established by statistical or empirical induction.
The causal nature of an association can only be investigated by experiment. All the computer calculation in the world cannot vault you over a single one of these inductive or experimental
hurdles.’
Thus, there are at least some things which cannot be handled by the computer. The statistical induction which Deken mentions can be implemented on the computer in the form of things like hold-out
testing. His comment about causal associations, though, brings up an important area which computers do not cover: the context of the data analysis problem. The difficulty is that this is conceptual
issue and not a mechanical one. A computer can detect statistically significant associations between variables, but it cannot determine whether that association has any real-world significance. For
example, a statistical relationship may be discovered between weather and sales, but whether that is important to us depends on the context of our problem. A business analyst would be better suited
to answer questions of real-world significance. For computers to answer such questions would require a great deal more intelligence than they currently possess. Without an abstract knowledge of
what drives a business, simply knowing that weather affects sales may be of no use.
As another example, consider a political model of voting patterns in Pennsylvania: whether it is of any use in Georgia is doubtful, given the different statistical universes involved. This is
precisely the sort of thing which analysis software on computers have no knowledge of and no control over.
When the problem context is narrowly specified, greater automation is possible through customization. Vertical applications for fraud detection or credit scoring are good examples of this. The
downside of this approach is the sacrifice of generality: these systems have limited utility outside of their intended application.
INNOVATION
Another area which computers have not attacked is the creation of new data analysis techniques. Things like Fourier transforms, wavelets and morphological smoothing may be immensely useful in data
analysis and may also be automated once they have been discovered. Before their invention, however, there is no way for computers to make use of them. It doesn’t matter how powerful the computer
is: if it’s running linear regressions on non-linear data, then it can’t compete with one running non-linear regressions. Constructing new analysis methods is a creative act which thus far has
been the exclusive domain of humans. This is often a significant portion of the data analysis job, particularly in modeling. Analysis often requires the creation of new filters, measures or
summaries of data for individual projects.
Genetic programming has shown some promise of automatic generation of new algorithms, and has demonstrated an ability to assemble relatively low- level filters. Unfortunately, it has yet to prove
itself capable of developing anything as interesting as higher-order spectral analysis. Whether it will ever be able to do this remains a matter of speculation. The ability to innovate leaves
humans ahead in the development of methods for such tasks as preprocessing of training data, postprocessing of model results, actual modeling of data and graphical display of data.
CONCLUSION
‘One-button’ data analysis alludes us and appears likely to do so for the foreseeable future. Various mechanical activities necessary for data analysis may be (and many have been) automated. It
is apparent that this idea extends beyond the mechanical grinding of the data to higher level decision-making. There are, however, two areas of data analysis (at least) which seem unlikely to be
reproduced on computers any time soon: creation of new analysis methods and understanding of problem context. These would require an enormous leap in the amount of intelligence employed by
computers in their work.
REFERENCES
Software Review Mentioned: “Advanced Modeling Systems”
AI Expert (June, 1994) by W. Dwinnell
Texts on Manual Data Analysis: “Exploratory Data Analysis” by John Tukey
published by Addison Wesley ISBN 0-201-07616-0
“Empirical Equations and Nomography” by Davis
published by McGraw-Hill (no ISBN: copyright 1943)
Book Mentioned: “The Electronic Cottage” by Joseph Deken
published by Morrow ISBN 0-688-00664-7