Data mining (sometimes called "knowledge discovery in databases")
is defined as the non-trivial extraction of implicit, previously
unknown, and potentially useful information from data . Data
mining involves the identification of patterns or relationships
in data. Formally, given a set of facts F, a language L, and some
measure of certainty C, a "pattern" is a statement S in L that describes
relationships among a subset F' of F with certainty C (substantiated
by the database) such that S is simpler than the enumeration of
all facts in F'.
Data mining is one of the hottest new database research areas.
Why? The amount of information in the world is growing exponentially,
and it is becoming impossible to effectively manage that data using
traditional database systems. Machine assistance is clearly necessary,
but the difficulty lies in designing systems that are capable of
discovering "useful" information with minimal human intervention.
Programs are just starting to be designed to exploit the vast repositories
of data that exist. The first data mining programs were used for
relational or transactional data. Recently, systems have been developed
to mine spatial or temporal data.
Some Data Mining Resource Sites
A Simple Data Mining Example Involving Supermarket
Consider an example from a supermarket that uses bar-code scanners
at the check-out counter. The underlying computer system is supposed
to identify the name and price of the product being scanned, and
update the inventory list so that the shelves can be restocked at
the right time. It appears that most of this supermarket data can
be discarded fairly quickly; however, the datasets contain lots
of valuable information that can be used for purposes other than
that for which it was originally collected. This information can
then be used to provide executive summaries of sales, to be aware
of customer preferences, to gain a competitive edge on other retailers,
to figure out which items (or combinations of items) should be put
on sale, or to simply acquire various kinds of marketing information.
In our supermarket example, the data mining system may point out
patterns such as:
- which items are frequently bought in combination (e.g., cereal
and milk; wieners, hot dog buns, mustard, and relish; hot salsa
and antacid; chips and pop; diapers and baby food)
- which items are frequently included in a $100+ grocery bill
- which items are frequently bought by families (a family may
be identified due to the purchase of certain types of products
that are typically aimed at children)
- which items are frequently purchased by people making small
purchases, with "small" perhaps being defined by a dollar amount,
or by the fact that the customer used the express check-out counter
Obvious correlations such as the relationship between purchases
of diapers and baby food are less interesting from a knowledge discovery
point of view than, say, a correlation between dairy products and
antacids. Patterns that are "interesting" normally concern relationships
that are not obvious, or are unexpected.
To date, most of the above types of information (for whatever
application domain) have not been exploited. As mentioned earlier,
there is simply too much information to process with existing tools.
New tools and techniques need to be developed. This is the motivation
for our research.
A Complex Example Involving Pharmaceutical
For a more complicated example involving more "interesting" patterns,
we briefly examine data mining in a pharmaceuticals context. The
identification and quantification of the following types of information
can be extremely useful for patients, physicians, pharmacists, health
organizations, insurance companies, regulatory agencies, investors,
lawyers, pharmaceutical manufacturers, drug testing companies, etc.
- interactions among over-the-counter (OTC) medicines
- interactions between prescription and OTC medicines
- interactions among prescription medicines
- interactions between any kind of medicine and various foods,
beverages, vitamins, and mineral supplements
- common characteristics between certain drug groups and offending
foods, beverages, medicines, etc.
- distinguishing characteristics among certain drug groups (e.g.,
for some people, certain antihistamines may not produce an adverse
reaction to certain foods, and therefore may be a better choice
among the large number of antihistamines on the market)
- questionable interactions based on very limited evidence, but
which may be of great interest (e.g., a few users out of many
thousands of users report a serious, but unusual side effect resulting
from some combination of characteristics)
- determining which types of patients are likely to be at risk
when using a particular medicine
Given the size of the databases being queried (see below), there
is likely to be a trade-off in accuracy of information vs. processing
time. Sampling techniques and tests of significance may be satisfactory
to identify some of the more common relationships; however, uncommon
relationships may require substantial search time. The thoroughness
of the search depends on the importance of the query (e.g., life
threatening vs. "curious to know"), the indexing structures used,
and the level of detail supplied in the query. Of course, the real
data mining challenge comes when the user supplies only a minimal
amount of information. For example: Find possible serious side effects
(not necessarily reported in the manufacturer's product literature)
involving food and any type or brand of antacid.
We begin by noting that there are literally hundreds of thousands
of OTC or prescription medicines available , and that almost
every kind of medicine (e.g., antacids, aspirin, children's cough
syrup, heart medicine) can have numerous minor or major side effects
[2,3]. We note the following facts:
- Most medicines interact with food, beverages, cigarettes, physical
activites, other medicines, etc. Some interactions are minor;
some are bothersome; some are serious; and some are lethal. Some
are common; some are uncommon; and some occur only under very
specific situations, in certain types of patients.
- Some medicines have very few reported side effects. This is
great if the medicine has been available for decades, if it has
had many users, and if all existing side effects have been reported.
(Note the "if's" in the previous sentence. Problems with the data,
such as an absence of information or contradictory information,
make data mining more difficult -- as we will see below.)
- Some medicines have hundreds of possible side effects, but
each side effect may only involve a very small percentage of users.
Some of these statistics are documented by the manufacturer and
may include a probability of occurrence.
- New side effects are constantly being reported, especially
for medicines that have only been available for a few years.
- Prolonged usage can affect patients in different ways. For
relatively new medicines, this information may be unavailable,
but is expected to be added to the database over time.
- The effectiveness of many medicines (with respect to shelf
life) deteriorates over time. Depending on storage, some medicines
lose their effectiveness quickly (e.g., weeks for nitroglycerine).
Some medicines need to be kept stored under strict conditions
(e.g., refrigerated; kept in a cool, dry place; well sealed).
Also, some medicines may cause serious damage to internal organs
if used after the expiry date (e.g., tetracycline).
- Many people take multiple medicines. When combined with dietary
habits, living habits, age, weight, etc., pattern detection using
exhaustive search techniques becomes intractable.
- People respond in different ways to different dosages.
- Half lives of medicines, and mean-time-to-peak statistics are
often known, and may be important in identifying interaction patterns.
- Sales or usage statistics are available.
- No two users are exactly the same. Users may fall into numerous
classes such as male, female, infants, children, adolescents,
adults, seniors, pregnant women, nursing mothers, vegetarians,
smokers, drinkers, diabetics, athletes, etc. Each of these classes
may have significance.
- Patients have varying diets; many medicines are affected by
- Many new medicines have had inadequate or insufficient testing,
even though the U.S. Food and Drug Adminstration (FDA) has approved
them. In fact, some of the best test data comes from users of
the medicine once it appears in pharmacies or on store shelves.
This means that data is constantly being updated. In fact, it
make take many years for adequate test data to be constructed
since FDA approval may only require a relatively small user test
base over a short period of time (e.g., months instead of years).
- It is impossible for pharmaceutical companies to examine all
possible interactions before releasing medicine for sale.
- Patient profiles, albeit largely incomplete or littered with
irrelevant information, are invaluable in mining pharmaceutical
A user-interface may be designed to accept all kinds of information
from the user (e.g., weight, sex, age, foods consumed, reactions
reported, dosage, length of usage). Then, based upon the information
in the databases and the relevant data entered by the user, a list
of warnings or known reactions (accompanied by probabilities) should
be reported. Note that user profiles can contain large amounts of
information, and efficient and effective data mining tools need
to be developed to probe the databases for relevant information.
Secondly, the patient's (anonymous) profile should be recorded along
with any adverse reactions reported by the patient, so that future
correlations can be reported. Over time, the databases will become
much larger, and interaction data for existing medicines will become
The amount of existing pharmaceutical information (pharmacological
properties, dosages, contraindications, warnings, etc.) is enormous;
however, this fact reflects the number of medicines on the market,
rather than an abundance of detailed information about each product.
One of the major problems with pharmaceutical data is actually
a lack of information. For example, an FDA commissioner estimated
that only about 1% of serious events are reported to the FDA. Fear
of litigation may be a contributing factor; however, most health
care providers simply don't have the time to fill out reports of
possible adverse drug reactions. Furthermore, it is expensive and
time-consuming for pharmaceutical companies to perform a thorough
job of data collection, especially when most of the information
is not required by law. Finally, we note that the FDA does not require
manufacturers to test new medicines for potential interactions.
Nevertheless, we expect a great increase in the amount of data
about pharmaceutical products in the foreseeable future, due in
a large part to increased computerization and consumer/patient awareness.
Reporting (via the Internet) by health care workers can easily be
facilitated. Data collection in hospitals and extended care facilities
is not difficult, and this information is of high quality since
such institutions typically have tailored diets for their patients
and maintain accurate records of treatments, lab tests, and administration
of prescription and OTC products. Furthermore, given the popularity
of the Internet, it is relatively easy for consumers to voluntarily
fill in and submit detailed profiles themselves. In conclusion,
there are likely to be many sources of relevant information, thereby
comprising a very large, but valuable, data repository. We emphasize
that data mining tools will be useful in extracting patterns, and
supporting various queries.
Types of Patterns that May Be Identified through Data Mining
Here are examples of interactions that may be reported by a data
mining system. These examples suggest how hard it can be to detect
interactions, given that they are not obvious, and are not likely
to be detected during testing (or for that matter, during many years
of use) [2,3]. This is where a data mining system (in conjunction
with databases of pharmacological properties, user profiles, etc.)
can be extremely valuable.
- grapefruit juice should be avoided with certain types of antihistamines
such as Seldane and Hismanal because of the possibility of irregular
- licorice should be avoided with certain heart medicines and
diuretics because of the possibility of increased blood pressure
and cardiac arrest
- high-fiber foods can interfere with the absorption of certain
antidepressants and heart medicines
- pudding should be avoided with certain anticonvulsants such
as Dilantin because of the possibility of seriously weakening
the drug's effect
- broccoli, brussels sprouts, and cabbage should be avoided with
certain anticoagulants such as Coumadin (Warfarin) because of
the possibility of blood clots
- Query: I am a diabetic, female, 60 years of age, social drinker,
chain smoker, and currently take Inderal for high blood pressure.
What are the best laxatives that I can take, and what kinds of
side effects should I be aware of?
- Query: I am male, 15 years of age, a competitive basketball
player, ..., and my doctor recently gave me some new wonder drug
called ABC. What are the side effects that people similar to me
have reported (either with ABC, or with medicines having similar
pharmacological properties to ABC), and what is the likelihood
of me having those side effects?
Problems with the Data
Our pharmaceutical example illustrates some common problems with
data used for data mining. These problems are summarized below:
- Incomplete data
Some data may be missing (e.g., some fields may be left blank
in a user profile, or perhaps the manufacturer has only very
limited test data to report). The question is what to do about
such situations. Sometimes the fact that data is missing is
itself a valuable piece of information (e.g., surgical information
for a patient who has never had surgery, disease information
for a patient who has never been sick). At other times, the
missing data constitutes a genuine problem (e.g., missing diagnostic
information after a test has been performed).
- Noisy data
The fields may contain incorrectly entered information. How
does this affect the certainty factor or confidence level of
- Temporal data
Since databases grow rapidly, how can data be incrementally
added to our results? Is current data "worth more" than data
from, say, a year ago? Data is also subject to change. What
effect should this have in the knowledge discovery process?
Can results be "undone", or must the entire knowledge discovery
process start from scratch to pick up changes?
- An extremely large amount of data
Some datasets can grow significantly over time. How should
such datasets be processed? One option is to perform parallel
processing, whereby n processors each process approximately
1/n'th of the data in approximately 1/n'th of the time. Another
option is to avoid processing the entire dataset, and simply
sample the data. Even though this may result in a loss of information
or in a reduced confidence level, perhaps the accuracy vs. efficiency
trade-off warrants such an approach.
- Non-textual data
There are many types of data that need to be manipulated,
including image data, multimedia data (video, sound), spatial
data in Geographic Information Systems, and user-defined data
- Controversial data
There are privacy issues to be considered. Probing databases
for personal information (especially medical information) may
violate privacy laws. For example, using data mining techniques
to create mailing lists of potential customers is controversial.
Even probing government databases for instances of fraud or
criminal intent has privacy implications. Similarly, probing
medical databases "in the interest of science" while trying
to isolate common characteristics among affected individuals
(for a cure to a disease) can be controversial.
 W. J. Frawley, G. Piatetsky-Shapiro, and C. J. Matheus.
"Knowledge Discovery in Databases: An Overview", Knowledge Discovery
in Databases, Piatetsky-Shapiro and Frawley (eds.), AAAI/MIT
Press, 1991, pp. 1-27.
 Joe Graedon and Teresa Graedon. The People's Guide to
Deadly Drug Interactions. New York: St. Martin's Press, 1995.
 Joe Graedon and Teresa Graedon. The People's Pharmacy.
New York: St. Martin's Griffin, 1996.