Data Mining

Introduction

Data mining (sometimes called "knowledge discovery in databases") is defined as the non-trivial extraction of implicit, previously unknown, and potentially useful information from data [1]. Data mining involves the identification of patterns or relationships in data. Formally, given a set of facts F, a language L, and some measure of certainty C, a "pattern" is a statement S in L that describes relationships among a subset F' of F with certainty C (substantiated by the database) such that S is simpler than the enumeration of all facts in F'.

Data mining is one of the hottest new database research areas. Why? The amount of information in the world is growing exponentially, and it is becoming impossible to effectively manage that data using traditional database systems. Machine assistance is clearly necessary, but the difficulty lies in designing systems that are capable of discovering "useful" information with minimal human intervention. Programs are just starting to be designed to exploit the vast repositories of data that exist. The first data mining programs were used for relational or transactional data. Recently, systems have been developed to mine spatial or temporal data.

Some Data Mining Resource Sites

Our Database Group's Publications Page (Various papers on data mining and other database-related topics.)
An Introduction to Distance-Based Outliers
An Introduction to Spatial Data Mining
An Introduction to Temporal Data Mining
ACM Special Interest Group on Knowledge Discovery and Data Mining
Data Mining and Knowledge Discovery Journal
Database Research Groups
IBM Almaden's QUEST Data Mining Project
IBM's Advanced Scout
KD Nuggets
On-Line Software for Clustering and Multivariate Analysis
Simon Fraser University's Database Research Group (our friends across town)
The Data Mine (links)

A Simple Data Mining Example Involving Supermarket Data

Consider an example from a supermarket that uses bar-code scanners at the check-out counter. The underlying computer system is supposed to identify the name and price of the product being scanned, and update the inventory list so that the shelves can be restocked at the right time. It appears that most of this supermarket data can be discarded fairly quickly; however, the datasets contain lots of valuable information that can be used for purposes other than that for which it was originally collected. This information can then be used to provide executive summaries of sales, to be aware of customer preferences, to gain a competitive edge on other retailers, to figure out which items (or combinations of items) should be put on sale, or to simply acquire various kinds of marketing information.

In our supermarket example, the data mining system may point out patterns such as:

which items are frequently bought in combination (e.g., cereal and milk; wieners, hot dog buns, mustard, and relish; hot salsa and antacid; chips and pop; diapers and baby food)
which items are frequently included in a $100+ grocery bill
which items are frequently bought by families (a family may be identified due to the purchase of certain types of products that are typically aimed at children)
which items are frequently purchased by people making small purchases, with "small" perhaps being defined by a dollar amount, or by the fact that the customer used the express check-out counter

Obvious correlations such as the relationship between purchases of diapers and baby food are less interesting from a knowledge discovery point of view than, say, a correlation between dairy products and antacids. Patterns that are "interesting" normally concern relationships that are not obvious, or are unexpected.

To date, most of the above types of information (for whatever application domain) have not been exploited. As mentioned earlier, there is simply too much information to process with existing tools. New tools and techniques need to be developed. This is the motivation for our research.

A Complex Example Involving Pharmaceutical Data

For a more complicated example involving more "interesting" patterns, we briefly examine data mining in a pharmaceuticals context. The identification and quantification of the following types of information can be extremely useful for patients, physicians, pharmacists, health organizations, insurance companies, regulatory agencies, investors, lawyers, pharmaceutical manufacturers, drug testing companies, etc.

interactions among over-the-counter (OTC) medicines
interactions between prescription and OTC medicines
interactions among prescription medicines
interactions between any kind of medicine and various foods, beverages, vitamins, and mineral supplements
common characteristics between certain drug groups and offending foods, beverages, medicines, etc.
distinguishing characteristics among certain drug groups (e.g., for some people, certain antihistamines may not produce an adverse reaction to certain foods, and therefore may be a better choice among the large number of antihistamines on the market)
questionable interactions based on very limited evidence, but which may be of great interest (e.g., a few users out of many thousands of users report a serious, but unusual side effect resulting from some combination of characteristics)
determining which types of patients are likely to be at risk when using a particular medicine

Given the size of the databases being queried (see below), there is likely to be a trade-off in accuracy of information vs. processing time. Sampling techniques and tests of significance may be satisfactory to identify some of the more common relationships; however, uncommon relationships may require substantial search time. The thoroughness of the search depends on the importance of the query (e.g., life threatening vs. "curious to know"), the indexing structures used, and the level of detail supplied in the query. Of course, the real data mining challenge comes when the user supplies only a minimal amount of information. For example: Find possible serious side effects (not necessarily reported in the manufacturer's product literature) involving food and any type or brand of antacid.

We begin by noting that there are literally hundreds of thousands of OTC or prescription medicines available [3], and that almost every kind of medicine (e.g., antacids, aspirin, children's cough syrup, heart medicine) can have numerous minor or major side effects [2,3]. We note the following facts:

Most medicines interact with food, beverages, cigarettes, physical activites, other medicines, etc. Some interactions are minor; some are bothersome; some are serious; and some are lethal. Some are common; some are uncommon; and some occur only under very specific situations, in certain types of patients.
Some medicines have very few reported side effects. This is great if the medicine has been available for decades, if it has had many users, and if all existing side effects have been reported. (Note the "if's" in the previous sentence. Problems with the data, such as an absence of information or contradictory information, make data mining more difficult -- as we will see below.)
Some medicines have hundreds of possible side effects, but each side effect may only involve a very small percentage of users. Some of these statistics are documented by the manufacturer and may include a probability of occurrence.
New side effects are constantly being reported, especially for medicines that have only been available for a few years.
Prolonged usage can affect patients in different ways. For relatively new medicines, this information may be unavailable, but is expected to be added to the database over time.
The effectiveness of many medicines (with respect to shelf life) deteriorates over time. Depending on storage, some medicines lose their effectiveness quickly (e.g., weeks for nitroglycerine). Some medicines need to be kept stored under strict conditions (e.g., refrigerated; kept in a cool, dry place; well sealed). Also, some medicines may cause serious damage to internal organs if used after the expiry date (e.g., tetracycline).
Many people take multiple medicines. When combined with dietary habits, living habits, age, weight, etc., pattern detection using exhaustive search techniques becomes intractable.
People respond in different ways to different dosages.
Half lives of medicines, and mean-time-to-peak statistics are often known, and may be important in identifying interaction patterns.
Sales or usage statistics are available.
No two users are exactly the same. Users may fall into numerous classes such as male, female, infants, children, adolescents, adults, seniors, pregnant women, nursing mothers, vegetarians, smokers, drinkers, diabetics, athletes, etc. Each of these classes may have significance.
Patients have varying diets; many medicines are affected by diet.
Many new medicines have had inadequate or insufficient testing, even though the U.S. Food and Drug Adminstration (FDA) has approved them. In fact, some of the best test data comes from users of the medicine once it appears in pharmacies or on store shelves. This means that data is constantly being updated. In fact, it make take many years for adequate test data to be constructed since FDA approval may only require a relatively small user test base over a short period of time (e.g., months instead of years).
It is impossible for pharmaceutical companies to examine all possible interactions before releasing medicine for sale.
Patient profiles, albeit largely incomplete or littered with irrelevant information, are invaluable in mining pharmaceutical information.

A user-interface may be designed to accept all kinds of information from the user (e.g., weight, sex, age, foods consumed, reactions reported, dosage, length of usage). Then, based upon the information in the databases and the relevant data entered by the user, a list of warnings or known reactions (accompanied by probabilities) should be reported. Note that user profiles can contain large amounts of information, and efficient and effective data mining tools need to be developed to probe the databases for relevant information. Secondly, the patient's (anonymous) profile should be recorded along with any adverse reactions reported by the patient, so that future correlations can be reported. Over time, the databases will become much larger, and interaction data for existing medicines will become more complete.

The amount of existing pharmaceutical information (pharmacological properties, dosages, contraindications, warnings, etc.) is enormous; however, this fact reflects the number of medicines on the market, rather than an abundance of detailed information about each product.

One of the major problems with pharmaceutical data is actually a lack of information. For example, an FDA commissioner estimated that only about 1% of serious events are reported to the FDA. Fear of litigation may be a contributing factor; however, most health care providers simply don't have the time to fill out reports of possible adverse drug reactions. Furthermore, it is expensive and time-consuming for pharmaceutical companies to perform a thorough job of data collection, especially when most of the information is not required by law. Finally, we note that the FDA does not require manufacturers to test new medicines for potential interactions.

Nevertheless, we expect a great increase in the amount of data about pharmaceutical products in the foreseeable future, due in a large part to increased computerization and consumer/patient awareness. Reporting (via the Internet) by health care workers can easily be facilitated. Data collection in hospitals and extended care facilities is not difficult, and this information is of high quality since such institutions typically have tailored diets for their patients and maintain accurate records of treatments, lab tests, and administration of prescription and OTC products. Furthermore, given the popularity of the Internet, it is relatively easy for consumers to voluntarily fill in and submit detailed profiles themselves. In conclusion, there are likely to be many sources of relevant information, thereby comprising a very large, but valuable, data repository. We emphasize that data mining tools will be useful in extracting patterns, and supporting various queries.

Types of Patterns that May Be Identified through Data Mining

Here are examples of interactions that may be reported by a data mining system. These examples suggest how hard it can be to detect interactions, given that they are not obvious, and are not likely to be detected during testing (or for that matter, during many years of use) [2,3]. This is where a data mining system (in conjunction with databases of pharmacological properties, user profiles, etc.) can be extremely valuable.

grapefruit juice should be avoided with certain types of antihistamines such as Seldane and Hismanal because of the possibility of irregular heart rhythms
licorice should be avoided with certain heart medicines and diuretics because of the possibility of increased blood pressure and cardiac arrest
high-fiber foods can interfere with the absorption of certain antidepressants and heart medicines
pudding should be avoided with certain anticonvulsants such as Dilantin because of the possibility of seriously weakening the drug's effect
broccoli, brussels sprouts, and cabbage should be avoided with certain anticoagulants such as Coumadin (Warfarin) because of the possibility of blood clots
Query: I am a diabetic, female, 60 years of age, social drinker, chain smoker, and currently take Inderal for high blood pressure. What are the best laxatives that I can take, and what kinds of side effects should I be aware of?
Query: I am male, 15 years of age, a competitive basketball player, ..., and my doctor recently gave me some new wonder drug called ABC. What are the side effects that people similar to me have reported (either with ABC, or with medicines having similar pharmacological properties to ABC), and what is the likelihood of me having those side effects?

Problems with the Data

Our pharmaceutical example illustrates some common problems with data used for data mining. These problems are summarized below:

Incomplete data
Some data may be missing (e.g., some fields may be left blank in a user profile, or perhaps the manufacturer has only very limited test data to report). The question is what to do about such situations. Sometimes the fact that data is missing is itself a valuable piece of information (e.g., surgical information for a patient who has never had surgery, disease information for a patient who has never been sick). At other times, the missing data constitutes a genuine problem (e.g., missing diagnostic information after a test has been performed).
Noisy data
The fields may contain incorrectly entered information. How does this affect the certainty factor or confidence level of the results?
Temporal data
Since databases grow rapidly, how can data be incrementally added to our results? Is current data "worth more" than data from, say, a year ago? Data is also subject to change. What effect should this have in the knowledge discovery process? Can results be "undone", or must the entire knowledge discovery process start from scratch to pick up changes?
An extremely large amount of data
Some datasets can grow significantly over time. How should such datasets be processed? One option is to perform parallel processing, whereby n processors each process approximately 1/n'th of the data in approximately 1/n'th of the time. Another option is to avoid processing the entire dataset, and simply sample the data. Even though this may result in a loss of information or in a reduced confidence level, perhaps the accuracy vs. efficiency trade-off warrants such an approach.
Non-textual data
There are many types of data that need to be manipulated, including image data, multimedia data (video, sound), spatial data in Geographic Information Systems, and user-defined data types.
Controversial data
There are privacy issues to be considered. Probing databases for personal information (especially medical information) may violate privacy laws. For example, using data mining techniques to create mailing lists of potential customers is controversial. Even probing government databases for instances of fraud or criminal intent has privacy implications. Similarly, probing medical databases "in the interest of science" while trying to isolate common characteristics among affected individuals (for a cure to a disease) can be controversial.

References

[1] W. J. Frawley, G. Piatetsky-Shapiro, and C. J. Matheus. "Knowledge Discovery in Databases: An Overview", Knowledge Discovery in Databases, Piatetsky-Shapiro and Frawley (eds.), AAAI/MIT Press, 1991, pp. 1-27.
[2] Joe Graedon and Teresa Graedon. The People's Guide to Deadly Drug Interactions. New York: St. Martin's Press, 1995.
[3] Joe Graedon and Teresa Graedon. The People's Pharmacy. New York: St. Martin's Griffin, 1996.