Data Mining

Introduction

Data mining (sometimes called "knowledge discovery in databases") is defined as the non-trivial extraction of implicit, previously unknown, and potentially useful information from data [1]. Data mining involves the identification of patterns or relationships in data. Formally, given a set of facts F, a language L, and some measure of certainty C, a "pattern" is a statement S in L that describes relationships among a subset F' of F with certainty C (substantiated by the database) such that S is simpler than the enumeration of all facts in F'.

Data mining is one of the hottest new database research areas. Why? The amount of information in the world is growing exponentially, and it is becoming impossible to effectively manage that data using traditional database systems. Machine assistance is clearly necessary, but the difficulty lies in designing systems that are capable of discovering "useful" information with minimal human intervention. Programs are just starting to be designed to exploit the vast repositories of data that exist. The first data mining programs were used for relational or transactional data. Recently, systems have been developed to mine spatial or temporal data.

Some Data Mining Resource Sites

A Simple Data Mining Example Involving Supermarket Data

Consider an example from a supermarket that uses bar-code scanners at the check-out counter. The underlying computer system is supposed to identify the name and price of the product being scanned, and update the inventory list so that the shelves can be restocked at the right time. It appears that most of this supermarket data can be discarded fairly quickly; however, the datasets contain lots of valuable information that can be used for purposes other than that for which it was originally collected. This information can then be used to provide executive summaries of sales, to be aware of customer preferences, to gain a competitive edge on other retailers, to figure out which items (or combinations of items) should be put on sale, or to simply acquire various kinds of marketing information.

In our supermarket example, the data mining system may point out patterns such as:

Obvious correlations such as the relationship between purchases of diapers and baby food are less interesting from a knowledge discovery point of view than, say, a correlation between dairy products and antacids. Patterns that are "interesting" normally concern relationships that are not obvious, or are unexpected.

To date, most of the above types of information (for whatever application domain) have not been exploited. As mentioned earlier, there is simply too much information to process with existing tools. New tools and techniques need to be developed. This is the motivation for our research.

A Complex Example Involving Pharmaceutical Data

For a more complicated example involving more "interesting" patterns, we briefly examine data mining in a pharmaceuticals context. The identification and quantification of the following types of information can be extremely useful for patients, physicians, pharmacists, health organizations, insurance companies, regulatory agencies, investors, lawyers, pharmaceutical manufacturers, drug testing companies, etc.

Given the size of the databases being queried (see below), there is likely to be a trade-off in accuracy of information vs. processing time. Sampling techniques and tests of significance may be satisfactory to identify some of the more common relationships; however, uncommon relationships may require substantial search time. The thoroughness of the search depends on the importance of the query (e.g., life threatening vs. "curious to know"), the indexing structures used, and the level of detail supplied in the query. Of course, the real data mining challenge comes when the user supplies only a minimal amount of information. For example: Find possible serious side effects (not necessarily reported in the manufacturer's product literature) involving food and any type or brand of antacid.

We begin by noting that there are literally hundreds of thousands of OTC or prescription medicines available [3], and that almost every kind of medicine (e.g., antacids, aspirin, children's cough syrup, heart medicine) can have numerous minor or major side effects [2,3]. We note the following facts:

A user-interface may be designed to accept all kinds of information from the user (e.g., weight, sex, age, foods consumed, reactions reported, dosage, length of usage). Then, based upon the information in the databases and the relevant data entered by the user, a list of warnings or known reactions (accompanied by probabilities) should be reported. Note that user profiles can contain large amounts of information, and efficient and effective data mining tools need to be developed to probe the databases for relevant information. Secondly, the patient's (anonymous) profile should be recorded along with any adverse reactions reported by the patient, so that future correlations can be reported. Over time, the databases will become much larger, and interaction data for existing medicines will become more complete.

The amount of existing pharmaceutical information (pharmacological properties, dosages, contraindications, warnings, etc.) is enormous; however, this fact reflects the number of medicines on the market, rather than an abundance of detailed information about each product.

One of the major problems with pharmaceutical data is actually a lack of information. For example, an FDA commissioner estimated that only about 1% of serious events are reported to the FDA. Fear of litigation may be a contributing factor; however, most health care providers simply don't have the time to fill out reports of possible adverse drug reactions. Furthermore, it is expensive and time-consuming for pharmaceutical companies to perform a thorough job of data collection, especially when most of the information is not required by law. Finally, we note that the FDA does not require manufacturers to test new medicines for potential interactions.

Nevertheless, we expect a great increase in the amount of data about pharmaceutical products in the foreseeable future, due in a large part to increased computerization and consumer/patient awareness. Reporting (via the Internet) by health care workers can easily be facilitated. Data collection in hospitals and extended care facilities is not difficult, and this information is of high quality since such institutions typically have tailored diets for their patients and maintain accurate records of treatments, lab tests, and administration of prescription and OTC products. Furthermore, given the popularity of the Internet, it is relatively easy for consumers to voluntarily fill in and submit detailed profiles themselves. In conclusion, there are likely to be many sources of relevant information, thereby comprising a very large, but valuable, data repository. We emphasize that data mining tools will be useful in extracting patterns, and supporting various queries.

Types of Patterns that May Be Identified through Data Mining

Here are examples of interactions that may be reported by a data mining system. These examples suggest how hard it can be to detect interactions, given that they are not obvious, and are not likely to be detected during testing (or for that matter, during many years of use) [2,3]. This is where a data mining system (in conjunction with databases of pharmacological properties, user profiles, etc.) can be extremely valuable.

Problems with the Data

Our pharmaceutical example illustrates some common problems with data used for data mining. These problems are summarized below:

  1. Incomplete data

    Some data may be missing (e.g., some fields may be left blank in a user profile, or perhaps the manufacturer has only very limited test data to report). The question is what to do about such situations. Sometimes the fact that data is missing is itself a valuable piece of information (e.g., surgical information for a patient who has never had surgery, disease information for a patient who has never been sick). At other times, the missing data constitutes a genuine problem (e.g., missing diagnostic information after a test has been performed).

  2. Noisy data

    The fields may contain incorrectly entered information. How does this affect the certainty factor or confidence level of the results?

  3. Temporal data

    Since databases grow rapidly, how can data be incrementally added to our results? Is current data "worth more" than data from, say, a year ago? Data is also subject to change. What effect should this have in the knowledge discovery process? Can results be "undone", or must the entire knowledge discovery process start from scratch to pick up changes?

  4. An extremely large amount of data

    Some datasets can grow significantly over time. How should such datasets be processed? One option is to perform parallel processing, whereby n processors each process approximately 1/n'th of the data in approximately 1/n'th of the time. Another option is to avoid processing the entire dataset, and simply sample the data. Even though this may result in a loss of information or in a reduced confidence level, perhaps the accuracy vs. efficiency trade-off warrants such an approach.

  5. Non-textual data

    There are many types of data that need to be manipulated, including image data, multimedia data (video, sound), spatial data in Geographic Information Systems, and user-defined data types.

  6. Controversial data

    There are privacy issues to be considered. Probing databases for personal information (especially medical information) may violate privacy laws. For example, using data mining techniques to create mailing lists of potential customers is controversial. Even probing government databases for instances of fraud or criminal intent has privacy implications. Similarly, probing medical databases "in the interest of science" while trying to isolate common characteristics among affected individuals (for a cure to a disease) can be controversial.

    References

    [1] W. J. Frawley, G. Piatetsky-Shapiro, and C. J. Matheus. "Knowledge Discovery in Databases: An Overview", Knowledge Discovery in Databases, Piatetsky-Shapiro and Frawley (eds.), AAAI/MIT Press, 1991, pp. 1-27.

    [2] Joe Graedon and Teresa Graedon. The People's Guide to Deadly Drug Interactions. New York: St. Martin's Press, 1995.

    [3] Joe Graedon and Teresa Graedon. The People's Pharmacy. New York: St. Martin's Griffin, 1996.