Tools Selection for Teaching Data Warehousing and Data Mining

ANDY C. H. LAW

Kwantlen University College

ABSTRACT

The application of data warehousing (DW) and data mining (DM) concepts in data analysis and decision making has attracted a great deal of interest in the information technology (IT) industry. In a DW and DM course, students learn how a data warehouse is developed through a methodological approach. They learn different techniques in each development phase according to the methodology, and use different tools to perform the techniques. Since DW and DM cover various technologies such as database technology, statistics, visualization, information science, and machine learning, software tools selection during the course development becomes a major concern. In this paper, I describe two different approaches of tools selection. The first approach uses an integrated tool by a single vendor, and the second uses different tools by various vendors. In each approach, I measure the quality of learning by making observations and interviews. I find that the students will be more confidence in solving DW and DM problems using the second approach. 

Keywords: Data Warehousing, Data Mining, Software Tools Selection

1. Introduction

The emergence of DW was initially a consequence of the observation by W. Inmon in the early 1990s that online transaction processing (OLTP) and online analytical processing (OLAP) cannot coexist efficiently in the same database environment, mostly due to their very different transaction characteristics [1]. Recently, the application of DW and DM concepts in data analysis and decision making has attracted a great deal of interest in the IT industry. Due to the strong demand of trained data warehouse developers, Kwantlen University College (KUC), B.C., Canada, has been offering a fourth year level course in DW and DM in its Bachelor of Technology in IT (BTECH-IT) program since year 2000. Similar to the development of most information systems, students are required to learn the methodology, the techniques and the tools. Students in the course learn how an enterprise data warehouse is developed through a methodological approach, followed by how specialized types of data warehouse such as data marts and OLAP systems are built from the enterprise model. In addition, they will learn how DM concepts can be applied to mine the warehouse data. During each development phase according to the methodology, students learn different techniques and use different tools to perform the techniques. In section two, I explore the relationship between tools and learning activities using activity theory. In section three, I describe the DW and DM infrastructure, and explain why tools selection is a major concern. In section four, I introduce two approaches of tools selection. The first approach uses an integrated tool by a single software vendor, and the second uses different tools by various vendors. In section five, I evaluate both approaches based on students’ performance using observation and interviews.

2. Tools and Learning

In this section, I explore the relationship between tools and learning activities and justify why tools selection is important in high quality learning. In fact, a DW and DM course can be viewed as an activity system. Activity systems, which are developed based on activity theory [2], are the basic units of analysis to alleviate problems that afflict the approaches to learning. In figure 1, the activity system has six components. Essential to the system are the relations between subjects (instructor and students), the community (DW and DM developers) of which they are members, and the object of their activities (DW and DM development). Such relations are mediated by a number of factors, including the tools (books, software tools) used by the subjects, the rules (course objectives, norms in techniques) that link subjects to their communities, and the roles (instructor educates students) adopted by the community.

Some instructors may allow students to select any tools to perform the techniques in DW and DM development. In fact, the activity system suggests that any direct relationships between tools and other components should be taken into consideration in tools selection.  Thus the concerns of the students, the instructor, and the community are equally important. If one of these components is ignored, tensions may appear in the system. In real life, developers in the community usually prefer an integrated tool by a single vendor. However, a college graduate who possesses knowledge of using only one tool is definitely not a competitive applicant for jobs in the community. Tensions in an activity have been explained in the activity theory by contradictions. These contradictions are the major factors of poor quality learning. Thus, it is justifiable to educate students with different tools by various vendors in order to minimize the number of tensions in learning. Besides, an integrated tool always comes with a dialog driven wizard to help developers to step through the whole development process. However, such a computer-driven and activity-oriented approach should be minimized when learning the techniques. In accordance with the constructivist learning environments design, instructors should educate students to perform tasks instead of activities [3]. With such a task-oriented approach, students are better able to deal with DW and DM problems, and better apply their knowledge to a novel situation. In designing the DW and DM course at KUC, I have identified eleven techniques in accordance with a generic methodology, and have designed a separate task for each technique. Since year 2001, I have been selecting a number of tools. Some tools may semi- or fully automate the tasks, while some are just programming tools that require students to write computer programs to complete the tasks.

3.    Tools for Teaching Data Warehousing and Data Mining

The DW and DM infrastructure, which is depicted in Fig. 2, exhibits various layers of data. The operational data stored in the enterprise database systems formed the bottom layer. Data in this layer are usually heterogeneous and stored in different database schema. The middle layer contains reconciled data, which are the result of the integration of various operational data and database schema [4]. Such a layer represents the data warehouse for the enterprise. Data transformation tools, data capturing tools, data scrubbing tools are used during the integration process. Loading tools and indexing tools provided by most database management systems (DBMSs) are used to populate the warehouse with data. According to Inmon [1], a data warehouse is a subject-oriented, integrated, time-variant, and nonvolatile collection of data in support of management’s decision-making process. The next layer, which contains highly aggregated data derived from the reconciled data, supports user activities such as query processing and trend analysis. Dart marts and OLAP systems are built on the top of the aggregated layers [5]. Data marts, which are subsets of the data warehouse, are created with querying tools provided by most DBMSs. OLAP systems provide users with an analytical engine to view data in a multi-dimensional ways, and are add-on analysis tools to some commercial DBMSs [6,7]. DM used to be a subset of the activities associated with the data warehouse but has gained independence today. DM supports inductive information discovery, which is the process of searching the data using various investigatory models, methods, and techniques, and then forming its own hypothesis during a decision-making process [8,9]. Tools selection for teaching DW and DM concepts is difficult because the concepts are confluences of multiple disciplines including database technology, statistics, visualization, information science, machine learning and more.

Tools selection becomes more difficult because there is not a standard set of DW and DM tools in the industry. In mid 1990s, many companies used separate tools to create their data warehouse because there were not integrated data warehouse development tools offered by vendors such as IBM, Oracle and Microsoft. Today, each of these vendors claims that she can provide developers an integrated tool. A developer in real-life definitely prefers an integrated tool by a single vendor in order to avoid the high cost involved in maintaining multiple repositories created during the data transformation, capture, scrubbing, and loading processes. However, many issues are raised if an integrated tool is adopted for teaching DW and DM. Since there is not a standard set of tools in the field, integrated software tools by different vendors are having very different user interfaces and using very different terminology. As a result, students who have already adapted to one particular integrated tool tend to refuse to use another.

4.    Methodologies, techniques, and tools

The development of a data warehouse is similar to that of any information system [10]. First, a methodology is chosen. Second, techniques are suggested in each phase by the methodology. Third, tools are selected to perform the techniques. There is quite a lot of discussion about the methodology for the design of a data warehouse [11,12]. Each approach has its pros and cons. Table 1 depicts a generic top-down methodology, the techniques involved, and tools adopted by KUC’s DW and DM course. Students in the course are required to complete twelve tasks. Each task requires students to perform a technique.

Task

Methodology – techniques

An integrated tool was used in year 2000

Different tools were used in year 2001

Different tools were used in year 2002

1

Requirement Analysis

 

Oracle Designer’s (CASE tool) Process Modeler, Dataflow Diagrammer,

ER Diagrammer

ER/Win (CASE tool)

Visio (Drawing tool),

ER/Win (CASE tool)

2

Data Transformation

Oracle Server’s Import Utility

Microsoft Excel’s Import Utility

Microsoft Excel’s Import Utility

3

Data Capture

No tools are used

No tools are used

Custom Programs in VB/Java/C++

4

Data Scrubbing

No tools are used

No tools are used

Custom Programs in VB/Java/C++

5

Logical Design

Oracle Designer’s Design Transformer

ER/Win’s Database Schema Generation

ER/Win’s Database Schema Generation

6

Physical Design

Oracle Designer’s Design Transformer

ER/Win’s Database Schema Generation

ER/Win’s Database Schema Generation

7

Implementation of data warehouse

Oracle Server and Client

Oracle Server and Client

Oracle Server and Client

8

Load data warehouse with data

Oracle’s SQL Loader

Oracle’s SQL Loader

Oracle’s SQL Loader

9

Implementation of data marts

Oracle Server and its SQL engine

Oracle Server and its SQL engine

Oracle Server and its SQL engine

10

Implementation of OLAP systems

Oracle’s Express Server

Microsoft Excel’s pivot tables

Microsoft SQL Server’s Analysis Services – OLAP engine

11

Implementation of DM solutions

No tools are used

No tools are used

Microsoft SQL Server’s Analysis Services – DM Component

12

Maintenance and Model Evolution

Oracle Designer’s Capture Design

ER/Win’s Reverse Engineering

ER/Win’s Reverse Engineering

In year 2000, I selected an integrated tool by Oracle [13,14] to support my teaching. Students were required to perform most of the tasks using Oracle’s Computer Aided Software Engineering (CASE) tool. Since year 2001, I have been using another approach of selecting different tools by various vendors. With the new approach, students use a CASE tool only during requirement analysis, logical design, physical design, and maintenance. They use other independent tools or even write computer programs to perform data transformation, capture, and scrubbing. They implement the data warehouse in Oracle Server, but use Microsoft SQL Server’s Analysis Services [15] to perform both OLAP and DM.

5.    Evaluation

Evaluation on tools selection was carried out in the past three years with an attempt to collect, analyze and interpret information through observations and interviews in order to make informed decisions that enhance quality of learning. Two approaches of tools selection are evaluated. The first uses an integrated tool by a single vendor, and the second uses different tools by various vendors. Both observation and interview were done on ten students selected randomly from each class. The results are summarized in table 2 and table 3.

Observations

An integrated tool was used in teaching during year 2000

Different tools were used in teaching during year 2001/02

Students were asked to develop a data warehouse data model. I wanted to find out how they approached the problem with the tool.

Students were provided with only the Oracle tool. Most students spent lots of time to figure out how the tool was used instead of how the model was created. 

Students were given a choice of using either a drawing tool like Visio, or a CASE tool. Most students started drawing the models on paper manually before deciding which tool to be used.

Students were asked to develop a data warehouse data model. I wanted to find out how they approached the problem without the tool.

Most students tried to recall all steps as suggested by the tool. They drew the model in a way similar to the one generated by the tool. 

Most students tried to recall all steps as suggested by the methodology. They drew the model with additional comments such as assumption, and database constraints.

Students were asked to draw a data warehouse data model by hand. I wanted to find out what kinds of notation they would use.

Most students tended to use the exact notation adopted by the Oracle’s tool. They thought that the notation was the only notation for data warehouse modeling.

Most students used IE or IDEFIX notation that they learned before in their database management classes. They understood that any data modeling notation could be used to document the data warehouse model.

Students were asked to perform logical and physical data warehouse designs. I wanted to find out how they approached both tasks with the assigned tools.

All students used Oracle Designer’s Design Transformer to perform both tasks.  They simply accepted the default settings suggested by the tool. Thus, they completed both tasks in minutes. 

All students used ER/Win to perform both tasks. However, ER/Win isn’t a product of Oracle and it actually supports many different target database platforms. Thus, students needed to specify the target database platform and the detail of the database schema. Students were spending hours or more in both tasks.

Students were asked to perform data analysis using an OLAP system. I wanted to find out how they approached the task with the assigned tools.

All students used Oracle Express Server as the OLAP system to perform data analysis. Since the tool came with a dialog driven wizard, it drove students through the whole process. Students completed the task in minutes.

All students used Microsoft Excel’s pivot table or SQL Server’s Analysis Services as OLAP systems to perform the task. Since data was stored in Oracle’s DBMS, students needed toexport the data into a format that could be accepted by the OLAP systems. Later, students found a smarter approach to import data into the OLAP system directly from Oracle’s DBMS using a middleware such as the ODBC APIs. Students were spending days in the task.

Interviews

An integrated tool was used in teaching during year 2000

Different tools were used in teaching during year 2001/02

Students were asked if they have the confidence to apply the DW and DM concepts in their work in the future using the same tools as given by the college.

All students had the confidence.  

All students had the confidence.

Students were asked if they have the confidence to apply the DW and DM concepts in their work in the future using different tools other than ones given by the college.

Very few students had the confidence. Most students claimed that they were not exposed to other commercial tools and thus were lack of the confidence.  

Most students had the confidence.

Students were asked how they would select a tool in their future.  If the tool could not handle one or more tasks, how would they handle the situation?

All students said they preferred the same tool that they learned in the course. They would probably get help from the vendor if the tool could not handle some tasks. They believed the vendor would have some suggestions.

Some students preferred an integrated tool, but some had no preference. If the tool was not able to handle a task, all students believed that they could come up some solutions. The solutions could be customized computer programs written in Visual Basic, Java or C++ language.

The integrated tool approach was used in year 2000. From the results, it is found that students became less proactive and had not much confidence in problem solving, and their activities were mainly driven by the integrated tool. Being educated with an integrated tool approach, students were concerned if they would be using the same tool in real-life. Obviously, students are lack of confidence to apply the learned techniques if the same tool is not given. Such a result has violated the course objectives [16,17]. Since year 2001, I have been using another approach with an attempt to improve the quality of learning. The approach of using different tools by various vendors has brought students both challenges and incentives in learning. Students find challenging to complete all tasks with different tools, and prepare themselves as highly competitive applicants for jobs in the industry. Most of the challenges come from moving data from one environment to another. Students are required to put in extra effort to identify some appropriate middleware for data migration. As a result, it is the students, not the tools, who drive all tasks. Being educated with different tools by various vendors, students are able to perform required techniques regardless what the tool is. Some students could even write programs in Visual Basic, Java or C++ to performed tasks, including data capture and scrubbing, because these tasks may not be automated by the selected tools. From the results, I find that students are more confidence in solving DW and DM problems with the approach of using different tools by various vendors. Such a learning outcome is what I, as an instructor, would like to see.

6.    Conclusion

Due to the strong demand of DW and DM developers in the IT industry, KUC has been offering in its BTECH-IT program a DW and DM course since year 2000. I have described in this paper the methodology, techniques and tools adopted by the course. Since DW and DM are complex topics that involve many different technologies, I have presented two approaches of tools selection for teaching and developing the course. The first uses an integrated tool by a single vendor, and the second uses different tools by various vendors. I have evaluated both approaches through observation and interviews. I have found that the second approach outperforms the first one in such a way that students become more proactive and confidence when applying DW and DM concepts and techniques in problem solving.   

References

  1. W.H. Inmon. “Building the data warehouse” John Wiley & Sons, 2002.
  2. J. Lave. “The Practice of Learning in Understanding Practice: Perspectives on Activity and Context” Cambridge University Press, 1993.
  3. D.H. Jonassen. “Designing Constructivist Learning Environments” Ch. 10 in Instructional-Design Theories and Models: A New Paradigm of Instructional Theory, vol. II. Lawrence Erlbaum Associates, 1999.
  4. Y. Arens, C. Y. Chee, C. Hsu, C. A. Knoblock. “Retrieving and integrating data from multiple information sources” Journal of Intelligent and Cooperative Information Systems, 1993.
  5. S. Chaudhuri, U. Dayal. “An overview of data warehousing and OLAP technology” SIGMOD, 1997.
  6. R. Agrawal, A. Gupta, S. Sarawagi. “Modeling multidimensional databases” IBM Almaden Research Center, 1995.
  7. L. Cabibbo, R. Torlone. “A logical approach to multidimensional databases” In proceedings of the 6th International Conference on Extending Database Technology, Springer-Verlag, 1997.
  8. G. Linoff, M. B Michael, and J. A. Berry. “Data Mining Techniques” John Wiley & Sons, 1997.
  9. Olivia Parr Rud. “Data Mining Cookbook” John Wiley & Sons, 2001.
  10. C. Adamson, M. Venerable. “Data warehouse design solutions” John Wiley & Sons, 1998.
  11. R. Kimball. “The data warehouse toolkit” John Wiley & Sons, 1996.
  12. R. Kimball, L. Reeves, M. Ross, W. Thornthwaite. “The Data warehouse lifecycle toolkit” John Wiley & Sons, 1998.
  13. Oracle Corporation. “Oracle server utilities user guide” http://www.oracle.com/ , 2000.
  14. Oracle Corporation. “Oracle Express Server user guide” http://www.oracle.com/ , 2000.
  15. Microsoft Corporation. “Microsoft SQL Server user’s guide” http://www.microsoft.com/ , 2000.
  16. Kwantlen University College. “INFO4330 course outline: Data Warehousing and Data Mining”, 1999.
  17. Kwantlen University College. “INFO4330 course outline (revised): Data Warehousing and Data Mining”, 2002.

Andy C. H. Law
Department of Computing Sciences and
Information Systems
Kwantlen University College

12666
-72nd Ave., Surrey, B.C., Canada V3W 2M8
Andy.Law@kwantlen.ca