Useful Links
DB Talks
Fall 2008
Fall 2007
Fall 2006
Summer 2006
Spring 2006
Fall 2005
Spring 2005
Fall 2004
Spring 2004
Fall 2003
Summer 2003
Spring 2003
Fall 2002
Fall 2001



dbTalks: Fall 2003

Contact: Wendy Hui Wang, Jessica Zhao Zheng, dbTalks Organizer


  1. Friday, Sept 19, 2003 at 3:00pm in MCML160

    TITLE: DB Lab Orientation Session
    SPEAKER: Faculties and Gradudate students in DB lab

  2. Thursday, Sept 25, 2003 at 4:00pm in CICSR104

    TITLE: Quotient Cube and QC-Tree: Efficient Summarizations for Semantic OLAP
    SPEAKER: Maryann Yan Zhao (Dept of CS, UBC)

    Most data cube compression approaches focused on reducing cube size but overlooked preserving the semantics of data cube. Quotient cube was proposed as a summary structure for a data cube that retains its semantics. It can be constructed very efficiently and it leads to a significant reduction in the cube size. However, many issues are unaddressed regarding the application of quotient cube. For instance, efficient storage and index structure were not provided; No specific algorithm for query answering was given; Incremental maintenance against updates was not discussed. In this thesis work, we are aiming at developing proper solutions to above open problems. Firstly, we introduce sorted list to index a direct tabular representation of quotient cube and give associated query answering algorithms. Since a tabular representation with additional index structure is neither as compact as it could be nor efficient to maintain, secondly, we propose a more efficient data structure called QC-tree to store and index quotient cube. We present a depth-first search based algorithms for constructing QC-tree directly from base table. Thirdly, we devise efficient algorithms for answering various queries and incrementally maintaining QC-tree against base table updates. An extensive experimental study illustrates the space and time savings achieved by our algorithms. Finally, we implement a quotient cube based data warehouse system to demonstrate our research accomplishments.

  3. Thursday, Oct 2, 2003 at 4:00pm in CICSR104

    TITLE: Searching the Web
    SPEAKER: Zhimin Chen (Dept of CS, UBC)

    We offer an overview of current Web search engine design. After introducing a generic search engine architecture, we examine each engine component in turn. We cover crawling, local Web page storage, indexing, and the use of link analysis for boosting search performance.The most common design and implementation techniques for each of these components are presented. We draw for this presentation from the literature, and from our own experimental search engine testbed. Emphasis is on introducing the fundamental concepts, and the results of several performance analyses we conducted to compare different designs.

    REFERENCE: Searching the Web - Arasu, Arvind; Cho, Junghoo; Garcia-Molina, Hector; Paepcke, Andreas; Raghavan, Sriram

  4. Thursday, Oct 9, 2003 at 4:00pm in CICSR104

    TITLE:Mining Newsgroups Using Networks Arising From Social Behavior
    SPEAKER: Xiaodong Zhou (Dept of CS, UBC)

    Recent advances in information retrieval over hyperlinked corpora have convincingly demonstrated that links carry less noisy information than text. We investigate the feasibility of applying link-based methods in new applications domains. The specific application we consider is to partition authors into opposite camps within a given topic in the context of newsgroups. A typical newsgroup posting consists of one or more quoted lines from another posting followed by the opinion of the author. This social behavior gives rise to a network in which the vertices are individuals and the links represent "responded-to" relationships. An interesting characteristic of many newsgroups is that people more frequently respond to a message when they disagree than when they agree. This behavior is in sharp contrast to the WWW link graph, where linkage is an indicator of agreement or common interest. By analyzing the graph structure of the responses, we are able to effectively classify people into opposite camps. In contrast, methods based on statistical analysis of text yield low accuracy on such datasets because the vocabulary used by the two sides tends to be largely identical, and many newsgroup postings consist of relatively few words of text.

    REFERENCE: Mining Newsgroups Using Networks Arising From Social Behavior - Rakesh Agrawal, Sridhar Rajagopalan, Ramakrishnan Srikant, Yirong Xu

  5. Thursday, Oct 16, 2003 at 4:00pm in CICSR104

    TITLE:Concise Descriptions of Subsets of Structured Sets
    SPEAKER: Shaofeng Bu (Dept of CS, UBC)

    We study the problem of economical representation of subsets of structured sets, that is, sets equipped with a set cover. Given a structured set U, and a language L whose expressions define subsets of U, the problem of Minimum Description Length in L (L-MDL) is: given a subset V of U, find a shortest string in L that defines V. We show that the simple set cover is enough to model a number of realistic database structures. We focus on two important families: hierarchical and multidimensional organizations. The former is found in the context of semistructured data such as XML, the latter in the context of statistical and OLAP databases. In the case of general OLAP databases, data organization is a mixture of multidimensionality and hierarchy, which can also be viewed naturally as a structured set. We study the complexity of the L-MDL problem in several settings, and provide an efficient algorithm for the hierarchical case. Finally, we illustrate the application of the theory to summarization of large result sets, (multi) query optimization for ROLAP queries, and XML queries.

    REFERENCE: Concise Descriptions of Subsets of Structured Sets - A. Mendelzon, Ken Q. Pu, Univ of Toronto

  6. Thursday, Oct 30, 2003 at 4:00pm in CICSR104

    TITLE: Locally Adaptive Dimensionality Reduction for Indexing Large Time Series Databases
    SPEAKER: Yuhan Cai (Dept of CS, UBC)

    Similarity search in large time series databases has attracted much research interest recently. It is a difficult problem because of the typically high dimensionality of the data. The most promising solutions involve performing dimensionality reduction on the data, then indexing the reduced data with a multidimensional index structure. Many dimensionality reduction techniques have been proposed, including Singular Value Decomposition (SVD), the Discrete Fourier transform (DFT), and the Discrete Wavelet Transform (DWT). In this work we introduce a new dimensionality reduction technique which we call Adaptive Piecewise Constant Approximation (APCA). While previous techniques (e.g., SVD, DFT and DWT) choose a common representation for all the items in the database that minimizes the global reconstruction error, APCA approximates each time series by a set of constant value segments of varying lengths such that their individual reconstruction errors are minimal. We show how APCA can be indexed using a multidimensional index structure. We propose two dis tance measures in the indexed space that exploit the high fidelity o f APCA for fast searching: a lower bounding Euclidean distance approximation, and a non-lower bounding, but very tight Euclidean distance approximation and show how they can support fast exact searching, and even faster approximate searching on the same index structure. We theoretically and empirically compare APCA to all the other techniques and demonstrate its superiority.

    REFERENCE: Short version, Long version - Kaushik Chakrabarti, Eamonn Keogh, Sharad Mehrotra, Michael Pazzani

  7. Thursday, Nov 6, 2003 at 4:00pm in CICSR104

    TITLE: A Parallel Pragramming Infrastructure to Support Interactive Parallel Data Mining.
    SPEAKER: Peiqun Yu (Dept of CS, UBC)

    There are two main projects that are currently active. The first is the development of a parallel programming infrastructure to support interactive parallel data mining. This work is based on the premise that datamining is more effective when the user can participate in the discovery process and that the user must be part of the datamining loop to enable them to guide the process. However, datamining is both data and compute intensive and parallel computing, in particular cluster computing, is a useful tool to achieve the response times necessary to make this approach feasible. We have focused on three datamining activities: calculation of correlation matrices, datacube querying, and cluster analysis. There are two other important aspects to this project: a) It is an open system to allow the addition of new tools and the ability for these tools to interact. b) It supports the interactive mining of large datasets. A result of this goal is that the algorithms must at any time be able to respond with the "current best" result from the algorithm even after reading only a portion of the data. The user can adjust the response time of the algorithms to give partial results to make decisions on future explorations. Currently we have a prototype of the tool running to support the three above mentioned datamining activies. The tool has a Java frontend and can connected via Corba to an arbitrary compute cluster running MPI/LAM.

  8. Thursday, Nov 13, 2003 at 3:00pm in CICSR104

    TITLE: Updating XML
    SPEAKER: Pingdong Ai (Dept of CS, UBC)

    As XML has developed over the past few years, its role has expanded beyond its original domain as a semantics-preserving markup language for online documents, and it is now also the de facto format for interchanging data between heterogeneous systems. Data sources export XML "views" over their data, and other systems can directly import or query these views. As a result, there has been great interest in languages and systems for expressing queries over XML data, whether the XML is stored in a repository or generated as a view over some other data storage format.
    Clearly, in order to fully evolve XML into a universal data representation and sharing format, we must allow users to specify updates to XML documents and must develop techiniques to process them efficiently. Update capabilities are important not only for modifying XML documents, but also for propagating changes through XML views and for expressing and transmitting changes to documents. This paper begins by proposing a set of basic update operations for both ordered and unordered XML data. We next describe extensions to the proposed standard XML query language, XQuery, to incorporate the update operations. We then consider alternative methods for implementing update operations when the XML data is mapped into a relational database. Finally, we describe an experimental evalution of the alternative techiniques for implementing our extensions.
    REFERENCE: Igor Tatarinov, Zachary G. Ives, Alon Y. Halevy, Daniel S. Weld- SIGMOD Conference

  9. Thursday, Nov 20, 2003 at 4:00pm in CICSR104

    TITLE:Predicting source code changes by mining revision history
    SPEAKER: Annie Ying (Dept of CS, UBC)

    Software developers are often faced with modification tasks that involve source which is spread across a code base. Some dependencies between source, such as the dependencies between platform dependent fragments, cannot be determined by existing static and dynamic analyses. To help developers identify relevant source code during a modification task, we have developed an approach that applies data mining techniques to determine change patterns---files that were changed together frequently in the past---from the revision history of the code base. Our hypothesis is that the change patterns can be used to recommend potentially relevant source code to a developer performing a modification task. We show that this approach can reveal valuable dependencies by applying the approach to the Eclipse and Mozilla open source projects, and by evaluating the predictability and interestingness of the recommendations produced for actual modification tasks on these systems.

  10. Thursday, Nov 27, 2003 at 4:00pm in CICSR104

    TITLE Subset-conjunctive for Breast Cancer Diagnosis
    SPEAKER: Prof. Ramesh Krishnamurthy(Dept of CS, SFU)

    The objective of this study was to distinguish within a population of patients with and without breast cancer. The study was based on the University of Wisconsin's dataset of 569 patients, of whom 212 were subsequently found to have breast cancer. A subset-conjunctive model is described to distinguish between the two groups of patients. We formulate the problem of inferring subset-conjunctive rules as a 0-1 integer program, show that it is NP-Hard, and prove that it admits no polynomial-time constant approximation algorithm. We examine the performance of a randomized algorithm and of randomization using LP rounding. In both cases, the expected performance ratio is arbitrarily bad. We use a deterministic greedy algorithm to identify a Pareto-efficient set of subset-conjunctive rules; describe how the rules change with a re-weighting of the type-I and type-II errors; how the best rule changes with the subset size; and how much of a tradeoff between the two types of error is required by selecting a more stringent or more lax classification rule. An important aspect of the analysis is that we find a sequence of closely related efficient rules, which can be readily used in a clinical setting because they are so simple and have the same structure as the rules currently used in clinical diagnosis. (Joint work with Rajeev Kohli and Kamel Jedidi at Columbia University)


Last Update: Oct. 15, 2003  

Home | Research | Publications | Theses | Collections | Members