CS 534B

CS 534B Topics in Data Management -- Web Data Integration and Management

Fall 2005


Instructor: Laks V S Lakshmanan



Student Talks

Here are some tips that you may find helpful in organizing your talk. You should plan for a talk lasting 30 minutes, including questions. Anyone can ask questions as long as they are not disruptive. Grading will be based on how well the talk addresses the criteria mentioned here, as well as on how well questions are answered.

Here is the schedule of talks.



Important Note: Arguably, the most valuable intellectual, informational, and strategic resource available to humankind today is the world-wide web. This course is not about teaching Web design, nor XML syntax, nor web service languages (or anything fluffy for that matter). It's more about the intellectual challenges raised by the advent of data on the web, and of XML, and about developing the foundations and techniques for solving information integration problems in diverse applications ranging from data warehousing, data cleaning, data exchange, privacy and security issues, integrating data management with information retrieval, to scientific data interoperability.

Brief Course Description

Much information of interest to the human society is online today, with experts predicting most, if not all, of the information in the world will be available online in future. While traditional database technology has been extremely successful in providing efficient and effective solutions for storing, retireving, and managing information that is well structured, the proportion of online information that can be attributed solely to databases is relatively small! Think about how much information is locked in non-database information repositories and applications: file systems, spreadsheets, plain text, LDAP style network directories, HTML pages, etc. Most of the questions that were asked (and conclusively answered) concerning the management of traditional structured data can (and should) be progitably asked against information that is partly or poorly structured and is locked away in applications/tools such as above.

How can we harness this information? How can we interact across applications? How can we maintain consistency of information so stored? The advent of semistructured data models on the research arena and of XML on the technological arena hold much promise for addressing these questions and for developing technologies for integrating information across diverse data stores and applications.

The theme of this year's offering of 534B is thus Web Data Integration and Management. The relational data model, invented for traditional business data processing applications, surprisingly can still serve us in our ``hour of need" in offering useful abstractions. Therefore, we begin this course with a brief review of the relational model. We then work our way through issues arising in interoperability across heterogeneous database systems as a natural transition point to study semistructured data and XML and then unstructured data including plain text.

Privacy and security of data play an increasingly important role today. Peer-to-peer data management systems are finding increasing applications and popularity. While traditionally we take the accuracy and correctness of the information in a database for granted, increasingly we are having to cope with inherent uncertainty in the data. These are but some of the many challenges in harnessing the underlying information in the huge amounts of data contained in diverse data stores.

Marking Scheme

  • Assignments 40%
  • Project 55%
  • Class Participation 5%
  • Projects Check out the project suggestions and related reference/background material.

    First meeting will be on Monday, September 11, 2005, 9:30-11:00 am, in in CICSR/ICICS 104. Regular schedule, MW 9:30-11:00 am, ICICS 104.

    Here is a tentative

    Course Outline

    GOFDB:
  • review of First Order Logic, Relational Query Languages (Relational Algebra, Relational Calculus, and Datalog), and Integiry Constraints.
  • conjunctive query containment and tableau techniques.
  • extensions (recent research) to negation and aggregation.
  • Global Information Systems
  • Integration models -- Global As View and Local As View
  • query answering using views (an application)
  • Dealing with heterogeneity :
  • SchemaLog and SchemaSQL.
  • Schema Integration & Matching.
  • Dropping (rigid) structure :
  • Intro. to Semistructured Data and XML -- data model, DTD, XML Schema
  • XML query languages:
  • Tree Pattern Queries & XPath
  • XQuery
  • The TAX Albebra for XML.
  • "Current Trends":
    We will introduce a series of key recent advances in such topics as DB + information retrieval, security and privacy in data management, P2P DBMS, Searching the WWW, Managing uncertain and unclean data, and set the stage for paper presentations with critiques, and projects.

    Here is a link to course notes.

    Course Resources:

    There is no single text that adequately covers the desired material. The material will instead be drawn extensively from recent research literature. Here are some books that cover the basics.

  • S. Abiteboul, R. Hull, and V. Vianu: Foundations of Databases, 1995.
  • J.D. Ullman: Principles of Database and Knowledge-base Systems, vol. I & II, 1988.
  • S. Abiteboul, P. Buneman, and D. Suciu: Data on the Web: From Relations to Semistructured Data and XML, 2000.
  • Other important resources include:
  • The DBLP Computer Science Bibliography.
  • The Citeseer Computer Science Bibliography.
  • A growing reading list , constantly under construction.
  • Attention MSS students:

    This might be a new experience for you: this course emphasizes research, innovation, and creativity much more than traditional courses that you may be used to. Make sure you understand the material discussed in class. Do participate in classroom discussions. For projects, I recommend that you include at least one MCS (or CS-PhD) student in your team and work with them closely.