Project Information

One of the major components of this class is the project. The point of this project is to delve further into some aspect that we have been studying. You may do your project either alone or in groups of two to three. The amount of work expected from the project is commensurate with the number of people working on it (i.e., you personally are expected to put in the same amount of work on a project regardless of whether you're working alone or in a group). Keep in mind that I do not require that this project be an implementation. A literature survey is a perfectly fine project. This project should not eat your life.

Schedule

January 27: 1-page proposal due. This should include:
- What problem(s) you want to solve,
- What is going to be new and challenging about it,
- How you will try to solve the problem(s)
- What problems you don't consider to be part of the project (i.e., non-goals)
- What resources you need that you don't already have
- Who is on your team if you are working in a team. Teams are strongly encouraged
Week of February 2: feedback on proposal returned to you
February 26: 4-page midterm status report due; this should describe what you have done, what you have left to do, roadblocks you've encountered, interesting or unexpected questions or issues that uncovered, etc. Included in this report should be 1-2 pages of literature search for related work; this should include both a written component comparing your project to related work, as well as a bibliography. Note that this checkpoint is largely a chance for you to get the feedback that you need. While it is not graded, students who have not made a good effort on this checkpoint often wind up not making good effort on the project overall and thus not doing well.
Week of March 2: feedback on status report returned to you
March 31 - April 9: Project presentations. Precise schedule TBD (it's first come first served via requests that are e-mailed to me --- requests only accepted after proposals are turned in), but everyone should be prepared for March 31. Your presentation should be ~15 minutes long - with at least 3 minutes of that time reserved for questions and answers. Here is what I expect out of the presentation (not necessarily in this order):
- A good description of what the project is (inputs, outputs, etc.)
- The motivation for why this project is interesting - why did you choose to do it, and why should we care about the problem
- A discussion of what makes this project non-trivial (especially for the more research-oriented projects)
- A description of how the project fits into the context of the class
- A presentation of the results thus far
- A discussion of what results you expect to get by the project deadline
- A discussion of the difficulties or surprises that you had when working on the project
- If your project is in a group, everyone must speak.
Tuesday, April 14 - 5:00pm:
Final report due. Your final project report is due, along with a group evaluation for those working in groups (see below). The final report must be a full-length conference-style paper discussing your project. (It should be roughly equivalent to 10-14 single column pages. Note that this is a rough guideline. It is okay to go a bit over this, particularly if you're working in a large group - this is just meant to help you decide if you're in the right ball park.) You should model your paper on some of the papers we've read this term. Either PDF by e-mail or a hard copy in my box is fine. If you give a hard copy, I'd appreciate an e-mail letting me know that you've turned it in. Note that I don't care about the format; I only specify the length in single column pages because otherwise people ask if I mean single or double column pages. The goal of saying "conference style paper" is that I want you to include things like:
- Motivate the problem that you're working on
- Provide an example of a scenario where you'd use your solution
- Tell me about the solution that you've created, this includes telling me about what makes the problem interesting and hard. If you'd like, you can interpret this as telling me what problems you ran into.
- Relate it to related work
- Tell me about potential future work - even if you have no intension of ever doing it. Just like in a real conference paper, the goal is for you to show that you know what some of the flaws are with your system, even if you have no intension of solving them. ;)
Note that some of you won't actually create a solution, but just explore the literature, which is fine. In this case, your job is to explore the strengths and weaknesses of the approaches, and, if you feel like there's an obvious choice, say what you would do if you were going to implement a solution to the problem. Note that I do not care about the layout.
In addition to the report, I want each person who is working on a project in a group to SEPARATELY turn in a report on how they felt that all of the group members (yourself included) contributed to the project. Useful information is: what parts of the project you did (e.g., if you divided the work by sections, who did what section), how many hours you estimated that you worked, and how well you feel like you and the other people in the project did.

Project Ideas

Here are some ideas that would be appropriate for the course project. The best project ideas are likely to come from you; however, here are some that you can use as is or use to think of new ones. The projects can run the gamut from all theory to having a heavy implementation component. I'll add more project ideas as I come up with them.

Most database research topics that you would like to pursue. Keep in mind that I do mean research topics; implementing a database application does not qualify. Feel free to send me mail or come by to talk about what qualifies as a good project.
Helping users to create an ontology or schema is a well understood process. Explore the best methodologies for doing so, especially focusing on open source software.
I'm beginning to work with Siobhan McElduff on 18th and 19th century book catalogs. Right now the raw data has been scanned in, but it's in no condition to actually be used. We need to do some pre-processing using database cleaning and processing techniques. Some things that would make good projects are:
- Entity reconciliation - how can we tell which book entries are referring to the same books? This is especially problematic because same book may both be many entities and one simultaneously in some ways. It will usually be in multiple sizes, multiple bindings within sizes, and then in multiple volume ranges (3, 4, 6, 2in 1). So a certain edition of Shakespeare of a specific date and publisher could easily be in 20 or more forms in the same catalogues. On top of that some of the items are secondhand and that adds to the range of possible versions of the same book.
- We'd like to have a hierarchy of the genres of the different types of books. This will be difficult because the hierarchy is going to be different across different catalogs and time. Additionally, the creator of the catalogues was a marketing innovator, so he kept experimenting with sections, so the same item might move from being in the history category to the travel one or even classical authors. Then each section title is repeated for each book size...and so forth. On top of that he sold the shop and then each next owner put his own spin on things.
- Data cleaning - there will be many typos
- Trying to reconcile and do comparison of prices. This could be tricky both because of keeping track of time issues and because of the old English monetary system.
A student of mine did his thesis on a system for providing a suite of tools to help people be able to better manage their experience with multiple data sources (e.g., keep track of changes made to data downloaded from a database into an excel spreadsheet). His thesis was to define the overall shape of the system. Pick one of the components and implement it or otherwise improve the details of it. (e-mail me if you want to see the thesis)
Some civil engineers are designing a web-based decision support system for sustainable asset management that uses the knowledge hidden in both structured and unstructured data to enhance decision making. Knowledge could be trapped in emails, blogs, historical reports, scanned documents stored in computers or external hard-drives, or in live data coming in from sensors and internet data sources. The more useful information we can extract from the Big data coming in, the more informed the decisions we make. Text Mining would be a useful tool to mine the knowledge from text based documents and Statistical Analysis would be a useful tool to mine the knowledge from structured data. Some possible projects in this space:
1. Design a web application that allows the User to interact with R (statistical software) without the knowledge of programming in R. The user should be able to perform basic statistical analysis on the data stored in a database on the local machine or an online database, performs analysis, and write the results back to another database and present the results in tables and graphically.
2. Design a web application that allows the User to connect to his/her email account (e.g. gmail) and perform text mining on the data, and store the result of such text mining analysis in a database online. Some text mining analysis processes to consider: information retrieval; linguistic analysis; pattern recognition; co-reference; relationship extraction; sentiment analysis; quantitative text analysis; etc.
3. Design a web application that allows the User to connect to a computer and perform text mining on the data stored on that computer, and then store the result of such text mining analysis in a database online. Some text mining analysis processes to consider: information retrieval; linguistic analysis; pattern recognition; co-reference; relationship extraction; sentiment analysis; quantitative text analysis; etc.
The Global Legal Entity Identifier (GLEI for short) is a dataset collected from various sources for the Global LEI watch project. "LEIs are designed to be a single, universal standard identifier for any organisation or firm involved in a financial transaction internationally". The dataset here contains data on organisations/companies in different locations and their corresponding identifiers. We are interested in integrating the GLEI data with other sources. More details are here.
Throughout this course, we'll talk about how the concepts that we study relate to your data. Choose some part of your data that is difficult to manage using current data management techniques/software. Describe what would need to change in order for your data to be managed effectively. Relate to readings both in class and out of class.

A word on plagiarism

Your project, as with all of your work, is to be your work. If you take ideas from anywhere else, you have to cite them, and that if you take words from somewhere else, they have to be quoted and cited (taking names of things is okay without quotes as long as they are well cited, but if you're taking more than that, you need to have it in quotes). Copying other people's text or figures and claiming it as your own is not okay; it is plagiarizing.

What does this mean precisely? Let's say that this webpage is your source [1]. If you were writing something about the first paragraph, it might look something like the following:

504 includes a class project which can be done either individually or in groups [1]. Overall, it shouldn't be too bad, in particular, "it should not eat your life"[1].

Note that the first sentence is paraphrased, so it has just been cited. The second sentence contains a direct quote, so it has been put in quotation marks along with having a citation.

To make sure that you don't plagiarize, always add in citations where appropriate as you are working on your paper. Never cut and paste text and put it in your work without putting it quotations. Do not rely on the fact that you will come back later and change wording later.

If you find yourself thinking "there's no point in my writing this differently, the source that I'm looking at has written it better than I could", I offer you the following words of wisdom (1) I don't care if they wrote it better, you can't plagiarize (2) in each case where I have detected plagiarism, the plagiarized sections are the WORST part of the paper, since they are generally just cut and pasted from other sources without regard to the context that the project is supposed to be about. So do us both a favour, save us both a lot of grief, and don't do it. You'll learn more and turn in a better result.

Resources

If you are looking for relevant papers, here are some suggestions:

DBLP is a fantastic bibliography and link to papers for database and logic programming.
Google Scholar also has a search engine that can be quite helpful since it indexes more than just the metadata about the paper

For any source, you want to make sure that you're reading the best papers. One way that will often, though not always, lead you in the right direction, is to look at the highly rated venues. In data management, some of those are:

Conferences:

SIGMOD
PODS (theory)
VLDB
EDBT
ICDE

Journals

TODS
VLDB Journal
TKDE

[504 home] [grading] [schedule] [project] [Connect]

Rachel Pottinger
E-mail Address: rap [at] cs [dot] ubc [dot] ca

Office Location: ICCS 345
Phone: (604)822-0436
Fax:(604)822-5485
Postal/Courier address:
The Department of Computer Science
University of British Columbia
201-2366 Main Mall
Vancouver, B.C. V6T 1Z4
Canada