CPSC 504: Project Suggestions and Information

Laks V.S. Lakshmanan, September 2011.

Here are some ideas for possible course projects. Keep in mind that we are not interested in merely understanding a specific paper, implementing the presented algorithms and demonstrating that you understand their contributions well enough to reproduce it yourself. The watchwords are innovation, creativity, and novelty. As always, depth will always bring its rewards (just as it does in paper presentations, in discussions led, and in the subsequent summaries prepared).

In addition to these suggestions, I am happy to listen to other ideas for projects, if you have any of your own. I am also happy to work with you on creating other project ideas. No matter which project you choose, I will be available to discuss with you any technical issues related to your project.

Each project can either take the shape of a paper or an implementation. It’s your choice. Each team may contain up to two people. Regardless of whether you do a project alone or as a team of two, my expectation will be that you do the same amount of work. Teams must be formed by Wednesday, October 3. Project choices should be made by Monday, October 8. Here, then are the project ideas. (More details on the project schedule and milestones will be posted soon.)

Enhancing DB+IR: Quite a few papers have been written up and several prototypes have been developed around the world about the integration of DB and IR using keyword search as a paradigm. Keyword search substantially enhances usability of a database by not requiring the end user to learn complex query languages or even the intricacies of navigating complex database schemas. However, keyword search “misses out” on some of the powerful things you are able to do with a query language. E.g., “finding a pair of employee and boss such that the employee earns more than his/her boss,” one of the famous queries taught in an undergrad course cannot be readily expressed using vanilla keyword search paradigm. Other examples of queries beyond conventional keyword search include

· “find those products whose sale exceeds $100,000 in all the outlets”

· “find the average sales of products grouped by city and quarter”

· “find the top-10 products with the highest revenue”

You can either write up a term paper or develop a small prototype system making use of available commodity software as components. For either option, think about the following questions:

· What are the challenges? There are three major challenges I can tell you right off the bat.

o Ambiguity: keyword queries are inherently ambiguous.

o How do you use keyword search paradigm for expressing the above queries – a largely syntax issue, but a non-trivial one.

o Assuming you tackle the second challenge above, the first challenge implies for a given query, there might be a number of possible answers, owing to ambiguity. How would you score/rank the results?

· What has been done in prior art that you should be aware of? (I will help you find relevant literature and help you in answering any questions you may have from the literature)

· How would you implement these queries such that it leverages available technologies (e.g., Lucene keyword search engine, SQL engine for query evaluation) and still offers reasonable efficiency?

Social Network Computing: Social nets such as facebook and del.ico.us have facilitated massive data, resource and opinion sharing, and in general online discourse. The resources can range from online resources like videos, photos, and blogs to material objects like lawnmowers and bikes. (If you are wondering about what material objects are doing in a social net, talk to me!) The opinions range from politics to impressions about the job done by the recently hired handyman. Additionally, the blogosphere is awash with information. On the one hand, the information in social nets is mostly free text and is thus largely unstructured. Thus, traditional IR techniques as well as techniques based on centrality notions such as pagerank can be effectively used to facilitate search over social nets. On the other hand, though, social nets have their own structure, bringing the notion of a seeker’s social neighborhood to the context of search. Just focusing on search alone, there are interesting questions that need to be studied:

· What is the right granularity of search? One could be searching for blogs, photos, resources, etc., or simply for opinions. Or one could be analyzing opinions.

· Suppose we call the granules above “infons”. What techniques can we employ for effectively searching such a collection of infons? What kind of analyses are enabled by the combination of text data and the social net structure, together chronology of opinions?

· How can we carry out these tasks efficiently?

Privacy Preserving Data Publishing: In the course, you have read or will read papers that deal with the problems of publishing data in a way that protects the privacy of individuals, while still retaining some utility of the data for purposes of mining. One model of privacy is k-anonymity, which basically says in the published data, values of certain attributes are generalized so as to make each tuple (think individual) is indistinguishable from at least k-1 other tuples). This sort of privacy protects individuals from being identified based on their quasi-identifiers (QIDs) on which there may be public external data. You have also learned (or will learn) about l-diversity and other models which go beyond k-anonymity. All of these models and works are set in the context of a single relational table. In this project, you will consider three different data models:

· Multiple tables.

· XML data.

· Social Network Data, where nodes represent users and links correspond to social relationships. Nodes may have properties associated with them. They may also have resources. Think of properties as modeled using attribute-value pairs and resources as a combination of a (possibly empty) set of attribute-value pairs and textual description.

For each data model, you will identify the appropriate privacy model. The privacy model should spell out what it is that is to be protected. You should draw comparisons with known models such as K-anonymity and l-diversity. You will develop appropriate techniques and algorithms for achieving the relevant privacy. There is some flexibility in this project for doing this work for fewer than three data models in exchange for more depth in the work. It is also possible to substitute other models in place of one or more of the above. You should talk to me about the precise nature of this exchange and the kind of depth expected.