Here are some ideas for possible course projects. Keep in mind that we are not interested in merely understanding a specific paper, implementing the presented algorithms and demonstrating that you understand their contributions well enough to reproduce it yourself. The watchwords are innovation, creativity, and novelty. As always, depth will always bring its rewards (just as it does in paper presentations, in discussions led, and in the subsequent summaries prepared).
In addition to these suggestions, I am happy to listen to other ideas for projects, if you have any of your own. I am also happy to work with you on creating other project ideas. No matter which project you choose, I will be available to discuss with you any technical issues related to your project.
Each project can either take the shape of a paper or an implementation. It’s your choice. Each team may contain up to two people. Regardless of whether you do a project alone or as a team of two, my expectation will be that you do the same amount of work. Teams must be formed by Wednesday, October 3. Project choices should be made by Monday, October 8. Here, then are the project ideas. (More details on the project schedule and milestones will be posted soon.)
· “find those products whose sale exceeds $100,000 in all the outlets”
· “find the average sales of products grouped by city and quarter”
· “find the top-10 products with the highest revenue”
You can either write up a term paper or develop a small prototype system making use of available commodity software as components. For either option, think about the following questions:
· What are the challenges? There are three major challenges I can tell you right off the bat.
o Ambiguity: keyword queries are inherently ambiguous.
o How do you use keyword search paradigm for expressing the above queries – a largely syntax issue, but a non-trivial one.
o Assuming you tackle the second challenge above, the first challenge implies for a given query, there might be a number of possible answers, owing to ambiguity. How would you score/rank the results?
· What has been done in prior art that you should be aware of? (I will help you find relevant literature and help you in answering any questions you may have from the literature)
· How would you implement these queries such that it leverages available technologies (e.g., Lucene keyword search engine, SQL engine for query evaluation) and still offers reasonable efficiency?
· What is the right granularity of search? One could be searching for blogs, photos, resources, etc., or simply for opinions. Or one could be analyzing opinions.
· Suppose we call the granules above “infons”. What techniques can we employ for effectively searching such a collection of infons? What kind of analyses are enabled by the combination of text data and the social net structure, together chronology of opinions?
· How can we carry out these tasks efficiently?
· Multiple tables.
· XML data.
· Social Network Data, where nodes represent users and links correspond to social relationships. Nodes may have properties associated with them. They may also have resources. Think of properties as modeled using attribute-value pairs and resources as a combination of a (possibly empty) set of attribute-value pairs and textual description.
For each data model, you will identify the appropriate privacy model. The privacy model should spell out what it is that is to be protected. You should draw comparisons with known models such as K-anonymity and l-diversity. You will develop appropriate techniques and algorithms for achieving the relevant privacy. There is some flexibility in this project for doing this work for fewer than three data models in exchange for more depth in the work. It is also possible to substitute other models in place of one or more of the above. You should talk to me about the precise nature of this exchange and the kind of depth expected.