Here are
some ideas for possible course projects. Keep in mind that we are not
interested in merely understanding a specific paper, implementing the presented
algorithms and demonstrating that you understand their contributions well
enough to reproduce it yourself. The watchwords are innovation, creativity, and novelty.
As always, depth
will always bring its rewards (just as it does in paper presentations, in
discussions led, and in the subsequent summaries prepared).
In addition
to these suggestions, I am happy to listen to other ideas for projects, if you
have any of your own. I am also happy to work with you on creating other project
ideas. No matter which project you choose, I will be available to discuss with
you any technical issues related to your project.
Each
project can either take the shape of a paper or an implementation. It’s your
choice. Each team may contain up to two people. Regardless of whether you do a
project alone or as a team of two, my expectation will be that you do the same
amount of work. Teams must be formed by
Wednesday, October 3. Project choices should be made by Monday, October 8. Here,
then are the project ideas. (More details on the project schedule and
milestones will be posted soon.)
·
“find
those products whose sale exceeds $100,000 in all the outlets”
·
“find
the average sales of products grouped by city and quarter”
·
“find
the top-10 products with the highest revenue”
You can either write up a term paper or develop
a small prototype system making use of available commodity software as
components. For either option, think about the following questions:
·
What
are the challenges? There are three major challenges I can tell you right off
the bat.
o
Ambiguity: keyword queries are inherently ambiguous.
o
How
do you use keyword search paradigm for expressing the above queries – a largely
syntax issue, but a non-trivial one.
o
Assuming
you tackle the second challenge above, the first challenge implies for a given
query, there might be a number of possible answers, owing to ambiguity. How
would you score/rank the results?
·
What
has been done in prior art that you should be aware of? (I will help you find relevant
literature and help you in answering any questions you may have from the
literature)
·
How
would you implement these queries such that it leverages available technologies
(e.g., Lucene keyword search engine, SQL engine for
query evaluation) and still offers reasonable efficiency?
·
What
is the right granularity of search? One could
be searching for blogs, photos, resources, etc., or simply for opinions. Or one
could be analyzing opinions.
·
Suppose
we call the granules above “infons”. What techniques
can we employ for effectively searching such
a collection of infons? What kind
of analyses are enabled by the
combination of text data and the social net structure, together chronology of
opinions?
·
How
can we carry out these tasks efficiently?
·
Multiple
tables.
·
XML
data.
·
Social Network Data, where nodes represent users and links correspond to social
relationships. Nodes may have properties associated with them. They may also
have resources. Think of properties as modeled using attribute-value pairs and
resources as a combination of a (possibly empty) set of attribute-value pairs
and textual description.
For each data model, you will identify the
appropriate privacy model. The privacy model should spell out what it is that is to be protected. You
should draw comparisons with known models such as K-anonymity and l-diversity.
You will develop appropriate techniques and algorithms for achieving the
relevant privacy. There is some flexibility in this project for doing this work
for fewer than three data models in exchange for more depth in the work. It is
also possible to substitute other
models in place of one or more of the above. You should talk to me about the
precise nature of this exchange and the kind of depth expected.