Some of Our Research:
This project is looking at helping communities to better evaluate different design options against the most important performance measures (e.g., cost, energy consumption, and quality of life).
The sharing of heterogeneous data amongst organizations is a common need in information management. In many applications, multiple heterogeneous data sources may wish to coordinate their data so that changes made to one data source may also be re ected in another related data source. In this way, the manager of a data source can ensure it is up to date and consistent with the latest data provided by other related data sources of interest.Building Taxonomies from Documents and Social Media
Most data warehouses are built bottom up at design time, a data warehouse schema is constructed on the basis of all local schemas of the data sources to be integrated. At run time, the warehousing process starts with the identification of sources. Instead our goal is to produce a new methodology for top-down integration driven by a conceptual model. This conceptual model is a semantic layer used by a business user to specify what info is needed and in what form; the system must satisfy the user request in a (semi-) automatic fashion.
Databases often tend to contain dirty or inconsistent information by storing some erroneous values. Some of the inconsistencies can be captured by means of a set of known facts about data, expressed in the form of integrity constraints, such as key and foreign key constraints, (conditional) functional dependencies, and aggregate constraints. These constraints could potentially be exploited for two purposes: (i) enhancing the quality of data by cleaning it, and (ii) obtaining consistent answers to queries posed to the data, without necessarily cleaning the data. Our work focuses on the foundational aspects related to these goals. In particular, we work on the complexity and algorithmic aspects of different problems related to achieving these goals. In general, many problems related to data cleaning and consistent query answering tend to be intractable. Much of the previous work uses heuristic solutions in the face of such intractability. Our work focuses on understanding the hardness of these problems and on devising efficient approximation algorithms whenever the problem is intractable.
Social bookmarking systems have recently gained interest among researches in the area of data mining because they provide a huge amount of annotations and reflect the interests of millions of people. The question to address is how we can utilize this huge tagged data in order to build a structure which reflects the underlying structure of user tags.Many communications and information management is needed during construction of civil infrastructures. Complexity of projects and distribution of work across physical locations emerges the need for information management. One tiny problem in this area is how to integrate unstructured text documents (such as meeting minutes) with other structured data sources (such as computer aided design models). The goal of this project is to develop algorithms for extracting relevant taxonomies and graph from two types of data: social media such as collaborative tagging sites and from text documents. In the former case, by analyzing tags assigned by users to resources and using association rule mining algorithms we will build the relevant graph. In the second case, we will develop algorithms for extracting taxonomies from a collection of civil engineering documents.
I am working on mining interesting patterns in Social Networks. Seeing friends actions, users are sometimes tempted to perform those actions. We look into the problem of studying the propagation of such influence, and on this basis, identifying which users are leaders when it comes to setting the trend of performing various actions. We developed a pattern mining based framework to discover leaders from community actions (CIKM paper). We also built a visualization tool known as Gurumine (gurumine paper) on top of it to visualize propagation of influence in very large Social Networks.Currently, we are working on extracting leadership qualities, i.e. what kind of qualities makes a user leader for what kind of followers for what kind of actions. In addition, we are building a machine learning based framework to capture influence propagation and predict future actions by users in Social Networks. We also study how this framework can be useful for solving viral marketing problems.
Online Analytical Processing software allows for the real-time analysis of data stored in a database. The OLAP server is normally a separate component that contains specialized algorithms and indexing tools to efficiently process data mining tasks with minimal impact on database performance. Data is stored in a highly structured database called data warehouse. OLAP tools provide elaborate query languages that allow users to group and aggregate data in various ways in order to explore interesting trends and patterns. Working with these tools requires sophisticated database users who are in practice a different group of users from managers and analysts who could benefit from OLAP. Therefore, devising easy to use techniques that efficiently provide business insights is very critical. Keyword search over a collection of text documents is one of the most popular services used by users every day. Due to simplicity and popularity of this searching method we are interested in extending existing data warehouses and adding keyword searching capabilities to them. The process involves finding all of the possible ways a potentially ambiguous keyword query could be translated into a structured query in a language such as MDX or SQL i.e. interpretations of query. Since there could be many existing interpretations, ranking and finding top-k interpretations is of great importance. We are specifically interested in an individualized ranking process for every user leveraging his/her interactions with the system in the past. From another perspective, individualized ranking of interpretations could be seen as a recommender system that deals with data in a structured space like a data warehouse and needs to provide flexible recommendations in response to keyword queries. Although the project is intended to work on data cubes in data warehouses, it could be applicable to other problem domains when recommendations need to be done in a dynamic and flexible setting over structured or semi-structured collections of data.
We are working on a recommendation system that recommends by example. A user submits a dream item to the system and the system recommends similar items. Existing recommender systems do the similarity computation in a black box and then present results in a ranked list. But such lists are difficult to explore as no matter which ranking function the system uses, the items that are interesting to user will be dispersed in the list. Since all the similarity computation was done in a black box, the user has absolutely no idea about where to look for the required item. In that case the user has no choice, but to explore a lot many items before getting the required one.We aim to propose a technique that is much more exploration-friendly. We dynamically identify the set of categories that are interesting to a user and represent the items similar to the given item, partition the results into these categories, select the top K categories to be shown to the user and rank each category internally. This way user has a description of each category (e.g. actor = Johonny Depp & genre = drama for movies) that is being shown to him/her and can choose to explore if it really interests him/her instead which would not have been possible in existing system. There are a lot of technical challenges involved in it. The major challenge is to identify the set of interesting categories for a given user that represent the set of items that are similar to the given example item in an efficient way. Given a set of items similar to the given item, how to we efficiently categories them? How to measure the meaningfulness of a category to a given user? We develop a notion of "social significance" that tries to capture the liking of a category in a user's network.
Social networks, which consist of people interacting with each other, are usually modeled as graphs. One can easily apply standard graph algorithms in order to analyze their structure (e.g. find communities), but when it comes to very large networks (which is usually the case nowadays), the cost of computations gets prohibitive. Thus, applying approximation methods to social networks structure analysis problems has recently become interesting. In this project, we use linear algebraic approaches along with some heuristic methods to calculate proximity measures on very large social networks. Proximity measures indicate how close different nodes on a network are to each other, and they could be used in many problems such as link prediction, community detection, collaborative filtering, etc.
In order to save the needed data mining expertise or the computational facilities, the data owner (who owns data) or the data custodian (who has the responsibility to protect the data, such as a hospital) would like to outsource the data mining tasks to a service provider, either a mining service provider or a computational service provider. We are studying the problem of preserving the privacy of the data while providing no-outcome-change guarantee, which means to preserve the mined patterns. We have developed transformation approaches for decision tree classification and SVM classification. Our methods can provide high-level protection on the original data while preserving the patterns.
With the advent of Internet, ever increasing volume of data is generated by users and stored in text format. Those data (and queries on them) are hardly perfect with typographical errors and thus exact matching often fails to retrieve desired information. To deal with this problem, approximate query processing has received a great deal of attention recently. We study the size estimation problem of several approximate queries in text databases, which is crucial in optimizing such queries. We developed techniques for estimating sub/string predicates with edit distance threshold and join size estimation for set similarity queries.