Data Management and Mining Lab

Image Databases

There is considerable research interest in efficiently indexing, querying, and retrieving images stored in databases.

Images can be queried in a number of ways:

using identifiers (e.g., the keyword "Rembrandt" when searching for paintings by Rembrandt)
using "features" that appear in the image, such as:
- colours (and perhaps the relative locations of those colours in the image)
- objects (e.g., houses, faces)
- shapes (e.g., circularity, eccentricity, large/small area)
- texture (e.g., rough vs. smooth -- green tree vs. green car)

Queries in relational databases rely on exact matches, whereas queries in image databases rely on approximate matches; hence, we need some notion of similarity. This can be achieved through the use of histograms. A k-bin colour histogram defines k colour ranges and stores a count of the number of pixels in the image falling into each colour range. A k-bin histogram can be represented by a k-dimensional vector, and a similarity measure such as an L-1 norm (Manhattan distance) or an L-2 norm (Euclidean distance) can be used.

We can view a k-bin histogram as a point in a k-dimensional space. A "fixed radius" or "nearest neighbour" search can then be used to identify the top-n candidates. A naive approach would be to compare the histograms of all images in the database to the target (query) histogram. If the radius is too large, we may get too many hits; if the radius is too small, we may dismiss some good candidates. With respect to correctness, we do not want false dismissals, but are prepared to deal with false positives.

A natural question to the above approach concerns the number of bins to use. This also introduces a trade-off between accuracy and efficiency. For example, expensive, fine-grained computations may utilize 256 bins; however, it would be good to use a suitable abstraction (e.g., 3 bins), and then use a filtering technique. Ideally, the filtered candidates should be a small subset of the database, but also a superset of the answer set. Fine-grained computations can then be limited to the filtered list, thus realizing potentially large savings since the time to perform the filtering plus the time to perform the fine-grained computations would typically be far less than a naive approach.

Other methods besides histograms can be used, such as Principal Component Analysis. PCA uses eigenvalues and is related to the Singular Value Decomposition. There are several motivations to this work. Firstly, some dimension of the search space may be dependent on others. Secondly, we want to index on properties other than just colour. With PCA, we can use a new coordinate system to capture the most varied dimension as our first axis, the second most varied dimension as our second axis, and so on. Whereas our previous approach may require a 256-dimensional space, PCA results in a compressed space with typically far fewer dimensions.

At this point, we should point out another area of concern with respect to indexing, namely that of high dimensionality. Studies have shown that when the number of dimensions exceeds approximately 20, then indexing is not very useful.

Variables to Consider

When performing image database queries, some or all of the following variables come into play:

M = number of images in the database
N = number of bins in the full histogram (e.g., 256)
time to compute L-x norm for 2 N-bin histograms
time to compute L-x norm for 2 k-bin histograms (k << N)
page size
number of N-bin histograms in 1 page
number of k-bin histograms in 1 page
number of page accesses
amount of buffering
value r for a fixed radius search
value n for desired number of nearest neighbours
fraction of images that survive a given level of filtering
number of levels of filtering
type of index structure (e.g., R-tree, k-d tree, TV-tree)
density of the image search space. Some databases contain many images of the same kind; others contain images from a single domain (e.g., flowers, cars).

On-going or Future Work

query language issues (constructs for composing queries; expressiveness; user-friendliness; human-computer interaction; etc.)
query processing and optimization issues (e.g., range queries, incremental ("like this") processing)
applications to video: content-based querying and retrieval