Attribute-based Access of Heterogeneous Digital Data

Amit Sheth * Vipul Kashyap *# William LeBlanc *
* LSDIS Lab
Department of Computer Science, University of Georgia
415 GSRC, GA 30602-7404
# Department of Computer Science, Rutgers University
New Brunswick NJ 08903

Introduction

To date, most of the work done on access to legacy data has been devoted to structured data managed by traditional database management systems. Gateways, data exchange and database transfer mechanisms (and products) have been developed to support this approach. However, a significant percentage of existing data is unstructured or semi-structured. The primary mechanisms used to access unstructured data have been information retrieval strategies that provide keyword based access, or data format specific tools associated with semi-structured data (e.g., SGML documents).

In this position paper, we propose that attribute-based access provides a powerful complementary or alternative mechanism to the traditional content-based search and access. Furthermore, we suggest that when dealing with heterogeneous data, this technique deserves a serious consideration. While the attribute-based access can provide better "precision", it can be more complex because it requires that appropriate attributes have been identified and the corresponding metadata instantiated before accessing data. The work discussed here has been implemented as an extension of the InfoHarness system [shk94,shk95].

A brief overview of metadata

A metadata classification presented by [kash95] recognizes three basic kinds of metadata: Content-dependent metadata is metadata that is derived directly from the actual contents of the data. Most textual indexing technologies rely primarily on this kind of metadata which is based on the text of these documents [shk95]. The are generally two flavors of content-dependent metadata used in textual indexing:
  1. An inverted index which keeps track of the documents in which a keyword occurs and its frequency within that document. This is used in WAIS [kahl91]
  2. Vectors associated with documents which characterize their position in a multi-dimensional space. An example is LSI [deer90] in which both documents and keywords are mapped to the same vector space.
Metadata that is based on the content of the data, but cannot be extracted by a content-analysis of the type done by textual indexing techniques is called content-descriptive metadata. They are further classified into domain dependent and domain independent content-descriptive metadata. Attribute-based access might use this type of metadata.

The time of last modification, and location of a file on a UNIX file system are examples of metadata that are completely independent of the content of the data, or content-independent metadata. Attributes of this type can be quite useful. For example, file name and last modification are attributes of a file that could be quite important to a programmer searching for a specific version of a C file.

Content-based Access versus Attribute-based Access

We use "content-based access" to refer to textual indexing technologies that primarily use metadata based on the text or content of the documents, i.e. content-based metadata [shk95]. This type of metadata is geared towards answering keyword-based queries. The similarity between a document and the query is estimated based on the frequency of occurrence of keywords or on some proximity measure between the query and the document. Content-based access methods suffer from two significant weaknesses: The attribute-based access approach alleviates to a significant extent the weaknesses of keyword retrieval systems discussed above. In this approach, media-specific extractors scan the documents and extract attributes as metadata for each document. Using these attributes the user can: This "power" of attribute-based access comes at the cost of some limitations and added complexity. In particular, the attributes available for querying have to be identified before hand. Another disadvantage of attributed-based access is that metadata corresponding to different sets of attributes (even for the same type of data) have to be "extracted" using appropriate (i.e., different) programs (called "extractors" in InfoHarness). It is important to note that although attribute-based access offers a powerful retrieval mechanism it may not be suitable in every type of query. A user who is interested in retrieving articles that mention President Clinton regardless of whether the reference is in the title, body, or byline does not have a need to specify the semantics "author" with the keyword: "Clinton".

It is possible that even when a user does specify "Clinton" as a value for all of the attributes in the query form, that articles relevant to the user's search will not be retrieved. An article in the collection may have contained the word "Clinton", but the extractor may not have selected this word as a value for any of the attributes in the metadata for that article. This translates into a possible loss of recall for this type of query.

Each of the indexing schemes discussed brings with it its own features. It is not proposed that the attribute-based approach will be a replacement for the other methods. Attribute-based access is offered as a tool that can enhance the power of information retrieval in certain cases.

An Implementation and Demo

The Attribute-based access discussed above has been implemented and a demo and discussion of the implementation may be seen at the InfoHarness home page at the LSDIS Lab, UGA

References

[deer90]
S. Deerwester, S. Dumais, G. Furnas, T. Landauer and R. Hashman. "Indexing by Latent Semantic Indexing". Journal of the American Society for Information Science, 41(6), 1990.
[kahl91]
B. Kahle and A. Medlar. "An Information System for Corporate Users: Wide Area Information Servers". Connexions - The Interoperability Report, 5(11), November 1991.
[kash95]
V. Kashyap, K. Shah and A. Sheth. Metadata for building the MultiMedia Patch Quilt In S. Jajodia and V. Subrahmanian, editors, MultiMedia Database Systems: Issues and Research Directions, Springer Verlag, 1995.
[shk94]
L. Shklar, S. Thatte, H. Marcus and A. Sheth, The InfoHarness Integration Platform Proceedings of the Second International WWW Conference, October 1994.
[shk95]
L. Shklar, A. Sheth, V. Kashyap and K. Shah. InfoHarness: Use of Automatically Generated Metadata for Search and Retrieval of Heterogeneous Information Proceedings of CAiSE '95, June 1995.