Attribute-based Access of Heterogeneous Digital Data
Amit Sheth
*
Vipul Kashyap
*#
William LeBlanc
*
* LSDIS Lab
Department of Computer Science, University of Georgia
415 GSRC, GA 30602-7404
# Department of Computer Science, Rutgers University
New Brunswick NJ 08903
Introduction
To date, most of the work done on access to legacy data has been devoted to
structured data managed by traditional database management systems. Gateways,
data exchange and database transfer mechanisms (and products) have been developed
to support this approach. However, a significant percentage of existing data is
unstructured or semi-structured.
The primary mechanisms used to access
unstructured data have been information retrieval strategies that
provide keyword based access, or data format specific tools
associated with semi-structured data (e.g., SGML documents).
In this position paper, we propose that attribute-based access provides a
powerful complementary or alternative mechanism to the traditional content-based search and
access. Furthermore, we suggest that when
dealing with heterogeneous data, this technique deserves a serious
consideration. While the attribute-based access can provide better "precision",
it can be more complex because it requires that appropriate attributes have been
identified and the corresponding metadata instantiated before accessing data.
The work discussed here has been implemented as an extension of the InfoHarness
system [shk94,shk95].
A brief overview of metadata
A metadata classification presented by [kash95] recognizes three basic kinds of metadata:
- Content-dependent metadata,
- Content-descriptive metadata, and
- Content-independent metadata.
Content-dependent metadata is metadata that is derived directly from the actual
contents of the data. Most textual indexing technologies rely primarily on this kind of
metadata which is based on the text of these documents [shk95].
The are generally two flavors of content-dependent
metadata used in textual indexing:
- An inverted index which keeps track of the documents in which a keyword
occurs and its frequency within that document. This is used in WAIS [kahl91]
- Vectors associated with documents which characterize their position in a
multi-dimensional space. An example is LSI [deer90] in
which both documents and keywords are mapped to the same vector space.
Metadata that is based on the content of the data, but cannot be extracted by a
content-analysis of the type done by textual indexing techniques is called
content-descriptive metadata. They are further classified into domain dependent
and domain independent content-descriptive metadata. Attribute-based access
might use this type of metadata.
The time of last modification, and location of a file on a UNIX file system are
examples of metadata that are completely independent of the content of the data,
or content-independent metadata. Attributes of this type can be quite useful.
For example, file name and last modification are attributes of a file that could be
quite important to a programmer searching for a specific version of a C file.
Content-based Access versus Attribute-based Access
We use "content-based access" to refer to textual indexing technologies that
primarily use metadata based on the text or content of the documents, i.e.
content-based metadata [shk95]. This type of metadata is
geared towards answering keyword-based queries. The similarity between a
document and the query is estimated based on the frequency of occurrence of
keywords or on some proximity measure between the query and the document.
Content-based access methods suffer from two significant weaknesses:
- A user cannot explicitly specify semantics along with his keywords.
A user interested in obtaining documents by a specific author from an LSI or
WAIS collection may get extraneous hits from documents that reference the author
in question resulting in loss of precision. In the case of NNTP news articles
this is particularly true as it is very common for references to authors and
their comments to be included as a part of a submission.
- Technologies that use text-based access, including both LSI and WAIS,
cannot provide non-text based matches.
Order and range queries (e.g. ones based on the dates of posting) cannot
be answered using a keyword-based approach. The best the user can do is to look
for text matches for the day, month or year he is looking for. There are two
reasons for this:
- the indexing technology does not recognize types other than text and therefore can
not capture type specific information (metadata) about the date of posting.
- the query language (keywords in this case) is unable to express range and
order conditions.
The attribute-based access approach alleviates to a significant extent the
weaknesses of keyword retrieval systems discussed above. In this approach,
media-specific extractors scan the documents and extract attributes as metadata
for each document. Using these attributes the user can:
- Enhance the semantics of the keywords he provides. For example,
when a user presents a keyword (e.g. "Kilpatrick") as the value of an attribute
(e.g. author) there are more constraints on the keyword compared to the case
it would appear by itself, thus improving the precision of the query.
- Attributes can have associated types. For example the attribute
Submission date could have the type date depending on the specification
of the extractor of the metadata. It is now possible to support type specific
queries involving types like time, date, currency, etc. Simple comparison
operators (<, >, =, <=, >=) are supported in the current implementation for
specifying constraints on attributes of dates and numeric types.
This "power" of attribute-based access comes at the cost of
some limitations and added complexity. In particular, the
attributes available for querying have to be identified before hand. Another
disadvantage of attributed-based access is that metadata corresponding to
different sets of attributes (even for the same type of data) have to be "extracted"
using appropriate (i.e., different) programs (called "extractors" in
InfoHarness).
It is important to note that although attribute-based access offers a
powerful retrieval mechanism it may not be suitable in every type of query. A
user who is interested in retrieving articles that mention President Clinton
regardless of whether the reference is in the title, body, or byline does not have
a need to specify the semantics
"author" with the keyword: "Clinton".
It is possible
that even when a user does specify "Clinton" as a value for
all of the attributes in the query form, that articles relevant to the
user's search will not be retrieved. An article in the collection may
have contained the word "Clinton", but the extractor may not
have selected this word as a value for any of the attributes in the
metadata for that article. This translates into a possible loss of
recall for this type of query.
Each of the indexing schemes discussed brings with it its own
features. It is not proposed that the attribute-based approach will
be a replacement for the other methods. Attribute-based access is
offered as a tool that can enhance the power of information
retrieval in certain cases.
An Implementation and Demo
The Attribute-based access discussed above has been implemented and a demo and
discussion of the implementation may be seen at the InfoHarness
home page at the LSDIS Lab, UGA
References
- [deer90]
- S. Deerwester, S. Dumais, G. Furnas, T. Landauer and R. Hashman. "Indexing by Latent Semantic Indexing". Journal of the
American Society for Information Science, 41(6), 1990.
- [kahl91]
- B. Kahle and A. Medlar. "An Information
System for Corporate Users: Wide Area Information Servers". Connexions - The
Interoperability Report, 5(11), November 1991.
- [kash95]
- V. Kashyap, K. Shah and A. Sheth. Metadata for building the
MultiMedia Patch Quilt In S. Jajodia and V. Subrahmanian, editors, MultiMedia Database Systems: Issues and Research Directions, Springer Verlag, 1995.
- [shk94]
- L. Shklar, S. Thatte, H. Marcus and A. Sheth,
The InfoHarness Integration Platform Proceedings of the Second
International WWW Conference, October 1994.
- [shk95]
- L. Shklar, A. Sheth, V. Kashyap and K. Shah.
InfoHarness: Use of
Automatically Generated Metadata for Search and Retrieval of Heterogeneous
Information Proceedings of CAiSE '95, June 1995.