Organizational information sources exist throughout the enterprise and the vast majority of these sources are web enabled. Integration with disparate web pages has traditionally been done by search
engines that utilize the metadata or the embedded text within the content itself. Unfortunately, more and more information sources are being built as dynamic applications versus static web pages.
Dynamic applications create a problem for traditional search engines due to the inability to spider the data. This requires new forms of integration technologies. One such technology is the
“MetaCard” which is built upon the library science card catalog model. The MetaCard is based on the Dublin Core metadata standard and allows traditional search technology to work without
additional layers of integration effort. This article will review an implementation of the MetaCard technology in a Fortune 500 company to resolve the problem of universal asset integration.
Content Delivery
One of the basic issues with an overall enterprise strategy is that not every asset representation is available in Hypertext Markup Language (HTML) format and may also not work with the basic
architecture in place. The issue that needs to be resolved is that any asset must be able to be located by the corporate search engine in order to have the asset located by as many resources as
possible. The assumption is that all repositories are implemented in a federated style and not in a centralized collection. The majority of repository solutions are database driven and require a
merging of web and data technologies. Static web pages publish content only once, while data driven sites will derive their content from instructions from the user.
Simple and Elegant Solution
The key to providing semantic knowledge is an agreement on the standards of documentation. The Dublin Core Metadata Initiative (DCMI) is an organization that is working on creating a standard set
of descriptive tags through an open source type organization. The standard summarizes the updated definitions for the Dublin Core metadata elements as originally defined by the DCMI. These
definitions are officially known as Version 1.1. The definitions utilize a formal standard for the description of metadata elements. This formalization helps to improve consistency with other
metadata communities and enhances the clarity, scope, and internal consistency of the Dublin Core metadata element definitions. Each of these elements can provide vital information pertaining to
the usage, purpose, content, and structure of the web page or any other web-based object. Some of these elements are broken down into further qualifications such as the “Date” element. The
qualifiers of the “Date” element include valid date, issued date, modified date, available date, and created date. These qualifiers provide additional semantic definitions that enable a closer
definition for the semantic meaning of the object.
Functionally, the process of building a search index is fairly straight forward with the use of a web crawler. A crawler is a program that downloads and stores Web pages, often for a Web search
engine. Roughly, a crawler starts off by placing an initial set of Universal Resource Locators (URL) in a queue, where all URLs to be retrieved are kept and prioritized. From this queue, the
crawler gets a URL (in some order), downloads the page, extracts any URLs in the downloaded page, and puts the new URLs in the queue. This process is repeated until the crawler decides to stop.
Collected pages are later used for other applications, such as a Web search engine or a Web cache. The key component of the crawler framework is to have content available that can be crawled in a
standard format like Hypertext Markup Language (HTML) or Microsoft Office type objects. The issue with dynamic content is that these standard formats are not available and reproducing the content
in multiple formats is not economically efficient. While some search engines can open document types to catalog the content, this is not universal and new document types are being created as
technology continues to advance. The solution is to create a “MetaCard” that only abstracts the core metadata required by the search engine without the overhead of the object or content itself.
The following sections will review the basic structure of the MetaCard, population of the content, and integration into the corporate search engine.
The MetaCard is simply an HTML page that contains the key metadata embedded into a table similar to Figure 1.
The page contains the standard corporate header as presented within the metadata repository environment. This header allows the user to navigate within the repository environment as well as apply
departmental branding. However, due to the Java script redirect, the page will not actually be seen by the end user unless scripting is disabled. For users that disable scripting, a descriptive
header is added at the top of the page body that instructs the user to click the link that will take them to the asset being described. The current deployment of the Metacard is a table with six
components which include the title, card name, description, keywords, cross reference keys and direct URL.
Title
The title is a name or short phrase that describes the content within the repository or information technology. Figure 1 utilizes the name of the article as the asset title.
Card Name
The card name is the name of the card given by the administrator or automated process. The card name should follow standard naming conventions and reflect some element of origin of the metadata
itself.
Description
The description is a detailed account of the content itself. The administrator should ensure that as many as possible of the searchable terms are included in the description.
Keywords
The keywords allow the administrator to associate specific search based keywords that relate to the asset being described. The administrator may also enter in abbreviations, acronyms, alternate
spellings, and synonyms. Some implementations of metadata may automate this process and not require the administrator to understand the enterprise taxonomy.
Soft Key References
Soft key references are similar to keywords with the exception that they are only known by the administrator. In many cases, the administrator may want to create a search set where only a
predefined number of assets show up in the results. For example, a soft key of “xref-metadata” may be used to pull only the assets with this keyword in the result set. The key is that a user
would not normally search for this specific term which then can be used in a soft link predefined search.
Universal Resource Locator (URL)
The URL provides several functions which may not be obvious. First, the end user may have JavaScript turned off at the browser level which disables the MetaCard from redirecting to the actual
asset. While the normal user will never actually see the MetaCard, users that turn off the redirect script will see the card and they should be instructed to click on the link provided in order to
proceed to the specific asset requested.
Other elements are possible that may help integrate the MetaCard into higher classification tools such as taxonomies or ontologies. Author, context, and location type metadata may be added to the
description field or deployed as separate fields. At the core of the MetaCard is the Java redirect script. The script is placed at the very beginning of the HTML file and would be the first level
of code executed. The purpose of the code is to redirect the browser to the asset specified at the URL.
Population of Content
The previous section described the technical side of the model while this section will focus on how data actually gets into the HTML file. Currently, two methods are used for loading data into the
standard HTML format. First, the MetaCard application can accept an XML feed and apply an XSLT style sheet transformation which will generate the HTML code. This process works well for external
organizations that can interface in a message type interface. Another option is to store the metadata in a database file. By storing the data, we can provide additional services like impact
analysis and data quality assurance. Once the data is collected then the HTML file can be generated by the use of an application program.
Search Engine Integration
Indexing and search directories are the dominant form of retrieval technologies. They make use of information retrieval (IR) systems based on the statistical relationship between words. Engines
with directory-based search employ human editors and robots to discover new documents and catalog Web pages in an index database. These robots can be pointed to the MetaCard directory in order to
index the complete inventory of cards. Ideally, the MetaCard entry in the search engine result set should be presented in a way that there should be no different than the representation of the
original asset with the only exception being the URL. For example, the following search result demonstrates how the metadata is integrated seamlessly into the system with only the address modified.
Benefits of the MetaCard System
The benefits of the MetaCard system lie in the simplicity of construction and the small footprint required by the HTML file. The application to build the MetaCards utilized a database application
and Visual Basic code. The enterprise application which was built later employed XML and XSLT technologies and replaced the core application. The cards themselves are less than 10k in size which
includes the base HTML and style type components that allow the card to fit into the current Intranet environment. Additionally, the system provides both the administrator and the end user the
ability to create the card which allows for additional subject matter expertise to enter the system. In the end, this joint effort expands the quantity and value of the search engine.
One of the most important dimensions of metadata is to ensure the quality of the metadata itself. Since search results, hierarchal classifications, and even ontologies will be built off the
metadata constructs, ensuring the quality of the metadata is an imperative. By utilizing a centralized group and application portfolio, additional steps for quality control can be implemented
including controlled vocabularies, glossaries, shared sets of keywords and soft-key constructs. Improving the quality of the metadata on the front and at the object level will enable a richer set
of services downstream.
Most search engines have a limited set of object types that can be indexed. The MetaCard technology allows for any asset that can be accessed with the browser be added to the search engine and
enterprise classification systems. Evolution of value is critical and expanding the breadth of assets in the search scope will enable shared knowledge through out the corporation.