Building blocks for semantic search engines:
Ranking and compact indexing in entity-relation graphs
Soumen Chakrabarti
We see an evolutionary path to supporting semantic search over text facilitated
by 1. extractors and annotators for ever-growing collections of entity and
relation types and 2. search systems that exploit a smooth continuum between
structured entities and relations on one hand and uninterpreted text on the
other. The extractors and annotators will be imperfect and incomplete. Unlike
in traditional data warehousing, the unstructured source cannot be forgotten
once structured data is curated. The source text lives on, with annotations
as probabilistic connections to one or more ontologies. Queries involve ontology
elements as well as uninterpreted strings. Searchers know only bits and pieces
of schema. The query language must enable schema-free searches but reward schema
knowledge. In such a system, ranking of results cannot be made all explicit
as in SQL. In the first part of the talk I visit two issues related to scoring
and ranking results. In one scenario we wish to rank mentions of instances of
a given type (e.g. distance) based on their textual proximity to query keywords
(e.g. Paris Helsinki). Such queries are crude but effective filters for simple
questions like "what is the distance between Paris and Helsinki".
In the other scenario I generalize the setting to learning ranking functions
in arbitrary E-R graphs, given partial order preferences. In the second part
of the talk I discuss the difficulties faced in indexing and query processing.
I describe new techniques to estimate query-processing cost and cost-driven
index compaction based on query logs. The work described is embodied in a system
we are building that we plan to release in the public domain.