Wednesday 27 July 2011

Spell check with Autonomy IDOL.

Spell-Check:
Autonomy IDOL uses Term Distancing algorithm to find correct spellings and suggests them. In term distancing algorithm IDOL server determines the number of edits (Each edit representing an insertion, deletion and replacement operation of a single character) to find the nearest matching terms.

Following is the minimum set of configurations that is required to activate spell check in Autonomy IDOL.

Index side
The following ConfigurationParams have to be included in the [server] section of the IDOL server configuration file:
  • SpellCheckMaxCheckTerms: It is the maximum size of the query (in number of terms), up to which a query may be considered eligible for spell check.E.g. SpellCheckMaxCheckTerms = 200.
  • SpellCheckIncorrectMaxDocOccs: Maximum number of docs a term can appear in and be considered a misspelling. E.g. SpellCheckMaxCheckTerms = 1 
  • SpellCheckCorrectMinDocOccs: Minimum number of docs a term must appear in order to be a spellcheck suggestion (or to be matched by a wildcard term.).
We can also use the config parameter UnstemmedMinDocOccs for this purpose. It represents the Minimum number of documents a term must appear in order to be a spellcheck suggestion or to be matched by a wildcard term.
E.g.  SpellCheckMaxCheckTerms = 1
        UnstemmedMinDocOccs = 1
There are a few other config parameters related to spell check. These are:
SpellCheckAlphaNumeric: Omits input terms containing numbers from being spellchecked. It is   either true or false.
E.g. SpellCheckMaxCheckTerms = true

SpellCheckCacheMaxSize:  Maximum number of spelling corrections that IDOL server can store.  The spell corrections are stored in IDOL>content>main>prx.db file.
E.g. SpellCheckMaxCheckTerms = 6666

Query Side

Include spellcheck=true in the queries in order to instruct the IDOL server to check the spelling of the query terms and provide suggestions for any misspelled term.

Monday 25 July 2011

Ranking

Ranking determines the quality of a match between query and candidate document.
Search products consider the following parameters to determine the appropriate rank value
  1. Freshness- It determines the age of the document to the point in time the query is issued.
  2. Authority- Authority denotes the importance of document as determined by links from other document.
  3. Quality- It determines the assigned importance of a document
  4. Proximity- Proximity denotes the distance between and location of, query terms in the documents.When a query contains multiple terms that are not detected as known phrases, the ranking process takes the relative position of the terms and determines the most relevant results based on the proximity the matching terms in the document have to each other.
  5. Context- Different document fields, for example title, body, description, price, or type, may be assigned different relevance weight. This allows you to specify for example that a match in the title field of a document contribute more to the document's ranking value than a match in the body field of a document.
The releavancy of the document is represented by ranking value.

Search Relevancy

In search relevancy is the measure of how well the returned result set addresses the intent of user query.

Search products takes in to consideration the following concepts for effective relevancy
  1. Linguistics.
  2. Ranking.
  3. Navigation.
  4. Sorting.
In future posts I will  address the above concepts and their use for effective relevancy.

Saturday 23 July 2011

Federated Search

Problem statement
An employee in any organization looks for relevant information. This relevant information could be present on internal search engine/engines and public portal like Google, MSN, Yahoo etc. In order for an employees quest to find relevant information he has to search seperately through different internal/external search portals which is very inconvenient.

Solution
Federated Search provides solution to this problem. Federated search facilitates user with single search form to enter search query.The search query is then submitted simultaneously to all search engines and various result set are combined back in to a unified result set.




Key Issues to be taken care while implementing federated Search

  1. Organization rules for combining unified result set.
  2. Security - Mapping security across all the internal enterprise search engines.
  3. Duplicates detection and removal
  4. Managing Facets
  5. Taxonomy

Friday 22 July 2011

Enteprise Search

70%-80% of organizations data is available in form of unstructured data e.g. word documents, spreadsheets, email, web pages to name a few. The content may be located on file servers, content management system or websites and remaining in structured data sources like database.

Enterprise Search aggregates data from all the unstructured and structured sources and facilitates the findability of the relevant information within an organization in an unified manner.

Enterprise search product implemented for one of our client feature following
  1. Integrates information from Web, internal, and external sources – providing a 360° view of market conditions.
  2. Monitors events and information – alerting analysts and decision-makers to the latest, actionable intelligence.
  3. Performs rapid search, discovery and advanced content analysis.
  4. Enables Web 2.0 style collaboration with enterprise level features and security
General issues to be addressed while implementing Enterprise search are
  1. Appropriate visualizations powered by facets
  2. Organization Taxonomy
  3. Security   
  4. Multiple systems
  5. Timezones
  6. Content sources
Hence appropriate data analysis of all the content sources is required to facilitate useful information retreival.
Search products available in Market Microsoft FAST ESP, Autonomy IDOL, Endeca, Attivio, Google search appliance, Solr etc.

Out of the above listed enterprise search products I have worked and consulted Autonomy IDOL, Microsoft FAST ESP, Google Search Appliance, Attivio and Solr.