Saturday 13 August 2011

Access control list (ACL)

Access Control List:
A data set which grants permissions, or access rights, to each user or group for a specific system objects, such as a directory or file.

FAST ESP, Autonomy IDOL or any other leading search product is able to utilize ACL information from the content repositories so that the same permissions apply to search results. This means that a user is only able to see the query results that he/she is entitled to view, based on his/her permissions towards the source content repository.

Friday 12 August 2011

Entity Extraction

Entity Extraction:

Entity extraction means detecting, extracting, and normalizing entities, such as names of people or companies, from documents. This adds more structure to the data and enables navigation or relevancy enhancements based on specific entities.

In FAST ESP is shipped with predefined entity extractors and in Autnomy IDOL it is implemented using grammar file and processed through eduction module via indextasks.

Offensive Content Filter

Offensive Content Filter:
The Offensive Content Filter is a document analysis tool to filter content regarded
as offensive.

The offensive content filter is implemented as a separate document processor that can be added to an ESP
pipeline and In Autonomy IDOL can be implemented using eduction module.

How it works:
Document content is generally run through filters and compared to pre-defined dictionary. the terms can be added, replaced, removed or even entire document can be rejected.

The output of the filter is an overall score that provides an indication of the likeliness that a document is offensive.

Lemmatization

Lemmatization:

The purpose of lemmatization is to enable a query with one word form to match documents that contain a
different form of the word.

In English, lemmatization can occur for:
  1. singular or plural forms for nouns.
  2. positive, comparative, or superlative forms for adjectives.
  3. tense and person for verbs.
For other languages, lemmatization also allows search across case and gender forms and other form
paradigms, depending on the grammatical features for the word forms.

Lemmatization allows a user to search for a term like car and get both documents that
contain the word car and documents that contain the word cars.

Lemmatization, stemming and wildcard search
:
Lemmatization differs from stemming or wildcard search by being more precise. Different word forms are
mapped to each other by using a language specific dictionary, not by applying simple suffix chopping rules
(stemming) or partial string matches (wildcard search).

Friday 5 August 2011

Synonym with Autonomy IDOL

Synonym
A synonym based search returns results which are conceptually similar to the query terms.

Solution Approaches:

  1. Enable synonym search in Autonomy IDOL
  2. Create a synonym database
Enable Synonym search in Autonomy IDOL: Autonomy IDOL recommends this method if synonym matching is required for approximate a few 100 terms.

It is a 3 step process 1. Set up a synonym file. 2. Configure the IDOL server to use the synonym file. 3. Execute the Synonym query.

1. Set up a  synonym file:
1. Create a text file and save it in IDOL server's IDOL/content directory using the custom file name (manually created by the User) specified in the IDOL server configuration file [SynonymType] section.
2. Create sections for each language type defined in the IDOL server configuration file.                            
For example:
[EnglishASCII]
[GermanUTF8]
3. In each section, create a line for each word for which user want to list synonyms (using encoding used for the associated language type).                                                                                                                          Example:
[EnglishASCII]
cat
dog

[GermanUTF8]
Katze
Hund

4. List synonym strings next to each word and save the file. Separate the word and each string with commas (there must be no space before or after a comma). The individual terms can contain spaces but must not contain any punctuation.
For example:

[EnglishASCII]
cat,feline,grimalkin,moggy,mouser,puss,pussy,tabby dog,bitch,cur,hound,mans best friend,mongrel,mutt,pooch,puppy

[GermanUTF8]
Katze,Mietze,Mietzekatze,Mietzekater,Kater,Mulle,Kätzchen                     Hund,Wau Wau,Hündin,Töle,Kläffer,Hündchen,Welpe

To configure IDOL server to use a synonym file

1. Open the IDOL server configuration file in a text editor.
2. In the IDOL server configuration file's [FieldProcessing] section, set up a synonym process. This process allows IDOL server to determine when it must apply synonym settings.
For example:

[FieldProcessing]
0=SynonymMatch

3. Create a section for the listed synonym field process to create a property for the process (synonym properties always point to a defined synonym job). Identify the required fields to associate with the process.

For example:
[SynonymMatch]
Property=ApplySynonymMatch
PropertyFieldCSVs=*/DRETITLE,*/DRECONTENT

In this example, IDOL server returns only documents for synonym queries if their DRETITLE or DRECONTENT field values match the query.      
(When identifying the fields, use the format /FieldName to match root-level fields, */FieldName to match all fields except root-level, or /Path/ FieldName to match fields that the specified path points to).

Note: - This should be implemented in [FieldProcessing] section of the IDOL config.

4. Create a section for the property to set the SynonymType parameter to the name of the synonym job that specifies which settings IDOL server must apply to synonym queries.

[ApplySynonymMatch]
SynonymType=Synonym_job

Note: - This should be implemented in [Properties] section of the IDOL config.

5. In the IDOL server configuration file [Synonym] section, list the synonym job whose settings need to apply when a synonym query send to IDOL server.  Multiple jobs can be set up in [Synonym] section. However normally only require one.
For example:

[Synonym]
0=Synonym_job

6. Define a section for the synonym job to specify the settings that required applying to synonym queries. The section must have the same name as the synonym job.
For example:

[Synonym_job]
File=animals.txt
MaxExpandLevel=1

Note: - Information on “ MaxExpandLevel ” :

Description
How many levels (0-3) of synonyms to display. Allows specifying how many levels of the synonym tree you want to show in the links field for query results. Enter 0 to display only direct synonyms, 1 to display direct synonyms and synonyms of the direct synonyms, and so on.

Example
The synonym file contains:
girl, young woman, lass, gal, schoolgirl, young lady, maiden, damsel
maiden, budding, fresh, pristine, new, raw, undeveloped, virgin
pristine, disinfected, germ-free, immaculate, pasteurized, purified, spotless, sterilized
Depending on the MaxExpandLevel level setting, a synonym query for the word "girl" is processed as follows:
MaxExpandLevel=0
Only directly related synonyms are added to a synonym query. If a synonym query, for example, contains the word "girl", the words "young woman", "lass", "gal", "schoolgirl", "young lady", "damsel" and "maiden" are added to it.
MaxExpandLevel=1
If a synonym query contains the word girl, direct synonyms for "girl" are added to the query ("young woman", "lass", "gal", "schoolgirl", "young lady", "damsel", "maiden") as well as synonyms of these direct synonyms ("budding", "fresh", "new", "raw", "undeveloped", "virgin", "pristine").
MaxExpandLevel=2
If a synonym query contains the word girl, direct synonyms for "girl" ("young woman", "lass", "gal", "schoolgirl", "young lady", "damsel", "maiden"), synonyms of the direct synonyms ("budding", "fresh", "new", "raw", "undeveloped", "virgin", "pristine") and synonyms of these synonyms are added to the query ("disinfected", "germ-free", "immaculate", "pasteurized", "purified", "spotless", "sterilized").

7. Save the configuration file and restart IDOL server.

Execute Synonym Searches
After creating a synonym file and configure IDOL server to use it, turn any Query action that send to IDOL server into a synonym query by adding &Synonym=true to it.

For example:
http://localhost:5552/action=Query&Text=Felix is a great mouser&Synonym=true

This query returns documents that conceptually match the term mouser, as well as documents that conceptually match any of the terms listed as synonyms for the term mouser in the synonym file.   

Implementation of Approach 2:-      Set up an Additional Synonym IDOL Server

 Key Process to set up an additional IDOL server

              1> Install the Synonym IDOL server.
              2> Create a synonym file and index it.
              3> Execute a synonym query.

Process to Install the Synonym IDOL Server
1. Create and Index a Synonym File
Install the IDOL server component following the installation instructions. If installation of the Synonym IDOL server is to be done on the same machine as your existing IDOL server, ensure that the servers use different ports.
You can obtain the synonym file you are going to store in your Synonym IDOL server by spidering a Thesaurus site (using HTTP Connector) or by creating the file manually. A synonym file must be a text file that contains these fields:

For example:

#DREREFERRENCE Syn1.txt
#DRECONTENT cat feline grimalkin moggy mouser tabby siamese kitten
#DREENDDOC

#DREREFERRENCE Syn2.txt
#DRECONTENT dog cur hound mongrel mutt pooch puppy
#DREENDDOC

Note: - If HTTP Connector is use to create the synonym file, connector can be used to index the file. The manually created file can be indexed using a DREADD index action.

Execute Synonym Searches
The procedure to execute a synonym search.

To execute synonym searches
1.    Send a query to the Synonym IDOL server.
For example: http://synonymServerHost:synonymServerPort/action=Query&Text=mouser
2.    When the Synonym IDOL server returns the synonym results, add the results to the query string and send the newly formed query to Content IDOL server (normally a front end is set up to do this).
For example:                                       http://IDOLhost:port/action=Query&Text=mouser+(cat feline grimalkin moggy mouser tabby siamese kitten)
This query returns documents that conceptually match the term mouser, as well as documents that conceptually match any of the terms that the Synonym IDOL server lists as synonyms for the term mouser.


Thursday 4 August 2011

Implementing Stemming with Autonomy IDOL

Feature Description:
Purpose of lemmatizatiion/stemming is to enable a query  with one word form to match documents that contain a different forms  of the word.
In languages, some words have a common morphological root.  Autonomy provides stemming algorithms that reduce words to this form. This process allows you to match concepts regardless of the grammatical use of words. In English for example, the words help, helpful, helping and helped can all be stripped to their stem help without significant loss of meaning.
Autonomy provides as standard, a set of stemming algorithms for the most  commonly used languages. IDOL applies stemming after it discards stop  words, both at index time (when content is stored in IDOL server) and at query  time (IDOL removes stop words and stems query text before matching).

Solution approach:

There could be two approaches while implementing stemming through Autonnomy -
  1. Using default stemming rules provided by Autonomy.
  2. Create a Custom Stem File for a Language: You can override the default stemming rules for certain words in a given language by creating a language-specific stemming file.
Steps:
      a)    Create the file.  This file is a list of words and their stems. Ex:
             [UTF8]
             mice mouse
             mouse mouse
            children child

     b) Open the IDOL server configuration file. In the [MyLanguage] section for the
         stemming file language, set the StemmingFile configuration parameter to
         the name of your stemming file. For example:
       [english]
       Encodings=ASCII:englishASCII,UTF8:englishUTF8
       Stoplist=engish.dat
       Stemming = true
       StemmingFile=english_stem.dat

Who Moved My Cheese???


I was recently suggested Who Moved My Cheese? as fantastic read, my thoughts post reading the book

It is indeed a fantastic read. In simple, realistic and effective manner the author has explained how change is an integral part of our life and the best part is he has explained how to deal with change and each one of us will definitely  relate to one of the central four characters Sniff, Scurry, Hem and Haw!!!!

I recommend it has a must read for everybody!!!