Categorisation Engine

As part of the Taxonomy and Categorisation Manager, the categorization engine aims to provide an end to end solution to applying taxonomy metadata to content or documents. This includes keywords and summary extraction, converting documents like PDF or DOC to text, automatically identifying suitable categories from a selected taxonomy. The latter is also known as auto categorisation process.

 

The process works as follows...

 

  • Document converted to text and or text extracted. 
  • The most frequent keywords or phrases are identified.
  • A keyword list is created with single keywords and two/three words phrases sorted by frequencies 
  • Based on the keywords and the rankings the engine will try to match these keywords to the taxonomy category name and synonyms.
  • If a category meets the minimum matching count of keywords then it will be promoted to the results category list.
  • summary for the text is then identified and this means listing all sentences and searching through the keywords list to find out which sentences contain the highest number of keywords from the keywords list.
  • Finally the keywords and category lists together with the summary are returned to the client application.

 

Additional information

 

The Taxonomy Manager and Categorisation Engine can be used as separate services but its primary purpose is to help categorise and manage content and documents in the Immediacy system. Additional dialogs are available within the core Content Management System and Web Asset Manager to manage the text and keyword extraction and categorisation of web page text and/or documents.

 

  • The engine is exposed through a web service which makes it available to external applications as well

 

  • The statistical analysis that is core to the system makes it language independent

 

  • The content filtering system is where we extract text from various types of documents. The solution is based in IFilter and therefore is extensible – the more IFilters you have the more documents you can process. We have extended support for PDF with three different solutions for extracting text

 

AutoCat Admin

 

Configuration of the Taxonomy and Categorisation Manager in the Immediacy CMS

 

 

Enrich dialog1

 

Integration with the Immediacy CMS is via the 'enrich' dialog interface