Categorisation Engine
As part of the Taxonomy and Categorisation Manager, the
categorization engine aims to provide an end to end solution to
applying taxonomy metadata to content or documents. This includes
keywords and summary extraction, converting documents like PDF or
DOC to text, automatically identifying suitable categories from a
selected taxonomy. The latter is also known as auto categorisation
process.
The process works as follows...
- Document converted to text and or text
extracted.
- The most frequent keywords or phrases are identified.
- A keyword list is created with single keywords and two/three
words phrases sorted by frequencies
- Based on the keywords and the rankings the engine will try to
match these keywords to the taxonomy category name and
synonyms.
- If a category meets the minimum matching count of keywords then
it will be promoted to the results category
list.
- A summary for the text is then identified
and this means listing all sentences and searching through the
keywords list to find out which sentences contain the highest
number of keywords from the keywords list.
- Finally the keywords and category lists together with the
summary are returned to the client application.
Additional information
The Taxonomy Manager and Categorisation Engine can be used as
separate services but its primary purpose is to help categorise and
manage content and documents in the Immediacy
system. Additional dialogs are available within the core
Content Management System and Web Asset Manager to manage the text
and keyword extraction and categorisation of web page text and/or
documents.
- The engine is exposed through a web service which makes it
available to external applications as well
- The statistical analysis that is core to the system makes
it language independent
- The content filtering system is where we extract text from
various types of documents. The solution is based in IFilter and
therefore is extensible – the more IFilters you have the more
documents you can process. We have extended support for PDF with
three different solutions for extracting text

Configuration of the Taxonomy and
Categorisation Manager in the Immediacy CMS

Integration with the Immediacy CMS is via the
'enrich' dialog interface