Skip to end of banner
Go to start of banner

Full-text search (POC)

Skip to end of metadata
Go to start of metadata

You are viewing an old version of this page. View the current version.

Compare with Current View Version History

« Previous Version 8 Next »

This optional module is part of release 24.4 and is still in the proof-of-concept (POC) stage.

Overview

This document provides technical support information for the new module "Full-text search (POC)" which includes the plugin "OCR (POC)". The module is designed to calculate Optical Character Recognition (OCR) for objects during the ingest process, storing the results in the record metadata to enable searching on the text of a file.

Activation

This feature is currently automatically activated on mh-uat for the organisations mh-uatb & mh-uatc

  1. Create a field definition Dynamic.PocOcr of type TextField

    • Also, mark the field definition as global and indexed such that the end-user experiences a full-text search using the global search

  2. Link the module FULL_TEXT_SEARCH_POC to the customer’s organisation. See the REST documentation for information on how to do that.

Once activated, OCR will be carried out for all files that are newly ingested via the ingest 1.5 and 2.0 flows for that organisation. The extracted text will be saved to the Dynamic.PocOcr field.

Supported file types

The following table contains file types that are confirmed to work.

Theoretically all supported Tika formats are supported: https://tika.apache.org/3.0.0-BETA/formats.html. However, this is not guaranteed for file formats not listed in the table below.

Format

Supported file extensions

Pdf files

pdf

Emails

msg, eml

Microsoft office

doc, docx, ppt, pptx, xls, xlsx

Web pages

htm, html, asp, php

Plain text

txt, rft, log

Images

jpg, png, bmp

  • No labels