How to use Aperture pipelet

eccenca Documentation

Introduction: What is Aperture?

Aperture is an Open Source Project that enables to

  • determine the mime type of a file such as an office document
  • extract its content
  • extract its meta information (e.g. author) into OpenRDF.

In eccenca Aperture stands for a BPEL engine pipelet that wraps the aperture libraries. Because OpenRDF is an XML format it is not suitable to be indexed directly but one needs to use the XPathExtractorPipelet to get the the specific meta data items.

eccenca CE has no part in the developement of Aperture it self, it only provides the integration and has therefore no influence on the quality provided by this free collection of converter tools.

Scope

we show

  • how aperture is installed
  • what configuration steps need to be take to use it, namely
    • crawling config
    • processing config

Within this guide the configurations provided will be for the existing FileIndex and its Crawling and Processing Configuration named file that is shiped with eccenca CE by default.

Steps

Step 1 - Installing Aperture

Follow the steps on How to install a new component to install the Aperture pipelet.

Step 2 - Update Crawler Configuration

This step will include files other than the dafault text and html files for processing. Make sure that the folder is correct and contains such files.

  1. Open the Crawling and Processing Configuration, e.g. file
  2. select the "Crawling Configuration" tab
  3. adjust the search location c:\data or create it in you file system.
  4. include the out commented file types within the <process> element at the bottom of the xml or replace the whole xml with the Crawling Configuration
  5. press Save

Step 3 - Update Processing Configuration I

In this step we add the Aperture pipelet to the BPEL workflow.

  1. Open the Crawling and Processing Configuration, e.g. file
  2. select the "Processing Configuration" tab
  3. remove default workflow by adding the aperture pipelet invocation at the location of
     <sequence>
        <receive createInstance="yes" name="start" operation="process"
            partnerLink="Pipeline" portType="proc:ProcessorPortType" variable="request" />
    
        <!-- aperture pipelet invocation goes here -->
    
        <reply name="end" operation="process" partnerLink="Pipeline"
            portType="proc:ProcessorPortType" variable="request" />
        <exit />
      </sequence>
    
  4. press Save

Step 4 - Update Processing Configuration II

As mentioned above, the metadata is extracted by Aperture into RDF. The specifc meta items need to be extracted from the RDF with the XPathExtractorPipelet pipelet that is shipped with SMILA.
This step occurs after the Aperture invocation.

  1. Open the Crawling and Processing Configuration, e.g. file
  2. select the "Processing Configuration" tab
  3. insert the title extraction invocation after the aperture invocation from the previous step.
  4. press Save

Your complete Processing Configuration should now look like so: Aperture Pipelet - Processing Config - Final XML

Conclusion

You have seen how the aperture piplet is incorporated into a crawl process and how titles are extracted from documents.

This HowTo doesn not explain how a new index field is created that may recieve the titel and make it available in search. Plz refere to 5 Minutes to success - RegExpTransformer pipelet for an example on this.

Labels

quick_pipelet_aperture quick_pipelet_aperture Delete
Enter labels to add to this page:
Please wait 
Looking for a label? Just start typing.