How to use RegExpTransformer pipelet

eccenca Documentation

What is RegExpTransformer?

It's a conversion pipelet which transforms a string using regular expressions

How to use RegExpTransformer in Eccenca.CE.

In this "5 Minutes to success" page we will

  • Install required pipelet from update site,
  • Add new field to FileIndex collection to store the file dir extracted from its path
  • Invoke dir calculation from conversion pipeline.
  • Crawl, search and check results

Workflow

Install Basic Pipelet feature from Eccenca.CE update site

after pressing "Apply changes" pipelet will be installed

Modify collection by adding new field for storing file dir

Remove physical index
  • Navigate to the "Collections" tab
  • Remove index for FileIndex collection.
Add new text field to IndexStructure
  • Navigate to FileIndex collection
  • Navigate to IndexStructure tab of Collection and add field to configuration XML. It will be field with number 6.
<IndexField FieldNo="6" IndexValue="true" Name="Dir" StoreText="true" Tokenize="true" Type="Text"/>
<IndexStructure xmlns="http://www.anyfinder.de/IndexStructure" Name="FileIndex">
  <Analyzer ClassName="org.apache.lucene.analysis.standard.StandardAnalyzer"/>
  <IndexField FieldNo="6" IndexValue="true" Name="Dir" StoreText="true" Tokenize="true" Type="Text"/>
  <IndexField FieldNo="5" IndexValue="true" Name="Content" StoreText="true" Tokenize="true" Type="Text"/>
  <IndexField FieldNo="4" IndexValue="true" Name="Date" StoreText="true" Tokenize="false" Type="Date"/>
  <IndexField FieldNo="3" IndexValue="true" Name="Size" StoreText="true" Tokenize="false" Type="Number"/>
  <IndexField FieldNo="2" IndexValue="true" Name="Extension" StoreText="true" Tokenize="false" Type="Text"/>
  <IndexField FieldNo="1" IndexValue="true" Name="Filename" StoreText="true" Tokenize="true" Type="Text"/>
  <IndexField FieldNo="0" IndexValue="true" Name="Path" StoreText="true" Tokenize="true" Type="Text"/>
</IndexStructure>
Change search result and default search configuration
  • Navigate to Configuration tab of Collection and add field to Result and DefaultConfig
<Configuration xsi:schemaLocation="http://www.anyfinder.de/DataDictionary/Configuration
../xml/DataDictionaryConfiguration.xsd"
xmlns="http://www.anyfinder.de/DataDictionary/Configuration" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
  <DefaultConfig>
    <Field FieldNo="6">
      <FieldConfig Constraint="optional" Weight="1" xsi:type="FTText">
        <Parameter Operator="OR" Tolerance="exact" xmlns="http://www.anyfinder.de/Search/TextField" />
      </FieldConfig>
    </Field>
    <Field FieldNo="5">
      <FieldConfig Constraint="optional" Weight="1" xsi:type="FTText">
        <Parameter Operator="OR" Tolerance="exact" xmlns="http://www.anyfinder.de/Search/TextField" />
      </FieldConfig>
    </Field>
    <Field FieldNo="4">
      <FieldConfig Constraint="optional" Weight="1" xsi:type="FTDate">
        <Parameter xmlns="http://www.anyfinder.de/Search/DateField" />
      </FieldConfig>
    </Field>
    <Field FieldNo="3">
      <FieldConfig Constraint="optional" Weight="1" xsi:type="FTNumber">
        <Parameter xmlns="http://www.anyfinder.de/Search/NumberField" />
      </FieldConfig>
    </Field>
    <Field FieldNo="2">
      <FieldConfig Constraint="optional" Weight="1" xsi:type="FTText">
        <Parameter Operator="OR" Tolerance="exact" xmlns="http://www.anyfinder.de/Search/TextField" />
      </FieldConfig>
    </Field>
    <Field FieldNo="1">
      <FieldConfig Constraint="optional" Weight="1" xsi:type="FTText">
        <Parameter Operator="OR" Tolerance="exact" xmlns="http://www.anyfinder.de/Search/TextField" />
      </FieldConfig>
    </Field>
    <Field FieldNo="0">
      <FieldConfig Constraint="optional" Weight="1" xsi:type="FTText">
        <Parameter Operator="OR" Tolerance="exact" xmlns="http://www.anyfinder.de/Search/TextField" />
      </FieldConfig>
    </Field>
  </DefaultConfig>
  <Result Name="">
    <ResultField FieldNo="6" Name="Dir" />
    <ResultField FieldNo="4" Name="Date" />
    <ResultField FieldNo="3" Name="Size" />
    <ResultField FieldNo="2" Name="Extension" />
    <ResultField FieldNo="1" Name="Filename" />
    <ResultField FieldNo="0" Name="Path" />
  </Result>
  <HighlightingResult Name="">
    <HighlightingResultField FieldNo="5" Name="Content" xsi:type="HLTextField">
      <HighlightingTransformer Name="urn:Sentence">
        <ParameterSet xmlns="http://www.brox.de/ParameterSet">
          <Parameter Name="MaxLength" xsi:type="Integer">
            <Value>500</Value>
          </Parameter>
          <Parameter Name="MaxHLElements" xsi:type="Integer">
            <Value>999</Value>
          </Parameter>
          <Parameter Name="MaxSucceedingCharacters" xsi:type="Integer">
            <Value>50</Value>
          </Parameter>
          <Parameter Name="SucceedingCharacters" xsi:type="String">
            <Value>...</Value>
          </Parameter>
          <Parameter Name="SortAlgorithm" xsi:type="String">
            <Value>Occurrence</Value>
          </Parameter>
          <Parameter Name="TextHandling" xsi:type="String">
            <Value>ReturnSnipplet</Value>
          </Parameter>
        </ParameterSet>
      </HighlightingTransformer>
      <HighlightingParameter xmlns="http://www.anyfinder.de/DataDictionary/Configuration/TextHighlighting" />
    </HighlightingResultField>
  </HighlightingResult>
</Configuration>
Update mappings from record into index
  • Navigate to Mapping tab of Collection and add mapping from attribute "Dir" to field 6.
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<Mapping xmlns="http://www.eccenca.com/eccenca/lucene" indexName="FileIndex">
  <Attributes>
    <Attribute fieldNo="0" name="Path" />
    <Attribute fieldNo="1" name="Filename" />
    <Attribute fieldNo="2" name="Extension" />
    <Attribute fieldNo="3" name="Size" />
    <Attribute fieldNo="4" name="LastModifiedDate" />
    <Attribute fieldNo="6" name="Dir" />
  </Attributes>
  <Attachments>
    <Attachment fieldNo="5" name="Text" />
  </Attachments>
</Mapping>
Save collection and create physical index
  • Press Save and you will see message "Collection 'FileIndex' was successfully updated."
  • Press create index icon and index will be created.
Update index order conversion pipeline
  • Navigate to Index Orders for collection and press on "file" index order to edit. We have to change only processing pipeline to add Dir calculation.
  • Navigate to "Processing pipeline" tab and add RegExpTransformer pipelet invocation.
<process name="Convert_FileIndex_file" targetNamespace="http://www.eclipse.org/smila/processor"
xmlns="http://docs.oasis-open.org/wsbpel/2.0/process/executable" xmlns:id="http://www.eclipse.org/smila/id"
xmlns:proc="http://www.eclipse.org/smila/processor" xmlns:rec="http://www.eclipse.org/smila/record"
xmlns:xsd="http://www.w3.org/2001/XMLSchema">
  <import importType="http://schemas.xmlsoap.org/wsdl/" location="../processor.wsdl"
namespace="http://www.eclipse.org/smila/processor" />
  <partnerLinks>
    <partnerLink myRole="service" name="Pipeline" partnerLinkType="proc:ProcessorPartnerLinkType" />
  </partnerLinks>
  <extensions>
    <extension mustUnderstand="no" namespace="http://www.eclipse.org/smila/processor" />
  </extensions>
  <variables>
    <variable messageType="proc:ProcessorMessage" name="request" />
  </variables>
  <sequence>
    <receive createInstance="yes" name="start" operation="process" partnerLink="Pipeline"
portType="proc:ProcessorPortType" variable="request" />
    <!-- MIME type identification -->
    <extensionActivity name="invokeSimpleMimeTypeIdentification">
      <proc:invokeService>
        <proc:service name="MimeTypeIdentifyService" />
        <proc:variables input="request" output="request" />
      </proc:invokeService>
    </extensionActivity>
    <!-- Conversion -->
    <if name="conditionIsText">
      <condition>($request.records/rec:Record[1]/rec:A[@n="MimeType"]/rec:L/rec:V = "text/plain")
        or ($request.records/rec:Record[1]/rec:A[@n="MimeType"]/rec:L/rec:V = "text/plain")</condition>
      <extensionActivity name="invokeCopyText">
        <proc:invokePipelet>
          <proc:pipelet class="com.brox.anyfinder.processing.utils.CopyAttachmentPipelet" />
          <proc:variables input="request" output="request" />
          <proc:PipeletConfiguration>
            <proc:Property name="source">
              <proc:Value>Content</proc:Value>
            </proc:Property>
            <proc:Property name="target">
                <proc:Value>Text</proc:Value>
            </proc:Property>
          </proc:PipeletConfiguration>
        </proc:invokePipelet>
      </extensionActivity>
    </if>
    <if name="conditionIsHtml">
      <condition>($request.records/rec:Record[1]/rec:A[@n="MimeType"]/rec:L/rec:V = "text/html")
        or ($request.records/rec:Record[1]/rec:A[@n="MimeType"]/rec:L/rec:V = "text/xml")</condition>
      <extensionActivity name="invokeHtml2Txt">
        <proc:invokePipelet>
          <proc:pipelet class="org.eclipse.smila.processing.pipelets.HtmlToTextPipelet" />
          <proc:variables input="request" output="request" />
          <proc:PipeletConfiguration>
            <proc:Property name="inputType">
              <proc:Value>ATTACHMENT</proc:Value>
            </proc:Property>
            <proc:Property name="outputType">
              <proc:Value>ATTACHMENT</proc:Value>
            </proc:Property>
            <proc:Property name="inputName">
              <proc:Value>Content</proc:Value>
            </proc:Property>
            <proc:Property name="outputName">
              <proc:Value>Text</proc:Value>
            </proc:Property>
            <proc:Property name="meta:title">
              <proc:Value>Title</proc:Value>
            </proc:Property>
          </proc:PipeletConfiguration>
        </proc:invokePipelet>
      </extensionActivity>
    </if>
    <!-- RegExpTransformer invocation -->
    <extensionActivity name="invokeRegExpTransformer">
      <proc:invokePipelet>
        <proc:pipelet class="com.eccenca.pipelets.basic.RegExpTransformer" />
        <proc:variables input="request" />
        <proc:PipeletConfiguration>
          <proc:Property name="Source">
            <proc:Value>Path</proc:Value>
          </proc:Property>
          <proc:Property name="Target">
            <proc:Value>Dir</proc:Value>
          </proc:Property>
          <proc:Property name="SourceType">
            <proc:Value>ATTRIBUTE</proc:Value>
          </proc:Property>
          <proc:Property name="TargetType">
            <proc:Value>ATTRIBUTE</proc:Value>
          </proc:Property>
          <proc:Property name="Value">
            <proc:Value>^(.+)[\\/]+[How to use RegExpTransformer pipelet^\\/]+$</proc:Value>
          </proc:Property>
          <proc:Property name="ValueIsPattern" type="java.lang.Boolean">
            <proc:Value>true</proc:Value>
          </proc:Property>
          <proc:Property name="Translation">
            <proc:Value>$1</proc:Value>
          </proc:Property>
          <proc:Property name="TranslationIsPattern" type="java.lang.Boolean">
            <proc:Value>true</proc:Value>
          </proc:Property>
          <proc:Property name="IgnoreCase" type="java.lang.Boolean">
            <proc:Value>true</proc:Value>
          </proc:Property>
        </proc:PipeletConfiguration>
      </proc:invokePipelet>
    </extensionActivity>
    <reply name="end" operation="process" partnerLink="Pipeline" portType="proc:ProcessorPortType" variable="request" />
    <exit />
  </sequence>
</process>

Press save button and you will see ok message "Index Order 'file' was updated successfully".

Create index, run index order, then perform the search and check the file dirs:

More configuration examples

Appending a configured string

This configuration appends a "World" string to the source string, making "Hello World" out of source "Hello ":

  <proc:Property name="Value">
    <proc:Value>(^(?:.|\n)*$)</proc:Value>
  </proc:Property>
  <proc:Property name="ValueIsPattern" type="java.lang.Boolean">
    <proc:Value>true</proc:Value>
  </proc:Property>
  <proc:Property name="Translation">
    <proc:Value>$1World</proc:Value>
  </proc:Property>
  <proc:Property name="TranslationIsPattern" type="java.lang.Boolean">
    <proc:Value>true</proc:Value>
  </proc:Property>

Prepending a prefix

This configuration appends a "Hello" string to the source string, making "Hello World" out of source " World":

  <proc:Property name="Value">
    <proc:Value>(^(?:.|\n)*$)</proc:Value>
  </proc:Property>
  <proc:Property name="ValueIsPattern" type="java.lang.Boolean">
    <proc:Value>true</proc:Value>
  </proc:Property>
  <proc:Property name="Translation">
    <proc:Value>Hello$1</proc:Value>
  </proc:Property>
  <proc:Property name="TranslationIsPattern" type="java.lang.Boolean">
    <proc:Value>true</proc:Value>
  </proc:Property>

Replacing source string with a configured one

This configuration replaces the whole source string with a "MyReplace":

  <proc:Property name="Value">
    <proc:Value>^(?:.|\n)*$</proc:Value>
  </proc:Property>
  <proc:Property name="ValueIsPattern" type="java.lang.Boolean">
    <proc:Value>true</proc:Value>
  </proc:Property>
  <proc:Property name="Translation">
    <proc:Value>MyReplace</proc:Value>
  </proc:Property>
  <proc:Property name="TranslationIsPattern" type="java.lang.Boolean">
    <proc:Value>false</proc:Value>
  </proc:Property>

Stripping unwanted characters

This configuration removes all characters from the source string except for the "A-Z", "a-z", 0-9 and whitespaces:

  <proc:Property name="Value">
    <proc:Value>([How to use RegExpTransformer pipelet^A-Za-z0-9\s])</proc:Value>
  </proc:Property>
  <proc:Property name="ValueIsPattern" type="java.lang.Boolean">
    <proc:Value>true</proc:Value>
  </proc:Property>
  <proc:Property name="Translation">
    <proc:Value></proc:Value>
  </proc:Property>
  <proc:Property name="TranslationIsPattern" type="java.lang.Boolean">
    <proc:Value>true</proc:Value>
  </proc:Property>

Labels

quick_pipelet quick_pipelet Delete
Enter labels to add to this page:
Please wait 
Looking for a label? Just start typing.