View Source

h2. What is RegExpTransformer?

It's a conversion pipelet which transforms a string using regular expressions

h2. How to use RegExpTransformer in Eccenca.CE.

In this "5 Minutes to success" page we will

* Install required pipelet from update site,
* Add new field to FileIndex collection to store the file dir extracted from its path
* Invoke dir calculation from conversion pipeline.
* Crawl, search and check results

h2. Workflow

h3. Install Basic Pipelet feature from Eccenca.CE update site

!basic_install_1.PNG!

!basic_install_2.PNG!

after pressing "Apply changes" pipelet will be installed

h3. Modify collection by adding new field for storing file dir

h5. Remove physical index

* Navigate to the "Collections" tab
* Remove index for FileIndex collection.

h5. Add new text field to IndexStructure

* Navigate to FileIndex collection

* Navigate to IndexStructure tab of Collection and add field to configuration XML. It will be field with number 6.

{code:xml}<IndexField FieldNo="6" IndexValue="true" Name="Dir" StoreText="true" Tokenize="true" Type="Text"/>
{code}
{code:xml}<IndexStructure xmlns="http://www.anyfinder.de/IndexStructure" Name="FileIndex">
<Analyzer ClassName="org.apache.lucene.analysis.standard.StandardAnalyzer"/>
<IndexField FieldNo="6" IndexValue="true" Name="Dir" StoreText="true" Tokenize="true" Type="Text"/>
<IndexField FieldNo="5" IndexValue="true" Name="Content" StoreText="true" Tokenize="true" Type="Text"/>
<IndexField FieldNo="4" IndexValue="true" Name="Date" StoreText="true" Tokenize="false" Type="Date"/>
<IndexField FieldNo="3" IndexValue="true" Name="Size" StoreText="true" Tokenize="false" Type="Number"/>
<IndexField FieldNo="2" IndexValue="true" Name="Extension" StoreText="true" Tokenize="false" Type="Text"/>
<IndexField FieldNo="1" IndexValue="true" Name="Filename" StoreText="true" Tokenize="true" Type="Text"/>
<IndexField FieldNo="0" IndexValue="true" Name="Path" StoreText="true" Tokenize="true" Type="Text"/>
</IndexStructure>
{code}
h5. Change search result and default search configuration

* Navigate to Configuration tab of Collection and add field to {{Result}} and {{DefaultConfig}}

{code:xml}<Configuration xsi:schemaLocation="http://www.anyfinder.de/DataDictionary/Configuration
../xml/DataDictionaryConfiguration.xsd"
xmlns="http://www.anyfinder.de/DataDictionary/Configuration" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
<DefaultConfig>
<Field FieldNo="6">
<FieldConfig Constraint="optional" Weight="1" xsi:type="FTText">
<Parameter Operator="OR" Tolerance="exact" xmlns="http://www.anyfinder.de/Search/TextField" />
</FieldConfig>
</Field>
<Field FieldNo="5">
<FieldConfig Constraint="optional" Weight="1" xsi:type="FTText">
<Parameter Operator="OR" Tolerance="exact" xmlns="http://www.anyfinder.de/Search/TextField" />
</FieldConfig>
</Field>
<Field FieldNo="4">
<FieldConfig Constraint="optional" Weight="1" xsi:type="FTDate">
<Parameter xmlns="http://www.anyfinder.de/Search/DateField" />
</FieldConfig>
</Field>
<Field FieldNo="3">
<FieldConfig Constraint="optional" Weight="1" xsi:type="FTNumber">
<Parameter xmlns="http://www.anyfinder.de/Search/NumberField" />
</FieldConfig>
</Field>
<Field FieldNo="2">
<FieldConfig Constraint="optional" Weight="1" xsi:type="FTText">
<Parameter Operator="OR" Tolerance="exact" xmlns="http://www.anyfinder.de/Search/TextField" />
</FieldConfig>
</Field>
<Field FieldNo="1">
<FieldConfig Constraint="optional" Weight="1" xsi:type="FTText">
<Parameter Operator="OR" Tolerance="exact" xmlns="http://www.anyfinder.de/Search/TextField" />
</FieldConfig>
</Field>
<Field FieldNo="0">
<FieldConfig Constraint="optional" Weight="1" xsi:type="FTText">
<Parameter Operator="OR" Tolerance="exact" xmlns="http://www.anyfinder.de/Search/TextField" />
</FieldConfig>
</Field>
</DefaultConfig>
<Result Name="">
<ResultField FieldNo="6" Name="Dir" />
<ResultField FieldNo="4" Name="Date" />
<ResultField FieldNo="3" Name="Size" />
<ResultField FieldNo="2" Name="Extension" />
<ResultField FieldNo="1" Name="Filename" />
<ResultField FieldNo="0" Name="Path" />
</Result>
<HighlightingResult Name="">
<HighlightingResultField FieldNo="5" Name="Content" xsi:type="HLTextField">
<HighlightingTransformer Name="urn:Sentence">
<ParameterSet xmlns="http://www.brox.de/ParameterSet">
<Parameter Name="MaxLength" xsi:type="Integer">
<Value>500</Value>
</Parameter>
<Parameter Name="MaxHLElements" xsi:type="Integer">
<Value>999</Value>
</Parameter>
<Parameter Name="MaxSucceedingCharacters" xsi:type="Integer">
<Value>50</Value>
</Parameter>
<Parameter Name="SucceedingCharacters" xsi:type="String">
<Value>...</Value>
</Parameter>
<Parameter Name="SortAlgorithm" xsi:type="String">
<Value>Occurrence</Value>
</Parameter>
<Parameter Name="TextHandling" xsi:type="String">
<Value>ReturnSnipplet</Value>
</Parameter>
</ParameterSet>
</HighlightingTransformer>
<HighlightingParameter xmlns="http://www.anyfinder.de/DataDictionary/Configuration/TextHighlighting" />
</HighlightingResultField>
</HighlightingResult>
</Configuration>
{code}
h5. Update mappings from record into index

* Navigate to Mapping tab of Collection and add mapping from attribute "Dir" to field 6.

{code:xml}<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<Mapping xmlns="http://www.eccenca.com/eccenca/lucene" indexName="FileIndex">
<Attributes>
<Attribute fieldNo="0" name="Path" />
<Attribute fieldNo="1" name="Filename" />
<Attribute fieldNo="2" name="Extension" />
<Attribute fieldNo="3" name="Size" />
<Attribute fieldNo="4" name="LastModifiedDate" />
<Attribute fieldNo="6" name="Dir" />
</Attributes>
<Attachments>
<Attachment fieldNo="5" name="Text" />
</Attachments>
</Mapping>
{code}
h5. Save collection and create physical index

* Press Save and you will see message "Collection 'FileIndex' was successfully updated."
* Press create index icon and index will be created.

h5. Update index order conversion pipeline

* Navigate to Index Orders for collection and press on "file" index order to edit. We have to change only processing pipeline to add Dir calculation.
* Navigate to "Processing pipeline" tab and add RegExpTransformer pipelet invocation.

{code:xml}<process name="Convert_FileIndex_file" targetNamespace="http://www.eclipse.org/smila/processor"
xmlns="http://docs.oasis-open.org/wsbpel/2.0/process/executable" xmlns:id="http://www.eclipse.org/smila/id"
xmlns:proc="http://www.eclipse.org/smila/processor" xmlns:rec="http://www.eclipse.org/smila/record"
xmlns:xsd="http://www.w3.org/2001/XMLSchema">
<import importType="http://schemas.xmlsoap.org/wsdl/" location="../processor.wsdl"
namespace="http://www.eclipse.org/smila/processor" />
<partnerLinks>
<partnerLink myRole="service" name="Pipeline" partnerLinkType="proc:ProcessorPartnerLinkType" />
</partnerLinks>
<extensions>
<extension mustUnderstand="no" namespace="http://www.eclipse.org/smila/processor" />
</extensions>
<variables>
<variable messageType="proc:ProcessorMessage" name="request" />
</variables>
<sequence>
<receive createInstance="yes" name="start" operation="process" partnerLink="Pipeline"
portType="proc:ProcessorPortType" variable="request" />
<!-- MIME type identification -->
<extensionActivity name="invokeSimpleMimeTypeIdentification">
<proc:invokeService>
<proc:service name="MimeTypeIdentifyService" />
<proc:variables input="request" output="request" />
</proc:invokeService>
</extensionActivity>
<!-- Conversion -->
<if name="conditionIsText">
<condition>($request.records/rec:Record[1]/rec:A[@n="MimeType"]/rec:L/rec:V = "text/plain")
or ($request.records/rec:Record[1]/rec:A[@n="MimeType"]/rec:L/rec:V = "text/plain")</condition>
<extensionActivity name="invokeCopyText">
<proc:invokePipelet>
<proc:pipelet class="com.brox.anyfinder.processing.utils.CopyAttachmentPipelet" />
<proc:variables input="request" output="request" />
<proc:PipeletConfiguration>
<proc:Property name="source">
<proc:Value>Content</proc:Value>
</proc:Property>
<proc:Property name="target">
<proc:Value>Text</proc:Value>
</proc:Property>
</proc:PipeletConfiguration>
</proc:invokePipelet>
</extensionActivity>
</if>
<if name="conditionIsHtml">
<condition>($request.records/rec:Record[1]/rec:A[@n="MimeType"]/rec:L/rec:V = "text/html")
or ($request.records/rec:Record[1]/rec:A[@n="MimeType"]/rec:L/rec:V = "text/xml")</condition>
<extensionActivity name="invokeHtml2Txt">
<proc:invokePipelet>
<proc:pipelet class="org.eclipse.smila.processing.pipelets.HtmlToTextPipelet" />
<proc:variables input="request" output="request" />
<proc:PipeletConfiguration>
<proc:Property name="inputType">
<proc:Value>ATTACHMENT</proc:Value>
</proc:Property>
<proc:Property name="outputType">
<proc:Value>ATTACHMENT</proc:Value>
</proc:Property>
<proc:Property name="inputName">
<proc:Value>Content</proc:Value>
</proc:Property>
<proc:Property name="outputName">
<proc:Value>Text</proc:Value>
</proc:Property>
<proc:Property name="meta:title">
<proc:Value>Title</proc:Value>
</proc:Property>
</proc:PipeletConfiguration>
</proc:invokePipelet>
</extensionActivity>
</if>
<!-- RegExpTransformer invocation -->
<extensionActivity name="invokeRegExpTransformer">
<proc:invokePipelet>
<proc:pipelet class="com.eccenca.pipelets.basic.RegExpTransformer" />
<proc:variables input="request" />
<proc:PipeletConfiguration>
<proc:Property name="Source">
<proc:Value>Path</proc:Value>
</proc:Property>
<proc:Property name="Target">
<proc:Value>Dir</proc:Value>
</proc:Property>
<proc:Property name="SourceType">
<proc:Value>ATTRIBUTE</proc:Value>
</proc:Property>
<proc:Property name="TargetType">
<proc:Value>ATTRIBUTE</proc:Value>
</proc:Property>
<proc:Property name="Value">
<proc:Value>^(.+)[\\/]+[How to use RegExpTransformer pipelet^\\/]+$</proc:Value>
</proc:Property>
<proc:Property name="ValueIsPattern" type="java.lang.Boolean">
<proc:Value>true</proc:Value>
</proc:Property>
<proc:Property name="Translation">
<proc:Value>$1</proc:Value>
</proc:Property>
<proc:Property name="TranslationIsPattern" type="java.lang.Boolean">
<proc:Value>true</proc:Value>
</proc:Property>
<proc:Property name="IgnoreCase" type="java.lang.Boolean">
<proc:Value>true</proc:Value>
</proc:Property>
</proc:PipeletConfiguration>
</proc:invokePipelet>
</extensionActivity>
<reply name="end" operation="process" partnerLink="Pipeline" portType="proc:ProcessorPortType" variable="request" />
<exit />
</sequence>
</process>
{code}
Press save button and you will see ok message "Index Order 'file' was updated successfully".

Create index, run index order, then perform the search and check the file dirs:

!regexp_install_3.PNG!

h2. More configuration examples

h3. Appending a configured string

This configuration appends a "World" string to the source string, making "Hello World" out of source "Hello ":

{code:xml} <proc:Property name="Value">
<proc:Value>(^(?:.|\n)*$)</proc:Value>
</proc:Property>
<proc:Property name="ValueIsPattern" type="java.lang.Boolean">
<proc:Value>true</proc:Value>
</proc:Property>
<proc:Property name="Translation">
<proc:Value>$1World</proc:Value>
</proc:Property>
<proc:Property name="TranslationIsPattern" type="java.lang.Boolean">
<proc:Value>true</proc:Value>
</proc:Property>
{code}
h3. Prepending a prefix

This configuration appends a "Hello" string to the source string, making "Hello World" out of source " World":

{code:xml} <proc:Property name="Value">
<proc:Value>(^(?:.|\n)*$)</proc:Value>
</proc:Property>
<proc:Property name="ValueIsPattern" type="java.lang.Boolean">
<proc:Value>true</proc:Value>
</proc:Property>
<proc:Property name="Translation">
<proc:Value>Hello$1</proc:Value>
</proc:Property>
<proc:Property name="TranslationIsPattern" type="java.lang.Boolean">
<proc:Value>true</proc:Value>
</proc:Property>
{code}
h4. Replacing source string with a configured one

This configuration replaces the whole source string with a "MyReplace":

{code:xml} <proc:Property name="Value">
<proc:Value>^(?:.|\n)*$</proc:Value>
</proc:Property>
<proc:Property name="ValueIsPattern" type="java.lang.Boolean">
<proc:Value>true</proc:Value>
</proc:Property>
<proc:Property name="Translation">
<proc:Value>MyReplace</proc:Value>
</proc:Property>
<proc:Property name="TranslationIsPattern" type="java.lang.Boolean">
<proc:Value>false</proc:Value>
</proc:Property>
{code}
h4. Stripping unwanted characters

This configuration removes all characters from the source string except for the "A-Z", "a-z", 0-9 and whitespaces:

{code:xml} <proc:Property name="Value">
<proc:Value>([How to use RegExpTransformer pipelet^A-Za-z0-9\s])</proc:Value>
</proc:Property>
<proc:Property name="ValueIsPattern" type="java.lang.Boolean">
<proc:Value>true</proc:Value>
</proc:Property>
<proc:Property name="Translation">
<proc:Value></proc:Value>
</proc:Property>
<proc:Property name="TranslationIsPattern" type="java.lang.Boolean">
<proc:Value>true</proc:Value>
</proc:Property>
{code}