What is Text Fingerprint?
Its a simple heuritic for fast structural duplicate recognition.
With it the text is split into words, using a locale (not too important which one), and encoded into a fingerprint of fixed length. Two fingerprints are similar if their origin texts are similar - usually a text search engine such as Lucene
will order texts by their similarity when the fingerprints are searched. The similarity method alternatively gives a similarity measure that is more independent of the underlying search engine.
Note that this is a statistical estimate that may be incorrect with low probability: you should check the real similarity of texts rather than basing any serious decision (such as deleting a double) on the fingerprint alone. This method also underestimates the similarity of small texts (less than 0.5kB), an effect that is usually wanted in real-world applications where small dissimilarities in small texts can have a huge impact on meaning (consider putting 'not' somewhere into the last paragraph.)
How to use Text Fingerprint in Eccenca.CE.
In this HowTo we will
- install required pipelet from update site,
- add new field to FileIndex collection to store fingerprint of document
- invoke fingerprint calculation from conversion pipeline.
- crawl, search by fingerprint and check results
TextFingerprint pipelet invocation configuration, with in-line comments for parameters
<extensionActivity name="invokeTextFingerprintPipelet">
<proc:invokePipelet>
<proc:pipelet class="com.eccenca.pipelets.basic.TextFingerprintPipelet" />
<proc:variables input="request" />
<proc:PipeletConfiguration>
<!-- text source attribute path or attachment name-->
<proc:Property name="source">
<proc:Value>Text</proc:Value>
</proc:Property>
<!-- type of source location - ATTRIBUTE or ATTACHMENT -->
<proc:Property name="source-type">
<proc:Value>ATTACHMENT</proc:Value>
</proc:Property>
<!-- result fingerprint attribute path or attachment name-->
<proc:Property name="target">
<proc:Value>Fingerprint</proc:Value>
</proc:Property>
<!-- type of result fingerprint location - ATTRIBUTE or ATTACHMENT -->
<proc:Property name="target-type">
<proc:Value>ATTRIBUTE</proc:Value>
</proc:Property>
<!-- lenght of fingerprint -->
<proc:Property name="key-length" type="java.lang.Integer">
<proc:Value>20</proc:Value>
</proc:Property>
<!-- maximum words from source text used to calculate fingerprint-->
<proc:Property name="max-words-considered" type="java.lang.Integer">
<proc:Value>20000</proc:Value>
</proc:Property>
<!-- locale used to split words-->
<proc:Property name="splitting-locale">
<proc:Value>US</proc:Value>
</proc:Property>
<!-- bahavior if source text was empty - THROW_EXCEPTION or RETURN_EMPTY or DO_NOTHING -->
<proc:Property name="empty-string-handling">
<proc:Value>RETURN_EMPTY</proc:Value>
</proc:Property>
</proc:PipeletConfiguration>
</proc:invokePipelet>
</extensionActivity>
Workflow
Install Basic pipelet feature from Eccenca.CE update site


after pressing "Apply changes" pipelet will be installed
Modify collection by adding new field for storing fingerprint
Remove physical index
Navigate to "Collections" tab
We will add new field to existing index, therefore as a first step remove index for FileIndex collection.
Navigate to FileIndex collection
Add new text field to IndexStructure
Navigate to IndexStructure tab of Collection and add field to configuration XML. It will be field with number 6.
<IndexField FieldNo="6" IndexValue="true" Name="Fingerprint" StoreText="true" Tokenize="true" Type="Text"/>
<IndexStructure xmlns="http: Name="FileIndex">
<Analyzer ClassName="org.apache.lucene.analysis.standard.StandardAnalyzer"/>
<IndexField FieldNo="6" IndexValue="true" Name="Fingerprint" StoreText="true" Tokenize="true" Type="Text"/>
<IndexField FieldNo="5" IndexValue="true" Name="Content" StoreText="true" Tokenize="true" Type="Text"/>
<IndexField FieldNo="4" IndexValue="true" Name="Date" StoreText="true" Tokenize="false" Type="Date"/>
<IndexField FieldNo="3" IndexValue="true" Name="Size" StoreText="true" Tokenize="false" Type="Number"/>
<IndexField FieldNo="2" IndexValue="true" Name="Extension" StoreText="true" Tokenize="false" Type="Text"/>
<IndexField FieldNo="1" IndexValue="true" Name="Filename" StoreText="true" Tokenize="true" Type="Text"/>
<IndexField FieldNo="0" IndexValue="true" Name="Path" StoreText="true" Tokenize="true" Type="Text"/>
</IndexStructure>
Change search result and default search configuration
Navigate to Configuration tab of Collection and add field to Result and DefaultConfig
<Configuration xsi:schemaLocation="http: xmlns="http://www.anyfinder.de/DataDictionary/Configuration"
xmlns:xsi="http:>
<DefaultConfig>
<Field FieldNo="6">
<FieldConfig Constraint="optional" Weight="1" xsi:type="FTText">
<Parameter Operator="OR" Tolerance="exact" xmlns="http: />
</FieldConfig>
</Field>
<Field FieldNo="5">
<FieldConfig Constraint="optional" Weight="1" xsi:type="FTText">
<Parameter Operator="OR" Tolerance="exact" xmlns="http: />
</FieldConfig>
</Field>
<Field FieldNo="4">
<FieldConfig Constraint="optional" Weight="1" xsi:type="FTDate">
<Parameter xmlns="http: />
</FieldConfig>
</Field>
<Field FieldNo="3">
<FieldConfig Constraint="optional" Weight="1" xsi:type="FTNumber">
<Parameter xmlns="http: />
</FieldConfig>
</Field>
<Field FieldNo="2">
<FieldConfig Constraint="optional" Weight="1" xsi:type="FTText">
<Parameter Operator="OR" Tolerance="exact" xmlns="http: />
</FieldConfig>
</Field>
<Field FieldNo="1">
<FieldConfig Constraint="optional" Weight="1" xsi:type="FTText">
<Parameter Operator="OR" Tolerance="exact" xmlns="http: />
</FieldConfig>
</Field>
<Field FieldNo="0">
<FieldConfig Constraint="optional" Weight="1" xsi:type="FTText">
<Parameter Operator="OR" Tolerance="exact" xmlns="http: />
</FieldConfig>
</Field>
</DefaultConfig>
<Result Name="">
<ResultField FieldNo="6" Name="Fingerprint" />
<ResultField FieldNo="4" Name="Date" />
<ResultField FieldNo="3" Name="Size" />
<ResultField FieldNo="2" Name="Extension" />
<ResultField FieldNo="1" Name="Filename" />
<ResultField FieldNo="0" Name="Path" />
</Result>
<HighlightingResult Name="">
<HighlightingResultField FieldNo="5" Name="Content" xsi:type="HLTextField">
<HighlightingTransformer Name="urn:Sentence">
<ParameterSet xmlns="http:>
<Parameter Name="MaxLength" xsi:type="Integer">
<Value>500</Value>
</Parameter>
<Parameter Name="MaxHLElements" xsi:type="Integer">
<Value>999</Value>
</Parameter>
<Parameter Name="MaxSucceedingCharacters" xsi:type="Integer">
<Value>50</Value>
</Parameter>
<Parameter Name="SucceedingCharacters" xsi:type="String">
<Value>...</Value>
</Parameter>
<Parameter Name="SortAlgorithm" xsi:type="String">
<Value>Occurrence</Value>
</Parameter>
<Parameter Name="TextHandling" xsi:type="String">
<Value>ReturnSnipplet</Value>
</Parameter>
</ParameterSet>
</HighlightingTransformer>
<HighlightingParameter xmlns="http: />
</HighlightingResultField>
</HighlightingResult>
</Configuration>
Update mappings from record into index
Navigate to Mapping tab of Collection and add mapping from attribute "Fingerprint" to field 6.
<Mapping indexName="FileIndex" xmlns="http:>
<Attributes>
<Attribute fieldNo="0" name="Path" />
<Attribute fieldNo="1" name="Filename" />
<Attribute fieldNo="2" name="Extension" />
<Attribute fieldNo="3" name="Size" />
<Attribute fieldNo="4" name="LastModifiedDate" />
<Attribute fieldNo="6" name="Fingerprint" />
</Attributes>
<Attachments>
<Attachment fieldNo="5" name="Text" />
</Attachments>
</Mapping>
Save collection and create physical index
Press Save and you will see message "Collection 'FileIndex' was successfully updated."
Press create index icon and index will be created.
Update index order conversion pipeline
Navigate to Index Orders for collection and press on "file" index order to edit.
We have to change only processing pipeline to add fingerprint extraction.
Navigate to "Processing pipeline" tab and add fingerprint pipelet invocation.
<process name="Convert_FileIndex_file" targetNamespace="http: xmlns="http://docs.oasis-open.org/wsbpel/2.0/process/executable" xmlns:id="http://www.eclipse.org/smila/id"
xmlns:proc="http: xmlns:rec="http://www.eclipse.org/smila/record" xmlns:xsd="http://www.w3.org/2001/XMLSchema">
<import importType="http: location="../processor.wsdl" namespace="http://www.eclipse.org/smila/processor" />
<partnerLinks>
<partnerLink myRole="service" name="Pipeline" partnerLinkType="proc:ProcessorPartnerLinkType" />
</partnerLinks>
<extensions>
<extension mustUnderstand="no" namespace="http: />
</extensions>
<variables>
<variable messageType="proc:ProcessorMessage" name="request" />
</variables>
<sequence>
<receive createInstance="yes" name="start" operation="process" partnerLink="Pipeline" portType="proc:ProcessorPortType" variable="request" />
<!-- MIME type identification -->
<extensionActivity name="invokeSimpleMimeTypeIdentification">
<proc:invokeService>
<proc:service name="MimeTypeIdentifyService" />
<proc:variables input="request" output="request" />
</proc:invokeService>
</extensionActivity>
<!-- Conversion -->
<if name="conditionIsText">
<condition>($request.records/rec:Record[1]/rec:A[@n="MimeType"]/rec:L/rec:V = "text/plain")
or ($request.records/rec:Record[1]/rec:A[@n="MimeType"]/rec:L/rec:V = "text/plain")</condition>
<extensionActivity name="invokeCopyText">
<proc:invokePipelet>
<proc:pipelet class="com.brox.anyfinder.processing.utils.CopyAttachmentPipelet" />
<proc:variables input="request" output="request" />
<proc:PipeletConfiguration>
<proc:Property name="source">
<proc:Value>Content</proc:Value>
</proc:Property>
<proc:Property name="target">
<proc:Value>Text</proc:Value>
</proc:Property>
</proc:PipeletConfiguration>
</proc:invokePipelet>
</extensionActivity>
</if>
<if name="conditionIsHtml">
<condition>($request.records/rec:Record[1]/rec:A[@n="MimeType"]/rec:L/rec:V = "text/html")
or ($request.records/rec:Record[1]/rec:A[@n="MimeType"]/rec:L/rec:V = "text/xml")</condition>
<extensionActivity name="invokeHtml2Txt">
<proc:invokePipelet>
<proc:pipelet class="org.eclipse.smila.processing.pipelets.HtmlToTextPipelet"/>
<proc:variables input="request" output="request"/>
<proc:PipeletConfiguration>
<proc:Property name="inputType">
<proc:Value>ATTACHMENT</proc:Value>
</proc:Property>
<proc:Property name="outputType">
<proc:Value>ATTACHMENT</proc:Value>
</proc:Property>
<proc:Property name="inputName">
<proc:Value>Content</proc:Value>
</proc:Property>
<proc:Property name="outputName">
<proc:Value>Text</proc:Value>
</proc:Property>
<proc:Property name="meta:title">
<proc:Value>Title</proc:Value>
</proc:Property>
</proc:PipeletConfiguration>
</proc:invokePipelet>
</extensionActivity>
</if>
<!-- Fingerprint invocation -->
<extensionActivity name="invokeTextFingerprintPipelet">
<proc:invokePipelet>
<proc:pipelet class="com.eccenca.smila.pipelets.basic.TextFingerprintPipelet" />
<proc:variables input="request" />
<proc:PipeletConfiguration>
<!-- text source attribute path or attachment name-->
<proc:Property name="source">
<proc:Value>Content</proc:Value>
</proc:Property>
<!-- type of source location - ATTRIBUTE or ATTACHMENT -->
<proc:Property name="source-type">
<proc:Value>ATTACHMENT</proc:Value>
</proc:Property>
<!-- result fingerprint attribute path or attachment name-->
<proc:Property name="target">
<proc:Value>Fingerprint</proc:Value>
</proc:Property>
<!-- type of result fingerprint location - ATTRIBUTE or ATTACHMENT -->
<proc:Property name="target-type">
<proc:Value>ATTRIBUTE</proc:Value>
</proc:Property>
<!-- lenght of fingerprint -->
<proc:Property name="key-length" type="java.lang.Integer">
<proc:Value>20</proc:Value>
</proc:Property>
<!-- maximum words from source text used to calculate fingerprint-->
<proc:Property name="max-words-considered" type="java.lang.Integer">
<proc:Value>20000</proc:Value>
</proc:Property>
<!-- locale used to split words-->
<proc:Property name="splitting-locale">
<proc:Value>US</proc:Value>
</proc:Property>
<!-- bahavior if source text was empty - THROW_EXCEPTION or RETURN_EMPTY or DO_NOTHING -->
<proc:Property name="empty-string-handling">
<proc:Value>RETURN_EMPTY</proc:Value>
</proc:Property>
</proc:PipeletConfiguration>
</proc:invokePipelet>
</extensionActivity>
<reply name="end" operation="process" partnerLink="Pipeline" portType="proc:ProcessorPortType" variable="request" />
<exit />
</sequence>
</process>
Press save button and you will see ok message "Index Order 'file' was updated successfully".
Run index order and check results.
To test I added similar documents to folder c:\data.
Its original files from jdk javadoc, also it was added a copy of one file with a couple of lines removed.
Search first document by name and copy fingerprint returned into search form and search again

As you may see Lucene indexing engine found the same document with 100% score and similar document with 89.6% score.