Comment:
migration to ecc 1.1
| Name | XML Crawler |
|---|---|
| Vendor | brox IT-Solutions GmbH |
| Authors | |
| Homepage | http://www.brox.de |
| Issue Management | http://support.eccenca.com |
| Continuous Integration | n/a |
| Categories | Crawler |
| Most Recent Version (see older versions) | 0.5.0 |
| Availability (see older versions) | eccenca / SMILA |
| State | |
| Support | |
| License | |
| Price | Free |
| Release Docs | |
| Java API Docs | n/a |
| Download Source | |
| Download JAR |
Overview
The XML Crawler recursively fetches all files from a given directory and crawl XML data via XPath. Besides providing the specified XML data it may also gather any file's metadata from the following list:
- full path
- file name only
- file size
- last modified date
- file extension
Crawling Configuration
Defining Schema: org.eccenca.connectivitiy.framework.crawler.xml/schemas/XmlDataSourceConnectionConfigSchema.xsd
Sample configuration
<DataSourceConnectionConfig xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:noNamespaceSchemaLocation="../org.eclipse.smila.connectivity.framework.crawler.xml/schemas/XmlDataSourceConnectionConfigSchema.xsd"> <DataSourceID>xml</DataSourceID> <SchemaID>org.eccenca.connectivity.framework.crawler.xml</SchemaID> <DataConnectionID> <Crawler>XmlCrawler</Crawler> </DataConnectionID> <DeltaIndexing>disabled</DeltaIndexing> <Attributes> <Attribute Name="Path" Type="String" KeyAttribute="true"> <Content>FullPath</Content> </Attribute> <Attribute Name="Date" Type="Date" HashAttribute="true"> <Content>Date</Content> </Attribute> <Attribute Name="First name" Type="String" KeyAttribute="true"> <XPath>ns1:Field[@FieldNo=0]/ns1:Value/text()</XPath> <NamespaceDef Namespace="http://www.anyfinder.de/Adressen" NamespacePrefix="ns1"/> </Attribute> <Attribute Name="Last name" Type="String" KeyAttribute="true"> <XPath>ns1:Field[@FieldNo=1]/ns1:Value/text()</XPath> <NamespaceDef Namespace="http://www.anyfinder.de/Adressen" NamespacePrefix="ns1"/> </Attribute> <Attribute Name="Street" Type="String"> <XPath>ns1:Field[@FieldNo=2]/ns1:Value/text()</XPath> <NamespaceDef Namespace="http://www.anyfinder.de/Adressen" NamespacePrefix="ns1"/> </Attribute> <Attribute Name="Town" Type="String"> <XPath>ns1:Field[@FieldNo=3]/ns1:Value/text()</XPath> <NamespaceDef Namespace="http://www.anyfinder.de/Adressen" NamespacePrefix="ns1"/> </Attribute> <Attribute Name="Zip code" Type="String"> <XPath>ns1:Field[@FieldNo=4]/ns1:Value/text()</XPath> <NamespaceDef Namespace="http://www.anyfinder.de/Adressen" NamespacePrefix="ns1"/> </Attribute> <Attribute Name="Phone" Type="String"> <XPath>ns1:Field[@FieldNo=5]/ns1:Value/text()</XPath> <NamespaceDef Namespace="http://www.anyfinder.de/Adressen" NamespacePrefix="ns1"/> </Attribute> </Attributes> <Process> <Selections> <Path BaseDir="C:/xml/data" CaseSensitive="true" Recursive="true"> <!--<Include Name="**" DateFrom="2007-10-01T01:00:00" DateTo="2008-01-01T00:00:00"/>--> <Include Name="*.xml"/> </Path> <XPath Parameter="//@No/.."/> <!--<XsltDoc Parameter="sample.xslt"/>--> </Selections> </Process> </DataSourceConnectionConfig>
Crawling configuration explanation
The root element of crawling configuration is DataSourceConnectionConfig and contains the following sub elements:
- DataSourceID - the identification of a data source.
- SchemaID - specify the schema for a crawler job.
- DataConnectionID - describes which agent crawler should be used.
- Crawler - implementation class of a Crawler.
- Agent - implementation class of an Agent.
- CompoundHandling - specify if packed data (like a zip containing files) should be unpack and files within should be crawled (YES or NO).
- Attributes - list all attributes.
- Attribute
- Type (required) - the data type (String, Integer or Date).
- Name (required) - attributes name.
- HashAttribute - specify if a hash should be created (true or false).
- KeyAttribute - creates a key for this object, for example for record id (true or false).
- Attachment - specify if the attribute return the data as attachment of record.
- Content - the name of the file attribute (Path, FullPath, FileName, Extension, Size, Date).
- XPath - a XPath syntax that addresses contents of an element or an attribute.
- Separator - if several contents are addressed with XPath syntax, then a separator can be indicated optionally.
- NamespaceDef - defines possible namespace definitions.
- Namespace - specify namespace.
- NamespacePrefix - specify namespace prefix.
- Attribute
- Process
- Selections - which data is to be selected (and how).
- Path - the paths to the indexing data sources are defined here.
- BaseDir - contains the path to the indexing data source. The path declaration should be absolute.
- Recursive - true of false.
- CaseSensitive - true or false.
- Include - file to crawl.
- Name - defines the files or paths, which are to be considered. The wild-card character * can be used for an arbitrary amount of characters and "?" for exactly one arbitrary character.
- TimeFrom and TimeTo - all files with last modification date in this space are selected. If the attribute "TimeTo" is not selected, the value is set to today by default. The attributes "TimeFrom" and "TimeTo" cannot be stated coexistent to the attribute "Period" (see below). Format: „YYYY-MM-DDThh-mm-ssZ"
- Period - is used to select files modified within a certain time period. Format: d {1,5} [Y|M|D|h|m|s]. Example: "14D".
- Exclude - files to leave out while crawling.
- Name - defines the files or paths, which are to be considered. The wild-card character * can be used for an arbitrary amount of characters and "?" for exactly one arbitrary character.
- XPath - within a XML file one defines, which knots of a structure identifies a data record for the index.
- Parameter - contains the appropriate XPath syntax.
- NamespaceDef - defines possible namespace definitions.
- Namespace - specify namespace.
- NamespacePrefix - specify namespace prefix.
- XsltDoc - defines a stylesheet, which is first consulted, in order to transform the XML documents which can be indicated. The "XPath" expression refers to the transformed XML document.
- Parameter - contains path to XSLT file
- Path - the paths to the indexing data sources are defined here.
- Selections - which data is to be selected (and how).
Link to support.eccenca.com
Frequently Asked Questions about this extension.
