Comment:
migration to ecc 1.1
| Name | CSV Crawler |
|---|---|
| Vendor | brox IT-Solutions GmbH |
| Authors | |
| Homepage | http://www.brox.de |
| Issue Management | http://support.eccenca.com |
| Continuous Integration | n/a |
| Categories | Crawler |
| Most Recent Version (see older versions) | 0.5.0 |
| Availability (see older versions) | eccenca / SMILA |
| State | |
| Support | |
| License | |
| Price | Free |
| Release Docs | |
| Java API Docs | n/a |
| Download Source | |
| Download JAR |
Overview
The CSV Crawler recursively fetches all the files from a given directory and crawls CSV data by a column number or name (in case the header is present). It may also gather any file's metadata from the following list:
- full path
- file name only
- file size
- last modified date
- file extension
Crawling Configuration
Defining Schema: org.eccenca.connectivitiy.framework.crawler.csv/schemas/CsvDataSourceConnectionConfigSchema.xsd
Sample configuration
<DataSourceConnectionConfig xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:noNamespaceSchemaLocation="../org.eclipse.smila.connectivity.framework.crawler.xml/schemas/XmlDataSourceConnectionConfigSchema.xsd"> <DataSourceID>csv</DataSourceID> <SchemaID>org.eccenca.connectivity.framework.crawler.csv</SchemaID> <DataConnectionID> <Crawler>CsvCrawler</Crawler> </DataConnectionID> <DeltaIndexing>disabled</DeltaIndexing> <Attributes> <Attribute Name="Path" Type="String" KeyAttribute="true"> <Content>FullPath</Content> </Attribute> <Attribute Name="Date" Type="Date" HashAttribute="true"> <Content>Date</Content> </Attribute> <Attribute Name="First name" Type="String" KeyAttribute="true"> <ColumnName>First name</ColumnName> </Attribute> <Attribute Name="Last name" Type="String" KeyAttribute="true"> <ColumnName>Last name</ColumnName> </Attribute> <Attribute Name="Street" Type="String"> <ColumnName>Street</ColumnName> </Attribute> <Attribute Name="Town" Type="String"> <ColumnName>Town</ColumnName> </Attribute> <Attribute Name="Zip code" Type="String"> <ColumnName>Zip code</ColumnName> </Attribute> <Attribute Name="Phone" Type="String"> <ColumnName>Phone</ColumnName> </Attribute> </Attributes> <Process> <Selections> <Path BaseDir="c:\data" Recursive="true" CaseSensitive="false"> <Include Name="*.csv"/> </Path> </Selections> <CsvFileHasHeader Value="true" /> </Process> </DataSourceConnectionConfig>
Crawling configuration explanation
The root element of crawling configuration is DataSourceConnectionConfig and contains the following sub elements:
- DataSourceID - the identification of a data source.
- SchemaID - specify the schema for a crawler job.
- DataConnectionID - describes which agent crawler should be used.
- Crawler - implementation class of a Crawler.
- Agent - implementation class of an Agent.
- CompoundHandling - specify if packed data (like a zip containing files) should be unpacked and files within it should be crawled (YES or NO).
- Attributes - list all attributes.
- Attribute
- Type (required) - the data type (String, Integer or Date).
- Name (required) - attributes name.
- HashAttribute - specify if a hash should be created (true or false).
- KeyAttribute - creates a key for this object, for example for record id (true or false).
- Attachment - specify if the attribute return the data as attachment of record.
- Content - the name of the file attribute (Path, FullPath, FileName, Extension, Size, Date).
- ColumnName - a column name the value selected in case the file to be crawled has a header (fisrt row lists the column names)
- ColumnNo - a column number the value to be selected from
- LineNumber - indicates that a line number will be selected as an attribute value
- Attribute
- Process
- Selections - which data is to be selected (and how).
- Path - the paths to the indexing data sources are defined here.
- BaseDir - contains the path to the indexing data source. The path declaration should be absolute.
- Recursive - true of false.
- CaseSensitive - true or false.
- Include - file to crawl.
- Name - defines the files or paths, which are to be considered. The wild-card character * can be used for an arbitrary amount of characters and "?" for exactly one arbitrary character.
- TimeFrom and TimeTo - all files with last modification date in this space are selected. If the attribute "TimeTo" is not selected, the value is set to today by default. The attributes "TimeFrom" and "TimeTo" cannot be stated coexistent to the attribute "Period" (see below). Format: „YYYY-MM-DDThh-mm-ssZ"
- Period - is used to select files modified within a certain time period. Format: d {1,5} [Y|M|D|h|m|s]. Example: "14D".
- Exclude - files to leave out while crawling.
- Name - defines the files or paths, which are to be considered. The wild-card character * can be used for an arbitrary amount of characters and "?" for exactly one arbitrary character.
- Path - the paths to the indexing data sources are defined here.
- CsvFileHasHeader - defines whether the first line of the file contains column names and will be skipped. Values (true, false)
- QuoteCharacter - specifies the quote character
- Separator - specifies the values separator
- Rows - sets the boundaries of the rows to be fetched by From and To attributes numeric values
- Selections - which data is to be selected (and how).
Link to support.eccenca.com
Frequently Asked Questions about this extension.
