Eml200DataSource
org.ecoinformatics.seek.datasource.eml.eml2.Eml200DataSource

The Eml200DataSource is used to gain access to a wide variety of data packages that have been described using Ecological Metadata Language (EML). Each data package contains an EML metadata description and one or more data entities (data tables, spatial raster images, spatial vector images). The data packages can be accessed from the local filesystem or through any EcoGrid server that provides access to its collection of data objects.

The metadata provided by the EML description of the data allows the data to be easily ingested into Kepler and exposed for use in downstream components. The Eml200DataSource handles all of the mechanical issues associated with parsing the metadata, downloading the data from remote servers if applicable, understanding the logical structure of the data, and emitting the data for downstream use when required. The supported data transfer protocols include http, ftp, file, ecogrid and srb.

After parsing the EML metadata, the actor automatically reconfigures its exposed ports to provide one port for each attribute in the first entity that is described in the EML description. For example, if the first entity is a data table with four columns, the ports might be "Site", "Date", "Plot", and "Rainfall" if that's what the data set contained. These details are obtained from the EML document.

By default, the ports created by the EML200DataSource represent fields in the data entities, and one tuple of data is emitted on these ports during each fire cycle. Alternatively, the actor can be configured to so that the ports instead represent an array of values for a field ("AsColumnVector"), or so that the ports represent an entire table of data ("AsTable") formatted in comma-separated-value (CSV) format.

If more than one data entity is described in the EML metadata, then the output of the actor defaults to the first entity listed in the EML. To select the other entities, one must provide a query statement that describes the filter and join that should be used to produce the data to be output. This is accomplished by selecting 'Open actor', which shows the Query configuration dialog, which can be used to select the columns to be output and any filtering constraints to be applied.

Author(s): Matt Jones, Jing Tao, Chad Berkley
Version:
Pt.Proposed Rating:Red (jones)
Pt.Accepted Rating:Red (jones)




emlFilePath
The file path for locating an EML file that is available from a local file.

dataOutputFormat
The format of the output to be produced for the data entity. This parameter controls which ports are created for the actor and what data is emitted on those ports during each fire cycle. For example, this field can be configured to produce one port for each column in a data table, or one port that emits the entire data table at once in CSV format. Specifically, the output format choices are:

As Field: This is the default. One output port is created for each field (aka column/attribute/variable) that is described in the EML metadata for the data entity. If the SQL statement has been used to subset the data, then only those fields selected in the SQL statement will be configured as ports.

As Table: The selected entity will be sent out as a string which contains the entire entity data. It has three output ports: DataTable - the data itself, Delimiter - delimiter to seperate fields, and NumColumns - the number of fields in the table.

As Row: In this output format, one tuple of selected data is formatted as an array and sent out. It only has one output port (DataRow) and the data type is a record containing each of the individuals fields.

As Byte Array: Selected data will be sent out as an array of bytes which are read from the data file. This is the raw data being sent in binary format. It has two output ports: BinaryData - contains data itself, and EndOfStream - a tag to indicate if it is end of data stream.

As UnCompressed File Name: This format is only used when the entity is a compressed file (zip, tar et al). The compressed archive file is uncompressed after it is downloaded. It has only one output port which will contain an array of the filenames of all of the uncompressed files from the archive. If the parameter "Target File Extension in Compressed File" is provided, then instead the array that is returned will only contain the files with the file extension provided.

As Cache File Name: Kepler stores downloaded data files from remote sites into its cache system. This output format will send the local cache file path for the entity so that workflow designers can directly access the cache files. It has two output ports: CacheLocalFileName - the local file path, and CacheResourceName - the data link in eml for this enity.

As Column Vector: This output format is similar to "As Field". The difference is instead sending out a single value on each port, it sends out an array of all of the data for that field. The type of each port is an array of the base type for the field.

As ColumnBased Record: This output format will send all data on one port using a Record structure that encapsulates the entire data object. The Record will contain one array for each of the fields in the data, and the type of that array will be determined by the type of the field it represents.

fileExtensionFilter
This parameter specifies a file extension that is used to limit the array of filenames returned by the data source actor when "As unCompressed File Name" is selected as the ouput type. Please see more information in "As Uncompressed File Name" in the description of the output format parameter.

selectedEntity
If this EML package has mutiple entities, this parameter specifies which entity should be used for output. By default when this parameter is unset, data from the first entity described in an EML package is output. This parameter is only used if the SQL parameter is not used, or if the SQL parameter is used and the output format is one of "As Table", "As Byte Array", "As Uncompressed File Name", and "As Cache File Name".