Dan Higgins (higgins@nceas.ucsb.edu) and
Matthew B. Jones (jones@nceas.ucsb.edu)
National Center for Ecological Analysis and Synthesis, University of California Santa Barbara
Scientists in a variety of disciplines (e.g., biology, ecology, astronomy) need access to scientific data and flexible means for executing complex analyses on those data. Such analyses can often be described in terms of a number of distinct operations with the results of one operation being passed to the next. This overall process describing data flow from one operation to the next can be called a 'scientific workflow'. A formal description of this workflow allows for efficient execution and repetition of such analyses, as well as providing documentation of exactly how data were analyzed. Kepler is designed to aid scientists in the design, construction, execution, and communication of such scientific workflows. For example, Kepler uses structured metadata such as Ecological Matadata Language (EML) metadata to make it easy for scientists to locate, analyze, and visualize unfamiliar data from data repositories around the world. Kepler includes a tool for creating graphical displays of workflows in the form of 'boxes', which represent operations or analytic steps, connected by 'arrows', which indicate the flow of information between the workflow steps. Kepler includes flexible mechanisms for controlling the data flow or sequencing the operations in these workflows, and for executing, saving, and re-creating such workflow descriptions.
Kepler is currently under development by the Kepler Project, a collaboration of various projects to develop open source tools for scientific workflows. These contributing projects include SEEK (see "Building SEEK: the Science Environment for Ecological Knowledge" by William Michener in an earlier DataBits issue), SDM/SPA, PtolemyII, GEON, and ROADNet (see the Kepler web site for further information). It is important to note that Kepler is based on the Ptolemy II project. Ptolemy is a modeling, siimulation, and design effort that has been going on for more than ten years at the Department of EECS at UC Berkeley. The baseline Ptolemy II software was first created over 5 years ago, and is thus tested, stable and well documented. Building Kepler on this existing base avoids having to reimplement a large amount of software and allows the effort to concentrate on new features.
In Ptolemy II nomenclature, the 'boxes' in a graphical workflow are called "actors". Each actor may have "input ports" and "output ports" which are connected by 'arrows' to other ports. The ports and their connections represent paths through which data moves in the workflow. A Ptolemy II model also has a "director" which coordinates the actions of the "actors". Perhaps the simplest type of director just tells actors to wait until data appears at an input port to start processing the data and then transfer it to an output port. When that data appears at the output port, it flows to whatever inputs it is connected to. Although this simple dataflow based on availability is often appropriate, one of the great features of Ptolemy II is that the director can be changed to allow for other types of models. For example, actors might be directed to 'fire' at prescribed, simulated times, whether or not input data is available.
The "actor" in Kepler can be thought of as a software component that processes the data that appears at its input ports. In addition to the ports, actors can also have "parameters" which the workflow designer sets to control just what the actor does. Actors can be very simple; for example, one may take an array of numbers as an input and simply count the number of items in the array. Or actors can carry out very complex operations like a genetic algorithm predicting species abundances based on a variety of environmental factors. Composite actors are also allowed, where a complex workflow is 'hidden' within a single actor 'box'. This allows for hierarchial workflows where certain complexities are 'hidden' to help understanding. There are also 'source' actors, which provide data and do not require inputs, and 'sink' actors which usually are just displays of data and have no outputs.
Actors which are available for building a Kepler workflow appear on the left of the graphical tool display, as indicated in Figure 1. Ptolemy provides about 100 actors and, so far, the Kepler program has added roughly 100 more. Workflows are built by dragging actors from the left onto the panel on the right and then connecting them to represent the workflow being created. Users can also build their own actors, either by programming in Java, configuring scriptable actors (curerntly by using Matlab, R, or Python scripts), or by building composite actors from low level actors.
Figure 1 - A screenshot of Kepler showing the "Actor" tab and a
Predator/Prey Workflow
One of the efforts in Kepler is to provide a variety of specialized actors and help scientists locate and use computing services that have been created elsewhere. As an example, actors for accessing web services have been created, as have actors for accessing Ecogrid data (see "SEEK Ecogrid" article in the Spring 2003 issue of "SEEK Ecogrid" article in the Spring 2003 issue of Databits). The web services actors can be used, for example, to carry out bioinformatics analyses on a web server in Japan or a geospatial image processing service in California and then automatically pass the results to a local actor for further processing. A data streaming example has also been created which displays almost real-time images from a remote location (Figure 2).
Figure 2 - A Kepler screenshot showing the deep-sea floor from a
real-time data source from a submersible. Various signal processing and
image processing utilities are available to analyze as well as display
temporal data streams such as this video stream.
A final relevant example is illustrated in Figure 3. On the left of this screen, the "Data" tab has been selected and a search has been carried out for data sets on the Ecogrid. One of the resulting datapackages has been 'dragged' onto the graphics display on the right. This results in the EMLDataSource actor shown in the figure. The EML metadata is automatically used to detemine the number of columns and datatypes of the data described by the EML document. In this case, there are 11 columns in the data table and an output port (one of the black, right pointing triangles) is created for each column. Two of these ports are connected to a plotting actor (labeled "HumidityPressurePlot"), and when the workflow is executed, the actual data is retreived from the Ecogrid and plotted in the window shown in Figure 3. This example shows how Kepler can help run analyses on remote data by leveraging EML and the Ecogrid, making it an excellent platform for exploratory analysis and visualization of unfamiliar data.
Figure 3 - A Kepler screenshot showing Ecogrid data search and EMLDataSource
results. Users can search for data that is available on the
EcoGrid and use it directly in workflows as if it resided locally on their
computer.
Due to its brevity, this article can only provide a brief glimpse of Kepler and its possibilities. For example, there is an effort to add semantic information in the form of ontologies to data searches and workflows in order to facilitate more automatic processing, error checking, and data discovery. Also, the Kepler project is on-going and thus continually changing. Additional capabilities are being added all the time. The interested reader should visit the Kepler website for more information and a view of the current state. Comments and suggestions are always appreciated, especially from potential users, and we welcome new contributors to participate in the design and development of the Kepler system.
This material is based upon work supported by the National Science Foundation under awards 0225676 for SEEK, 0225673 (AWSFL008-DS3) for GEON and OCE-0121726 for ROADNet, and by the Department of Energy under Contract No. DE-FC02-01ER25486 for SciDAC/SDM, and by DARPA under Contract No. F33615-00-C-1703 for Ptolemy, and by the the Office of Naval Research under Contract No. N00014-98-1-0772 for ROADNet. Any opinions, findings and conclusions or recomendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation (NSF).