Apache Oozie Tutorial
Apache Oozie is a Java Web application used to schedule Apache Hadoop jobs.It is integrated with the Hadoop stack and supports Hadoop jobs for Apache MapReduce, Apache Pig, Apache Hive, and Apache Sqoop.
There are two basic types of Oozie jobs:
1)Oozie Workflow : An Oozie Workflow is a collection of actions arranged in a Directed Acyclic Graph (DAG) . Control nodes define job chronology, setting rules for beginning and ending a workflow, which controls the workflow execution path with decision, fork and join nodes. Action nodes trigger the execution of tasks.
Workflow nodes are classified in control flow nodes and action nodes:
- Control flow nodes: nodes that control the start and end of the workflow and workflow job execution path.
- Action nodes: nodes that trigger the execution of a computation/processing task.
Workflow definitions can be parameterized.The parameterization of workflow definitions it done using JSP Expression Language syntax , allowing not only to support variables as parameters but also functions and complex expressions.
EL expressions can be used in the configuration values of action and decision nodes. They can be used in XML attribute values and in XML element and attribute values.
2)Oozie Coordinator jobs are recurrent Oozie Workflow jobs that are triggered by time and data availability.Oozie Coordinator can also manage multiple workflows that are dependent on the outcome of subsequent workflows. The outputs of subsequent workflows become the input to the next workflow. This chain is called a “data application pipeline”.
Oozie processes coordinator jobs in a fixed timezone with no DST (typically UTC ), this timezone is referred as ‘Oozie processing timezone’.
The Oozie processing timezone is used to resolve coordinator jobs start/end times, job pause times and the initial-instance of datasets. Also, all coordinator dataset instance URI templates are resolved to a datetime in the Oozie processing time-zone.
A coordinator application is a program that triggers actions (commonly workflow jobs) when a set of conditions are met. Conditions can be a time frequency, the availability of new dataset instances or other external events.
Types of coordinator applications:
- Synchronous: Its coordinator actions are created at specified time intervals.
The usage of Oozie Coordinator can be categorized in 3 different segments:
- Small: consisting of a single coordinator application with embedded dataset definitions
- Medium: consisting of a single shared dataset definitions and a few coordinator applications
- Large: consisting of a single or multiple shared dataset definitions and several coordinator applications
Oozie Bundle is a higher-level oozie abstraction that will batch a set of coordinator applications. The user will be able to start/stop/suspend/resume/rerun in the bundle level resulting a better and easy operational control.
More specififcally, the oozie Bundle system allows the user to define and execute a bunch of coordinator applications often called a data pipeline. There is no explicit dependency among the coordinator applications in a bundle. However, a user could use the data dependency of coordinator applications to create an implicit data application pipeline.
Oozie executes workflow based on:
- Time Dependency(Frequency)
- Data Dependency