

PacMan, which was one of the ad hoc systems developed for running Team successfully leveraged the experience gained from developing Run multistage jobs consisting of MapReduce, Pig, and SSH jobs. Month, the first functional version of Oozie was running. Toward the end of 2008, Alejandro Abdelnur and a few engineers from Yahoo! Bangalore took over aĬonference room with the goal of implementing such a system. It should be a multitenant service to reduce the cost of Jobs should run in a server to increase reliability. It should scale to support several thousand concurrent It should be extensible to support new types of jobs. It should be easy to troubleshot and recover jobs when Model to facilitate its adoption and to reduce developer ramp-up It should use an adequate and well-understood programming Multistage Hadoop jobs with the following requirements: It was clear that there was a need for a general-purpose system to run Significant resources to develop and support multiple frameworks for Different organizations within Yahoo! were using Had to learn the specifics of the custom framework used by the project Developers moved from one project to another and they Life of administrators, who not only had to monitor the health of theĬluster but also of different systems running multistage jobs fromĬlient machines. It was hard to track errors and it was difficult to recoverįrom failures. Solution that ran multiple Hadoop jobs using one thread to execute eachĪs these solutions started to be widely used, several issuesĮmerged. Specify their MapReduce and Pig jobs as dependencies of each other-alsoĪ topological sorting mechanism.

One development team resorted to Ant with a custom Ant task to Others used Hadoop’s JobControl class, whichĮxecutes multiple MapReduce jobs using topological Someĭevelopers wrote simple shell scripts to start one Hadoop job after the This led to several ad hoc solutions to manage theĮxecution and interdependency of these multiple Hadoop jobs. We mention the job type explicitly only when there isĪ need to refer to a particular type of job.Īt Yahoo!, as developers started doing more complex processing Hadoop cluster, we refer to it as a Hadoop Or any other type of job that runs one or more MapReduce jobs on a Throughout the book, when referring to a MapReduce, Pig, Hive,
