Hadoop Q&A with Pramod Abichandani, PhD

September 29, 2015

Pramod Abichandani teaches the Hadoop and Map Reduce graduate course to MS Business Analytics and Engineering students at Drexel University. The course introduces modules covering the state-of-the-art in the areas of Distributed File Systems and functional Map and Reduce operations on multi-node storage systems.

What is Apache Hadoop®?

Apache Hadoop is an open-source software framework for storing, processing, and analyzing Terabyte-sized datasets. Originally written in Java, Hadoop uses inexpensive, industry-standard servers (also called commodity hardware) that both store and process data, and can scale without limits. This is a disruptive shift from the traditional data storage and processing paradigms where systems relied on expensive, proprietary hardware and different systems to store and process data.

Hadoop’s power lies in the use of two key elements: The Hadoop Distributed File System (HDFS) and MapReduce functionality. Much like our computer’s file system, HDFS stores data in Hadoop. The only difference here is that data are stored on the server cluster. These MapReduce functionalities process data stored in HDFS. MapReduce distributes large input datasets into smaller blocks of data spread across different computers in the server cluster. Once the datasets are processed on these computers, MapReduce combines the distributed set of results to a coherent one.

To date, hundreds of companies have used Hadoop and MapReduce to analyze large datasets. Typical applications include Data Processing/ETL offload, Data Warehouse offload, Telemetry, 360-degree customer view, and Enterprise Data Hub.

What type of training do LeBow students who study analytics receive?

At LeBow, we have developed a cutting edge Hadoop and MapReduce course that immerses students in the technical challenges of Distributed File Systems and functional Map and Reduce operations on multi-node storage systems. Students work on assignments and a final project throughout the term. Students start by collecting relevant data about the topics at hand. They store this data in a distributed file system, and ultimately demonstrate their proficiency at running distributed processing (MapReduce) jobs on the said file system. Students share their project findings via a 5-minute presentation at the end of the term.

This class sets LeBow students up for success in the upcoming field of Hadoop based business analytics application development

What’s Cloudera, and how does it relate to Hadoop?

Cloudera is a data management software company that was started by 4 stalwarts of the software industry - Mike Olson (from Oracle), Amr Awadallah (from Yahoo!), Jeff Hammerbacher (from Facebook), and Christophe Bisciglia (from Google). The company’s enterprise data hub (EDH) software platform empowers organizations to store, process and analyze all enterprise data, of whatever type, in any volume – creating cost-efficiencies as well as enabling business transformation. More interestingly, Cloudera is the largest contributor to the open source Apache Hadoop® ecosystem.

Here at LeBow, we use Cloudera’s open source distribution of Hadoop and related projects. This distribution integrates all the key Hadoop ecosystem projects, and happens to be a complete, tested, and widely-deployed distribution of Hadoop.

What is the job market for students with Hadoop skills?

Hadoop jobs are one of the most sought out positions in the analytics and data industry. Not only are these professionals well paid, the profession is growing rapidly as a whole. Just to quote an industry source (WantedAnalytics.com), “during March 2014, there were more than 17,000 jobs advertised online that require candidates with Hadoop knowledge or experience. Since last year, hiring demand for hadoop is up 34 percent.”

The median salary for professionals with big data expertise is $103,000 a year. Sample jobs in this category include Big Data Solution Architect, Linux Systems and Big Data Engineer, Big Data Platform Engineer, Lead Software Engineer, Big Data (Java, Hadoop, SQL) and others.
Programming

A key skill that employers seek is computer programming. Proficiency in Java and/or Python programming language provide students with an opportunity to maximize their Hadoop experience.