Hadoop Just the Basics for Big Data Rookies

If you like this presentation – show it...

Slide 0

Hadoop Just the Basics for Big Data Rookies Adam Shook ashook@gopivotal.com

Slide 1

Agenda Hadoop Overview HDFS Architecture Hadoop MapReduce Hadoop Ecosystem MapReduce Primer Buckle up!

Slide 2

Hadoop Overview

Slide 3

Hadoop Core Open-source Apache project out of Yahoo! in 2006 Distributed fault-tolerant data storage and batch processing Provides linear scalability on commodity hardware Adopted by many: Amazon, AOL, eBay, Facebook, Foursquare, Google, IBM, Netflix, Twitter, Yahoo!, and many, many more

Slide 4

Why? Bottom line: Flexible Scalable Inexpensive

Slide 5

Overview Great at Reliable storage for multi-petabyte data sets Batch queries and analytics Complex hierarchical data structures with changing schemas, unstructured and structured data Not so great at Changes to files (can’t do it…) Low-latency responses Analyst usability This is less of a concern now due to higher-level languages

Slide 6

Data Structure Bytes! No more ETL necessary Store data now, process later Structure on read Built-in support for common data types and formats Extendable Flexible

Slide 7

Versioning Version 0.20.x, 0.21.x, 0.22.x, 1.x.x Two main MR packages: org.apache.hadoop.mapred (deprecated) org.apache.hadoop.mapreduce (new hotness) Version 2.x.x, alpha’d in May 2012 NameNode HA YARN – Next Gen MapReduce

Slide 8

HDFS Architecture

Slide 9

HDFS Overview Hierarchical UNIX-like file system for data storage sort of Splitting of large files into blocks Distribution and replication of blocks to nodes Two key services Master NameNode Many DataNodes Checkpoint Node (Secondary NameNode)

Slide 10

NameNode Single master service for HDFS Single point of failure (HDFS 1.x) Stores file to block to location mappings in the namespace All transactions are logged to disk NameNode startup reads namespace image and logs

Slide 11

Checkpoint Node (Secondary NN) Performs checkpoints of the NameNode’s namespace and logs Not a hot backup! Loads up namespace Reads log transactions to modify namespace Saves namespace as a checkpoint

Slide 12

DataNode Stores blocks on local disk Sends frequent heartbeats to NameNode Sends block reports to NameNode Clients connect to DataNode for I/O

Slide 13

How HDFS Works - Writes Client contacts NameNode to write data NameNode says write it to these nodes Client sequentially writes blocks to DataNode

Slide 14

How HDFS Works - Writes DataNodes replicate data blocks, orchestrated by the NameNode

Slide 15

How HDFS Works - Reads Client contacts NameNode to read data NameNode says you can find it here Client sequentially reads blocks from DataNode

Slide 16

Client connects to another node serving that block How HDFS Works - Failure

Slide 17

Block Replication Default of three replicas Rack-aware system One block on same rack One block on same rack, different host One block on another rack Automatic re-copy by NameNode, as needed

Slide 18

HDFS 2.0 Features NameNode High-Availability (HA) Two redundant NameNodes in active/passive configuration Manual or automated failover NameNode Federation Multiple independent NameNodes using the same collection of DataNodes

Slide 19

Hadoop MapReduce

Slide 20

Hadoop MapReduce 1.x Moves the code to the data JobTracker Master service to monitor jobs TaskTracker Multiple services to run tasks Same physical machine as a DataNode A job contains many tasks A task contains one or more task attempts

Slide 21

JobTracker Monitors job and task progress Issues task attempts to TaskTrackers Re-tries failed task attempts Four failed attempts = one failed job Schedules jobs in FIFO order Fair Scheduler Single point of failure for MapReduce

Slide 22

TaskTrackers Runs on same node as DataNode service Sends heartbeats and task reports to JobTracker Configurable number of map and reduce slots Runs map and reduce task attempts Separate JVM!

Slide 23

Exploiting Data Locality JobTracker will schedule task on a TaskTracker that is local to the block 3 options! If TaskTracker is busy, selects TaskTracker on same rack Many options! If still busy, chooses an available TaskTracker at random Rare!

Slide 24

How MapReduce Works Client submits job to JobTracker JobTracker submits tasks to TaskTrackers Job output is written to DataNodes w/replication JobTracker reports metrics

Slide 25

How MapReduce Works - Failure JobTracker assigns task to different node

Slide 26

YARN Abstract framework for distributed application development Split functionality of JobTracker into two components ResourceManager ApplicationMaster TaskTracker becomes NodeManager Containers instead of map and reduce slots Configurable amount of memory per NodeManager

Slide 27

MapReduce 2.x on YARN MapReduce API has not changed Rebuild required to upgrade from 1.x to 2.x Application Master launches and monitors job via YARN MapReduce History Server to store… history

Slide 28

Hadoop Ecosystem

Slide 29

Hadoop Ecosystem Core Technologies Hadoop Distributed File System Hadoop MapReduce Many other tools… Which I will be describing… now

Slide 30

Moving Data Sqoop Moving data between RDBMS and HDFS Say, migrating MySQL tables to HDFS Flume Streams event data from sources to sinks Say, weblogs from multiple servers into HDFS

Slide 31

Flume Architecture

Slide 32

Higher Level APIs Pig Data-flow language – aptly named PigLatin -- to generate one or more MapReduce jobs against data stored locally or in HDFS Hive Data warehousing solution, allowing users to write SQL-like queries to generate a series of MapReduce jobs against data stored in HDFS

Slide 33

Pig Word Count A = LOAD '$input'; B = FOREACH A GENERATE FLATTEN(TOKENIZE($0)) AS word; C = GROUP B BY word; D = FOREACH C GENERATE group AS word, COUNT(B); STORE D INTO '$output';

Slide 34

Key/Value Stores HBase Accumulo Implementations of Google’s Big Table for HDFS Provides random, real-time access to big data Supports updates and deletes of key/value pairs

Slide 35

HBase Architecture Master ZooKeeper Client HDFS

Slide 36

Data Structure Avro Data serialization system designed for the Hadoop ecosystem Expressed as JSON Parquet Compressed, efficient columnar storage for Hadoop and other systems

Slide 37

Scalable Machine Learning Mahout Library for scalable machine learning written in Java Very robust examples! Classification, Clustering, Pattern Mining, Collaborative Filtering, and much more

Slide 38

Workflow Management Oozie Scheduling system for Hadoop Jobs Support for: Java MapReduce Streaming MapReduce Pig, Hive, Sqoop, Distcp Any ol’ Java or shell script program

Slide 39

Real-time Stream Processing Storm Open-source project which runs a streaming of data, called a spout, to a series of execution agents called bolts Scalable and fault-tolerant, with guaranteed processing of data Benchmarks of over a million tuples processed per second per node

Slide 40

Distributed Application Coordination ZooKeeper An effort to develop and maintain an open-source server which enables highly reliable distributed coordination Designed to be simple, replicated, ordered, and fast Provides configuration management, distributed synchronization, and group services for applications

Slide 41

ZooKeeper Architecture

Slide 42

Hadoop Streaming Write MapReduce mappers and reducers using stdin and stdout Execute on command line using Hadoop Streaming JAR // TODO verify hadoop jar hadoop-streaming.jar -input input -output outputdir -mapper org.apache.hadoop.mapreduce.Mapper -reduce /bin/wc

Slide 43

SQL on Hadoop Apache Drill Cloudera Impala Hive Stinger Pivotal HAWQ MPP execution of SQL queries against HDFS data

Slide 44

HAWQ Architecture

Slide 45

That’s a lot of projects I am likely missing several (Sorry, guys!) Each cropped up to solve a limitation of Hadoop Core Know your ecosystem Pick the right tool for the right job

Slide 46

Sample Architecture HDFS Flume Agent Flume Agent Flume Agent MapReduce Pig HBase Storm Oozie Webserver Sales Call Center SQL SQL

Slide 47

MapReduce Primer

Slide 48

MapReduce Paradigm Data processing system with two key phases Map Perform a map function on input key/value pairs to generate intermediate key/value pairs Reduce Perform a reduce function on intermediate key/value groups to generate output key/value pairs Groups created by sorting map output

Slide 49

Map Input Map Output Reducer Input Groups Reducer Output

Slide 50

Hadoop MapReduce Components Map Phase Input Format Record Reader Mapper Combiner Partitioner Reduce Phase Shuffle Sort Reducer Output Format Record Writer

Slide 51

Writable Interfaces public interface Writable { void write(DataOutput out); void readFields(DataInput in); } public interface WritableComparable<T> extends Writable, Comparable<T> { } BooleanWritable BytesWritable ByteWritable DoubleWritable FloatWritable IntWritable LongWritable NullWritable Text

Slide 52

InputFormat public abstract class InputFormat<K, V> { public abstract List<InputSplit> getSplits(JobContext context); public abstract RecordReader<K, V> createRecordReader(InputSplit split, TaskAttemptContext context); }

Slide 53

RecordReader public abstract class RecordReader<KEYIN, VALUEIN> implements Closeable { public abstract void initialize(InputSplit split, TaskAttemptContext context); public abstract boolean nextKeyValue(); public abstract KEYIN getCurrentKey(); public abstract VALUEIN getCurrentValue(); public abstract float getProgress(); public abstract void close(); }

Slide 54

Mapper public class Mapper<KEYIN, VALUEIN, KEYOUT, VALUEOUT> { protected void setup(Context context) { /* NOTHING */ } protected void cleanup(Context context) { /* NOTHING */ } protected void map(KEYIN key, VALUEIN value, Context context) { context.write((KEYOUT) key, (VALUEOUT) value); } public void run(Context context) { setup(context); while (context.nextKeyValue()) map(context.getCurrentKey(), context.getCurrentValue(), context); cleanup(context); } }

Slide 55

Partitioner public abstract class Partitioner<KEY, VALUE> { public abstract int getPartition(KEY key, VALUE value, int numPartitions); } Default HashPartitioner uses key’s hashCode() % numPartitions

Slide 56

Reducer public class Reducer<KEYIN, VALUEIN, KEYOUT, VALUEOUT> { protected void setup(Context context) { /* NOTHING */ } protected void cleanup(Context context) { /* NOTHING */ } protected void reduce(KEYIN key, Iterable<VALUEIN> value, Context context) { for (VALUEIN value : values) context.write((KEYOUT) key, (VALUEOUT) value); } public void run(Context context) { setup(context); while (context.nextKey()) reduce(context.getCurrentKey(), context.getValues(), context); cleanup(context); } }

Slide 57

OutputFormat public abstract class OutputFormat<K, V> { public abstract RecordWriter<K, V> getRecordWriter(TaskAttemptContext context); public abstract void checkOutputSpecs(JobContext context); public abstract OutputCommitter getOutputCommitter(TaskAttemptContext context); }

Slide 58

RecordWriter public abstract class RecordWriter<K, V> { public abstract void write(K key, V value); public abstract void close(TaskAttemptContext context); }

Slide 59

Word Count Example

Slide 60

Problem Count the number of times each word is used in a body of text Uses TextInputFormat and TextOutputFormat map(byte_offset, line) foreach word in line emit(word, 1) reduce(word, counts) sum = 0 foreach count in counts sum += count emit(word, sum)

Slide 61

Mapper Code public class WordMapper extends Mapper<LongWritable, Text, Text, IntWritable>{ private final static IntWritable ONE = new IntWritable(1); private Text word = new Text(); public void map(LongWritable key, Text value, Context context) { String line = value.toString(); StringTokenizer tokenizer = new StringTokenizer(line); while (tokenizer.hasMoreTokens()) { word.set(tokenizer.nextToken()); context.write(word, ONE); } } }

Slide 62

Shuffle and Sort Mapper outputs to a single logically partitioned file Reducers copy their parts Reducer merges partitions, sorting by key

Slide 63

Reducer Code public class IntSumReducer extends Reducer<Text, LongWritable, Text, IntWritable> { private IntWritable outvalue = new IntWritable(); private int sum = 0; public void reduce(Text key, Iterable<IntWritable> values, Context context) { sum = 0; for (IntWritable val : values) { sum += val.get(); } outvalue.set(sum); context.write(key, outvalue); } }

Slide 64

So what’s so hard about it?

Slide 65

So what’s so hard about it? MapReduce is a limitation Entirely different way of thinking Simple processing operations such as joins are not so easy when expressed in MapReduce Proper implementation is not so easy Lots of configuration and implementation details for optimal performance Number of reduce tasks, data skew, JVM size, garbage collection

Slide 66

So what does this mean for you? Hadoop is written primarily in Java Components are extendable and configurable Custom I/O through Input and Output Formats Parse custom data formats Read and write using external systems Higher-level tools enable rapid development of big data analysis

Slide 67

Resources, Wrap-up, etc. http://hadoop.apache.org Very supportive community Strata + Hadoop World Oct. 28th – 30th in Manhattan Plenty of resources available to learn more Blogs Email lists Books Shameless Plug -- MapReduce Design Patterns

Slide 68

Getting Started Pivotal HD Single-Node VM and Community Edition http://gopivotal.com/pivotal-products/data/pivotal-hd For the brave and bold -- Roll-your-own! http://hadoop.apache.org/docs/current

Slide 69

Acknowledgements Apache Hadoop, the Hadoop elephant logo, HDFS, Accumulo, Avro, Drill, Flume, HBase, Hive, Mahout, Oozie, Pig, Sqoop, YARN, and ZooKeeper are trademarks of the Apache Software Foundation Cloudera Impala is a trademark of Cloudera Parquet is copyright Twitter, Cloudera, and other contributors Storm is licensed under the Eclipse Public License

Slide 70

Learn More. Stay Connected. Talk to us on Twitter: @springcentral Find Session replays on YouTube: spring.io/video