Michael E. Byczek, Technical Consultant
Michael E. Byczek

Apache Hadoop Framework

Apache Hadoop is an open source framework for distributed processing of large data sets (structured and unstructured). Files are split into blocks and distributed across nodes in a cluster the scale from one to thousands of computers.

Most of an average company's data is unstructured (i.e. emails and social media) that don't fit perfectly into rows and columns. Hadoop handles any file or format of data.

The framework consists of:
  • Common utilities
  • HDFS (Distributed File System)
  • YARN - job scheduling and cluster resource management
  • MapReduce - YARN-based system for parallel processing of large data sets
Hadoop includes a larger collection of services that include other tools, such as:
  • Spark - Programming model for ETL, machine learning, stream processing, and graph computation
  • Pig - Data flow language and execution framework for parallel computation
  • Mahout - Machine learning and data mining library
  • Hive - Data warehouse infrastructure for data summarizations and ad hoc querying
MapReduce
  • Input data-set split into key-value chunks processed by "map" tasks in parallel
  • Output becomes the input of "reduce" tasks
  • The "map" job converts a set of data into a different set of data
  • The "reduce" job combines the tuples (key-value pairs) into a smaller set of tuples
Example: multiple files contain high temperatures for the same cities on different days. The "map" job calculates the highest temperature in each file in distinct jobs for each file. Those individual jobs are fed into the "reduce" phases which combines all the high temperatures to arrive at a single high temperature for each city.

Spark
  • General-purpose cluster computing system
  • Write applications in Python, Java, R, or Scala
  • Combine libraries for machine learning, SQL, streaming, and analysis into single application
  • SQL queries from within the Spark program
  • Uniform access to different data sources, such as JSON, JDBC, and Hive
  • Based on resilient distributed dataset (RDD), a collection of elements partitioned across the cluster
  • RDD starts with file in Hadoop and transformation of that data
  • The transformation of a dataset is returned to the driver program
Pig
  • Platform for analyzing large data sets through reduced time of writing map/reduce programs
  • Handles any type of data
  • Applications written in PigLatin language and executed in runtime environment
  • Data loaded from Hadoop HDFS
  • Results of data transformation logic stored to a file
Hive
  • Facilitates querying and managing large datasets
  • Queries based on HiveQL, which is like SQL
  • Also supports custom map/reduce code
Mahout
  • Environment for creating machine learning application
  • The math environment (Samsara) utilizes off-the-shelf algorithm implementation
  • Core of linear algebra and statistical operations
  • Matrix decomposition algorithms
  • Naive Bayes classifier and collaboration filtering
  • Cooccurrence recommenders that use entire user click stream and context to generate recommendations
  • Interactive shell runs distributed operations on Spark cluster


Copyright © 2016. Michael E. Byczek. All Rights Reserved.