Michael E. Byczek, Big Data

Michael E. Byczek
Software Engineer

Apache Hadoop Framework
Apache Hadoop is an open source framework for distributed processing of large data sets (structured and unstructured). Files are split into blocks and distributed across nodes in a cluster the scale from one to thousands of computers.

Most of an average company's data is unstructured (i.e. emails and social media) that don't fit perfectly into rows and columns. Hadoop handles any file or format of data.

The framework consists of:

Common utilities
HDFS (Distributed File System)
YARN - job scheduling and cluster resource management
MapReduce - YARN-based system for parallel processing of large data sets

Hadoop includes a larger collection of services that include other tools, such as:

Spark - Programming model for ETL, machine learning, stream processing, and graph computation
Pig - Data flow language and execution framework for parallel computation
Mahout - Machine learning and data mining library
Hive - Data warehouse infrastructure for data summarizations and ad hoc querying

MapReduce

Input data-set split into key-value chunks processed by "map" tasks in parallel
Output becomes the input of "reduce" tasks
The "map" job converts a set of data into a different set of data
The "reduce" job combines the tuples (key-value pairs) into a smaller set of tuples

Example: multiple files contain high temperatures for the same cities on different days. The "map" job calculates the highest temperature in each file in distinct jobs for each file. Those individual jobs are fed into the "reduce" phases which combines all the high temperatures to arrive at a single high temperature for each city.

Spark

General-purpose cluster computing system
Write applications in Python, Java, R, or Scala
Combine libraries for machine learning, SQL, streaming, and analysis into single application
SQL queries from within the Spark program
Uniform access to different data sources, such as JSON, JDBC, and Hive
Based on resilient distributed dataset (RDD), a collection of elements partitioned across the cluster
RDD starts with file in Hadoop and transformation of that data
The transformation of a dataset is returned to the driver program

Pig

Platform for analyzing large data sets through reduced time of writing map/reduce programs
Handles any type of data
Applications written in PigLatin language and executed in runtime environment
Data loaded from Hadoop HDFS
Results of data transformation logic stored to a file

Hive

Facilitates querying and managing large datasets
Queries based on HiveQL, which is like SQL
Also supports custom map/reduce code

Mahout

Environment for creating machine learning application
The math environment (Samsara) utilizes off-the-shelf algorithm implementation
Core of linear algebra and statistical operations
Matrix decomposition algorithms
Naive Bayes classifier and collaboration filtering
Cooccurrence recommenders that use entire user click stream and context to generate recommendations
Interactive shell runs distributed operations on Spark cluster

Main