1 of 22

Slide Notes

DownloadGo Live

Big Data, huh?

Published on Nov 21, 2015

A quick overview on various big data tools for analyzing technical data.

PRESENTATION OUTLINE

Big Data, huh?

Clarifying a Subject

Big Data:

  • A Lot of Tools
  • A Lot of Requirements
  • A Lot of Options

Big Data creates meaning. But how?

Know your Data

Logfile Analysis

_That's_ your logfile:

^(?\d{4}-\d{2}-\d{2}) (?\d{2}:\d{2}:\d{2}) (?\d+\.\d+\.\d+\.\d+) (?[^\s]+) (?[^\s]+) (?[^\s]+) (?\d+) (?[^\s]+) (?\d+\.\d+\.\d+\.\d+) (?[^\s]+) (?\d+) (?\d+) (?\d+) (?\d+)$

Know Your Tools

It's more than just 'Hadoop'

The Apache™ Hadoop® project develops open-source software for reliable, scalable, distributed computing.

The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. Rather than rely on hardware to deliver high-availability, the library itself is designed to detect and handle failures at the application layer, so delivering a highly-available service on top of a cluster of computers, each of which may be prone to failures.

Hadoop MapReduce is a software framework for easily writing applications which process vast amounts of data (multi-terabyte data-sets) in-parallel on large clusters (thousands of nodes) of commodity hardware in a reliable, fault-tolerant manner.

The Apache Hive™ data warehouse software facilitates querying and managing large datasets residing in distributed storage. Hive provides a mechanism to project structure onto this data and query the data using a SQL-like language called HiveQL. At the same time this language also allows traditional map/reduce programmers to plug in their custom mappers and reducers when it is inconvenient or inefficient to express this logic in HiveQL.

Apache Sqoop™ is a tool designed for efficiently transferring bulk data between Apache Hadoop and structured datastores such as relational databases.

Chukwa is an open source data collection system for monitoring large distributed systems. Chukwa is built on top of the Hadoop Distributed File System (HDFS) and Map/Reduce framework and inherits Hadoop’s scalability and robustness. Chukwa also includes a flexible and powerful toolkit for displaying, monitoring and analyzing results to make the best use of the collected data.

Flume is a distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of log data. It has a simple and flexible architecture based on streaming data flows. It is robust and fault tolerant with tunable reliability mechanisms and many failover and recovery mechanisms. It uses a simple extensible data model that allows for online analytic application.

Apache Kafka is publish-subscribe messaging rethought as a distributed commit log. A single Kafka broker can handle hundreds of megabytes of reads and writes per second from thousands of clients. Kafka is designed to allow a single cluster to serve as the central data backbone for a large organization. It can be elastically and transparently expanded without downtime. Kafka has a modern cluster-centric design that offers strong durability and fault-tolerance guarantees.

Solr is highly reliable, scalable and fault tolerant, providing distributed indexing, replication and load-balanced querying, automated failover and recovery, centralized configuration and more. Solr powers the search and navigation features of many of the world's largest internet sites.

Morphlines is an open source framework that reduces the time and efforts necessary to build and change Hadoop ETL stream processing applications that extract, transform and load data into Apache Solr, HBase, HDFS, Enterprise Data Warehouses, or Analytic Online Dashboards.

HUE

Hue is a Web interface for analyzing data with Apache Hadoop.

no need to read all this, but...

That's Big Data

(-In their own words-

TARGET Architecture

Analysis Chain

our WEAPONS

Tools of the trade

thank you.