1 of 13

Slide Notes

DownloadGo Live

Spark Benefits

Published on Nov 18, 2015

A overview of why Spark is the next big thing in Data Science

PRESENTATION OUTLINE

spark

Igniting a Big Data Revolution
Photo by xavi talleda

The Future of MapReduce

Photo by Zach Dischner

Software Advantages

  • 10x-100x faster than MapReduce
  • More contributors than any other engine
  • Single codebase for ML, Streaming, and Batch
  • Built on solid foundation (Scalding, Cascalog)
  • Highly composable, and compact
It has more contributors than MapReduce itself, and allows for regression to be coded in 20 lines (compared to 15,000 in pure MapReduce).
Photo by Joel Abroad

What People Are Saying

  • "Leading candidate for a successor to MapReduce"
  • "Spark-powered applications are operating on more real-time data"
  • "Spark has surpassed MapReduce as an execution framework"
  • "Spark is becoming the most powerful platform for data scientists"
  • “More general and powerful alternative to Hadoop's MapReduce.”
"Spark is quickly establishing itself as a leading environment for doing fast, iterative in-memory and streaming analysis." -- InformationWeek

"Leading candidate for a successor to MapReduce" -- Cloudera

"Spark-powered applications are operating on more real-time data, which ultimately enables faster fraud detection, better personalization of media, higher quality from manufacturing processes and other operational analytic use cases." -- MapR

"We use HFDS as the underlying cheap storage, and will continue to do so, and some of our legacy customers still use MapReduce and Hive – both of which are still available within xPatterns. However, for new customers & deployments we consider MapReduce a legacy technology and recommend all new code to be written in Spark as the lowest-level execution framework, given the substantial speed advantages and simpler programming model." -- Atigeo

"Spark is becoming the most powerful platform for data scientists because it unifies everything into a single platform whose foundation is Spark" -- InfoQ

“More general and powerful alternative to Hadoop's MapReduce.” -- DataBricks
Photo by deep_schismic

Strong Industry Support

Photo by Raoul Pop

Untitled Slide

  • Determined MapReduce to be too slow
  • Found it easier to develop with Spark than MR
  • Has four employees dedicated to improving Spark
  • Migrating all services from MapReduce to Spark

Untitled Slide

  • Presented as key part of EMR data pipeline
  • Participated in Spark summit
  • Prototyping internal analytics system with Spark

Untitled Slide

  • Used for Real-Time Analytical Processing
  • Collaborates with community to improve Spark
  • Partnering with Alibaba, Baidu iQiyi, Youku

Untitled Slide

  • Bringing Spark into ecosystem
  • Chief Scientist started integration movement
  • Trying to move to a Big Data architecture

Deep Roots

Photo by Aaron Escobar

Untitled Slide

  • Based on Google's Dremel, and Facebook's Presto 
  • Fast moving with growing adoption
  • Spark summit had high attendance by major players
  • Spotify and Netflix are investigating using it
Since December they've added Streaming support, YARN support, and Shark

Low Startup Cost

Photo by staxnet

Untitled Slide

  • Can be tested on single m1.large instance
  • Prototyping is super fast
  • Compatible with other solutions (Hive, HBase)