1 of 25

Slide Notes

Image credit © Michael Tompert 2012 / from The Human Face of Big Data

The cover image shows routes taken by Domino's pizza delivery bikes in Manhattan

Contact me on Twitter as @edd or look me up on Google Plus.
DownloadGo Live

Big Data SLA PHT 2013

Published on Nov 18, 2015

No Description

PRESENTATION OUTLINE

BIG DATA

EDD DUMBILL @edd
Image credit © Michael Tompert 2012 / from The Human Face of Big Data

The cover image shows routes taken by Domino's pizza delivery bikes in Manhattan

Contact me on Twitter as @edd or look me up on Google Plus.

WHAT IS BIG DATA?

Hard to have avoided the news about big data in the last year.

A lot of different definitions, depending on which lens you look through: from social media, through big brother, to the IT world.

My definition "smart data".
Photo by BBVAtech

STRATA SMITH

WILLIAM 'STRATA' SMITH 1769-1839
William "Strata" Smith was an English geologist who lived over 150 years ago.

He noticed by examining outcrops of rock that the fossils always appeared in the same order.

He was then able to deduce from the rocks on the ground, what lay underneath.

(Image credit: Wikipedia)

SMART DATA

Smith produced a geological map of the UK, showing the location of minerals, which made a lot of people very rich.

This is a great example of smart use of data: using observations to create interferences and hence new value.

(Image credit: Wikipedia)

WHERE DOES BIG DATA COME FROM?

Until a few years ago, the role of IT in an organization was mostly automating processes that had previously been done with paper.

Other phases of the product loop: customer interaction, sales, manufacture, all happened very much within the physical world.

Companies today are becoming increasingly digital as ever more comes under algorithmic control. Not just customer interaction, but even manufacture and delivery.

Big data is in essence the signal pulsing through this new digital nervous system.

Untitled Slide

Quite naturally, big data originates with those organizations who have a lot of data to generate. It's unsurprising that web companies have a big role in its genesis.

Google pioneered the use of cheap commodity computers to crunch data. Using a technique they called "Map Reduce" they were able to spread large computations out over very many computers at once. In this way they were able to compile the search index that powers their search engine.

Facebook perform many computations per user to deliver their personalized recommendations and adverts.

Untitled Slide

In some IT circles, Hadoop is synonymous with big data.

Symbolized by the toy elephant of its creators' son, Hadoop emerged from Yahoo! as an open source implementation of Google's Map Reduce paper.

Making this power available to more or less anybody at low cost has driven much of the revolution in big data we are experiencing today.

It happens against another important computing change—the cloud—with just a credit card, anybody can get started crunching big data sets today.

SEEING EVERYTHING

VOLUME, VELOCITY, VARIETY
The IT industry explains big data often in terms of variation in three "Vs".

Volume: a larger amount of data than conventional systems can process.
Velocity: faster changing and arriving data.
Variety: unstructured data and multiple data sources.

But the supreme change big data has brought is the ability to examine every single data point in a population.

High level example: mail order catalogs vs Amazon recommendations.

No sampling needed: sampling is fraught with trouble, you're deciding ahead of time what's representative data, you can't zoom in, restricts axes of exploration.
Photo by Truthout.org

STATISTICAL

NOT THE SEMANTIC WEB
Statistics and sampling isn't the only practice that big data is disruptive to: semantics comes in for a challenge too.

Consider the original battle of the web: Yahoo! as a curated guide vs Google's statistical approach. The data got too large and complicated for Yahoo!'s directory.
Ontologies are problematic (Norvig et al in "The Unreasonable Effectiveness of Data")

* Ontology writing: hard, and the easy ones are done.
* Difficulty of implementation: hard to get people to encode metadata.
* Competition: people want to promote their own ontology for competitive reasons.
* Inaccuracy and deception: trust problems at large.

So, reverse the model. Instead of classifying each instance, find the natural categories and label them.
Photo by yoz

inmaps.linkedinlabs.com

Great example of the statistical approach.
Clustering algorithm applied to your contacts and their interrelations.

Very capable of finding communities of which it had no explicit knowledge.

The graph shows my professional network: dark blue is the XML community, for example, and red is O'Reilly Media.

http://inmaps.linkedinlabs.com/

DATA SCIENCE

The arrival of big data and the importance of data as a product heralded what's been touted as the hottest job of the decade, "data scientist".

Combination of statistician, programmer, entrepreneur and storyteller. DJ Patil in "Building Data Science Teams" (http://radar.oreilly.com/2011/09/building-data-science-teams.html) characterizes their role in companies:

* Decision science and business intelligence
* Product and marketing analytics
* Fraud, abuse, risk & security
* Data services and operations
* Data engineering and infrastructure

Patil characterizes a data scientist as having:

• Technical expertise: the best data scientists typically have deep expertise in some scientific
discipline.
• Curiosity: a desire to go beneath the surface and discover and distill a problem down into a very clear set of hypotheses that can be tested.
• Storytelling: the ability to use data to tell a story and to be able to com- municate it effectively.
• Cleverness: the ability to look at a problem in different, creative ways.

In essence, the change is this: you can't put analysis in a box. A large BI vendor told me that their biggest problem tended to be the organizational structure of their customers, analysts were isolated in their department, throwing reports over the wall.

Google, Facebook et al not only have better tools, but they are organized around the importance of data to their organizations. Reaping benefit from data may well be as simple as sitting your analyst next to the business team.

MORE DATA

BEATS BETTER ALGORITHMS (USUALLY)
I mentioned earlier that the key change in big data is being able to analyze all the data points. When you do that, you find that astonishingly simple algorithms can often suffice.

Also, adding in external data sources can radically improve your analysis. Here's one example:

Anand Rajaraman of Walmart Labs
http://anand.typepad.com/datawocky/2008/03/more-data-usual.html

"Different student teams in my class adopted different approaches to the problem, using both published algorithms and novel ideas. Of these, the results from two of the teams illustrate a broader point. Team A came up with a very sophisticated algorithm using the
Netflix data. Team B used a very simple algorithm, but they added in additional data beyond the Netflix set: information about movie genres from the Internet Movie Database (IMDB). Guess which team did better?"

The best algorithm using samples is beaten out by techniques that work at the level of N=all

ComScore vs Efficient Frontier (2008)
http://anand.typepad.com/datawocky/2008/04/more-data-beats.html

In marketing, one particularly "magic" data set is location. Consider sales figures from retail stores, combined with the average drive time to a competitor store.

Adding in these multiple data sets is hard work and today's tools are only starting to make that easier.
Photo by JD Hancock

INSTRUMENT

ALL THE THINGS
When people say "big data is the new oil", this is what they're talking about: the sometimes surprising propensity for large humdrum data sets to contain immense business value when processed.

In a digital world, our every action leaves traces. For example, financial traders try and obscure their searches on information systems so as not to leak their intent.

Google using the search logs to understand users' behavior, and find signals that can be used to improve the product. This is how Steven Levy describes it:

'But the information Google began gathering was far more voluminous, and the company received it for free. Google came to see that instant feedback as the basis of an artificial intelligence learning mechanism. ... On the most basic level, Google could see how satisfied users were. To paraphrase Tolstoy, happy users were all the same. The best sign of their happiness was the “long click”— this occurred when someone went to a search result, ideally the top one, and did not return. That meant Google had successfully fulfilled the query. But unhappy users were unhappy in their own ways. Most telling were the “short clicks” where a user followed a link and immediately returned to try again.'

Levy, Steven (2011-04-12). In The Plex (p. 47). Simon & Schuster, Inc.

Best place to look for improvements is by observing user behavior. This is where observation beats out theory again.

It's a benign feedback loop that offers a competitive advantage. The more people use your product, the better it gets.
Photo by vissago

HOARD

YOU DON'T KNOW WHAT YOU DON'T KNOW
When data storage is limited, storing every piece of information is troublesome.

This is how many of today's data warehousing solutions work. Data is loaded into the warehouse, transformed and cleaned into a schema most appropriate for purpose. Anything you throw away is lost. That might be invalid data, or detail that's not needed for current business purposes.

With big data, the approach is opposite. The idea is to store all raw incoming data. Because large amounts of data can be processed rapidly in parallel, there's no need to clean it up and transform it in advance, you can do it "just in time".

Because no information is lost, you can go back and look at data through a different lens, and find value you didn't know you needed the first time around.

This is set to be a sea change in how business handles data. Instead of 3 month turnarounds for small changes to reports, expect a self-serve infrastructure that encourages exploration and questioning.
Photo by ross.grady

A BUSINESS PROBLEM

DATA ISN'T SOMETHING TO HIDE AWAY IN I.T.
"Business Problem" is jargon that means everybody should care about data and what it says. We don't ignore what we smell or see, and in the new digital era, data is feeding us those senses.

Blind organizations can walk on unawares for a while, but then their mental maps start to fail them.

Realistically, what does this mean?

* Buy in for using data must exist at the top level
* Don't separate people who understand the domain from those who understand math, combined them and mix them up!
* Be prepared for organizational change to happen if you really want to use data well

Big data is developing against a background of change in IT. No longer is savvy IT about central planning, but about enablement.

* Data should be made available to all who want to use it. A successful example in the large is that of opening government data, and resulting ecosystems of private companies providing value through the data.
Photo by marcp_dmoz

THE MASTER PLAN

We're mostly at phase 1: "get the data"

This turns out to be spectacularly difficult, and it's where most of the current industry effort is focused. Data marketplaces were a big deal a few years ago, but didn't take off. Why? There was nowhere for the data to go.

Inside an organization, getting access to data can often run up against politics and silos.

We're starting to push into phase 2, and understand some of the patterns of change. Training up data scientists, building out data platforms inside organizations, creating easier services and tools.

There's not going to be a one-size-fits-all Phase 2. Instead, the thing to do is to figure out your target problem, and then working with the data to answer it.

TODAY'S TOOLS

GIVE IT A GO: LEARN R OR PYTHON!
Firstly, to get into "big data" you don't have to work on big data. Smart use of data is the most important thing to get familiar with.

You can work on small data sets with R, Python, Gephi. If you don't know relational databases, learn them, as SQL isn't going away any time soon. Learn all the things you need to liberate data. Google Fusion Tables. Look at the Guardian's data blog and learn from what they do.

Hadoop is an operating system for big data. You can get started working with it using reasonably familiar tools and Amazon's cloud services. State of tooling is still immature, you need to be a programmer.
Photo by bgolub

IN THE FUTURE

DATA NEEDS SMART MANAGEMENT
As we're busy reinventing our data infrastructure at scale, there are plenty of concepts from the "small data" world that don't exist yet for big data: and will become correspondingly more vital.

* Provenance: where did this data come from?
* Indexing and location: where can I find the data?
* Dependencies: which systems depend on this data for their function?
* Auditing: who has touched this data, how was it transformed?

I don't expect the solutions to these problems to be exactly the same as today's for smaller data. A change in scale can bring a change in nature too.
Photo by peasap

ONTOLOGIES SUPERCHARGE BIG DATA

All is by no means lost for the ontology.

Big data actually gives the semantic web something to do. One of the biggest problems in data management is classifying it: big data allows us to do a lot of that automatically. We can now start to deploy systems in a smart way based on ontologies.

Even Google have realized this, and they're making a huge push with their "knowledge graph".

The stage is set for applying artificial intelligence to enable computers to help people radically more than they ever could before.

ASK GOOD QUESTIONS

As more of the world can be measured, your organization becomes more like a laboratory.

The things that make a good scientist and a good leader in real life apply in data too.

Above all, ask good questions

HUMANFACEOFBIGDATA.COM

Excellent photographic storytelling about how data and analytics touch our everyday lives.

"BIG DATA"

MAYER-SCHOENBERGER & CUKIER BIG-DATA-BOOK.COM
http://big-data-book.com/

by Viktor Mayer-Schönberger and Kenneth Cukier, who wrote the influential Economist report on big data in 2011.

A wonderfully written trove of stories that illustrate well the principles of big data.

BIG DATA

I'm the Editor-in-Chief.

Our aim is to be integrative across disciplines and foster discussion of interest over the data-using community.

STRATA CONFERENCE

I'm the founding program chair for this conference, which now happens four times a year. http://strataconf.com/

Our tagline is "making data work", which sums up my philosophy to this area.

EDD.ME/SLAPHT-BIG

FIND ME ON TWITTER @EDD AND ON GOOGLE PLUS
I also run the "Data Data Data" community on Google Plus, which is a good place to learn of new tools and join in high-signal discussions around big data and data science.

Closing thoughts: