Real-Time Analytics. Techniques to Analyze and Visualize Streaming Data

  • ID: 2827018
  • Book
  • 432 Pages
  • John Wiley and Sons Ltd
1 of 4

A COMPLETE SOLUTION FOR DYNAMIC ANALYSIS OF STREAMING DATA

Real–Time Analytics provides a complete end–to–end solution for cost–effective analysis and visualization of streaming data. Beginning with a description of the required analytics ecosystem, the book builds upon that foundation with practical guidance toward the tools and techniques that get targeted results. Outlining best practices for each specific application throughout the discovery life cycle, the approach provides easy–to–follow instructions for implementing the presented tools and techniques. Examples taken from real–world applications highlight the usage of various aspects of data processing from tabulation to visualization and forecasting. Readers will:

  • Understand the components of streaming data systems, including their full capabilities and characteristics
  • Learn the relevant architecture and best practices for analysis and storage of streaming data
  • Develop a system for data aggregation, delivery, and warehousing using open source and commercial tools
  • Learn the implementation and application of advanced algorithms and data structures to streaming applications

Decreasing data acquisition costs and increasing connectivity are enabling ever more efficient methods of continuous collection, so why do analysis platforms remain largely batch–based? The tools do exist to efficiently handle streaming data analysis and visualization feasibly in terms of time, maintenance and hardware. This book guides readers through the construction of a robust, cost–efficient system with clear, expert instruction.

Note: Product cover images may vary from those shown
2 of 4

Introduction xv

Chapter 1 Introduction to Streaming Data 1

Sources of Streaming Data 2

Operational Monitoring 3

Web Analytics 3

Online Advertising 4

Social Media 5

Mobile Data and the Internet of Things 5

Why Streaming Data Is Different 7

Always On, Always Flowing 7

Loosely Structured 8

High–Cardinality Storage 9

Infrastructures and Algorithms 10

Conclusion 10

Part I Streaming Analytics Architecture 13

Chapter 2 Designing Real–Time Streaming Architectures 15

Real–Time Architecture Components 16

Collection 16

Data Flow 17

Processing 19

Storage 20

Delivery 22

Features of a Real–Time Architecture 24

High Availability 24

Low Latency 25

Horizontal Scalability 26

Languages for Real–Time Programming 27

Java 27

Scala and Clojure 28

JavaScript 29

The Go Language 30

A Real–Time Architecture Checklist 30

Collection 31

Data Flow 31

Processing 32

Storage 32

Delivery 33

Conclusion 34

Chapter 3 Service Configuration and Coordination 35

Motivation for Confi guration and Coordination Systems 36

Maintaining Distributed State 36

Unreliable Network Connections 36

Clock Synchronization 37

Consensus in an Unreliable World 38

Apache ZooKeeper 39

The znode 39

Watches and Notifi cations 41

Maintaining Consistency 41

Creating a ZooKeeper Cluster 42

ZooKeeper s Native Java Client 47

The Curator Client 56

Curator Recipes 63

Conclusion 70

Chapter 4 Data–Flow Management in Streaming Analysis 71

Distributed Data Flows 72

At Least Once Delivery 72

The n+1 Problem 73

Apache Kafka: High–Throughput Distributed Messaging 74

Design and Implementation 74

Configuring a Kafka Environment 80

Interacting with Kafka Brokers 89

Apache Flume: Distributed Log Collection 92

The Flume Agent 92

Configuring the Agent 94

The Flume Data Model 95

Channel Selectors 95

Flume Sources 98

Flume Sinks 107

Sink Processors 110

Flume Channels 110

Flume Interceptors 112

Integrating Custom Flume Components 114

Running Flume Agents 114

Conclusion 115

Chapter 5 Processing Streaming Data 117

Distributed Streaming Data Processing 118

Coordination 118

Partitions and Merges 119

Transactions 119

Processing Data with Storm 119

Components of a Storm Cluster 120

Configuring a Storm Cluster 122

Distributed Clusters 123

Local Clusters 126

Storm Topologies 127

Implementing Bolts 130

Implementing and Using Spouts 136

Distributed Remote Procedure Calls 142

Trident: The Storm DSL 144

Processing Data with Samza 151

Apache YARN 151

Getting Started with YARN and Samza 153

Integrating Samza into the Data Flow 157

Samza Jobs 157

Conclusion 166

Chapter 6 Storing Streaming Data 167

Consistent Hashing 168

NoSQL Storage Systems 169

Redis 170

MongoDB 180

Cassandra 203

Other Storage Technologies 215

Relational Databases 215

Distributed In–Memory Data Grids 215

Choosing a Technology 215

Key–Value Stores 216

Document Stores 216

Distributed Hash Table Stores 216

In–Memory Grids 217

Relational Databases 217

Warehousing 217

Hadoop as ETL and Warehouse 218

Lambda Architectures 223

Conclusion 224

Part II Analysis and Visualization 225

Chapter 7 Delivering Streaming Metrics 227

Streaming Web Applications 228

Working with Node 229

Managing a Node Project with NPM 231

Developing Node Web Applications 235

A Basic Streaming Dashboard 238

Adding Streaming to Web Applications 242

Visualizing Data 254

HTML5 Canvas and Inline SVG 254

Data–Driven Documents: D3.js 262

High–Level Tools 272

Mobile Streaming Applications 277

Conclusion 279

Chapter 8 Exact Aggregation and Delivery 281

Timed Counting and Summation 285

Counting in Bolts 286

Counting with Trident 288

Counting in Samza 289

Multi–Resolution Time–Series Aggregation 290

Quantization Framework 290

Stochastic Optimization 296

Delivering Time–Series Data 297

Strip Charts with D3.js 298

High–Speed Canvas Charts 299

Horizon Charts 301

Conclusion 303

Chapter 9 Statistical Approximation of Streaming Data 305

Numerical Libraries 306

Probabilities and Distributions 307

Expectation and Variance 309

Statistical Distributions 310

Discrete Distributions 310

Continuous Distributions 312

Joint Distributions 315

Working with Distributions 316

Inferring Parameters 316

The Delta Method 317

Distribution Inequalities 319

Random Number Generation 319

Generating Specific Distributions 321

Sampling Procedures 324

Sampling from a Fixed Population 325

Sampling from a Streaming Population 326

Biased Streaming Sampling 327

Conclusion 329

Chapter 10 Approximating Streaming Data with Sketching 331

Registers and Hash Functions 332

Registers 332

Hash Functions 332

Working with Sets 336

The Bloom Filter 338

The Algorithm 338

Choosing a Filter Size 340

Unions and Intersections 341

Cardinality Estimation 342

Interesting Variations 344

Distinct Value Sketches 347

The Min–Count Algorithm 348

The HyperLogLog Algorithm 351

The Count–Min Sketch 356

Point Queries 356

Count–Min Sketch Implementation 357

Top–K and Heavy Hitters 358

Range and Quantile Queries 360

Other Applications 364

Conclusion 364

Chapter 11 Beyond Aggregation 367

Models for Real–Time Data 368

Simple Time–Series Models 369

Linear Models 373

Logistic Regression 378

Neural Network Models 380

Forecasting with Models 389

Exponential Smoothing Methods 390

Regression Methods 393

Neural Network Methods 394

Monitoring 396

Outlier Detection 397

Change Detection 399

Real–Time Optimization 400

Conclusion 402

Index 403

Note: Product cover images may vary from those shown
3 of 4

Loading
LOADING...

4 of 4

BYRON ELLIS is CTO of Spongecell, where he heads research and development. Previously the Chief Data Scientist for LivePerson and CTO at AdBrite, Ellis holds a Ph.D. in Statistics from Harvard University, and a B.S. in Cybernetics from UCLA. He presents sessions on real–time analytics at Strata and other major conferences.

Note: Product cover images may vary from those shown
5 of 4
Note: Product cover images may vary from those shown
Adroll
adroll