coverpage
Learning Hadoop 2
Credits
About the Authors
About the Reviewers
www.PacktPub.com
Support files eBooks discount offers and more
Preface
What this book covers
What you need for this book
Who this book is for
Conventions
Reader feedback
Customer support
Chapter 1. Introduction
A note on versioning
The background of Hadoop
Components of Hadoop
Hadoop 2 – what's the big deal?
Distributions of Apache Hadoop
A dual approach
AWS – infrastructure on demand from Amazon
Getting started
Running the examples
Data processing with Hadoop
Summary
Chapter 2. Storage
The inner workings of HDFS
Command-line access to the HDFS filesystem
Protecting the filesystem metadata
Apache ZooKeeper – a different type of filesystem
Automatic NameNode failover
HDFS snapshots
Hadoop filesystems
Managing and serializing data
Storing data
Summary
Chapter 3. Processing – MapReduce and Beyond
MapReduce
Java API to MapReduce
Writing MapReduce programs
Walking through a run of a MapReduce job
YARN
YARN in the real world – Computation beyond MapReduce
Summary
Chapter 4. Real-time Computation with Samza
Stream processing with Samza
Summary
Chapter 5. Iterative Computation with Spark
Apache Spark
The Spark ecosystem
Processing data with Apache Spark
Comparing Samza and Spark Streaming
Summary
Chapter 6. Data Analysis with Apache Pig
An overview of Pig
Getting started
Running Pig
Fundamentals of Apache Pig
Programming Pig
Extending Pig (UDFs)
Analyzing the Twitter stream
Summary
Chapter 7. Hadoop and SQL
Why SQL on Hadoop
Prerequisites
Hive architecture
Hive and Amazon Web Services
Extending HiveQL
Programmatic interfaces
Stinger initiative
Impala
Summary
Chapter 8. Data Lifecycle Management
What data lifecycle management is
Building a tweet analysis capability
Challenges of external data
Collecting additional data
Pulling it all together
Summary
Chapter 9. Making Development Easier
Choosing a framework
Hadoop streaming
Kite Data
Apache Crunch
Summary
Chapter 10. Running a Hadoop Cluster
I'm a developer – I don't care about operations!
Cloudera Manager
Ambari – the open source alternative
Operations in the Hadoop 2 world
Sharing resources
Building a physical cluster
Building a cluster on EMR
Cluster tuning
Security
Monitoring
Troubleshooting
Summary
Chapter 11. Where to Go Next
Alternative distributions
Other computational frameworks
Other interesting projects
Other programming abstractions
AWS resources
Sources of information
Summary
Index
更新时间:2021-07-23 20:57:57