Hadoop: The Definitive Guide. Storage and Analysis at Internet Scale. 4th Edition Tom White

Hadoop: The Definitive Guide. Storage and Analysis at Internet Scale. 4th Edition Tom White - okladka książki

Autor:: Tom White
Wydawnictwo:: O'Reilly Media (Z chęcią przeczytam książkę w języku polskim)
Ocena:: Bądź pierwszym, który oceni tę książkę
Stron:: 756
Dostępne formaty:: ePub

Mobi

Ebook

186,15 zł ~~219,00 zł~~ (-15%)

131,40 zł najniższa cena z 30 dni

Dodaj do koszyka Dostępny natychmiast po opłaceniu zakupu lub Kup na prezent Kup 1-kliknięciem

Przenieś na półkę

Do przechowalni

Kup polskie wydanie:

Hadoop. Komplety przewodnik. Analiza i przechowywanie danych

Tom White

książka
ebook

Czasowo niedostępna

Analiza danych z Hadoopem — i wszystko staje się prostsze! Podstawy Hadoopa i model MapReduce Praca z Hadoopem, budowa klastra i zarządzanie platformą Dodatki zwiększające funkcjonalność Hadoopa Platforma Apache Hadoop to jedno z zaawansowanych narzędzi informatycznych. Dzięki niej można przeprowadzać różne operacje na dużych ilościach danych i znacznie skrócić czas wykonywania tych działań. Wszędzie tam, gdzie potrzebne jest szybkie sortowanie, obliczanie i arch...

Kup w zestawie z dodatkowym rabatem

Hadoop: The Definitive Guide. Storage and Analysis at Internet Scale. 4th Edition Tom White

Hadoop: The Definitive Guide. The Definitive Guide Tom White

AI-Assisted Statistics for Data Scientists. 50+ Essential Concepts Using R and Python. 3rd Edition Peter Bruce, Andrew Bruce, Peter Gedeck

Cena zestawu: 495.64 zł

Zyskujesz: 131.36 zł (-21%)

Dodaj do koszyka

Get ready to unlock the power of your data. With the fourth edition of this comprehensive guide, youâ??ll learn how to build and maintain reliable, scalable, distributed systems with Apache Hadoop. This book is ideal for programmers looking to analyze datasets of any size, and for administrators who want to set up and run Hadoop clusters.

Using Hadoop 2 exclusively, author Tom White presents new chapters on YARN and several Hadoop-related projects such as Parquet, Flume, Crunch, and Spark. Youâ??ll learn about recent changes to Hadoop, and explore new case studies on Hadoopâ??s role in healthcare systems and genomics data processing.

Learn fundamental components such as MapReduce, HDFS, and YARN
Explore MapReduce in depth, including steps for developing applications with it
Set up and maintain a Hadoop cluster running HDFS and MapReduce on YARN
Learn two data formats: Avro for data serialization and Parquet for nested data
Use data ingestion tools such as Flume (for streaming data) and Sqoop (for bulk data transfer)
Understand how high-level data processing tools like Pig, Hive, Crunch, and Spark work with Hadoop
Learn the HBase distributed database and the ZooKeeper distributed configuration service

Wybrane bestsellery

Promocja

Tę książkę napisał wytrawny znawca i współtwórca Hadoopa. Przedstawia w niej wszystkie istotne mechanizmy działania platformy i pokazuje, jak efektywnie jej używać. Dowiesz się stąd, do czego służą model MapReduce oraz systemy HDFS i YARN. Nauczysz się budować aplikacje oraz klastry.
- ebook
- książka
Hadoop. Komplety przewodnik. Analiza i przechowywanie danych

Tom White

(44,50 zł najniższa cena z 30 dni)

44.50 zł ~~89.00 zł (-50%)~~
Promocja

Hadoop: The Definitive Guide helps you harness the power of your data. Ideal for processing large datasets, the Apache Hadoop framework is an open source implementation of the MapReduce algorithm on which Google built its empire. This comprehensive resource demonstrates how to use Hadoop to build reliable, scalable, distributed systems: programmers
- ebook
Hadoop: The Definitive Guide. The Definitive Guide

Tom White

(83,40 zł najniższa cena z 30 dni)

118.15 zł ~~139.00 zł (-15%)~~
Nowość Promocja

Pass the GCP Professional Data Engineer exam with expert guidance, real-world scenarios, and web-based study aids designed to build your skills and boost your confidence.
- ebook
Google Cloud Certified Professional Data Engineer Certification Guide. Get certified and develop expert-level data engineering skills with Google Cloud Platform

Sireesha Pulipati, Juan Carlos Escalante Soto

(139,00 zł najniższa cena z 30 dni)

125.10 zł ~~139.00 zł (-10%)~~
Nowość Promocja

Leverage the power of ESQL for faster search, aggregations, and visualizations, and gain insights on real-time data processing.
- ebook
Elasticsearch Query Language the Definitive Guide. A hands-on guide to mastering ESQL for search, observability, and security

Bahaaldine Azarmi, Alexis Charveriat, Stephen Brown, Farbod Shirzadian, Alejandro Sanchez

(139,00 zł najniższa cena z 30 dni)

125.10 zł ~~139.00 zł (-10%)~~
Nowość Promocja

Hands-on and strategic recipes for the Snowflake AI Data Cloud that show not just how, but when and why to apply each capability to build governed, intelligent, AI-ready data platforms.
- ebook
Snowflake Cookbook. Strategic and practical recipes for building governed, intelligent, AI-ready data platforms - Second Edition

Keith Belanger

(129,00 zł najniższa cena z 30 dni)

116.10 zł ~~129.00 zł (-10%)~~
Promocja

Reimagine enterprise data by implementing Oracle ML and GenAI workflows that drive innovation, improve decision making, and equip you with advanced skills to deliver intelligent, business-ready AI solutions.
- ebook
A Practical Guide to Oracle AI Engineering. Build intelligent apps with machine learning and AI across cloud and on-premises environments

Erik Benner, Hicham Assoudi, Tural Gulmammadov

(104,25 zł najniższa cena z 30 dni)

125.10 zł ~~139.00 zł (-10%)~~
Promocja

Build scalable, secure data solutions on Azure Databricks. Learn ingestion, transformation, real-time streaming, Unity Catalog governance, and ML workflows to make Databricks your central data engineering platform.
- ebook
Data Engineering with Azure Databricks. Design, build, and optimize scalable data pipelines and analytics solutions with Azure Databricks

Dmitry Foshin, Dmitry Anoshin, Tonya Chernyshova, Sergii Volodarskyi

(104,25 zł najniższa cena z 30 dni)

125.10 zł ~~139.00 zł (-10%)~~
Promocja

This book covers the history, current state, and future of AI. From symbolic AI to machine learning, it provides insight into key developments and ethical issues surrounding AI systems today.
- ebook
Machines That Think. How Artificial Intelligence Works and What It Means for Us

Rheinwerk Publishing, Inc, Inga Strümke

(89,90 zł najniższa cena z 30 dni)

80.91 zł ~~89.90 zł (-10%)~~
Promocja

This guide covers SQL essentials to advanced topics, including relational databases, querying, data manipulation, and security. Learn to create, manage, and optimize databases with practical examples, preparing you for real-world SQL applications.
- ebook
SQL. The Practical Guide

Rheinwerk Publishing, Inc, Kerem Koseoglu

(159,00 zł najniższa cena z 30 dni)

143.10 zł ~~159.00 zł (-10%)~~
Promocja

POS PostgreSQL 18 to build AI-driven enterprise apps with speed and intelligence. Learn with real-world examples, proven architectures, and advanced techniques for transactional, analytical, and AI workloads. Innovate. Scale. Win.
- ebook
AI-Ready PostgreSQL 18. Building Intelligent Data Systems with Transactions, Analytics, and Vectors

Vibhor Kumar, Marc Linster, Ed Boyajian

(126,75 zł najniższa cena z 30 dni)

152.10 zł ~~169.00 zł (-10%)~~
Promocja

Uncover the architecture and engineering principles behind Milvus, the leading open-source vector database. This book helps you confidently deploy, optimize, and integrate Milvus into real-world AI and vector search systems.
- ebook
The Architecture Handbook for Milvus Vector Database. Design and implement high-performance vector search systems with Milvus

Yudong Cai, Jeremy Zhu, Xuan Yang, Bang Fu

(104,25 zł najniższa cena z 30 dni)

125.10 zł ~~139.00 zł (-10%)~~
Promocja

Description Data engineering fuels the AI revolution by transforming raw information into high-quality insights. This guide navigates the evolution from traditional warehousing to modern lakehouse systems, teaching you to build and safely operate the medallion architecture (bronze, silver, and gold layers) in production. This book explores the evol
- ebook
Data Engineering with Medallion Architecture

Miki Eto

(116,10 zł najniższa cena z 30 dni)

125.10 zł ~~139.00 zł (-10%)~~

Ebooka "Hadoop: The Definitive Guide. Storage and Analysis at Internet Scale. 4th Edition" przeczytasz na:

czytnikach Inkbook, Kindle, Pocketbook, Onyx Boox i innych
systemach Windows, MacOS i innych

systemach Windows, Android, iOS, HarmonyOS
na dowolnych urządzeniach i aplikacjach obsługujących formaty: PDF, EPub, Mobi

Masz pytania? Zajrzyj do zakładki Pomoc »

Oceny i opinie klientów: Hadoop: The Definitive Guide. Storage and Analysis at Internet Scale. 4th Edition Tom White

(0)

Szczegóły książki

ISBN Ebooka:: 978-14-919-0170-0, 9781491901700
Data wydania ebooka :: 2015-03-25 Data wydania ebooka często jest dniem wprowadzenia tytułu do sprzedaży i może nie być równoznaczna z datą wydania książki papierowej. Dodatkowe informacje możesz znaleźć w darmowym fragmencie. Jeśli masz wątpliwości skontaktuj się z nami sklep@ebookpoint.pl.
Język publikacji:: angielski
Rozmiar pliku ePub:: 8.1MB
Rozmiar pliku Mobi:: 16.7MB

Zgłoś erratę

Kategorie

Kliknij, aby zgłosić błędnie przypisaną kategorię »

Informatyka » Bazy danych

Dostępność produktu

Produkt nie został jeszcze oceniony pod kątem ułatwień dostępu lub nie podano żadnych informacji o ułatwieniach dostępu lub są one niewystarczające. Prawdopodobnie Wydawca/Dostawca jeszcze nie umożliwił dokonania walidacji produktu lub nie przekazał odpowiednich informacji na temat jego dostępności.

Spis treści książki

Hadoop: The Definitive Guide
Dedication
Foreword
Preface
- Administrative Notes
- Whats New in the Fourth Edition?
- Whats New in the Third Edition?
- Whats New in the Second Edition?
- Conventions Used in This Book
- Using Code Examples
- Safari Books Online
- How to Contact Us
- Acknowledgments
I. Hadoop Fundamentals
- 1. Meet Hadoop
  - Data!
  - Data Storage and Analysis
  - Querying All Your Data
  - Beyond Batch
  - Comparison with Other Systems
    - Relational Database Management Systems
    - Grid Computing
    - Volunteer Computing
  - A Brief History of Apache Hadoop
  - Whats in This Book?
- 2. MapReduce
  - A Weather Dataset
    - Data Format
  - Analyzing the Data with Unix Tools
  - Analyzing the Data with Hadoop
    - Map and Reduce
    - Java MapReduce
      - A test run
  - Scaling Out
    - Data Flow
    - Combiner Functions
      - Specifying a combiner function
    - Running a Distributed MapReduce Job
  - Hadoop Streaming
    - Ruby
    - Python
- 3. The Hadoop Distributed Filesystem
  - The Design of HDFS
  - HDFS Concepts
    - Blocks
    - Namenodes and Datanodes
    - Block Caching
    - HDFS Federation
    - HDFS High Availability
      - Failover and fencing
  - The Command-Line Interface
    - Basic Filesystem Operations
  - Hadoop Filesystems
    - Interfaces
      - HTTP
      - C
      - NFS
      - FUSE
  - The Java Interface
    - Reading Data from a Hadoop URL
    - Reading Data Using the FileSystem API
      - FSDataInputStream
    - Writing Data
      - FSDataOutputStream
    - Directories
    - Querying the Filesystem
      - File metadata: FileStatus
      - Listing files
      - File patterns
      - PathFilter
    - Deleting Data
  - Data Flow
    - Anatomy of a File Read
    - Anatomy of a File Write
    - Coherency Model
      - Consequences for application design
  - Parallel Copying with distcp
    - Keeping an HDFS Cluster Balanced
- 4. YARN
  - Anatomy of a YARN Application Run
    - Resource Requests
    - Application Lifespan
    - Building YARN Applications
  - YARN Compared to MapReduce 1
  - Scheduling in YARN
    - Scheduler Options
    - Capacity Scheduler Configuration
      - Queue placement
    - Fair Scheduler Configuration
      - Enabling the Fair Scheduler
      - Queue configuration
      - Queue placement
      - Preemption
    - Delay Scheduling
    - Dominant Resource Fairness
  - Further Reading
- 5. Hadoop I/O
  - Data Integrity
    - Data Integrity in HDFS
    - LocalFileSystem
    - ChecksumFileSystem
  - Compression
    - Codecs
      - Compressing and decompressing streams with CompressionCodec
      - Inferring CompressionCodecs using CompressionCodecFactory
      - Native libraries
        
        CodecPool
    - Compression and Input Splits
    - Using Compression in MapReduce
      - Compressing map output
  - Serialization
    - The Writable Interface
      - WritableComparable and comparators
    - Writable Classes
      - Writable wrappers for Java primitives
      - Text
        
        Indexing
        
        Unicode
        
        Iteration
        
        Mutability
        
        Resorting to String
      - BytesWritable
      - NullWritable
      - ObjectWritable and GenericWritable
      - Writable collections
    - Implementing a Custom Writable
      - Implementing a RawComparator for speed
      - Custom comparators
    - Serialization Frameworks
      - Serialization IDL
  - File-Based Data Structures
    - SequenceFile
      - Writing a SequenceFile
      - Reading a SequenceFile
      - Displaying a SequenceFile with the command-line interface
      - Sorting and merging SequenceFiles
      - The SequenceFile format
    - MapFile
      - MapFile variants
    - Other File Formats and Column-Oriented Formats
II. MapReduce
- 6. Developing a MapReduce Application
  - The Configuration API
    - Combining Resources
    - Variable Expansion
  - Setting Up the Development Environment
    - Managing Configuration
    - GenericOptionsParser, Tool, and ToolRunner
  - Writing a Unit Test with MRUnit
    - Mapper
    - Reducer
  - Running Locally on Test Data
    - Running a Job in a Local Job Runner
    - Testing the Driver
  - Running on a Cluster
    - Packaging a Job
      - The client classpath
      - The task classpath
      - Packaging dependencies
      - Task classpath precedence
    - Launching a Job
    - The MapReduce Web UI
      - The resource manager page
      - The MapReduce job page
    - Retrieving the Results
    - Debugging a Job
      - The tasks and task attempts pages
      - Handling malformed data
    - Hadoop Logs
    - Remote Debugging
  - Tuning a Job
    - Profiling Tasks
      - The HPROF profiler
  - MapReduce Workflows
    - Decomposing a Problem into MapReduce Jobs
    - JobControl
    - Apache Oozie
      - Defining an Oozie workflow
      - Packaging and deploying an Oozie workflow application
      - Running an Oozie workflow job
- 7. How MapReduce Works
  - Anatomy of a MapReduce Job Run
    - Job Submission
    - Job Initialization
    - Task Assignment
    - Task Execution
      - Streaming
    - Progress and Status Updates
    - Job Completion
  - Failures
    - Task Failure
    - Application Master Failure
    - Node Manager Failure
    - Resource Manager Failure
  - Shuffle and Sort
    - The Map Side
    - The Reduce Side
    - Configuration Tuning
  - Task Execution
    - The Task Execution Environment
      - Streaming environment variables
    - Speculative Execution
    - Output Committers
      - Task side-effect files
- 8. MapReduce Types and Formats
  - MapReduce Types
    - The Default MapReduce Job
      - The default Streaming job
      - Keys and values in Streaming
  - Input Formats
    - Input Splits and Records
      - FileInputFormat
      - FileInputFormat input paths
      - FileInputFormat input splits
      - Small files and CombineFileInputFormat
      - Preventing splitting
      - File information in the mapper
      - Processing a whole file as a record
    - Text Input
      - TextInputFormat
        
        Controlling the maximum line length
      - KeyValueTextInputFormat
      - NLineInputFormat
      - XML
    - Binary Input
      - SequenceFileInputFormat
      - SequenceFileAsTextInputFormat
      - SequenceFileAsBinaryInputFormat
      - FixedLengthInputFormat
    - Multiple Inputs
    - Database Input (and Output)
  - Output Formats
    - Text Output
    - Binary Output
      - SequenceFileOutputFormat
      - SequenceFileAsBinaryOutputFormat
      - MapFileOutputFormat
    - Multiple Outputs
      - An example: Partitioning data
      - MultipleOutputs
    - Lazy Output
    - Database Output
- 9. MapReduce Features
  - Counters
    - Built-in Counters
      - Task counters
      - Job counters
    - User-Defined Java Counters
      - Dynamic counters
      - Retrieving counters
    - User-Defined Streaming Counters
  - Sorting
    - Preparation
    - Partial Sort
    - Total Sort
    - Secondary Sort
      - Java code
      - Streaming
  - Joins
    - Map-Side Joins
    - Reduce-Side Joins
  - Side Data Distribution
    - Using the Job Configuration
    - Distributed Cache
      - Usage
      - How it works
      - The distributed cache API
  - MapReduce Library Classes
III. Hadoop Operations
- 10. Setting Up a Hadoop Cluster
  - Cluster Specification
    - Cluster Sizing
      - Master node scenarios
    - Network Topology
      - Rack awareness
  - Cluster Setup and Installation
    - Installing Java
    - Creating Unix User Accounts
    - Installing Hadoop
    - Configuring SSH
    - Configuring Hadoop
    - Formatting the HDFS Filesystem
    - Starting and Stopping the Daemons
    - Creating User Directories
  - Hadoop Configuration
    - Configuration Management
    - Environment Settings
      - Java
      - Memory heap size
      - System logfiles
      - SSH settings
    - Important Hadoop Daemon Properties
      - HDFS
      - YARN
      - Memory settings in YARN and MapReduce
      - CPU settings in YARN and MapReduce
    - Hadoop Daemon Addresses and Ports
    - Other Hadoop Properties
      - Cluster membership
      - Buffer size
      - HDFS block size
      - Reserved storage space
      - Trash
      - Job scheduler
      - Reduce slow start
      - Short-circuit local reads
  - Security
    - Kerberos and Hadoop
      - An example
    - Delegation Tokens
    - Other Security Enhancements
  - Benchmarking a Hadoop Cluster
    - Hadoop Benchmarks
      - Benchmarking MapReduce with TeraSort
      - Other benchmarks
    - User Jobs
- 11. Administering Hadoop
  - HDFS
    - Persistent Data Structures
      - Namenode directory structure
      - The filesystem image and edit log
      - Secondary namenode directory structure
      - Datanode directory structure
    - Safe Mode
      - Entering and leaving safe mode
    - Audit Logging
    - Tools
      - dfsadmin
      - Filesystem check (fsck)
        
        Finding the blocks for a file
      - Datanode block scanner
      - Balancer
  - Monitoring
    - Logging
      - Setting log levels
      - Getting stack traces
    - Metrics and JMX
  - Maintenance
    - Routine Administration Procedures
      - Metadata backups
      - Data backups
      - Filesystem check (fsck)
      - Filesystem balancer
    - Commissioning and Decommissioning Nodes
      - Commissioning new nodes
      - Decommissioning old nodes
    - Upgrades
      - HDFS data and metadata upgrades
        
        Start the upgrade
        
        Wait until the upgrade is complete
        
        Check the upgrade
        
        Roll back the upgrade (optional)
        
        Finalize the upgrade (optional)
IV. Related Projects
- 12. Avro
  - Avro Data Types and Schemas
  - In-Memory Serialization and Deserialization
    - The Specific API
  - Avro Datafiles
  - Interoperability
    - Python API
    - Avro Tools
  - Schema Resolution
  - Sort Order
  - Avro MapReduce
  - Sorting Using Avro MapReduce
  - Avro in Other Languages
- 13. Parquet
  - Data Model
    - Nested Encoding
  - Parquet File Format
  - Parquet Configuration
  - Writing and Reading Parquet Files
    - Avro, Protocol Buffers, and Thrift
      - Projection and read schemas
  - Parquet MapReduce
- 14. Flume
  - Installing Flume
  - An Example
  - Transactions and Reliability
    - Batching
  - The HDFS Sink
    - Partitioning and Interceptors
    - File Formats
  - Fan Out
    - Delivery Guarantees
    - Replicating and Multiplexing Selectors
  - Distribution: Agent Tiers
    - Delivery Guarantees
  - Sink Groups
  - Integrating Flume with Applications
  - Component Catalog
  - Further Reading
- 15. Sqoop
  - Getting Sqoop
  - Sqoop Connectors
  - A Sample Import
    - Text and Binary File Formats
  - Generated Code
    - Additional Serialization Systems
  - Imports: A Deeper Look
    - Controlling the Import
    - Imports and Consistency
    - Incremental Imports
    - Direct-Mode Imports
  - Working with Imported Data
    - Imported Data and Hive
  - Importing Large Objects
  - Performing an Export
  - Exports: A Deeper Look
    - Exports and Transactionality
    - Exports and SequenceFiles
  - Further Reading
- 16. Pig
  - Installing and Running Pig
    - Execution Types
      - Local mode
      - MapReduce mode
    - Running Pig Programs
    - Grunt
    - Pig Latin Editors
  - An Example
    - Generating Examples
  - Comparison with Databases
  - Pig Latin
    - Structure
    - Statements
    - Expressions
    - Types
    - Schemas
      - Using Hive tables with HCatalog
      - Validation and nulls
      - Schema merging
    - Functions
      - Other libraries
    - Macros
  - User-Defined Functions
    - A Filter UDF
      - Leveraging types
    - An Eval UDF
      - Dynamic invokers
    - A Load UDF
      - Using a schema
  - Data Processing Operators
    - Loading and Storing Data
    - Filtering Data
      - FOREACH...GENERATE
      - STREAM
    - Grouping and Joining Data
      - JOIN
      - COGROUP
      - CROSS
      - GROUP
    - Sorting Data
    - Combining and Splitting Data
  - Pig in Practice
    - Parallelism
    - Anonymous Relations
    - Parameter Substitution
      - Dynamic parameters
      - Parameter substitution processing
  - Further Reading
- 17. Hive
  - Installing Hive
    - The Hive Shell
  - An Example
  - Running Hive
    - Configuring Hive
      - Execution engines
      - Logging
    - Hive Services
      - Hive clients
    - The Metastore
  - Comparison with Traditional Databases
    - Schema on Read Versus Schema on Write
    - Updates, Transactions, and Indexes
    - SQL-on-Hadoop Alternatives
  - HiveQL
    - Data Types
      - Primitive types
      - Complex types
    - Operators and Functions
      - Conversions
  - Tables
    - Managed Tables and External Tables
    - Partitions and Buckets
      - Partitions
      - Buckets
    - Storage Formats
      - The default storage format: Delimited text
      - Binary storage formats: Sequence files, Avro datafiles, Parquet files, RCFiles, and ORCFiles
      - Using a custom SerDe: RegexSerDe
      - Storage handlers
    - Importing Data
      - Inserts
      - Multitable insert
      - CREATE TABLE...AS SELECT
    - Altering Tables
    - Dropping Tables
  - Querying Data
    - Sorting and Aggregating
    - MapReduce Scripts
    - Joins
      - Inner joins
      - Outer joins
      - Semi joins
      - Map joins
    - Subqueries
    - Views
  - User-Defined Functions
    - Writing a UDF
    - Writing a UDAF
      - A more complex UDAF
  - Further Reading
- 18. Crunch
  - An Example
  - The Core Crunch API
    - Primitive Operations
      - union()
      - parallelDo()
      - groupByKey()
      - combineValues()
    - Types
      - Records and tuples
    - Sources and Targets
      - Reading from a source
      - Writing to a target
      - Existing outputs
      - Combined sources and targets
    - Functions
      - Serialization of functions
      - Object reuse
    - Materialization
      - PObject
  - Pipeline Execution
    - Running a Pipeline
      - Asynchronous execution
      - Debugging
    - Stopping a Pipeline
    - Inspecting a Crunch Plan
    - Iterative Algorithms
    - Checkpointing a Pipeline
  - Crunch Libraries
  - Further Reading
- 19. Spark
  - Installing Spark
  - An Example
    - Spark Applications, Jobs, Stages, and Tasks
    - A Scala Standalone Application
    - A Java Example
    - A Python Example
  - Resilient Distributed Datasets
    - Creation
    - Transformations and Actions
      - Aggregation transformations
    - Persistence
      - Persistence levels
    - Serialization
      - Data
      - Functions
  - Shared Variables
    - Broadcast Variables
    - Accumulators
  - Anatomy of a Spark Job Run
    - Job Submission
    - DAG Construction
    - Task Scheduling
    - Task Execution
  - Executors and Cluster Managers
    - Spark on YARN
      - YARN client mode
      - YARN cluster mode
  - Further Reading
- 20. HBase
  - HBasics
    - Backdrop
  - Concepts
    - Whirlwind Tour of the Data Model
      - Regions
      - Locking
    - Implementation
      - HBase in operation
  - Installation
    - Test Drive
  - Clients
    - Java
    - MapReduce
    - REST and Thrift
  - Building an Online Query Application
    - Schema Design
    - Loading Data
      - Load distribution
      - Bulk load
    - Online Queries
      - Station queries
      - Observation queries
  - HBase Versus RDBMS
    - Successful Service
    - HBase
  - Praxis
    - HDFS
    - UI
    - Metrics
    - Counters
  - Further Reading
- 21. ZooKeeper
  - Installing and Running ZooKeeper
  - An Example
    - Group Membership in ZooKeeper
    - Creating the Group
    - Joining a Group
    - Listing Members in a Group
      - ZooKeeper command-line tools
    - Deleting a Group
  - The ZooKeeper Service
    - Data Model
      - Ephemeral znodes
      - Sequence numbers
      - Watches
    - Operations
      - Multiupdate
      - APIs
      - Watch triggers
      - ACLs
    - Implementation
    - Consistency
    - Sessions
      - Time
    - States
  - Building Applications with ZooKeeper
    - A Configuration Service
    - The Resilient ZooKeeper Application
      - InterruptedException
      - KeeperException
        
        State exceptions
        
        Recoverable exceptions
        
        Unrecoverable exceptions
      - A reliable configuration service
    - A Lock Service
      - The herd effect
      - Recoverable exceptions
      - Unrecoverable exceptions
      - Implementation
    - More Distributed Data Structures and Protocols
      - BookKeeper and Hedwig
  - ZooKeeper in Production
    - Resilience and Performance
    - Configuration
  - Further Reading
V. Case Studies
- 22. Composable Data at Cerner
  - From CPUs to Semantic Integration
  - Enter Apache Crunch
  - Building a Complete Picture
  - Integrating Healthcare Data
  - Composability over Frameworks
  - Moving Forward
- 23. Biological Data Science: Saving Lives with Software
  - The Structure of DNA
  - The Genetic Code: Turning DNA Letters into Proteins
  - Thinking of DNA as Source Code
  - The Human Genome Project and Reference Genomes
  - Sequencing and Aligning DNA
  - ADAM, A Scalable Genome Analysis Platform
    - Literate programming with the Avro interface description language (IDL)
    - Column-oriented access with Parquet
    - A simple example: k-mer counting using Spark and ADAM
  - From Personalized Ads to Personalized Medicine
  - Join In
- 24. Cascading
  - Fields, Tuples, and Pipes
  - Operations
  - Taps, Schemes, and Flows
  - Cascading in Practice
  - Flexibility
  - Hadoop and Cascading at ShareThis
  - Summary
A. Installing Apache Hadoop
- Prerequisites
- Installation
- Configuration
  - Standalone Mode
  - Pseudodistributed Mode
    - Configuring SSH
    - Formatting the HDFS filesystem
    - Starting and stopping the daemons
    - Creating a user directory
  - Fully Distributed Mode
B. Clouderas Distribution Including Apache Hadoop
C. Preparing the NCDC Weather Data
D. The Old and New Java MapReduce APIs
Index
Colophon
Copyright