Hadoop Application Architectures Mark Grover, Ted Malaska, Jonathan Seidman

Hadoop Application Architectures Mark Grover, Ted Malaska, Jonathan Seidman - okladka książki

Autorzy:: Mark Grover, Ted Malaska, Jonathan Seidman
Wydawnictwo:: O'Reilly Media (Z chęcią przeczytam książkę w języku polskim)
Ocena:: Bądź pierwszym, który oceni tę książkę
Stron:: 400
Dostępne formaty:: ePub

Mobi

Ebook

143,65 zł ~~169,00 zł~~ (-15%)

101,40 zł najniższa cena z 30 dni

Dodaj do koszyka Dostępny natychmiast po opłaceniu zakupu lub Kup na prezent Kup 1-kliknięciem

Przenieś na półkę

Do przechowalni

Get expert guidance on architecting end-to-end data management solutions with Apache Hadoop. While many sources explain how to use various components in the Hadoop ecosystem, this practical book takes you through architectural considerations necessary to tie those components together into a complete tailored application, based on your particular use case.

To reinforce those lessons, the book’s second section provides detailed examples of architectures used in some of the most commonly found Hadoop applications. Whether you’re designing a new Hadoop application, or planning to integrate Hadoop into your existing data infrastructure, Hadoop Application Architectures will skillfully guide you through the process.

This book covers:

Factors to consider when using Hadoop to store and model data
Best practices for moving data in and out of the system
Data processing frameworks, including MapReduce, Spark, and Hive
Common Hadoop processing patterns, such as removing duplicate records and using windowing analytics
Giraph, GraphX, and other tools for large graph processing on Hadoop
Using workflow orchestration and scheduling tools such as Apache Oozie
Near-real-time stream processing with Apache Storm, Apache Spark Streaming, and Apache Flume
Architecture examples for clickstream analysis, fraud detection, and data warehousing

Wybrane bestsellery

Promocja

While many companies ponder implementation details such as distributed processing engines and algorithms for data analysis, this practical book takes a much wider view of big data development, starting with initial planning and moving diligently toward execution. Authors Ted Malaska and Jonathan Seidman guide you through the major components necess
- ebook
Foundations for Architecting Data Solutions. Managing Successful Data Projects

Ted Malaska, Jonathan Seidman

(101,40 zł najniższa cena z 30 dni)

143.65 zł ~~169.00 zł (-15%)~~
Nowość Promocja

Pass the GCP Professional Data Engineer exam with expert guidance, real-world scenarios, and web-based study aids designed to build your skills and boost your confidence.
- ebook
Google Cloud Certified Professional Data Engineer Certification Guide. Get certified and develop expert-level data engineering skills with Google Cloud Platform

Sireesha Pulipati, Juan Carlos Escalante Soto

(139,00 zł najniższa cena z 30 dni)

125.10 zł ~~139.00 zł (-10%)~~
Nowość Promocja

Leverage the power of ESQL for faster search, aggregations, and visualizations, and gain insights on real-time data processing.
- ebook
Elasticsearch Query Language the Definitive Guide. A hands-on guide to mastering ESQL for search, observability, and security

Bahaaldine Azarmi, Alexis Charveriat, Stephen Brown, Farbod Shirzadian, Alejandro Sanchez

(139,00 zł najniższa cena z 30 dni)

125.10 zł ~~139.00 zł (-10%)~~
Nowość Promocja

Hands-on and strategic recipes for the Snowflake AI Data Cloud that show not just how, but when and why to apply each capability to build governed, intelligent, AI-ready data platforms.
- ebook
Snowflake Cookbook. Strategic and practical recipes for building governed, intelligent, AI-ready data platforms - Second Edition

Keith Belanger

(129,00 zł najniższa cena z 30 dni)

116.10 zł ~~129.00 zł (-10%)~~
Promocja

Reimagine enterprise data by implementing Oracle ML and GenAI workflows that drive innovation, improve decision making, and equip you with advanced skills to deliver intelligent, business-ready AI solutions.
- ebook
A Practical Guide to Oracle AI Engineering. Build intelligent apps with machine learning and AI across cloud and on-premises environments

Erik Benner, Hicham Assoudi, Tural Gulmammadov

(104,25 zł najniższa cena z 30 dni)

125.10 zł ~~139.00 zł (-10%)~~
Promocja

Build scalable, secure data solutions on Azure Databricks. Learn ingestion, transformation, real-time streaming, Unity Catalog governance, and ML workflows to make Databricks your central data engineering platform.
- ebook
Data Engineering with Azure Databricks. Design, build, and optimize scalable data pipelines and analytics solutions with Azure Databricks

Dmitry Foshin, Dmitry Anoshin, Tonya Chernyshova, Sergii Volodarskyi

(104,25 zł najniższa cena z 30 dni)

125.10 zł ~~139.00 zł (-10%)~~
Promocja

This book covers the history, current state, and future of AI. From symbolic AI to machine learning, it provides insight into key developments and ethical issues surrounding AI systems today.
- ebook
Machines That Think. How Artificial Intelligence Works and What It Means for Us

Rheinwerk Publishing, Inc, Inga Strümke

(89,90 zł najniższa cena z 30 dni)

80.91 zł ~~89.90 zł (-10%)~~
Promocja

This guide covers SQL essentials to advanced topics, including relational databases, querying, data manipulation, and security. Learn to create, manage, and optimize databases with practical examples, preparing you for real-world SQL applications.
- ebook
SQL. The Practical Guide

Rheinwerk Publishing, Inc, Kerem Koseoglu

(159,00 zł najniższa cena z 30 dni)

143.10 zł ~~159.00 zł (-10%)~~
Promocja

POS PostgreSQL 18 to build AI-driven enterprise apps with speed and intelligence. Learn with real-world examples, proven architectures, and advanced techniques for transactional, analytical, and AI workloads. Innovate. Scale. Win.
- ebook
AI-Ready PostgreSQL 18. Building Intelligent Data Systems with Transactions, Analytics, and Vectors

Vibhor Kumar, Marc Linster, Ed Boyajian

(126,75 zł najniższa cena z 30 dni)

152.10 zł ~~169.00 zł (-10%)~~
Promocja

Uncover the architecture and engineering principles behind Milvus, the leading open-source vector database. This book helps you confidently deploy, optimize, and integrate Milvus into real-world AI and vector search systems.
- ebook
The Architecture Handbook for Milvus Vector Database. Design and implement high-performance vector search systems with Milvus

Yudong Cai, Jeremy Zhu, Xuan Yang, Bang Fu

(104,25 zł najniższa cena z 30 dni)

125.10 zł ~~139.00 zł (-10%)~~
Promocja

Description Data engineering fuels the AI revolution by transforming raw information into high-quality insights. This guide navigates the evolution from traditional warehousing to modern lakehouse systems, teaching you to build and safely operate the medallion architecture (bronze, silver, and gold layers) in production. This book explores the evol
- ebook
Data Engineering with Medallion Architecture

Miki Eto

(116,10 zł najniższa cena z 30 dni)

125.10 zł ~~139.00 zł (-10%)~~

Ebooka "Hadoop Application Architectures" przeczytasz na:

czytnikach Inkbook, Kindle, Pocketbook, Onyx Boox i innych
systemach Windows, MacOS i innych

systemach Windows, Android, iOS, HarmonyOS
na dowolnych urządzeniach i aplikacjach obsługujących formaty: PDF, EPub, Mobi

Masz pytania? Zajrzyj do zakładki Pomoc »

Oceny i opinie klientów: Hadoop Application Architectures Mark Grover, Ted Malaska, Jonathan Seidman

(0)

Szczegóły książki

ISBN Ebooka:: 978-14-919-0005-5, 9781491900055
Data wydania ebooka :: 2015-06-30 Data wydania ebooka często jest dniem wprowadzenia tytułu do sprzedaży i może nie być równoznaczna z datą wydania książki papierowej. Dodatkowe informacje możesz znaleźć w darmowym fragmencie. Jeśli masz wątpliwości skontaktuj się z nami sklep@ebookpoint.pl.
Język publikacji:: angielski
Rozmiar pliku ePub:: 6MB
Rozmiar pliku Mobi:: 6MB

Zgłoś erratę

Kategorie

Kliknij, aby zgłosić błędnie przypisaną kategorię »

Informatyka » Bazy danych

Dostępność produktu

Produkt nie został jeszcze oceniony pod kątem ułatwień dostępu lub nie podano żadnych informacji o ułatwieniach dostępu lub są one niewystarczające. Prawdopodobnie Wydawca/Dostawca jeszcze nie umożliwił dokonania walidacji produktu lub nie przekazał odpowiednich informacji na temat jego dostępności.

Spis treści książki

Foreword
Preface
- A Note About the Code Examples
- Who Should Read This Book
- Why We Wrote This Book
- Navigating This Book
- Conventions Used in This Book
- Using Code Examples
- Safari Books Online
- How to Contact Us
- Acknowledgments
  - Mark Grovers Acknowledgements
  - Ted Malaskas Acknowledgements
  - Jonathan Seidmans Acknowledgements
  - Gwen Shapiras Acknowledgements
I. Architectural Considerations for Hadoop Applications
1. Data Modeling in Hadoop
- Data Storage Options
  - Standard File Formats
    - Text data
    - Structured text data
    - Binary data
  - Hadoop File Types
    - File-based data structures
  - Serialization Formats
    - Thrift
    - Protocol Buffers
    - Avro
  - Columnar Formats
    - RCFile
    - ORC
    - Parquet
      - Avro and Parquet
  - Compression
    - Snappy
    - LZO
    - Gzip
    - bzip2
    - Compression recommendations
- HDFS Schema Design
  - Location of HDFS Files
  - Advanced HDFS Schema Design
    - Partitioning
    - Bucketing
    - Denormalizing
  - HDFS Schema Design Summary
- HBase Schema Design
  - Row Key
    - Record retrieval
    - Distribution
    - Block cache
    - Ability to scan
    - Size
    - Readability
    - Uniqueness
  - Timestamp
  - Hops
  - Tables and Regions
    - Put performance
    - Compaction time
  - Using Columns
  - Using Column Families
  - Time-to-Live
- Managing Metadata
  - What Is Metadata?
  - Why Care About Metadata?
  - Where to Store Metadata?
  - Examples of Managing Metadata
  - Limitations of the Hive Metastore and HCatalog
  - Other Ways of Storing Metadata
    - Embedding metadata in file paths and names
    - Storing the metadata in HDFS
- Conclusion
2. Data Movement
- Data Ingestion Considerations
  - Timeliness of Data Ingestion
  - Incremental Updates
  - Access Patterns
  - Original Source System and Data Structure
    - Read speed of the devices on source systems
    - Original file type
    - Compression
    - Relational database management systems
    - Streaming data
    - Logfiles
  - Transformations
    - Interceptors
    - Selectors
  - Network Bottlenecks
  - Network Security
  - Push or Pull
    - Sqoop
    - Flume
  - Failure Handling
  - Level of Complexity
- Data Ingestion Options
  - File Transfers
    - HDFS client commands
    - Mountable HDFS
  - Considerations for File Transfers versus Other Ingest Methods
  - Sqoop: Batch Transfer Between Hadoop and Relational Databases
    - Choosing a split-by column
    - Using database-specific connectors whenever available
    - Using the Goldilocks method of Sqoop performance tuning
    - Loading many tables in parallel with fair scheduler throttling
    - Diagnosing bottlenecks
    - Keeping Hadoop updated
  - Flume: Event-Based Data Collection and Processing
    - Flume architecture
    - Flume patterns
    - File formats
    - Recommendations
      - Flume sources
      - Flume sinks
      - Flume interceptors
      - Flume memory channels
      - Flume file channels
      - Sizing Channels
    - Finding Flume bottlenecks
  - Kafka
    - Kafka fault tolerance
    - Kafka and Hadoop
- Data Extraction
- Conclusion
3. Processing Data in Hadoop
- MapReduce
  - MapReduce Overview
    - Map phase
      - InputFormat
      - RecordReader
      - Mapper.setup()
      - Mapper.map
      - Partitioner
      - Mapper.cleanup()
      - Combiner
    - Reducer
      - Shuffle
      - Reducer.setup()
      - Reducer.reduce()
      - Reducer.cleanup()
      - OutputFormat
  - Example for MapReduce
  - When to Use MapReduce
- Spark
  - Spark Overview
    - DAG Model
  - Overview of Spark Components
  - Basic Spark Concepts
    - Resilient Distributed Datasets
    - Shared variables
    - SparkContext
    - Transformations
    - Action
  - Benefits of Using Spark
    - Simplicity
    - Versatility
    - Reduced disk I/O
    - Storage
    - Multilanguage
    - Resource manager independence
    - Interactive shell (REPL)
  - Spark Example
  - When to Use Spark
- Abstractions
  - Pig
  - Pig Example
  - When to Use Pig
- Crunch
  - Crunch Example
  - When to Use Crunch
- Cascading
  - Cascading Example
  - When to Use Cascading
- Hive
  - Hive Overview
  - Example of Hive Code
  - When to Use Hive
- Impala
  - Impala Overview
  - Speed-Oriented Design
    - Efficient use of memory
    - Long running daemons
    - Efficient execution engine
    - Use of LLVM
  - Impala Example
  - When to Use Impala
- Conclusion
4. Common Hadoop Processing Patterns
- Pattern: Removing Duplicate Records by Primary Key
  - Data Generation for Deduplication Example
  - Code Example: Spark Deduplication in Scala
  - Code Example: Deduplication in SQL
- Pattern: Windowing Analysis
  - Data Generation for Windowing Analysis Example
  - Code Example: Peaks and Valleys in Spark
  - Code Example: Peaks and Valleys in SQL
- Pattern: Time Series Modifications
  - Use HBase and Versioning
  - Use HBase with a RowKey of RecordKey and StartTime
  - Use HDFS and Rewrite the Whole Table
  - Use Partitions on HDFS for Current and Historical Records
  - Data Generation for Time Series Example
  - Code Example: Time Series in Spark
  - Code Example: Time Series in SQL
- Conclusion
5. Graph Processing on Hadoop
- What Is a Graph?
- What Is Graph Processing?
- How Do You Process a Graph in a Distributed System?
  - The Bulk Synchronous Parallel Model
  - BSP by Example
- Giraph
  - Read and Partition the Data
  - Batch Process the Graph with BSP
  - Write the Graph Back to Disk
  - Putting It All Together
  - When Should You Use Giraph?
- GraphX
  - Just Another RDD
  - GraphX Pregel Interface
  - vprog()
  - sendMessage()
  - mergeMessage()
- Which Tool to Use?
- Conclusion
6. Orchestration
- Why We Need Workflow Orchestration
- The Limits of Scripting
- The Enterprise Job Scheduler and Hadoop
- Orchestration Frameworks in the Hadoop Ecosystem
- Oozie Terminology
- Oozie Overview
- Oozie Workflow
- Workflow Patterns
  - Point-to-Point Workflow
  - Fan-Out Workflow
  - Capture-and-Decide Workflow
- Parameterizing Workflows
- Classpath Definition
- Scheduling Patterns
  - Frequency Scheduling
  - Time and Data Triggers
- Executing Workflows
- Conclusion
7. Near-Real-Time Processing with Hadoop
- Stream Processing
- Apache Storm
  - Storm High-Level Architecture
  - Storm Topologies
  - Tuples and Streams
  - Spouts and Bolts
  - Stream Groupings
  - Reliability of Storm Applications
  - Exactly-Once Processing
  - Fault Tolerance
  - Integrating Storm with HDFS
  - Integrating Storm with HBase
  - Storm Example: Simple Moving Average
  - Evaluating Storm
    - Support for aggregation and windowing
    - Enrichment and alerting
    - Lamdba Architecture
- Trident
  - Trident Example: Simple Moving Average
  - Evaluating Trident
    - Support for counting and windowing
    - Enrichment and alerting
    - Lamdba Architecture
- Spark Streaming
  - Overview of Spark Streaming
  - Spark Streaming Example: Simple Count
  - Spark Streaming Example: Multiple Inputs
  - Spark Streaming Example: Maintaining State
  - Spark Streaming Example: Windowing
  - Spark Streaming Example: Streaming versus ETL Code
  - Evaluating Spark Streaming
    - Support for counting and windowing
    - Enrichment and alerting
    - Lambda Architecture
- Flume Interceptors
- Which Tool to Use?
  - Low-Latency Enrichment, Validation, Alerting, and Ingestion
    - Solution One: Flume
    - Solution Two: Kafka and Storm
  - NRT Counting, Rolling Averages, and Iterative Processing
  - Complex Data Pipelines
- Conclusion
II. Case Studies
8. Clickstream Analysis
- Defining the Use Case
- Using Hadoop for Clickstream Analysis
- Design Overview
- Storage
- Ingestion
  - The Client Tier
  - The Collector Tier
- Processing
  - Data Deduplication
    - Deduplication in Hive
    - Deduplication in Pig
  - Sessionization
    - Sessionization in Spark
    - Sessionization in MapReduce
    - Sessionization in Pig
    - Sessionization in Hive
- Analyzing
- Orchestration
- Conclusion
9. Fraud Detection
- Continuous Improvement
- Taking Action
- Architectural Requirements of Fraud Detection Systems
- Introducing Our Use Case
- High-Level Design
- Client Architecture
- Profile Storage and Retrieval
  - Caching
    - Distributed memory caching
    - HBase with BlockCache
  - HBase Data Definition
    - Columns (combined or atomic)
    - Event counting using HBase increment or put
    - Event history using HBase put
  - Delivering Transaction Status: Approved or Denied?
- Ingest
  - Path Between the Client and Flume
    - Client push
    - Logfile pull
    - Message queue or Kafka in the middle
- Near-Real-Time and Exploratory Analytics
- Near-Real-Time Processing
- Exploratory Analytics
- What About Other Architectures?
  - Flume Interceptors
  - Kafka to Storm or Spark Streaming
  - External Business Rules Engine
- Conclusion
10. Data Warehouse
- Using Hadoop for Data Warehousing
- Defining the Use Case
- OLTP Schema
- Data Warehouse: Introduction and Terminology
- Data Warehousing with Hadoop
- High-Level Design
  - Data Modeling and Storage
    - Choosing a storage engine
    - Denormalizing
    - Tracking updates in Hadoop
    - Selecting storage format and compression
    - Partitioning
  - Ingestion
  - Data Processing and Access
    - Partitioning
    - Merge/update
  - Aggregations
  - Data Export
  - Orchestration
- Conclusion
A. Joins in Impala
- Broadcast Joins
- Partitioned Hash Join
Index