Hadoop Application Architectures
- Autorzy:
- Mark Grover, Ted Malaska, Jonathan Seidman
- Ocena:
- Bądź pierwszym, który oceni tę książkę
- Stron:
- 400
- Dostępne formaty:
-
ePubMobi
Opis ebooka: Hadoop Application Architectures
Get expert guidance on architecting end-to-end data management solutions with Apache Hadoop. While many sources explain how to use various components in the Hadoop ecosystem, this practical book takes you through architectural considerations necessary to tie those components together into a complete tailored application, based on your particular use case.
To reinforce those lessons, the book’s second section provides detailed examples of architectures used in some of the most commonly found Hadoop applications. Whether you’re designing a new Hadoop application, or planning to integrate Hadoop into your existing data infrastructure, Hadoop Application Architectures will skillfully guide you through the process.
This book covers:
- Factors to consider when using Hadoop to store and model data
- Best practices for moving data in and out of the system
- Data processing frameworks, including MapReduce, Spark, and Hive
- Common Hadoop processing patterns, such as removing duplicate records and using windowing analytics
- Giraph, GraphX, and other tools for large graph processing on Hadoop
- Using workflow orchestration and scheduling tools such as Apache Oozie
- Near-real-time stream processing with Apache Storm, Apache Spark Streaming, and Apache Flume
- Architecture examples for clickstream analysis, fraud detection, and data warehousing
Wybrane bestsellery
-
While many companies ponder implementation details such as distributed processing engines and algorithms for data analysis, this practical book takes a much wider view of big data development, starting with initial planning and moving diligently toward execution. Authors Ted Malaska and Jonathan ...
Foundations for Architecting Data Solutions. Managing Successful Data Projects Foundations for Architecting Data Solutions. Managing Successful Data Projects
(152.15 zł najniższa cena z 30 dni)160.65 zł
189.00 zł(-15%) -
This practical guide to implementing DeFi in your projects guides you through building full-stack DeFi solutions with popular tools and teaches you how to leverage blockchain technologies to manage crypto assets.
Building Full Stack DeFi Applications. A practical guide to creating your own decentralized finance projects on blockchain Building Full Stack DeFi Applications. A practical guide to creating your own decentralized finance projects on blockchain
-
The Definitive Guide to Data Integration is for data eclectics looking to explore the modern data stack. Complete with practical examples and insights, it covering tools, techniques, and best practices to unleash your data's potential.
The Definitive Guide to Data Integration. Unlock the power of data integration to efficiently manage, transform, and analyze data The Definitive Guide to Data Integration. Unlock the power of data integration to efficiently manage, transform, and analyze data
Pierre-Yves BONNEFOY, Emeric CHAIZE, Raphaël MANSUY, Mehdi TAZI, Stephane Heckel
-
Learn T-SQL Querying, Second Edition, is an up-to-date reference designed to help you write more efficient T-SQL code to perform simple-to-advanced tasks for data management and data analysis tasks.
Learn T-SQL Querying. A guide to developing efficient and elegant T-SQL code - Second Edition Learn T-SQL Querying. A guide to developing efficient and elegant T-SQL code - Second Edition
-
With the help of well-structured and practical recipes, this book will teach you how to integrate data from the cloud and on-premises. You’ll learn how to transform, clean, and consolidate data into a single data platform and get to grips with ADF
Azure Data Factory Cookbook. Build ETL, Hybrid ETL, and ELT pipelines using ADF, Synapse Analytics, Fabric and Databricks - Second Edition Azure Data Factory Cookbook. Build ETL, Hybrid ETL, and ELT pipelines using ADF, Synapse Analytics, Fabric and Databricks - Second Edition
Dmitry Foshin, Tonya Chernyshova, Dmitry Anoshin, Xenia Ireton
-
Learn new techniques to ingest, transform, merge, and deliver trusted data to downstream users using modern cloud data architectures and Scala, and learn end-to-end data engineering that will make you the most valuable asset on your data team.
Data Engineering with Scala and Spark. Build streaming and batch pipelines that process massive amounts of data using Scala Data Engineering with Scala and Spark. Build streaming and batch pipelines that process massive amounts of data using Scala
-
This hands-on guide will help you learn various strategies for managing data integrity and data quality using effective frameworks, tools, and strategies. Get ready to explore a range of methods and solutions for finance projects and requirements.
Managing Data Integrity for Finance. Discover practical data quality management strategies for finance analysts and data professionals Managing Data Integrity for Finance. Discover practical data quality management strategies for finance analysts and data professionals
-
Explore the full potential of MongoDB 7.0 with this comprehensive guide that offers powerful techniques for efficient data manipulation, application integration, and security. This intermediate-to-advanced level book empowers you to harness the power of the latest version of MongoDB.
Mastering MongoDB 7.0. Achieve data excellence by unlocking the full potential of MongoDB - Fourth Edition Mastering MongoDB 7.0. Achieve data excellence by unlocking the full potential of MongoDB - Fourth Edition
Marko Aleksendrić, Arek Borucki, Leandro Domingues, Malak Abu Hammad, Elie Hannouch
-
Discover the multi-model capabilities of Redis Stack to work with JSON and hash documents and manage vectors for unstructured data modeling. Manage data flows and resolve data modeling problems through examples using popular programming languages.
Redis Stack for Application Modernization. Build real-time multi-model applications at any scale with Redis Redis Stack for Application Modernization. Build real-time multi-model applications at any scale with Redis
-
The book takes an objective, personalized and practical approach to designing the best database models while focusing on real-world examples and implementations with Google Cloud.
Database Design and Modeling with Google Cloud. Learn database design and development to take your data to applications, analytics, and AI Database Design and Modeling with Google Cloud. Learn database design and development to take your data to applications, analytics, and AI
Ebooka "Hadoop Application Architectures" przeczytasz na:
-
czytnikach Inkbook, Kindle, Pocketbook, Onyx Boox i innych
-
systemach Windows, MacOS i innych
-
systemach Windows, Android, iOS, HarmonyOS
-
na dowolnych urządzeniach i aplikacjach obsługujących formaty: PDF, EPub, Mobi
Masz pytania? Zajrzyj do zakładki Pomoc »
Audiobooka "Hadoop Application Architectures" posłuchasz:
-
w aplikacji Ebookpoint na Android, iOS, HarmonyOs
-
na systemach Windows, MacOS i innych
-
na dowolnych urządzeniach i aplikacjach obsługujących format MP3 (pliki spakowane w ZIP)
Masz pytania? Zajrzyj do zakładki Pomoc »
Kurs Video "Hadoop Application Architectures" zobaczysz:
-
w aplikacjach Ebookpoint i Videopoint na Android, iOS, HarmonyOs
-
na systemach Windows, MacOS i innych z dostępem do najnowszej wersji Twojej przeglądarki internetowej
Szczegóły ebooka
- ISBN Ebooka:
- 978-14-919-0005-5, 9781491900055
- Data wydania ebooka:
- 2015-06-30 Data wydania ebooka często jest dniem wprowadzenia tytułu do sprzedaży i może nie być równoznaczna z datą wydania książki papierowej. Dodatkowe informacje możesz znaleźć w darmowym fragmencie. Jeśli masz wątpliwości skontaktuj się z nami sklep@ebookpoint.pl.
- Język publikacji:
- angielski
- Rozmiar pliku ePub:
- 6.0MB
- Rozmiar pliku Mobi:
- 6.0MB
Spis treści ebooka
- Foreword
- Preface
- A Note About the Code Examples
- Who Should Read This Book
- Why We Wrote This Book
- Navigating This Book
- Conventions Used in This Book
- Using Code Examples
- Safari Books Online
- How to Contact Us
- Acknowledgments
- Mark Grovers Acknowledgements
- Ted Malaskas Acknowledgements
- Jonathan Seidmans Acknowledgements
- Gwen Shapiras Acknowledgements
- I. Architectural Considerations for Hadoop Applications
- 1. Data Modeling in Hadoop
- Data Storage Options
- Standard File Formats
- Text data
- Structured text data
- Binary data
- Standard File Formats
- Hadoop File Types
- File-based data structures
- Data Storage Options
- Serialization Formats
- Thrift
- Protocol Buffers
- Avro
- Columnar Formats
- RCFile
- ORC
- Parquet
- Avro and Parquet
- Compression
- Snappy
- LZO
- Gzip
- bzip2
- Compression recommendations
- HDFS Schema Design
- Location of HDFS Files
- Advanced HDFS Schema Design
- Partitioning
- Bucketing
- Denormalizing
- HDFS Schema Design Summary
- HBase Schema Design
- Row Key
- Record retrieval
- Distribution
- Block cache
- Ability to scan
- Size
- Readability
- Uniqueness
- Row Key
- Timestamp
- Hops
- Tables and Regions
- Put performance
- Compaction time
- Using Columns
- Using Column Families
- Time-to-Live
- Managing Metadata
- What Is Metadata?
- Why Care About Metadata?
- Where to Store Metadata?
- Examples of Managing Metadata
- Limitations of the Hive Metastore and HCatalog
- Other Ways of Storing Metadata
- Embedding metadata in file paths and names
- Storing the metadata in HDFS
- Conclusion
- 2. Data Movement
- Data Ingestion Considerations
- Timeliness of Data Ingestion
- Incremental Updates
- Access Patterns
- Original Source System and Data Structure
- Read speed of the devices on source systems
- Original file type
- Compression
- Relational database management systems
- Streaming data
- Logfiles
- Transformations
- Interceptors
- Selectors
- Data Ingestion Considerations
- Network Bottlenecks
- Network Security
- Push or Pull
- Sqoop
- Flume
- Failure Handling
- Level of Complexity
- Data Ingestion Options
- File Transfers
- HDFS client commands
- Mountable HDFS
- File Transfers
- Considerations for File Transfers versus Other Ingest Methods
- Sqoop: Batch Transfer Between Hadoop and Relational Databases
- Choosing a split-by column
- Using database-specific connectors whenever available
- Using the Goldilocks method of Sqoop performance tuning
- Loading many tables in parallel with fair scheduler throttling
- Diagnosing bottlenecks
- Keeping Hadoop updated
- Flume: Event-Based Data Collection and Processing
- Flume architecture
- Flume patterns
- File formats
- Recommendations
- Flume sources
- Flume sinks
- Flume interceptors
- Flume memory channels
- Flume file channels
- Sizing Channels
- Finding Flume bottlenecks
- Kafka
- Kafka fault tolerance
- Kafka and Hadoop
- Data Extraction
- Conclusion
- 3. Processing Data in Hadoop
- MapReduce
- MapReduce Overview
- Map phase
- InputFormat
- RecordReader
- Mapper.setup()
- Mapper.map
- Partitioner
- Mapper.cleanup()
- Combiner
- Map phase
- Reducer
- Shuffle
- Reducer.setup()
- Reducer.reduce()
- Reducer.cleanup()
- OutputFormat
- MapReduce Overview
- MapReduce
- Example for MapReduce
- When to Use MapReduce
- Spark
- Spark Overview
- DAG Model
- Spark Overview
- Overview of Spark Components
- Basic Spark Concepts
- Resilient Distributed Datasets
- Shared variables
- SparkContext
- Transformations
- Action
- Benefits of Using Spark
- Simplicity
- Versatility
- Reduced disk I/O
- Storage
- Multilanguage
- Resource manager independence
- Interactive shell (REPL)
- Spark Example
- When to Use Spark
- Abstractions
- Pig
- Pig Example
- When to Use Pig
- Crunch
- Crunch Example
- When to Use Crunch
- Cascading
- Cascading Example
- When to Use Cascading
- Hive
- Hive Overview
- Example of Hive Code
- When to Use Hive
- Impala
- Impala Overview
- Speed-Oriented Design
- Efficient use of memory
- Long running daemons
- Efficient execution engine
- Use of LLVM
- Impala Example
- When to Use Impala
- Conclusion
- 4. Common Hadoop Processing Patterns
- Pattern: Removing Duplicate Records by Primary Key
- Data Generation for Deduplication Example
- Code Example: Spark Deduplication in Scala
- Code Example: Deduplication in SQL
- Pattern: Removing Duplicate Records by Primary Key
- Pattern: Windowing Analysis
- Data Generation for Windowing Analysis Example
- Code Example: Peaks and Valleys in Spark
- Code Example: Peaks and Valleys in SQL
- Pattern: Time Series Modifications
- Use HBase and Versioning
- Use HBase with a RowKey of RecordKey and StartTime
- Use HDFS and Rewrite the Whole Table
- Use Partitions on HDFS for Current and Historical Records
- Data Generation for Time Series Example
- Code Example: Time Series in Spark
- Code Example: Time Series in SQL
- Conclusion
- 5. Graph Processing on Hadoop
- What Is a Graph?
- What Is Graph Processing?
- How Do You Process a Graph in a Distributed System?
- The Bulk Synchronous Parallel Model
- BSP by Example
- Giraph
- Read and Partition the Data
- Batch Process the Graph with BSP
- Write the Graph Back to Disk
- Putting It All Together
- When Should You Use Giraph?
- GraphX
- Just Another RDD
- GraphX Pregel Interface
- vprog()
- sendMessage()
- mergeMessage()
- Which Tool to Use?
- Conclusion
- 6. Orchestration
- Why We Need Workflow Orchestration
- The Limits of Scripting
- The Enterprise Job Scheduler and Hadoop
- Orchestration Frameworks in the Hadoop Ecosystem
- Oozie Terminology
- Oozie Overview
- Oozie Workflow
- Workflow Patterns
- Point-to-Point Workflow
- Fan-Out Workflow
- Capture-and-Decide Workflow
- Parameterizing Workflows
- Classpath Definition
- Scheduling Patterns
- Frequency Scheduling
- Time and Data Triggers
- Executing Workflows
- Conclusion
- 7. Near-Real-Time Processing with Hadoop
- Stream Processing
- Apache Storm
- Storm High-Level Architecture
- Storm Topologies
- Tuples and Streams
- Spouts and Bolts
- Stream Groupings
- Reliability of Storm Applications
- Exactly-Once Processing
- Fault Tolerance
- Integrating Storm with HDFS
- Integrating Storm with HBase
- Storm Example: Simple Moving Average
- Evaluating Storm
- Support for aggregation and windowing
- Enrichment and alerting
- Lamdba Architecture
- Trident
- Trident Example: Simple Moving Average
- Evaluating Trident
- Support for counting and windowing
- Enrichment and alerting
- Lamdba Architecture
- Spark Streaming
- Overview of Spark Streaming
- Spark Streaming Example: Simple Count
- Spark Streaming Example: Multiple Inputs
- Spark Streaming Example: Maintaining State
- Spark Streaming Example: Windowing
- Spark Streaming Example: Streaming versus ETL Code
- Evaluating Spark Streaming
- Support for counting and windowing
- Enrichment and alerting
- Lambda Architecture
- Flume Interceptors
- Which Tool to Use?
- Low-Latency Enrichment, Validation, Alerting, and Ingestion
- Solution One: Flume
- Solution Two: Kafka and Storm
- Low-Latency Enrichment, Validation, Alerting, and Ingestion
- NRT Counting, Rolling Averages, and Iterative Processing
- Complex Data Pipelines
- Conclusion
- II. Case Studies
- 8. Clickstream Analysis
- Defining the Use Case
- Using Hadoop for Clickstream Analysis
- Design Overview
- Storage
- Ingestion
- The Client Tier
- The Collector Tier
- Processing
- Data Deduplication
- Deduplication in Hive
- Deduplication in Pig
- Data Deduplication
- Sessionization
- Sessionization in Spark
- Sessionization in MapReduce
- Sessionization in Pig
- Sessionization in Hive
- Analyzing
- Orchestration
- Conclusion
- 9. Fraud Detection
- Continuous Improvement
- Taking Action
- Architectural Requirements of Fraud Detection Systems
- Introducing Our Use Case
- High-Level Design
- Client Architecture
- Profile Storage and Retrieval
- Caching
- Distributed memory caching
- HBase with BlockCache
- Caching
- HBase Data Definition
- Columns (combined or atomic)
- Event counting using HBase increment or put
- Event history using HBase put
- Delivering Transaction Status: Approved or Denied?
- Ingest
- Path Between the Client and Flume
- Client push
- Logfile pull
- Message queue or Kafka in the middle
- Path Between the Client and Flume
- Near-Real-Time and Exploratory Analytics
- Near-Real-Time Processing
- Exploratory Analytics
- What About Other Architectures?
- Flume Interceptors
- Kafka to Storm or Spark Streaming
- External Business Rules Engine
- Conclusion
- 10. Data Warehouse
- Using Hadoop for Data Warehousing
- Defining the Use Case
- OLTP Schema
- Data Warehouse: Introduction and Terminology
- Data Warehousing with Hadoop
- High-Level Design
- Data Modeling and Storage
- Choosing a storage engine
- Denormalizing
- Tracking updates in Hadoop
- Selecting storage format and compression
- Partitioning
- Data Modeling and Storage
- Ingestion
- Data Processing and Access
- Partitioning
- Merge/update
- Aggregations
- Data Export
- Orchestration
- Conclusion
- A. Joins in Impala
- Broadcast Joins
- Partitioned Hash Join
- Index
O'Reilly Media - inne książki
-
JavaScript gives web developers great power to create rich interactive browser experiences, and much of that power is provided by the browser itself. Modern web APIs enable web-based applications to come to life like never before, supporting actions that once required browser plug-ins. Some are s...(177.65 zł najniższa cena z 30 dni)
186.15 zł
219.00 zł(-15%) -
How will software development and operations have to change to meet the sustainability and green needs of the planet? And what does that imply for development organizations? In this eye-opening book, sustainable software advocates Anne Currie, Sarah Hsu, and Sara Bergman provide a unique overview...(160.65 zł najniższa cena z 30 dni)
177.65 zł
209.00 zł(-15%) -
OpenTelemetry is a revolution in observability data. Instead of running multiple uncoordinated pipelines, OpenTelemetry provides users with a single integrated stream of data, providing multiple sources of high-quality telemetry data: tracing, metrics, logs, RUM, eBPF, and more. This practical gu...(143.65 zł najniższa cena z 30 dni)
152.15 zł
179.00 zł(-15%) -
Interested in developing embedded systems? Since they don't tolerate inefficiency, these systems require a disciplined approach to programming. This easy-to-read guide helps you cultivate good development practices based on classic software design patterns and new patterns unique to embedded prog...(152.15 zł najniższa cena z 30 dni)
160.65 zł
189.00 zł(-15%) -
If you use Linux in your day-to-day work, then Linux Pocket Guide is the perfect on-the-job reference. This thoroughly updated 20th anniversary edition explains more than 200 Linux commands, including new commands for file handling, package management, version control, file format conversions, an...(92.65 zł najniższa cena z 30 dni)
101.15 zł
119.00 zł(-15%) -
Gain the valuable skills and techniques you need to accelerate the delivery of machine learning solutions. With this practical guide, data scientists, ML engineers, and their leaders will learn how to bridge the gap between data science and Lean product delivery in a practical and simple way. Dav...(245.65 zł najniższa cena z 30 dni)
254.15 zł
299.00 zł(-15%) -
This practical book provides a detailed explanation of the zero trust security model. Zero trust is a security paradigm shift that eliminates the concept of traditional perimeter-based security and requires you to "always assume breach" and "never trust but always verify." The updated edition off...(203.15 zł najniższa cena z 30 dni)
211.65 zł
249.00 zł(-15%) -
Decentralized finance (DeFi) is a rapidly growing field in fintech, having grown from $700 million to $100 billion over the past three years alone. But the lack of reliable information makes this area both risky and murky. In this practical book, experienced securities attorney Alexandra Damsker ...(203.15 zł najniższa cena z 30 dni)
211.65 zł
249.00 zł(-15%) -
Whether you're a startup founder trying to disrupt an industry or an entrepreneur trying to provoke change from within, your biggest challenge is creating a product people actually want. Lean Analytics steers you in the right direction.This book shows you how to validate your initial idea, find t...(126.65 zł najniższa cena z 30 dni)
126.65 zł
149.00 zł(-15%) -
When it comes to building user interfaces on the web, React enables web developers to unlock a new world of possibilities. This practical book helps you take a deep dive into fundamental concepts of this JavaScript library, including JSX syntax and advanced patterns, the virtual DOM, React reconc...(194.65 zł najniższa cena z 30 dni)
211.65 zł
249.00 zł(-15%)
Dzieki opcji "Druk na żądanie" do sprzedaży wracają tytuły Grupy Helion, które cieszyły sie dużym zainteresowaniem, a których nakład został wyprzedany.
Dla naszych Czytelników wydrukowaliśmy dodatkową pulę egzemplarzy w technice druku cyfrowego.
Co powinieneś wiedzieć o usłudze "Druk na żądanie":
- usługa obejmuje tylko widoczną poniżej listę tytułów, którą na bieżąco aktualizujemy;
- cena książki może być wyższa od początkowej ceny detalicznej, co jest spowodowane kosztami druku cyfrowego (wyższymi niż koszty tradycyjnego druku offsetowego). Obowiązująca cena jest zawsze podawana na stronie WWW książki;
- zawartość książki wraz z dodatkami (płyta CD, DVD) odpowiada jej pierwotnemu wydaniu i jest w pełni komplementarna;
- usługa nie obejmuje książek w kolorze.
Masz pytanie o konkretny tytuł? Napisz do nas: sklep[at]helion.pl.
Książka, którą chcesz zamówić pochodzi z końcówki nakładu. Oznacza to, że mogą się pojawić drobne defekty (otarcia, rysy, zagięcia).
Co powinieneś wiedzieć o usłudze "Końcówka nakładu":
- usługa obejmuje tylko książki oznaczone tagiem "Końcówka nakładu";
- wady o których mowa powyżej nie podlegają reklamacji;
Masz pytanie o konkretny tytuł? Napisz do nas: sklep[at]helion.pl.
Książka drukowana
Oceny i opinie klientów: Hadoop Application Architectures Mark Grover, Ted Malaska, Jonathan Seidman (0) Weryfikacja opinii następuję na podstawie historii zamówień na koncie Użytkownika umieszczającego opinię. Użytkownik mógł otrzymać punkty za opublikowanie opinii uprawniające do uzyskania rabatu w ramach Programu Punktowego.