High Performance Spark. Best Practices for Scaling and Optimizing Apache Spark
- Autorzy:
- Holden Karau, Rachel Warren
- Ocena:
- Bądź pierwszym, który oceni tę książkę
- Stron:
- 358
- Dostępne formaty:
-
ePubMobi
Opis ebooka: High Performance Spark. Best Practices for Scaling and Optimizing Apache Spark
Apache Spark is amazing when everything clicks. But if you haven’t seen the performance improvements you expected, or still don’t feel confident enough to use Spark in production, this practical book is for you. Authors Holden Karau and Rachel Warren demonstrate performance optimizations to help your Spark queries run faster and handle larger data sizes, while using fewer resources.
Ideal for software engineers, data engineers, developers, and system administrators working with large-scale data applications, this book describes techniques that can reduce data infrastructure costs and developer hours. Not only will you gain a more comprehensive understanding of Spark, you’ll also learn how to make it sing.
With this book, you’ll explore:
- How Spark SQL’s new interfaces improve performance over SQL’s RDD data structure
- The choice between data joins in Core Spark and Spark SQL
- Techniques for getting the most out of standard RDD transformations
- How to work around performance issues in Spark’s key/value pair paradigm
- Writing high-performance Spark code without Scala or the JVM
- How to test for functionality and performance when applying suggested improvements
- Using Spark MLlib and Spark ML machine learning libraries
- Spark’s Streaming components and external community packages
Wybrane bestsellery
-
Traditional data architecture patterns are severely limited. To use these patterns, you have to ETL data into each tool—a cost-prohibitive process for making warehouse features available to all of your data. The lack of flexibility with these patterns requires you to lock into a set of prio...(207.57 zł najniższa cena z 30 dni)
207.47 zł
249.00 zł(-17%) -
Oprogramowanie Apache Kafka powstało jako broker wiadomości w LinkedIn. Obecnie pełni funkcję rozproszonego systemu przetwarzania strumieniowego danych, używanego do budowania aplikacji opracowujących duże ilości danych w czasie rzeczywistym. Z zalet tego oprogramowania korzystają firmy na całym ...
Apache Kafka. Kurs video. Przetwarzanie danych w czasie rzeczywistym Apache Kafka. Kurs video. Przetwarzanie danych w czasie rzeczywistym
(31.14 zł najniższa cena z 30 dni)48.95 zł
89.00 zł(-45%) -
Used by more than 80% of Fortune 100 companies, Apache Kafka has become the de facto event streaming platform. Kafka Connect is a key component of Kafka that lets you flow data between your existing systems and Kafka to process data in real time.With this practical guide, authors Mickael Maison a...(242.32 zł najniższa cena z 30 dni)
242.12 zł
289.00 zł(-16%) -
This book describes both batch processing and real-time processing pipelines. You’ll learn how to implement basic and advanced big data use cases with ease and develop a deep understanding of the Apache Beam model. In addition to this, you’ll discover how the portability layer works...
Building Big Data Pipelines with Apache Beam. Use a single programming model for both batch and stream data processing Building Big Data Pipelines with Apache Beam. Use a single programming model for both batch and stream data processing
(135.50 zł najniższa cena z 30 dni)135.45 zł
139.00 zł(-3%) -
Every enterprise application creates data, including log messages, metrics, user activity, and outgoing messages. Learning how to move these items is almost as important as the data itself. If you're an application architect, developer, or production engineer new to Apache Pulsar, this practical ...(209.82 zł najniższa cena z 30 dni)
209.32 zł
249.00 zł(-16%) -
Data is bigger, arrives faster, and comes in a variety of formatsâ??and it all needs to be processed at scale for analytics or machine learning. But how can you process such varied workloads efficiently? Enter Apache Spark.Updated to include Spark 3.0, this second edition shows data engineer...(207.57 zł najniższa cena z 30 dni)
207.52 zł
249.00 zł(-17%) -
Serverless computing greatly simplifies software development. Your team can focus solely on your application while the cloud provider manages the servers you need. This practical guide shows you step-by-step how to build and deploy complex applications in a flexible multicloud, multilanguage envi...
Learning Apache OpenWhisk. Developing Open Serverless Solutions Learning Apache OpenWhisk. Developing Open Serverless Solutions
(207.65 zł najniższa cena z 30 dni)207.15 zł
249.00 zł(-17%) -
Before you can build analytics tools to gain quick insights, you first need to know how to process data in real time. With this practical guide, developers familiar with Apache Spark will learn how to put this in-memory framework to use for streaming data. You’ll discover how Spark enables ...
Stream Processing with Apache Spark. Mastering Structured Streaming and Spark Streaming Stream Processing with Apache Spark. Mastering Structured Streaming and Spark Streaming
(211.17 zł najniższa cena z 30 dni)211.12 zł
249.00 zł(-15%) -
This practical guide explains you to program and understand the power of Apache Cassandra 3.x. You will explore the integration and interaction of Cassandra components, and explore features such as the token allocation algorithm, CQL3, vnodes, lightweight transactions, and data modelling in detail.
Mastering Apache Cassandra 3.x. An expert guide to improving database scalability and availability without compromising performance - Third Edition Mastering Apache Cassandra 3.x. An expert guide to improving database scalability and availability without compromising performance - Third Edition
(116.26 zł najniższa cena z 30 dni)115.76 zł
119.00 zł(-3%) -
Apache Hive helps you deal with data summarization, queries, and analysis for huge amounts of data. This book will give you a background in big data, and familiarize you with your Hive working environment. Next you will cover advanced topics like performance and security in Hive and how to work e...
Apache Hive Essentials. Essential techniques to help you process, and get unique insights from, big data - Second Edition Apache Hive Essentials. Essential techniques to help you process, and get unique insights from, big data - Second Edition
O autorze ebooka
Holden Karau, Rachel Warren - pozostałe książki
-
Modern systems contain multi-core CPUs and GPUs that have the potential for parallel computing. But many scientific Python tools were not designed to leverage this parallelism. With this short but thorough resource, data scientists and Python programmers will learn how the Dask open source librar...(243.42 zł najniższa cena z 30 dni)
243.37 zł
289.00 zł(-16%) -
Serverless computing enables developers to concentrate solely on their applications rather than worry about where they've been deployed. With the Ray general-purpose serverless implementation in Python, programmers and data scientists can hide servers, implement stateful applications, support dir...(176.05 zł najniższa cena z 30 dni)
175.85 zł
219.00 zł(-20%) -
If you're training a machine learning model but aren't sure how to put it into production, this book will get you there. Kubeflow provides a collection of cloud native tools for different stages of a model's lifecycle, from data exploration, feature preparation, and model training to model servin...(133.23 zł najniższa cena z 30 dni)
133.02 zł
169.00 zł(-21%) -
When people want a way to process big data at speed, Spark is invariably the solution. With its ease of development (in comparison to the relative complexity of Hadoop), it’s unsurprising that it’s becoming popular with data analysts and engineers everywhere. Beginning with the fund...
Fast Data Processing with Spark 2. Accelerate your data for rapid insight - Third Edition Fast Data Processing with Spark 2. Accelerate your data for rapid insight - Third Edition
Ebooka "High Performance Spark. Best Practices for Scaling and Optimizing Apache Spark" przeczytasz na:
-
czytnikach Inkbook, Kindle, Pocketbook, Onyx Boox i innych
-
systemach Windows, MacOS i innych
-
systemach Windows, Android, iOS, HarmonyOS
-
na dowolnych urządzeniach i aplikacjach obsługujących formaty: PDF, EPub, Mobi
Masz pytania? Zajrzyj do zakładki Pomoc »
Audiobooka "High Performance Spark. Best Practices for Scaling and Optimizing Apache Spark" posłuchasz:
-
w aplikacji Ebookpoint na Android, iOS, HarmonyOs
-
na systemach Windows, MacOS i innych
-
na dowolnych urządzeniach i aplikacjach obsługujących format MP3 (pliki spakowane w ZIP)
Masz pytania? Zajrzyj do zakładki Pomoc »
Kurs Video "High Performance Spark. Best Practices for Scaling and Optimizing Apache Spark" zobaczysz:
-
w aplikacjach Ebookpoint i Videopoint na Android, iOS, HarmonyOs
-
na systemach Windows, MacOS i innych z dostępem do najnowszej wersji Twojej przeglądarki internetowej
Szczegóły ebooka
- ISBN Ebooka:
- 978-14-919-4315-1, 9781491943151
- Data wydania ebooka:
- 2017-05-25 Data wydania ebooka często jest dniem wprowadzenia tytułu do sprzedaży i może nie być równoznaczna z datą wydania książki papierowej. Dodatkowe informacje możesz znaleźć w darmowym fragmencie. Jeśli masz wątpliwości skontaktuj się z nami sklep@ebookpoint.pl.
- Język publikacji:
- angielski
- Rozmiar pliku ePub:
- 4.1MB
- Rozmiar pliku Mobi:
- 10.0MB
Spis treści ebooka
- Preface
- First Edition Notes
- Supporting Books and Materials
- Conventions Used in This Book
- Using Code Examples
- OReilly Safari
- How to Contact the Authors
- How to Contact Us
- Acknowledgments
- 1. Introduction to High Performance Spark
- What Is Spark and Why Performance Matters
- What You Can Expect to Get from This Book
- Spark Versions
- Why Scala?
- To Be a Spark Expert You Have to Learn a Little Scala Anyway
- The Spark Scala API Is Easier to Use Than the Java API
- Scala Is More Performant Than Python
- Why Not Scala?
- Learning Scala
- Conclusion
- 2. How Spark Works
- How Spark Fits into the Big Data Ecosystem
- Spark Components
- How Spark Fits into the Big Data Ecosystem
- Spark Model of Parallel Computing: RDDs
- Lazy Evaluation
- Performance and usability advantages of lazy evaluation
- Lazy evaluation and fault tolerance
- Lazy evaluation and debugging
- Lazy Evaluation
- In-Memory Persistence and Memory Management
- Immutability and the RDD Interface
- Types of RDDs
- Functions on RDDs: Transformations Versus Actions
- Wide Versus Narrow Dependencies
- Spark Job Scheduling
- Resource Allocation Across Applications
- The Spark Application
- Default Spark Scheduler
- The Anatomy of a Spark Job
- The DAG
- Jobs
- Stages
- Tasks
- Conclusion
- 3. DataFrames, Datasets, and Spark SQL
- Getting Started with the SparkSession (or HiveContext or SQLContext)
- Spark SQL Dependencies
- Managing Spark Dependencies
- Avoiding Hive JARs
- Basics of Schemas
- DataFrame API
- Transformations
- Simple DataFrame transformations and SQL expressions
- Specialized DataFrame transformations for missing and noisy data
- Beyond row-by-row transformations
- Aggregates and groupBy
- Windowing
- Sorting
- Transformations
- Multi-DataFrame Transformations
- Set-like operations
- Plain Old SQL Queries and Interacting with Hive Data
- Data Representation in DataFrames and Datasets
- Tungsten
- Data Loading and Saving Functions
- DataFrameWriter and DataFrameReader
- Formats
- JSON
- JDBC
- Parquet
- Hive tables
- RDDs
- Local collections
- Additional formats
- Save Modes
- Partitions (Discovery and Writing)
- Datasets
- Interoperability with RDDs, DataFrames, and Local Collections
- Compile-Time Strong Typing
- Easier Functional (RDD like) Transformations
- Relational Transformations
- Multi-Dataset Relational Transformations
- Grouped Operations on Datasets
- Extending with User-Defined Functions and Aggregate Functions (UDFs, UDAFs)
- Query Optimizer
- Logical and Physical Plans
- Code Generation
- Large Query Plans and Iterative Algorithms
- Debugging Spark SQL Queries
- JDBC/ODBC Server
- Conclusion
- 4. Joins (SQL and Core)
- Core Spark Joins
- Choosing a Join Type
- Choosing an Execution Plan
- Speeding up joins by assigning a known partitioner
- Speeding up joins using a broadcast hash join
- Partial manual broadcast hash join
- Core Spark Joins
- Spark SQL Joins
- DataFrame Joins
- Self joins
- Broadcast hash joins
- DataFrame Joins
- Dataset Joins
- Conclusion
- 5. Effective Transformations
- Narrow Versus Wide Transformations
- Implications for Performance
- Implications for Fault Tolerance
- The Special Case of coalesce
- Narrow Versus Wide Transformations
- What Type of RDD Does Your Transformation Return?
- Minimizing Object Creation
- Reusing Existing Objects
- Using Smaller Data Structures
- Iterator-to-Iterator Transformations with mapPartitions
- What Is an Iterator-to-Iterator Transformation?
- Space and Time Advantages
- An Example
- Set Operations
- Reducing Setup Overhead
- Shared Variables
- Broadcast Variables
- Accumulators
- Reusing RDDs
- Cases for Reuse
- Iterative computations
- Multiple actions on the same RDD
- If the cost to compute each partition is very high
- Cases for Reuse
- Deciding if Recompute Is Inexpensive Enough
- Types of Reuse: Cache, Persist, Checkpoint, Shuffle Files
- Persist and cache
- Checkpointing
- Checkpointing example
- Alluxio (nee Tachyon)
- LRU Caching
- Shuffle files
- Noisy Cluster Considerations
- Interaction with Accumulators
- Conclusion
- 6. Working with Key/Value Data
- The Goldilocks Example
- Goldilocks Version 0: Iterative Solution
- How to Use PairRDDFunctions and OrderedRDDFunctions
- The Goldilocks Example
- Actions on Key/Value Pairs
- Whats So Dangerous About the groupByKey Function
- Goldilocks Version 1: groupByKey Solution
- Why GroupByKey fails
- Goldilocks Version 1: groupByKey Solution
- Choosing an Aggregation Operation
- Dictionary of Aggregation Operations with Performance Considerations
- Preventing out-of-memory errors with aggregation operations
- Dictionary of Aggregation Operations with Performance Considerations
- Multiple RDD Operations
- Co-Grouping
- Partitioners and Key/Value Data
- Using the Spark Partitioner Object
- Hash Partitioning
- Range Partitioning
- Custom Partitioning
- Preserving Partitioning Information Across Transformations
- Using narrow transformations that preserve partitioning
- Leveraging Co-Located and Co-Partitioned RDDs
- Dictionary of Mapping and Partitioning Functions PairRDDFunctions
- Dictionary of OrderedRDDOperations
- Sorting by Two Keys with SortByKey
- Secondary Sort and repartitionAndSortWithinPartitions
- Leveraging repartitionAndSortWithinPartitions for a Group by Key and Sort Values Function
- How Not to Sort by Two Orderings
- Goldilocks Version 2: Secondary Sort
- Defining the custom partitioner
- Filtering on each partition
- Combine the elements associated with one key
- Performance
- A Different Approach to Goldilocks
- Map to (cell value, column index) pairs
- Sort and count values on each partition
- Determine location of rank statistics on each partition
- Filter for rank statistics
- Goldilocks Version 3: Sort on Cell Values
- Straggler Detection and Unbalanced Data
- Back to Goldilocks (Again)
- Goldilocks Version 4: Reduce to Distinct on Each Partition
- Aggregate to ((cell value, column index), count) on each partition
- Sort and find rank statistics
- Goldilocks postmortem
- Conclusion
- 7. Going Beyond Scala
- Beyond Scala within the JVM
- Beyond Scala, and Beyond the JVM
- How PySpark Works
- PySpark RDDs
- PySpark DataFrames and Datasets
- Accessing the backing Java objects and mixing Scala code
- PySpark dependency management
- Installing PySpark
- How PySpark Works
- How SparkR Works
- Spark.jl (Julia Spark)
- How Eclair JS Works
- Spark on the Common Language Runtime (CLR)C# and Friends
- Calling Other Languages from Spark
- Using Pipe and Friends
- JNI
- Java Native Access (JNA)
- Underneath Everything Is FORTRAN
- Getting to the GPU
- The Future
- Conclusion
- 8. Testing and Validation
- Unit Testing
- General Spark Unit Testing
- Factoring your code for testability
- Regular Spark jobs (testing with RDDs)
- Streaming
- General Spark Unit Testing
- Mocking RDDs
- Testing DataFrames
- Unit Testing
- Getting Test Data
- Generating Large Datasets
- Sampling
- Property Checking with ScalaCheck
- Computing RDD Difference
- Integration Testing
- Choosing Your Integration Testing Environment
- Local mode
- Docker-based
- Yarn MiniCluster
- Choosing Your Integration Testing Environment
- Verifying Performance
- Spark Counters for Verifying Performance
- Projects for Verifying Performance
- Job Validation
- Conclusion
- 9. Spark MLlib and ML
- Choosing Between Spark MLlib and Spark ML
- Working with MLlib
- Getting Started with MLlib (Organization and Imports)
- MLlib Feature Encoding and Data Preparation
- Working with Spark vectors
- Preparing textual data
- Preparing data for supervised learning
- Feature Scaling and Selection
- MLlib Model Training
- Predicting
- Serving and Persistence
- Saveable (internal format)
- PMML
- Custom
- Model Evaluation
- Working with Spark ML
- Spark ML Organization and Imports
- Pipeline Stages
- Explain Params
- Data Encoding
- Data Cleaning
- Spark ML Models
- Putting It All Together in a Pipeline
- Training a Pipeline
- Accessing Individual Stages
- Data Persistence and Spark ML
- Automated model selection (parameter search)
- Extending Spark ML Pipelines with Your Own Algorithms
- Custom transformers
- Custom estimators
- Model and Pipeline Persistence and Serving with Spark ML
- General Serving Considerations
- Conclusion
- 10. Spark Components and Packages
- Stream Processing with Spark
- Sources and Sinks
- Receivers
- Repartitioning
- Sources and Sinks
- Batch Intervals
- Data Checkpoint Intervals
- Considerations for DStreams
- Output operations
- Stream Processing with Spark
- Considerations for Structured Streaming
- Data sources
- Output operations
- Custom sinks
- Machine learning with Structured Streaming
- Stream status and debugging
- High Availability Mode (or Handling Driver Failure or Checkpointing)
- GraphX
- Using Community Packages and Libraries
- Creating a Spark Package
- Conclusion
- A. Tuning, Debugging, and Other Things Developers Like to Pretend Dont Exist
- Spark Tuning and Cluster Sizing
- How to Adjust Spark Settings
- How to Determine the Relevant Information About Your Cluster
- Spark Tuning and Cluster Sizing
- Basic Spark Core Settings: How Many Resources to Allocate to the Spark Application?
- Calculating Executor and Driver Memory Overhead
- How Large to Make the Spark Driver
- A Few Large Executors or Many Small Executors?
- Many small executors
- Many large executors
- Allocating Cluster Resources and Dynamic Allocation
- Restrictions on dynamic allocation
- Dividing the Space Within One Executor
- Number and Size of Partitions
- Serialization Options
- Kryo
- Spark settings conclusion
- Kryo
- Some Additional Debugging Techniques
- Out of Disk Space Errors
- Logging
- Configuring logging
- Accessing logs
- Attaching debuggers
- Debugging in notebooks
- Python debugging
- Debugging conclusion
- Index
O'Reilly Media - inne książki
-
Keeping up with the Python ecosystem can be daunting. Its developer tooling doesn't provide the out-of-the-box experience native to languages like Rust and Go. When it comes to long-term project maintenance or collaborating with others, every Python project faces the same problem: how to build re...(200.93 zł najniższa cena z 30 dni)
200.88 zł
239.00 zł(-16%) -
Bringing a deep-learning project into production at scale is quite challenging. To successfully scale your project, a foundational understanding of full stack deep learning, including the knowledge that lies at the intersection of hardware, software, data, and algorithms, is required.This book il...(241.26 zł najniższa cena z 30 dni)
241.21 zł
289.00 zł(-17%) -
Frontend developers have to consider many things: browser compatibility, usability, performance, scalability, SEO, and other best practices. But the most fundamental aspect of creating websites is one that often falls short: accessibility. Accessibility is the cornerstone of any website, and if a...(200.09 zł najniższa cena z 30 dni)
199.59 zł
239.00 zł(-16%) -
In this insightful and comprehensive guide, Addy Osmani shares more than a decade of experience working on the Chrome team at Google, uncovering secrets to engineering effectiveness, efficiency, and team success. Engineers and engineering leaders looking to scale their effectiveness and drive tra...(114.88 zł najniższa cena z 30 dni)
114.38 zł
149.00 zł(-23%) -
Data modeling is the single most overlooked feature in Power BI Desktop, yet it's what sets Power BI apart from other tools on the market. This practical book serves as your fast-forward button for data modeling with Power BI, Analysis Services tabular, and SQL databases. It serves as a starting ...(198.88 zł najniższa cena z 30 dni)
198.78 zł
239.00 zł(-17%) -
C# is undeniably one of the most versatile programming languages available to engineers today. With this comprehensive guide, you'll learn just how powerful the combination of C# and .NET can be. Author Ian Griffiths guides you through C# 12.0 and .NET 8 fundamentals and techniques for building c...(240.92 zł najniższa cena z 30 dni)
240.72 zł
289.00 zł(-17%) -
Learn how to get started with Futures Thinking. With this practical guide, Phil Balagtas, founder of the Design Futures Initiative and the global Speculative Futures network, shows you how designers and futurists have made futures work at companies such as Atari, IBM, Apple, Disney, Autodesk, Luf...(148.00 zł najniższa cena z 30 dni)
147.90 zł
179.00 zł(-17%) -
Augmented Analytics isn't just another book on data and analytics; it's a holistic resource for reimagining the way your entire organization interacts with information to become insight-driven.Moving beyond traditional, limited ways of making sense of data, Augmented Analytics provides a dynamic,...(174.54 zł najniższa cena z 30 dni)
174.34 zł
219.00 zł(-20%) -
Learn how to prepare for—and pass—the Kubernetes and Cloud Native Associate (KCNA) certification exam. This practical guide serves as both a study guide and point of entry for practitioners looking to explore and adopt cloud native technologies. Adrián González Sánchez ...
Kubernetes and Cloud Native Associate (KCNA) Study Guide Kubernetes and Cloud Native Associate (KCNA) Study Guide
(169.14 zł najniższa cena z 30 dni)177.65 zł
199.00 zł(-11%) -
Python is an excellent way to get started in programming, and this clear, concise guide walks you through Python a step at a time—beginning with basic programming concepts before moving on to functions, data structures, and object-oriented design. This revised third edition reflects the gro...(140.14 zł najniższa cena z 30 dni)
139.94 zł
179.00 zł(-22%)
Dzieki opcji "Druk na żądanie" do sprzedaży wracają tytuły Grupy Helion, które cieszyły sie dużym zainteresowaniem, a których nakład został wyprzedany.
Dla naszych Czytelników wydrukowaliśmy dodatkową pulę egzemplarzy w technice druku cyfrowego.
Co powinieneś wiedzieć o usłudze "Druk na żądanie":
- usługa obejmuje tylko widoczną poniżej listę tytułów, którą na bieżąco aktualizujemy;
- cena książki może być wyższa od początkowej ceny detalicznej, co jest spowodowane kosztami druku cyfrowego (wyższymi niż koszty tradycyjnego druku offsetowego). Obowiązująca cena jest zawsze podawana na stronie WWW książki;
- zawartość książki wraz z dodatkami (płyta CD, DVD) odpowiada jej pierwotnemu wydaniu i jest w pełni komplementarna;
- usługa nie obejmuje książek w kolorze.
Masz pytanie o konkretny tytuł? Napisz do nas: sklep[at]helion.pl.
Książka, którą chcesz zamówić pochodzi z końcówki nakładu. Oznacza to, że mogą się pojawić drobne defekty (otarcia, rysy, zagięcia).
Co powinieneś wiedzieć o usłudze "Końcówka nakładu":
- usługa obejmuje tylko książki oznaczone tagiem "Końcówka nakładu";
- wady o których mowa powyżej nie podlegają reklamacji;
Masz pytanie o konkretny tytuł? Napisz do nas: sklep[at]helion.pl.
Książka drukowana
Oceny i opinie klientów: High Performance Spark. Best Practices for Scaling and Optimizing Apache Spark Holden Karau, Rachel Warren (0) Weryfikacja opinii następuję na podstawie historii zamówień na koncie Użytkownika umieszczającego opinię. Użytkownik mógł otrzymać punkty za opublikowanie opinii uprawniające do uzyskania rabatu w ramach Programu Punktowego.