Learning Spark. 2nd Edition
- Autorzy: :
- Jules S. Damji, Brooke Wenig, Tathagata Das
- Ocena:
- Bądź pierwszym, który oceni tę książkę
- Stron:
- 400
- Dostępne formaty:
-
ePubMobi
Opis ebooka: Learning Spark. 2nd Edition
Data is bigger, arrives faster, and comes in a variety of formatsâ??and it all needs to be processed at scale for analytics or machine learning. But how can you process such varied workloads efficiently? Enter Apache Spark.
Updated to include Spark 3.0, this second edition shows data engineers and data scientists why structure and unification in Spark matters. Specifically, this book explains how to perform simple and complex data analytics and employ machine learning algorithms. Through step-by-step walk-throughs, code snippets, and notebooks, youâ??ll be able to:
- Learn Python, SQL, Scala, or Java high-level Structured APIs
- Understand Spark operations and SQL Engine
- Inspect, tune, and debug Spark operations with Spark configurations and Spark UI
- Connect to data sources: JSON, Parquet, CSV, Avro, ORC, Hive, S3, or Kafka
- Perform analytics on batch and streaming data using Structured Streaming
- Build reliable data pipelines with open source Delta Lake and Spark
- Develop machine learning pipelines with MLlib and productionize models using MLflow
Wybrane bestsellery
-
To książka przeznaczona dla inżynierów danych i programistów, którzy chcą za pomocą Sparka przeprowadzać skomplikowane analizy danych i korzystać z algorytmów uczenia maszynowego, nawet jeśli te dane pochodzą z różnych źródeł. Wyjaśniono tu, jak dzięki Apache Spark można odczytywać i ujednolicać ...(53.40 zł najniższa cena z 30 dni)
62.30 zł
89.00 zł(-30%) -
Traditional data architecture patterns are severely limited. To use these patterns, you have to ETL data into each tool—a cost-prohibitive process for making warehouse features available to all of your data. The lack of flexibility with these patterns requires you to lock into a set of prio...(183.42 zł najniższa cena z 30 dni)
183.32 zł
249.00 zł(-26%) -
Oprogramowanie Apache Kafka powstało jako broker wiadomości w LinkedIn. Obecnie pełni funkcję rozproszonego systemu przetwarzania strumieniowego danych, używanego do budowania aplikacji opracowujących duże ilości danych w czasie rzeczywistym. Z zalet tego oprogramowania korzystają firmy na całym ...(53.39 zł najniższa cena z 30 dni)
48.95 zł
89.00 zł(-45%) -
Used by more than 80% of Fortune 100 companies, Apache Kafka has become the de facto event streaming platform. Kafka Connect is a key component of Kafka that lets you flow data between your existing systems and Kafka to process data in real time.With this practical guide, authors Mickael Maison a...(216.18 zł najniższa cena z 30 dni)
216.13 zł
279.00 zł(-23%) -
This book describes both batch processing and real-time processing pipelines. You’ll learn how to implement basic and advanced big data use cases with ease and develop a deep understanding of the Apache Beam model. In addition to this, you’ll discover how the portability layer works...(112.40 zł najniższa cena z 30 dni)
112.20 zł
139.00 zł(-19%) -
Every enterprise application creates data, including log messages, metrics, user activity, and outgoing messages. Learning how to move these items is almost as important as the data itself. If you're an application architect, developer, or production engineer new to Apache Pulsar, this practical ...(183.73 zł najniższa cena z 30 dni)
183.63 zł
249.00 zł(-26%) -
Serverless computing greatly simplifies software development. Your team can focus solely on your application while the cloud provider manages the servers you need. This practical guide shows you step-by-step how to build and deploy complex applications in a flexible multicloud, multilanguage envi...(180.10 zł najniższa cena z 30 dni)
179.90 zł
249.00 zł(-28%) -
Before you can build analytics tools to gain quick insights, you first need to know how to process data in real time. With this practical guide, developers familiar with Apache Spark will learn how to put this in-memory framework to use for streaming data. You’ll discover how Spark enables ...(185.52 zł najniższa cena z 30 dni)
185.31 zł
249.00 zł(-26%) -
This practical guide explains you to program and understand the power of Apache Cassandra 3.x. You will explore the integration and interaction of Cassandra components, and explore features such as the token allocation algorithm, CQL3, vnodes, lightweight transactions, and data modelling in detail.(92.56 zł najniższa cena z 30 dni)
92.51 zł
119.00 zł(-22%) -
Apache Hive helps you deal with data summarization, queries, and analysis for huge amounts of data. This book will give you a background in big data, and familiarize you with your Hive working environment. Next you will cover advanced topics like performance and security in Hive and how to work e...
O autorach ebooka
Jules S. Damji - jest inżynierem oprogramowania dla wielu wiodących firm, takich jak Netscape, Sun Microsystems, Verisign i ProQuest. Zajmuje się systemami rozproszonymi.
Brooke Wenig - kieruje zespołem, który opracowuje potoki uczenia maszynowego. Prowadzi też szkolenia z zakresu rozproszonego uczenia maszynowego.
Tathagata Das - jest członkiem Apache Spark Project Management Committee. Pracuje nad strumieniowaniem strukturalnym i Delta Lake.
Kup polskie wydanie:
Spark. Błyskawiczna analiza danych. Wydanie II
- Autor:
- Jules S. Damji, Brooke Wenig, Tathagata Das, Denny Lee
44,50 zł
89,00 zł
(44.50 zł najniższa cena z 30 dni)
Ebooka "Learning Spark. 2nd Edition" przeczytasz na:
-
czytnikach Inkbook, Kindle, Pocketbook, Onyx Boox i innych
-
systemach Windows, MacOS i innych
-
systemach Windows, Android, iOS, HarmonyOS
-
na dowolnych urządzeniach i aplikacjach obsługujących formaty: PDF, EPub, Mobi
Masz pytania? Zajrzyj do zakładki Pomoc »
Audiobooka "Learning Spark. 2nd Edition" posłuchasz:
-
w aplikacji Ebookpoint na Android, iOS, HarmonyOs
-
na systemach Windows, MacOS i innych
-
na dowolnych urządzeniach i aplikacjach obsługujących format MP3 (pliki spakowane w ZIP)
Masz pytania? Zajrzyj do zakładki Pomoc »
Kurs Video "Learning Spark. 2nd Edition" zobaczysz:
-
w aplikacjach Ebookpoint i Videopoint na Android, iOS, HarmonyOs
-
na systemach Windows, MacOS i innych z dostępem do najnowszej wersji Twojej przeglądarki internetowej
Szczegóły ebooka
- ISBN Ebooka:
- 978-14-920-4999-9, 9781492049999
- Data wydania ebooka:
- 2020-07-16 Data wydania ebooka często jest dniem wprowadzenia tytułu do sprzedaży i może nie być równoznaczna z datą wydania książki papierowej. Dodatkowe informacje możesz znaleźć w darmowym fragmencie. Jeśli masz wątpliwości skontaktuj się z nami sklep@ebookpoint.pl.
- Język publikacji:
- angielski
- Rozmiar pliku ePub:
- 13.7MB
- Rozmiar pliku Mobi:
- 35.2MB
Spis treści ebooka
- Foreword
- Preface
- Who This Book Is For
- How the Book Is Organized
- How to Use the Code Examples
- Software and Configuration Used
- Conventions Used in This Book
- Using Code Examples
- OReilly Online Learning
- How to Contact Us
- Acknowledgments
- 1. Introduction to Apache Spark: A Unified Analytics Engine
- The Genesis of Spark
- Big Data and Distributed Computing at Google
- Hadoop at Yahoo!
- Sparks Early Years at AMPLab
- The Genesis of Spark
- What Is Apache Spark?
- Speed
- Ease of Use
- Modularity
- Extensibility
- Unified Analytics
- Apache Spark Components as a Unified Stack
- Spark SQL
- Spark MLlib
- Spark Structured Streaming
- GraphX
- Apache Spark Components as a Unified Stack
- Apache Sparks Distributed Execution
- Spark driver
- SparkSession
- Cluster manager
- Spark executor
- Deployment modes
- Distributed data and partitions
- The Developers Experience
- Who Uses Spark, and for What?
- Data science tasks
- Data engineering tasks
- Popular Spark use cases
- Who Uses Spark, and for What?
- Community Adoption and Expansion
- 2. Downloading Apache Spark and Getting Started
- Step 1: Downloading Apache Spark
- Sparks Directories and Files
- Step 1: Downloading Apache Spark
- Step 2: Using the Scala or PySpark Shell
- Using the Local Machine
- Step 3: Understanding Spark Application Concepts
- Spark Application and SparkSession
- Spark Jobs
- Spark Stages
- Spark Tasks
- Transformations, Actions, and Lazy Evaluation
- Narrow and Wide Transformations
- The Spark UI
- Your First Standalone Application
- Counting M&Ms for the Cookie Monster
- Building Standalone Applications in Scala
- Summary
- 3. Apache Sparks Structured APIs
- Spark: Whats Underneath an RDD?
- Structuring Spark
- Key Merits and Benefits
- The DataFrame API
- Sparks Basic Data Types
- Sparks Structured and Complex Data Types
- Schemas and Creating DataFrames
- Two ways to define a schema
- Columns and Expressions
- Rows
- Common DataFrame Operations
- Using DataFrameReader and DataFrameWriter
- Saving a DataFrame as a Parquet file or SQL table
- Using DataFrameReader and DataFrameWriter
- Transformations and actions
- Projections and filters
- Renaming, adding, and dropping columns
- Aggregations
- Other common DataFrame operations
- End-to-End DataFrame Example
- The Dataset API
- Typed Objects, Untyped Objects, and Generic Rows
- Creating Datasets
- Scala: Case classes
- Dataset Operations
- End-to-End Dataset Example
- DataFrames Versus Datasets
- When to Use RDDs
- Spark SQL and the Underlying Engine
- The Catalyst Optimizer
- Phase 1: Analysis
- Phase 2: Logical optimization
- Phase 3: Physical planning
- Phase 4: Code generation
- The Catalyst Optimizer
- Summary
- 4. Spark SQL and DataFrames: Introduction to Built-in Data Sources
- Using Spark SQL in Spark Applications
- Basic Query Examples
- Using Spark SQL in Spark Applications
- SQL Tables and Views
- Managed Versus UnmanagedTables
- Creating SQL Databases and Tables
- Creating a managed table
- Creating an unmanaged table
- Creating Views
- Temporary views versus global temporary views
- Viewing the Metadata
- Caching SQL Tables
- Reading Tables into DataFrames
- Data Sources for DataFrames and SQL Tables
- DataFrameReader
- DataFrameWriter
- Parquet
- Reading Parquet files into a DataFrame
- Reading Parquet files into a Spark SQL table
- Writing DataFrames to Parquet files
- Writing DataFrames to Spark SQL tables
- JSON
- Reading a JSON file into a DataFrame
- Reading a JSON file into a Spark SQL table
- Writing DataFrames to JSON files
- JSON data source options
- CSV
- Reading a CSV file into a DataFrame
- Reading a CSV file into a Spark SQL table
- Writing DataFrames to CSV files
- CSV data source options
- Avro
- Reading an Avro file into a DataFrame
- Reading an Avro file into a Spark SQL table
- Writing DataFrames to Avro files
- Avro data source options
- ORC
- Reading an ORC file into a DataFrame
- Reading an ORC file into a Spark SQL table
- Writing DataFrames to ORC files
- Images
- Reading an image file into a DataFrame
- Binary Files
- Reading a binary file into a DataFrame
- Summary
- 5. Spark SQL and DataFrames: Interacting with External Data Sources
- Spark SQL and Apache Hive
- User-Defined Functions
- Spark SQL UDFs
- Evaluation order and null checking in Spark SQL
- Speeding up and distributing PySpark UDFs with Pandas UDFs
- User-Defined Functions
- Spark SQL and Apache Hive
- Querying with the Spark SQL Shell, Beeline, and Tableau
- Using the Spark SQL Shell
- Create a table
- Insert data into the table
- Running a Spark SQL query
- Using the Spark SQL Shell
- Working with Beeline
- Start the Thrift server
- Connect to the Thrift server via Beeline
- Execute a Spark SQL query with Beeline
- Stop the Thrift server
- Working with Tableau
- Start the Thrift server
- Start Tableau
- Stop the Thrift server
- External Data Sources
- JDBC and SQL Databases
- The importance of partitioning
- JDBC and SQL Databases
- PostgreSQL
- MySQL
- Azure Cosmos DB
- MS SQL Server
- Other External Sources
- Higher-Order Functions in DataFrames and Spark SQL
- Option 1: Explode and Collect
- Option 2: User-Defined Function
- Built-in Functions for Complex Data Types
- Higher-Order Functions
- transform()
- filter()
- exists()
- reduce()
- Common DataFrames and Spark SQL Operations
- Unions
- Joins
- Windowing
- Modifications
- Adding new columns
- Dropping columns
- Renaming columns
- Pivoting
- Summary
- 6. Spark SQL and Datasets
- Single API for Java and Scala
- Scala Case Classes and JavaBeans for Datasets
- Single API for Java and Scala
- Working with Datasets
- Creating Sample Data
- Transforming Sample Data
- Higher-order functions and functional programming
- Converting DataFrames to Datasets
- Memory Management for Datasets and DataFrames
- Dataset Encoders
- Sparks Internal Format Versus Java Object Format
- Serialization and Deserialization (SerDe)
- Costs of Using Datasets
- Strategies to Mitigate Costs
- Summary
- 7. Optimizing and Tuning Spark Applications
- Optimizing and Tuning Spark for Efficiency
- Viewing and Setting Apache Spark Configurations
- Scaling Spark for Large Workloads
- Static versus dynamic resource allocation
- Configuring Spark executors memory and the shuffle service
- Maximizing Spark parallelism
- How partitions are created
- Optimizing and Tuning Spark for Efficiency
- Caching and Persistence of Data
- DataFrame.cache()
- DataFrame.persist()
- When to Cache and Persist
- When Not to Cache and Persist
- A Family of Spark Joins
- Broadcast Hash Join
- When to use a broadcast hash join
- Broadcast Hash Join
- Shuffle Sort Merge Join
- Optimizing the shuffle sort merge join
- When to use a shuffle sort merge join
- Inspecting the Spark UI
- Journey Through the Spark UI Tabs
- Jobs and Stages
- Executors
- Storage
- SQL
- Environment
- Debugging Spark applications
- Journey Through the Spark UI Tabs
- Summary
- 8. Structured Streaming
- Evolution of the Apache Spark Stream Processing Engine
- The Advent of Micro-Batch Stream Processing
- Lessons Learned from Spark Streaming (DStreams)
- The Philosophy of Structured Streaming
- Evolution of the Apache Spark Stream Processing Engine
- The Programming Model of Structured Streaming
- The Fundamentals of a Structured Streaming Query
- Five Steps to Define a Streaming Query
- Step 1: Define input sources
- Step 2: Transform data
- Step 3: Define output sink and output mode
- Step 4: Specify processing details
- Step 5: Start the query
- Putting it all together
- Five Steps to Define a Streaming Query
- Under the Hood of an Active Streaming Query
- Recovering from Failures with Exactly-Once Guarantees
- Monitoring an Active Query
- Querying current status using StreamingQuery
- Get current metrics using StreamingQuery
- Get current status using StreamingQuery.status()
- Querying current status using StreamingQuery
- Publishing metrics using Dropwizard Metrics
- Publishing metrics using custom StreamingQueryListeners
- Streaming Data Sources and Sinks
- Files
- Reading from files
- Writing to files
- Files
- Apache Kafka
- Reading from Kafka
- Writing to Kafka
- Custom Streaming Sources and Sinks
- Writing to any storage system
- Using foreachBatch()
- Using foreach()
- Writing to any storage system
- Reading from any storage system
- Data Transformations
- Incremental Execution and Streaming State
- Stateless Transformations
- Stateful Transformations
- Distributed and fault-tolerant state management
- Types of stateful operations
- Stateful Streaming Aggregations
- Aggregations Not Based on Time
- Aggregations with Event-Time Windows
- Handling late data with watermarks
- Semantic guarantees with watermarks
- Handling late data with watermarks
- Supported output modes
- Streaming Joins
- StreamStatic Joins
- StreamStream Joins
- Inner joins with optional watermarking
- Outer joins with watermarking
- Arbitrary Stateful Computations
- Modeling Arbitrary Stateful Operations with mapGroupsWithState()
- Using Timeouts to Manage Inactive Groups
- Processing-time timeouts
- Event-time timeouts
- Generalization with flatMapGroupsWithState()
- Performance Tuning
- Summary
- 9. Building Reliable Data Lakes with Apache Spark
- The Importance of an Optimal Storage Solution
- Databases
- A Brief Introduction to Databases
- Reading from and Writing to Databases Using Apache Spark
- Limitations of Databases
- Data Lakes
- A Brief Introduction to Data Lakes
- Reading from and Writing to Data Lakes using Apache Spark
- Limitations of Data Lakes
- Lakehouses: The Next Step in the Evolution of Storage Solutions
- Apache Hudi
- Apache Iceberg
- Delta Lake
- Building Lakehouses with Apache Spark and Delta Lake
- Configuring Apache Spark with Delta Lake
- Loading Data into a Delta Lake Table
- Loading Data Streams into a Delta Lake Table
- Enforcing Schema on Write to Prevent Data Corruption
- Evolving Schemas to Accommodate Changing Data
- Transforming Existing Data
- Updating data to fix errors
- Deleting user-related data
- Upserting change data to a table using merge()
- Deduplicating data while inserting using insert-only merge
- Auditing Data Changes with Operation History
- Querying Previous Snapshots of a Table with Time Travel
- Summary
- 10. Machine Learning with MLlib
- What Is Machine Learning?
- Supervised Learning
- Unsupervised Learning
- Why Spark for Machine Learning?
- What Is Machine Learning?
- Designing Machine Learning Pipelines
- Data Ingestion and Exploration
- Creating Training and Test Data Sets
- Preparing Features with Transformers
- Understanding Linear Regression
- Using Estimators to Build Models
- Creating a Pipeline
- One-hot encoding
- Evaluating Models
- RMSE
- Interpreting the value of RMSE
- R2
- RMSE
- Saving and Loading Models
- Hyperparameter Tuning
- Tree-Based Models
- Decision trees
- Random forests
- Tree-Based Models
- k-Fold Cross-Validation
- Optimizing Pipelines
- Summary
- 11. Managing, Deploying, and Scaling Machine Learning Pipelines with Apache Spark
- Model Management
- MLflow
- Tracking
- MLflow
- Model Management
- Model Deployment Options with MLlib
- Batch
- Streaming
- Model Export Patterns for Real-Time Inference
- Leveraging Spark for Non-MLlib Models
- Pandas UDFs
- Spark for Distributed Hyperparameter Tuning
- Joblib
- Hyperopt
- Summary
- 12. Epilogue: Apache Spark 3.0
- Spark Core and Spark SQL
- Dynamic Partition Pruning
- Adaptive Query Execution
- The AQE framework
- SQL Join Hints
- Shuffle sort merge join (SMJ)
- Broadcast hash join (BHJ)
- Shuffle hash join (SHJ)
- Shuffle-and-replicate nested loop join (SNLJ)
- Spark Core and Spark SQL
- Catalog Plugin API and DataSourceV2
- Accelerator-Aware Scheduler
- Structured Streaming
- PySpark, Pandas UDFs, and Pandas Function APIs
- Redesigned Pandas UDFs with Python Type Hints
- Iterator Support in Pandas UDFs
- New Pandas Function APIs
- Changed Functionality
- Languages Supported and Deprecated
- Changes to the DataFrame and Dataset APIs
- DataFrame and SQL Explain Commands
- Summary
- Index
O'Reilly Media - inne książki
-
Large language models (LLMs) and generative AI are rapidly changing the healthcare industry. These technologies have the potential to revolutionize healthcare by improving the efficiency, accuracy, and personalization of care. This practical book shows healthcare leaders, researchers, data scient...(149.94 zł najniższa cena z 30 dni)
149.89 zł
199.00 zł(-25%) -
With hundreds of tools preinstalled, the Kali Linux distribution makes it easier for security professionals to get started with security testing quickly. But with more than 600 tools in its arsenal, Kali Linux can also be overwhelming. The new edition of this practical book covers updates to the ...(162.32 zł najniższa cena z 30 dni)
162.21 zł
209.00 zł(-22%) -
Learn how to implement and manage a modern customer data platform (CDP) through the Salesforce Data Cloud platform. This practical book provides a comprehensive overview that shows architects, administrators, developers, data engineers, and marketers how to ingest, store, and manage real-time cus...(185.89 zł najniższa cena z 30 dni)
185.84 zł
249.00 zł(-25%) -
Real-time applications are becoming the norm today. But building a model that works properly requires real-time data from the source, in-flight stream processing, and low latency serving of its analytics. With this practical book, data engineers, data architects, and data analysts will learn how ...(210.70 zł najniższa cena z 30 dni)
210.20 zł
279.00 zł(-25%) -
Methods of delivering software are constantly evolving in order to increase speed to market without sacrificing reliability and stability. Mastering development end to end, from version control to production, and building production-ready code is now more important than ever. Continuous deploymen...(172.22 zł najniższa cena z 30 dni)
171.72 zł
229.00 zł(-25%) -
As the transformation to hybrid multicloud accelerates, businesses require a structured approach to securing their workloads. Adopting zero trust principles demands a systematic set of practices to deliver secure solutions. Regulated businesses, in particular, demand rigor in the architectural pr...(153.98 zł najniższa cena z 30 dni)
153.48 zł
209.00 zł(-27%) -
This concise yet comprehensive guide explains how to adopt a data lakehouse architecture to implement modern data platforms. It reviews the design considerations, challenges, and best practices for implementing a lakehouse and provides key insights into the ways that using a lakehouse can impact ...(181.73 zł najniższa cena z 30 dni)
181.63 zł
249.00 zł(-27%) -
In today's fast-paced world, more and more organizations require rapid application development with reduced development costs and increased productivity. This practical guide shows application developers how to use PowerApps, Microsoft's no-code/low-code application framework that helps developer...(151.27 zł najniższa cena z 30 dni)
150.77 zł
209.00 zł(-28%) -
Welcome to the systems age, where software professionals are no longer building software&emdash;we're building systems of software. Change is continuously deployed across software ecosystems coordinated by responsive infrastructure. In this world of increasing relational complexity, we need t...(141.24 zł najniższa cena z 30 dni)
141.04 zł
209.00 zł(-33%) -
This book provides an ideal guide for Python developers who want to learn how to build applications with large language models. Authors Olivier Caelen and Marie-Alice Blete cover the main features and benefits of GPT-4 and GPT-3.5 models and explain how they work. You'll also get a step-by-step g...(143.43 zł najniższa cena z 30 dni)
143.33 zł
209.00 zł(-31%)
Dzieki opcji "Druk na żądanie" do sprzedaży wracają tytuły Grupy Helion, które cieszyły sie dużym zainteresowaniem, a których nakład został wyprzedany.
Dla naszych Czytelników wydrukowaliśmy dodatkową pulę egzemplarzy w technice druku cyfrowego.
Co powinieneś wiedzieć o usłudze "Druk na żądanie":
- usługa obejmuje tylko widoczną poniżej listę tytułów, którą na bieżąco aktualizujemy;
- cena książki może być wyższa od początkowej ceny detalicznej, co jest spowodowane kosztami druku cyfrowego (wyższymi niż koszty tradycyjnego druku offsetowego). Obowiązująca cena jest zawsze podawana na stronie WWW książki;
- zawartość książki wraz z dodatkami (płyta CD, DVD) odpowiada jej pierwotnemu wydaniu i jest w pełni komplementarna;
- usługa nie obejmuje książek w kolorze.
Masz pytanie o konkretny tytuł? Napisz do nas: sklep[at]helion.pl.
Książka, którą chcesz zamówić pochodzi z końcówki nakładu. Oznacza to, że mogą się pojawić drobne defekty (otarcia, rysy, zagięcia).
Co powinieneś wiedzieć o usłudze "Końcówka nakładu":
- usługa obejmuje tylko książki oznaczone tagiem "Końcówka nakładu";
- wady o których mowa powyżej nie podlegają reklamacji;
Masz pytanie o konkretny tytuł? Napisz do nas: sklep[at]helion.pl.
Książka drukowana
Oceny i opinie klientów: Learning Spark. 2nd Edition Jules S. Damji, Brooke Wenig, Tathagata Das (0) Weryfikacja opinii następuję na podstawie historii zamówień na koncie Użytkownika umieszczającego opinię. Użytkownik mógł otrzymać punkty za opublikowanie opinii uprawniające do uzyskania rabatu w ramach Programu Punktowego.