Learning Spark. 2nd Edition Jules S. Damji, Brooke Wenig, Tathagata Das

Learning Spark. 2nd Edition Jules S. Damji, Brooke Wenig, Tathagata Das - okladka książki

Autorzy:: Jules S. Damji, Brooke Wenig, Tathagata Das
Wydawnictwo:: O'Reilly Media (Z chęcią przeczytam książkę w języku polskim)
Ocena:: Bądź pierwszym, który oceni tę książkę
Stron:: 400
Dostępne formaty:: ePub

Mobi

Ebook

203,15 zł ~~239,00 zł~~ (-15%)

143,40 zł najniższa cena z 30 dni

Dodaj do koszyka Dostępny natychmiast po opłaceniu zakupu lub Kup na prezent Kup 1-kliknięciem

Przenieś na półkę

Do przechowalni

Kup polskie wydanie:

Spark. Błyskawiczna analiza danych. Wydanie II

Jules S. Damji, Brooke Wenig, Tathagata Das, Denny Lee

książka
ebook

(53.4 zł najniższa cena z 30 dni) 57,84 zł ~~89,00 zł (-35%)~~

Apache Spark jest oprogramowaniem open source, przeznaczonym do klastrowego przetwarzania danych dostarczanych w różnych formatach. Pozwala na uzyskanie niespotykanej wydajności, umożliwia też pracę w trybie wsadowym i strumieniowym. Framework ten jest również świetnie przygotowany do uruchamiania złożonych aplikacji, włączając w to algorytmy uczenia maszynowego czy analizy predykcyjnej. To wszystko sprawia, że Apache Spark stanowi znakomity wybór dla programistów zaj...

Data is bigger, arrives faster, and comes in a variety of formatsâ??and it all needs to be processed at scale for analytics or machine learning. But how can you process such varied workloads efficiently? Enter Apache Spark.

Updated to include Spark 3.0, this second edition shows data engineers and data scientists why structure and unification in Spark matters. Specifically, this book explains how to perform simple and complex data analytics and employ machine learning algorithms. Through step-by-step walk-throughs, code snippets, and notebooks, youâ??ll be able to:

Learn Python, SQL, Scala, or Java high-level Structured APIs
Understand Spark operations and SQL Engine
Inspect, tune, and debug Spark operations with Spark configurations and Spark UI
Connect to data sources: JSON, Parquet, CSV, Avro, ORC, Hive, S3, or Kafka
Perform analytics on batch and streaming data using Structured Streaming
Build reliable data pipelines with open source Delta Lake and Spark
Develop machine learning pipelines with MLlib and productionize models using MLflow

Wybrane bestsellery

Promocja Promocja 2za1

To książka przeznaczona dla inżynierów danych i programistów, którzy chcą za pomocą Sparka przeprowadzać skomplikowane analizy danych i korzystać z algorytmów uczenia maszynowego, nawet jeśli te dane pochodzą z różnych źródeł. Wyjaśniono tu, jak dzięki Apache Spark można odczytywać i ujednolicać duże zbiory informacji, aby powstawały niezawodne jeziora danych, w jaki sposób wykonuje się interaktywne zapytania SQL, a także jak tworzy się potoki przy użyciu MLlib i wdraża modele za pomocą biblioteki MLflow. Omówiono również współdziałanie aplikacji Sparka z jego rozproszonymi komponentami i tryby jej wdrażania w poszczególnych środowiskach.
- ebook
- książka
Spark. Błyskawiczna analiza danych. Wydanie II

Jules S. Damji, Brooke Wenig, Tathagata Das, Denny Lee

(44,50 zł najniższa cena z 30 dni)

53.40 zł ~~89.00 zł (-40%)~~
Promocja Promocja 2za1

Apache Spark is amazing when everything clicks. But if you haven't seen the performance improvements you expected or still don't feel confident enough to use Spark in production, this practical book is for you. Authors Holden Karau, Adi Polak, and Rachel Warren walk you through the secrets of the Spark code base and demonstrate performance optimiza
- ebook
High Performance Spark. Best Practices for Scaling and Optimizing Apache Spark. 2nd Edition

Holden Karau, Adi Polak, Rachel Warren

(131,40 zł najniższa cena z 30 dni)

186.15 zł ~~219.00 zł (-15%)~~
Promocja Promocja 2za1

Description Modern frontend development is the art of building the digital bridge between users and technology, and mastering it requires more than just code. It requires not only technical expertise in UI/UX principles and web technologies, but a deep understanding of the psychological mechanisms that determine whether users engage with or leave t
- ebook
Frontend Development

Dario Benevento

(116,10 zł najniższa cena z 30 dni)

125.10 zł ~~139.00 zł (-10%)~~
Promocja Promocja 2za1

Description The guiding principles for an architect is the comprehensive roadmap to understanding the platforms core components and why they matter for todays businesses. Microsoft Power Platform is transforming how enterprises build solutions in todays fast-paced digital era. By enabling low-code innovation and empowering citizen developers, it he
- ebook
Microsoft Power Platform

Goloknath Mishra

(116,10 zł najniższa cena z 30 dni)

125.10 zł ~~139.00 zł (-10%)~~
Promocja Promocja 2za1

Description Rust has revolutionized modern development by providing unmatched performance and security guarantees, making it the ideal foundation for building reliable web applications. While its development has not slowed down the slightest, it already has a vibrant ecosystem to support diverse developer needs. Readers will learn to build minimali
- ebook
Web Development in Rust

Viktor Daróczi

(46,15 zł najniższa cena z 30 dni)

89.91 zł ~~99.90 zł (-10%)~~
Promocja Promocja 2za1

Description Python is the industry standard for modern software development, known for its readability and ability to integrate into virtually every domain, from scripting to complex system design. This book is your practical guide to moving beyond Python basics and mastering the art of building complete, deployable applications. Each chapter blend
- ebook
Python Real-World Projects

Arun Prakash Shivakumar

(46,15 zł najniższa cena z 30 dni)

89.91 zł ~~99.90 zł (-10%)~~
Promocja Promocja 2za1

Description Python is a versatile programming language that can help solve problems in various fields. With PyCharm as an IDE, you will learn to build Python applications step-by-step. This book is for beginner to intermediate software developers and data scientists who want to use Python for web development and for data science projects. This book
- ebook
Application Development with PyCharm

Muhammad Asif

(46,15 zł najniższa cena z 30 dni)

89.91 zł ~~99.90 zł (-10%)~~
Promocja Promocja 2za1

Navigating the complexities of large-scale spatial data can be daunting. In order to unleash the power of massive and complex datasets, you'll need a cutting-edge tool like Apache Sedona. This innovative distributed computing system, designed specifically for spatial data, has diverse applications in fields such as mobility, telematics, agriculture
- ebook
Cloud Native Geospatial Analytics with Apache Sedona. A Hands-On Guide for Working with Large-Scale Spatial Data

Pawel Tokaj, Jia Yu, Mo Sarwat

(143,40 zł najniższa cena z 30 dni)

203.15 zł ~~239.00 zł (-15%)~~
Promocja Promocja 2za1

Description Elixir is the modern, powerful programming language designed for massive scale and reliability, perfectly suited for todays concurrent web applications. Built on the proven Erlang virtual machine (BEAM), Elixir empowers developers to build fast, fault-tolerant systems that simply do not crash. This book provides a clear, sequential path
- ebook
Elixir and Phoenix for Beginners

Karthikeyan Paramasivan

(46,15 zł najniższa cena z 30 dni)

89.91 zł ~~99.90 zł (-10%)~~
Promocja Promocja 2za1

Overcome challenges in building transactional guarantees on rapidly changing data by using Apache Hudi. With this practical guide, data engineers, data architects, and software architects will discover how to seamlessly build an interoperable lakehouse from disparate data sources and deliver faster insights using your query engine of choice. Author
- ebook
Apache Hudi: The Definitive Guide. Building Robust, Open, and High-Performing Data Lakehouses

Shiyan Xu, Prashant Wason, Bhavani Sudha Saktheeswaran

(143,40 zł najniższa cena z 30 dni)

203.15 zł ~~239.00 zł (-15%)~~
Promocja Promocja 2za1

Description Artificial intelligence is redefining how software is created, enabling developers to code faster, improve accuracy, and bring innovative ideas to life. In todays competitive technology landscape, AI-assisted programming is no longer optional; its a core skill for building modern web applications and machine learning solutions. This boo
- ebook
AI-assisted Programming for Web and Machine Learning

Dr. Muralidhar Kurni, Ramesh Krishnamaneni, Dr. Srinivasa K. G.

(46,15 zł najniższa cena z 30 dni)

89.91 zł ~~99.90 zł (-10%)~~

O autorach książki

Jules S. Damji - jest inżynierem oprogramowania dla wielu wiodących firm, takich jak Netscape, Sun Microsystems, Verisign i ProQuest. Zajmuje się systemami rozproszonymi.

Brooke Wenig - kieruje zespołem, który opracowuje potoki uczenia maszynowego. Prowadzi też szkolenia z zakresu rozproszonego uczenia maszynowego.

Tathagata Das - jest członkiem Apache Spark Project Management Committee. Pracuje nad strumieniowaniem strukturalnym i Delta Lake.

Ebooka "Learning Spark. 2nd Edition" przeczytasz na:

czytnikach Inkbook, Kindle, Pocketbook, Onyx Boox i innych
systemach Windows, MacOS i innych

systemach Windows, Android, iOS, HarmonyOS
na dowolnych urządzeniach i aplikacjach obsługujących formaty: PDF, EPub, Mobi

Masz pytania? Zajrzyj do zakładki Pomoc »

Oceny i opinie klientów: Learning Spark. 2nd Edition Jules S. Damji, Brooke Wenig, Tathagata Das

(0)

Szczegóły książki

ISBN Ebooka:: 978-14-920-4999-9, 9781492049999
Data wydania ebooka :: 2020-07-16 Data wydania ebooka często jest dniem wprowadzenia tytułu do sprzedaży i może nie być równoznaczna z datą wydania książki papierowej. Dodatkowe informacje możesz znaleźć w darmowym fragmencie. Jeśli masz wątpliwości skontaktuj się z nami sklep@ebookpoint.pl.
Język publikacji:: angielski
Rozmiar pliku ePub:: 13.7MB
Rozmiar pliku Mobi:: 35.2MB

Zgłoś erratę

Kategorie

Kliknij, aby zgłosić błędnie przypisaną kategorię »

Informatyka » Serwery internetowe » Apache

Dostępność produktu

Produkt nie został jeszcze oceniony pod kątem ułatwień dostępu lub nie podano żadnych informacji o ułatwieniach dostępu lub są one niewystarczające. Prawdopodobnie Wydawca/Dostawca jeszcze nie umożliwił dokonania walidacji produktu lub nie przekazał odpowiednich informacji na temat jego dostępności.

Spis treści książki

Foreword
Preface
- Who This Book Is For
- How the Book Is Organized
- How to Use the Code Examples
- Software and Configuration Used
- Conventions Used in This Book
- Using Code Examples
- OReilly Online Learning
- How to Contact Us
- Acknowledgments
1. Introduction to Apache Spark: A Unified Analytics Engine
- The Genesis of Spark
  - Big Data and Distributed Computing at Google
  - Hadoop at Yahoo!
  - Sparks Early Years at AMPLab
- What Is Apache Spark?
  - Speed
  - Ease of Use
  - Modularity
  - Extensibility
- Unified Analytics
  - Apache Spark Components as a Unified Stack
    - Spark SQL
    - Spark MLlib
    - Spark Structured Streaming
    - GraphX
  - Apache Sparks Distributed Execution
    - Spark driver
    - SparkSession
    - Cluster manager
    - Spark executor
    - Deployment modes
    - Distributed data and partitions
- The Developers Experience
  - Who Uses Spark, and for What?
    - Data science tasks
    - Data engineering tasks
    - Popular Spark use cases
  - Community Adoption and Expansion
2. Downloading Apache Spark and Getting Started
- Step 1: Downloading Apache Spark
  - Sparks Directories and Files
- Step 2: Using the Scala or PySpark Shell
  - Using the Local Machine
- Step 3: Understanding Spark Application Concepts
  - Spark Application and SparkSession
  - Spark Jobs
  - Spark Stages
  - Spark Tasks
- Transformations, Actions, and Lazy Evaluation
  - Narrow and Wide Transformations
- The Spark UI
- Your First Standalone Application
  - Counting M&Ms for the Cookie Monster
  - Building Standalone Applications in Scala
- Summary
3. Apache Sparks Structured APIs
- Spark: Whats Underneath an RDD?
- Structuring Spark
  - Key Merits and Benefits
- The DataFrame API
  - Sparks Basic Data Types
  - Sparks Structured and Complex Data Types
  - Schemas and Creating DataFrames
    - Two ways to define a schema
  - Columns and Expressions
  - Rows
  - Common DataFrame Operations
    - Using DataFrameReader and DataFrameWriter
      - Saving a DataFrame as a Parquet file or SQL table
    - Transformations and actions
      - Projections and filters
      - Renaming, adding, and dropping columns
      - Aggregations
      - Other common DataFrame operations
  - End-to-End DataFrame Example
- The Dataset API
  - Typed Objects, Untyped Objects, and Generic Rows
  - Creating Datasets
    - Scala: Case classes
  - Dataset Operations
  - End-to-End Dataset Example
- DataFrames Versus Datasets
  - When to Use RDDs
- Spark SQL and the Underlying Engine
  - The Catalyst Optimizer
    - Phase 1: Analysis
    - Phase 2: Logical optimization
    - Phase 3: Physical planning
    - Phase 4: Code generation
- Summary
4. Spark SQL and DataFrames: Introduction to Built-in Data Sources
- Using Spark SQL in Spark Applications
  - Basic Query Examples
- SQL Tables and Views
  - Managed Versus UnmanagedTables
  - Creating SQL Databases and Tables
    - Creating a managed table
    - Creating an unmanaged table
  - Creating Views
    - Temporary views versus global temporary views
  - Viewing the Metadata
  - Caching SQL Tables
  - Reading Tables into DataFrames
- Data Sources for DataFrames and SQL Tables
  - DataFrameReader
  - DataFrameWriter
  - Parquet
    - Reading Parquet files into a DataFrame
    - Reading Parquet files into a Spark SQL table
    - Writing DataFrames to Parquet files
    - Writing DataFrames to Spark SQL tables
  - JSON
    - Reading a JSON file into a DataFrame
    - Reading a JSON file into a Spark SQL table
    - Writing DataFrames to JSON files
    - JSON data source options
  - CSV
    - Reading a CSV file into a DataFrame
    - Reading a CSV file into a Spark SQL table
    - Writing DataFrames to CSV files
    - CSV data source options
  - Avro
    - Reading an Avro file into a DataFrame
    - Reading an Avro file into a Spark SQL table
    - Writing DataFrames to Avro files
    - Avro data source options
  - ORC
    - Reading an ORC file into a DataFrame
    - Reading an ORC file into a Spark SQL table
    - Writing DataFrames to ORC files
  - Images
    - Reading an image file into a DataFrame
  - Binary Files
    - Reading a binary file into a DataFrame
- Summary
5. Spark SQL and DataFrames: Interacting with External Data Sources
- Spark SQL and Apache Hive
  - User-Defined Functions
    - Spark SQL UDFs
    - Evaluation order and null checking in Spark SQL
    - Speeding up and distributing PySpark UDFs with Pandas UDFs
- Querying with the Spark SQL Shell, Beeline, and Tableau
  - Using the Spark SQL Shell
    - Create a table
    - Insert data into the table
    - Running a Spark SQL query
  - Working with Beeline
    - Start the Thrift server
    - Connect to the Thrift server via Beeline
    - Execute a Spark SQL query with Beeline
    - Stop the Thrift server
  - Working with Tableau
    - Start the Thrift server
    - Start Tableau
    - Stop the Thrift server
- External Data Sources
  - JDBC and SQL Databases
    - The importance of partitioning
  - PostgreSQL
  - MySQL
  - Azure Cosmos DB
  - MS SQL Server
  - Other External Sources
- Higher-Order Functions in DataFrames and Spark SQL
  - Option 1: Explode and Collect
  - Option 2: User-Defined Function
  - Built-in Functions for Complex Data Types
  - Higher-Order Functions
    - transform()
    - filter()
    - exists()
    - reduce()
- Common DataFrames and Spark SQL Operations
  - Unions
  - Joins
  - Windowing
  - Modifications
    - Adding new columns
    - Dropping columns
    - Renaming columns
    - Pivoting
- Summary
6. Spark SQL and Datasets
- Single API for Java and Scala
  - Scala Case Classes and JavaBeans for Datasets
- Working with Datasets
  - Creating Sample Data
  - Transforming Sample Data
    - Higher-order functions and functional programming
    - Converting DataFrames to Datasets
- Memory Management for Datasets and DataFrames
- Dataset Encoders
  - Sparks Internal Format Versus Java Object Format
  - Serialization and Deserialization (SerDe)
- Costs of Using Datasets
  - Strategies to Mitigate Costs
- Summary
7. Optimizing and Tuning Spark Applications
- Optimizing and Tuning Spark for Efficiency
  - Viewing and Setting Apache Spark Configurations
  - Scaling Spark for Large Workloads
    - Static versus dynamic resource allocation
    - Configuring Spark executors memory and the shuffle service
    - Maximizing Spark parallelism
      - How partitions are created
- Caching and Persistence of Data
  - DataFrame.cache()
  - DataFrame.persist()
  - When to Cache and Persist
  - When Not to Cache and Persist
- A Family of Spark Joins
  - Broadcast Hash Join
    - When to use a broadcast hash join
  - Shuffle Sort Merge Join
    - Optimizing the shuffle sort merge join
    - When to use a shuffle sort merge join
- Inspecting the Spark UI
  - Journey Through the Spark UI Tabs
    - Jobs and Stages
    - Executors
    - Storage
    - SQL
    - Environment
    - Debugging Spark applications
- Summary
8. Structured Streaming
- Evolution of the Apache Spark Stream Processing Engine
  - The Advent of Micro-Batch Stream Processing
  - Lessons Learned from Spark Streaming (DStreams)
  - The Philosophy of Structured Streaming
- The Programming Model of Structured Streaming
- The Fundamentals of a Structured Streaming Query
  - Five Steps to Define a Streaming Query
    - Step 1: Define input sources
    - Step 2: Transform data
    - Step 3: Define output sink and output mode
    - Step 4: Specify processing details
    - Step 5: Start the query
    - Putting it all together
  - Under the Hood of an Active Streaming Query
  - Recovering from Failures with Exactly-Once Guarantees
  - Monitoring an Active Query
    - Querying current status using StreamingQuery
      - Get current metrics using StreamingQuery
      - Get current status using StreamingQuery.status()
    - Publishing metrics using Dropwizard Metrics
    - Publishing metrics using custom StreamingQueryListeners
- Streaming Data Sources and Sinks
  - Files
    - Reading from files
    - Writing to files
  - Apache Kafka
    - Reading from Kafka
    - Writing to Kafka
  - Custom Streaming Sources and Sinks
    - Writing to any storage system
      - Using foreachBatch()
      - Using foreach()
    - Reading from any storage system
- Data Transformations
  - Incremental Execution and Streaming State
  - Stateless Transformations
  - Stateful Transformations
    - Distributed and fault-tolerant state management
    - Types of stateful operations
- Stateful Streaming Aggregations
  - Aggregations Not Based on Time
  - Aggregations with Event-Time Windows
    - Handling late data with watermarks
      - Semantic guarantees with watermarks
    - Supported output modes
- Streaming Joins
  - StreamStatic Joins
  - StreamStream Joins
    - Inner joins with optional watermarking
    - Outer joins with watermarking
- Arbitrary Stateful Computations
  - Modeling Arbitrary Stateful Operations with mapGroupsWithState()
  - Using Timeouts to Manage Inactive Groups
    - Processing-time timeouts
    - Event-time timeouts
  - Generalization with flatMapGroupsWithState()
- Performance Tuning
- Summary
9. Building Reliable Data Lakes with Apache Spark
- The Importance of an Optimal Storage Solution
- Databases
  - A Brief Introduction to Databases
  - Reading from and Writing to Databases Using Apache Spark
  - Limitations of Databases
- Data Lakes
  - A Brief Introduction to Data Lakes
  - Reading from and Writing to Data Lakes using Apache Spark
  - Limitations of Data Lakes
- Lakehouses: The Next Step in the Evolution of Storage Solutions
  - Apache Hudi
  - Apache Iceberg
  - Delta Lake
- Building Lakehouses with Apache Spark and Delta Lake
  - Configuring Apache Spark with Delta Lake
  - Loading Data into a Delta Lake Table
  - Loading Data Streams into a Delta Lake Table
  - Enforcing Schema on Write to Prevent Data Corruption
  - Evolving Schemas to Accommodate Changing Data
  - Transforming Existing Data
    - Updating data to fix errors
    - Deleting user-related data
    - Upserting change data to a table using merge()
    - Deduplicating data while inserting using insert-only merge
  - Auditing Data Changes with Operation History
  - Querying Previous Snapshots of a Table with Time Travel
- Summary
10. Machine Learning with MLlib
- What Is Machine Learning?
  - Supervised Learning
  - Unsupervised Learning
  - Why Spark for Machine Learning?
- Designing Machine Learning Pipelines
  - Data Ingestion and Exploration
  - Creating Training and Test Data Sets
  - Preparing Features with Transformers
  - Understanding Linear Regression
  - Using Estimators to Build Models
  - Creating a Pipeline
    - One-hot encoding
  - Evaluating Models
    - RMSE
      - Interpreting the value of RMSE
      - R2
  - Saving and Loading Models
- Hyperparameter Tuning
  - Tree-Based Models
    - Decision trees
    - Random forests
  - k-Fold Cross-Validation
  - Optimizing Pipelines
- Summary
11. Managing, Deploying, and Scaling Machine Learning Pipelines with Apache Spark
- Model Management
  - MLflow
    - Tracking
- Model Deployment Options with MLlib
  - Batch
  - Streaming
  - Model Export Patterns for Real-Time Inference
- Leveraging Spark for Non-MLlib Models
  - Pandas UDFs
  - Spark for Distributed Hyperparameter Tuning
    - Joblib
    - Hyperopt
- Summary
12. Epilogue: Apache Spark 3.0
- Spark Core and Spark SQL
  - Dynamic Partition Pruning
  - Adaptive Query Execution
    - The AQE framework
  - SQL Join Hints
    - Shuffle sort merge join (SMJ)
    - Broadcast hash join (BHJ)
    - Shuffle hash join (SHJ)
    - Shuffle-and-replicate nested loop join (SNLJ)
  - Catalog Plugin API and DataSourceV2
  - Accelerator-Aware Scheduler
- Structured Streaming
- PySpark, Pandas UDFs, and Pandas Function APIs
  - Redesigned Pandas UDFs with Python Type Hints
  - Iterator Support in Pandas UDFs
  - New Pandas Function APIs
- Changed Functionality
  - Languages Supported and Deprecated
  - Changes to the DataFrame and Dataset APIs
  - DataFrame and SQL Explain Commands
- Summary
Index