Data Algorithms with Spark Mahmoud Parsian

Data Algorithms with Spark Mahmoud Parsian - okladka książki

Autor:: Mahmoud Parsian
Wydawnictwo:: O'Reilly Media (Z chęcią przeczytam książkę w języku polskim)
Ocena:: Bądź pierwszym, który oceni tę książkę
Stron:: 438
Dostępne formaty:: ePub

Mobi

Ebook

203,15 zł ~~239,00 zł~~ (-15%)

143,40 zł najniższa cena z 30 dni

Dodaj do koszyka Dostępny natychmiast po opłaceniu zakupu lub Kup na prezent Kup 1-kliknięciem

Przenieś na półkę

Do przechowalni

Apache Spark's speed, ease of use, sophisticated analytics, and multilanguage support makes practical knowledge of this cluster-computing framework a required skill for data engineers and data scientists. With this hands-on guide, anyone looking for an introduction to Spark will learn practical algorithms and examples using PySpark.

In each chapter, author Mahmoud Parsian shows you how to solve a data problem with a set of Spark transformations and algorithms. You'll learn how to tackle problems involving ETL, design patterns, machine learning algorithms, data partitioning, and genomics analysis. Each detailed recipe includes PySpark algorithms using the PySpark driver and shell script.

With this book, you will:

Learn how to select Spark transformations for optimized solutions
Explore powerful transformations and reductions including reduceByKey(), combineByKey(), and mapPartitions()
Understand data partitioning for optimized queries
Build and apply a model using PySpark design patterns
Apply motif-finding algorithms to graph data
Analyze graph data by using the GraphFrames API
Apply PySpark algorithms to clinical and genomics data
Learn how to use and apply feature engineering in ML algorithms
Understand and use practical and pragmatic data design patterns

Wybrane bestsellery

Ebooka "Data Algorithms with Spark" przeczytasz na:

czytnikach Inkbook, Kindle, Pocketbook, Onyx Boox i innych
systemach Windows, MacOS i innych

systemach Windows, Android, iOS, HarmonyOS
na dowolnych urządzeniach i aplikacjach obsługujących formaty: PDF, EPub, Mobi

Masz pytania? Zajrzyj do zakładki Pomoc »

Oceny i opinie klientów: Data Algorithms with Spark Mahmoud Parsian

(0)

Szczegóły książki

ISBN Ebooka:: 978-14-920-8233-0, 9781492082330
Data wydania ebooka :: 2022-04-08 Data wydania ebooka często jest dniem wprowadzenia tytułu do sprzedaży i może nie być równoznaczna z datą wydania książki papierowej. Dodatkowe informacje możesz znaleźć w darmowym fragmencie. Jeśli masz wątpliwości skontaktuj się z nami sklep@ebookpoint.pl.
Język publikacji:: angielski
Rozmiar pliku ePub:: 5.4MB
Rozmiar pliku Mobi:: 14.6MB

Zgłoś erratę

Kategorie

Kliknij, aby zgłosić błędnie przypisaną kategorię »

Informatyka » Programowanie » Programowanie w chmurze
Informatyka » Biznes IT » Big data » Analiza danych

Dostępność produktu

Produkt nie został jeszcze oceniony pod kątem ułatwień dostępu lub nie podano żadnych informacji o ułatwieniach dostępu lub są one niewystarczające. Prawdopodobnie Wydawca/Dostawca jeszcze nie umożliwił dokonania walidacji produktu lub nie przekazał odpowiednich informacji na temat jego dostępności.

Spis treści książki

Foreword
Preface
- Why I Wrote This Book
- Who This Book Is For
- How This Book Is Organized
- Conventions Used in This Book
- Using Code Examples
- OReilly Online Learning
- How to Contact Us
- Acknowledgments
I. Fundamentals
1. Introduction to Spark and PySpark
- Why Spark for Data Analytics
  - The Spark Ecosystem
  - Spark Architecture
    - Key Terms
    - Spark architecture in a nutshell
- The Power of PySpark
  - PySpark Architecture
- Spark Data Abstractions
  - RDD Examples
  - Spark RDD Operations
    - Transformations
    - Actions
  - DataFrame Examples
- Using the PySpark Shell
  - Launching the PySpark Shell
  - Creating an RDD from a Collection
  - Aggregating and Merging Values of Keys
  - Filtering an RDDs Elements
  - Grouping Similar Keys
  - Aggregating Values for Similar Keys
- ETL Example with DataFrames
  - Extraction
  - Transformation
  - Loading
- Summary
2. Transformations in Action
- The DNA Base Count Example
  - The DNA Base Count Problem
  - FASTA Format
  - Sample Data
- DNA Base Count Solution 1
  - Step 1: Create an RDD[String] from the Input
  - Step 2: Define a Mapper Function
  - Step 3: Find the Frequencies of DNA Letters
  - Pros and Cons of Solution 1
- DNA Base Count Solution 2
  - Step 1: Create an RDD[String] from the Input
  - Step 2: Define a Mapper Function
  - Step 3: Find the Frequencies of DNA Letters
  - Pros and Cons of Solution 2
- DNA Base Count Solution 3
  - The mapPartitions() Transformation
  - Step 1: Create an RDD[String] from the Input
  - Step 2: Define a Function to Handle a Partition
  - Step 3: Apply the Custom Function to Each Partition
  - Pros and Cons of Solution 3
- Summary
3. Mapper Transformations
- Data Abstractions and Mappers
- What Are Transformations?
  - Lazy Transformations
  - The map() Transformation
    - RDD mapper
    - Custom mapper functions
  - DataFrame Mapper
    - Mapper to single DataFrame column
    - Mapper to multiple DataFrame columns
- The flatMap() Transformation
  - map() Versus flatMap()
  - Apply flatMap() to a DataFrame
- The mapValues() Transformation
- The flatMapValues() Transformation
- The mapPartitions() Transformation
  - Handling Empty Partitions
  - Benefits and Drawbacks
  - DataFrames and mapPartitions() Transformation
- Summary
4. Reductions in Spark
- Creating Pair RDDs
- Reduction Transformations
- Sparks Reductions
- Simple Warmup Example
  - Solving with reduceByKey()
  - Solving with groupByKey()
  - Solving with aggregateByKey()
  - Solving with combineByKey()
- What Is a Monoid?
  - Monoid and Non-Monoid Examples
- The Movie Problem
  - Input Dataset to Analyze
  - The aggregateByKey() Transformation
  - First Solution Using aggregateByKey()
  - Second Solution Using aggregateByKey()
  - Complete PySpark Solution Using groupByKey()
  - Complete PySpark Solution Using reduceByKey()
    - Step 1: Read data and create pairs
    - Step 2: Use reduceByKey() to sum up ratings
    - Step 3: Find average rating
  - Complete PySpark Solution Using combineByKey()
    - Step 1: Read data and create pairs
    - Step 2: Use combineByKey() to sum up ratings
    - Step 3: Find average rating
- The Shuffle Step in Reductions
  - Shuffle Step for groupByKey()
  - Shuffle Step for reduceByKey()
- Summary
II. Working with Data
5. Partitioning Data
- Introduction to Partitions
  - Partitions in Spark
- Managing Partitions
  - Default Partitioning
  - Explicit Partitioning
- Physical Partitioning for SQL Queries
- Physical Partitioning of Data in Spark
  - Partition as Text Format
  - Partition as Parquet Format
- How to Query Partitioned Data
  - Amazon Athena Example
- Summary
6. Graph Algorithms
- Introduction to Graphs
- The GraphFrames API
  - How to Use GraphFrames
  - GraphFrames Functions and Attributes
- GraphFrames Algorithms
  - Finding Triangles
    - Step 1: Build a graph
    - Step 2: Count triangles
  - Motif Finding
    - Triangle counting with motifs
      - Trial 1
      - Trial 2
      - Trial 3
    - Finding unique triangles with motifs
      - Input
      - Output
      - Algorithm
    - Other motif finding examples
      - Finding bidirectional vertices
      - Finding subgraphs
      - Friend recommendation
      - Product recommendations
- Real-World Applications
  - Gene Analysis
    - Motif finding for genes
  - Social Recommendations
  - Facebook Circles
    - Input
    - Building the graph
    - Motif finding
  - Connected Components
    - Connected components in Spark
  - Analyzing Flight Data
    - Input
      - Vertices
      - Edges
    - Building the graph
    - Flight analysis
- Summary
7. Interacting with External Data Sources
- Relational Databases
  - Reading from a Database
    - Step 1. Create a database table
    - Step 2: Read the database table into a DataFrame
    - Step 3: Query the DataFrame
  - Writing a DataFrame to a Database
- Reading Text Files
- Reading and Writing CSV Files
  - Reading CSV Files
  - Writing CSV Files
- Reading and Writing JSON Files
  - Reading JSON Files
  - Writing JSON Files
- Reading from and Writing to Amazon S3
  - Reading from Amazon S3
  - Writing to Amazon S3
- Reading and Writing Hadoop Files
  - Reading Hadoop Text Files
  - Writing Hadoop Text Files
  - Reading and Writing HDFS SequenceFiles
    - Reading HDFS SequenceFiles
    - Writing HDFS SequenceFiles
- Reading and Writing Parquet Files
  - Writing Parquet Files
  - Reading Parquet Files
- Reading and Writing Avro Files
  - Reading Avro Files
  - Writing Avro Files
- Reading from and Writing to MS SQL Server
  - Writing to MS SQL Server
  - Reading from MS SQL Server
- Reading Image Files
  - Creating a DataFrame from Images
- Summary
8. Ranking Algorithms
- Rank Product
  - Calculation of the Rank Product
  - Formalizing Rank Product
  - Rank Product Example
  - PySpark Solution
    - Input data format
    - Output data format
    - Rank product solution using combineByKey()
      - Step 1: Compute the mean per gene per study
      - Step 2: Compute the rank of each gene per study
      - Step 3: Calculate the rank product for each gene
    - Rank product solution using groupByKey()
- PageRank
  - PageRanks Iterative Computation
  - Custom PageRank in PySpark Using RDDs
    - Input data format
    - Output data format
    - PySpark Solution
    - Sample output
  - Custom PageRank in PySpark Using an Adjacency Matrix
    - Input data format
    - Output data format
    - PySpark solution
  - PageRank with GraphFrames
    - Tolerance
    - Maximum iterations
- Summary
III. Data Design Patterns
9. Classic Data Design Patterns
- Input-Map-Output
  - RDD Solution
  - DataFrame Solution
  - Flat Mapper functionality
- Input-Filter-Output
  - RDD Solution
  - DataFrame Solution
  - DataFrame Filter
- Input-Map-Reduce-Output
  - RDD Solution
  - DataFrame Solution
- Input-Multiple-Maps-Reduce-Output
  - RDD Solution
  - DataFrame Solution
- Input-Map-Combiner-Reduce-Output
- Input-MapPartitions-Reduce-Output
- Inverted Index
  - Problem Statement
  - Input
  - Output
  - PySpark Solution
- Summary
10. Practical Data Design Patterns
- In-Mapper Combining
  - Basic MapReduce Algorithm
  - In-Mapper Combining per Record
  - In-Mapper Combining per Partition
- Top-10
  - Top-N Formalized
  - PySpark Solution
  - Finding the Bottom 10
- MinMax
  - Solution 1: Classic MapReduce
  - Solution 2: Sorting
  - Solution 3: Sparks mapPartitions()
- The Composite Pattern and Monoids
  - Monoids
  - Monoidal and Non-Monoidal Examples
    - Maximum over a set of integers
    - Subtraction over a set of integers
    - Addition over a set of integers
    - Union and intersection over integers
    - Multiplication over a set of integers
    - Mean over a set of integers
    - Median over a set of integers
    - Concatenation over lists
    - Matrix example
  - Non-Monoid MapReduce Example
  - Monoid MapReduce Example
  - PySpark Implementation of Monoidal Mean
  - Functors and Monoids
  - Conclusion on Using Monoids
- Binning
- Sorting
- Summary
11. Join Design Patterns
- Introduction to the Join Operation
- Join in MapReduce
  - Map Phase
  - Reducer Phase
  - Implementation in PySpark
- Map-Side Join Using RDDs
- Map-Side Join Using DataFrames
  - Step 1: Create Cache for Airports
  - Step 2: Create Cache for Airlines
  - Step 3: Create Facts Table
  - Step 4: Apply Map-Side Join
- Efficient Joins Using Bloom Filters
  - Introduction to Bloom Filters
  - A Simple Bloom Filter Example
  - Bloom Filters in Python
  - Using Bloom Filters in PySpark
- Summary
12. Feature Engineering in PySpark
- Introduction to Feature Engineering
- Adding New Features
- Applying UDFs
- Creating Pipelines
- Binarizing Data
- Imputation
- Tokenization
  - Tokenizer
  - RegexTokenizer
  - Tokenization with a Pipeline
- Standardization
- Normalization
  - Scaling a Column Using a Pipeline
  - Using MinMaxScaler on Multiple Columns
  - Normalization Using Normalizer
- String Indexing
  - Applying StringIndexer to a Single Column
  - Applying StringIndexer to Several Columns
- Vector Assembly
- Bucketing
  - Bucketizer
  - QuantileDiscretizer
- Logarithm Transformation
- One-Hot Encoding
- TF-IDF
- FeatureHasher
- SQLTransformer
- Summary
Index