Spark: The Definitive Guide. Big Data Processing Made Simple
- Autorzy:
- Bill Chambers, Matei Zaharia
- Ocena:
- Bądź pierwszym, który oceni tę książkę
- Stron:
- 606
- Dostępne formaty:
-
ePubMobi
Opis ebooka: Spark: The Definitive Guide. Big Data Processing Made Simple
Learn how to use, deploy, and maintain Apache Spark with this comprehensive guide, written by the creators of the open-source cluster-computing framework. With an emphasis on improvements and new features in Spark 2.0, authors Bill Chambers and Matei Zaharia break down Spark topics into distinct sections, each with unique goals.
You’ll explore the basic operations and common functions of Spark’s structured APIs, as well as Structured Streaming, a new high-level API for building end-to-end streaming applications. Developers and system administrators will learn the fundamentals of monitoring, tuning, and debugging Spark, and explore machine learning techniques and scenarios for employing MLlib, Spark’s scalable machine-learning library.
- Get a gentle overview of big data and Spark
- Learn about DataFrames, SQL, and Datasets—Spark’s core APIs—through worked examples
- Dive into Spark’s low-level APIs, RDDs, and execution of SQL and DataFrames
- Understand how Spark runs on a cluster
- Debug, monitor, and tune Spark clusters and applications
- Learn the power of Structured Streaming, Spark’s stream-processing engine
- Learn how you can apply MLlib to a variety of problems, including classification or recommendation
Wybrane bestsellery
-
Mastering Data transformation is essential for enhancing their data models and business intelligence. The Definitive Guide to Power Query equips you with the knowledge and skills to master the tool while leveraging its remarkable capabilities.
The Definitive Guide to Power Query (M). Mastering complex data transformation with Power Query The Definitive Guide to Power Query (M). Mastering complex data transformation with Power Query
Gregory Deckler, Rick de Groot, Melissa de Korte, Brian Julius
-
Jeśli w swojej pracy masz lub miewasz do czynienia z danymi, z pewnością orientujesz się, że do tego celu stworzono dotąd całkiem sporo narzędzi. Nic dziwnego – przy tej liczbie danych, z jaką spotykamy się w dzisiejszym cyfrowym świecie, zdolność do ich sprawnego analizowania i wyciągania ...
Grafana. Kurs video. Monitorowanie, analiza i wizualizacja danych w czasie rzeczywistym Grafana. Kurs video. Monitorowanie, analiza i wizualizacja danych w czasie rzeczywistym
(39.90 zł najniższa cena z 30 dni)90.34 zł
139.00 zł(-35%) -
Dzisiejszą gospodarką rządzi informacja. Kto potrafi ją wyłuskać z zalewu danych, ten zyskuje konkurencyjną przewagę. Świadomi tego twórcy oprogramowania komputerowego stworzyli szereg narzędzi służących wyszukiwaniu informacji, ich przetwarzaniu, analizowaniu i prezentowaniu w sposób dostępny dl...
Elasticsearch. Kurs video. Pozyskiwanie i analiza danych Elasticsearch. Kurs video. Pozyskiwanie i analiza danych
(39.90 zł najniższa cena z 30 dni)124.50 zł
249.00 zł(-50%) -
Power Apps to platforma stworzona przez Microsoft, umożliwiająca łatwe projektowanie, tworzenie i dostosowywanie aplikacji bez konieczności posiadania głębokiej wiedzy programistycznej. Z użyciem Power Apps można budować niestandardowe aplikacje, które efektywnie wspierają i automatyzują różne pr...
Power Apps. Kurs video. Tworzenie biznesowych aplikacji no-code Power Apps. Kurs video. Tworzenie biznesowych aplikacji no-code
(39.90 zł najniższa cena z 30 dni)129.35 zł
199.00 zł(-35%) -
Tę książkę docenią wszyscy zainteresowani eksploracją danych i uczeniem maszynowym, którzy chcieliby pewnie poruszać się w świecie nauki o danych. Pokazano tu, w jaki sposób Excel pozwala zobrazować proces ich eksplorowania i jak działają poszczególne techniki w tym zakresie. Przejrzyście wyjaśni...
Eksploracja danych za pomocą Excela. Metody uczenia maszynowego krok po kroku Eksploracja danych za pomocą Excela. Metody uczenia maszynowego krok po kroku
(40.20 zł najniższa cena z 30 dni)43.55 zł
67.00 zł(-35%) -
Oto zwięzłe i równocześnie praktyczne kompendium, w którym znajdziesz 20 praktyk udanego planowania, analizy, specyfikacji, walidacji i zarządzania wymaganiami. Praktyki te są odpowiednie dla projektów zarządzanych zarówno w tradycyjny, jak i zwinny sposób, niezależnie od branży. Sprawią, że zesp...
Specyfikacja wymagań oprogramowania. Kluczowe praktyki analizy biznesowej Specyfikacja wymagań oprogramowania. Kluczowe praktyki analizy biznesowej
(40.20 zł najniższa cena z 30 dni)53.60 zł
67.00 zł(-20%) -
W dzisiejszej praktyce biznesowej duże znaczenie mają dane i ich analiza. W analizie zastosowanie znajduje wiele modeli statystycznych, implementowanych w różnych programach komputerowych. Na przykład Excel ma specjalny dodatek, nazwany po prostu Analiza Danych. Bardzo popularne narzędzie stanowi...
R i pakiet shiny. Kurs video. Interaktywne aplikacje w analizie danych R i pakiet shiny. Kurs video. Interaktywne aplikacje w analizie danych
(39.90 zł najniższa cena z 30 dni)74.50 zł
149.00 zł(-50%) -
Oto drugie, zaktualizowane i uzupełnione wydanie przewodnika po bibliotece Pandas. Dzięki tej przystępnej książce nauczysz się w pełni korzystać z możliwości oferowanych przez bibliotekę, nawet jeśli dopiero zaczynasz przygodę z analizą danych w Pythonie. Naukę rozpoczniesz z użyciem rzeczywisteg...
Jak analizować dane z biblioteką Pandas. Praktyczne wprowadzenie. Wydanie II Jak analizować dane z biblioteką Pandas. Praktyczne wprowadzenie. Wydanie II
(65.40 zł najniższa cena z 30 dni)70.85 zł
109.00 zł(-35%) -
Czy zastanawiasz się czasem nad tym, jak to możliwe, że jesteśmy w stanie „rozmawiać” z maszynami? Że coś mówimy, a one nas rozumieją i odpowiadają na nasze pytania, realizują polecenia, wykonują zadania? I na odwrót – to one mówią (i piszą) do nas słowami, które są dla nas jasn...
NLP. Kurs video. Analiza danych tekstowych w języku Python NLP. Kurs video. Analiza danych tekstowych w języku Python
(39.90 zł najniższa cena z 30 dni)74.50 zł
149.00 zł(-50%)
Ebooka "Spark: The Definitive Guide. Big Data Processing Made Simple" przeczytasz na:
-
czytnikach Inkbook, Kindle, Pocketbook, Onyx Boox i innych
-
systemach Windows, MacOS i innych
-
systemach Windows, Android, iOS, HarmonyOS
-
na dowolnych urządzeniach i aplikacjach obsługujących formaty: PDF, EPub, Mobi
Masz pytania? Zajrzyj do zakładki Pomoc »
Audiobooka "Spark: The Definitive Guide. Big Data Processing Made Simple" posłuchasz:
-
w aplikacji Ebookpoint na Android, iOS, HarmonyOs
-
na systemach Windows, MacOS i innych
-
na dowolnych urządzeniach i aplikacjach obsługujących format MP3 (pliki spakowane w ZIP)
Masz pytania? Zajrzyj do zakładki Pomoc »
Kurs Video "Spark: The Definitive Guide. Big Data Processing Made Simple" zobaczysz:
-
w aplikacjach Ebookpoint i Videopoint na Android, iOS, HarmonyOs
-
na systemach Windows, MacOS i innych z dostępem do najnowszej wersji Twojej przeglądarki internetowej
Szczegóły ebooka
- ISBN Ebooka:
- 978-14-919-1229-4, 9781491912294
- Data wydania ebooka:
- 2018-02-08 Data wydania ebooka często jest dniem wprowadzenia tytułu do sprzedaży i może nie być równoznaczna z datą wydania książki papierowej. Dodatkowe informacje możesz znaleźć w darmowym fragmencie. Jeśli masz wątpliwości skontaktuj się z nami sklep@ebookpoint.pl.
- Język publikacji:
- angielski
- Rozmiar pliku ePub:
- 7.5MB
- Rozmiar pliku Mobi:
- 17.1MB
Spis treści ebooka
- Preface
- About the Authors
- Who This Book Is For
- Conventions Used in This Book
- Using Code Examples
- OReilly Safari
- How to Contact Us
- Acknowledgments
- I. Gentle Overview of Big Data and Spark
- 1. What Is Apache Spark?
- Apache Sparks Philosophy
- Context: The Big Data Problem
- History of Spark
- The Present and Future of Spark
- Running Spark
- Downloading Spark Locally
- Downloading Spark for a Hadoop cluster
- Building Spark from source
- Downloading Spark Locally
- Launching Sparks Interactive Consoles
- Launching the Python console
- Launching the Scala console
- Launching the SQL console
- Running Spark in the Cloud
- Data Used in This Book
- 2. A Gentle Introduction to Spark
- Sparks Basic Architecture
- Spark Applications
- Sparks Basic Architecture
- Sparks Language APIs
- Sparks APIs
- Starting Spark
- The SparkSession
- DataFrames
- Partitions
- Transformations
- Lazy Evaluation
- Actions
- Spark UI
- An End-to-End Example
- DataFrames and SQL
- Conclusion
- 3. A Tour of Sparks Toolset
- Running Production Applications
- Datasets: Type-Safe Structured APIs
- Structured Streaming
- Machine Learning and Advanced Analytics
- Lower-Level APIs
- SparkR
- Sparks Ecosystem and Packages
- Conclusion
- II. Structured APIsDataFrames, SQL, and Datasets
- 4. Structured API Overview
- DataFrames and Datasets
- Schemas
- Overview of Structured Spark Types
- DataFrames Versus Datasets
- Columns
- Rows
- Spark Types
- Overview of Structured API Execution
- Logical Planning
- Physical Planning
- Execution
- Conclusion
- 5. Basic Structured Operations
- Schemas
- Columns and Expressions
- Columns
- Explicit column references
- Columns
- Expressions
- Columns as expressions
- Accessing a DataFrames columns
- Records and Rows
- Creating Rows
- DataFrame Transformations
- Creating DataFrames
- select and selectExpr
- Converting to Spark Types (Literals)
- Adding Columns
- Renaming Columns
- Reserved Characters and Keywords
- Case Sensitivity
- Removing Columns
- Changing a Columns Type (cast)
- Filtering Rows
- Getting Unique Rows
- Random Samples
- Random Splits
- Concatenating and Appending Rows (Union)
- Sorting Rows
- Limit
- Repartition and Coalesce
- Collecting Rows to the Driver
- Conclusion
- 6. Working with Different Types of Data
- Where to Look for APIs
- Converting to Spark Types
- Working with Booleans
- Working with Numbers
- Working with Strings
- Regular Expressions
- Working with Dates and Timestamps
- Working with Nulls in Data
- Coalesce
- ifnull, nullIf, nvl, and nvl2
- drop
- fill
- replace
- Ordering
- Working with Complex Types
- Structs
- Arrays
- split
- Array Length
- array_contains
- explode
- Maps
- Working with JSON
- User-Defined Functions
- Conclusion
- 7. Aggregations
- Aggregation Functions
- count
- countDistinct
- approx_count_distinct
- first and last
- min and max
- sum
- sumDistinct
- avg
- Variance and Standard Deviation
- skewness and kurtosis
- Covariance and Correlation
- Aggregating to Complex Types
- Aggregation Functions
- Grouping
- Grouping with Expressions
- Grouping with Maps
- Window Functions
- Grouping Sets
- Rollups
- Cube
- Grouping Metadata
- Pivot
- User-Defined Aggregation Functions
- Conclusion
- 8. Joins
- Join Expressions
- Join Types
- Inner Joins
- Outer Joins
- Left Outer Joins
- Right Outer Joins
- Left Semi Joins
- Left Anti Joins
- Natural Joins
- Cross (Cartesian) Joins
- Challenges When Using Joins
- Joins on Complex Types
- Handling Duplicate Column Names
- Approach 1: Different join expression
- Approach 2: Dropping the column after the join
- Approach 3: Renaming a column before the join
- How Spark Performs Joins
- Communication Strategies
- Big tabletobig table
- Big tabletosmall table
- Little tabletolittle table
- Communication Strategies
- Conclusion
- 9. Data Sources
- The Structure of the Data Sources API
- Read API Structure
- Basics of Reading Data
- Read modes
- Write API Structure
- Basics of Writing Data
- Save modes
- The Structure of the Data Sources API
- CSV Files
- CSV Options
- Reading CSV Files
- Writing CSV Files
- JSON Files
- JSON Options
- Reading JSON Files
- Writing JSON Files
- Parquet Files
- Reading Parquet Files
- Parquet options
- Reading Parquet Files
- Writing Parquet Files
- ORC Files
- Reading Orc Files
- Writing Orc Files
- SQL Databases
- Reading from SQL Databases
- Query Pushdown
- Reading from databases in parallel
- Partitioning based on a sliding window
- Writing to SQL Databases
- Text Files
- Reading Text Files
- Writing Text Files
- Advanced I/O Concepts
- Splittable File Types and Compression
- Reading Data in Parallel
- Writing Data in Parallel
- Partitioning
- Bucketing
- Writing Complex Types
- Managing File Size
- Conclusion
- 10. Spark SQL
- What Is SQL?
- Big Data and SQL: Apache Hive
- Big Data and SQL: Spark SQL
- Sparks Relationship to Hive
- The Hive metastore
- Sparks Relationship to Hive
- How to Run Spark SQL Queries
- Spark SQL CLI
- Sparks Programmatic SQL Interface
- SparkSQL Thrift JDBC/ODBC Server
- Catalog
- Tables
- Spark-Managed Tables
- Creating Tables
- Creating External Tables
- Inserting into Tables
- Describing Table Metadata
- Refreshing Table Metadata
- Dropping Tables
- Dropping unmanaged tables
- Caching Tables
- Views
- Creating Views
- Dropping Views
- Databases
- Creating Databases
- Setting the Database
- Dropping Databases
- Select Statements
- casewhenthen Statements
- Advanced Topics
- Complex Types
- Structs
- Lists
- Complex Types
- Functions
- User-defined functions
- Subqueries
- Uncorrelated predicate subqueries
- Correlated predicate subqueries
- Uncorrelated scalar queries
- Miscellaneous Features
- Configurations
- Setting Configuration Values in SQL
- Conclusion
- 11. Datasets
- When to Use Datasets
- Creating Datasets
- In Java: Encoders
- In Scala: Case Classes
- Actions
- Transformations
- Filtering
- Mapping
- Joins
- Grouping and Aggregations
- Conclusion
- III. Low-Level APIs
- 12. Resilient Distributed Datasets (RDDs)
- What Are the Low-Level APIs?
- When to Use the Low-Level APIs?
- How to Use the Low-Level APIs?
- What Are the Low-Level APIs?
- About RDDs
- Types of RDDs
- When to Use RDDs?
- Datasets and RDDs of Case Classes
- Creating RDDs
- Interoperating Between DataFrames, Datasets, and RDDs
- From a Local Collection
- From Data Sources
- Manipulating RDDs
- Transformations
- distinct
- filter
- map
- flatMap
- sort
- Random Splits
- Actions
- reduce
- count
- countApprox
- countApproxDistinct
- countByValue
- countByValueApprox
- first
- max and min
- take
- Saving Files
- saveAsTextFile
- SequenceFiles
- Hadoop Files
- Caching
- Checkpointing
- Pipe RDDs to System Commands
- mapPartitions
- foreachPartition
- glom
- Conclusion
- 13. Advanced RDDs
- Key-Value Basics (Key-Value RDDs)
- keyBy
- Mapping over Values
- Extracting Keys and Values
- lookup
- sampleByKey
- Key-Value Basics (Key-Value RDDs)
- Aggregations
- countByKey
- Understanding Aggregation Implementations
- groupByKey
- reduceByKey
- Other Aggregation Methods
- aggregate
- aggregateByKey
- combineByKey
- foldByKey
- CoGroups
- Joins
- Inner Join
- zips
- Controlling Partitions
- coalesce
- repartition
- repartitionAndSortWithinPartitions
- Custom Partitioning
- Custom Serialization
- Conclusion
- 14. Distributed Shared Variables
- Broadcast Variables
- Accumulators
- Basic Example
- Custom Accumulators
- Conclusion
- IV. Production Applications
- 15. How Spark Runs on a Cluster
- The Architecture of a Spark Application
- Execution Modes
- Cluster mode
- Client mode
- Local mode
- Execution Modes
- The Architecture of a Spark Application
- The Life Cycle of a Spark Application (Outside Spark)
- Client Request
- Launch
- Execution
- Completion
- The Life Cycle of a Spark Application (Inside Spark)
- The SparkSession
- The SparkContext
- The SparkSession
- Logical Instructions
- Logical instructions to physical execution
- A Spark Job
- Stages
- Tasks
- Execution Details
- Pipelining
- Shuffle Persistence
- Conclusion
- 16. Developing Spark Applications
- Writing Spark Applications
- A Simple Scala-Based App
- Running the application
- A Simple Scala-Based App
- Writing Python Applications
- Running the application
- Writing Spark Applications
- Writing Java Applications
- Running the application
- Testing Spark Applications
- Strategic Principles
- Input data resilience
- Business logic resilience and evolution
- Resilience in output and atomicity
- Strategic Principles
- Tactical Takeaways
- Managing SparkSessions
- Which Spark API to Use?
- Connecting to Unit Testing Frameworks
- Connecting to Data Sources
- The Development Process
- Launching Applications
- Application Launch Examples
- Configuring Applications
- The SparkConf
- Application Properties
- Runtime Properties
- Execution Properties
- Configuring Memory Management
- Configuring Shuffle Behavior
- Environmental Variables
- Job Scheduling Within an Application
- Conclusion
- 17. Deploying Spark
- Where to Deploy Your Cluster to Run Spark Applications
- On-Premises Cluster Deployments
- Spark in the Cloud
- Where to Deploy Your Cluster to Run Spark Applications
- Cluster Managers
- Standalone Mode
- Starting a standalone cluster
- Cluster launch scripts
- Standalone cluster configurations
- Submitting applications
- Standalone Mode
- Spark on YARN
- Submitting applications
- Configuring Spark on YARN Applications
- Hadoop configurations
- Application properties for YARN
- Spark on Mesos
- Submitting applications
- Configuring Mesos
- Secure Deployment Configurations
- Cluster Networking Configurations
- Application Scheduling
- Dynamic allocation
- Miscellaneous Considerations
- Conclusion
- 18. Monitoring and Debugging
- The Monitoring Landscape
- What to Monitor
- Driver and Executor Processes
- Queries, Jobs, Stages, and Tasks
- Spark Logs
- The Spark UI
- Other Spark UI tabs
- Configuring the Spark user interface
- Spark REST API
- Spark UI History Server
- Debugging and Spark First Aid
- Spark Jobs Not Starting
- Signs and symptoms
- Potential treatments
- Spark Jobs Not Starting
- Errors Before Execution
- Signs and symptoms
- Potential treatments
- Errors During Execution
- Signs and symptoms
- Potential treatments
- Slow Tasks or Stragglers
- Signs and symptoms
- Potential treatments
- Slow Aggregations
- Signs and symptoms
- Potential treatments
- Slow Joins
- Signs and symptoms
- Potential treatments
- Slow Reads and Writes
- Signs and symptoms
- Potential treatments
- Driver OutOfMemoryError or Driver Unresponsive
- Signs and symptoms
- Potential treatments
- Executor OutOfMemoryError or Executor Unresponsive
- Signs and symptoms
- Potential treatments
- Unexpected Nulls in Results
- Signs and symptoms
- Potential treatments
- No Space Left on Disk Errors
- Signs and symptoms
- Potential treatments
- Serialization Errors
- Signs and symptoms
- Potential treatments
- Conclusion
- 19. Performance Tuning
- Indirect Performance Enhancements
- Design Choices
- Scala versus Java versus Python versus R
- DataFrames versus SQL versus Datasets versus RDDs
- Design Choices
- Object Serialization in RDDs
- Cluster Configurations
- Cluster/application sizing and sharing
- Dynamic allocation
- Indirect Performance Enhancements
- Scheduling
- Data at Rest
- File-based long-term data storage
- Splittable file types and compression
- Table partitioning
- Bucketing
- The number of files
- Data locality
- Statistics collection
- Shuffle Configurations
- Memory Pressure and Garbage Collection
- Measuring the impact of garbage collection
- Garbage collection tuning
- Direct Performance Enhancements
- Parallelism
- Improved Filtering
- Repartitioning and Coalescing
- Custom partitioning
- User-Defined Functions (UDFs)
- Temporary Data Storage (Caching)
- Joins
- Aggregations
- Broadcast Variables
- Conclusion
- V. Streaming
- 20. Stream Processing Fundamentals
- What Is Stream Processing?
- Stream Processing Use Cases
- Notifications and alerting
- Real-time reporting
- Incremental ETL
- Update data to serve in real time
- Real-time decision making
- Online machine learning
- Stream Processing Use Cases
- Advantages of Stream Processing
- Challenges of Stream Processing
- What Is Stream Processing?
- Stream Processing Design Points
- Record-at-a-Time Versus Declarative APIs
- Event Time Versus Processing Time
- Continuous Versus Micro-Batch Execution
- Sparks Streaming APIs
- The DStream API
- Structured Streaming
- Conclusion
- 21. Structured Streaming Basics
- Structured Streaming Basics
- Core Concepts
- Transformations and Actions
- Input Sources
- Sinks
- Output Modes
- Triggers
- Event-Time Processing
- Event-time data
- Watermarks
- Structured Streaming in Action
- Transformations on Streams
- Selections and Filtering
- Aggregations
- Joins
- Input and Output
- Where Data Is Read and Written (Sources and Sinks)
- File source and sink
- Kafka source and sink
- Where Data Is Read and Written (Sources and Sinks)
- Reading from the Kafka Source
- Writing to the Kafka Sink
- Foreach sink
- Sources and sinks for testing
- How Data Is Output (Output Modes)
- Append mode
- Complete mode
- Update mode
- When can you use each mode?
- When Data Is Output (Triggers)
- Processing time trigger
- Once trigger
- Streaming Dataset API
- Conclusion
- 22. Event-Time and Stateful Processing
- Event Time
- Stateful Processing
- Arbitrary Stateful Processing
- Event-Time Basics
- Windows on Event Time
- Tumbling Windows
- Sliding windows
- Tumbling Windows
- Handling Late Data with Watermarks
- Dropping Duplicates in a Stream
- Arbitrary Stateful Processing
- Time-Outs
- Output Modes
- mapGroupsWithState
- flatMapGroupsWithState
- Conclusion
- 23. Structured Streaming in Production
- Fault Tolerance and Checkpointing
- Updating Your Application
- Updating Your Streaming Application Code
- Updating Your Spark Version
- Sizing and Rescaling Your Application
- Metrics and Monitoring
- Query Status
- Recent Progress
- Input rate and processing rate
- Batch duration
- Spark UI
- Alerting
- Advanced Monitoring with the Streaming Listener
- Conclusion
- VI. Advanced Analytics and Machine Learning
- 24. Advanced Analytics and Machine Learning Overview
- A Short Primer on Advanced Analytics
- Supervised Learning
- Classification
- Regression
- Supervised Learning
- Recommendation
- Unsupervised Learning
- Graph Analytics
- The Advanced Analytics Process
- Data collection
- Data cleaning
- Feature engineering
- Training models
- Model tuning and evaluation
- Leveraging the model and/or insights
- A Short Primer on Advanced Analytics
- Sparks Advanced Analytics Toolkit
- What Is MLlib?
- When and why should you use MLlib (versus scikit-learn, TensorFlow, or foo package)
- What Is MLlib?
- High-Level MLlib Concepts
- Low-level data types
- MLlib in Action
- Feature Engineering with Transformers
- Estimators
- Pipelining Our Workflow
- Training and Evaluation
- Persisting and Applying Models
- Deployment Patterns
- Conclusion
- 25. Preprocessing and Feature Engineering
- Formatting Models According to Your Use Case
- Transformers
- Estimators for Preprocessing
- Transformer Properties
- High-Level Transformers
- RFormula
- SQL Transformers
- VectorAssembler
- Working with Continuous Features
- Bucketing
- Advanced bucketing techniques
- Bucketing
- Scaling and Normalization
- StandardScaler
- MinMaxScaler
- MaxAbsScaler
- ElementwiseProduct
- Normalizer
- Working with Categorical Features
- StringIndexer
- Converting Indexed Values Back to Text
- Indexing in Vectors
- One-Hot Encoding
- Text Data Transformers
- Tokenizing Text
- Removing Common Words
- Creating Word Combinations
- Converting Words into Numerical Representations
- Term frequencyinverse document frequency
- Word2Vec
- Feature Manipulation
- PCA
- Interaction
- Polynomial Expansion
- Feature Selection
- ChiSqSelector
- Advanced Topics
- Persisting Transformers
- Writing a Custom Transformer
- Conclusion
- 26. Classification
- Use Cases
- Types of Classification
- Binary Classification
- Multiclass Classification
- Multilabel Classification
- Classification Models in MLlib
- Model Scalability
- Logistic Regression
- Model Hyperparameters
- Training Parameters
- Prediction Parameters
- Example
- Model Summary
- Decision Trees
- Model Hyperparameters
- Training Parameters
- Prediction Parameters
- Random Forest and Gradient-Boosted Trees
- Model Hyperparameters
- Random forest only
- Gradient-boosted trees (GBT) only
- Model Hyperparameters
- Training Parameters
- Prediction Parameters
- Naive Bayes
- Model Hyperparameters
- Training Parameters
- Prediction Parameters
- Evaluators for Classification and Automating Model Tuning
- Detailed Evaluation Metrics
- One-vs-Rest Classifier
- Multilayer Perceptron
- Conclusion
- 27. Regression
- Use Cases
- Regression Models in MLlib
- Model Scalability
- Linear Regression
- Model Hyperparameters
- Training Parameters
- Example
- Training Summary
- Generalized Linear Regression
- Model Hyperparameters
- Training Parameters
- Prediction Parameters
- Example
- Training Summary
- Decision Trees
- Model Hyperparameters
- Training Parameters
- Example
- Random Forests and Gradient-Boosted Trees
- Model Hyperparameters
- Training Parameters
- Example
- Advanced Methods
- Survival Regression (Accelerated Failure Time)
- Isotonic Regression
- Evaluators and Automating Model Tuning
- Metrics
- Conclusion
- 28. Recommendation
- Use Cases
- Collaborative Filtering with Alternating Least Squares
- Model Hyperparameters
- Training Parameters
- Prediction Parameters
- Example
- Evaluators for Recommendation
- Metrics
- Regression Metrics
- Ranking Metrics
- Frequent Pattern Mining
- Conclusion
- 29. Unsupervised Learning
- Use Cases
- Model Scalability
- k-means
- Model Hyperparameters
- Training Parameters
- Example
- k-means Metrics Summary
- Bisecting k-means
- Model Hyperparameters
- Training Parameters
- Example
- Bisecting k-means Summary
- Gaussian Mixture Models
- Model Hyperparameters
- Training Parameters
- Example
- Gaussian Mixture Model Summary
- Latent Dirichlet Allocation
- Model Hyperparameters
- Training Parameters
- Prediction Parameters
- Example
- Conclusion
- 30. Graph Analytics
- Building a Graph
- Querying the Graph
- Subgraphs
- Motif Finding
- Graph Algorithms
- PageRank
- In-Degree and Out-Degree Metrics
- Breadth-First Search
- Connected Components
- Strongly Connected Components
- Advanced Tasks
- Conclusion
- 31. Deep Learning
- What Is Deep Learning?
- Ways of Using Deep Learning in Spark
- Deep Learning Libraries
- MLlib Neural Network Support
- TensorFrames
- BigDL
- TensorFlowOnSpark
- DeepLearning4J
- Deep Learning Pipelines
- A Simple Example with Deep Learning Pipelines
- Setup
- Images and DataFrames
- Transfer Learning
- Applying deep learning models at scale
- Applying Popular Models
- Applying custom Keras models
- Applying TensorFlow models
- Deploying models as SQL functions
- Conclusion
- VII. Ecosystem
- 32. Language Specifics: Python (PySpark) and R (SparkR and sparklyr)
- PySpark
- Fundamental PySpark Differences
- Pandas Integration
- PySpark
- R on Spark
- SparkR
- Pros and cons of using SparkR instead of other languages
- Setup
- Key Concepts
- Function masking
- SparkR functions only apply to SparkDataFrames
- Data manipulation
- Data sources
- Machine learning
- User-defined functions
- SparkR
- sparklyr
- Key concepts
- No DataFrames
- Data manipulation
- Executing SQL
- Data sources
- Machine learning
- Conclusion
- 33. Ecosystem and Community
- Spark Packages
- An Abridged List of Popular Packages
- Using Spark Packages
- In Scala
- In Python
- At runtime
- External Packages
- Spark Packages
- Community
- Spark Summit
- Local Meetups
- Conclusion
- Index
O'Reilly Media - inne książki
-
Software as a service (SaaS) is on the path to becoming the de facto model for building, delivering, and operating software solutions. Adopting a multi-tenant SaaS model requires builders to take on a broad range of new architecture, implementation, and operational challenges. How data is partiti...(237.15 zł najniższa cena z 30 dni)
245.65 zł
289.00 zł(-15%) -
Great engineers don't necessarily make great leaders—at least, not without a lot of work. Finding your path to becoming a strong leader is often fraught with challenges. It's not easy to figure out how to be strategic, successful, and considerate while also being firm. Whether you're on the...(118.15 zł najniższa cena z 30 dni)
126.65 zł
149.00 zł(-15%) -
Data science happens in code. The ability to write reproducible, robust, scaleable code is key to a data science project's success—and is absolutely essential for those working with production code. This practical book bridges the gap between data science and software engineering,and clearl...(211.65 zł najniższa cena z 30 dni)
220.15 zł
259.00 zł(-15%) -
With the massive adoption of microservices, operators and developers face far more complexity in their applications today. Service meshes can help you manage this problem by providing a unified control plane to secure, manage, and monitor your entire network. This practical guide shows you how th...(194.65 zł najniższa cena z 30 dni)
211.65 zł
249.00 zł(-15%) -
Get practical advice on how to leverage AI development tools for all stages of code creation, including requirements, planning, design, coding, debugging, testing, and documentation. With this book, beginners and experienced developers alike will learn how to use a wide range of tools, from gener...(177.65 zł najniższa cena z 30 dni)
164.25 zł
219.00 zł(-25%) -
Rust's popularity is growing, due in part to features like memory safety, type safety, and thread safety. But these same elements can also make learning Rust a challenge, even for experienced programmers. This practical guide helps you make the transition to writing idiomatic Rust—while als...(177.65 zł najniższa cena z 30 dni)
164.25 zł
219.00 zł(-25%) -
Advance your Power BI skills by adding AI to your repertoire at a practice level. With this practical book, business-oriented software engineers and developers will learn the terminologies, practices, and strategy necessary to successfully incorporate AI into your business intelligence estate. Je...(211.65 zł najniższa cena z 30 dni)
220.15 zł
259.00 zł(-15%) -
Microservices can be a very effective approach for delivering value to your organization and to your customers. If you get them right, microservices help you to move fast by making changes to small parts of your system hundreds of times a day. But if you get them wrong, microservices will just ma...(194.65 zł najniższa cena z 30 dni)
211.65 zł
249.00 zł(-15%) -
JavaScript gives web developers great power to create rich interactive browser experiences, and much of that power is provided by the browser itself. Modern web APIs enable web-based applications to come to life like never before, supporting actions that once required browser plug-ins. Some are s...(186.15 zł najniższa cena z 30 dni)
186.15 zł
219.00 zł(-15%) -
How will software development and operations have to change to meet the sustainability and green needs of the planet? And what does that imply for development organizations? In this eye-opening book, sustainable software advocates Anne Currie, Sarah Hsu, and Sara Bergman provide a unique overview...(160.65 zł najniższa cena z 30 dni)
177.65 zł
209.00 zł(-15%)
Dzieki opcji "Druk na żądanie" do sprzedaży wracają tytuły Grupy Helion, które cieszyły sie dużym zainteresowaniem, a których nakład został wyprzedany.
Dla naszych Czytelników wydrukowaliśmy dodatkową pulę egzemplarzy w technice druku cyfrowego.
Co powinieneś wiedzieć o usłudze "Druk na żądanie":
- usługa obejmuje tylko widoczną poniżej listę tytułów, którą na bieżąco aktualizujemy;
- cena książki może być wyższa od początkowej ceny detalicznej, co jest spowodowane kosztami druku cyfrowego (wyższymi niż koszty tradycyjnego druku offsetowego). Obowiązująca cena jest zawsze podawana na stronie WWW książki;
- zawartość książki wraz z dodatkami (płyta CD, DVD) odpowiada jej pierwotnemu wydaniu i jest w pełni komplementarna;
- usługa nie obejmuje książek w kolorze.
Masz pytanie o konkretny tytuł? Napisz do nas: sklep[at]helion.pl.
Książka, którą chcesz zamówić pochodzi z końcówki nakładu. Oznacza to, że mogą się pojawić drobne defekty (otarcia, rysy, zagięcia).
Co powinieneś wiedzieć o usłudze "Końcówka nakładu":
- usługa obejmuje tylko książki oznaczone tagiem "Końcówka nakładu";
- wady o których mowa powyżej nie podlegają reklamacji;
Masz pytanie o konkretny tytuł? Napisz do nas: sklep[at]helion.pl.
Książka drukowana
Oceny i opinie klientów: Spark: The Definitive Guide. Big Data Processing Made Simple Bill Chambers, Matei Zaharia (0) Weryfikacja opinii następuję na podstawie historii zamówień na koncie Użytkownika umieszczającego opinię. Użytkownik mógł otrzymać punkty za opublikowanie opinii uprawniające do uzyskania rabatu w ramach Programu Punktowego.