Spark: The Definitive Guide. Big Data Processing Made Simple
- Autorzy:
- Bill Chambers, Matei Zaharia
- Ocena:
- Bądź pierwszym, który oceni tę książkę
- Stron:
- 606
- Dostępne formaty:
-
ePubMobi
Opis ebooka: Spark: The Definitive Guide. Big Data Processing Made Simple
Learn how to use, deploy, and maintain Apache Spark with this comprehensive guide, written by the creators of the open-source cluster-computing framework. With an emphasis on improvements and new features in Spark 2.0, authors Bill Chambers and Matei Zaharia break down Spark topics into distinct sections, each with unique goals.
You’ll explore the basic operations and common functions of Spark’s structured APIs, as well as Structured Streaming, a new high-level API for building end-to-end streaming applications. Developers and system administrators will learn the fundamentals of monitoring, tuning, and debugging Spark, and explore machine learning techniques and scenarios for employing MLlib, Spark’s scalable machine-learning library.
- Get a gentle overview of big data and Spark
- Learn about DataFrames, SQL, and Datasets—Spark’s core APIs—through worked examples
- Dive into Spark’s low-level APIs, RDDs, and execution of SQL and DataFrames
- Understand how Spark runs on a cluster
- Debug, monitor, and tune Spark clusters and applications
- Learn the power of Structured Streaming, Spark’s stream-processing engine
- Learn how you can apply MLlib to a variety of problems, including classification or recommendation
Wybrane bestsellery
-
Statystyka to dziedzina wiedzy, która bazuje na danych – przedmiotem jej zainteresowania są metody ich pozyskiwania i prezentacji, a przede wszystkim analizy. W ostatnich latach mocno zyskuje na popularności i dziś niemal każda uczelnia w Polsce oferuje możliwość studiowania na kierunku zwi...
Statystyka. Kurs video. Przewodnik dla studentów kierunków ścisłych Statystyka. Kurs video. Przewodnik dla studentów kierunków ścisłych
(39.90 zł najniższa cena z 30 dni)83.85 zł
129.00 zł(-35%) -
Mastering Data transformation is essential for enhancing their data models and business intelligence. The Definitive Guide to Power Query equips you with the knowledge and skills to master the tool while leveraging its remarkable capabilities.
The Definitive Guide to Power Query (M). Mastering complex data transformation with Power Query The Definitive Guide to Power Query (M). Mastering complex data transformation with Power Query
Gregory Deckler, Rick de Groot, Melissa de Korte, Brian Julius
-
Jeśli w swojej pracy masz lub miewasz do czynienia z danymi, z pewnością orientujesz się, że do tego celu stworzono dotąd całkiem sporo narzędzi. Nic dziwnego – przy tej liczbie danych, z jaką spotykamy się w dzisiejszym cyfrowym świecie, zdolność do ich sprawnego analizowania i wyciągania ...
Grafana. Kurs video. Monitorowanie, analiza i wizualizacja danych w czasie rzeczywistym Grafana. Kurs video. Monitorowanie, analiza i wizualizacja danych w czasie rzeczywistym
(39.90 zł najniższa cena z 30 dni)55.60 zł
139.00 zł(-60%) -
Oto praktyczny przewodnik po nauce o danych w miejscu pracy. Dowiesz się stąd wszystkiego, co ważne na początku Twojej drogi jako danologa: od osobowości, z którymi przyjdzie Ci pracować, przez detale analizy danych, po matematykę stojącą za algorytmami i uczeniem maszynowym. Nauczysz się myśleć ...
Analityk danych. Przewodnik po data science, statystyce i uczeniu maszynowym Analityk danych. Przewodnik po data science, statystyce i uczeniu maszynowym
(41.40 zł najniższa cena z 30 dni)48.30 zł
69.00 zł(-30%) -
Autorzy, Joe Reis i Matt Housley, przeprowadzą Cię przez cykl życia inżynierii danych i pokażą, jak połączyć różne technologie chmurowe, aby spełnić potrzeby konsumentów danych w dolnej części strumienia przetwarzania. Dzięki lekturze tej książki dowiesz się, jak zastosować koncepcje generowania,...
Inżynieria danych w praktyce. Kluczowe koncepcje i najlepsze technologie Inżynieria danych w praktyce. Kluczowe koncepcje i najlepsze technologie
(71.40 zł najniższa cena z 30 dni)83.30 zł
119.00 zł(-30%) -
Big data pokazuje, jak postęp technologiczny spowodowany rozwojem Internetu i cyfrowego wszechświata wpłynął na radykalną transformację nauki o danych. Czym są duże zbiory danych i jak zmieniają świat? Jaki mają wpływ na nasze codzienne życie, a jaki na świat biznesu? W tej książce czytelnik znaj...(19.69 zł najniższa cena z 30 dni)
19.25 zł
27.90 zł(-31%) -
W złożonej rzeczywistości myślenie systemowe jest kluczowym narzędziem pozwalającym odnieść się do licznych wyzwań: gospodarczych, ekologicznych, politycznych czy społecznych. Tylko w ten sposób w codziennych wiadomościach można dostrzec przejawy trendów, a w trendach — przeja...(29.94 zł najniższa cena z 30 dni)
34.93 zł
49.90 zł(-30%)
Ebooka "Spark: The Definitive Guide. Big Data Processing Made Simple" przeczytasz na:
-
czytnikach Inkbook, Kindle, Pocketbook, Onyx Boox i innych
-
systemach Windows, MacOS i innych
-
systemach Windows, Android, iOS, HarmonyOS
-
na dowolnych urządzeniach i aplikacjach obsługujących formaty: PDF, EPub, Mobi
Masz pytania? Zajrzyj do zakładki Pomoc »
Audiobooka "Spark: The Definitive Guide. Big Data Processing Made Simple" posłuchasz:
-
w aplikacji Ebookpoint na Android, iOS, HarmonyOs
-
na systemach Windows, MacOS i innych
-
na dowolnych urządzeniach i aplikacjach obsługujących format MP3 (pliki spakowane w ZIP)
Masz pytania? Zajrzyj do zakładki Pomoc »
Kurs Video "Spark: The Definitive Guide. Big Data Processing Made Simple" zobaczysz:
-
w aplikacjach Ebookpoint i Videopoint na Android, iOS, HarmonyOs
-
na systemach Windows, MacOS i innych z dostępem do najnowszej wersji Twojej przeglądarki internetowej
Szczegóły ebooka
- ISBN Ebooka:
- 978-14-919-1229-4, 9781491912294
- Data wydania ebooka:
- 2018-02-08 Data wydania ebooka często jest dniem wprowadzenia tytułu do sprzedaży i może nie być równoznaczna z datą wydania książki papierowej. Dodatkowe informacje możesz znaleźć w darmowym fragmencie. Jeśli masz wątpliwości skontaktuj się z nami sklep@ebookpoint.pl.
- Język publikacji:
- angielski
- Rozmiar pliku ePub:
- 7.5MB
- Rozmiar pliku Mobi:
- 17.1MB
Spis treści ebooka
- Preface
- About the Authors
- Who This Book Is For
- Conventions Used in This Book
- Using Code Examples
- OReilly Safari
- How to Contact Us
- Acknowledgments
- I. Gentle Overview of Big Data and Spark
- 1. What Is Apache Spark?
- Apache Sparks Philosophy
- Context: The Big Data Problem
- History of Spark
- The Present and Future of Spark
- Running Spark
- Downloading Spark Locally
- Downloading Spark for a Hadoop cluster
- Building Spark from source
- Downloading Spark Locally
- Launching Sparks Interactive Consoles
- Launching the Python console
- Launching the Scala console
- Launching the SQL console
- Running Spark in the Cloud
- Data Used in This Book
- 2. A Gentle Introduction to Spark
- Sparks Basic Architecture
- Spark Applications
- Sparks Basic Architecture
- Sparks Language APIs
- Sparks APIs
- Starting Spark
- The SparkSession
- DataFrames
- Partitions
- Transformations
- Lazy Evaluation
- Actions
- Spark UI
- An End-to-End Example
- DataFrames and SQL
- Conclusion
- 3. A Tour of Sparks Toolset
- Running Production Applications
- Datasets: Type-Safe Structured APIs
- Structured Streaming
- Machine Learning and Advanced Analytics
- Lower-Level APIs
- SparkR
- Sparks Ecosystem and Packages
- Conclusion
- II. Structured APIsDataFrames, SQL, and Datasets
- 4. Structured API Overview
- DataFrames and Datasets
- Schemas
- Overview of Structured Spark Types
- DataFrames Versus Datasets
- Columns
- Rows
- Spark Types
- Overview of Structured API Execution
- Logical Planning
- Physical Planning
- Execution
- Conclusion
- 5. Basic Structured Operations
- Schemas
- Columns and Expressions
- Columns
- Explicit column references
- Columns
- Expressions
- Columns as expressions
- Accessing a DataFrames columns
- Records and Rows
- Creating Rows
- DataFrame Transformations
- Creating DataFrames
- select and selectExpr
- Converting to Spark Types (Literals)
- Adding Columns
- Renaming Columns
- Reserved Characters and Keywords
- Case Sensitivity
- Removing Columns
- Changing a Columns Type (cast)
- Filtering Rows
- Getting Unique Rows
- Random Samples
- Random Splits
- Concatenating and Appending Rows (Union)
- Sorting Rows
- Limit
- Repartition and Coalesce
- Collecting Rows to the Driver
- Conclusion
- 6. Working with Different Types of Data
- Where to Look for APIs
- Converting to Spark Types
- Working with Booleans
- Working with Numbers
- Working with Strings
- Regular Expressions
- Working with Dates and Timestamps
- Working with Nulls in Data
- Coalesce
- ifnull, nullIf, nvl, and nvl2
- drop
- fill
- replace
- Ordering
- Working with Complex Types
- Structs
- Arrays
- split
- Array Length
- array_contains
- explode
- Maps
- Working with JSON
- User-Defined Functions
- Conclusion
- 7. Aggregations
- Aggregation Functions
- count
- countDistinct
- approx_count_distinct
- first and last
- min and max
- sum
- sumDistinct
- avg
- Variance and Standard Deviation
- skewness and kurtosis
- Covariance and Correlation
- Aggregating to Complex Types
- Aggregation Functions
- Grouping
- Grouping with Expressions
- Grouping with Maps
- Window Functions
- Grouping Sets
- Rollups
- Cube
- Grouping Metadata
- Pivot
- User-Defined Aggregation Functions
- Conclusion
- 8. Joins
- Join Expressions
- Join Types
- Inner Joins
- Outer Joins
- Left Outer Joins
- Right Outer Joins
- Left Semi Joins
- Left Anti Joins
- Natural Joins
- Cross (Cartesian) Joins
- Challenges When Using Joins
- Joins on Complex Types
- Handling Duplicate Column Names
- Approach 1: Different join expression
- Approach 2: Dropping the column after the join
- Approach 3: Renaming a column before the join
- How Spark Performs Joins
- Communication Strategies
- Big tabletobig table
- Big tabletosmall table
- Little tabletolittle table
- Communication Strategies
- Conclusion
- 9. Data Sources
- The Structure of the Data Sources API
- Read API Structure
- Basics of Reading Data
- Read modes
- Write API Structure
- Basics of Writing Data
- Save modes
- The Structure of the Data Sources API
- CSV Files
- CSV Options
- Reading CSV Files
- Writing CSV Files
- JSON Files
- JSON Options
- Reading JSON Files
- Writing JSON Files
- Parquet Files
- Reading Parquet Files
- Parquet options
- Reading Parquet Files
- Writing Parquet Files
- ORC Files
- Reading Orc Files
- Writing Orc Files
- SQL Databases
- Reading from SQL Databases
- Query Pushdown
- Reading from databases in parallel
- Partitioning based on a sliding window
- Writing to SQL Databases
- Text Files
- Reading Text Files
- Writing Text Files
- Advanced I/O Concepts
- Splittable File Types and Compression
- Reading Data in Parallel
- Writing Data in Parallel
- Partitioning
- Bucketing
- Writing Complex Types
- Managing File Size
- Conclusion
- 10. Spark SQL
- What Is SQL?
- Big Data and SQL: Apache Hive
- Big Data and SQL: Spark SQL
- Sparks Relationship to Hive
- The Hive metastore
- Sparks Relationship to Hive
- How to Run Spark SQL Queries
- Spark SQL CLI
- Sparks Programmatic SQL Interface
- SparkSQL Thrift JDBC/ODBC Server
- Catalog
- Tables
- Spark-Managed Tables
- Creating Tables
- Creating External Tables
- Inserting into Tables
- Describing Table Metadata
- Refreshing Table Metadata
- Dropping Tables
- Dropping unmanaged tables
- Caching Tables
- Views
- Creating Views
- Dropping Views
- Databases
- Creating Databases
- Setting the Database
- Dropping Databases
- Select Statements
- casewhenthen Statements
- Advanced Topics
- Complex Types
- Structs
- Lists
- Complex Types
- Functions
- User-defined functions
- Subqueries
- Uncorrelated predicate subqueries
- Correlated predicate subqueries
- Uncorrelated scalar queries
- Miscellaneous Features
- Configurations
- Setting Configuration Values in SQL
- Conclusion
- 11. Datasets
- When to Use Datasets
- Creating Datasets
- In Java: Encoders
- In Scala: Case Classes
- Actions
- Transformations
- Filtering
- Mapping
- Joins
- Grouping and Aggregations
- Conclusion
- III. Low-Level APIs
- 12. Resilient Distributed Datasets (RDDs)
- What Are the Low-Level APIs?
- When to Use the Low-Level APIs?
- How to Use the Low-Level APIs?
- What Are the Low-Level APIs?
- About RDDs
- Types of RDDs
- When to Use RDDs?
- Datasets and RDDs of Case Classes
- Creating RDDs
- Interoperating Between DataFrames, Datasets, and RDDs
- From a Local Collection
- From Data Sources
- Manipulating RDDs
- Transformations
- distinct
- filter
- map
- flatMap
- sort
- Random Splits
- Actions
- reduce
- count
- countApprox
- countApproxDistinct
- countByValue
- countByValueApprox
- first
- max and min
- take
- Saving Files
- saveAsTextFile
- SequenceFiles
- Hadoop Files
- Caching
- Checkpointing
- Pipe RDDs to System Commands
- mapPartitions
- foreachPartition
- glom
- Conclusion
- 13. Advanced RDDs
- Key-Value Basics (Key-Value RDDs)
- keyBy
- Mapping over Values
- Extracting Keys and Values
- lookup
- sampleByKey
- Key-Value Basics (Key-Value RDDs)
- Aggregations
- countByKey
- Understanding Aggregation Implementations
- groupByKey
- reduceByKey
- Other Aggregation Methods
- aggregate
- aggregateByKey
- combineByKey
- foldByKey
- CoGroups
- Joins
- Inner Join
- zips
- Controlling Partitions
- coalesce
- repartition
- repartitionAndSortWithinPartitions
- Custom Partitioning
- Custom Serialization
- Conclusion
- 14. Distributed Shared Variables
- Broadcast Variables
- Accumulators
- Basic Example
- Custom Accumulators
- Conclusion
- IV. Production Applications
- 15. How Spark Runs on a Cluster
- The Architecture of a Spark Application
- Execution Modes
- Cluster mode
- Client mode
- Local mode
- Execution Modes
- The Architecture of a Spark Application
- The Life Cycle of a Spark Application (Outside Spark)
- Client Request
- Launch
- Execution
- Completion
- The Life Cycle of a Spark Application (Inside Spark)
- The SparkSession
- The SparkContext
- The SparkSession
- Logical Instructions
- Logical instructions to physical execution
- A Spark Job
- Stages
- Tasks
- Execution Details
- Pipelining
- Shuffle Persistence
- Conclusion
- 16. Developing Spark Applications
- Writing Spark Applications
- A Simple Scala-Based App
- Running the application
- A Simple Scala-Based App
- Writing Python Applications
- Running the application
- Writing Spark Applications
- Writing Java Applications
- Running the application
- Testing Spark Applications
- Strategic Principles
- Input data resilience
- Business logic resilience and evolution
- Resilience in output and atomicity
- Strategic Principles
- Tactical Takeaways
- Managing SparkSessions
- Which Spark API to Use?
- Connecting to Unit Testing Frameworks
- Connecting to Data Sources
- The Development Process
- Launching Applications
- Application Launch Examples
- Configuring Applications
- The SparkConf
- Application Properties
- Runtime Properties
- Execution Properties
- Configuring Memory Management
- Configuring Shuffle Behavior
- Environmental Variables
- Job Scheduling Within an Application
- Conclusion
- 17. Deploying Spark
- Where to Deploy Your Cluster to Run Spark Applications
- On-Premises Cluster Deployments
- Spark in the Cloud
- Where to Deploy Your Cluster to Run Spark Applications
- Cluster Managers
- Standalone Mode
- Starting a standalone cluster
- Cluster launch scripts
- Standalone cluster configurations
- Submitting applications
- Standalone Mode
- Spark on YARN
- Submitting applications
- Configuring Spark on YARN Applications
- Hadoop configurations
- Application properties for YARN
- Spark on Mesos
- Submitting applications
- Configuring Mesos
- Secure Deployment Configurations
- Cluster Networking Configurations
- Application Scheduling
- Dynamic allocation
- Miscellaneous Considerations
- Conclusion
- 18. Monitoring and Debugging
- The Monitoring Landscape
- What to Monitor
- Driver and Executor Processes
- Queries, Jobs, Stages, and Tasks
- Spark Logs
- The Spark UI
- Other Spark UI tabs
- Configuring the Spark user interface
- Spark REST API
- Spark UI History Server
- Debugging and Spark First Aid
- Spark Jobs Not Starting
- Signs and symptoms
- Potential treatments
- Spark Jobs Not Starting
- Errors Before Execution
- Signs and symptoms
- Potential treatments
- Errors During Execution
- Signs and symptoms
- Potential treatments
- Slow Tasks or Stragglers
- Signs and symptoms
- Potential treatments
- Slow Aggregations
- Signs and symptoms
- Potential treatments
- Slow Joins
- Signs and symptoms
- Potential treatments
- Slow Reads and Writes
- Signs and symptoms
- Potential treatments
- Driver OutOfMemoryError or Driver Unresponsive
- Signs and symptoms
- Potential treatments
- Executor OutOfMemoryError or Executor Unresponsive
- Signs and symptoms
- Potential treatments
- Unexpected Nulls in Results
- Signs and symptoms
- Potential treatments
- No Space Left on Disk Errors
- Signs and symptoms
- Potential treatments
- Serialization Errors
- Signs and symptoms
- Potential treatments
- Conclusion
- 19. Performance Tuning
- Indirect Performance Enhancements
- Design Choices
- Scala versus Java versus Python versus R
- DataFrames versus SQL versus Datasets versus RDDs
- Design Choices
- Object Serialization in RDDs
- Cluster Configurations
- Cluster/application sizing and sharing
- Dynamic allocation
- Indirect Performance Enhancements
- Scheduling
- Data at Rest
- File-based long-term data storage
- Splittable file types and compression
- Table partitioning
- Bucketing
- The number of files
- Data locality
- Statistics collection
- Shuffle Configurations
- Memory Pressure and Garbage Collection
- Measuring the impact of garbage collection
- Garbage collection tuning
- Direct Performance Enhancements
- Parallelism
- Improved Filtering
- Repartitioning and Coalescing
- Custom partitioning
- User-Defined Functions (UDFs)
- Temporary Data Storage (Caching)
- Joins
- Aggregations
- Broadcast Variables
- Conclusion
- V. Streaming
- 20. Stream Processing Fundamentals
- What Is Stream Processing?
- Stream Processing Use Cases
- Notifications and alerting
- Real-time reporting
- Incremental ETL
- Update data to serve in real time
- Real-time decision making
- Online machine learning
- Stream Processing Use Cases
- Advantages of Stream Processing
- Challenges of Stream Processing
- What Is Stream Processing?
- Stream Processing Design Points
- Record-at-a-Time Versus Declarative APIs
- Event Time Versus Processing Time
- Continuous Versus Micro-Batch Execution
- Sparks Streaming APIs
- The DStream API
- Structured Streaming
- Conclusion
- 21. Structured Streaming Basics
- Structured Streaming Basics
- Core Concepts
- Transformations and Actions
- Input Sources
- Sinks
- Output Modes
- Triggers
- Event-Time Processing
- Event-time data
- Watermarks
- Structured Streaming in Action
- Transformations on Streams
- Selections and Filtering
- Aggregations
- Joins
- Input and Output
- Where Data Is Read and Written (Sources and Sinks)
- File source and sink
- Kafka source and sink
- Where Data Is Read and Written (Sources and Sinks)
- Reading from the Kafka Source
- Writing to the Kafka Sink
- Foreach sink
- Sources and sinks for testing
- How Data Is Output (Output Modes)
- Append mode
- Complete mode
- Update mode
- When can you use each mode?
- When Data Is Output (Triggers)
- Processing time trigger
- Once trigger
- Streaming Dataset API
- Conclusion
- 22. Event-Time and Stateful Processing
- Event Time
- Stateful Processing
- Arbitrary Stateful Processing
- Event-Time Basics
- Windows on Event Time
- Tumbling Windows
- Sliding windows
- Tumbling Windows
- Handling Late Data with Watermarks
- Dropping Duplicates in a Stream
- Arbitrary Stateful Processing
- Time-Outs
- Output Modes
- mapGroupsWithState
- flatMapGroupsWithState
- Conclusion
- 23. Structured Streaming in Production
- Fault Tolerance and Checkpointing
- Updating Your Application
- Updating Your Streaming Application Code
- Updating Your Spark Version
- Sizing and Rescaling Your Application
- Metrics and Monitoring
- Query Status
- Recent Progress
- Input rate and processing rate
- Batch duration
- Spark UI
- Alerting
- Advanced Monitoring with the Streaming Listener
- Conclusion
- VI. Advanced Analytics and Machine Learning
- 24. Advanced Analytics and Machine Learning Overview
- A Short Primer on Advanced Analytics
- Supervised Learning
- Classification
- Regression
- Supervised Learning
- Recommendation
- Unsupervised Learning
- Graph Analytics
- The Advanced Analytics Process
- Data collection
- Data cleaning
- Feature engineering
- Training models
- Model tuning and evaluation
- Leveraging the model and/or insights
- A Short Primer on Advanced Analytics
- Sparks Advanced Analytics Toolkit
- What Is MLlib?
- When and why should you use MLlib (versus scikit-learn, TensorFlow, or foo package)
- What Is MLlib?
- High-Level MLlib Concepts
- Low-level data types
- MLlib in Action
- Feature Engineering with Transformers
- Estimators
- Pipelining Our Workflow
- Training and Evaluation
- Persisting and Applying Models
- Deployment Patterns
- Conclusion
- 25. Preprocessing and Feature Engineering
- Formatting Models According to Your Use Case
- Transformers
- Estimators for Preprocessing
- Transformer Properties
- High-Level Transformers
- RFormula
- SQL Transformers
- VectorAssembler
- Working with Continuous Features
- Bucketing
- Advanced bucketing techniques
- Bucketing
- Scaling and Normalization
- StandardScaler
- MinMaxScaler
- MaxAbsScaler
- ElementwiseProduct
- Normalizer
- Working with Categorical Features
- StringIndexer
- Converting Indexed Values Back to Text
- Indexing in Vectors
- One-Hot Encoding
- Text Data Transformers
- Tokenizing Text
- Removing Common Words
- Creating Word Combinations
- Converting Words into Numerical Representations
- Term frequencyinverse document frequency
- Word2Vec
- Feature Manipulation
- PCA
- Interaction
- Polynomial Expansion
- Feature Selection
- ChiSqSelector
- Advanced Topics
- Persisting Transformers
- Writing a Custom Transformer
- Conclusion
- 26. Classification
- Use Cases
- Types of Classification
- Binary Classification
- Multiclass Classification
- Multilabel Classification
- Classification Models in MLlib
- Model Scalability
- Logistic Regression
- Model Hyperparameters
- Training Parameters
- Prediction Parameters
- Example
- Model Summary
- Decision Trees
- Model Hyperparameters
- Training Parameters
- Prediction Parameters
- Random Forest and Gradient-Boosted Trees
- Model Hyperparameters
- Random forest only
- Gradient-boosted trees (GBT) only
- Model Hyperparameters
- Training Parameters
- Prediction Parameters
- Naive Bayes
- Model Hyperparameters
- Training Parameters
- Prediction Parameters
- Evaluators for Classification and Automating Model Tuning
- Detailed Evaluation Metrics
- One-vs-Rest Classifier
- Multilayer Perceptron
- Conclusion
- 27. Regression
- Use Cases
- Regression Models in MLlib
- Model Scalability
- Linear Regression
- Model Hyperparameters
- Training Parameters
- Example
- Training Summary
- Generalized Linear Regression
- Model Hyperparameters
- Training Parameters
- Prediction Parameters
- Example
- Training Summary
- Decision Trees
- Model Hyperparameters
- Training Parameters
- Example
- Random Forests and Gradient-Boosted Trees
- Model Hyperparameters
- Training Parameters
- Example
- Advanced Methods
- Survival Regression (Accelerated Failure Time)
- Isotonic Regression
- Evaluators and Automating Model Tuning
- Metrics
- Conclusion
- 28. Recommendation
- Use Cases
- Collaborative Filtering with Alternating Least Squares
- Model Hyperparameters
- Training Parameters
- Prediction Parameters
- Example
- Evaluators for Recommendation
- Metrics
- Regression Metrics
- Ranking Metrics
- Frequent Pattern Mining
- Conclusion
- 29. Unsupervised Learning
- Use Cases
- Model Scalability
- k-means
- Model Hyperparameters
- Training Parameters
- Example
- k-means Metrics Summary
- Bisecting k-means
- Model Hyperparameters
- Training Parameters
- Example
- Bisecting k-means Summary
- Gaussian Mixture Models
- Model Hyperparameters
- Training Parameters
- Example
- Gaussian Mixture Model Summary
- Latent Dirichlet Allocation
- Model Hyperparameters
- Training Parameters
- Prediction Parameters
- Example
- Conclusion
- 30. Graph Analytics
- Building a Graph
- Querying the Graph
- Subgraphs
- Motif Finding
- Graph Algorithms
- PageRank
- In-Degree and Out-Degree Metrics
- Breadth-First Search
- Connected Components
- Strongly Connected Components
- Advanced Tasks
- Conclusion
- 31. Deep Learning
- What Is Deep Learning?
- Ways of Using Deep Learning in Spark
- Deep Learning Libraries
- MLlib Neural Network Support
- TensorFrames
- BigDL
- TensorFlowOnSpark
- DeepLearning4J
- Deep Learning Pipelines
- A Simple Example with Deep Learning Pipelines
- Setup
- Images and DataFrames
- Transfer Learning
- Applying deep learning models at scale
- Applying Popular Models
- Applying custom Keras models
- Applying TensorFlow models
- Deploying models as SQL functions
- Conclusion
- VII. Ecosystem
- 32. Language Specifics: Python (PySpark) and R (SparkR and sparklyr)
- PySpark
- Fundamental PySpark Differences
- Pandas Integration
- PySpark
- R on Spark
- SparkR
- Pros and cons of using SparkR instead of other languages
- Setup
- Key Concepts
- Function masking
- SparkR functions only apply to SparkDataFrames
- Data manipulation
- Data sources
- Machine learning
- User-defined functions
- SparkR
- sparklyr
- Key concepts
- No DataFrames
- Data manipulation
- Executing SQL
- Data sources
- Machine learning
- Conclusion
- 33. Ecosystem and Community
- Spark Packages
- An Abridged List of Popular Packages
- Using Spark Packages
- In Scala
- In Python
- At runtime
- External Packages
- Spark Packages
- Community
- Spark Summit
- Local Meetups
- Conclusion
- Index
O'Reilly Media - inne książki
-
Keeping up with the Python ecosystem can be daunting. Its developer tooling doesn't provide the out-of-the-box experience native to languages like Rust and Go. When it comes to long-term project maintenance or collaborating with others, every Python project faces the same problem: how to build re...(203.15 zł najniższa cena z 30 dni)
207.44 zł
239.00 zł(-13%) -
Bringing a deep-learning project into production at scale is quite challenging. To successfully scale your project, a foundational understanding of full stack deep learning, including the knowledge that lies at the intersection of hardware, software, data, and algorithms, is required.This book il...(237.15 zł najniższa cena z 30 dni)
249.70 zł
289.00 zł(-14%) -
Frontend developers have to consider many things: browser compatibility, usability, performance, scalability, SEO, and other best practices. But the most fundamental aspect of creating websites is one that often falls short: accessibility. Accessibility is the cornerstone of any website, and if a...(194.65 zł najniższa cena z 30 dni)
207.10 zł
239.00 zł(-13%) -
In this insightful and comprehensive guide, Addy Osmani shares more than a decade of experience working on the Chrome team at Google, uncovering secrets to engineering effectiveness, efficiency, and team success. Engineers and engineering leaders looking to scale their effectiveness and drive tra...(118.15 zł najniższa cena z 30 dni)
121.09 zł
149.00 zł(-19%) -
Data modeling is the single most overlooked feature in Power BI Desktop, yet it's what sets Power BI apart from other tools on the market. This practical book serves as your fast-forward button for data modeling with Power BI, Analysis Services tabular, and SQL databases. It serves as a starting ...(194.65 zł najniższa cena z 30 dni)
206.39 zł
239.00 zł(-14%) -
C# is undeniably one of the most versatile programming languages available to engineers today. With this comprehensive guide, you'll learn just how powerful the combination of C# and .NET can be. Author Ian Griffiths guides you through C# 12.0 and .NET 8 fundamentals and techniques for building c...(228.65 zł najniższa cena z 30 dni)
249.34 zł
289.00 zł(-14%) -
Learn how to get started with Futures Thinking. With this practical guide, Phil Balagtas, founder of the Design Futures Initiative and the global Speculative Futures network, shows you how designers and futurists have made futures work at companies such as Atari, IBM, Apple, Disney, Autodesk, Luf...(152.15 zł najniższa cena z 30 dni)
155.10 zł
179.00 zł(-13%) -
Augmented Analytics isn't just another book on data and analytics; it's a holistic resource for reimagining the way your entire organization interacts with information to become insight-driven.Moving beyond traditional, limited ways of making sense of data, Augmented Analytics provides a dynamic,...(181.15 zł najniższa cena z 30 dni)
180.95 zł
219.00 zł(-17%) -
Learn how to prepare for—and pass—the Kubernetes and Cloud Native Associate (KCNA) certification exam. This practical guide serves as both a study guide and point of entry for practitioners looking to explore and adopt cloud native technologies. Adrián González Sánchez ...
Kubernetes and Cloud Native Associate (KCNA) Study Guide Kubernetes and Cloud Native Associate (KCNA) Study Guide
(169.14 zł najniższa cena z 30 dni)177.65 zł
209.00 zł(-15%) -
Python is an excellent way to get started in programming, and this clear, concise guide walks you through Python a step at a time—beginning with basic programming concepts before moving on to functions, data structures, and object-oriented design. This revised third edition reflects the gro...(148.56 zł najniższa cena z 30 dni)
148.06 zł
179.00 zł(-17%)
Dzieki opcji "Druk na żądanie" do sprzedaży wracają tytuły Grupy Helion, które cieszyły sie dużym zainteresowaniem, a których nakład został wyprzedany.
Dla naszych Czytelników wydrukowaliśmy dodatkową pulę egzemplarzy w technice druku cyfrowego.
Co powinieneś wiedzieć o usłudze "Druk na żądanie":
- usługa obejmuje tylko widoczną poniżej listę tytułów, którą na bieżąco aktualizujemy;
- cena książki może być wyższa od początkowej ceny detalicznej, co jest spowodowane kosztami druku cyfrowego (wyższymi niż koszty tradycyjnego druku offsetowego). Obowiązująca cena jest zawsze podawana na stronie WWW książki;
- zawartość książki wraz z dodatkami (płyta CD, DVD) odpowiada jej pierwotnemu wydaniu i jest w pełni komplementarna;
- usługa nie obejmuje książek w kolorze.
Masz pytanie o konkretny tytuł? Napisz do nas: sklep[at]helion.pl.
Książka, którą chcesz zamówić pochodzi z końcówki nakładu. Oznacza to, że mogą się pojawić drobne defekty (otarcia, rysy, zagięcia).
Co powinieneś wiedzieć o usłudze "Końcówka nakładu":
- usługa obejmuje tylko książki oznaczone tagiem "Końcówka nakładu";
- wady o których mowa powyżej nie podlegają reklamacji;
Masz pytanie o konkretny tytuł? Napisz do nas: sklep[at]helion.pl.
Książka drukowana
Oceny i opinie klientów: Spark: The Definitive Guide. Big Data Processing Made Simple Bill Chambers, Matei Zaharia (0) Weryfikacja opinii następuję na podstawie historii zamówień na koncie Użytkownika umieszczającego opinię. Użytkownik mógł otrzymać punkty za opublikowanie opinii uprawniające do uzyskania rabatu w ramach Programu Punktowego.