Big Data Ecosystem | Big Data Processing with Apache Spark & Cassandra

Spark Training Series | Big Data / Data Science / Hadoop / Machine Learning / Deep Learning / AI

TTDS7777

Intermediate

5 Days

Course Overview

Big Data Processing with Spark & Cassandra explores processing large data sets using Spark and Cassandra and other related tools working within the Hadoop ecosystem. Students will focus on specific techniques and tools for ingesting, transforming, and exporting data to and from the Big Data technologies for processing.

Course Objectives

Working within an engaging, hands-on learning environment, attendees will learn:

 

Explore Apache Spark version 2:

  • Understand the need for Spark in data processing
  • Compare to Apache Impala
  • Understand the Spark architecture and how it distributes computations to cluster nodes
  • Be familiar with basic installation / setup / layout of Spark
  • Use the Spark shell for interactive and ad-hoc operations
  • Understand RDDs (Resilient Distributed Datasets), and data partitioning, pipelining, and computations
  • Understand/use RDD ops such as map(), filter() and others.
  • Understand and use Spark SQL and the DataFrame/DataSet API.
  • Understand DataSet/DataFrame capabilities, including the Catalyst query optimizer and Tungsten memory/cpu optimizations.
  • Be familiar with performance issues, and use the DataSet/DataFrame and Spark SQL for efficient computations
  • Understand Spark's data caching and use it for efficient data transfer
  • Write/run standalone Spark programs with the Spark API
  • Use Spark Streaming / Structured Streaming to process streaming (real-time) data
  • Ingest streaming data from Kafka, and process via Spark Structured Streaming
  • Understand performance implications and optimizations when using Spark

 

Explore Apache Cassandra

  • Understand the needs which Cassandra (hereafter “C*”) addresses
  • Be familiar with the operation and structure of C*
  • Be able to install and set up a C* database
  • Use the C* tools, including cqlsh, nodetool, and ccm (Cassandra Cluster Manager)
  • Be familiar with the C* architecture, and how a C* cluster is structured
  • Understand how data is distributed and replicated in a C* cluster
  • Understand core C* data modeling concepts, and use them to create well-structured data models
  • Use data replication and eventual consistency intelligently
  • Understand and use CQL to create tables and query for data
  • Know and use the CQL data types (numerical, textual, uuid, etc.)
  • Understand the various kinds of primary keys available (simple, compound, and composite primary keys)
  • Use more advanced capabilities like collections, counters, secondary indexes, CAS (Compare and Set), static columns, and batches
  • Be familiar with the Java client API
  • Use the Java client API to write client programs that work with C*
  • Build and use dynamic queries with QueryBuilder
  • Understand and use asynchronous queries with the Java API

Explore Other technologies:

  • The Big Data Ecosystem (day one general overview of the various technology options).

Course Prerequisites

This in an intermediate-level course is geared for experienced developers or architects (with development experience) seeking to be proficient in developing, maintaining or deploying scalable, distributed applications that analyze Big Data. Attendees should be experienced developers who are comfortable with Java, Scala or Python programming.  Students should also be able to navigate Linux command line, and who have basic knowledge of Linux editors (such as VI / nano) for editing code.

Course Agenda

Session : The Motivation for Big Data (Overview)

 

  • Problems with Traditional Large-Scale Systems
  • Introducing the Hadoop Ecosystem, as well as Spark and Apache Cassandra.
  • Stoage options: HDFS
    • Avro, Parquet, ORC, CSV, etc.
  • Legacy Hadoop Map/Reduce
    • Hive, Impala, Pig quick overview
  • Apache Spark
  • NoSQL technologies
    • Couchbase and MongoDB
    • Apache Cassandra and Apache HBase
  • Data Lakes

 

Session 1: Cassandra Overview

 

  • Why We Need Cassandra
  • High level Cassandra Overview
  • Cassandra Features
  • Basic Cassandra Installation and Configuration

 

Session 2: Cassandra Architecture and CQL Overview

 

  • Cassandra Architecture Overview
  • Cassandra Clusters and Rings
  • Data Replication in Cassandra
  • Cassandra Consistency / Eventual Consistency
  • Introduction to CQL
  • Defining Tables with a Single Primary Key
  • Using cqlsh for Interactive Querying
  • Selecting and Inserting/Upserting Data with CQL
  • Data Replication and Distribution
  • Basic Data Types (including uuid, timeuuid)

 

Session 3: Data Modeling and CQL Core Concepts

 

  • Defining a Compound Primary Key
    • CQL for Compound Primary Keys
    • Partition Keys and Data Distribution
    • Clustering Columns
    • Overview of Internal Data Organization
  • Additional Querying Capabilities
    • Result Ordering - ORDER BY and CLUSTERING ORDER BY
    • UPDATE and DELETE Queries
    • Result Filtering, ALLOW FILTERING
    • Batch Queries
  • Data Modeling Guidelines
    • Denormalization
    • Data Modeling Workflow
    • Data Modeling Principles
    • Primary Key Considerations
  • Composite Partition Keys
    • Defining with CQL
    • Data Distribution with Composite Partition Key
    • Overview of Internal Data Organization

 

Session 4: Additional CQL Capabilities

 

  • Indexing
    • Primary/Partition Keys and Pagination with token()
    • Secondary Indexes and Usage Guidelines
  • Cassandra Counters
    • Counter Structure and Definition
    • Using Counters
    • Counter Limitations
  • Cassandra collections
    • Collection Structure and Uses
    • Defining Collections (set, list, and map)
    • Querying Collections (Including Insert, Update, Delete)
    • Limitations
    • Overview of Internal Storage Organization
  • Static Column: Overview and Usage
  • Static Column Guidelines
  • Materialized View: Overview and Usage
  • Materialized View Guidelines

 

Session 5: Data Consistency In Cassandra

  • Overview of Consistency in Cassandra
  • CAP Theorem
  • Eventual (Tunable) Consistency in C* - ONE, QUORUM, ALL
  • Choosing CL ONE
  • Choosing CL QUORUM
  • Achieving Immediate Consistency
  • Using other Consistency Levels
  • Internal Repair Mechanisms (Read Repair, Hinted Handoff)
  • Session 6: Lightweight Transactions (LWT)/ Compare and Set (CAS)
  • Overview of Lightweight Transactions
  • Using LWT, the [applied] Column
  • IF EXISTS, IF NOT EXISTS, Other IF conditions
  • Basic CAS Internals
  • Overhead and Guidelines
  • Session 7: Practical Considerations
  • Dealing with Write Failure
  • Unavailable Nodes and Node Failure
  • Requirements for Write Operations
  • Key and Row Caches
  • Cache Overview
  • Usage Guidelines
  • Multi-Data Center Support
  • Overview
  • Replication Factor Configuration
  • Additional Consistency Levels - LOCAL/EACH QUORUM
  • Deletes
  • CQL for Deletion
  • Tombstones
  • Usage Guidelines
  •  

    Session 7: The Java Client API

     

  • API Overview
    • Introduction
    • Architecture and Features
  • Connecting to a Cluster
    • Cluster and Cluster.Builder
    • Contact Points, Connecting to a Cluster
    • Session Overview and API
    • Working with Sessions
  • The Query API
    • Overview
    • Dynamic Queries, Statement, SimpleStatement
    • Processing Query Results, ResultSet, Row
    • PreparedStatement, BoundStatement
    • Binding Values and Querying with PreparedStatements
    • CQL to Java Type Mapping
    • Working with UUIDs
    • Working with Time/Date Values
    • Working with Batches of SimpleStatement and PreparedStatement
  • Dynamic Queries and QueryBuilder
    • QueryBuilder Overview and API
    • Building SELECT, DELETE, INSERT, and UPDATE Queries
    • Creating WHERE Clauses
    • Other Query Examples
  • Configuring Query Behavior
    • Setting LIMIT and TTL
    • Working with Consistency
    • Using LWT
    • Working with Driver Policies
    • Load Balancing Policies - RoundRobinPolicy, DCAwareRoundRobinPolicy
    • Retry Policies - DefaultRetryPolicy, DowngradingConsistencyRetryPolicy, Other Policies
    • Reconnection Policies
  • Asynchronous Querying Overview
    • Synchronous vs. Asynchronous Querying
    • Executing Asynchronous Queries
  •  

    Session 8: Introduction to Apache Spark

  • Overview, Motivations, Spark Systems
  • Spark Ecosystem
  • Spark vs. Hadoop
  • Acquiring and Installing Spark
  • The Spark Shell, SparkContext
  • RDD Concepts, Lifecycle, Lazy Evaluation
  • RDD Partitioning and Transformations
  • Working with RDDs - Creating and Transforming (map, filter, etc.)
  • Session 9: Spark SQL, DataFrames, and DataSets

  • Overview
  • SparkSession, Loading/Saving Data, Data Formats (JSON, CSV, Parquet, text ...)
  • Introducing DataFrames and DataSets (Creation and Schema Inference)
  • Supported Data Formats (JSON, Text, CSV, Parquet)
  • Working with the DataFrame (untyped) Query DSL (Column, Filtering, Grouping, Aggregation)
  • SQL-based Queries
  • Working with the DataSet (typed) API
  • Mapping and Splitting (flatMap(), explode(), and split())
  • DataSets vs. DataFrames vs. RDDs
  •  

    Session 10: Shuffling Transformations and Performance

     

  • Grouping, Reducing, Joining
  • Shuffling, Narrow vs. Wide Dependencies, and Performance Implications
  • Exploring the Catalyst Query Optimizer (explain(), Query Plans, Issues with lambdas)
  • The Tungsten Optimizer (Binary Format, Cache Awareness, Whole-Stage Code Gen)
  •  

    Session 11: Performance Tuning

  • Caching - Concepts, Storage Type, Guidelines
  • Minimizing Shuffling for Increased Performance
  • Using Broadcast Variables and Accumulators
  • General Performance Guidelines
  •  

    Session 12: Creating Standalone Applications

     

  • Core API, SparkSession.Builder
  • Configuring and Creating a SparkSession
  • Building and Running Applications - sbt/build.sbt and spark-submit
  • Application Lifecycle (Driver, Executors, and Tasks)
  • Cluster Managers (Standalone, YARN, Mesos)
  • Logging and Debugging
  •  

    Session 13: Spark Streaming

     

  • Introduction and Streaming Basics
  • Spark Streaming (Spark 1.0+)
    • DStreams, Receivers, Batching
    • Stateless Transformation
    • Windowed Transformation

      o   Stateful Transformation

      ·        Structured Streaming (Spark 2+)

      o   Continuous Applications

      o   Table Paradigm, Result Table

      o   Steps for Structured Streaming

      o   Sources and Sinks

      ·        Consuming Kafka Data

      o   Kafka Overview

      o   Structured Streaming - "kafka" format

      o   Processing the Stream

Course Materials

Each student will receive a course Student Guide, complete with course notes, code samples, software tutorials, diagrams and related reference materials and links. Our courses also include our Student Workbook, with step by step hands-on lab instructions and project files (as necessary) and solutions, clearly illustrated for users to complete hands-on work in class, and to revisit to review or refresh skills at any time.  Students will also receive the course set up files, project files (or code, if applicable) and solutions required for the hands-on work.

Raise the bar for advancing technology skills

Attend a Class!

Live scheduled classes are listed below or browse our full course catalog anytime

Special Offers

We regulary offer discounts for individuals, groups and corporate teams. Contact us

Custom Team Training

Check out custom training solutions planned around your unique needs and skills.

EveryCourse Extras

Exclusive materials, ongoing support and a free live course refresh with every class.

See Our Special Offers and Promotions
Trivera offers exclusive promotional offers here at our site that change regularly. Check back often and don’t miss these limited opportunities to learn for less.

See our latest offers and promotions

Learn. Explore. Advance!

Trivera EveryCourse Extras
Extend your training investment! Recorded sessions, free re-sits and after course support included with Every Course
Trivera MiniCamps
Gain the skills you need with less time in the classroom with our short course, live-online hands-on events
Trivera QuickSkills: Free Courses and Webinars
Training on us! Keep your skills current with free live events, courses & webinars
Trivera AfterCourse: Coaching and Support
Expert level after-training support to help organizations put new training skills into practice on the job

The voices of our customers speak volumes

Special Offers
Limited Offer for most courses.

SAVE 50%

Learn More