Distributed programming on the cloud

Brought by: N/A

Overview

  • Module 1: Carnegie Mellon University's Cloud Developer course. Learn about distributed programming and why it's useful for the cloud, including programming models, types of parallelism, and symmetrical vs. asymmetrical architecture.
  • In this module, you will:

    • Classify programs as sequential, concurrent, parallel, and distributed
    • Indicate why programmers usually parallelize sequential programs
    • Explain why cloud programs are important for solving complex computing problems
    • Define distributed systems, and indicate the relationship between distributed systems and clouds
    • Define distributed programming models
    • Indicate why synchronization is needed in shared-memory systems
    • Describe how tasks can communicate by using the message-passing programming model
    • Outline the difference between synchronous and asynchronous programs
    • Explain the bulk synchronous parallel (BSP) model
    • Outline the difference between data parallelism and graph parallelism
    • Distinguish between these distributed programs: single program, multiple data (SPMD); and multiple program, multiple data (MPMD)
    • Discuss the two main techniques that can be incorporated in distributed programs so as to address the communication bottleneck in the cloud
    • Define heterogeneous and homogenous clouds, and identify the main reasons for heterogeneity in the cloud
    • State when and why synchronization is required in the cloud
    • Identify the main technique that can be used to tolerate faults in clouds
    • Outline the difference between task scheduling and job scheduling

    In partnership with Dr. Majd Sakr and Carnegie Mellon University.

  • Module 2: Carnegie Mellon University's cloud developer course. MapReduce was a breakthrough in big data processing that has become mainstream and been improved upon significantly. Learn about how MapReduce works.
  • In this module, you will:

    • Identify the underlying distributed programming model of MapReduce
    • Explain how MapReduce can exploit data parallelism
    • Identify the input and output of map and reduce tasks
    • Define task elasticity, and indicate its importance for effective job scheduling
    • Explain the map and reduce task-scheduling strategies in Hadoop MapReduce
    • List the elements of the YARN architecture, and identify the role of each element
    • Summarize the lifecycle of a MapReduce job in YARN
    • Compare and contrast the architectures and the resource allocators of YARN and the previous Hadoop MapReduce
    • Indicate how job and task scheduling differ in YARN as opposed to the previous Hadoop MapReduce

    In partnership with Dr. Majd Sakr and Carnegie Mellon University.

  • Module 3: Carnegie Mellon University's cloud developer course. GraphLab is a big data tool developed by Carnegie Mellon University to help with data mining. Learn about how GraphLab works and why it's useful.
  • In this module, you will:

    • Describe the unique features in GraphLab and the application types that it targets
    • Recall the features of a graph-parallel distributed programming framework
    • Recall the three main parts in the GraphLab engine
    • Describe the steps that are involved in the GraphLab execution engine
    • Discuss the architectural model of GraphLab
    • Recall the scheduling strategy of GraphLab
    • Describe the programming model of GraphLab
    • List and explain the consistency levels in GraphLab
    • Describe the in-memory data placement strategy in GraphLab and its performance implications for certain types of graphs
    • Discuss the computational model of GraphLab
    • Discuss the fault-tolerance mechanisms in GraphLab
    • Identify the steps that are involved in the execution of a GraphLab program
    • Compare and contrast MapReduce, Spark, and GraphLab in terms of their programming, computation, parallelism, architectural, and scheduling models
    • Identify a suitable analytics engine given an application's characteristics

    In partnership with Dr. Majd Sakr and Carnegie Mellon University.

  • Module 4: Carnegie Mellon University's cloud developer course. Spark is an open-source cluster-computing framework with different strengths than MapReduce has. Learn about how Spark works.
  • In this module, you will:

    • Recall the features of an iterative programming framework
    • Describe the architecture and job flow in Spark
    • Recall the role of resilient distributed datasets (RDDs) in Spark
    • Describe the properties of RDDs in Spark
    • Compare and contrast RDDs with distributed shared-memory systems
    • Describe fault-tolerance mechanics in Spark
    • Describe the role of lineage in RDDs for fault tolerance and recovery
    • Understand the different types of dependencies between RDDs
    • Understand the basic operations on Spark RDDs
    • Step through a simple iterative Spark program
    • Recall the various Spark libraries and their functions

    In partnership with Dr. Majd Sakr and Carnegie Mellon University.

  • Module 5: Carnegie Mellon University's cloud developer course. The increase of available data has led to the rise of continuous streams of real-time data to process. Learn about different systems and techniques for consuming and processing real-time data streams.
  • In this module, you will:

    • Define a message queue and recall a basic architecture
    • Recall the characteristics, and present the advantages and disadvantages, of a message queue
    • Explain the basic architecture of Apache Kafka
    • Discuss the roles of topics and partitions, as well as how scalability and fault tolerance are achieved
    • Discuss general requirements of stream processing systems
    • Recall the evolution of stream processing
    • Explain the basic components of Apache Samza
    • Discuss how Apache Samza achieves stateful stream processing
    • Discuss the differences between the Lambda and Kappa architectures
    • Discuss the motivation for the adoption of message queues and stream processing in the LinkedIn use case

    In partnership with Dr. Majd Sakr and Carnegie Mellon University.

Syllabus

  • Module 1: What is distributed programming?
    • Introduction
    • Categories of computer programs
    • Why use distributed programming?
    • Distributed programming on the cloud
    • Programming models for clouds
    • Synchronous vs. asynchronous computation
    • Types of parallelism
    • Symmetrical vs. asymmetrical architecture
    • Cloud challenges: Scalability
    • Cloud challenges: Communication
    • Cloud challenges: Heterogeneity
    • Cloud challenges: Synchronization
    • Cloud challenges: Fault tolerance
    • Cloud challenges: Scheduling
    • Summary
  • Module 2: Distributed computing on the cloud: MapReduce
    • Introduction
    • Programming model
    • Data structure
    • Example MapReduce programs
    • Computation and architectural models
    • Job and task scheduling
    • Fault tolerance
    • YARN
    • Summary
  • Module 3: Distributed computing on the cloud: GraphLab
    • Introduction
    • Data structure and graph flow
    • Architectural model
    • Programming model
    • Computational model
    • Fault tolerance
    • An example application in GraphLab
    • Comparison of distributed analytics engines
    • Summary
  • Module 4: Distributed computing on the cloud: Spark
    • Introduction
    • Spark overview
    • Resilient distributed datasets
    • Lineage, fault tolerance, and recovery
    • Programming in Spark
    • The Spark ecosystem
    • Summary
  • Module 5: Message queues and stream processing
    • Introduction
    • Message queues
    • Message queues: Case study
    • Stream processing systems
    • Streaming architectures: Case study
    • Big data processing architectures
    • Real-time architectures in practice
    • Summary
Distributed programming on the cloud
Go to course

Distributed programming on the cloud

Brought by: N/A

  • N/A
  • Free
  • English
  • Certificate Not Available
  • Available at any time
  • beginner
  • N/A