- Module 1: Carnegie Mellon University's Cloud Developer course. Learn about distributed programming and why it's useful for the cloud, including programming models, types of parallelism, and symmetrical vs. asymmetrical architecture.
- Classify programs as sequential, concurrent, parallel, and distributed
- Indicate why programmers usually parallelize sequential programs
- Explain why cloud programs are important for solving complex computing problems
- Define distributed systems, and indicate the relationship between distributed systems and clouds
- Define distributed programming models
- Indicate why synchronization is needed in shared-memory systems
- Describe how tasks can communicate by using the message-passing programming model
- Outline the difference between synchronous and asynchronous programs
- Explain the bulk synchronous parallel (BSP) model
- Outline the difference between data parallelism and graph parallelism
- Distinguish between these distributed programs: single program, multiple data (SPMD); and multiple program, multiple data (MPMD)
- Discuss the two main techniques that can be incorporated in distributed programs so as to address the communication bottleneck in the cloud
- Define heterogeneous and homogenous clouds, and identify the main reasons for heterogeneity in the cloud
- State when and why synchronization is required in the cloud
- Identify the main technique that can be used to tolerate faults in clouds
- Outline the difference between task scheduling and job scheduling
- Module 2: Carnegie Mellon University's cloud developer course. MapReduce was a breakthrough in big data processing that has become mainstream and been improved upon significantly. Learn about how MapReduce works.
- Identify the underlying distributed programming model of MapReduce
- Explain how MapReduce can exploit data parallelism
- Identify the input and output of map and reduce tasks
- Define task elasticity, and indicate its importance for effective job scheduling
- Explain the map and reduce task-scheduling strategies in Hadoop MapReduce
- List the elements of the YARN architecture, and identify the role of each element
- Summarize the lifecycle of a MapReduce job in YARN
- Compare and contrast the architectures and the resource allocators of YARN and the previous Hadoop MapReduce
- Indicate how job and task scheduling differ in YARN as opposed to the previous Hadoop MapReduce
- Module 3: Carnegie Mellon University's cloud developer course. GraphLab is a big data tool developed by Carnegie Mellon University to help with data mining. Learn about how GraphLab works and why it's useful.
- Describe the unique features in GraphLab and the application types that it targets
- Recall the features of a graph-parallel distributed programming framework
- Recall the three main parts in the GraphLab engine
- Describe the steps that are involved in the GraphLab execution engine
- Discuss the architectural model of GraphLab
- Recall the scheduling strategy of GraphLab
- Describe the programming model of GraphLab
- List and explain the consistency levels in GraphLab
- Describe the in-memory data placement strategy in GraphLab and its performance implications for certain types of graphs
- Discuss the computational model of GraphLab
- Discuss the fault-tolerance mechanisms in GraphLab
- Identify the steps that are involved in the execution of a GraphLab program
- Compare and contrast MapReduce, Spark, and GraphLab in terms of their programming, computation, parallelism, architectural, and scheduling models
- Identify a suitable analytics engine given an application's characteristics
- Module 4: Carnegie Mellon University's cloud developer course. Spark is an open-source cluster-computing framework with different strengths than MapReduce has. Learn about how Spark works.
- Recall the features of an iterative programming framework
- Describe the architecture and job flow in Spark
- Recall the role of resilient distributed datasets (RDDs) in Spark
- Describe the properties of RDDs in Spark
- Compare and contrast RDDs with distributed shared-memory systems
- Describe fault-tolerance mechanics in Spark
- Describe the role of lineage in RDDs for fault tolerance and recovery
- Understand the different types of dependencies between RDDs
- Understand the basic operations on Spark RDDs
- Step through a simple iterative Spark program
- Recall the various Spark libraries and their functions
- Module 5: Carnegie Mellon University's cloud developer course. The increase of available data has led to the rise of continuous streams of real-time data to process. Learn about different systems and techniques for consuming and processing real-time data streams.
- Define a message queue and recall a basic architecture
- Recall the characteristics, and present the advantages and disadvantages, of a message queue
- Explain the basic architecture of Apache Kafka
- Discuss the roles of topics and partitions, as well as how scalability and fault tolerance are achieved
- Discuss general requirements of stream processing systems
- Recall the evolution of stream processing
- Explain the basic components of Apache Samza
- Discuss how Apache Samza achieves stateful stream processing
- Discuss the differences between the Lambda and Kappa architectures
- Discuss the motivation for the adoption of message queues and stream processing in the LinkedIn use case
In this module, you will:
In partnership with Dr. Majd Sakr and Carnegie Mellon University.
In this module, you will:
In partnership with Dr. Majd Sakr and Carnegie Mellon University.
In this module, you will:
In partnership with Dr. Majd Sakr and Carnegie Mellon University.
In this module, you will:
In partnership with Dr. Majd Sakr and Carnegie Mellon University.
In this module, you will:
In partnership with Dr. Majd Sakr and Carnegie Mellon University.