Course Outline:
1. Introduction to the Parallel Framework Architecture
- Describe the parallel processing architecture
- Describe pipeline and partition parallelism
- Describe the role of the configuration file
2. Compiling and Executing Jobs
- Describe the main parts of the configuration file
- Describe the compile process and the OSH that the compilation process generates
- Describe the role and the main parts of the Score
3. Partitioning and Collecting Data
- Understand how partitioning works in the Framework
- Viewing partitioners in the Score
- Selecting partitioning algorithms
4. Sorting Data
- Sort data in the parallel framework
- Find inserted sorts in the Score
- Reduce the number of inserted sorts
- Optimize Fork-Join jobs
- Use Sort stages to determine the last row in a group
5. Buffering in Parallel Jobs
- Describe how buffering works in parallel jobs
- Tune buffers in parallel jobs
- Avoid buffer contentions
6. Parallel Framework Data Types
- Describe virtual data sets
- Describe schemas
- Describe data type mappings and conversions
- Describe how external data is processed
- Handle nulls
7. Reusable Components
- Create a schema file
- Read a sequential file using a schema
- Describe Runtime Column Propagation (RCP)
- Enable and disable RCP
- Create and use shared containers
8. Balanced Optimization
- Enable Balanced Optimization functionality in Designer
- List the different Balanced Optimization options.
- Push stage processing to a data source
- Push stage processing to a data target
- Optimize a job accessing Hadoop HDFS file system
- Understand the limitations of Balanced Optimizations