+353-1-416-8900REST OF WORLD
+44-20-3973-8888REST OF WORLD
1-917-300-0470EAST COAST U.S
1-800-526-8630U.S. (TOLL FREE)

MCA Microsoft Certified Associate Azure Data Engineer Study Guide. Exam DP-203. Edition No. 1. Sybex Study Guide

  • Book

  • 1008 Pages
  • September 2023
  • John Wiley and Sons Ltd
  • ID: 5836805

Prepare for the Azure Data Engineering certification - and an exciting new career in analytics - with this must-have study aide

In the MCA Microsoft Certified Associate Azure Data Engineer Study Guide: Exam DP-203, accomplished data engineer and tech educator Benjamin Perkins delivers a hands-on, practical guide to preparing for the challenging Azure Data Engineer certification and for a new career in an exciting and growing field of tech.

In the book, you’ll explore all the objectives covered on the DP-203 exam while learning the job roles and responsibilities of a newly minted Azure data engineer. From integrating, transforming, and consolidating data from various structured and unstructured data systems into a structure that is suitable for building analytics solutions, you’ll get up to speed quickly and efficiently with Sybex’s easy-to-use study aids and tools.

This Study Guide also offers:

  • Career-ready advice for anyone hoping to ace their first data engineering job interview and excel in their first day in the field
  • Indispensable tips and tricks to familiarize yourself with the DP-203 exam structure and help reduce test anxiety
  • Complimentary access to Sybex’s expansive online study tools, accessible across multiple devices, and offering access to hundreds of bonus practice questions, electronic flashcards, and a searchable, digital glossary of key terms

A one-of-a-kind study aid designed to help you get straight to the crucial material you need to succeed on the exam and on the job, the MCA Microsoft Certified Associate Azure Data Engineer Study Guide: Exam DP-203 belongs on the bookshelves of anyone hoping to increase their data analytics skills, advance their data engineering career with an in-demand certification, or hoping to make a career change into a popular new area of tech.

Table of Contents

Introduction xxvii

Part I Azure Data Engineer Certification and Azure Products 1

Chapter 1 Gaining the Azure Data Engineer Associate Certification 3

The Journey to Certification 7

How to Pass Exam DP- 203 8

Understanding the Exam Expectations and Requirements 9

Use Azure Daily 17

Read Azure Articles to Stay Current 17

Have an Understanding of All Azure Products 20

Azure Product Name Recognition 21

Azure Data Analytics 23

Azure Synapse Analytics 23

Azure Databricks 26

Azure HDInsight 28

Azure Analysis Services 30

Azure Data Factory 31

Azure Event Hubs 33

Azure Stream Analytics 34

Other Products 35

Azure Storage Products 36

Azure Data Lake Storage 37

Azure Storage 40

Other Products 42

Azure Databases 43

Azure Cosmos DB 43

Azure SQL Server Products 46

Additional Azure Databases 46

Other Products 47

Azure Security 48

Azure Active Directory 48

Role- Based Access Control 51

Attribute- Based Access Control 53

Azure Key Vault 53

Other Products 55

Azure Networking 56

Virtual Networks 56

Other Products 59

Azure Compute 59

Azure Virtual Machines 59

Azure Virtual Machine Scale Sets 60

Azure App Service Web Apps 60

Azure Functions 60

Azure Batch 60

Azure Management and Governance 60

Azure Monitor 61

Azure Purview 61

Azure Policy 62

Azure Blueprints (Preview) 62

Azure Lighthouse 62

Azure Cost Management and Billing 62

Other Products 63

Summary 64

Exam Essentials 64

Review Questions 66

Chapter 2 CREATE DATABASE dbName; GO 69

The Brainjammer 70

A Historical Look at Data 71

Variety 73

Velocity 74

Volume 74

Data Locations 74

Data File Formats 75

Data Structures, Types, and Concepts 83

Data Structures 83

Data Types and Management 92

Data Concepts 95

Data Programming and Querying for Data Engineers 125

Data Programming 126

Querying Data 143

Understanding Big Data Processing 169

Big Data Stages 169

Etl, Elt, Eltl 174

Analytics Types 175

Big Data Layers 176

Summary 177

Exam Essentials 177

Review Questions 179

Part II Design and Implement Data Storage 181

Chapter 3 Data Sources and Ingestion 183

Where Does Data Come From? 185

Design a Data Storage Structure 189

Design an Azure Data Lake Solution 190

Recommended File Types for Storage 198

Recommended File Types for Analytical Queries 199

Design for Efficient Querying 200

Design for Data Pruning 203

Design a Folder Structure That Represents the Levels of Data Transformation 203

Design a Distribution Strategy 205

Design a Data Archiving Solution 206

Design a Partition Strategy 207

Design a Partition Strategy for Files 209

Design a Partition Strategy for Analytical Workloads 210

Design a Partition Strategy for Efficiency and Performance 211

Design a Partition Strategy for Azure Synapse Analytics 211

Identify When Partitioning Is Needed in Azure Data Lake Storage Gen 2 212

Design the Serving/Data Exploration Layer 213

Design Star Schemas 214

Design Slowly Changing Dimensions 215

Design a Dimensional Hierarchy 219

Design a Solution for Temporal Data 220

Design for Incremental Loading 222

Design Analytical Stores 223

Design Metastores in Azure Synapse Analytics and Azure Databricks 224

The Ingestion of Data into a Pipeline 228

Azure Synapse Analytics 228

Azure Data Factory 268

Azure Databricks 275

Event Hubs and IoT Hub 301

Azure Stream Analytics 303

Apache Kafka for HDInsight 314

Migrating and Moving Data 316

Summary 317

Exam Essentials 317

Review Questions 319

Chapter 4 The Storage of Data 321

Implement Physical Data Storage Structures 322

Implement Compression 322

Implement Partitioning 325

Implement Sharding 328

Implement Different Table Geometries with Azure Synapse Analytics Pools 329

Implement Data Redundancy 331

Implement Distributions 341

Implement Data Archiving 342

Azure Synapse Analytics Develop Hub 346

Implement Logical Data Structures 360

Build a Temporal Data Solution 361

Build a Slowly Changing Dimension 365

Build a Logical Folder Structure 368

Build External Tables 369

Implement File and Folder Structures for Efficient Querying and Data Pruning 372

Implement a Partition Strategy 375

Implement a Partition Strategy for Files 376

Implement a Partition Strategy for Analytical Workloads 377

Implement a Partition Strategy for Streaming Workloads 378

Implement a Partition Strategy for Azure Synapse Analytics 378

Design and Implement the Data Exploration Layer 379

Deliver Data in a Relational Star Schema 379

Deliver Data in Parquet Files 385

Maintain Metadata 386

Implement a Dimensional Hierarchy 386

Create and Execute Queries by Using a Compute Solution That Leverages SQL Serverless and Spark Cluster 388

Recommend Azure Synapse Analytics Database Templates 389

Implement Azure Synapse Analytics Database Templates 389

Additional Data Storage Topics 390

Storing Raw Data in Azure Databricks for Transformation 390

Storing Data Using Azure HDInsight 392

Storing Prepared, Trained, and Modeled Data 393

Summary 394

Exam Essentials 395

Review Questions 396

Part III Develop Data Processing 399

Chapter 5 Transform, Manage, and Prepare Data 401

Chapter 6 Ingest and Transform Data 402

Transform Data Using Azure Synapse Pipelines 404

Transform Data Using Azure Data Factory 410

Transform Data Using Apache Spark 414

Transform Data Using Transact- SQL 429

Transform Data Using Stream Analytics 431

Cleanse Data 433

Split Data 435

Shred JSON 439

Encode and Decode Data 445

Configure Error Handling for the Transformation 450

Normalize and Denormalize Values 451

Transform Data by Using Scala 461

Perform Exploratory Data Analysis 463

Transformation and Data Management Concepts 473

Transformation 473

Data Management 480

Azure Databricks 481

Data Modeling and Usage 485

Data Modeling with Machine Learning 486

Usage 494

Summary 500

Exam Essentials 500

Review Questions 502

Create and Manage Batch Processing and Pipelines 505

Design and Develop a Batch Processing Solution 507

Design a Batch Processing Solution 510

Develop Batch Processing Solutions 512

Create Data Pipelines 538

Handle Duplicate Data 560

Handle Missing Data 569

Handle Late- Arriving Data 571

Upsert Data 572

Configure the Batch Size 578

Configure Batch Retention 581

Design and Develop Slowly Changing Dimensions 582

Design and Implement Incremental Data Loads 583

Integrate Jupyter/IPython Notebooks into a Data Pipeline 590

Chapter 7 Revert Data to a Previous State 591

Handle Security and Compliance Requirements 592

Design and Create Tests for Data Pipelines 593

Scale Resources 593

Design and Configure Exception Handling 593

Debug Spark Jobs Using the Spark UI 594

Implement Azure Synapse Link and Query the Replicated Data 594

Use PolyBase to Load Data to a SQL Pool 595

Read from and Write to a Delta Table 595

Manage Batches and Pipelines 596

Trigger Batches 597

Schedule Data Pipelines 597

Validate Batch Loads 598

Implement Version Control for Pipeline Artifacts 604

Manage Data Pipelines 607

Manage Spark Jobs in a Pipeline 609

Handle Failed Batch Loads 610

Summary 610

Exam Essentials 611

Review Questions 612

Design and Implement a Data Stream Processing Solution 615

Develop a Stream Processing Solution 617

Design a Stream Processing Solution 618

Create a Stream Processing Solution 630

Process Time Series Data 657

Design and Create Windowed Aggregates 658

Process Data Within One Partition 661

Process Data Across Partitions 663

Upsert Data 665

Handle Schema Drift 674

Configure Checkpoints/Watermarking During Processing 680

Replay Archived Stream Data 685

Design and Create Tests for Data Pipelines 688

Monitor for Performance and Functional Regressions 689

Optimize Pipelines for Analytical or Transactional Purposes 689

Scale Resources 690

Design and Configure Exception Handling 691

Handle Interruptions 694

Ingest and Transform Data 694

Transform Data Using Azure Stream Analytics 694

Monitor Data Storage and Data Processing 695

Monitor Stream Processing 695

Summary 695

Exam Essentials 696

Review Questions 697

Part IV Secure, Monitor, and Optimize Data Storage and Data Processing 699

Chapter 8 Keeping Data Safe and Secure 701

Design Security for Data Policies and Standards 702

Design a Data Auditing Strategy 711

Design a Data Retention Policy 716

Design for Data Privacy 717

Design to Purge Data Based on Business Requirements 719

Design Data Encryption for Data at Rest and in Transit 719

Design Row- Level and Column- Level Security 722

Design a Data Masking Strategy 723

Design Access Control for Azure Data Lake Storage Gen 2 724

Implement Data Security 730

Implement a Data Auditing Strategy 731

Manage Sensitive Information 739

Implement a Data Retention Policy 745

Encrypt Data at Rest and in Motion 748

Implement Row- Level and Column- Level Security 749

Implement Data Masking 753

Manage Identities, Keys, and Secrets Across Different Data Platform Technologies 755

Implement Access Control for Azure Data Lake Storage Gen 2 765

Implement Secure Endpoints (Private and Public) 772

Implement Resource Tokens in Azure Databricks 778

Load a DataFrame with Sensitive Information 779

Write Encrypted Data to Tables or Parquet Files 780

Develop a Batch Processing Solution 781

Handle Security and Compliance Requirements 782

Design and Implement the Data Exploration Layer 784

Browse and Search Metadata in Microsoft Purview Data Catalog 784

Push New or Updated Data Lineage to Microsoft Purview 785

Summary 786

Exam Essentials 787

Review Questions 789

Chapter 9 Monitoring Azure Data Storage and Processing 791

Monitoring Data Storage and Data Processing 793

Implement Logging Used by Azure Monitor 793

Configure Monitoring Services 799

Understand Custom Logging Options 821

Measure Query Performance 822

Monitor Data Pipeline Performance 823

Monitor Cluster Performance 824

Measure Performance of Data Movement 824

Interpret Azure Monitor Metrics and Logs 825

Monitor and Update Statistics about Data Across a System 828

Schedule and Monitor Pipeline Tests 830

Interpret a Spark Directed Acyclic Graph 830

Monitor Stream Processing 832

Implement a Pipeline Alert Strategy 832

Develop a Batch Processing Solution 832

Design and Create Tests for Data Pipelines 832

Develop a Stream Processing Solution 837

Monitor for Performance and Functional Regressions 837

Design and Create Tests for Data Pipelines 838

Azure Monitoring Overview 841

Azure Batch 841

Azure Key Vault 842

Azure SQL 843

Summary 844

Exam Essentials 844

Review Questions 846

Chapter 10 Troubleshoot Data Storage Processing 849

Optimize and Troubleshoot Data Storage and Data Processing 851

Optimize Resource Management 854

Compact Small Files 857

Handle Skew in Data 859

Handle Data Spill 860

Find Shuffling in a Pipeline 862

Tune Shuffle Partitions 864

Tune Queries by Using Indexers 869

Tune Queries by Using Cache 876

Optimize Pipelines for Analytical or Transactional Purposes 877

Optimize Pipeline for Descriptive versus Analytical Workloads 886

Troubleshoot a Failed Spark Job 888

Troubleshoot a Failed Pipeline Run 890

Rewrite User- Defined Functions 899

Design and Develop a Batch Processing Solution 901

Design and Configure Exception Handling 902

Debug Spark Jobs by Using the Spark UI 902

Scale Resources 902

Monitor Batches and Pipelines 904

Handle Failed Batch Loads 904

Design and Develop a Stream Processing Solution 905

Optimize Pipelines for Analytical or Transactional Purposes 905

Handle Interruptions 906

Scale Resources 908

Summary 909

Exam Essentials 910

Review Questions 912

Appendix Answers to Review Questions 915

Chapter 1: Gaining the Azure Data Engineer Associate Certification 916

Chapter 2: CREATE DATABASE dbName; GO 916

Chapter 3: Data Sources and Ingestion 917

Chapter 4: The Storage of Data 918

Chapter 5: Transform, Manage, and Prepare Data 918

Chapter 6. Create and Manage Batch Processing and Pipelines 919

Chapter 7: Design and Implement a Data Stream Processing Solution 920

Chapter 8: Keeping Data Safe and Secure 921

Chapter 9: Monitoring Azure Data Storage and Processing 921

Chapter 10: Troubleshoot Data Storage Processing 922

Index 925

Authors

Benjamin Perkins