Data Engineering on Azure

Build a data platform to the industry-leading standards set by Microsoft’s own infrastructure.

Summary
In Data Engineering on Azure you will learn how to:

    Pick the right Azure services for different data scenarios
    Manage data inventory
    Implement production quality data modeling, analytics, and machine learning workloads
    Handle data governance
    Using DevOps to increase reliability
    Ingesting, storing, and distributing data
    Apply best practices for compliance and access control

Data Engineering on Azure reveals the data management patterns and techniques that support Microsoft’s own massive data infrastructure. Author Vlad Riscutia, a data engineer at Microsoft, teaches you to bring an engineering rigor to your data platform and ensure that your data prototypes function just as well under the pressures of production. You'll implement common data modeling patterns, stand up cloud-native data platforms on Azure, and get to grips with DevOps for both analytics and machine learning.

Purchase of the print book includes a free eBook in PDF, Kindle, and ePub formats from Manning Publications.

About the technology
Build secure, stable data platforms that can scale to loads of any size. When a project moves from the lab into production, you need confidence that it can stand up to real-world challenges. This book teaches you to design and implement cloud-based data infrastructure that you can easily monitor, scale, and modify.

About the book
In Data Engineering on Azure you’ll learn the skills you need to build and maintain big data platforms in massive enterprises. This invaluable guide includes clear, practical guidance for setting up infrastructure, orchestration, workloads, and governance. As you go, you’ll set up efficient machine learning pipelines, and then master time-saving automation and DevOps solutions. The Azure-based examples are easy to reproduce on other cloud platforms.

What's inside

    Data inventory and data governance
    Assure data quality, compliance, and distribution
    Build automated pipelines to increase reliability
    Ingest, store, and distribute data
    Production-quality data modeling, analytics, and machine learning

About the reader
For data engineers familiar with cloud computing and DevOps.

About the author
Vlad Riscutia is a software architect at Microsoft.

Table of Contents

1 Introduction
PART 1 INFRASTRUCTURE
2 Storage
3 DevOps
4 Orchestration
PART 2 WORKLOADS
5 Processing
6 Analytics
7 Machine learning
PART 3 GOVERNANCE
8 Metadata
9 Data quality
10 Compliance
11 Distributing data

Data Engineering on Azure

36.99 In Stock

Data Engineering on Azure

Add to Wishlist

Data Engineering on Azure

eBook

$36.99

View All Available Formats & Editions

eBook
$36.99

View All Available Formats & Editions

Available on Compatible NOOK devices, the free NOOK App and in My Digital Library.

WANT A NOOK? Explore Now

Buy As Gift

Related collections and offers

Overview

Product Details

ISBN-13:	9781638356912
Publisher:	Manning
Publication date:	09/21/2021
Sold by:	SIMON & SCHUSTER
Format:	eBook
Pages:	336
File size:	8 MB

About the Author

Vlad Riscutia is a software architect at Microsoft.

Preface xv

Acknowledgments xvii

About this book xviii

About the author xxi

About the cover illustration xxii

1 Introduction 1

1.1 What is data engineering? 2

1.2 Who this book is for 3

1.3 What is a data platform? 3

Anatomy of a data platform 4

Infrastructure as code, codeless infrastructure 6

1.4 Building in the cloud 7

IaaS, PaaS, SaaS 7

Network, storage, compute 8

Getting started with Azure 8

Interacting with Azure 9

1.5 Implementing an Azure data platform 11

Part 1 Infrastructure 15

2 Storage 17

2.1 Storing data in a data platform 18

Storing data across multiple data fabrics 19

Having a single source of truth 20

2.2 Introducing Azure Data Explorer 22

Deploying an Azure Data Explorer cluster 23

Using Azure Data Explorer 26

Working around query limits 29

2.3 Introducing Azure Data Lake Storage 30

Creating an Azure Data Lake Storage account 31

Using Azure Data Lake Storage 31

Integrating with Azure Data Explorer 32

2.4 Ingesting data 34

Ingestion frequency 34

Load type 36

Restatements and reloads 39

3 DevOps 43

3.1 What is DevOps? 44

DevOps in data engineering 45

3.2 Introducing Azure DevOps 48

Using the az azure-devops extension 49

3.3 Deploying infrastructure 50

Exporting an Azure Resource Manager template 52

Creating Azure DevOps service connections 55

Deploying Azure Resource Manager templates 57

Understanding Azure Pipelines 61

3.4 Deploying analytics 63

Using Azure DevOps marketplace extensions 64

Storing everything in Git; deploying everything automatically 68

4 Orchestration 70

4.1 Ingesting the Bing COVID-19 open dataset 71

4.2 Introducing Azure Data Factory 73

Setting up the data source 74

Setting up the data sink 76

Setting up the pipeline 80

Setting up a trigger 84

Orchestrating with Azure Data Factory 85

4.3 DevOps for Azure Data Factory 86

Deploying Azure Data Factory from Git 89

Setting up access control 90

Deploying the production data factory 92

DevOps for the Azure Data Factory recap 94

4.4 Monitoring with Azure Monitor 95

Part 2 Workloads 99

5 Processing 101

5.1 Data modeling techniques 102

Normalization and denormalization 102

Data warehousing 105

Semistructured data 107

Data modeling recap 110

5.2 Identity keyrings 111

Building an identity keyring 112

Understanding keyrings 114

5.3 Timelines 116

Building a timeline view 116

Using timelines 118

5.4 Continuous data processing 119

Tracking processing functions in Git 119

Keyring building in Azure Data Factory 121

Scaling out 126

Analytics 132

6 Analytics 132

6.1 Structuring storage 133

Providing development data 135

Replicating production data 136

Providing read-only access to the production data 137

Storage structure recap 139

6.2 Analytics workflow 139

Prototyping 142

Development and user acceptance testing 143

Production 145

Analytics workflow recap 147

6.3 Self-serve data movement 148

Support model 150

Data contracts 150

Pipeline validation 152

Postmortems 156

Self-serve data movement recap 156

7 Machine learning 158

7.1 Training a machine learning model 159

Training a model using scikit-learn 161

High spender model implementation 162

7.2 Introducing Azure Machine Learning 163

Creating a workspace 163

Creating an Azure Machine Learning compute target 165

Setting up Azure Machine Learning storage 166

Running ML in the cloud 168

Azure Machine Learning recap 172

7.3 MLOps 173

Deploying from- Git 173

Storing pipeline IDs 176

DevOps for Azure Machine Learning recap 178

7.4 Orchestrating machine learning 178

Connecting Azure Data Factory with Azure Machine Learning 179

Machine learning orchestration 181

Orchestrating recap 184

Part 3 Governance 187

8 Metadata 189

8.1 Making sense of the data 190

8.2 Introducing Azure Purview 193

8.3 Maintaining a data inventory 196

Setting up a scan 197

Browsing the data dictionary 201

Data dictionary recap 202

8.4 Managing a data glossary 203

Adding a new glossary term 203

Curating terms 205

Custom templates and bulk import 206

Data glossary recap 207

8.5 Understanding Azure Purview's advanced features 208

Tracking lineage 208

Classification rules 209

REST API 211

Advanced features recap 212

9 Data quality 214

9.1 Testing data 215

Availability tests 215

Correctness tests 216

Completeness tests 218

Detecting anomalies 219

Testing data recap 221

9.2 Running data quality checks 222

Testing using Azure Data Factory 222

Executing tests 225

Creating and using a template 227

Running data quality checks recap 228

9.3 Scaling out data testing 229

Supporting multiple data fabrics 229

Testing at rest and during movement 231

Authoring tests 232

Storing tests and results 237

10 Compliance 241

10.1 Data classification 242

Feature data 243

Telemetry 243

User data 243

User-owned data 244

Business data 244

Data classification recap 245

10.2 Changing classification through processing 245

Aggregation 247

Anonymization 248

Pseudonymization 250

Masking 254

Processing classification changes recap 255

10.3 Implementing an access model 256

Security groups 258

Securing Azure Data Explorer 260

Access model recap 265

10.4 Complying with GDPR and other considerations 266

Data handling 266

Data subject requests 266

Other considerations 269

11 Distributing data 271

11.1 Data distribution overview 272

11.2 Building a data API 275

Introducing Azure Cosmos DB 277

Populating the Cosmos DB collection 280

Retrieving data 282

Data API recap 285

11.3 Serving machine learning 285

11.4 Sharing data for bulk copy 285

Separating compute resources 286

Introducing Azure Data Share 288

Sharing data for bulk copy recap 293

11.5 Data sharing best practices 294

Appendix A Azure services 297

Appendix B KQL quick reference 299

Appendix C Running code samples 301

Index 303

From the B&N Reads Blog

Page 1 of

Related collections and offers

Overview

Product Details

About the Author

Table of Contents

Related Subjects

Customer Reviews