Azure Databricks: Engineering Vs Data Science

2019

TL; DR

Azure DataBricks can be used for both engineering and for data science. This session is led by two Microsoft MVPs, facing off. Engineer vs Scientist. The session is half how to build data pipelines and half how to do machine learning at scale.

Session Details

Have you looked at Azure DataBricks yet? No! Then you need to. Why you ask, there are many reasons. The number 1, knowing how to use Apache Spark will earn you more money. It is that simple. Data Engineers and Data Scientists who know Apace Spark are in-demand! This workshop is designed to introduce you to the skills required to do both.

In the morning we will introduce Azure DataBricks then discuss how to develop in-memory elastic scale data engineering pipelines. We will talk about shaping and cleaning data, the languages, notebooks, ways of working, design patterns and how to get the best performance. You will build an engineering pipeline with Python (Or possibly some other stuff we are not allowed to tell you about yet). The Engineering element will be delivered by UK MVP Simon Whiteley. Simon has been deploying engineering projects with Azure DataBricks since it was announced. He has real world experience in multiple environments.

Then we will shift gears, we will take the data we moved and cleansed and apply distributed machine learning at scale. We will train a model and productionise it. We will then enrich our data with our newly predicted values. The Data Science element will be led by UK MVP Terry McCann. Terry holds an MSc in Data Science and has been working with Apache Spark for the last 5 years. He is dedicated to applying engineering practices to data science to make model development, training and scoring as easy an as automated as possible

By the end of the day, you will understand how Azure Databricks supports both data engineering and data science, levering Apace Spark to deliver blisteringly fast data pipelines and distributed machine learning models. Bring your laptop as this will be hands on.

Pre-requisites
An understanding of ETL processing either ETL or ELT on either on-premises or in a big data environment. A basic level of Machine Learning would also be beneficial, but not critical.
Laptop Required:Yes

Software: In the session we will be using Azure Databricks. We will have labs and demos that you can follow if you want to. If you do want to then you will need the following: - An Azure Subscription - Money on the Azure Subscription - Enough access on the subscription to make service principals. - Azure Storage explorer- PowerShell
Subscriptions: Azure

3 things you'll get out of this session

Simon Whiteley's previous sessions

Behind the Hype - Architecture Trends in Data

Seasoned Data Engineer and YouTube grumbler Simon Whiteley takes us on a journey through the current industry trends and buzzwords, carving through the hype to get at the underlying ideals. Which is going to last and which is a sales gimmick? Which bandwagon might actually take you in the right strategic direction?

Nose-Dive Narratives: Slide Karaoke 2024

Get ready to wrap up a serious day of learning with a dash of humor, spontaneity, and friendly competition! SQLBits presents "Slide Karaoke" where SQLBits speakers reveal their hidden talents while vying for bragging rights. This session promises to be a one-of-a-kind experience that will leave you in stitches and awe, and the speakers scrambling for their non-existent notes!

Behind the Hype - Architecture Trends in Data

In this session, seasoned data engineer and youtube grumbler Simon Whiteley takes us on a journey through the current industry trends and buzzwords, carving through the hype to get at the underlying ideals.

Building a Lakehouse on the Microsoft Intelligent Data Platform

This session session aims to give you that context. We'll look at how spark-based engines work and how we can use them within Synapse Analytics. We'll dig into Delta, the underlying file format that enables the Lakehouse, and take a tour of how the Synapse compute engines interact with it. Finally, we'll draw out our whole Lakehouse architecture

Bringing Data Lakes to your Purview

A short, fast dive into the specific elements of Azure Purview that work well with Data Lakes, and how you implement them yourselves

Value-Driven Analytics Development

Ever spent an age releasing a data model, only to find no-one uses it? There's a better way of working, driven by both technology & agile working practices, let me tell you about Value Driven Development & DataOps

Databricks, Delta Lake and You

Databricks, Lakes & Parquet are a match made in heaven, but explode with extra power when using Delta Lake. This session will dive into the details of how Databricks Delta works and how to make the most of it.

The Azure Spark Showdown - Databricks VS Synapse Analytics

Azure now has two slick, platform-as-a-service spark offerings, but which one should you choose? A separate specialist tools or a one-size-fits-all solution? Join Simon as he compares and contrasts the spark offerings.

Azure SQL DataWarehouse: 0-100 (DWUs)

Azure SQLDW - WHAT, WHERE, WHEN and HOW to use it.

Terry McCann's previous sessions

Docker & Kubernetes for the Data Scientist

Deployment == Return on investment. This session looks to show you how to do that for Machine Learning.

Machine Learning in Azure Synapse

There is a lot of content available on Synapse for Data Engineering, but what about Machine Learning? In this session we will look at how to integrate a SparkML model in Synapse.

Rapid Requirements: Introducing the Machine Learning Canvas

In this session, we will introduce the Advancing Analytics Machine Learning Canvas and how it can be used to capture requirements for Machine Learning Projects.

Machine Learning in Azure Databricks

In this session we focus on how Spark implements Machine Learning at Scale with Spark ML.

Deploy ML models faster with Data Science DevOps

In this session I will show you how to apply DevOps practices to speed up your development cycle and ensure that you have robust deployable models. We will focus on the Azure cloud platform in particular, however this is applicable to other platforms

Enhancing relational models with graph in SQL Server 2017

This session explores SQL Server 2017's Graph processing to better understand interconnectivity and behaviour in your data.