How to Build Efficient Data Pipelines for Machine Learning

Building Data Pipelines for ML

Presented By Suraj Bang
Suraj Bang
Suraj Bang
Solutions Architect at Qubole

Suraj Bang is a Solutions Architect at Qubole where he brings over 13 years of experience in data analytics and engineering to help customers on their big data journey. He has subject matter expertise in building big data applications with Apache Spark and other Open Source technologies such as Airflow, Apache Zeppelin etc. Prior to Qubole, Suraj worked as Data Engineering Lead building various big data applications for financial, retail and insurance organizations. He enjoys being outdoors, loves biking and when indoors enjoys building Alexa apps.

Presentation Description

Companies now need to apply machine learning (ML) techniques on their data in order to remain relevant. Among the new challenges faced by data scientists is the need to build get access to large data sets so that trained models can scale to run with production data. 

Aside from dealing with larger data volumes, these pipelines need to be flexible in order to accommodate the variety of data and the high processing velocity required by the new ML applications. Apache Airflow and Spark address these challenges by providing a highly scalable technology for autoscaling big data engines.

In this presentation we will cover:

- Some of the typical challenges faced by data scientists when building pipelines for machine learning.

- Typical uses of the various big data engines to address these challenges.

- A real-world example using Apache Spark and Airflow to operationalize a recommendation engine

Presentation Curriculum

Building Data Pipelines for ML
Hide Content