Speaker "Alex Sergeev" Details Back
-
Name
Alex Sergeev
-
Company
Uber Technologies, Inc.
-
Designation
Staff Software Engineer
Topic
Distributed Deep Learning with Horovod
Abstract
Learn how to scale distributed training of TensorFlow, PyTorch, and Apache MXNet models with Horovod, a library designed to make distributed training fast and easy to use. Although frameworks like TensorFlow, PyTorch, and Apache MXNet simplify the design and training of deep learning models, difficulties usually arise when scaling models to multiple GPUs in a server or multiple servers in a cluster. We'll explain the role of Horovod in taking a model designed on a single GPU and training it on a cluster of GPU servers.
Who is this presentation for?
Deep learning engineers, ML infrastructure engineers, technical decision makers
Prerequisite knowledge:
TensorFlow, PyTorch, or Apache MXNet
What you'll learn?
Approaches for scaling deep learning training, how to improve separation-of-concerns between DL engineers & ML infra
Profile
Alex Sergeev is a staff engineer at Uber working on scalable deep learning. Previously, he was a senior engineer at Microsoft working on big data mining. He received his master's degree in computer science from National Research Nuclear University's Moscow Engineering Physics Institute.