
Speaker "Osman Sarood" Details Back

-
Name
Osman Sarood
-
Company
Mist Systems
-
Designation
Member Technical Staff
Topic
How to Cost Effectively and Reliably Build Infrastructure for Machine Learning
Abstract
Mist Systems consumes several Terabytes of telemetry data every day coming from its Wireless Access Points (APs) deployed all over the world. A significant portion of our telemetry data is consumed by our machine learning algorithms, that are essential for the smooth operation of some of the world’s largest WiFi deployments. At Mist, we apply machine learning to incoming telemetry data to detect and attribute anomalies, which is a non-trivial problem and requires exploring multiple dimensions. Although our infrastructure is small compared to some of the tech giants, it is growing very rapidly. Last year, we saw a 10X growth in our infrastructure, taking our AWS annual cost over $1 million. In this talk, we present how we kept our annual cost to $1 million rather than $3 million (i.e., 66% reduction in cost), using AWS spot instances while keeping our infrastructure reliable. Attendees will learn: 1-How to select the right EC2 instance types, i.e., compute versus memory intensive 2-How much over-provisioning (extra capacity) is needed for ensuring reliability 3 The impact of different types of applications, i.e., stateless and stateful, on 1 and 2 above 4-Key aspects for building real time applications that reliably run on top of spot instances 5- How to monitor real time applications in the presence of a high number of server faults due to spot instance terminations Includes a demonstration of terminating random production hosts and: -How we detect when a machine is terminated - How applications running on terminated hosts can recover seamlessly - Visualization of the impact on all the applications running on terminated hosts.