We feature speakers at Global Data Science Conference March 27 - 29, 2017 to catch up and find out what he or she is working on now and what's coming next. This week we're talking to Amit Sharma, Postdoctoral Researcher, Microsoft
1. Tell us about yourself and your background.
I am a postdoctoral researcher at Microsoft, where I work on methods for generating robust insights from data. A critical challenge in my work is to separate correlation from causation. I got interested in these questions as I was working on recommender systems during graduate school. More often than not, I would find that the metrics computed from observed data rarely match the performance of a recommender system in practice. So I started to ask "why" and went down the rabbit hole of causal inference.
2. What have you been working on recently?
I have been working on developing a framework for doing causal inference from large-scale data. Supervised machine learning can model patterns to make accurate predictions about new data. However, often the most interesting questions involve some decision-making based on a model. This goes a step beyond prediction and asks what is the right decision to make given the available data. So we have to estimate the effect of hypothetical decisions, either by conducting randomized experiments or utilizing past data. I am developing scalable methods to do such inference.
For instance, consider a subscription service like Xbox or Spotify. We might find that the most satisfied users login more frequently and have more friends than others, so any one of these features can be predictive of their future activity. In practice though, we would want to know which one should we focus on to increase user satisfaction: increasing individual activity or improving the social platform? Many other problems share this goal, such as deciding which algorithm to use, which strategy to follow, which medical treatments to administer or which social policy to choose. The fundamental problem, of course, is that we do not have any data on the effects of any of these decisions, so we need to uncover causal processes from available data to make such inference.
3. Tell me about the approach you take for solving problems with data.
I think one of the under-recognized aspect of data science is that it is very much like research. In research, often the most important part is to find the right problem to solve and then solving the problem follows from that. As John Dewey once said, "A problem well put is half solved". So my approach, even before I look at the data is to ask the question: what is the goal? What is the right problem to solve for that goal? A great learning experience for me was when I was working on customer churn for an online service. A natural problem to solve is whether we can predict in advance which customers are likely to churn. But when I did that successfully and presented the model, I was asked how the model can be used to prevent churn. And it turns out that my predictive model couldn't answer that. So I had to change the question to match the goal of the online service and consider alternative approaches.
4. Where are we now today in terms of the state of Data Science, and where do you think we’ll go over the next five years?
I think we are still in early stages of data science. If you look at how data science has grown, it is almost the reverse of how historically scientific knowledge has been created: the hypothesis or "science" came first and then data was collected to test the hypothesis. Today we have increasing amounts of data, but the "science" part of how to make sense of this found data is still catching up. So I think we will see more focus on knowledge discovery in the next five years. I am especially excited about two directions that are gaining momentum. The first is the need for interpretable machine learning models, so that we better understand the decisions made by them. As data-powered tools start impacting societally critical domains such as education, health and governance, understanding how they come up with decisions becomes important. The second is in enabling the creation of new knowledge from data: how to use data to both generate and test new hypotheses, so that we are creating generalizable knowledge that would not have been possible otherwise.
5. What are some of the best takeaways that the attendees can have from your "Causal inference in data science" talk?
The key message of my talk is to convey the importance of causal thinking when working with data. I will present examples from recommender systems, search engines and social networking platforms to show that results from machine learning can be counter-productive, giving us a false sense of confidence when the true answers are entirely different. One of the best examples of this is the Simpson's paradox, where merely conditioning on a variable provides a different conclusion than in the unconditioned case. The answer to dealing with such questions is simple: start asking the "what-if" question while tackling any data science problem. Just by asking this counterfactual question, I will show how we can derive the right formulation for a problem and solve it. For more details, you can check out an extended version of the tutorial at http://blog.amitsharma.in/