Architecting a predictive, petabyte-scale, self-learning fraud detection system
Fraud detection is a classic adversarial analytics challenge: As soon as an automated system successfully learns to stop one scheme, fraudsters move on to attack another way. Each scheme requires looking for different signals (i.e. features) to catch; is relatively rare (one in millions for finance or e-commerce); and may take months to investigate a single case (in healthcare or tax, for example) – making quality training data scarce.
This talk covers key lessons learned while building such real-world software systems over the past few years. We’ll be looking for fraud signals in public email datasets, using popular Python based open-source data science libraries to generate graph based, rule based, language based and time series based features, tied together with ensemble learning algorithms.
Apache Spark is used to run these models at scale – in batch mode for model training and with Spark Streaming for production use. We’ll discuss the data model, computation, and feedback workflows, as well as some tools and libraries built on top of the open-source components to enable faster experimentation, optimization and productization.
|Your Review / Feedback :|
|Your Company :|