Speaker "Dan Steinberg" Details Back
-
Name
Dan Steinberg
-
Company
Salford Systems
-
Designation
CEO
Topic
Random Forests: How a Chance Driven Learning Machine Does So Spectacularly Well on Marketing Datasets
Abstract
RF is a next generation learning machine based on partially or totally randomly generated decision trees. Each individual tree is a poor predictor of the target but the ensemble of a large number of these trees has proven good enough to win Kaggle competitions and solve many real world problems. This talk presents an overview of the core ideas behind the Random Forest, illustrates its predictive power on a competition data set, explains why the technology works, and discusses the strengths and weaknesses, and types of problems for which RF is best suited. 1. What is a Random Forest? How randomness is incorporated into the learning process by repeatedly training on different rows of data and by considering different subsets of features at each decision node in a tree. • Bootstrap samples • OOB: Records included and records excluded from the training of a specific tree (“Out of Bag”) • Selecting predictive features at random and how we vary the degree randomness from none to total 2. What kinds of problems can RF be used to solve. Classification, Regression, Clustering, Outlier and Anomaly Detection 3. How to set up an RF model: what controls really matter. Number of features considered at each decision node, size of the training sample extracted from the master database, tree size limits. • Adapting RF to BigData via very small sample extracts 4. Binary classification example: predicting credit risk — who repays their loan and who does not. What we learn from RF that we would not learn from other learning machines. 5. Why RF works. The wisdom of crowds applied to trees. Trees, unlike mathematically formulated models, are nothing more than selective descriptions of data. Each tree offers a differently cast description of the data and incorporates a form of nearest neighbor classifier. With sufficient RF trees the average predictive accuracy can become better than that typically delivered by a professional statistician. 6. Parallel RF. Easy ways to get RF models computed rapidly.