Back

Speaker "Dan Steinberg" Details Back

 

Topic

Random Forests: How a Chance Driven Learning Machine Does So Spectacularly Well on Marketing Datasets

Abstract

RF is a next generation learning machine based on partially or totally randomly generated decision trees. Each individual tree is a poor predictor of the target but the ensemble of a large number of these trees has proven good enough to win Kaggle competitions and solve many real world problems. This talk presents an overview of the core ideas behind the Random Forest, illustrates its predictive power on a competition data set, explains why the technology works, and discusses the strengths and weaknesses, and types of problems for which RF is best suited. 1. What is a Random Forest? How randomness is incorporated into the learning process by repeatedly training on different rows of data and by considering different subsets of features at each decision node in a tree. • Bootstrap samples • OOB: Records included and records excluded from the training of a specific tree (“Out of Bag”) • Selecting predictive features at random and how we vary the degree randomness from none to total 2. What kinds of problems can RF be used to solve. Classification, Regression, Clustering, Outlier and Anomaly Detection 3. How to set up an RF model: what controls really matter. Number of features considered at each decision node, size of the training sample extracted from the master database, tree size limits. • Adapting RF to BigData via very small sample extracts 4. Binary classification example: predicting credit risk — who repays their loan and who does not. What we learn from RF that we would not learn from other learning machines. 5. Why RF works. The wisdom of crowds applied to trees. Trees, unlike mathematically formulated models, are nothing more than selective descriptions of data. Each tree offers a differently cast description of the data and incorporates a form of nearest neighbor classifier. With sufficient RF trees the average predictive accuracy can become better than that typically delivered by a professional statistician. 6. Parallel RF. Easy ways to get RF models computed rapidly.

Profile

Dan Steinberg, Ph.D. CEO and Founder, Salford Systems Dan Steinberg is CEO and founder of Salford Systems, the developer of the CART® decision tree, MARS® spline regression, TreeNet® gradient boosting, Breiman's RandomForests®, and other influential data mining technology. After earning a PhD in Econometrics at Harvard Dan began his professional career as a Member of the Technical Staff at Bell Labs, Murray Hill, and then as Assistant Professor of Economics at the University of California, San Diego. His consulting experience at Salford Systems has included complex modeling projects for Fortune 100 clients including Citibank, Chase, American Express, Credit Suisse, Johnson & Johnson and has included projects in Europe, Australia, New Zealand, Malaysia, Korea, Japan and Brazil. Dan led the modeling teams that won first place awards in the KDDCup 2000, and the 2002 Duke/TeraData Churn modeling competition, and the teams that won awards in the PAKDD competitions of 2006 and 2007. Dan has published papers in statistics, economics, econometrics, computer science and marketing journals, and he has been a featured data mining issues speaker for the American Marketing Association, American Statistical Association, the Direct Marketing Association and the Casualty Actuarial Society. Dan contributes actively to the ongoing research and development at Salford.