Thursday, June 16, 2011

Review of An Introduction to Machine Learning with Web Data by Hilary Mason

An Introduction to Machine Learning with Web Data by Hilary Mason; O'Reilly Media.

The video itself is presented in five sections: (1) Introduction, (2) Classifying Web Documents - The Theory, (3) Classifying Web Documents - The Code, (4) Clustering , Recommendations, and Probability, and (5) Conclusion. In short, the video Hilary uses web based data to show the audience how to work with data to solve problems you may have by using basic machine learning techniques. The video is particularly directed at programmers who do not have statistical training.  
The viewer will sit with a group of a few other students and feel the imtimate setting of a small classroom. For myself, a video where you can re-watch segments, stop the video to reference a suggested resource, or pause to experiment with a variant of the code is both helpful and handy. For example, in the introduction, Hilary references a link (http://bit.ly/9RYQEF) that explans the concept of "data science." At that point, I paused the video and browsed to the link and found it quiet informative. By the second section, Classifying Web Documents - The Theory, the audience is gently taken into statistical techniques such as naive bayes and shown a step-by-step approach in how the math is applied. In the  Classifying Web Documents - The Code, the participant utilizes python code and the New York Time API to classify words from the New York Times web site. Within the Clustering, Recommendations, and Probability video the viewer is taken through code that demonstrates how to take data with which little is known and learn from the clustering results. Finally, the conclusion section deals briefly with the concepts of probability and then reviews the entire sessions content. 
While being able to navigate around in python is beneficial and by following along with the running of the code one can learn and retain more information, the participant can just view the video content as both the code and concepts are displayed and explained. What is nice is Hilary provides the code used in the video from her Git repository at https://github.com/hmason/ml_class. If the viewer wants to participate she will need to make sure that they have the proper python modules installed. 
In conclusion, the software developer that has little more than the required stats college class would do well to purchase this video. Seeing the actual application of code to the basic statistical algorithms is extremely informative and applicable in various problem domains.