Week 17

Published on Author malmLeave a comment

Elements of Machine Learning II: Automated classifier optimisation

[avatar user=”malm” size=”small” align=”left” link=”file” /]

One of the core features of the Python sklearn machine learning framework is the provision of a standard ‘estimator’ and ‘model’ interfaces for working with classification algorithms or ‘classifiers’ as they are typically termed.   Examples of sklearn classifiers include Support Vector Machines, Logistic Regression and Decision Trees.   These can be treated as essentially interchangeable within the context of a classification pipeline that might additionally include a ‘transformer’ interface for data modification.  If used naively without any optimisation, figuring out the best performing classifier for a particular dataset and context can be a matter of lots of trial and error.

Enter TPOT which is billed “a python tool for automating data science”.  It’s a powerful utility tool on top of Python sklearn functionality which handles the grunt work of determining the optimal pipeline for producing the best results from a dataset split into training and test data.  You can think of it as a smarter assisted version of GridSearchCV:

TPOT will automate the most tedious part of machine learning by intelligently exploring thousands of possible pipelines to find the best one for your data.

TPOT has a dependency on the xgboost library which is fairly tricky to build on Windows so if you’re going to try this out, best to do so in a Linux environment where all you need to do as groundwork on top of a standard Python sklearn setup is: pip install tpot.  You are then good to run this script, slightly adapted from the one presented in the source article:

import pandas as pd
from sklearn.cross_validation import train_test_split
from tpot import TPOT
import time

output = 'exported_pipeline.py'

t0 = time.time()
fname = 'https://raw.githubusercontent.com/rhiever/Data-Analysis-and-Machine-Learning-Projects/master/tpot-demo/Hill_Valley_with_noise.csv.gz'
print("1. Reading '%s'" % fname)
data = pd.read_csv(fname, sep='\t', compression='gzip')

print("2. Generating X,y and training + test data")
X = data.drop('class', axis=1).values
y = data.loc[:, 'class'].values

X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.75, test_size=0.25)

print("3. teapot brewing - will take a few mins")
teapot = TPOT(generations=10)
teapot.fit(X_train, y_train)

print("4. Score and export pipeline")
print(teapot.score(X_test, y_test))
teapot.export(output) 
t1 = time.time()

print("Finished in %d seconds! Optimal pipeline in '%s'" % (t1-t0, output))

After some 20 minutes or so in an Ubuntu VM, this should result in the following output:

1. Reading 'https://raw.githubusercontent.com/rhiever/Data-Analysis-and-Machine-Learning-Projects/master/tpot-demo/Hill_Valley_with_noise.csv.gz'
2. Generating X,y and training + test data
3. teapot brewing - will take a few mins
4. Score and export pipeline
0.960592769803
Finished in 1344 seconds! Optimal pipeline in 'exported_pipeline.py'

where the generated exported_pipeline.py reveals the optimal supervised learning classifier to be a logistic regression with regularisation parameter C = 299.54.   Here’s a cleaned up version for direct comparison which takes a few seconds to run:

import numpy as np
import pandas as pd

from sklearn.cross_validation import train_test_split
from sklearn.linear_model import LogisticRegression

fname = 'Hill_Valley_with_noise.csv.gz'

# NOTE: Make sure that the class is labeled 'class' in the data file
data = pd.read_csv(fname, sep='\t', compression='gzip')
X = data.drop('class', axis=1).values
y = data.loc[:, 'class'].values

X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.75, test_size=0.25)

# Perform classification with a logistic regression classifier
lrc1 = LogisticRegression(C=299.545454545)
lrc1.fit(X_train, y_train)
print(lrc1.score(X_test, y_test))

What TPOT does is essentially automate a large chunk of the process of machine learning classifier pipeline development. Essentially instead of using intuition and know-how to explore the search space, you opt for sheer brute force instead.  If you’re someone who uses sklearn as your go-to machine learning prototyping tool, you probably want to get familiar with tpot as it looks to be an essential tool:

An example machine learning pipeline

The Uber for X Landscape

This Guardian long read on how Uber went from a single hire in Feb 2012 to total dominance of the London taxi transportation network is a textbook lesson in the power of disruption.  Four years ago, the London cab scene was effectively a complex closed shop with black cab drivers resting on the laurels of their three-year Knowledge acquisition and a few private firms like Addison Lee sewing up the rest of the market.  The entry of Uber into the market appeared of little obvious concern and most of the competition continued to follow a well-worn playbook and fail to understand let alone compete with what was “only an app” which is why this isn’t really all that surprising:

“We were worried that Addison Lee [4,500 cars and revenues of £90m a year] would get smart, spend £1m – which isn’t a lot of money for them – and make a really nice, seamless app that copied Uber’s. But they never did.”

Today, Uber has transformed the taxi dynamic in London and pushed out to other UK cities.  Its spectacular success has fuelled a global appetite to build parallel “Uber for X” propositions over the last few years. There are signs these ventures are now experiencing something of a backlash in large part because they often rely on what is arguably an unsustainable zero-hours, low-paid precariat workforce.  Nowhere more so than Uber itself where the main service innovation in the last couple of years has been in building even lower cost variants, UberX and UberPool, which further exacerbate tensions between the company and its drivers.  Other entrepreneurs have taken heed and looked at doing something very different that elevates employee welfare concerns:

As tensions build, a smaller wave of on-demand entrepreneurs is finding an opportunity inside an opportunity – to be the “anti-Uber for X”.

Nevertheless, on-demand services continue to be developed at a furious pace with the latest trend to interface with users through a conversational UI paradigm:

big companies like Facebook, Apple, and Microsoft are now eager to host our interactions with various services, and offer tools for developers to make those services available. Chatbots easily fit into their larger business models of advertising, e-commerce, online services, and device sales.

As explained recently in a recent post in this blog, such chatbots today frequently serve as an interface onto a largely human-powered backend, particularly if they are able to handle the servicing of multiple different verticals.  Pure Artificial Intelligence-based chatbot propositions remain the holy grail but there are few outfits outside of the GAFA quartet that have the competence to do a comparable job to Google Now or Siri. One possible contender is Viv which has been built by the ex-Siri team.  WashPo’s absorbing profile of Viv is essential reading for anyone interested in the space.

More Machine Learning

  • Neural networks are “impressively good at compression“.  This article helps explain why. It essentially outlines how neural nets operate as autoencoders.

16_second_example

HiddenLayers

  • Algorithmic air traffic control:

“In recent year, the Gulf States have stopped using children as jockeys – and instead switched to robot ones after criticism from human rights organizations, who reported child were being killed on the camel racing-circuits.”

camels racing

Wearables, Devices and Manufacturers

  • This story of how Apple Music extends its long reach right into your laptop to delete local content and upload it to their cloud offers a cautionary tale for our times.  The author suggests that in serving as the ultimate gateway for your content, cloud media services stand at the cusp of ushering in an Orwellian future in which they control the media users can access:

When I signed up for Apple Music, iTunes evaluated my massive collection of Mp3s and WAV files, scanned Apple’s database for what it considered matches, then removed the original files from my internal hard drive. REMOVED them. Deleted. If Apple Music saw a file it didn’t recognize—which came up often, since I’m a freelance composer and have many music files that I created myself—it would then download it to Apple’s database, delete it from my hard drive, and serve it back to me when I wanted to listen, just like it would with my other music files it had deleted.

The pace at which China is building its walled-off world, filled with hardware and software created by Chinese companies, with only selectively allowed non-Chinese content, is accelerating.

  • The new £8 million Dyson Centre for Engineering Design at Cambridge University opened for business this week. Together with the £12 million Dyson School of Design Engineering at Imperial College and other generous bursaries, James Dyson is arguably the most important single influence in elite engineering in the UK.   The man himself is on characteristically pugnacious form in this Telegraph article:

“It’s economic suicide that we’re not creating more engineers.  They create the technologies that Britain can export, generating wealth for the nation. But we’re facing a chronic shortage of them.”

Sandra Goldmark

Cloud and Digital Innovation

“We have 2.6 million developers registered so far. I want it to be 100 million. If we don’t look like other platforms, it’s because I think we’re playing a different game,”

  • How to troubleshoot troublemakers comes with this amusing cut out and keep guide to recognising the major types in a company near you:
    • The Hermit: Regardless of the team’s established process, workflow or deadlines, he finds ways to hide his work until he deems it suitable to share.
    • The Nostalgia Junkie: “Too bad you weren’t here a few years ago. It was great.
    • The Trend Chaser: “I’ve read about this on Twitter and I’m installing it
    • The Smartest in the Room: this troublemaker feeds off her environment and the reactions of others

some of the biggest qualitative differences between high- and low-performing companies, according to respondents, relate to the leadership and organization of analytics activities. High-performer executives most often rank senior-management involvement as the factor that has contributed the most to their analytics success; the low-performer executives say their biggest challenge is designing the right organizational structure to support analytics

  • This Medium post on Agile, IT and devops suggests the new mindset and way of working is beginning to percolate downwards perhaps as a result of dawning awareness of the existential dangers of ignoring this mind map of the new territory:

most companies today will not live beyond their fiftieth birthday. John Chambers, ex-CEO of Cisco, believes that 40% of today’s leading companies will not survive another decade.  … As it becomes increasingly challenging to envision the future, conventional diagnoses and remedies that have led to past achievements just don’t apply any more. Previous success can lead to “a collective cognitive myopia,” explains London Business School’s Gary Dushnitsky. “It’s hard to see beyond something that you’ve excelled at for the last 15 or 20 years.

Software Engineering

I’ve found that a big difference between new coders and experienced coders is faith: faith that things are going wrong for a logical and discoverable reason, faith that problems are fixable, faith that there is a way to accomplish the goal. The path from “not working” to “working” might not be obvious, but with patience you can usually find it.

KnightOS is a free and open source firmware for calculators, namely Texas Instruments calculators. It’s written in assembly, and projects that support it are written in C, Python, JavaScript, LaTeX, etc. It is the first mature, open source operating system for calculators and it takes that seriously.

Society and Culture

  • Good post on TheConversation outlining how computing technology aimed at making the lives of Western consumers easier and more productive is also increasingly being used to support global terrorism and crime:

Having researched cybercrime and technology use among criminal populations for more than a decade, I have seen firsthand that throwaway phones are just one piece of the ever-widening technological arsenal of extremists and terror groups of all kinds. Computers, smartphones and tablets also draw people into a movement, indoctrinate them and coordinate various parts of an attack, making technology a fundamental component of modern terrorism.

a growing body of evidence suggests that seeing ourselves as self-made—rather than as talented, hardworking, and lucky—leads us to be less generous and public-spirited. It may even make the lucky less likely to support the conditions (such as high-quality public infrastructure and education) that made their own success possible.

I suspect that as we move further and further into the field of digital or artificial life we will find more and more unexpected properties begin to emerge out of what we see happening and that this is a precise parallel to the entities we create around ourselves to inform and shape our lives and enable us to work and live together.

Leicester City

  • Leicester City’s incredible triumph in the English Premiership has already been labelled by some as “the greatest sporting story of all time“.  It’s certainly brought much needed optimism and romance to a sport that had lost some of its lustre over the years with the influx of money seemingly determining who won.   Their coronation late on Monday May 2nd happened as a result of Chelsea scoring two goals in an intense and violent second half game against Spurs that made for dramatic viewing:

leicester_science

33BD37FA00000578-3570269-image-a-98_1462232140261

  • On a personal level, I was born and grew up in Leicester on a diet that included Roy of the Rovers and the improbable exploits of Roy Race which had nothing on this story.

  • Years later long-term life-long fan Julian Barnes wrote about Leicester City winning the FA Cup in History of the World in 10 and a Half Chapters. In a chapter called The Dream about an imagined future their win was positioned in the paper next to news of a cure for cancer.  It helps situate how utterly inconceivable it is that Leicester City have won the Premier League and the “generalised delirium” of the week long party that’s been going on in a small East Midlands town.  Nessun Dorma indeed:

IMG_2358

  • The last word should go to the maestro himself:

Trump

On his first day in office, he said, he would meet with Homeland Security officials, generals, and others — he did not mention diplomats — to take steps to seal the southern border and assign more security agents along it. He would also call the heads of companies like Pfizer, the Carrier Corporation, Ford and Nabisco and warn them that their products face 35 percent tariffs because they are moving jobs out of the country.

If Mr Trump’s diagnosis of what ails America is bad, his prescriptions for fixing it are catastrophic. His signature promise is to wall off Mexico and make it pay for the bricks. Even ignoring the fact that America is seeing a net outflow of Mexicans across its southern border this is nonsense. Mexico has already refused to pay.

Trumped

For Gore, the recent violent attack by Trump supporters, which left her with severe facial bruising, is indicative of the election — and how the Republican is inciting violence while exposing deep divides in American society.

Leave a Reply