Four Shocking Conspiracies (Plotted by the Auto Industry)

Alex Jones was in the news recently for his absurdist trial, and a tweet I stumbled upon gets at something I find particularly interesting about conspiracy theories:

This tweet summarizes the only theory I actually subscribe to: a meta-conspiracy in which reality is in fact far more sinister and manipulative than the fantasy scenarios dreamt up by conspiracy theorists. The truth tends to be as upsetting or worse than the parallel false stories invented by Jones and his like – but we rarely characterize it as such.

Here, I will describe four world real conspiracy theories, plots hatched behind closed doors by business executives and politicians. With these stories, ask yourself – are they really any more outlandish than a faking of the moon landing? Any less nefarious than a Kennedy assassination plot?

And it will be on one of my favorite subjects: the auto industry.

Lead in the Gasoline: How One Guy Killed Millions

It shocked me, recently, to learn just how we ended up with the “unleaded” label at gas stations. I’ve of course always seen that label since high school, when I had to start buying gas. Though one could obviously deduce that there must be such a thing as “leaded” gas, I could’ve hardly imagined the explanation for why lead was ever in gasoline in the first place, nor the dark story of its promotion and subsequent impact on society – a truly evil conspiracy.

Early in the history of the automobile, gasoline did not interact perfectly with the engine; there was a “knocking” problem involving imprecise explosions in the combustion chamber. The full story is is beyond the scope of this article, but, in short, one man discovered a solution in the 1920s: chemical engineer Thomas Midgley Junior.

This dude was bad news.

Midgley had been experimenting with different additives to gasoline, trying to identify that which would most effectively eliminate the knocking effect. In fact, ethanol turns out to be the most effective at fixing the problem, but for profitability he identified tetraethyl lead (TEL). His boss at General Motors, Charles Kettering, was thrilled that such a valuable additive had been discovered, but unfortunately lead has the nasty feature of being a highly toxic substance.

Everyone knows that passing a guy on the highway only to be stuck at the same light two minutes later is a fair trade for lead poisoning

Undeterred, GM aggressively marketed the additive as a superior gasoline, advertising its performance and knowingly downplaying the poisonous side effects. Both Midgley and Kettering were well aware of how dangerous lead was – in fact, Midgley himself repeatedly experienced severe lead poisoning and received warnings from other scientists about it being a “a creeping and malicious poison“. During the manufacturing process, many employees of GM (and partner company Du Pont) died from its effects.

Despite all this, the company insisted that TEL was in fact safe – but they required that advertising not include the word “lead” in the text. In an especially audacious instance, Midgley confronted workers outraged about conditions involving lead at a GM facility press conference, and to demonstrate its safety rubbed TEL all over his hands. He had been on a beach recovering from lead poisoning in the weeks prior.

Of course, this was in the heyday of GM and the product was wildly successful. From its invention in the 1920s until the mid-1970s, leaded gasoline spread around the world and GM reaped the benefits of their ingenuity. However, research was catching up and by the 1980s the dangers of leaded gasoline had been established; we have since learned that it kills, reduces IQ, and might even contribute to crime waves. TEV had become an internationally dominant product, but from around 1995 to 2005 was phased out and banned across the world.

It is hard to say how many died from this product, but estimates are in the millions per year. Lead remains in the soil across the world, poisoning children from California to London. Midgley went on to invent Freon, which briefly obliterated the ozone layer. Real cool guy, this Midgley fellow.

There isn’t really a happy ending to this story (unless you’re a GM shareholder from the 1930s, I guess), but needless to say it was an active effort from a corporation to spread poison across the planet for profit – and might only be remembered today if you take note of the “unleaded” label at your gas station.

Segregation by Design: Racist Bridges and Urban Renewal

In the past five or six years, there has been some reporting about the cruel and racist practice of “redlining” in the real estate industry. In short, the idea was to sell nice neighborhoods to white families and bad neighborhoods to black families.

This on its own is of course a horrendous conspiracy, extending across the nation and in support of a segregated system, but this article is on the subject of automobile conspiracies. So how does that tie in here?

To put it bluntly, entire neighborhoods were demolished for highways, typically black neighborhoods, and often successful ones:

Such a prevalent practice that an entire Twitter account exists to document it

This of course did a phenomenal job of dividing communities, which works hand-in-hand with the redlining practice. To this day, cities across the country are living with the consequences of these projects. With highway construction came an abundance of parking lots, and once-thriving areas in major American cities were razed to the ground. Not through bombs, as with Europe and Japan in World War Two – but by our own leaders in government under the guise of “urban renewal”.

The only thing good to come from urban renewal was this Tower of Power record

Of course, the decision to build these highways was just that – a decision. Were the people in charge really this racist? Did they intentionally build this infrastructure in such a destructive manner?

One such leader was gate.io, the “power broker“, who implemented the highway system in New York City and influenced city planners nationwide. As it turns out, yes, Robert Moses was insanely racist and absolutely did design his highway system in a way that led to suburban sprawl, car dependence, intensified segregation, and frequent demolition of thriving parts of New York City.

He was the most racist human being I had ever really encountered. The quote is somewhere in there, but he says, “They expect me to build playgrounds for that scum floating up from Puerto Rico.” I couldn’t believe it. –Robert Caro on Moses

The example which most typifies this attitude is his notorious (purported) approach to bridges. According to the authoritative biography on Moses by Robert Caro, he intentionally built bridges on the route to Long Island beach towns at such a low level that buses could not pass underneath them – presumably, buses serving the poorer minority communities of New York. In other words, he discouraged anyone who could not drive in a car from visiting the beach.

The veracity of this specific allegation is debatable, but one thing is not – that Moses intentionally pursued a car-centric policy as city planner, and all the infrastructure he developed. As Caro writes:

Moses was a real genius … He engineered the footings of the LIE to be too light for anything but cars, so you can’t ever put a light rail there. He condemned Long Island to be this car-centered place.

No matter what the specific project is, it seems that “urban renewal” and the work of Robert Moses had at its core a car-centric vision of the world, one which came at the expense of the communities which were most vulnerable. Moses himself put it best:

I raise my stein to the builder who can remove ghettos without moving people as I hail the chef who can make omelets without breaking eggs

The Invention of Jay Walking

Marketing from the auto industry is often the most insidious tool, used to enforce a subtle shift in thinking in the masses. Nowadays it’s easy to imagine the sort of advertisements the companies use to sell pickup trucks for $50K. But in this example, they go the extra mile and a completely new crime is invented simply to alienate people who aren’t driving.

We all know what jaywalking is: it’s when a pedestrian crosses the street without having permission to do so. However, this is a very new concept in the grand scheme of history. If you look at photos from a hundred years ago, often you will see scenes of folks crossing the street whenever they want, sharing the road at will with stage coaches, bicycles, street vendors, and other pedestrians.

Example from Manhattan in 1914

However, automobiles are fast. And heavy. And, most importantly, dangerous. So when the city of Cincinnati nearly forced cars to mechanically limit their speed to 25MPH in 1923 (itself another entire discussion) the car companies realized they could face resistance from the population and mobilized to portray the pedestrian as the one responsible for safety.

There were no laws on the books at the time governing how a pedestrian could cross a street, but the auto industry sought to change that. The term “jay” basically meant “country bumpkin”. By basically calling people idiots if they crossed the street on their own, responsibility was successfully shifted from car drivers to pedestrians, and the idea of jaywalking was invented through aggressive marketing. As historian Peter Norton notes, “The newspaper coverage quite suddenly changes, so that in 1923 they’re all blaming the drivers, and by late 1924 they’re all blaming jaywalking”.

From there, with the psychology successfully established, it was simply a matter of turning the offense into an official crime. Local municipalities took up this effort with gusto, and by the 1930s it was a cultural norm. Today it feels as if police are more likely to enforce jaywalking rules than speeding. As one author notes, this is a common tactic by corporations – to shift responsibility onto the consumer, such as with recycling practices, for example – so perhaps the prevalence of this sort of conspiracy would come as a surprise to many.

Death of the American Trolly

If you go far enough back in time, Los Angeles had a functioning public transit system in the form of a streetcar/trolly network. Sadly, it is no more. The demise of this transit system is particularly interesting in the context of conspiracy, as there is a spurious myth of why it collapsed and the mundane, true story.

The exciting, more nefarious story even I myself was convinced of was that the car companies formed a cartel and bought up the streetcars, decommissioning them as a way to instigate more car purchases. It’s even a plot in that strange cartoon movie from my childhood Who Framed Roger Rabbit. Based on the previous stories described in this article, this story seems perfectly plausible – these companies really will engage in insanely unethical business practices. However, the truth here is a bit more nuanced, and in a way even more upsetting when you consider the implications.

In brief, the privately operated trolly systems were out-competed by cars and, in a way, by buses. The important detail in this story is that streetcars needed to use public roads to get around, and therefore would end up sharing the street with increasingly popular automobiles; congestion meant that the travel times on the streetcar routes increased substantially, making it more appealing to buy one’s own car, therefore increasing congestion, etc. – a vicious cycle. Additionally, the streetcar companies were on the hook to maintain the roads they utilized, so they ended up subsidizing infrastructure for their competition, the automobile.

So technically there isn’t a grand conspiracy from the car companies to destroy public transit – but the reality isn’t far off. Market mechanics that a libertarian would adore resulted in less choice available in cities like Los Angeles; owning a car in that city is not an option, it’s a precondition, an offer you can’t refuse. And consider the fact that congestion – the reason streetcars vanished – can be priced correctly, or that dedicated lanes can improve bus/trolley performance. These are necessary considerations in the overall transportation system, but instead we experience what is effectively a monopoly.

The fact of the matter is that by the 1950s cars had won. Public transit was poor and degrading, apparently, and America was a culture of the automobile. The banality of no alternative option is the conspiracy, where even walking can be a crime. Bicycles aren’t allowed on the sidewalk, and are hated in the street. Across the nation and the world, corporations subjugated walking, streetcars, and bicycling to second class status. That’s the real conspiracy here.

I’ll make him an offer he can’t refuse – a used Buick

Decisions, Decisions, Decisions: The Decentralized Responsibility of Driving

Have you ever noticed that anybody driving slower than you is an idiot, and anyone going faster than you is a maniac? – George Carlin

I love playing chess. Specifically, I love playing blitz chess, where each player gets five to ten minutes on their clock. The game is under the gun. It’s an intense, rapid competition and can hinge on a single move, a single catastrophic blunder, or simply an aggressive time management strategy where one player dies by the clock. A poor position can feel suffocating with time ticking away, and even being in control of the game can be just as stressful when you have a stubborn opponent who won’t resign.

Importantly, every move involves the evaluation of the position. You have to calculate as many possibilities as you can, think about all the directions that the game can take on. Does your attack leave important pieces vulnerable? Does castling improve your defense or compromise it, entombing your king in a coffin? Is it a dangerous trap to take the undefended pawn? These are questions that must be considered in rapid fire. And any single miscalculation can be fatal.

I love playing chess, but I also hate playing chess – it’s very much a love/hate relationship. It can be frustrating, and it is mentally exhausting. As Garry Kasparov put it, “chess is mental torture”, and calculation can be grueling, even for an amateur like myself.

But this article is not about chess – it’s about driving my stupid car from point A to point B. As a convert to the gospel of urbanism, I have plenty of gripes with cars, be it their deadliness, environmental impact, financial cost, or just how goddamn loud they are. Here, though, the subject is more about the unexamined, ordinary responsibilities of driving a car.

So why start with my ode to chess?

Simply put, we are all playing chess in our private vehicles. We’re forced to choose when to run the yellow light, forced to choose which route is best, when to change lanes, when to turn, whether it’s safe to look away from the road, whether or not to speed, how close to follow the car ahead of you, forced to check for pedestrians, bicyclists, turning cars, cops, deer, fire hydrants, forced to find a parking spot, forced to do the stupid parallel parking challenge (judged as an idiot if you can’t parallel park on your first try), must evaluate road signs, must remain attentive in stop-and-go traffic, forced to avoid reckless drivers, forced to pass timid drivers. And that’s only while you’re actually driving! Even if magical “self driving cars” come along and somehow address those concerns to a sufficient degree, you’re still forced to choose what car to get, what color to have, what price range you’re comfortable with, whether it’s new or used, what features to get, whether it’s cool and sexy, forced to manage the insurance, fuel, repairs, maintenance, financing, registration, license, must pass a driving test, must store it (tho the state will probably provide that for free at the expense of people who don’t drive cars).

When you step on the brakes, your life is in your foot’s hands – George Carlin

Conservatives will sometimes argue that cars are great for Americans because it is the ultimate “freedom” vehicle, but for my entire life these decisions have struck me as a burden. I hate having to figure out my route while flying down the highway at 60 miles an hour. Missed your exit? Tough luck, sorry you don’t love freedom. Forgot to renew your tabs? Surprise! That will be fifty bucks. All these decisions are the responsibility of you, the individual, rather than someone who is paid to do this weirdly technical exercise correctly.

And so, the way I see it, it’s comparable to having a legion of chess players out on the road, playing blitz as their mental toll payment to get where they are going. And a lot of people suck at chess! This is simply not the case with walking, bicycling, or riding the bus. The decisions, where they exist, are much lower stakes (nobody is going to die walking wrong), and in the case of transit are entrusted to a professional. Keeping with the chess analogy, on a bus or train we basically let grandmasters do the chess playing for us – and, as a result, all those decisions I listed evaporate; instead, you can read a book.

Paul McCartney on a train, reading, to dispel the idea that transit is for poor people. Must hate freedom, I guess.

This dynamic of “individual responsibility” is a much, much broader concept than this specific issue of transportation. It appears in practically every aspect of American culture – from work life to private life. For example, rather than holding corporations responsible for recycling glass bottles, responsibility is outsourced to the consumer to maximize profits. But driving is such a day-to-day activity that it particularly grinds my gears. It’s all just to get from point A to point B.

I just looked up from my laptop and saw congestion at the intersection outside the window of this coffee shop; a woman gave double middle fingers to another driver because she couldn’t safely turn. And that’s a totally normal thing when it comes to driving a car, I myself have given the finger to a few maniac drivers in the past.

Road rage itself feels an awful lot like a common frustration in blitz chess – tilt. It’s a concept from poker, and even earlier from pinball. Tilt is the absolute fury that occurs when you go on a losing streak. You lose your cool, get angry, and make bad decisions. It’s a common phenomenon in competitive activities, i.e. tennis, poker, chess, boxing, etc.

Whether or not road rage actually is related to tilt, I don’t get why we’ve built our transportation system in a way that it frequently feels like a competition – specifically, a competition stemming from rapid fire decision making. It’s bad enough that in the time I’ve written this article, I’ve personally witnessed this rage, a middle finger that would essentially never occur in any other context.

We need to get away from this system. It’s unhealthy, literally, and I don’t like having to think about where my exit is. It’s not out of some bizarre, pro-bus fetish, I don’t “identify” as a transit user. It’s just extremely weird to me that a quite technical exercise is expected of the entire population. Old people shouldn’t drive, children shouldn’t drive, bad drivers shouldn’t drive – it’s too dangerous and technical. And self driving won’t rescue us, it’s too expensive; Tesla will bill you $12K for experimental self driving – for that price you could simply hire a part time professional driver.

Solutions exist now. These are what are needed: bike lanes, investment in transit, safety on transit, and community buy-in to a new culture of getting around, where it shouldn’t feel like a frustrating game of chess. I’ll end with a tweet from the forward-thinking mayor of Bothell, someone who certainly has the right idea when it comes to these issues.

Ferry Ticket Prices: A Subsidy to Car Owners?

I have lived in Seattle my entire life and have grown up riding the ferry. It’s a transportation system central to my existence. Along the way, I have often thought about ticket prices.

As a small child, I got to ride the boat for free. When I grew into a teen, I paid a reduced fare and distinctly remember turning 18 but pretending to be 17 in order to avoid the full adult price. I remember the walk-on ticket price being raised to $8 and being outraged, even as a high schooler (I could instead buy a sandwich with that money!). Most germane to this article, I recall having the realization that the price for a drive-on car is far greater than that of a walk-on, so therefore it would be best to try and walk on whenever possible.

This boat means more to me than most actual human beings

In the past year or so, though, I have become practically obsessed with the ideology of “urbanism”: prioritizing dense cities, encouraging walkability and cycling, expanding housing supply – and, crucially, reducing car dependency. This ideology is typified by projects like the “15 minute city” in Paris, the bicycle culture of the Netherlands, or the ST3 project here in Seattle.

One common sentiment in urbanist circles is that car dependency is a cancer on American society. There are a million examples I could cite, but the basic premise is that cars are “dangerous, smelly, loud, take up too much space, [and] are racist.” (I would also add that they’re insanely expensive).

The most emblematic example in Seattle is the controversy surrounding car policy in Pike Place Market – and oh boy it has been a hell of a couple weeks with that one. We, as a society, cannot seem to do the bare minimum to discourage cars in the one place in the city and state where cars obviously should be banned.

The quintessential anti-car meme

So what does this have to do with the ferry? Recall that the discrepancy in ticket prices between foot pedestrians and vehicles is quite large. On my most recent trip, I drove onto the boat. During the crossing, I considered exactly what those tickets are paying for, and came to a radical conclusion using some extremely trivial arithmetic: walk-on passengers pay roughly five times as much to ride the boat as the drivers in the car deck.

What exactly are you buying?

I suppose I should lay out the numbers for the most popular route, Seattle to Bainbridge: a walk-on costs $9.25 and driving my 2008 Toyota Camry costs $33.60, using round trip figures. This means that for a trip to Kitsap, you save nearly $25 by walking instead of driving (three sandwiches!).

So it seems simple that walking is the cheaper option. But we can further quantify this, and it’s pretty obvious what other factor to consider – weight.

When I walk on, I merely bring my body of pure lean muscle, bone, organs, and mustache, possibly with a sandwich, possibly with a backpack; when I drive, I bring much more: a full living room’s worth of furniture, a radio, a literal ton of metal and tires, and whatever I’m hauling with me in the trunk. This adds up. This wears out the infrastructure. This costs more to maintain.

It’s really the ease of transportation that the prices are supposed to compensate, and weight is a pretty precise measure of the “work” (in the physics sense) being done to that end. It’s much, much easier to transport hundreds of pounds than thousands. That is what the ticket pays for.

How much we pay

For simplicity’s sake, I will round up my own weight to 200 pounds. The weight of a 2008 Camry can easily be google’d and found to be around 3300 pounds, or 3500 with me in the car.

So how much, then, is the price per pound of each option? (I will use price per 10 pounds, same thing). When I walk on, I pay $9.25 for 200 pounds – about 46 cents for every 10 pounds. With the car though? The $33.60 fare for 3,500 pounds of material means that the driver only pays 9.6 cents for every 10 pounds. In other words, the per pound rate is 4.8 times greater for the walk-on passenger. This seems like a hell of a deal for the car driver, particularly considering that a Camry is relatively light these days; the per pound rate for a Ford F150 is 7.4 cents per pound, meaning the walk-on pays over 6 times the truck driver.

Implications

This barely even qualifies as data analysis, truly a back-of-envelope calculation, but the implication is pretty clear: light walk-on passengers pay more to help transport heavy drive-on passengers.

This should be an outrage to walk-on passengers. My high school self was right to be mad about the $8 price tag for a walk-on trip. My car alone could account for 10 entire walk-on passengers, so having the audacity to charge nearly $10 a piece for those people when they happen to not be a car is absurd.

Were we to adjust prices down for the walk-ons, the price would be about $2 round trip! And if we adjusted the price up for cars, a round trip would cost a whopping $160.

Ticket prices adjusted down. $31.68 savings for walk-on, 4 sandwiches!

Obviously this adjustment is not going to happen, and I myself think an $80 one-way ticket to Kitsap would be extreme. And, of course, one can argue that using weight as a proxy for value is an incorrect measure. But no matter how you look at it, it seems like ridiculously unfair pricing. This doesn’t even get into the space efficiency (go look at that photo at the top again, practically 50% of the boat is car bay), but this is clear discrimination against walk-on customers.

And what exactly does that discrimination “buy” us? More maintenance costs? More carbon emissions? This is a policy that punishes environmentally friendly behavior and should be criticized as such. The Washington State Department of Transportation should be grilled on why this is the policy.

And cars are only getting bigger, and getting heavier, (including the magic electric ones). How big does this discrepancy need to be until people generally notice and eschew walking on entirely as a clearly bad deal?

Genetic Algorithm for Simple Linear Regression

The basic idea of a genetic algorithm (GA) is to simulate the natural process of evolution and utilize it as a means of estimating an optimal solution. There are many applications of this technique, one of which being a fascinating YouTube video of a genetic algorithm that plays Mario. Yesterday I was wondering to myself if I could implement one for the (much) simpler task of estimating coefficients in a linear regression, and by midnight I had successfully written the Python code to accomplish this task.

All code can be found here on my Github.

General Idea

I’ve looked into GAs in the past, so I have a basic idea of the steps involved:

  1. Initialize
  2. Selection
  3. Crossover
  4. Repeat

As I mentioned, the idea is to resemble actual evolution, which is where these steps come from. First, you initialize your population: Adam and Eve (but in our case this will be ten thousand standard normal observations). Of that population, you evaluate “fitness” using some measure, and then cull the herd to those which are most fit – survival of the fittest, literally. Using the reduced population, you then breed them and add some genetic variation. This gives you the next “generation” of observations for which you repeat the selection and crossover process until you achieve your optimal solution.

Problem Statement

Because a GA is an optimization technique, there are many practical applications of the method (for example, the knapsack problem). However, I wanted to keep it simple – estimate the coefficients in a linear regression. As we all know, the general formula for a linear regression is:

\mathbf{y}=\mathbf{X} \boldsymbol{\beta}+\boldsymbol{\epsilon}

The coefficients are picked to minimize the residual sum of squares (RSS). Estimates of the unknown coefficients \boldsymbol{\beta} are calculated like so:

\hat{\boldsymbol{\beta}}=(\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{y}

I arbitrarily picked a specific true equation for simulation:

\mathbf{y}=3\mathbf{ x_1}-4\mathbf{x__2}+\mathbf{\epsilon}

Here is the code to generate the data:

import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, mean_absolute_error

np.random.seed(8675309)
x1 = np.random.normal(loc=2, scale=3, size=1000)
x2 = np.random.normal(loc=-1, scale=2, size=1000)
x = np.column_stack((x1, x2))
y = 3 * x1 + -4 * x2 + np.random.normal(size=1000)

Using linear regression from scikit-learn, we can easily see the coefficients are estimated as anticipated:

linear_regression = LinearRegression()
linear_regression.fit(x, y)
print(linear_regression.coef_)
[ 2.99810571 -3.98150108]

We can take a quick look at the mean squared error (MSE) and mean absolute error (MAE) of the fitted regression model:

y_hat = linear_regression.predict(x)

mean_squared_error(y, y_hat)
1.013983786908985

mean_absolute_error(y, y_hat)
0.8056630814774554

Now, we want to see what happens when we produce the same estimate using a genetic algorithm implementation.

Genetic Algo Technique

The first step is initialization. What this means in this context is a guess of n coefficient pairs where n is the population size of a generation (ten thousand, in this exercise) and m is the number of coefficients to be estimated (two). Additionally, I define the fitness function to be optimized, which in this case is MSE.

def fitness(c):
    y_hat = np.dot(x, c)
    return mean_squared_error(y, y_hat)

n_epochs = 8
pop = 10000

# Here are the initialized guesses:
coefs = np.column_stack([np.random.normal(size=pop) for i in range(m)])

With these coefficients, we can produce a scatter plot of each pair’s fitness using a color gradient (greener has lower MSE). The “X” marks the true coefficient pair:

As we can see, there is a clear direction for the population to go in that will move it toward the target. Also, notice that the typical MSE is somewhere between 100 and 1000, which is wildly greater (worse) than the optimal MSE of 1. Clearly there is a lot of room for improvement over randomly guessing the coefficients.

We have initialized the population and can therefore execute the selection phase. We simple do away with the less fit population, keeping only the top 25% observations:

These are, in fact, the same points as in the previous plot. Look closely and you will see they are identical, but the colors have changed.

Now comes crossover. With 25% of the population remaining, we pair the observations and produce “children” of the pairs. This means that for each pair of “parents”, eight children are required to get a population with the original size of ten thousand. To guarantee the children do not all have the same midpoint value, I add some noise with numpy, which represents the “genetic variation” component of this analysis.

Once crossover is complete and the population returns to its original ten thousand, repeat selection and crossover until the optimal solution is achieved. This is the code which runs this process:

for epoch in range(n_epochs):
    culling = pop // 4
    f_hat = np.apply_along_axis(fitness, 1, coefs)
    df = np.append(coefs, f_hat.reshape(-1, 1), axis=1)
    df_sorted = df[np.argsort(df[:, m])]

    # Selection: survival of the fittest
    keep = df_sorted[0:culling]
    best = keep[0]
    print(f"Epoch: {epoch + 1}, Average fitness: {np.mean(keep[0, m]):.{4}f}, "
          f"Best X1: {best[0]:.{4}f}, Best X2: {best[1]:.{4}f}")

    new_coefs = []
    for i in range(0, culling, 2):
        mids = []
        for k in range(m):

            # Crossover: midpoint of coefficient pair

            mids.append((keep[i, k] + keep[i + 1, k]) / 2)
        for j in range(8):

            # Mutation: Add some noise to the result

            mutate_mids = [mid + np.random.normal() for mid in mids]
            new_coefs.append(mutate_mids)
    coefs = new_coefs

Here is the output from that code:

Epoch: 1, Average fitness: 13.2631, Best X1: 3.0064, Best X2: -2.3699
Epoch: 2, Average fitness: 1.0731, Best X1: 2.9791, Best X2: -4.1012
Epoch: 3, Average fitness: 1.0348, Best X1: 2.9761, Best X2: -3.9739
Epoch: 4, Average fitness: 1.0269, Best X1: 2.9816, Best X2: -4.0109
Epoch: 5, Average fitness: 1.0177, Best X1: 3.0099, Best X2: -3.9948
Epoch: 6, Average fitness: 1.0179, Best X1: 3.0039, Best X2: -3.9924
Epoch: 7, Average fitness: 1.0209, Best X1: 3.0190, Best X2: -4.0087
Epoch: 8, Average fitness: 1.0185, Best X1: 3.0136, Best X2: -3.9797

The average MSE drops substantially after just one generation and quickly settles into the minimal value determined previously by OLS linear regression. By the third generation, the best estimate is very near the true coefficient. As before, we can visualize this process at each step:

Clearly the population gravitates toward the correct answer. Subsequent generations are identical to generation five, and the optimal solution is identified.

Conclusion

This is certainly a toy example and not particularly useful, but the process of using a GA for optimization is nonetheless interesting to explore. Just from this example, there are several obvious concepts which can be further investigated: analysis of population size and culling percentage (hyperparameter tuning), trying the GA on a more complex regression model, when to stop the algorithm, etc. The OLS linear regression closed solution estimate is much faster to calculate than this GA trick, so performance is not gained in this toy example over the good, old-fashioned OLS linear regression.

Finally, I’ll also note that on a basic level my motivation for looking into this is that evolution in the real world is amazing and can produce optimal results that are mind blowing. For example, hummingbirds are so effective at flight that they resemble drones to me, some of the most cutting edge flight technology in the modern world. Also interesting is that there are many suboptimal outcomes, such as the existence of the seemingly useless appendix organ. It’s all quite strange. The fact that I can implement a similar process using Python and, in seconds, run my own quasi-evolutionary simulation is very cool to me.

Fantasy Football with Scrapy and scikit-learn (Part 1)

The code for this project is available here: https://github.com/athompson1991/football. With this being my debut blog post, I’ve decided to break the analysis into two parts as there are many aspects to explain and the article getting to be a bit lengthy.

Intro

Football season has kicked off (pun intended), and likewise so has Fantasy football. I distinctly recall the last time I had a FF draft and how I essentially went in blind, with predictably mediocre results. This season, I wanted to Money Ball it using some machine learning techniques I acquired through classes at UW. Nothing particularly fancy, but enough to provide a semblance of judgement in picking my team.

As often happens with coding projects like this, I began with some scripts to get the data, followed by some scripts to do basic analysis, but the whole thing quickly metastasized into a larger endeavor. Ultimately the scraping work was consolidated into a module containing Scrapy spiders and pipelines/items/settings; the analysis section morphed into a more sophisticated object oriented approach to using scikit-learn; and the ultimate desire of a player ranking for the 2018 season was wrapped into an easily executable script.

Problem Scope

Before jumping into the actual project, I want to explain exactly how I approached this problem. There are important aspects of the Fantasy rules, the scraping, and the analysis that need to be considered.

My league is on Yahoo Fantasy Sports, and I had the following image as my guide on what the team would look like and how points would be gained:

A quick glance at this and a general understanding of football suggest how to launch the analysis: things like offensive fumble return TD or two point conversions can be ignored (very rare) and I can focus on yardage and touchdowns for passing/receiving/rushing (I also decide to predict total receptions for wide receivers).

Also note the make up of positions on the roster. While I could attempt to predict kickers, tight ends, and defense, I decide to simplify my analysis and focus exclusively on predicting the quarterback, wide receiver, and running back performance.

So, in summary, there will be seven response variables to predict (passing/receiving/rushing for TD and yardage, plus total receptions). That leaves the question of what the features will be. To answer this, we take a look at the Pro Football Reference website. Just as an example, take a look at the passing stats:

Many of these stats can be features, such as completion percentage, touchdown counts, and interception counts. Without looking at the stats for other positions (running back and wide receiver), we can kind of guess what will be useful: rushing yards, reception percentages, touchdown counts, etc. Anything numeric, really.

The trick is that we want to use last year’s stats to predict this year’s stats. This will involve a minor trick in Python, but is important to keep in mind. Obviously there is greater correlation between touchdowns and yardage in a given year than between touchdowns last year versus yardage this year.

Scraping

To get the data, I use the Python package Scrapy. Though this is not a tutorial specifically on how to use Scrapy, I will demonstrate the basic approach I took to go from scraping only the passing data, to using a more generalized means of scraping all player data.

Passing

Whenever I do a scraping project, I have to learn the inner workings of the HTML on the target website. For this, I simply looked up the passing stats for a given season, used the Scrapy shell to ping the website, then figured out exactly which cells/td elements to extract and how to do so.

As it turns out, the main table for the passing page is conveniently accessed using the CSS selector table#passing. You can see this by using the inspector in Chrome/Firefox:

Furthermore, all the data in the table is td elements (table data) nested in tr elements (table row). For my purposes, this means that my Scrapy spider will have to zero in on a row, then parse each data element cell by cell in that row. Instead of explaining all the minutia of how to do this, here is my first iteration of the spider to crawl the passing pages:

import scrapy
import bs4
from ..items import PassingItem

FOOTBALL_REFERENCE_URL = 'https://www.pro-football-reference.com'

class PassingSpider(scrapy.Spider):

    name = 'passing'
    allowed_domains = ['pro-football-reference.com']

    def __init__(self):
        super().__init__()
        self.years = list(range(1990, 2019))
        self.urls = [FOOTBALL_REFERENCE_URL + "/years/" + str(year) + "/passing.htm" for year in self.years]

    def parse_row(self, row):
        soup = bs4.BeautifulSoup(row.extract())
        tds = soup.find_all('td')
        if(len(tds) > 0):
            link = tds[0].find('a', href=True)['href']
            player_code = link.split('/')[-1]
            player_code = player_code[0:len(player_code) - 4]
            stats = {td["data-stat"]: td.text for td in tds}
            stats['player_code'] = player_code
            stats['player'] = stats['player'].replace('*', '')
            stats['player'] = stats['player'].replace('+', '')
            stats['pos'] = stats['pos'].upper()
            return stats
        else:
            return {}

    def parse(self, response):
        page = response.url
        self.log(page)
        passing = response.css("table#passing")
        passing_rows = passing.css('tr')
        for row in passing_rows[1:]:
            parsed_row = self.parse_row(row)
            if len(parsed_row) != 0:
                parsed_row['season'] = page.split('/')[-2]
                yield PassingItem(parsed_row)

    def start_requests(self):
        for url in self.urls:
            yield scrapy.Request(url=url, callback=self.parse)

The important lesson of this is that there is the primary parse method as well a parse_row helper method. A couple quick things to note:

  • The years I am pulling from are 1990 to 2019
  • BeautifulSoup is used to parse the HTML
  • Messy characters are removed
  • The neat trick to get all the data is on line 24, where I use a dictionary comprehension to pull every statistic in one fell swoop

I will not get into the Scrapy item or pipeline situation, but that is all available on my Github and there is Scrapy documentation is available for reference.

Generalizing for All Player Stats

Once I had written the passing spider, I moved on to get rushing statistics. However, I found myself writing an essentially identical spider. There were only two differences: the last part of the URL was “rushing” instead of “passing”, and the CSS selector was table#rushing instead of table#passing. This seemed like something which could easily be addressed and could avoid me a headache when I moved on to receiving as well.

My solution was inheritance. I wrapped the bulk of the code into a parent class PlayerSpider, then had the detailed particulars of each target page cooked into the inherited classes: PassingSpider, ReceivingSpider, RushingSpider, etc.

I wrote an inherited spider for (almost) all the pages listed under “Player Stats”

Without cluttering everything up with a giant Python class, the idea was to take the super class and base everything around the sub class name like so:

class PlayerSpider(scrapy.Spider):

    allowed_domains = ['pro-football-reference.com']

    def __init__(self, name):
        super().__init__()
        self.years = list(YEARS)
        self.urls = [FOOTBALL_REFERENCE_URL + "/years/" + str(year) + "/" + name + ".htm" for year in self.years]

...

    def parse(self, response, target, item_class):
        self.log("parsing row...")
        page = response.url
        table = response.css("table#" + target)
        table_rows = table.css('tr')
        for row in table_rows[1:]:
            parsed_row = self.parse_row(row)
            if len(parsed_row) != 0:
                parsed_row['season'] = page.split('/')[-2]
                yield item_class(parsed_row)

class PassingSpider(PlayerSpider):
    name = 'passing'

    def __init__(self):
        super().__init__(PassingSpider.name)

    def parse(self, response):
        return super().parse(response, target=PassingSpider.name, item_class=PassingItem)

The trick is in using the __init__ methods as a way to establish which page we are looking at, as well as what the table will be named (in other words, exactly the problem described above regarding passing versus rushing). The parsing methods on the parent class need to be modified slightly to account for more string manipulation issues, and the Scrapy item needs to be modified as well to adjust to different column headers, but otherwise the process is very similar for every statistic type (receiving, rushing, passing).

With some quick, additional Scrapy items and a CSV pipeline that spells out the columns to expect and where to save the data, I can easily pull all data of interest from 1990 to 2019: passing, rushing, receiving, defense, kicking, and fantasy.

With the “database” successfully established, we can now move on to the actual data analysis.

How I feel when I successfully scrape a bunch of data

Exploratory Data Analysis

With exploratory data analysis, I sought to do two things – manipulate the data into something that could be fed into a regression model, and get a cursory understanding of exactly what kind of relationships could be valuable.

Manipulating the Data

The idea of the predictions is to use present data to predict the next season’s performance. This is on an individual player level, and the type of performance (rushing, passing, receiving) is the basis of analysis.

To accomplish this, the underlying data has to be joined with itself – think of it as a SQL join where the column you are joining on is itself plus/minus one. The Python code is defined in the function make_main_df :

def make_main_df(filename):
    raw_data = pd.read_csv(filename)
    prev_season = raw_data.copy()
    prev_season['lookup'] = prev_season['season'] + 1
    main = pd.merge(
        raw_data,
        prev_season,
        left_on=['player_code', 'season'],
        right_on=['player_code', 'lookup'],
        suffixes=('_now', '_prev')
    )
    return main

Notice that the season field is joined on itself plus one and that the player_code is also part of the merge. The columns are renamed with suffixes that are self explanatory. If we use the Analyzer class I wrote for this project (more on that in the sequel to this blog post), we can see what kinds of columns this gives us for our analysis of passing data.

from football.analysis.analyzer import Analyzer

analyzer = Analyzer("script/analysis_config.json")
analyzer.set_analysis("passing")
analyzer.create_main_df()

analyzer.main.columns
 Index(['season_now', 'player_now', 'player_code', 'team_now', 'age_now',
        'pos_now', 'g_now', 'gs_now', 'qb_rec_now', 'pass_cmp_now',
        'pass_att_now', 'pass_cmp_perc_now', 'pass_yds_now', 'pass_td_now',
        'pass_td_perc_now', 'pass_int_now', 'pass_int_perc_now',
        'pass_long_now', 'pass_yds_per_att_now', 'pass_adj_yds_per_att_now',
        'pass_yds_per_cmp_now', 'pass_yds_per_g_now', 'pass_rating_now',
        'qbr_now', 'pass_sacked_now', 'pass_sacked_yds_now',
        'pass_net_yds_per_att_now', 'pass_adj_net_yds_per_att_now',
        'pass_sacked_perc_now', 'comebacks_now', 'gwd_now', 'season_prev',
        'player_prev', 'team_prev', 'age_prev', 'pos_prev', 'g_prev', 'gs_prev',
        'qb_rec_prev', 'pass_cmp_prev', 'pass_att_prev', 'pass_cmp_perc_prev',
        'pass_yds_prev', 'pass_td_prev', 'pass_td_perc_prev', 'pass_int_prev',
        'pass_int_perc_prev', 'pass_long_prev', 'pass_yds_per_att_prev',
        'pass_adj_yds_per_att_prev', 'pass_yds_per_cmp_prev',
        'pass_yds_per_g_prev', 'pass_rating_prev', 'qbr_prev',
        'pass_sacked_prev', 'pass_sacked_yds_prev', 'pass_net_yds_per_att_prev',
        'pass_adj_net_yds_per_att_prev', 'pass_sacked_perc_prev',
        'comebacks_prev', 'gwd_prev', 'lookup'],
       dtype='object')

Analyzing the Data

One thing that might be interesting to look at is the relationship between touchdowns and yardage. Is there any predictive power there?

Here is a plot of touchdown count versus yardage, per quarterback (so each dot is the quarterback, but any given quarterback could have many seasons plotted)

There is clearly a great relationship between these two variables (because obviously there would be). But do touchdowns this season provide any help in predicting next season‘s yardage? Here is that picture:

The relationship becomes much noisier. However, there does seem to be a nice, upward trend once the low touchdown/yardage observations are removed. If we filter results to “cleaner” quarterback observations, we get this plot:

This doesn’t look terrible! How does the current touchdown count versus next season’s touchdown count look?

Again, this is a filtered dataset and there does seem to be a degree of correlation present. If we plot the correlation matrix as a heatmap, we can get a better idea of exactly which of the variables available have predictive power and which do not.

There is a clear break between the “now” data and the “previous” data, as expected. However the large block in the middle of the triangle is of interest – this is what we will use to develop our models, and at first glance there does seem to be correlation.

I am going to leave the EDA at that because I could repeat this exercise for all the other positions and at the end of the day I am simply trying to use whatever is available to make an informed decision. The important conclusion from this analysis, though, is that there are quite a few numerical fields in each position to produce analysis and that, at first glance, there is reason to suspect something of value can be developed.

Conclusion

That’s all for now. I will explain the inner workings of my preprocessing, regressions, hyperparameter tuning, and predictions in the next post. I hope you found this overview of the web scraping and exploratory data analysis work useful and interesting!

Fantasy Football with Scrapy and scikit-learn (Part 2)

Welcome to part 2 of this tutorial! In the first part I went over how to get the data and do simple analysis, and in this section I will explain how I fit a number of different machine learning models. All of the code is available on Github.

Preprocessing and Pipelines

Now that the data has been acquired and determined to have predictive capabilities, we can turn our attention to building regression models. Before doing this, though, we want to make sure the data is clean, as well as in a proper format for model fitting.

For the purposes of this blog post, I’ll break the preprocessing into three components: filtering out dirty observations, calculating data standardization, and introducing polynomial features.

Filtering the Data

This is the simplest step in cleaning up the data, but it is certainly an important one. For example, with the raw passing data we only want to consider quarterbacks, but there are a number of different positions represented in the data set. Additionally, we want to filter down to those observations with a meaningful number of passes. The code ultimately looks like this:

print('prior shape: ' + str(main.shape[0]))
main = main[main['pass_att_prev'] > 100]
main = main[main['pass_att_now'] > 100]
main = main[main['pos_prev'] == 'QB']
print('post shape: ' + str(main.shape[0]))

The print commands are helpful in seeing just how much data is dropped, something always worth keeping in mind when doing statistical analysis.

I won’t repeat this for all the other positions, but a similar weeding out of bad data has to occur, obviously.

Standardization

Some regression/classification models (such as support vector machines) are sensitive to the scale of the data and need to have standardized observations in order to work properly. There are a few different ways to accomplish this, but here I will use the most common, which is to calculate the Z score (\frac{x-\mu}{\sigma}) of each observation.

This can be done using the StandardScaler class in scikit-learn. Though I skip directly to wrapping this scaler in a Pipeline (explained in the next section), the basic usage is something along the lines of:

from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaled_pass_att = scaler.fit_transform(main['pass_att_now'])

Polynomial Features

One thing to consider in modeling the data is the interaction between various features – perhaps player age and player interceptions aren’t significant features on their own, but age times interceptions is. From the scikit-learn documentation, if we have input features \mathbf{x}=[a, b] then the generated polynomial features (of degree 2) are \mathbf{x}=[1, a, b, a^2, ab, b^2]

This operation is accomplished simply using the PolynomialFeatures class in scikit-learn.

Pipelines

A helpful tool in the scikit-learn library is the Pipeline class. While each preprocessing step and model specification can be done one step at a time (by manually using the .fit_transform method), there is an alternative approach using pipelines.

Here, for example, is the first model I developed to predict passing yardage. I standardize the data, generate polynomial features, then fit a support vector regression (SVR) estimator

from sklearn.preprocessing import StandardScaler, PolynomialFeatures
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.svm import SVR

target = main['pass_yds_now']
features = main[[
    'age_prev',
    'pass_cmp_prev',
    'pass_att_prev',
    'pass_cmp_perc_prev',
    'pass_yds_prev',
    'pass_yds_per_att_prev',
    'pass_adj_yds_per_att_prev',
    'pass_yds_per_cmp_prev',
    'pass_yds_per_g_prev',
    'pass_net_yds_per_att_prev',
    'pass_adj_net_yds_per_att_prev',
    'pass_td_prev',
    'pass_td_perc_prev',
    'pass_int_prev',
    'pass_int_perc_prev',
    'pass_sacked_yds_prev',
    'pass_sacked_prev'
]]


features_train, features_test, target_train, target_test = train_test_split(features, target, test_size=0.2, random_state=42)


svr_pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('polyfeatures', PolynomialFeatures()),
    ('svr', SVR())
])

svr_pipeline.fit(features_train, target_train)

This is a useful way to avoid redundant code. Also, pipelines can be further complicated to address many other aspects of preprocessing and modeling, such as to calculate one-hot encoding or to impute missing values.

Fitting the Regression Models

I will attempt to very briefly explain the math behind these models and then demonstrate the code.

Regression Models

I decided to use three different regression techniques to predict the various target variables: support vector regression, random forest regression, and ridge regression. After fitting, then tuning the models, I effectively use a voting estimator and take an average of the three models to produce my player rankings.

Support Vector Machine

SVMs are often first explained in terms of the classifier. It is simply an optimization problem. The idea in classification is to draw a line (decision boundary) through the classes that maximizes the distance between them (the gap between the classes is sometimes referred to as the “street”); soft-margin classification relaxes the line-drawing so that the model is more flexible. With regression, the idea is the same but treats the street as a known distance/hyperparameter, \epsilon, and minimizes regression error such that the number of observations on the street is maximized. Without getting too deep into it, the optimization problem for soft margin support vector regression is

    \[\begin{array}{l r} \text{min} & \frac{1}{2}\mathbf{w}^T\mathbf{w}+C\sum_{i=1}^m\left(\zeta_i+\zeta_i^*\right)\\ \\ \text{s.t.} &  y_i - \mathbf{w}^T\mathbf{x}_i-b \leq \epsilon + \zeta_i\\ & \mathbf{w}^T\mathbf{x}_i + b - y_i \leq \epsilon + \zeta_i^* \\ & \zeta_i, \zeta_i^* \geq 0 \end{array}\]

where \mathbf{w} is the weight vector per feature, y_i is the known target variable, C is a regularization tuning variable, and \zeta_i,\zeta_i^* is a distance to the points outside the street.

There is also the kernel trick, which is a fancy way of transforming the training data without actually going through the trouble of doing the transform. For example, the kernel trick can be used to effectively recreate the polynomial features described previously:

    \[\phi(\mathbf{a})^T\phi(\mathbf{b})=\left(\mathbf{a}^T\mathbf{b}\right)^2\]

Random Forest Regression

Similar to the support vector machine, random forest models are typically explained as classification estimators. Random forest estimators are an extension of decision tree estimators. With decision trees, the idea is to minimize the Gini impurity in the training data:

    \[G_i=1-\sum_{j=1}^Np_{i,k}^2\]

Where p_{i, k} is the fraction of class k observations mislabeled as i. With random forests, many trees are generated and the training data is bootstrapped. We can easily go from classification to regression by predicting an individual value rather than a class.
Ridge Regression

While OLS linear regression is obviously the most common approach to regression, it is a model that is forced to utilize every feature to solve for the coefficients. Ridge regression is an interesting alternative which can discern between important features and less important one. The typical coefficient equation for OLS is

    \[\pmb{\hat{\beta}}=\left(\mathbf{X}^T\mathbf{X}\right)^{-1}\mathbf{X}^T\mathbf{Y}\]

Assuming \mathbf{X} is scaled, ridge regression incorporates a new hyperparameter, k, to “shrink” unimportant features and accentuate the more valuable data. The model becomes

    \[\pmb{\hat{\beta}}=\left(\mathbf{X}^T\mathbf{X}\right + k\mathbf{I})^{-1}\mathbf{X}^T\mathbf{Y}\]

Tuning the Models

The scikit-learn library allows us not to just fit the models, but also tune the various hyperparameters each may require. For example, ridge regression requires the variable k but it isn’t clear exactly what that value should be. To find this out, there are search classes available in scikit-learn that cross validate a grid of parameters and identify the best combination.

How to do it

To tune an estimator, you have a few options in scikit-learn. The two I know are GridSearchCV and RandomizedSearchCV, but others exist (such as a randomly sampled param grid). While the grid search is just that – a full search over the entire grid of parameters provided – the randomized search looks for better performing options and doesn’t try out every combination.

Though I ultimately ended up using the RandomizedSearchCV for the analysis, one of my old commits on Github utilized the GridSearchCV class like so (imports omitted, as well as the train/test split):

svr_pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('polyfeatures', PolynomialFeatures()),
    ('svr', SVR())
])


param_grid = [{
    'polyfeatures__degree': [1, 2, 3],
    'svr__epsilon': np.linspace(1, 1000, 10),
    'svr__kernel': ['linear', 'poly', 'sigmoid', 'rbf'],
    'svr__C': [0.1, 1, 10],
}]

grid_search = GridSearchCV(
    svr_pipeline,
    param_grid,
    cv=10,
    scoring='neg_mean_squared_error'
)

grid_search.fit(features_train, target_train)

There are a couple things to note. First, the syntax of the parameter grid – it is a dictionary of variables inside a list that get passed into the GridSearchCV class. Also note that the regression itself is built into a pipeline. What this means for the hyperparameter tuning is that the name of the pipe in the pipeline needs to be prefixed on the key of the param grid; notice that there is svr__epsilon in the param grid and that the corresponding tuple in the pipeline is named svr. Finally, it’s worth pointing out the argument cv in the grid search class means that there will be 10 fold cross validation and that the scoring argument is negative mean squared error (a typical scoring calculation for regression problems).

Once the grid search works out the best model, you can easily retrieve it for further analysis

from sklearn.metrics import mean_squared_error
from sklearn.model_selection import cross_val_predict

best_model = grid_search.best_estimator_
predictions = cross_val_predict(
    best_model,
    features_train,
    target_train,
    cv=10
)
score = np.sqrt(mean_squared_error(predictions, target_train))
print('Grid Search SVR MSE: ' + str(score))

Example: Tuning the Ridge Regression to Address Overfitting

As noted earlier, the k parameter of the ridge regression can be tuned. In the case of this project, the first model I fit has the passing touchdowns this season as the target variable and uses 17 features from the previous season as input.

Let’s take a look at the learning curve for the raw, untuned model.

The orange line is the out of sample prediction and the blue line is the in sample; so, in-sample the model predicts with 400 observations, on average, an error of about eight touchdowns, and out-of-sample an average error of about nine. The large gap between these two lines suggests the model is overfitting. Can this be fixed by tuning the k value? Here is what the learning curve looks like with the tuned model using randomized search

Clearly the model is improved and the gap between the two lines is significantly reduced (though, honestly, an error rate of 8 touchdowns is rather large and the more serious problem here is underfitting)

Bringing it All Together

So now we have all the steps of the analysis outlined

  1. Manipulate the data into the format we want
  2. Filter down the observations into a clean dataset
  3. Specify and fit the models
  4. Tune the hyperparameters for every model
  5. Make predictions about this season

And this has to be done 7 times for various different metrics. Instead of writing a script line-by-line, I have built out a class, Analyzer, to execute all of these steps. All the functionality of the machine learning is simply wrapped in that class, and (for now) it is assumed that the code will be executed in order.

Configuration

Since the process is quite similar for every analysis, I thought it would be prudent to put the model specifications into a JSON file. This way, if I need to change, say, the target variable then I can simply change the code in the configuration file rather than having to root through a large script or Jupyter notebook.

The JSON analysis specs look like this

{
  "home_dir": "./",

  "passing_analysis": {
    "main_df": "data/passing.csv",
    "target": "pass_td_now",
    "features": [
      "age_prev",
      "pass_cmp_prev",
      "pass_att_prev",
      "pass_cmp_perc_prev",
      "pass_yds_prev",
      "pass_yds_per_att_prev",
      "pass_adj_yds_per_att_prev",
      "pass_yds_per_cmp_prev",
      "pass_yds_per_g_prev",
      "pass_net_yds_per_att_prev",
      "pass_adj_net_yds_per_att_prev",
      "pass_td_prev",
      "pass_td_perc_prev",
      "pass_int_prev",
      "pass_int_perc_prev",
      "pass_sacked_yds_prev",
      "pass_sacked_prev"
    ],
    "filters": [
      ["pass_att_prev", ">",  100],
      ["pass_att_now", ">", 100],
      ["pos_prev", "==", "'QB'"]
    ],
    "hypertune_params": {
      "search_class": "RandomizedSearchCV",
      "cv": 10,
      "scoring": "neg_mean_squared_error"
    },
    "models": {
      "support_vector_machine": {
        "pipeline": {
          "scaler": "StandardScaler",
          "poly_features": "PolynomialFeatures",
          "svr": "SVR"
        },
        "search_params": {
          "poly_features__degree": [1, 2, 3],
          "svr__kernel": ["linear", "rbf", "sigmoid"],
          "svr__epsilon": [0.5, 1, 1.5]
        }
      },
      "random_forest_regressor": {
        "pipeline": {
          "scaler": "StandardScaler",
          "poly_features": "PolynomialFeatures",
          "random_forest": "RandomForestRegressor"
        },
        "search_params": {
          "poly_features__degree": [1, 2],
          "random_forest__max_depth": [2, 3, 5, 1000],
          "random_forest__min_samples_leaf": [1, 10, 20, 40, 60, 80, 100]
        }
      },
      "ridge": {
        "pipeline": {
          "scaler": "StandardScaler",
          "poly_features": "PolynomialFeatures",
          "ridge_regression": "Ridge"
        },
        "search_params": {
          "poly_features__degree": [1, 2, 3],
          "ridge_regression__alpha": [0.1, 0.2, 0.3, 0.4, 0.5]
        }
      }
    }
  }
}

This spells out the whole process! The CSV we want? It’s located in data/passing.csv. The filters? They’re specified under “filters”. Add or drop features? Easy. The models are a little tricky in how they are configured, but it is exactly the same process as before – it requires that any modeling be done using a pipeline, but then the parameter grid for tuning can be changed quite easily.

Additional analysis specifications can be inserted into this as well, as long as they have a similar _analysis suffix.

Execution in Python

It is simple to take this configuration and actually execute the scikit-learn code.

analyzer = Analyzer("config.json")
analyzer.set_analysis("passing")
analyzer.create_main_df()
analyzer.filter_main()
analyzer.split_data()
analyzer.create_models()
analyzer.run_models()
analyzer.tune_models()

This will run all the steps listed in the config file. I want to build out unit tests for this class to guarantee it handles everything as expected, but for the purposes of this analysis the code does work as intended.

Prediction

Now to finally get the power rankings.

Assume that the config file spells out all seven target variables we want and how to run the regressions on them. Assume that the Analyzer class has done all the hard work of preprocessing, model fitting, and model tuning.

Now, we want to take the original dataset, put the suffix _prev onto the 2018 season data, then use that as the input into the Analyzer instance to get the prediction of each metric (passing/receiving/rushing stats on touchdowns and yardage, plus total receptions).

This seemingly tricky operation can be done using the following function

def predict_from_raw(main, analyzer):
    df = main.copy()
    df.columns = [col + "_prev" for col in df.columns]
    features = df[analyzer.features_names]
    predictions = analyzer.predict_tuned_models(features)
    names = df["player_prev"]
    names.index = range(len(names))
    predictions['name'] = names
    return predictions

Notice the method .predict_tuned_models. This is what will run a prediction based on the input data for each model from the configuration file.

Once all the predictions have been made, we want to take the predicted statistic and calculate the corresponding fantasy points

models = ['support_vector_machine', 'random_forest_regressor', 'ridge']
passing_fantasy_yds = passing_yds_predictions[models].div(25)
receiving_fantasy_yds = receiving_yds_predictions[models].div(10)
rushing_fantasy_yds = rushing_yds_predictions[models].div(10)

passing_fantasy_td = passing_td_predictions[models].mul(4)
receiving_fantasy_td = receiving_td_predictions[models].mul(6)
rushing_fantasy_td = rushing_td_predictions[models].mul(6)

receiving_fantasy_rec = receiving_rec_predictions[models]

Now that all the predictions have been made and turned into fantasy points, the last step is to take an average of each model for an overall ranking (this function is admittedly pretty hacky to get the receptions shoehorned in there but gets the job done for now).

def get_ranking(yds, td, main_df, rec=None):
    vote_yds = yds.apply(np.mean, axis=1)
    vote_yds.index = main_df['player_code']
    vote_td = td.apply(np.mean, axis=1)
    vote_td.index = main_df['player_code']
    if rec is not None:
        vote_rec = rec.apply(np.mean, axis=1)
        vote_rec.index = main_df['player_code']
        vote = vote_yds + vote_td + vote_rec
    else:
        vote = vote_yds + vote_td
    return vote

Results

And here are the results!

QB

pointspositionplayer
MahoPa00313.14238QBPatrick Mahomes
LuckAn00295.370346QBAndrew Luck
RoetBe00289.55698QBBen Roethlisberger
RyanMa00273.549325QBMatt Ryan
GoffJa00267.91081QBJared Goff
CousKi00263.739173QBKirk Cousins
BreeDr00262.422544QBDrew Brees
BradTo00249.018828QBTom Brady
MayfBa00246.809009QBBaker Mayfield
RivePh00246.742789QBPhilip Rivers

WR

pointspositionplayer
HopkDe00265.0243634WRDeAndre Hopkins
JoneJu02263.8836521WRJulio Jones
AdamDa01257.2140158WRDavante Adams
BrowAn04252.9623443WRAntonio Brown
ThomMi05250.3592538WRMichael Thomas
SmitJu00244.594075WRJuJu Smith-Schuster
ThieAd00241.9040111WRAdam Thielen
HillTy00240.7230349WRTyreek Hill
EvanMi00233.97766WRMike Evans
HiltT.00215.3827413WRT.Y. Hilton

RB

pointspositionplayer
GurlTo01176.535942RBTodd Gurley
ElliEz00156.799343RBEzekiel Elliott
BarkSa00150.807151RBSaquon Barkley
MixoJo00145.69819RBJoe Mixon
ConnJa00143.190105RBJames Conner
CarsCh00137.216166RBChris Carson
GordMe00136.926972RBMelvin Gordon
MackMa00135.825863RBMarlon Mack
HuntKa00127.118146RBKareem Hunt
McCaCh01125.465541RBChristian McCaffrey

A quick comment on trying to recreate these results: a lot of the methods in the machine learning algorithms employ randomness (stochastic gradient descent, the random forests algo, etc), so unless you carefully set a seed you may (read: will) see slightly different rankings.

Conclusions

The Good

I like that my number one QB ranking was Mahomes and that number two was Andrew Luck! This suggests that at least something was right with my models! The little I did know going into this season was that Mahomes was highly regarded (first round pick) and that Andrew Luck was also considered quite valuable (until, uh, recently). On the whole, I think the analysis gave me what I wanted: a generally more coherent approach to picking a fantasy team than randomly guessing.

I also like that the configuration file is in JSON and seems pretty extensible. I would like to build out more functionality for plotting capabilities and think it would even be cool to get a front end that runs the Python on the fly in the background and perhaps is accessed through a REST service. I think that the Analyzer class can be utilized for other CSV files as well, or perhaps have a database connection, and there is potential for building out more functionality in general.

The Bad

I do not like that my number three QB ranking was Ben Rothlisberger! That totally torpedoed my team last week! Though this analysis provides a better-than-totally-random picking ability, it does not give a super precise prediction – recall that the ridge model was underfitting. There is obviously a whole universe of analysis dedicated to figuring this stuff out, but needless to say more can be done than what I worked out here.

Another notable absence is any way to go about drafting the players based on this data. Though I can see the top predicted performance, there isn’t any optimization that provides me a list of who to pick in the draft and which round – Mahomes was taken immediately and wide receivers were the next fastest to be removed. How to account for this? My guess is some kind of constrained linear optimization problem, which perhaps can be another project.

The Ugly

I do not like that the Analyzer class is untested. This isn’t a crisis for this first pass, but it is something I would like to clean up. Additionally, the regression estimator classes from scikit-learn (Ridge, SVR, and RandomForestRegressor) have been imported manually into the Python module that establishes the Analyzer class, which is not ideal. I would like to have a way to lookup the regressors dynamically, so that other regressors from scikit-learn (such as Lasso or BaggingRegressor) can be incorporated.

Final Note

That’s it! I hope you found this analysis interesting! I hope that somebody reads this!

This is my first blog post ever on anything related to data science or coding, so I am certainly open to suggestions on how to better present my work. Let me know in the comments, or by email.

I do intend on developing this further and so some of the conclusions in this post may be different going forward, but otherwise this should be a pretty good overview of this entire project.

Have a nice day!