A technical breakdown of our data pipeline, modeling choices, and why we chose Gradient Boosting over Neural Networks.
Fantasy Premier League (FPL) is notoriously noisy. A defender can concede a goal in the 95th minute and lose their clean sheet (-4 points relative to expectation) due to a random deflection. Capturing the "true talent" signal amidst this variance is the core challenge.
Early iterations of this project attempted to predict exact raw points for a single gameweek. This failed miserably due to variance. We pivoted to predicting Expected Points (xP)βa robust long-term metric that minimizes error over time, rather than trying to guess a specific hat-trick.
Data quality is paramount. Our pipeline runs every 6 hours and processes data from three primary layers:
One of our key differentiators is how we handle user consistency. We engineered a proprietary feature called Efficiency Score, defined as:
This metric heavily penalizes Volatility. A player who scores 12, 2, 12, 2
(High Variance)
will have a lower Efficiency score than a player scoring 7, 6, 7, 8 (Low Variance). This aligns
with long-term
probability theory: "Haulers" are often statistically lucky, while "Grinders" are statistically sustainable.
The "Clinicality" Myth: Notably, our model explicitly excludes raw `goals_scored` from its training features. We rely entirely on xG. If a player scores 5 goals from 0.5 xG, the model treats this as a negative signal (unsustainable overperformance) rather than a positive skill.
We use a Gradient Boosting Regressor (GBR) ensemble, chosen for its ability to handle non-linear interactions between features, such as the Teammate Cannibalization effect (when a premium asset like Salah returns, teammates' projections drop due to xG redistribution).
We use a novel Hybrid Ensemble approach:
The final prediction is a weighted average of these two models, giving you the stability of history with the responsiveness of form.
Specialized Goalkeeper Model: Goalkeepers are fundamentally different from outfield players. They rely on clean sheets and saves, which are rarer events. For GKPs, we use a separate model architecture with a Log-Transformed Target to handle the high variance and skew in their points distribution.
Predicting points is only half the battle. The other half is the Knapsack Problem: fitting the best players into a budget constraint.
Our "Transfer Strategizer" is a Mixed-Integer Linear Programming (MILP) solver. It constructs a decision tree of every possible transfer combination for the next N weeks.
This allows us to suggest "optimal" pathways that a human might missβsuch as selling a premium asset this week to fund two mid-priced upgrades next week.
"All models are wrong, but some are useful." β George Box
View Model Accuracy π