Detecting Marathon Cheaters: Using Python to Find Race Anomalies

ZhongTr0n
7 min readSep 7, 2024

--

Image source: pexels.com

Introduction

As a runner myself, my YouTube feed recommended a video where two models got caught cheating in a half marathon. You can see all the details in the video but basically, someone found out their split times don’t add up. In non-running terms, this means that if you split up the race into segments, she achieved unrealistically fast times for specific segments. If a runner would take the bus or a shortcut, the total time might not show it, but the segments would.

This got me thinking; it’s pretty easy to automate this right? Driven by curiosity, I started looking for a race to see if I could find some suspicious activity myself…

Approach & Ethical Considerations

The idea is simple; get all the results from a race (including the splits) and see if there are anomalies. The two main anomalies would be;

  • running faster than humanly possibly
  • running some segments significantly faster than your overall level

However, there are some ethical considerations. Even though I dislike cheaters, it is not much more than a pretty harmless ‘crime’, if you could even call it that. As long as no prize money or other reward is involved, the cheater is only fooling themself. I asked some friends’ opinions on marathon cheating and it varied from “unacceptable” to “who cares”.

Additionally, one can also never be sure. Even though some missing splits and high paces can seem suspicious, it is not definitive proof the participant cheated. There might be another explanation that is overlooked.

Image source: tenor.com

For these reasons, I chose to censor all the personal information in this article.

Data Collection

After visiting the result page of the race I wanted to analyze, I looked at the network traffic and quickly found the API it uses to talk to the database. As the robots.txt file on their website does not mention any limitations, I could scrape the data I needed.

I wrote a Python script that makes requests to the API to get the race results for each participant.

Image source: tenor.com

In order not to overload their servers, I built in a sleep timer that waits about 10 seconds per request. This would make sure I am not interfering with any of the other traffic on their page.

Slowly but surely the script found all the data and wrote it to a table

containing information like:

  • race number
  • average pace
  • total time
  • time split n
  • pace split n

This was all I needed to check for simple anomalies.

Methodology

Using Pandas (the go-to Python data analysis library), I focussed on the two anomalies described earlier in the approach.

Method 1: Superhuman Pace

Running this script, I filtered out all the participants who had a split above 25kph (~16mph), which is faster than the marathon world record.

columns_to_check = [
'split1_speed', 'split2_speed', 'split3_speed', 'split4_speed',
'split5_speed', 'split6_speed', 'split7_speed', 'split8_speed',
'split9_speed', 'split10_speed'
]
filtered_df = df[df[columns_to_check].gt(25).any(axis=1)]

And sure enough, I had one result. One participant managed to run (?) a segment at around 60kph (37 mph).

Interesting result to say the least, but before diving deeper into the individual cases it’s time to run another test; looking for variance in pace within the same participant.

Method 2: Supersplits

This method gets a bit more technical. Z-scores are a statistical measure used to determine how far a point is from the mean in a normally distributed dataset. It tells you how many standard deviations a point is distanced from the mean and which percentage of points are within that distance.

Image source: https://z-table.com/

A z-score is calculated with the following formula:

Image by author

Example: let’s say a runner has a mean pace of 10kmh with a standard deviation of 1.7 (meaning the average split is 1.7kmh faster or slower than the mean). Then a pace of 13.4 kmh would have a z-score of 2. Because 2 times the standard deviation (1.7) equal 3.4. And the mean (10) plus two standard deviations (3.4) equals 13.4.

Even though this method might give a good indication, caution should be applied as there are not that many datapoints per participant and a normal distribution can not always be assumed.

Iteration 1
I set my cut-off at z-score +3. This is the typical threshold used to find anomalies. Theoretically, I should also look for values below -3, but there is not much point in cheating if you go extremely slow of course.
Result: 0 participants.

Iteration 2:
Ok, z-score 3 did not return anything, let’s move a bit closer to the mean by using 2.75.
Result: 61 participants.
However, when looking at the individual results, I noticed they all seemed to have this extra fast split at the final stretch. This makes sense of course. In the last segment you empty your tank and pull your final sprint.

Iteration 3:
Keep the threshold at 2.75 but ignore the final split.
Result: 1
And it probably won’t surprise you it’s the same participant that recorded the superhuman pace in the other test.

Results

So the analysis shows only participant recorded suspicious splits. Time to look at their performance in more detail.

Image by author

As you can see the participant missed the 35k split. After missing that split they showed up again for the next split bridging the 30k to 40k distance in less than 10 minutes. This means an average speed of 68kph (42mph).

It gets even more interesting if you look at the race route.

Image by author

The split they missed (35k) is right in the middle of a loop. It is definitely plausible this participant did not run this loop only to join the race at a later stage.

Conclusion

Using the power of data analysis, I was able to quickly detect anomalies in race data. One participant recorded a superhuman speed which makes their result questionable to say the least. However, there might be a technical glitch or other reasons causing this anomaly.

While I am happy I was able to find some result that easily, I was also a bit disappointed. Part of me wished I would had to dig a lot deeper and come up with some advanced analysis. However this is a good example to show that things don’t always have to be complicated.

This was of course the analysis of one single race in one single year. Keep reading if you want to try this on your own local race.

Join The #RunDataChallenge

As there are thousands and thousands of races every year all across the globe, there are countless of datasets to analyze. I noticed that many people in the data community are looking for personal projects ideas and I believe this to be good one. Why;

  • it’s a unique project (your local race)
  • there are many ways to go about this
  • it requires data collection, data analysis, visualization and story telling

If you do decide to analyze your local race, feel free to add the hashtag #RunDataChallenge so others can easily find your results. But remember, I do not advocate for naming & shaming. It’s essential to approach this challenge as a data storytelling exercise, not a witch hunt. Publicly accusing someone without definitive proof can lead to unwanted consequences for all involved.

Furthermore, the methods and tools described in this article are provided as a guideline for educational purposes only. I am not responsible for how you choose to implement these methods or for the results you obtain and publish. Please exercise caution and use common sense when collecting data and sharing your findings. It is your responsibility to ensure that your actions are ethical, legal, and respectful of others’ privacy.

Tech Stack

For this analysis I used Python with the following libraries:

  • Pandas
  • Numpy
  • Requests
  • Pickle

The author’s graphics were created with Google Draw and Notability.

--

--