BEYOND PACE: PREDICTING 1500M FREESTYLE TIMES WITH MULTI-FEATURE RANDOM FORESTS

Author(s): RUSSOMANNO, T., Institution: UNB UNIVERSIDADE DE BRASILIA/ TUM, Country: GERMANY, Abstract-ID: 1404

INTRODUCTION:
Advances in technology have led to a loads of data in many sports, making data-driven models increasingly popular for performance analysis (Silva. et.al., 2007). Different models have been applied to swimming, predicting individual event performance based on various datasets (Wu et.al.,2021). In long-distance swimming, the pace strategy (PS) follows a U-shaped curve (Lara and Del Coso, 2021). This means professional swimmers start and finish fast, maintaining a relatively consistent speed with minor fluctuations in between. This study investigates the use of a Random Forest model to predict final race times based on athletes heat times and pace strategy.
METHODS:
Race data from five Olympic Games (Sydney, Athens, Beijing, Rio, and Tokyo) were analyzed, containing both heats and finals data. Data were obtained from the FINA website (https://www.fina.org), providing split times for every 50m and final times for each race. A total of 174 races were analyzed. The dataset was divided into two parts: one for training the model (heat data) and the other for evaluation (final data).Relevant features like mean time, speed at different distances, and total time were selected for model training. A Random Forest model was trained with optimal hyperparameters: 200 estimators, max_depth =8, max_features=sqrt, min_samples_leaf=1, random_state=42.
To evaluate the models performance, the following metrics were used: Mean Squared Error (MSE), Root Mean Squared Error (RMSE), Mean Absolute Error (MAE), R-squared and Explained Variance Score.

RESULTS:
On the training data, the model achieved a Mean Squared Error (MSE) of 5.09, Root Mean Squared Error (RMSE) of 2.25, Mean Absolute Error (MAE) of 1.708, R-squared of 0.953, and Explained Variance Score of 0.968. On the final dataset, the model achieved an MSE of 32.53, RMSE of 5.70, MAE of 4.40, R-squared of 0.94, and Explained Variance Score of 0.94. This indicates a slight decrease in model performance on the new data, with an average prediction error of around 5.7 seconds. In a race lasting approximately 14 minutes, this translates to an error of less than 0.58% of the total time.
CONCLUSION:
The analysis consistently revealed a U-shaped pace strategy profile employed by all athletes across all races, regardless of whether they were competing in heats or finals. This finding highlights the consistency of this approach in 1500m swimming. The chosen Random Forest model demonstrated worthy performance on the training data, explaining over 95% of the variance in total final times. This indicates the models ability to effectively learn the underlying patterns and relationships within the dataset. Still, when applied to the evaluation data (final times), the models performance exhibited a slight decrease, resulting in an average prediction error of approximately 5.7 seconds. This study showcases the promising potential of Random Forest regression for predicting swimming times. Utilizing features derived from both heat and final performance.