Time Series Project Bike Sales
Modeling & Forecast
INTRODUCTION
Type: Time Series course group project
Role: Group leader
Duration: 2 months
The project is about time series model. Our team uses the "Bike Sales in Europe" dataset from Kaggle to conduct time series analysis and forecasting. The raw dataset contains 18 columns, including date, customer’s age, customer gender, country, state/province, quantity ordered, unit cost, etc. Due to the large volume of data in the original dataset and the large number of missing values, our study focuses on the state of California, with training data covering 2011-01-01 to 2014-06-30.
We aim to identify sales trends, detect stable intra-year seasonal patterns, compare different models on monthly aggregated sales, select the best model using in-sample fit and residual diagnostics, and finally conduct forecasting using the best model.
Step 1. Data Preprocessing
The preprocessing pipeline includes two key steps for modeling preparation

Moving Average Smoothing (27-day window)
The original daily order quantity data exhibits high volatility with frequent spikes and fluctuations.
To identify underlying trends while reducing daily noise, we applied a moving average using rollmean() with k=27.


While the original data (light purple) displays sharp daily variations, the smoothed series (dark purple) successfully captures the broader trends:
a period of relative stability in 2011-2012, a notable decline in early 2013, followed by a recovery and upward trend from mid-2013 to 2014.
Missing Value Imputation and Additional Smoothing
Many dates in the time series have missing order quantities. We filled these gaps using linear interpolation. This ensures data continuity essential for time series analysis without introducing artificial discontinuities.
Then we applied a 7-day right aligned moving average to further reduce short-term fluctuations, particularly weekly patterns that might obscure monthly or seasonal trends

The filled and smoothed series (purple line in the second plot) maintains the overall trend structure while eliminating noise.
Step 2. Data Decomposition

Monthly Bike Sales Data

Decomposition Method: STL
Components:
Trend, seasonality, and remainder
Results:
1. A significant overall growth in underlying sales volume after a temporary downturn
2. Highly significant seasonal influence with clear, regular annual pattern
3. Relatively small magnitude of the remainder
Step 3. Stationarity Analysis
ADF Stationarity Test
ADF stationarity test on the training set
Lag order = 3
Test statistic = -1.6499
P-value = 0.7116 > 0.05.
Result: Fail to reject the null hypothesis of a unit root, indicating the data is non-stationary.
ADF stationarity test on the differenced set
Lag order = 3
Test statistic = -3.9248
P-value = 0.02245 < 0.05.
Result: Reject the null hypothesis of a unit root, indicating the data is stationary.
Seasonal Differencing Analysis
Method: Twice differencing (seasonal + non-seasonal)
First-order differencing d=1:
Removes the trend component by computing differences between consecutive observations. This makes the mean of the series stationary.
Seasonal differencing D=1:
Removes the seasonal component by computing differences between observations at the same seasonal position (e.g., January 2012 vs January 2011). This makes the seasonal pattern stationary.

Step 4. Model Fitting
Non-seasonal part
of the model
ARIMA (p, d, q) (P, D, Q)m
Seasonal part
of the model

Model Selection Strategy: ACF and PACF
1. ARIMA(0,1,2)(0,1,1)12
2. ARIMA(2,1,0)(0,1,1)12
3. AUTO
Model candidates:
SARIMA Model Evaluation:
Best Model: ARIMA(2,1,0)(0,1,1)12
AICc Value: 74.7
Ljung-Box Test: 0.842> 0.05
Step 5. Making Forecasts

SARIMA Results: However, we observe that while the forecast trend direction is similar to the actual line, there is a systematic bias in the overall level. We speculate that this suboptimal forecast performance may be caused by the data gap discontinuity.
Possible Improvement: We only utilized data from 2011.1 to 2014.6 because there were missing records in the second half of 2014. And we directly use the data from 2011.1 to 2014.6 as the training set and the data after 2015.1 as the test set. This gap might be the main driver of the forecasting gap between the SARIMA predicted result and the ground truth data. To handle this gap and improve our model preformence, the Trend-Adjusted Seasonal Imputation method was applied to fill this gap.

To improve the model perfromance, we try another model to finish the time series forecasting, that is, the LSTM model.
RMSE: 2.612 → 1.0409
MAE: 2.203 → 0.7959
MAPE: 64.54% → 14.89%, drop by 50%
Actual
LSTM

