Road traffic accidents involving personal injury – Methodological description
The National Police Headquarters (ORFK) publishes a flash report on its website between the 10th and 20th day following the reference month, presenting data on road traffic accidents from the previous month. A systematic underestimation can be observed in the ORFK’s reports compared to the data published by the Hungarian Central Statistical Office (HCSO) on the 50th–55th day after the reference month. One of the main reasons for this is that the HCSO— in line with European standards—defines a person as having died in an accident if they pass away within 30 days as a result of the accident. Accordingly, at the time of compiling the flash report, data referring to the 30th day after the accident are not yet available. To bridge this gap, the HCSO has developed so-called ARIMA models that take into account trends in time series, separately for Budapest and for the other counties combined, and now publishes the sum of these separately estimated figures.
Methodology of the flash estimate:
The essence of the method is to produce a forecast of accident data using the flash reports prepared by the ORFK. The procedure is based on the empirical observation that there is a strong relationship between the flash reports and the final accident data. Since both datasets aim to measure the same types of events, the high correlation is statistically justified. To eliminate the systematic underestimation of the flash reports, a SARIMAX model (Seasonal AutoRegressive Integrated Moving Average with eXogenous variables) is applied. SARIMAX is a classical linear time series model that explains the value of yₜ based on the following components:
(1) AR – autoregressive term: previous values of y
(2) I – integrated (differencing): ensuring stationarity through differencing
(3) MA – moving average error term: previous forecast errors
(4) S – seasonal components: AR, I, MA at the seasonal level (s = 12, yearly)
(5) X – exogenous variables: external explanatory regressors (in this case, the figures from the ORFK flash reports)
Model specification used in the procedure:
Autoregressive order (1, 0, 1):
-
p = 1 → AR(1): one lag, φ₁·yₜ₋₁
-
d = 0 → no non-seasonal differencing (stationary time series are used)
-
q = 1 → MA(1): one error lag, θ₁·εₜ₋₁
Seasonal order = (1, 1, 1, 12):
-
P = 1 → SAR(1): Φ₁·yₜ₋₁₂
-
D = 1 → one seasonal difference: Δ₁₂yₜ = yₜ − yₜ₋₁₂
-
Q = 1 → SMA(1): Θ₁·εₜ₋₁₂
-
s = 12 → monthly data, annual seasonality
The Python statsmodels package applies the following maximum likelihood estimation procedure. Assuming normally distributed errors, the log-likelihood is:
ℓ(ψ) = −(T/2)·ln(2π) − (T/2)·ln(σ²) − (1/2σ²)·Σ εₜ(ψ)²
where ψ = (φ₁, Φ₁, θ₁, Θ₁, β, σ²) is the parameter vector, and the residuals εₜ(ψ) are computed recursively via the Kalman filter.
First-order conditions:
∂ℓ/∂ψ = (1/σ²)·Σ εₜ · (∂εₜ/∂ψ) = 0
Due to the MA terms, εₜ is a nonlinear function of θ₁ and Θ₁, so there is no closed-form solution. Therefore, statsmodels maximises ℓ numerically (by default using the L-BFGS-B algorithm).
Standard errors – inverse of the Fisher information matrix:
Var(ψ) = [ −∂²ℓ/∂ψ∂ψᵀ ]⁻¹ (evaluated at the estimated ψ under MLE)
Road traffic accidents involving personal injury – Methodological description
The National Police Headquarters (ORFK) publishes a flash report on its website between the 10th and 20th day following the reference month, presenting data on road traffic accidents from the previous month. A systematic underestimation can be observed in the ORFK’s reports compared to the data published by the Hungarian Central Statistical Office (HCSO) on the 50th–55th day after the reference month. One of the main reasons for this is that the HCSO— in line with European standards—defines a person as having died in an accident if they pass away within 30 days as a result of the accident. Accordingly, at the time of compiling the flash report, data referring to the 30th day after the accident are not yet available. To bridge this gap, the HCSO has developed so-called ARIMA models that take into account trends in time series, separately for Budapest and for the other counties combined, and now publishes the sum of these separately estimated figures.
Methodology of the flash estimate:
The essence of the method is to produce a forecast of accident data using the flash reports prepared by the ORFK. The procedure is based on the empirical observation that there is a strong relationship between the flash reports and the final accident data. Since both datasets aim to measure the same types of events, the high correlation is statistically justified. To eliminate the systematic underestimation of the flash reports, a SARIMAX model (Seasonal AutoRegressive Integrated Moving Average with eXogenous variables) is applied. SARIMAX is a classical linear time series model that explains the value of yₜ based on the following components:
(1) AR – autoregressive term: previous values of y
(2) I – integrated (differencing): ensuring stationarity through differencing
(3) MA – moving average error term: previous forecast errors
(4) S – seasonal components: AR, I, MA at the seasonal level (s = 12, yearly)
(5) X – exogenous variables: external explanatory regressors (in this case, the figures from the ORFK flash reports)
Model specification used in the procedure:
Autoregressive order (1, 0, 1):
-
p = 1 → AR(1): one lag, φ₁·yₜ₋₁
-
d = 0 → no non-seasonal differencing (stationary time series are used)
-
q = 1 → MA(1): one error lag, θ₁·εₜ₋₁
Seasonal order = (1, 1, 1, 12):
-
P = 1 → SAR(1): Φ₁·yₜ₋₁₂
-
D = 1 → one seasonal difference: Δ₁₂yₜ = yₜ − yₜ₋₁₂
-
Q = 1 → SMA(1): Θ₁·εₜ₋₁₂
-
s = 12 → monthly data, annual seasonality
The Python statsmodels package applies the following maximum likelihood estimation procedure. Assuming normally distributed errors, the log-likelihood is:
ℓ(ψ) = −(T/2)·ln(2π) − (T/2)·ln(σ²) − (1/2σ²)·Σ εₜ(ψ)²
where ψ = (φ₁, Φ₁, θ₁, Θ₁, β, σ²) is the parameter vector, and the residuals εₜ(ψ) are computed recursively via the Kalman filter.
First-order conditions:
∂ℓ/∂ψ = (1/σ²)·Σ εₜ · (∂εₜ/∂ψ) = 0
Due to the MA terms, εₜ is a nonlinear function of θ₁ and Θ₁, so there is no closed-form solution. Therefore, statsmodels maximises ℓ numerically (by default using the L-BFGS-B algorithm).
Standard errors – inverse of the Fisher information matrix:
Var(ψ) = [ −∂²ℓ/∂ψ∂ψᵀ ]⁻¹ (evaluated at the estimated ψ under MLE)