SARIMA
SARIMA -
SARIMA stands for Seasonal Auto Regressive Integrated Moving Average. This model is useful for forecasting time series data. In this tutorial we will learn how we can use it for forecasting with the help of example.
The classic ARIMA accepts the parameters (p,d,q), while SARIMA or Seasonal ARIMA accepts an additional set of parameters (P,D,Q)m that specifically describe the seasonal components of the model.
Here P, D, and Q terms represent the seasonal regression, differencing, and moving average coefficients, and m represents the number of data points that is rows in each seasonal cycle. So if you have monthly data with a yearly seasonal cycle, m would be equal to 12.
When we actually import it within statsmodels, the implementation of seasonal ARIMA is called SARIMAX. The "X" added to the name represents that the function also supports exogenous regressor variables. Importing SARIMAX means we are just performing classic ARIMA with a seasonal ARIMA model.
Example -
Let' understand more about SARIMA practically with an example.
Open Jupyter Notebook. We will import pandas, numpy, set matplotlib inline, and also will set warnings to ignore. Next, we will load couple of forecasting tools. First one will be SARIMAX. Also we will create some Error-Trend-Seasonality (ETS) plots so for that second we will import seasonal_decompose and then finally from pmdarima import auto_arima. We will use a classic dataset called Mauna Loa, Hawaii dataset which is basically the monthly average of CO2 levels in parts per million over a particular part in Hawaii. You can download it from internet. So we will use pandas in order to configure the dataset for statsmodels. So inside the Data folder, open up the co2_mm_mlo.csv file. We will take a look at the head of the dataset using head() function.
Below are the code snippet for above steps in jupyter notebook:
We can see in dataset that the year and the month is separated into columns like this:
So we want to figure out how we can use these two columns to create a date time index. The first thing to do is to add a date column. Pandas is designed to handle this sort of thing, so using this dataset's year and month, we will create a new column called date using to_datetime() function and then we will pass in a dictionary call of what the year should be and what the month should be and then what the day should be. So in the dictionary we will have year that is going to be equal to the df year's column, then we have month that's the value from the df month column, and then we are going to provide day and we are assuming that the day starts at the very first of the month. You could switch day up to the last of the month like 30, but since we are dealing with monthly timestamp data so we know that months can change from 30 to 31, that's why we set the day is equal to 1, since we only know the year and month. You could also construct dictionary in an alternative way which i have comment out in below snippet just by passing in dictionary and passing in year, month, day as parameters into the dictionary function call. So we will go ahead and run that and now if we check out the head of our dataframe (df), we will notice that we have the date and it looks like it's now a timestamp object.
Below are the code snippet for above steps -
We can confirm that by using df.info() and as we can notice that this date column is infact a datetime object.
We want this date column to be the index so we'll use set_index() and to make sure this is permanent we'll reassign the current data frame(df). And now if we check the head of the data frame, we have date index, and the previous columns.
In order to use it with statsmodels we need to set frquency, so we will use index.frequency and set it to monthly start(MS). Now we can check the head.
We'll notice that the average column has missing values(NaN) like in 1958 they don't have a recorded value for June. So what they did instead is they just interpolated between the previous points and some of the future points to fill in that value. That's why we will use interpolated column so in that way we are not missing any points. Let's go ahead and plot this data, in order to plot out the data, we will use plot().
When we run it, we will get like this:
We can see here that there is seasonality as well as some general upward trend. To confirm that there are some seasonality, we can run as ETS decomposition, for that we will use seasonal_decompose() with interpolated column and then you can use either an additive model or a multiplicative model, but the key thing to note here is that we'll definitely see a clear seasonal component. So when we plot out this result, we can see here the observed values, the general trend, as well as the residual and the seasonal. So we can see that there is a clear amount of seasonal behavior. Below is the code snippet -
We want to take into account that definitely by this scale it's going to be large enough, that's why we are using a seasonal Arima (SARIMA) model. Let's go ahead and run the auto arima in order to obtain the recommended orders. So typically SARIMA models are going to take longer to run than at least as far as running the different models in order to figure out which is the best one. So just keep that in mind in case this doesn't run fast as other auto.arima orders. Now we will pass interpolated in auto_arima and then we want to make sure that we specify seasonal equals to true, then as seasonal is happening every year so we will say that m is equal to 12, since we have monthly data and there's 12 months per year. Before running auto_arima we can call summary() of this as well so that we can see the result. Below is the code snippet for this step -
Note - This will take a little longer and you may get some sort of warnings here about convergence not being met. That's totally okay. But eventually it's going to verify the arrival order as well as the seasonal order. So you may see it as not converge on a couple of the arima models that it's trying to run. Basically it takes time for running depending on how powerful your computer is, the auto.arima library may end a little sooner or longer.
Now we finished running all those models we will get back the reported model - SARIMAX. We can see that our model performs quite well and we have a pretty good AIC value as well in output. AIC in general, depends on what model should be reported back to you, it should be around something like 420 for AIC value.
From this statespace model results we got sarima order as we can notice that it is SARIMAX(0,1,1)x(1,0,1,12). Now we will do a train test split on the data and test out our model, and will see how it performs on the test set, and then in last we will forecast into the future. For that we will check the length of data frame. So we have 729 rows. Let's set one year for testing by using iloc function, so one year means 12 months, that means we need to set length-12 = 729-12 = 717. So that means our training(train) will be from the beginning all the way to 717. And then our test data is going to be from 717 to all the way into the future. Now we will create the model, SARIMAX pass in the interpolated column from the training data and then we will specify two parameters, one is the first order for the arima - (0,1,1) and second is seasonal_order - (1,0,1,12). Then we'll fit the model and get those results using model.fit(). Checking the result summary and this is basically the same results that was just reported by auto arima.
Below is the code snippet for this step -
Now let's go ahead and get predicted values for our test set range by setting starting point to the length of training data, and then end point is equal to the length of training data plus length of test data minus one. Now let's create predictions by simply calling results.predict with start, end, and type as levels, to make sure we don't have any issues with any differencing components and then we'll rename it as SARIMA Predictions. Now we will have our predictions, so we will plot them out against the test results. So we'll take our interpolated values from our test set then we'll plot those with legend equal to true and with a figsize equal to 12 by 8 and in the same cell we'll plot our predictions with legend=true so that we can see the name popup. Below is the code snippet -
After running this, we can see the real results in blue interpolated line and the SARIMA Predictions in orange. So now we will evaluate the model, for that we'll import Root Mean Squared Error (rmse). Error is equal to rmse of test interpolated compared to predictions. When we checked rmse for our model, we got around 0.34 that's the root mean square error. To understand whether it's good or bad, we should take the test values, check out the range or mean of those values using mean() function. And so the root mean squared error being off by 0.34 is pretty good given the range of values we are dealing with which is 408. From these, we can understand that sarima predictions are close to the real values which is reflected in the plot. Below is the code snippet for this -
We have done an evaluation on the test data where we actually knew the correct results. But when we are forecasting into the unknown future, there is no way to understand how good these data points are because it's just the future and we don't know about it yet. Now we will re-train the model on the full data and forecast into the future. This is going to be our true forecast into the unknown future.
To do that we are going to repeat the same steps. First we'll set the model equal to call SARIMAX function, inside that call df interpolated which is the whole original dataframe, and will use the same orders as last time. Then we will get results by calling model.fit(). So once fitting the model is done, we simply grab our forecasted values, and to do that we'll use predict() function. So we will predict one year into the future, starting from the length of the dataframe + 11 months that means 12 months or 1 year. Set type as 'levels' and rename it to SARIMA FORECAST. Then we will plot out forecasted values with real values that is from interpolated column as we did in previous step for sarima prediction. Below is the code snippet -
When we run it, we got this:
We can see here at the very end what our forecast is in the orange line. In general, this is reasonable assumption or forescast because it keep this sort of seasonality and upward trend going.
In this tutorial we built out a forecast to what we believe those future values should be with the help of SARIMA model and we have seen that how we can use the auto.arima to get the seasonal ARIMA values.
Comments
Post a Comment