Some Basic Concepts About Time Series (1)
Time series is a very important area in machine learning. As the name shows, time series model involves a time component, and the component of time makes time series problems hard to tackle. Time series adds on a time dimension that takes time order sequentially. Normally the current time is denoted as t, and the observation at the current time is defined as obs(t). The observations made at prior time, named lag times or lags are in most interest that are used to predict current observation. Times in the past are negatively relative to the current time. For example, the current time is t, then the prior times would be t-1, t-2,….t-n. Correspondingly, the observations at prior times are obs(t-1), obs(t-2), obs(t-3)…obs(t-n).
Time series analysis is applied to understanding the dataset, which mainly includes trend, seasonality, and noise. Time series mode uses these information to forecast future values of that series. Time series analysis models aim to understand the underlying causes, and answer the “why” question behind a time series dataset.
Time series forecasting is making predictions about the future. This is called extrapolation in the traditional statistics when dealing with time series data. The skill of a time series forecasting model is determined by its performance at predicting the future, which involves explaining why a specific prediction was made, confidence intervals, and even better understanding the underlying causes behind the problem.
Components of time series:
- Level: the baseline value for the series
- Trend: the trend line over time
- seasonality: the repeating patterns or cycles of behavior over time.
- Noise: the errors that cannot be explained by the model.
Examples of time series are: forecasting the close price of a stock each day; forecasting the sales in units sold each day for a store; forecasting unemployment for a state each quarter, and so forth.
Time series forecasting is considered as a supervised learning problem. The use of prior time steps to predict the next time step is called sliding window method, or lag method. The number of previous time steps is called the window width or size of the lag. The slide window is the foundation for how time series datasets could be used into a supervised learning problem.
Univariate time series: these are series that have only one variable is observed at each time, such as close stock price at each day.
Multivariate time series: these are series that have two or more variables are observed at each time. Multivariate time series analysis involves simultaneously multiple time series.
Most cases focus on univariate time series problems. It is harder to model multivariate datasets and classical models do not perform well. Good news is that, some machine learning models, like neural networks are able to handle these difficulties in multivariate time series problems. This is called multi-step forecasting. LSTM is an approach for this difficult problem.
Load the data: load the data as series instead of DataFrame. you can load the data from Kaggle.
Date Time Features in times series are the components of the time step itself for each observation.
Lag Feature: values at prior time steps
Window Features: a summary of values over a fixed window of prior time steps.
Feature engineering is very critical in machine learning because it enables select important but simple relationships between features and the target. It can not only reduce the complexity of the model, but also help improve the model performance. In time series, there is no concept of input or output, so that we need to create the input features so that to transform a time series dataset into a supervised learning dataset for machine learning. As we discussed above, we can use slide window to engineer the inputs. The methods mainly include develop date-time and lag-based features; develop sliding and expanding windows summary statistic features.
Understanding of the feature engineer is critical to understand time series machine learning algorithm such as ARIMA (auto regressive integrated moving average) model.
Visualizing Time Series Dataset
Visualization of time series dataset is very important and useful to explore and understand the series dataset. The mostly used plots are line plot, lag plot, histogram and density plots, heat maps, and autocorrelation plots.
For example line plot:
The line plot can also be disaggregated by years using pandas Grouper() function. We can define the argument freq as ‘A’ which means year. Please see more here.
Histogram and density plots help explore the distribution of the series dataset.
Box and whisker box also shows the descriptive statistics and the quantiles.
Last, we can visualize the autocorrelations. Autocorrelation refers to the observation is correlated with its lag observations. The resulting plot indicates the temperature dataset has a strong sign of seasonality.
These are some basic concepts of time series. Hopefully it would help you understand the time series. Thanks for reading.