Datasets and Features
Last updated
Last updated
When creating a dataset to train an ML model, data is retrieved from your data sources, placed into data sinks, and used to create the ML features for the training dataset.
When serving an ML model trained on a dataset, at the time of prediction, some features may be provided by your application in the prediction REST request, and some features may need to be retrieved from your data sources via data sinks. This distinction defines basis features and contextual features.
Basis features are those features that will be provided by your application in the prediction REST request at the time of prediction.
Contextual features are those features that must be retrieved by the Aizen platform during prediction. Contextual features are not provided by your application in the prediction REST request but are retrieved from your data sources and appended to your basis features, and the complete set of features is passed as input to your ML model for prediction. Contextual features are typically time-window aggregated computations of your data or join columns from other data sources. For example, if your dataset contains user ratings of a product, then a contextual feature could be the average user rating of the product over the last three months or the max user rating of the product over the last seven days. If all your ML features are provided by your application in the prediction REST request, then all your features are basis features, and you do not have any contextual features.
Traditional ML deployments require all features to be basis features. This puts a burden on the prediction application because the application must provide all the features during the prediction REST request. Aizen recommends that you identify most of your features as contextual features when you configure the training dataset. Contextual feature definitions provide these benefits:
Complex time-window aggregations and point-in-time joins are made simple with feature definitions. Prediction applications do not need to compute complex features in real-time.
A feature definition generates consistent feature data for training and prediction. You do not need to worry about whether your prediction application is generating features in a manner consistent with when the features were generated for the training dataset.
Feature definitions make it simple and fast to deploy ML models, lowering your time to production.
A data sink is a table in Aizen storage that corresponds to a data source. Aizen classifies data sinks into two types:
An events data sink. This type of data sink is tied to a data source that is event driven. The data source must have a timestamp column, which is the event timestamp. Some examples are IoT sensor data, product shipping and receiving events, weather events, traffic events, and so on. You may specify primary key columns for the data sink. An events data sink supports window time aggregation functions on its columns, with group by columns being a subset of its primary key columns.
A static data sink. This type of data sink is not event driven. It is a static table of records. You must specify the primary key columns for the data sink. A static data sink supports lookups or joins using its primary key columns.
When defining features for a dataset, Aizen classifies features into four types:
A basis feature from a data sink. This is a column from a data sink. It is the simplest definition of a feature. The column data from the data sink directly maps into the dataset. The column name and data sink name are specified when defining this feature. A dataset may have at most one basis data sink, from which it can draw any number of basis features.
A contextual feature from an events data sink. This is a window time aggregation function performed on a column from an events data sink. The data sink column name, join keys from the basis features, aggregation function and the time window of aggregation are specified when defining this feature. The time window of aggregation is based on the timestamp column of the events data sink.
A contextual feature from a static data sink. This is a lookup or join of a column from a static data sink. The join keys from the basis features, the static data sink name and its column name are specified when defining this feature.
A contextual feature that is a compound expression containing one or more features from the same dataset.
When the model is served, at the time of prediction, the basis features are expected in the prediction REST request. Aizen will augment the basis features with contextual features from their data sinks. If you have a contextual feature from an events data sink, then that events data sink must be connected to a real-time data source prior to model serving. This is because the events data sink will require fresh real-time event data at the time of prediction requests.
At the time of training dataset creation, labels are also considered as features, specifically as output features. Labels or output features are classified into the four types described above, in the same manner as input features. At the time of prediction, output features are not expected in prediction requests, they are returned in prediction responses.
Table joins in the Aizen platform are performed via contextual feature definitions. When creating a dataset, you first define the basis features from a data sink. The basis features are then joined with other columns from events data sinks or static data sinks, via join keys, to create contextual features.
Example 1: Let's say your data source consists of store sales information, and you want to predict the sales for a given store and product for the next three days. Let's say your basis features are the Date, the Store ID and the Product ID. Let’s say you want an additional input feature that is the last seven days of sales for that store and product
, and the output feature (label) is the next three days of sales for that store and product
. These additional features are contextual features drawn from an events data sink that contains the sales quantity per day for each store and product. In this case, the label is an output contextual feature.
The Last 7d sales Qty
and Next 3d sales Qty
are aggregation functions on the events data sink, where the Sales Qty
column is aggregated by the Store ID and Product ID. The join keys from the basis features are the Store ID and the Product ID. The final training dataset contains the basis features, augmented with the contextual features from the events data sink.
When the model is deployed, at the time of prediction, the basis features Date, Store ID, and Product ID are expected in the prediction REST request. Aizen will augment those with the Last 7d sales Qty
from the events data sink before passing the data to the model to predict the Next 3d sales Qty
. The events data sink must have a real-time data source providing the sales quantity for stores and products at the time of the prediction request, that is, corresponding to the Date supplied in the prediction request.
Example 2: Let's say you have another data source that has the store address, sales tax for that location and the type of location, which can be rural, urban, suburb and so on. Let’s say your basis features are the Date, the Store ID and the Product ID drawn from the first data source. Let’s say you want two more input features to be the Sales Tax
and the Location Type
. These additional features are contextual features drawn from a static data sink that is configured for the second data source.
The Sales Tax
and Location Type
are lookup values from the static data sink where the join key from the basis feature is the Store ID. The final training dataset contains the basis features augmented with the contextual features from the events and static data sinks.
When the model is deployed, at the time of prediction, the basis features Date, Store ID, and Product ID are expected in the prediction REST request. Aizen will augment those with the Last 7d sales Qty
from the events data sink and the Sales Tax
and Location Type
from the static data sink before passing the data to the model to predict the Next 3d sales Qty
.
Basis features are defined from a data sink, which is configured to a data source. Typically, the basis data is materialized from its data source when the training dataset is created. This is essential if the training output features (labels) are present in the basis data source.
However, there are cases where materializing basis data from its data source is suboptimal for model training. Consider the previous Example 1. When materialized from the data source, the final training dataset looks as follows:
Let’s say on 01/01/2021 there were no sales for Product ID "123" at Store ID "PQR". That means the basis data will not contain the row {01/01/2021, “PQR”, “123”} and neither will the training dataset. However, it is perfectly valid to ask the question what is the Next 3d sales Qty
for Store ID "PQR" and Product ID "123" on 01/01/2021, if Store ID "PQR" sold Product ID "123" on other dates. Moreover, the contextual features Last 7d sales Qty
and Next 3d sales Qty
for Store ID "PQR", Product ID "123" and Date 01/01/2021 can easily be generated. It was not generated simply because the dated combination of {01/01/2021, “PQR”, “123”} was not present in the basis data.
It is more beneficial for model training that the row {01/01/2021, “PQR”, “123”} were present in the basis data and the training dataset, if Store ID "PQR" sold Product ID "123" on other dates. The most complete basis data would be a table of all possible dates, fully cross joined with a table of distinct Store ID, Product ID combinations from the data source, as shown below:
This is irrespective of whether dated combinations such as {01/01/2021, “PQR”, “123”} were present in the data source or not. The above basis data table is the most complete basis data for Example 1. These scenarios occur when the output features (labels) are contextual features drawn from an events data sink.
The configure dataset
command allows you to synthesize basis data instead of materializing it from its data source. You must provide the datetime range and stride interval, for example 01/01/2021 to 01/01/2022 with a 1-day stride. The start dataset
command will generate all the datetimes for that range and fully cross join it with the distinct Store ID, Product ID combinations from the data source to synthesize the basis data.
If this leads to a large training dataset, either because the datetime range is large (or the stride interval is small), or the number of distinct basis features is large, then you may want to compromise on the size of the training dataset to shorten the model training time. The configure dataset
command allows you to specify an explode row count. Each datetime in the range will be cross joined with an explode row count number of randomly selected distinct basis features. For the previous example, if the explode row count was 1000, then for each date there would be 1000 random but distinct Store ID, Product ID combinations selected from the data source, resulting in 1000 rows per day.