Clickhouse and Timeseries

HI,
How do you pass the mandatory timeseries parameters to the predictor from clickhouse Insert query ?
Also how do you predict multiple time after that (ex next 5 hours if hourly prediction) ?

Thanks

1 Like

Hi,

I’m afraid we don’t support timeseries parameters through the database interface for now, though if this is something that you want feel free to make an issue describing your specific usecase, since we prefer basing new implementations on real-world cases to some extent (or even a PR, but that might be a bit too difficult).

That being said, there’s a way to get timeseries like behavior, at least in part, by adding arrays to your data. For example, say you want to predict a column price based on the demand column during the last 3 days. In mindsdb you would write something like: learn(... order_by=['demand'], window_size=3)

But this could be replaced by formatting a set of data like:

price, demand
10, 20
12, 30
14, 15
10, 22
11, 35

Into

price, demand_ts
10, [20]
12, [20,30]
14, [20,30,15]
10, [30,15,22]
11. [15,22,35]

Depending on your type of data this might work, group_by would be harder to implement, but there’s some ways to hack around that depending on your type of data (e.g. if it’s very low granularity you can just train 1 predictor per “group”)

Regrading predictions for the next x hour, it really depends on your usecase, the easiest way to do that is to format your dataset such that it has x different columns to predict target_1, target_2target_x (in you case x=5). Then whenever you run the model it would generate 5 predictions.

Alternatively, if you can “infer” some of the data for the next hour (e.g. increase the value of the hour column and assume all other columns stay the same) you can just get 5 different predictions.

You can also try predicting a timeseries column, but that would be fairly experimental and I doubt it would show results.

To give a better answer I’d be ideal to get the schema of your data and some sample data + the actual usecase you have. Given that I could also help you with the clickhouse queries needed to format the data if you’re unsure how to write those.

1 Like

Hi,
Thanks for the prompt answer

The use case is a supervision system.
We need to monitor the number of transaction per location, and get prediction in the comming few hours to take preventive action.

The source table is

create table if not exists TRANSACTION_CountryA (
Location String,
transaction_time DateTime,
nb_transactions_total UInt32)
engine = MergeTree
order by (transaction_time)
partition by toYYYYMM(transaction_time);

The data is already aggregated per 15 minutes (in transaction_time).
We have 2 years of history.
There is between 100 and 1000 locations depending on the country. (here it’s a table for 1 country).
It’s a timeseries per location, with loose relation between locations (some can be closer than other). In profiling, the time series have few different shapes depending of the location around 10 diferents). There is multiple saisonality: within the day, the week, annual.

We want short term prediction, the values for each location per 15 minutes for the next 3 hours. (So the next 12 values).
And we run the prediction every 15 minutes on actual data refresh.

We tried random forest, xgboost, RNN (LSTM). We enrich the features by adding date and time related fields like day of the week, month, hour of the day … we need to buil dloops, pivot with lag …)
It’s time consumming to try all models and parameters with complex code in python.

I wanted to try in-database machine learning to do all in SQL without having to code python and have additionnal batch to chain.
If it’s not possible in SQL we can also switch back to Python… but what drawn me to MindsDB was the in DB machine learning so we don’t have to move the data (even if in that case it’s not realy doing it in DB but sending the data to the python code behind).

Stephane.

Actually, I misspoke, from clickhouse we should already support timeseries formatting so there’s no need to do any changes to the data from inside the database :slight_smile:

You can simply run the query as:

INSERT INTO mindsdb.predictors spike_preidctor WHERE select_data_query='SELECT transaction_time, location, nb_transactions_total FROM TRANSACTION_Belgium AND to_predict='nb_transactions_total' AND training_options='{"order_by":"transaction_time","window_size":5}'.

That being said, in your case, it seems like the ideal algorithm would be one that takes into account both transaction_time and the previous value[s] for nb_transactions_total. Which would be doable by adding a second column for that, but potentially hard to query since clickhouse doesn’t have very good lag/windows options yet.

Without that, it might be better to just stick to giving a single value of transaction_time and not using the timeseries-based approach at all.

Some additional points I can think of:

The data size, by your estimate it’s ~100mil to 1bil per table, so I think I’d start prototyping on a set of one or a few locations in your place to save time and see if it performs well. If mindsdb works on 10 locations it would presumably work on 1000 as well. It might also be worthwhile to try with a few months at first and then extend to the whole 2 year period.

In this case I don’t think doing a group by location wouldn’t help the problem, so that specific parameter shouldn’t be needed.

We want short term prediction, the values for each location per 15 minutes for the next 3 hours. (So the next 12 values).
And we run the prediction every 15 minutes on actual data refresh.

Since in this case it sounds like you aren’t using the previous values of nb_transactions_total for the problem, this part is easy, you’d just have to run a single predict give it the location and the following 12 values for the transaction_time column.

We tried random forest, xgboost, RNN (LSTM). We enrich the features by adding date and time related fields like day of the week, month, hour of the day … we need to buil dloops, pivot with lag …)
It’s time consumming to try all models and parameters with complex code in python.

Regrading this bit, I don’t think this will be necessary, as this kind of feature enrichment is already handled by mindsdb internally.

I wanted to try in-database machine learning to do all in SQL without having to code python and have additionnal batch to chain.
If it’s not possible in SQL we can also switch back to Python… but what drawn me to MindsDB was the in DB machine learning so we don’t have to move the data (even if in that case it’s not realy doing it in DB but sending the data to the python code behind).

I think in this case the problem itself can be easily done from within the database (see above, I had just forgotten we already implemented the timeseries feature), but based on what values you want to use for training it might require some more advanced queries to select the data in the proper format.

At any rate, do let me know how it goes, since this is the kind of timeseries problem we want to actually start optimizing more for (when previous values of the target column matter a lot for the prediction), it’s just that we didn’t really get loads of users with this usecase before (and when it did happen, alternatives worked better).

One final note:

The following approach might be better for your problem

  1. Create a table
create table if not exists TRANSACTION_CountryA_with_history (
Location String,
transaction_time DateTime,
nb_transactions_total UInt32
nb_transactions_total_previous UInt32)
engine = MergeTree
order by (transaction_time)
partition by toYYYYMM(transaction_time);
  1. Run the query:

INSERT INTO mindsdb.predictors spike_preidctor WHERE select_data_query='SELECT transaction_time, location, nb_transactions_total, nb_transactions_total FROM TRANSACTION_CountryA_with_history AND to_predict='nb_transactions_total' AND training_options='{"order_by":"nb_transactions_total_previous","window_size":5}' .

Because, as mentioned before, having an actual “time series” of the dates doesn’t matter, what will probably help is the nr of transactions in the previous timestep.

Obviously the problem here is creating the TRANSACTION_CountryA_with_history table ( I think it can be done with some weird queries but I haven’t worked with clickhouse in a while so I can’t recommend anything of the top of my head).

In this case to predict the following x values for nb_transactions_total would consist of actually running 1 query per value of nb_transactions_total_previous and then appending the result of that to the nb_transactions_total_previous and so on and so forth until you’ve gotten x predictions.

Starting with a table with a nr of rows equal to window_size and inserting each prediction into that, these table are pretty small, so you could just create a new one every time you want to predict. That being said, these querying would probably have to be done programatically in a production setting.

Still, I’d focus on the training first and the predicting problem second.

Thank-you for the detail explaination.

I will try what you recommend.

Regarding the LAG, clickhouse have a new function neighbor that just do that :

select Location, transaction_time, nb_transactions_total,
if(neighbor(Location, -1, ‘ZZZ’) != Location, 0 , neighbor(nb_transactions_total, -1, 0)) as nb_transactions_total_previous
from (
select * from TRANSACTION_Belgium
order by Location, transaction_time
);

you just have to :

  • manage block using a subquery with an order by
  • manage the skip in sequence, like here by Location with IF.

An yes, for 12 step prediction we can run it 12 time using the previously predicted value as input for the next one.

Thanks again, that give me a lot to test.

1 Like

HI,

also regarding the INSERT,
what format work is

INSERT INTO mindsdb.predictors(name, predict, select_data_query, training_options)
VALUES (‘PredictionWith_LAG0_GRP1_12’
, ‘nb_transactions_total’
, ‘SELECT transaction_time, Location, nb_transactions_total FROM TRANSACTION_Belgium’
, ‘{“order_by”:“transaction_time”,“group_by”:“Location”,“window_size”:12}’);

I’m not sure for the training_options format but it was not rejected.
now running…

Hello @Stephane, the INSERT format should follow:

  • name ( string ) - the name of the predictor.
  • predict ( string ) - the feature you want to predict, in your example it will be nb_transactions_total.
  • select_data_query ( string ) - the SELECT query
  • training_options (dictionary) - optional value that contains additional training parameters. For a full list of the parameters check the mindsdb.docs. Please share the info if you have successfully trained and queried the model.

Huh, I didn’t know that clickhouse introduced a lag~like function, haven’t worked much with it for the last year and a bit, that will come in handy :slight_smile:

Anyway, the approach your described of using data from the query:

select Location, transaction_time, nb_transactions_total,
if(neighbor(Location, -1, ‘ZZZ’) != Location, 0 , neighbor(nb_transactions_total, -1, 0)) as
from (
select * from TRANSACTION_Belgium
order by Location, transaction_time
);

Seems like the correct one here, also you could add 5,10,20…etc x columns like nb_transactions_total_previous with the previous second, third… etc to use more of the transaction history for each sample, that seems like the best approach here to me.

The second approach you described in that INSERT query might also work, but again, mindsdb doesn’t use the “target” column in a timeseries, so in this case you’re not using any of the information in the nb_transactions_total column in previous rows, thus I’m not sure if the predictions will be very good.

1 Like

If i understand properly, If I use the Lag function in database to create the lag columns, then I should not use the timeseries parameters (order_by, group_by, window_size) as for exemple the window_size is to tel mindsdb to create lag so we don’t need that as we create them with the SQL ?

Also I’ve downscalled the data to create a first quick model. It work. I can now get a prediction using :

SELECT Location, transaction_time,
nb_transactions_total AS predicted,
nb_transactions_total_min,
nb_transactions_total_max,
nb_transactions_total_confidence AS confidence,
nb_transactions_total_explain AS info
FROM mindsdb.PredictionWith_LAG1
WHERE Location=‘XXX’ AND transaction_time = ‘2018-07-01 08:15:00’ and nb_transactions_total_previous = 426;

But I can’t find a way to predict multiple value (location) at the same time.

Location in (‘XXX’,‘YYY’)

does not work.
And joining doesn’t work either. Any solution for that ?

Thanks

To predict multiple values you need to predict from a table:

nb_transactions_total AS predicted nb_transactions_total_min,
nb_transactions_total_max,
nb_transactions_total_confidence AS confidence,
nb_transactions_total_explain AS info
FROM mindsdb.PredictionWith_LAG1
WHERE select_data_query='query to select from table with multiple rows to predict from'

Where select_data_query is just a magic variable in which you can add any valid clickhouse select query.

If i understand properly, If I use the Lag function in database to create the lag columns, then I should not use the timeseries parameters (order_by, group_by, window_size) as for exemple the window_size is to tel mindsdb to create lag so we don’t need that as we create them with the SQL ?

Yes, preciesly, the timeseries parameters just tell mindsdb how to conver certain columns to arrays containing values from the previous rows (in order to have them processed with the timeseries encoder).

But in this case the timeseries you would have created would have been of the “time” column (as in, values from that column in an array), but you needed the “transaction_total” column. This is a case Mindsdb can’t handle via the timeseries args as of yet (using the previous values of the target column to predict it’s new value), so it’s better to just have this workaround in the database.

Thank’s the select_data_query worked fine.

SELECT station, transaction_time,
nb_transactions_total AS predicted,
nb_transactions_total_original,
nb_transactions_total_min,
nb_transactions_total_max,
nb_transactions_total_confidence AS confidence,
nb_transactions_total_explain AS info
FROM mindsdb.EntriesPredictionWith_LAG12_v3
WHERE select_data_query=
‘SELECT transaction_time, Location, nb_transactions_total_prev01, nb_transactions_total_prev02, nb_transactions_total_prev03, nb_transactions_total_prev04, nb_transactions_total_prev05, nb_transactions_total_prev06, nb_transactions_total_prev07, nb_transactions_total_prev08, nb_transactions_total_prev09, nb_transactions_total_prev10, nb_transactions_total_prev11, nb_transactions_total_prev12 FROM 12LAG_1MTH_TGT where transaction_time = ‘‘2018-07-01 08:00:00’’’;

for simple quote in query it just have to be doubled to work.

Perfect, if you’ve played around with a few models do lete us know how this worked compared to other frameworks you tried, both in terms of accuracy and your opinion of it in general in terms of intuitiveness, ease of usage, features… etc.

Mindsdb’s clickhouse integration is pretty recent and we genuinely want to hear people’s opinions and complaints about it because that’s the best path for us to improve.