Multiproc / parallelisme / multi CPU?

Dear all,

I started my first training, but it look like all the activity is done by 1 processor.
Currently stuck on the following for 2 hours

lightwood.py:173 - Reshaping data into timeseries format, this may take a while !

EDIT : then after I have

INFO:root:We will train an encoder for this text, on a CPU it will take about 112660885.80576308 minutes

I’ve 48 core on the python server and mindsDB is using 1 core at 100%.
Meanwhile the clickhouse server is iddle.

Is there any config.json or training parameter that will allow me to set the number of core to use ?

Thanks.

Hmh, it’s rather odd that one of your columns is getting classified as “Text”, it should be classified as “Category” (I assume this is the location column you are talking about, based on your previous post).

You can force categorical encoding for the column as: "advanced_args":{"force_categorical_encoding":["Location"]"} to skip the text encoding.

Alternatively, if one of your columns contains rich text (actual sentences) and the training takes to long, I might be able to help you with a quick hack to bypass that (with slightly decreased accuracy).

Regrading the single CPU usage, the machine learning bit uses pytroch (intel MLK) and should parallelize on as many CPUs as you have, but the statistical analysis & data modeling is currently single-cored, which poses issues on large datasets where we must run costly remodeling (e.g. a lot of rows + timeseries arguments).

I will make an issue regrading this, since it’s certainly something we ought to try and fix, but it’s a non trivial issue that we haven’t encountered that much, so it likely won’t be fixed anytime soon (e.g. in the next 2 weeks).

1 Like