Multiprocessing

pyseer supports the use of multiple CPUs through the --cpu option. This sends batches of processed variants to a core, which will fit the chosen model on all variants in the batch.

The constant --block-size controls the number of variants sent to each core. The higher this is set the more efficient the use of CPUs will be (up to a limit, set by the time spent reading the variant input) at the expense of a roughly linear increase in memory usage. The default is 1000, using which on 8 cores required around 1.5Gb of memory for a 1.4x speedup with the mixed model. Increasing this to 30000 while using 4 cores gave a similar (1.5x) speedup, but needed 12Gb of memory.

Depending on your computing architecture, you may wish to split the input and run separate jobs. This will be more efficient, but is less convenient. This can be done using GNU split:

split -d -n l/8 fsm_kmers.txt fsm_out

This would split the input k-mers into 8 separate files.

Prediction

The --wg enet mode also supports CPUs, but can be very memory-hungry (memory use scales linearly with number of cores). For large datasets, if you are running out of memory, you may wish to try with just a single core.