Can vowpal wabbit use all my CPU cores? - vowpalwabbit

I've tried to train --oaa vowpal wabbit classifier on 10M+ train data and found that it uses only one core. Is any ways to make it use all 12 cores?

VW uses two threads: one for loading&parsing the input data and one for the machine learning.
VW comes with a spanning_tree tool for parallel execution (AllReduce) of several VW instances on a cluster (e.g. Hadoop) or on a single machine (--span_server localhost).
That said, I think 12 cores are not enough for AllReduce to pay off. For the best results, you need to do hyper-parameter search anyway, so you can do it in parallel using the 12 cores.


h2o autoML network usage

First I would like to thank the H2o team for a great product and rapid development / iteration.
I was testing h2o autoML on a 4 machine cluster. (40 cores, 256 gigs of ram, gigabite bandwidth)
For a 20MB dataset I am noticing that the cluster is using up a lot of network and hardly touching the CPU. I was wondering if it makes sense for h2o to train 1 model per computer instead of trying to train every model on the entire cluster.
AutoML is training H2O models in a sequence, so this advice applies to H2O models in general, not just AutoML -- if your dataset is small enough, adding machines to your H2O cluster will only slow down the training process.
For a 20MB dataset I am noticing that the cluster is using up a lot of network and hardly touching the CPU.
If you have a 20MB dataset, it's always going to be better to run H2O on a single machine. The overhead of using multiple machines is only worth it when your training frame won't fit into RAM on a single machine.
There is a longer explanation in another Stack Overflow answer I wrote here.
I was wondering if it makes sense for h2o to train 1 model per computer instead of trying to train every model on the entire cluster.
It does make sense for small data, but H2O was designed to scale to big data (with millions or hundreds of millions of rows), so training several models in parallel is not the design pattern that was used. To speed up the training process, you can use a single machine with more cores.

Spark: Inconsistent performance number in scaling number of cores

I am doing a simple scaling test on Spark using sort benchmark -- from 1 core, up to 8 cores. I notice that 8 cores is slower than 1 core.
//run spark using 1 core
spark-submit --master local[1] --class john.sort sort.jar data_800MB.txt data_800MB_output
//run spark using 8 cores
spark-submit --master local[8] --class john.sort sort.jar data_800MB.txt data_800MB_output
The input and output directories in each case, are in HDFS.
1 core: 80 secs
8 cores: 160 secs
I would expect 8 cores performance to have x amount of speedup.
Theoretical limitations
I assume you are familiar Amdahl's law but here is a quick reminder. Theoretical speedup is defined as followed :
where :
s - is the speedup of the parallel part.
p - is fraction of the program that can be parallelized.
In practice theoretical speedup is always limited by the part that cannot be parallelized and even if p is relatively high (0.95) the theoretical limit is quite low:
(This file is licensed under the Creative Commons Attribution-Share Alike 3.0 Unported license.
Attribution: Daniels220 at English Wikipedia)
Effectively this sets theoretical bound how fast you can get. You can expect that p will be relatively high in case embarrassingly parallel jobs but I wouldn't dream about anything close to 0.95 or higher. This is because
Spark is a high cost abstraction
Spark is designed to work on commodity hardware at the datacenter scale. It's core design is focused on making a whole system robust and immune to hardware failures. It is a great feature when you work with hundreds of nodes
and execute long running jobs but it is doesn't scale down very well.
Spark is not focused on parallel computing
In practice Spark and similar systems are focused on two problems:
Reducing overall IO latency by distributing IO operations between multiple nodes.
Increasing amount of available memory without increasing the cost per unit.
which are fundamental problems for large scale, data intensive systems.
Parallel processing is more a side effect of the particular solution than the main goal. Spark is distributed first, parallel second. The main point is to keep processing time constant with increasing amount of data by scaling out, not speeding up existing computations.
With modern coprocessors and GPGPUs you can achieve much higher parallelism on a single machine than a typical Spark cluster but it doesn't necessarily help in data intensive jobs due to IO and memory limitations. The problem is how to load data fast enough not how to process it.
Practical implications
Spark is not a replacement for multiprocessing or mulithreading on a single machine.
Increasing parallelism on a single machine is unlikely to bring any improvements and typically will decrease performance due to overhead of the components.
In this context:
Assuming that the class and jar are meaningful and it is indeed a sort it is just cheaper to read data (single partition in, single partition out) and sort in memory on a single partition than executing a whole Spark sorting machinery with shuffle files and data exchange.

Run Map-Reduce application on multiple core on the same machine

I want to run map reduce tasks on a single machine and I want to use all the cores of my machine. Which is the best approach? If I install hadoop in pseudo distributed mode it is possible to use all the cores?
You can make use of the properties and mapred.tasktracker.reduce.tasks.maximum to increase the number of Mappers/Reducers spawned simultaneously on a TaskTracker as per your hardware specs. By default, it is set to 2, hence a maximum of 2 maps and 2 reduces will run at a given instance. But, one thing to keep in mind is that if your input is very small then framework will decide it's not worth parallelizing the execution. In such a case you need to handle it by tweaking the default split size through mapred.max.split.size.
Having said that, I, based on my personal experience, have noticed that MR jobs are normally I/O(perhaps memory, sometimes) bound. So, CPU does not really become a bottleneck under normal circumstances. As a result you might find it difficult to fully utilize all the cores on one machine at a time for a job.
I would suggest to devise some strategy to decide the proper number of Mappers/Reducers to efficiently carry out the processing to make sure that you are properly utilizing the CPU since Mappers/Reducers take up slots on each node. One approach could be to take the number of cores, multiply it by .75 and then set the number of Mappers and Reducers as per your needs. For example, you have 12 physical cores or 24 virtual cores, then you could have 24*.75 = 18 slots. Now based on your needs you can decide whether to use 9Mappers+9Reducers or 12Mappers+6Reducers or something else.
I'm reposting my answer from this question: Hadoop and map-reduce on multicore machines
For Apache Hadoop 2.7.3, my experience has been that enabling YARN will also enable multi-core support. Here is a simple guide for enabling YARN on a single node:
The default configuration seems to work pretty well. If you want to tune your core usage, then perhaps look into setting 'yarn.scheduler.minimum-allocation-vcores' and 'yarn.scheduler.maximum-allocation-vcores' within yarn-site.xml (
Also, see here for instructions on how to configure a simple Hadoop sandbox with multicore support:

Estimating Hadoop Scalability Performance on pseudo-distributed nodes?

Are there any tools, packages, or methodologies available to estimate / simulate the scalability performance of Hadoop using only a single machine using a pseudo-distributed architecture? Such a system would need to make accurate estimations based on jobs that do not interfere with each other in the simulation (e.g., with blocked I/O).
In my mind, how this would work is that I'd run all my map / reduce jobs sequentially, and use some metric to estimate how well the system is scaling (e.g., take the longest running map job and estimate that the run time will be bottlenecked by it).
Additionally, I have multiple map/reduce jobs which are being chained together to form the output.
I think it is largely depends on the nature of your job. Let us try to take a few examples:
1. Your job has heavy input formatting and mapper processing, with minimal data passed to reducer. In this case I would estimate that pseudo distributed cluster will realistically reflect real cluster performance (per slot) and you can assume that 5 nodes cluster will have about x5 performance. I would suggest to put enough data that job time will take at least 5-10 times of the job start-up time. This estimation will be better if you have enough splits to ensure data locality during processing.
If you plan to have a lot of relatively small files - put enough in your test, to simulate per task overhead.
2. Your heavily relaying on Hadoop distributed sort capability (shuffling). Its performance in one node and real cluster can be quite different and the factor is hard to estimate.
I can summarize that throughput of mapper and, in some extent, reducer in terms of MB/sec per slot you can estimated from above. Real cluster probably will have not better performance per slot.

KMeans clustering for more than 5 million vectors

I have hit a real problem. I need to do some Kmeans clustering for 5 million vectors, each containing about 32 cols.
I tried out Mahout which requires linux and I am on windows, I am restrained from using a Linux OS and any sort of simulator.
Can anyone suggest a KMeans clustering algorithm that is scalable upto 5M vectors and can converge quickly?
I have tested a few but they wont scale. Which means they are slow and take forever to complete.
OK, So who ever wants clustering for large scale datasets, the only way of doing so is to use Mahout. IT requires a linux platform. So I had to use virtual box, placed Ubuntu on it and then used Mahout. Its a lengthy procedure to set up Mahout, but the two links that I used are as follows.
