Spark MLLib's LassoWithSGD doesn't scale?

Spark MLLib's LassoWithSGD doesn't scale? - performance

I have code similar to what follows:
val fileContent = sc.textFile("file:///myfile")
val dataset = fileContent.map(row => {
val explodedRow = row.split(",").map(s => s.toDouble)
new LabeledPoint(explodedRow(13), Vectors.dense(
Array(explodedRow(10), explodedRow(11), explodedRow(12))
))})
val algo = new LassoWithSGD().setIntercept(true)
val lambda = 0.0
algo.optimizer.setRegParam(lambda)
algo.optimizer.setNumIterations(100)
algo.optimizer.setStepSize(1.0)
val model = algo.run(dataset)
I'm running this in the cloud on my virtual server with 20 cores. The file is a "local" (i.e. not in HDFS) file with a few million rows. I run this in local mode, with sbt run (i.e. I don't use a cluster, I don't use spark-submit).
I would have expected this to get be increasingly faster as I increase the spark.master=local[*] setting from local[8] to local[40]. Instead, it takes the same amount of time regardless of what setting I use (but I notice from the Spark UI that my executor has a maximum number of Active Tasks at any given time that is equal to the expected amount, i.e. ~8 for local[8], ~40 for local[40], etc. -- so it seems that the parallelization works).
By default the number of partitions my dataset RDD is 4. I tried forcing the number of partitions to 20, without success -- in fact it slows the Lasso algorithm down even more...
Is my expectation of the scaling process incorrect? Can somebody help me troubleshoot this?

Is my expectation of the scaling process incorrect?
Well, kind of. I hope you don't mind I use a little bit of Python to prove my point.
Lets be generous and say a few million rows is actually ten million. With 40 000 000 values (intercept + 3 features + label per row) it gives around 380 MB of data (Java Double is a double-precision 64-bit IEEE 754 floating point). Lets create some dummy data:
import numpy as np
n = 10 * 1000**2
X = np.random.uniform(size=(n, 4)) # Features
y = np.random.uniform(size=(n, 1)) # Labels
theta = np.random.uniform(size=(4, 1)) # Estimated parameters
Each step of gradient descent (since default miniBatchFraction for LassoWithSGD is 1.0 it is not really stochastic) ignoring regularization requires operation like this.
def step(X, y, theta):
return ((X.dot(theta) - y) * X).sum(0)
So lets see how long it takes locally on our data:
%timeit -n 15 step(X, y, theta)
## 15 loops, best of 3: 743 ms per loop
Less than a second per step, without any additional optimizations. Intuitively it is pretty fast and it won't be easy to match this. Just for fun lets see how much it takes to get closed form solution for data like this
%timeit -n 15 np.linalg.inv(X.transpose().dot(X)).dot(X.transpose()).dot(y)
## 15 loops, best of 3: 1.33 s per loop
Now lets go back to Spark. Residuals for a single point can be computed in parallel. So this is a part which scales linearly when you increase number of partitions which are processed in parallel.
Problem is that you have to aggregate data locally, serialize, transfer to the driver, deserialize and reduce locally to get a final result after each step. Then you have compute new theta, serialize send back and so on.
All of that can be improved by a proper usage of mini batches and some further optimizations but at the end of the day you are limited by a latency of a whole system. It is worth noting that when you increase parallelism on a worker side you also increase amount of work that has to be performed sequentially on a driver and the other way round. One way or another the Amdahl's law will bite you.
Also all of the above ignores actual implementation.
Now lets perform another experiment. First some dummy data:
nCores = 8 # Number of cores on local machine I use for tests
rdd = sc.parallelize([], nCores)
and bechmark:
%timeit -n 40 rdd.mapPartitions(lambda x: x).count()
## 40 loops, best of 3: 82.3 ms per loop
It means that with 8 cores, without any real processing or network traffic we get to the point where we cannot do much better by increasing parallelism in Spark (743ms / 8 = 92.875ms per partition assuming linear scalability of the parallelized part)
Just to summarize above:
if data can be easily processed locally with a closed-form solution using gradient descent is just a waste of time. If you want to increase parallelism / reduce latency you can use good linear algebra libraries
Spark is designed to handle large amounts of data not to reduce latency. If your data fits in a memory of a few years old smartphone it is a good sign that is not the right tool
if computations are cheap then constant costs become a limiting factor
Side notes:
relatively large number of cores per machine is generally speaking not the best choice unless you can match this with IO throughput

Related

Algorithm / data structure for rate of change calculation with limited memory

Certain sensors are to trigger a signal based on the rate of change of the value rather than a threshold.
For instance, heat detectors in fire alarms are supposed to trigger an alarm quicker if the rate of temperature rise is higher: A temperature rise of 1K/min should trigger an alarm after 30 minutes, a rise of 5K/min after 5 minutes and a rise of 30K/min after 30 seconds.
 
I am wondering how this is implemented in embedded systems, where resources are scares. Is there a clever data structure to minimize the data stored?
 
The naive approach would be to measure the temperature every 5 seconds or so and keep the data for 30 minutes. On these data one can calculate change rates over arbitrary time windows. But this requires a lot of memory.
 
I thought about small windows (e.g. 10 seconds) for which min and max are stored, but this would not save much memory.

From a mathematical point of view, the examples you have described can be greatly simplified:
1K/min for 30 mins equals a total change of 30K
5K/min for 5 mins equals a total change of 25K
Obviously there is some adjustment to be made because you have picked round numbers for the example, but it sounds like what you care about is having a single threshold for the total change. This makes sense because taking the integral of a differential results in just a delta.
However, if we disregard the numeric example and just focus on your original question then here are some answers:
First, it has already been mentioned in the comments that one byte every five seconds for half an hour is really not very much memory at all for almost any modern microcontroller, as long as you are able to keep your main RAM turned on between samples, which you usually can.
If however you need to discard the contents of RAM between samples to preserve battery life, then a simpler method is just to calculate one differential at a time.
In your example you want to have a much higher sample rate (every 5 seconds) than the time you wish to calculate the delta over (eg: 30 mins). You can reduce your storage needs to a single data point if you make your sample rate equal to your delta period. The single previous value could be stored in a small battery retained memory (eg: backup registers on STM32).
Obviously if you choose this approach you will have to compromise between accuracy and latency, but maybe 30 seconds would be a suitable timebase for your temperature alarm example.
You can also set several thresholds of K/sec, and then allocate counters to count how many consecutive times the each threshold has been exceeded. This requires only one extra integer per threshold.

In signal processing terms, the procedure you want to perform is:
Apply a low-pass filter to smooth quick variations in the temperature
Take the derivative of its output
The cut-off frequency of the filter would be set according to the time frame. There are 2 ways to do this.
You could apply a FIR (finite impulse response) filter, which is a weighted moving average over the time frame of interest. Naively, this requires a lot of memory, but it's not bad if you do a multi-stage decimation first to reduce your sample rate. It ends up being a little complicated, but you have fine control over the response.
You could apply in IIR (Infinite impulse response) filter, which utilizes feedback of the output. The exponential moving average is the simplest example of this. These filters require far less memory -- only a few samples' worth, but your control over the precise shape of the response is limited. A classic example like the Butterworth filter would probably be great for your application, though.

Spark Time Series - Custom Group By 10 Minute Intervals: Improve Performance

We have time series data (timestamp in us since 1970 and integer data value):
# load data and cache it
df_cache = readInData() # read data from several files (paritioned by hour)
df_cache.persist(pyspark.StorageLevel.MEMORY_AND_DISK)
df_cache.agg({"data": "max"}).collect()
# now data is cached
df_cache.show()
+--------------------+---------+
| time| data|
+--------------------+---------+
|1.448409599861109E15|1551.7468|
|1.448409599871109E15|1551.7463|
|1.448409599881109E15|1551.7468|
Now we want to calculate some non-trivial things on top of 10 Minute time windows using an external python library. In order to do so, we need to load the data of each time frame in memory, apply the external function and store the result. Therefore a User Defined Aggregate Function (UDAF) is not possible.
Now the problem is, when we apply the GroupBy to the RDD, it is very slow.
df_cache.rdd.groupBy(lambda x: int(x.time / 600e6) ). \ # create 10 minute groups
map(lambda x: 1). \ # do some calculations, e.g. external library
collect() # get results
This operation takes for 120Mio samples (100Hz data) on two nodes with 6GB Ram around 14 minutes. Spark Details for the groupBy stage:
Total Time Across All Tasks: 1.2 h
Locality Level Summary: Process local: 8
Input Size / Records: 1835.0 MB / 12097
Shuffle Write: 1677.6 MB / 379
Shuffle Spill (Memory): 79.4 GB
Shuffle Spill (Disk): 1930.6 MB
If I use a simple python script and let it iterate over the input files, it takes way less time to finish.
How can this job be optimized in spark?

The groupBy is your bottleneck here : it needs to shuffle the data across all partitions, which is time consuming and takes a hefty space in memory, as you can see from your metrics.
The way to go here is to use the reduceByKey operation and chaining it as follow :
df_cache.rdd.map(lambda x: (int(x.time/600e6), (x.time, x.data) ).reduceByKey(lambda x,y: 1).collect()
The key takeaway here is that groupBy needs to shuffle all of your data across all partitions, whereas reduceByKey will first reduce on each of the partition and then across all partitions - reducing drastically the size of the global shuffle. Notice how I organized the input into a key to take advantage of the reduceByKey operation.
As I mentionned in the comments, you might also want to try your program by using Spark SQL's DataFrame abstraction, that can potentially give you an extra boost, thanks to its optimizer.

Practical Parallel Efficiency % in Teradata

Teradata is built for parallelism.
I believe that from the below query we can measure the Parallel Efficiency of user's query
SELECT
USERNAME,
NumOfActiveAMPs,
((sum(AMPCPUTime))/1024) / ((sum(MaxAmpCPUTime) * NumOfActiveAMPs)/1024) * 100 as Parallel_Efficiency,
count(1)
FROM dbc.qrylog
WHERE MaxAmpCPUTime > 0
group by 1,2
In a ideal situation, i believe PE can be 100%
But for various reasons, i see that most PE (rolled up) is usually less than 50%
What according to you is a good Parallel Efficiency % that we should try to achieve ?
I was told that trying to achieve a high PE (like 60% or more) is also not good for the state of the system, not sure of the reason though, is this true ? your thoughts ?
Thanks for sharing your thoughts !

Parallel Efficiency for a given query can be calculated as AMPCPUTime / (MaxAMPCPUTime * (HASHAMP () + 1)). Where (MaxAMPCPUTime * (HASHAMP () + 1)) is the ImpactCPU measure, representing the highest CPU consumed by a participating AMP in the query multiplied by the number of AMPs in the configuration. You may find individual workloads are all over the board on their parallel efficiency.
I some times wonder if PE for an individual query would be more accurate if you replace the number of nodes in the system with the number of AMPs used by the query. This metric is available in DBQL and may help balance queries that are using PI or USI access paths that are not all AMP operations.
Parallel efficiency for your overall system can be obtained using ResUsage metrics by dividing the average node utilization by the maximum node utilization. This helps you understand how evenly the system is processing a given workload but does not consider how "heavy" that workload might be. Here you are looking to see the overall efficiency to be greater than 60%, the closer to 100% the better the nodes are working together.
I know your inquiry was about individual queries, but I thought sharing details about the PE of your environment would be beneficial as well.

Need suggestion to fasten Map for a huge Invert Index in matlab

I need to store a huge data in a map for Invert Index, but my data is very huge, and I see that as Map gets bigger and bigger, it becomes slower and slower. We are talking a Map container with a very sparse index, that covers 1 to billions.
In one iteration of my program, some numbers will be calculated, to get many key values (could be thousands) to be stored - this means the Map will have its size increased by about thousands or so in every iteration. And I see that in the first few iterations, it take 20 seconds or so, but on 70th iteration or so, it takes 100 seconds or so. I have about 5000 set of data - that is I require 5000 iterations for all these data. With the exponentially increasing time for each iteration, this will take days to compute and that is unacceptable.
So is there anything I can do in this case?

You could try using the java HashMap implementation instead. There is a smkall overhead every time Matlab access java routines, but the Java routines usually provide more flexibility. For example:
%Create
map = java.util.HashMap(5e6); %Initialize with room for 5 million entries
%Add data
map.put('key1','value1');
map.put(2,20);
%get data
out = map.get('key1'); %Get a value
map.containsKey(2); %Check for existance of a key
This will work. But ... it's not clear if it will be faster or not. Only a test will tell.
Also, you will probably get an occasional error when you are developing this way.
Java exception occurred:
java.lang.OutOfMemoryError: Java heap space
at java.util.HashMap.<init>(Unknown Source)
at java.util.HashMap.<init>(Unknown Source)
When this happens, you can use clear java to purge any Java resident information, or allocate less space to the initial HashMap.

Optimum number of threads for a highly parallelizable problem

I parallelized a simulation engine in 12 threads to run it on a cluster of 12 nodes(each node running one thread). Since chances of availability of 12 systems is generally less, I also tweaked it for 6 threads(to run on 6 nodes), 4 threads(to run on 4 nodes), 3 threads(to run on 3 nodes), and 2 threads(to run on 2 nodes). I have noticed that more the number of nodes/threads, more is the speedup. But obviously, the more nodes I use, the more expensive(in terms of cost and power) the execution becomes.
I want to publish these results in a journal so I want to know if there are any laws/theorems which will help me to decide the optimum number of nodes on which I should run this program?
Thanks,
Akshey

How have you parallelised your program and what is inside each of your nodes ?
For instance, on one of my clusters I have several hundred nodes each containing 4 dual-core Xeons. If I were to run an OpenMP program on this cluster I would place a single execution on one node and start up no more than 8 threads, one for each processor core. My clusters are managed by Grid Engine and used for batch jobs, so there is no contention while a job is running. In general there is no point in asking for more than one node on which to run an OpenMP job since the shared-memory approach doesn't work on distributed-memory hardware. And there's not much to be gained by asking for fewer than 8 threads on an 8-core node, I have enough hardware available not to have to share it.
If you have used a distributed-memory programming approach, such as MPI, then you are probably working with a number of processes (rather than threads) and may well be executing these processes on cores on different nodes, and be paying the costs in terms of communications traffic.
As #Blank has already pointed out the most efficient way to run a program, if by efficiency one means 'minimising total cpu-hours', is to run the program on 1 core. Only. However, for jobs of mine which can take, say, a week on 256 cores, waiting 128 weeks for one core to finish its work is not appealing.
If you are not already familiar with the following terms, Google around for them or head for Wikipedia:
Amdahl's Law
Gustafson's Law
weak scaling
strong scaling
parallel speedup
parallel efficiency
scalability.

"if there are any laws/theorems which will help me to decide the optimum number of nodes on which I should run this program?"
There's no such general laws, because every problem has slightly different characteristics.
You can make a mathematical model of the performance of your problem on different number of nodes, knowing how much computational work has to be done, and how much communications has to be done, and how long each takes. (The communications times can be estimated by the amount of commuincations, and typical latency/bandwidth numbers for your nodes' type of interconnect). This can guide you as to good choices.
These models can be valuable for understanding what is going on, but to actually determine the right number of nodes to run on for your code for some given problem size, there's really no substitute for running a scaling test - running the problem on various numbers of nodes and actually seeing how it performs. The numbers you want to see are:
Time to completion as a function of number of processors: T(P)
Speedup as a function of number of processors: S(P) = T(1)/T(P)
Parallel efficiency: E(P) = S(P)/P
How do you choose the "right" number of nodes? It depends on how many jobs you have to run, and what's an acceptable use of computational resources.
So for instance, in plotting your timing results you might find that you have a minimum time to completion T(P) at some number of processors -- say, 32. So that might seem like the "best" choice. But when you look at the efficiency numbers, it might become clear that the efficiency started dropping precipitously long before that; and you only got (say) a 20% decrease in run time over running at 16 processors - that is, for 2x the amount of computational resources, you only got a 1.25x increase in speed. That's usually going to be a bad trade, and you'd prefer to run at fewer processors - particularly if you have a lot of these simulations to run. (If you have 2 simulations to run, for instance, in this case you could get them done in 1.25 time units insetad of 2 time units by running the two simulations each on 16 processors simultaneously rather than running them one at a time on 32 processors).
On the other hand, sometimes you only have a couple runs to do and time really is of the essence, even if you're using resources somewhat inefficiently. Financial modelling can be like this -- they need the predictions for tomorrow's markets now, and they have the money to throw at computational resources even if they're not used 100% efficiently.
Some of these concepts are discussed in the "Introduction to Parallel Performance" section of any parallel programming tutorials; here's our example, https://support.scinet.utoronto.ca/wiki/index.php/Introduction_To_Performance

Increasing the number of nodes leads to diminishing returns. Two nodes is not twice as fast as one node; four nodes even less so than two. As such, the optimal number of nodes is always one; it is with a single node that you get most work done per node.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio