Is there any way to increase the efficiency of PySpark outputs?

Is there any way to increase the efficiency of PySpark outputs? - performance

I am trying to test the ability of PySpark to iterate over some very large (10s of GBs to 1s of TBs) data. For most scripts I find PySpark to have about the same efficiency as Scala code. In other cases (like the code below) I get serious speed problems ranging from 10 to 12 times slower.
path = "path/to/file"
spark = SparkSession.builder.appName("siteLinkStructureByDate").getOrCreate()
sc = spark.sparkContext
df = RecordLoader.loadSomethingAsDF(path, sc, spark)
fdf = df.select(df['aDate'], df['aSourceUrl'], df['contentTextWithUrls'])
rdd = fdf.rdd
rddx = rdd.map (lambda r: (r.aDate, CreateAVertexFromSourceUrlAndContent(r.aSourceUrl, r.contentTextWithUrls)))\
.flatMap(lambda r: map(lambda f: (r[0], ExtractDomain(f[0]), ExtractDomain(f[1])), r[1]))\
.filter(lambda r: r[-1] != None)\
.countByValue()
print([((x[0], x[1], x[2]), y) for x, y in rddx.items()])
We think we have isolated the problem to the .countByValue() (which returns a defaultdict), but applying countItems() or reduceByKey() produces pretty much the same results. We are also 99% sure the problem is not with the ExtractDomain or CreateAVertexFromSourceUrlAndContent (not the real names of the functions, just pseudocode to make it understandable).
So my question is first
is there anything in this code that I can do to reduce the time?
Is PySpark fundamentally that much slower than its Scala
counterpart?
Is there a way to replicate the flatmap
using PySpark dataframes instead (understanding that dataframes are
generally faster than RDD in Pyspark)?

The biggest problem here might be communication - Spark SQL (columnar format) -> plain Scala object -> pickle (Pyrolite) -> socket -> unpickle -> plain Python object. That's a lot of copying, converting and moving things.
there a way to replicate the flatmap using PySpark dataframes instead
Yes. It is called explode - but to be fair it is slowish as well.
understanding that dataframes are generally faster than RDD in Pyspark
That's usually true (Scala and Python both), but you might need udf to implement ExtractDomain or CreateAVertexFromSourceUrlAndContent - it is another slow thing. Just from the names you might be able to use parse_url_tuple.
Is PySpark fundamentally that much slower than its Scala counterpart?
It is somewhat slower. Normally not that slower on well tuned code. But implementation details are different - the same set of operation in both Scala and Python can be materialized in a different way.
is there anything in this code that I can do to reduce the time?
I'd recommend profiling first. Once you determine which part is responsible (conversions, merging) you can try to target it.

Related

Is there a way to use torch.autograd.gradient in parallel in Pytorch?

I am trying to train some network where the loss is not only a function of the output but also the derivative of the output w.r.t. the input. The problem is that while computing the batch output can be done in parallel with the modules with Pytorh, I can't find a way to do the derivative in parallel. Here's the best I can do in serial:
import torch
x=torch.rand(300,1)
dydx=torch.zeros_like(x)
fc=torch.nn.Linear(1,1)
x.requires_grad=True
for ii in range(x.size(0)):
xi=x[ii,0:]
yi=torch.tanh(fc(xi))
dydx[ii]=torch.autograd.grad(yi,xi,create_graph=True)[0]
dydxsum=(dydx**2).sum()
dydxsum.backward()
In the code above, x is split to save memory and time. However, when the size of x becomes large, parallelization (in CUDA) is still necessary. If it has to be implemented by tinkering Pytorch, a hint to where to start will be appreciated.

How can I use basic tricks for improving my Julia code?

I'm relatively new with Julia and I'm currently using version 1.0. I have a code that is intendend to produce a sequence of integers, based on an input matrix. The code takes 3 hours to run on my machine (i5, dual core, 16GB ram), using 16% of CPU and 3% of memory. Is there any basic tips I can learn and apply to optimize my code in Julia to improve its performance? Does indentation have an effect on performance? Is there a package that can track my code and suggest improvements? I provide my code below. The code includes a R code that generates data to which the Julia code is applicable. If an error occurs during the R code, it's just a lack of achievement during simulations and it must be run again until simulation is complete.
using Distances
using RCall
using Distributions
using BSON: #save, #load
using StatsBase
using LinearAlgebra
R"simul<-function(m){
comb<-expand.grid(c(0.01,0.2,0.4),
c(sample(2:7,1),sample(8:12,1),sample(13:20,1)),
c(sample(2:5,1),sample(6:10,1),sample(11:20,1)),
c(150,500,1500))
gener<-function(i){
maxoverlap<-comb[i,1]
nbvar<-comb[i,2]
nbclass<-comb[i,3]
propmix<-runif(1,0.001,1/nbclass)
Q<-MixSim(MaxOmega = maxoverlap, K = nbclass, p = nbvar,PiLow = propmix,resN = 1000)
A <- simdataset(n = comb[i,4], Pi = Q$Pi, Mu = Q$Mu, S = Q$S)
results<-list(Q,A)
return(results)
}
donnees<-sapply(1:nrow(comb),gener)
}
library(MixSim)
donneesimul=simul(1)"
#rget donneesimul
function pointsdpp(t)
datasim=donneesimul[2,t][:X]
Eucldist=pairwise(Euclidean(),transpose(datasim))
D=maximum(Eucldist.^2)
sigma2hat=mean(((Eucldist.^2)./D)[tril!(trues(size((Eucldist.^2)./D)),-1)])
L=exp.(-(Eucldist.^2/D)/(2*sigma2hat))
eigenv=eigvals(L)
prob=eigenv./(eigenv.+1)
eigenvectors=eigvecs(L)
function sampledpp(m)
u=rand(size(L,1))
V=eigenvectors[:,findall(u.<=prob)]
k=size(V,2)
Y=zeros(Int64,k)
for i=k:-1:1
P=sum(V.^2,dims=2)
Pri=P / sum(P)
Cumpri=cumsum(Pri,dims=1)
u=rand()
Y[i]=findfirst(u.<=Cumpri)[1]
if i==1 break end
j=findfirst(V[Y[i],:].!=0)
Vj=V[:,j]
V=V[:,deleteat!(collect(1:1:size(V,2)),j)]
V=V-repeat(Vj,1,size(V,2)).*repeat(transpose(V[Y[i],:]/Vj[Y[i]]),size(V,1))
for a = 1:i-1
for b = 1:a-1
V[:,a] = V[:,a] - transpose(V[:,a])*V[:,b]*V[:,b]
end
V[:,a] = V[:,a] / norm(V[:,a])
end
end
Y=sort(Y)
return(Y)
end
m=collect(1:1000)
sampleY_repet=map(sampledpp,m)
end
w=collect(1:1:81)
echantdpp=map(pointsdpp,w)
#save "echantdppdatasim1.bson" echantdpp

There are many issues to be considered when evaluating Julia performance. While the code you provided is far beyond MWE (minimal working example) and is not reproducible neither.
However, here are some general guidelines:
Take some time to read carefully the Julia performance tips and apply them
Since you process some arrays your code will likely benefit from the #simd macro. Using array views is also very often a low-hanging-fruit for codes such as yours.
You use 16% of CPU power (likely you have 8 cores and your program uses just one). Consider using either multi-threading or multiprocessing - your program will run many times faster
For some scenario you might consider using GPU computing with Flux.jl
Consider moving your multi-core computation to the cloud (Julia scaling on AWS EC2 instances works fantastic)
Since each of those topics is a big area on its own work step-by-step on your code and ask questions to get help.

Safety of sharing a read-only scipy sparse matrix between multiple processes

I have a computation I must do which is somewhat expensive and I want to spawn multiple processes to complete it. The gist is more or less this:
1) I have a big scipy.sparse.csc_matrix (could use other sparse format if needed) from which I'm going to read (only read, never write) data for the calculation.
2) I must do lots of embarrassingly parallel calculations and return values.
So I did something like this:
import numpy as np
from multiprocessing import Process, Manager
def f(instance, big_matrix):
"""
This is the actual thing I want to calculate. This reads lots of
data from big_matrix but never writes anything to it.
"""
return stuff_calculated
def do_some_work(big_matrix, instances, outputs):
"""
This do a few chunked calculations for a few instances and
saves the result in `outputs`, which is a memory shared dictionary.
"""
for instance in instances:
x = f(instance, big_matrix)
outputs[instance] = x
def split_work(big_matrix, instances_to_calculate):
"""
Split do_some_work into many processes by chunking instances_to_calculate,
creating a shared dictionary and spawning and joining the processes.
"""
# break instance list into 4 chunks to pass each process
instance_sets = np.array_split(instances_to_calculate, 4)
manager = Manager()
outputs = manager.dict()
processes = [
Process(target=do_some_work, args=(big_matrix, instance_sets, outputs))
for instances in instance_sets
]
for p in processes:
p.start()
for p in processes:
p.join()
return user_sets, outputs
My question is: is this safe? My function f never writes anything, but I'm not taking any precaution to share the big_array between processes, just passing it as it is. It seems to be working but I'm concerned if I can corrupt anything just by passing a value between multiple processes even if I never write to it.
I tried to use the sharemem package to share the matrix between multiple processes but it seems to be unable to hold scipy sparse matrices, only normal numpy arrays.
If this isn't safe, how can I share (read only) big sparse matrices between processes without problems?
I've saw here that I can make another csc_matrix pointing to the same memory with:
other_matrix = csc_matrix(
(bit_matrix.data, bit_matrix.indices, bit_matrix.indptr),
shape=bit_matrix.shape,
copy=False
)
Will this make it safer or would it be the same just as safe as passing the original object?
Thanks.

As explained here it seems your first option creates one copy of the sparse matrix per process. This is safe, but isn't ideal from a performance point of view. However, depending on the computation you perform on the sparse matrix, the overhead may not be signficant.
I suspect a cleaner option using the multiprocessing lib would be to create three lists (depending on the matrix format you use) and populate these with the values, row_ind and col_ptr of your CSC matrix. The documentation for multiprocessing shows how this can be done using an Array or using the Manager and one of the supported types.
Afterwards I don't see how you could run into trouble using read-only operations and it may be more efficient.

Non-linear performance of Java function in parallel MATLAB

Recently, I implemented parallelisation in my MATLAB program, much to the suggestions offered in Slow xlsread in MATLAB. However, implementing the parallelism has cropped up another problem - non-linearly increasing processing time with increasing scale.
The culprit seems to be the java.util.concurrent.LinkedBlockingQueue method as can be seen from the attached images of profiler and the corresponding condensed graphs.
Problem: How do I remove this non-linearity as my work involves processing more than 1000 sheets in single run - which would take an insanely long time?
Note: The parallelised part of the program involves just reading all the .xls files and storing them in matrices, after which I start the remainder of my program. dlmwrite is used towards the end of the program and optimization on its time is not really required, although could also be suggested.
Culprit:
Code being parallelised:
parfor i = 1:runs
sin = 'Sheet';
sno = num2str(i);
sna = strcat(sin, sno);
data(i, :, :) = xlsread('Processes.xls', sna, '' , 'basic');
end

Doing parallel IO operation is likely to be a problem (could be slower in fact) unless maybe if you keep everything on an SSD. If you are always reading the same file and it's not enormous, you may want to try reading it prior to your loop and just doing your data manipulation in parallel.

Is Scala functional programming slower than traditional coding?

In one of my first attempts to create functional code, I ran into a performance issue.
I started with a common task - multiply the elements of two arrays and sum up the results:
var first:Array[Float] ...
var second:Array[Float] ...
var sum=0f;
for (ix<-0 until first.length)
sum += first(ix) * second(ix);
Here is how I reformed the work:
sum = first.zip(second).map{ case (a,b) => a*b }.reduceLeft(_+_)
When I benchmarked the two approaches, the second method takes 40 times as long to complete!
Why does the second method take so much longer? How can I reform the work to be both speed efficient and use functional programming style?

The main reasons why these two examples are so different in speed are:
the faster one doesn't use any generics, so it doesn't face boxing/unboxing.
the faster one doesn't create temporary collections and, thus, avoids extra memory copies.
Let's consider the slower one by parts. First:
first.zip(second)
That creates a new array, an array of Tuple2. It will copy all elements from both arrays into Tuple2 objects, and then copy a reference to each of these objects into a third array. Now, notice that Tuple2 is parameterized, so it can't store Float directly. Instead, new instances of java.lang.Float are created for each number, the numbers are stored in them, and then a reference for each of them is stored into the Tuple2.
map{ case (a,b) => a*b }
Now a fourth array is created. To compute the values of these elements, it needs to read the reference to the tuple from the third array, read the reference to the java.lang.Float stored in them, read the numbers, multiply, create a new java.lang.Float to store the result, and then pass this reference back, which will be de-referenced again to be stored in the array (arrays are not type-erased).
We are not finished, though. Here's the next part:
reduceLeft(_+_)
That one is relatively harmless, except that it still do boxing/unboxing and java.lang.Float creation at iteration, since reduceLeft receives a Function2, which is parameterized.
Scala 2.8 introduces a feature called specialization which will get rid of a lot of these boxing/unboxing. But let's consider alternative faster versions. We could, for instance, do map and reduceLeft in a single step:
sum = first.zip(second).foldLeft(0f) { case (a, (b, c)) => a + b * c }
We could use view (Scala 2.8) or projection (Scala 2.7) to avoid creating intermediary collections altogether:
sum = first.view.zip(second).map{ case (a,b) => a*b }.reduceLeft(_+_)
This last one doesn't save much, actually, so I think the non-strictness if being "lost" pretty fast (ie, one of these methods is strict even in a view). There's also an alternative way of zipping that is non-strict (ie, avoids some intermediary results) by default:
sum = (first,second).zipped.map{ case (a,b) => a*b }.reduceLeft(_+_)
This gives much better result that the former. Better than the foldLeft one, though not by much. Unfortunately, we can't combined zipped with foldLeft because the former doesn't support the latter.
The last one is the fastest I could get. Faster than that, only with specialization. Now, Function2 happens to be specialized, but for Int, Long and Double. The other primitives were left out, as specialization increases code size rather dramatically for each primitive. On my tests, though Double is actually taking longer. That might be a result of it being twice the size, or it might be something I'm doing wrong.
So, in the end, the problem is a combination of factors, including producing intermediary copies of elements, and the way Java (JVM) handles primitives and generics. A similar code in Haskell using supercompilation would be equal to anything short of assembler. On the JVM, you have to be aware of the trade-offs and be prepared to optimize critical code.

I did some variations of this with Scala 2.8. The loop version is as you write but the
functional version is slightly different:
(xs, ys).zipped map (_ * _) reduceLeft(_ + _)
I ran with Double instead of Float, because currently specialization only kicks in for Double. I then tested with arrays and vectors as the carrier type. Furthermore, I tested Boxed variants which work on java.lang.Double's instead of primitive Doubles to measure
the effect of primitive type boxing and unboxing. Here is what I got (running Java 1.6_10 server VM, Scala 2.8 RC1, 5 runs per test).
loopArray 461 437 436 437 435
reduceArray 6573 6544 6718 6828 6554
loopVector 5877 5773 5775 5791 5657
reduceVector 5064 4880 4844 4828 4926
loopArrayBoxed 2627 2551 2569 2537 2546
reduceArrayBoxed 4809 4434 4496 4434 4365
loopVectorBoxed 7577 7450 7456 7463 7432
reduceVectorBoxed 5116 4903 5006 4957 5122
The first thing to notice is that by far the biggest difference is between primitive array loops and primitive array functional reduce. It's about a factor of 15 instead of the 40 you have seen, which reflects improvements in Scala 2.8 over 2.7. Still, primitive array loops are the fastest of all tests whereas primitive array reduces are the slowest. The reason is that primitive Java arrays and generic operations are just not a very good fit. Accessing elements of primitive Java arrays from generic functions requires a lot of boxing/unboxing and sometimes even requires reflection. Future versions of Scala will specialize the Array class and then we should see some improvement. But right now that's what it is.
If you go from arrays to vectors, you notice several things. First, the reduce version is now faster than the imperative loop! This is because vector reduce can make use of efficient bulk operations. Second, vector reduce is faster than array reduce, which illustrates the inherent overhead that arrays of primitive types pose for generic higher-order functions.
If you eliminate the overhead of boxing/unboxing by working only with boxed java.lang.Double values, the picture changes. Now reduce over arrays is a bit less than 2 times slower than looping, instead of the 15 times difference before. That more closely approximates the inherent overhead of the three loops with intermediate data structures instead of the fused loop of the imperative version. Looping over vectors is now by far the slowest solution, whereas reducing over vectors is a little bit slower than reducing over arrays.
So the overall answer is: it depends. If you have tight loops over arrays of primitive values, nothing beats an imperative loop. And there's no problem writing the loops because they are neither longer nor less comprehensible than the functional versions. In all other situations, the FP solution looks competitive.

This is a microbenchmark, and it depends on how the compiler optimizes you code. You have 3 loops composed here,
zip . map . fold
Now, I'm fairly sure the Scala compiler cannot fuse those three loops into a single loop, and the underlying data type is strict, so each (.) corresponds to an intermediate array being created. The imperative/mutable solution would reuse the buffer each time, avoiding copies.
Now, an understanding of what composing those three functions means is key to understanding performance in a functional programming language -- and indeed, in Haskell, those three loops will be optimized into a single loop that reuses an underlying buffer -- but Scala cannot do that.
There are benefits to sticking to the combinator approach, however -- by distinguishing those three functions, it will be easier to parallelize the code (replace map with parMap etc). In fact, given the right array type, (such as a parallel array) a sufficiently smart compiler will be able to automatically parallelize your code, yielding more performance wins.
So, in summary:
naive translations may have unexpected copies and inefficiences
clever FP compilers remove this overhead (but Scala can't yet)
sticking to the high level approach pays off if you want to retarget your code, e.g. to parallelize it

Don Stewart has a fine answer, but it might not be obvious how going from one loop to three creates a factor of 40 slowdown. I'll add to his answer that Scala compiles to JVM bytecodes, and not only does the Scala compiler not fuse the three loops into one, but the Scala compiler is almost certainly allocating all the intermediate arrays. Notoriously, implementations of the JVM are not designed to handle the allocation rates required by functional languages. Allocation is a significant cost in functional programs, and that's one the loop-fusion transformations that Don Stewart and his colleagues have implemented for Haskell are so powerful: they eliminate lots of allocations. When you don't have those transformations, plus you're using an expensive allocator such as is found on a typical JVM, that's where the big slowdown comes from.
Scala is a great vehicle for experimenting with the expressive power of an unusual mix of language ideas: classes, mixins, modules, functions, and so on. But it's a relatively young research language, and it runs on the JVM, so it's unreasonable to expect great performance except on the kind of code that JVMs are good at. If you want to experiment with the mix of language ideas that Scala offers, great—it's a really interesting design—but don't expect the same performance on pure functional code that you'd get with a mature compiler for a functional language, like GHC or MLton.
Is scala functional programming slower than traditional coding?
Not necessarily. Stuff to do with first-class functions, pattern matching, and currying need not be especially slow. But with Scala, more than with other implementations of other functional languages, you really have to watch out for allocations—they can be very expensive.

The Scala collections library is fully generic, and the operations provided are chosen for maximum capability, not maximum speed. So, yes, if you use a functional paradigm with Scala without paying attention (especially if you are using primitive data types), your code will take longer to run (in most cases) than if you use an imperative/iterative paradigm without paying attention.
That said, you can easily create non-generic functional operations that perform quickly for your desired task. In the case of working with pairs of floats, we might do the following:
class FastFloatOps(a: Array[Float]) {
def fastMapOnto(f: Float => Float) = {
var i = 0
while (i < a.length) { a(i) = f(a(i)); i += 1 }
this
}
def fastMapWith(b: Array[Float])(f: (Float,Float) => Float) = {
val len = a.length min b.length
val c = new Array[Float](len)
var i = 0
while (i < len) { c(i) = f(a(i),b(i)); i += 1 }
c
}
def fastReduce(f: (Float,Float) => Float) = {
if (a.length==0) Float.NaN
else {
var r = a(0)
var i = 1
while (i < a.length) { r = f(r,a(i)); i += 1 }
r
}
}
}
implicit def farray2fastfarray(a: Array[Float]) = new FastFloatOps(a)
and then these operations will be much faster. (Faster still if you use Double and 2.8.RC1, because then the functions (Double,Double)=>Double will be specialized, not generic; if you're using something earlier, you can create your own abstract class F { def f(a: Float) : Float } and then call with new F { def f(a: Float) = a*a } instead of (a: Float) => a*a.)
Anyway, the point is that it's not the functional style that makes functional coding in Scala slow, it's that the library is designed with maximum power/flexibility in mind, not maximum speed. This is sensible, since each person's speed requirements are typically subtly different, so it's hard to cover everyone supremely well. But if it's something you're doing more than just a little, you can write your own stuff where the performance penalty for a functional style is extremely small.

I am not an expert Scala programmer, so there is probably a more efficient method, but what about something like this. This can be tail call optimized, so performance should be OK.
def multiply_and_sum(l1:List[Int], l2:List[Int], sum:Int):Int = {
if (l1 != Nil && l2 != Nil) {
multiply_and_sum(l1.tail, l2.tail, sum + (l1.head * l2.head))
}
else {
sum
}
}
val first = Array(1,2,3,4,5)
val second = Array(6,7,8,9,10)
multiply_and_sum(first.toList, second.toList, 0) //Returns: 130

To answer the question in the title: Simple functional constructs may be slower than imperative on the JVM.
But, if we consider only simple constructs, then we might as well throw out all modern languages and stick with C or assembler. If you look a the programming language shootout, C always wins.
So why choose a modern language? Because it lets you express a cleaner design. Cleaner design leads to performance gains in the overall operation of the application. Even if some low-level methods can be slower. One of my favorite examples is the performance of BuildR vs. Maven. BuildR is written in Ruby, an interpreted, slow, language. Maven is written in Java. A build in BuildR is twice as fast as Maven. This is due mostly to the design of BuildR which is lightweight compared with that of Maven.

Your functional solution is slow because it is generating unnecessary temporary data structures. Removing these is known as deforesting and it is easily done in strict functional languages by rolling your anonymous functions into a single anonymous function and using a single aggregator. For example, your solution written in F# using zip, map and reduce:
let dot xs ys = Array.zip xs ys |> Array.map (fun (x, y) -> x * y) -> Array.reduce ( * )
may be rewritten using fold2 so as to avoid all temporary data structures:
let dot xs ys = Array.fold2 (fun t x y -> t + x * y) 0.0 xs ys
This is a lot faster and the same transformation can be done in Scala and other strict functional languages. In F#, you can also define the fold2 as inline in order to have the higher-order function inlined with its functional argument whereupon you recover the optimal performance of the imperative loop.

Here is dbyrnes solution with arrays (assuming Arrays are to be used) and just iterating over the index:
def multiplyAndSum (l1: Array[Int], l2: Array[Int]) : Int =
{
def productSum (idx: Int, sum: Int) : Int =
if (idx < l1.length)
productSum (idx + 1, sum + (l1(idx) * l2(idx))) else
sum
if (l2.length == l1.length)
productSum (0, 0) else
error ("lengths don't fit " + l1.length + " != " + l2.length)
}
val first = (1 to 500).map (_ * 1.1) toArray
val second = (11 to 510).map (_ * 1.2) toArray
def loopi (n: Int) = (1 to n).foreach (dummy => multiplyAndSum (first, second))
println (timed (loopi (100*1000)))
That needs about 1/40 of the time of the list-approach. I don't have 2.8 installed, so you have to test #tailrec yourself. :)

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio