I'm using dask/distributed to submit 100+ evaluations of a function to the multi-node cluster. Each eval is very costly, about 90 sec of CPU time. I've noticed though that there seems to be a memory leak and all workers over time grow in size, although the function i'm evaluating is not pure.
Here's sample code to reproduce this behavior:
import numpy as np
from dask.distributed import Client
class Foo:
def __init__(self):
self.a = np.random.rand(2000, 2000) # dummy data, not really used
#staticmethod
def myfun1(k):
return np.random.rand(10000 + k, 100)
def myfun2(self, k):
return np.random.rand(10000 + k, 100)
client = Client('XXX-YYY:8786')
f = Foo()
tasks = client.map(f.myfun2, range(100), pure=False)
results = client.gather(tasks)
tasks = []
If client.map() is called to execute f.myfun1() (which is just a static method), the workers don't grow in size. However, if one calls f.myfun2() workers size grows considerably (eg. 50mb -> 400mb) after just one client.map() call above. Also client.close() does nothing to reduce workers size.
Is this a memory leak or I'm not using dask.distributed correctly? I definitely don't care about results of my calculations being available afterwards or shared on the cluster. FWIW, tested with distributed v1.19.1 and Python 3.5.4
Nice example.
Your myfun2 method is attached to your f = Foo() object, which carries around with it a decently large attribute (f.a). This f.myfun2 method is thus actually really expensive to move around, and you're creating 1000 of them. If you can it's best to avoid using methods of large objects in a distributed setting. Instead consider using functions.
Related
i am new to Java and Apache Storm and i want to know how i can make things go faster!
I setup a Storm cluster with 2 physical machines with 8 cores each. The cluster is working perfectly fine. I setup the following test topology in order to measure performance:
builder.setSpout("spout", new RandomNumberSpoutSingle(sizeOfArray), 10);
builder.setBolt("null", new NullBolt(), 4).allGrouping("spout");
RandomNumberSpoutSingle creates an Array like so:
ArrayList<Integer> array = new ArrayList<Integer>();
I fill it with sizeOfArray integers. This array, combined with an ID, builds my tuple.
Now i measure how many tuples per second arrive at the bolt with allGrouping (i look at the Storm GUI's "transferred" value).
If i put sizeOfArray = 1024, about 173000 tuples/s get pushed. Since 1 tuple should be about 4*1024 bytes, around 675MB/second get moved.
Am i correct so far?
Now my question is: Is Storm/Kryo capable of moving more? How can i tune this? Are there settings i ignored?
I want to serialize more tuples per second! If i use local shuffling, the values skyrocket because nothing has to be serialized, but i need the tuples on all workers.
Neither CPU, Memory nor network are fully occupied.
I think you got the math about right, I am not sure though if the Java overhead for the non-primitive Integer type is considered in serialization, which would add some more bytes to the equation. Yet, I am also not sure if this is the best way of analyzing storm performance, as this is more measured in number of tuples per second than in bandwidth.
Storm has built in serialization for primitive types, strings, byte arrays, ArrayList, HashMap, and HashSet (source). When I program Java for maximum performance I try to stick with primitive types as much as possible. Would it be feasible to use int[] instead of ArrayList<Integer>? I would expect to gain some performance from that, if it is possible in your setup.
Considering the above types which storm is able to serialize out-of-the-box I would most likely shy away from trying to improve serialization performance. I assume kryo is pretty optimized and that it will be very hard to achieve anything faster here. I am also not sure if serialization is the real bottleneck here or rather something in your topology setup (see below).
I would look at other tunables which are related to the intra and inter worker communication. A good overview can be found here. In one topology for which performance is critical, I am using the following setup code to adjust these kind of parameters. What works best in your case needs to be found out via testing.
int topology_executor_receive_buffer_size = 32768; // intra-worker messaging, default: 32768
int topology_transfer_buffer_size = 2048; // inter-worker messaging, default: 1000
int topology_producer_batch_size = 10; // intra-worker batch, default: 1
int topology_transfer_batch_size = 20; // inter-worker batch, default: 1
int topology_batch_flush_interval_millis = 10; // flush tuple creation ms, default: 1
double topology_stats_sample_rate = 0.001; // calculate metrics every 1000 messages, default: 0.05
conf.put("topology.executor.receive.buffer.size", topology_executor_receive_buffer_size);
conf.put("topology.transfer.buffer.size", topology_transfer_buffer_size);
conf.put("topology.producer.batch.size", topology_producer_batch_size);
conf.put("topology.transfer.batch.size", topology_transfer_batch_size);
conf.put("topology.batch.flush.interval.millis", topology_batch_flush_interval_millis);
conf.put("topology.stats.sample.rate", topology_stats_sample_rate);
As you have noticed, performance greatly increases when storm is able to use intra-worker processing, so I would always suggest to make use of that if possible. Are you sure you need allGrouping? If not I would suggest to use shuffleGrouping, which will actually use local communication if storm thinks it is appropriate, unless topology.disable.loadaware.messaging is set to false. I am not sure if allGrouping will use local communication for those components which are on the same worker.
Another thing which I wonder about is the configuration of your topology: you have 10 spouts and 4 consumer bolts. Unless the bolts consume incoming tuples much faster than they are created, it might be advisable to use an equal number for both components. From how you describe your process it seems you use acking and failing, because you have written you assign an ID to your tuples. In case that guaranteed processing of individual tuples is not a absolute requirement, performance can probably be gained by switching to unanchored tuples. Acking and failing does produce some overhead, so I would assume a higher tuple throughput if it is turned off.
And lastly, you can also experiment with the value for maximum number of pending tuples (configured via method .setMaxSpoutPending of the spouts). Not sure what storm uses as default, however from my experience setting a little higher number than what the bolts can ingest downstream delivers higher throughput. Look at metrics capacity and number of transferred tuples in the storm UI.
I trained a doc2vec model using python gensim on a corpus of 40,000,000 documents. This model is used for infering docvec on millions of documents everyday. To ensure stability, I set alpha to a small value and a large steps instead of setting a constant random seed:
from gensim.models.doc2vec import Doc2Vec
model = Doc2Vec.load('doc2vec_dm.model')
doc_demo = ['a','b']
# model.random.seed(0)
model.infer_vector(doc_demo, alpha=0.1, min_alpha=0.0001, steps=100)
doc2vec.infer_vector() accepts only one documents each time and it takes almost 0.1 second to infer each docvec. Is there any API that can handle a series of documents in each infering step?
Currently, there's no gensim API which does large batches of inference at once, which could help by using multiple threads. It is a wishlist item, among other improvements: https://github.com/RaRe-Technologies/gensim/issues/515
You might get some speedup, up to the number of cores in your CPU, by spreading your own inference jobs over multiple threads.
To eliminate all multithreaded contention due to the Python GIL, you could spread your inference over separate Python processes. If each process loads the model using some of the tricks described at another answern (see below), the OS will help them share the large model backing arrays (only paying the cost in RAM once), while they each could completely independently due one unblocking thread of inference.
(Specifically, Doc2Vec.load() can also use the mmap='r' mode to load an existing on-disk model with memory-mapping of the backing files. Inference alone, with no most_similar()-like operations, will only read the shared raw backing arrays, so no fussing with the _norm variants should be necessary if you're launching single-purpose processes that just do inference then save their results and exit.)
I replaced CIFAR-10 preprocessing pipeline in the project with Dataset API approach and it resulted in performance decrease of about 10-20%.
Preporcessing is rather standart:
- read image from disk
- make random/crop and flip
- shuffle, batch
- feed to the model
Overall i see that batche processing is now 15% faster, but every once in a while (or, more precisely, whenever I reinitialize dataframe or expect reshuffling) the batch is being blocked for up long time (30 sec) which totals to slower epoch-per-epoch processing.
This behaviour seems to do something with internal hashing. If I reduce N in ds.shuffle(buffer_size=N) delays are shorter but proportionally more frequent. Removing shuffle at all results to delays as if buffer_size was set to dataset size.
Can somebody explain internal logic of Dataset API when it comes to reading/caching? Is there any reason at all to expect Dataset API to work faster than manually created Queues?
I am using TF 1.3.
If you implement the same pipeline using the tf.data.Dataset API and using queues, the performance of the Dataset version should be better than the queue-based version.
However, there are a few performance best practices to observe in order to get the best performance. We have collected these in a performance guide for tf.data. Here are the main issues:
Prefetching is important: the queue-based pipelines prefetch by default and the Dataset pipelines do not. Adding dataset.prefetch(1) to the end of your pipeline will give you most of the benefit of prefetching, but you might need to tune this further.
The shuffle operator has a delay at the beginning, while it fills its buffer. The queue-based pipelines shuffle a concatenation of all epochs, which means that the buffer is only filled once. In a Dataset pipeline, this would be equivalent to dataset.repeat(NUM_EPOCHS).shuffle(N). By contrast, you can also write dataset.shuffle(N).repeat(NUM_EPOCHS), but this needs to restart the shuffling in each epoch. The latter approach is slightly preferable (and truer to the definition of SGD, for example), but the difference might not be noticeable if your dataset is large.
We are adding a fused version of shuffle-and-repeat that doesn't incur the delay, and a nightly build of TensorFlow will include the custom tf.contrib.data.shuffle_and_repeat() transformation that is equivalent to dataset.shuffle(N).repeat(NUM_EPOCHS) but doesn't suffer the delay at the start of each epoch.
Having said this, if you have a pipeline that is significantly slower when using tf.data than the queues, please file a GitHub issue with the details, and we'll take a look!
Suggested things didn't solve my problem back in the days, but I would like to add a couple of recommendations for those, who don't want to learn about queues and still get the most out of TF data pipeline:
Convert your input data into TFRecord (as cumbersome as it might be)
Use recommended input pipeline format
.
files = tf.data.Dataset.list_files(data_dir)
ds = tf.data.TFRecordDataset(files, num_parallel_reads=32)
ds = (ds.shuffle(10000)
.repeat(EPOCHS)
.map(parser_fn, num_parallel_calls=64)
.batch(batch_size)
)
dataset = dataset.prefetch(2)
Where you have to pay attention to 3 main components:
num_parallel_read=32 to parallelize disk IO operations
num_parallel_calls=64 to parallelize calls to parser function
prefetch(2)
Spark now offers predefined functions that can be used in dataframes, and it seems they are highly optimized. My original question was going to be on which is faster, but I did some testing myself and found the spark functions to be about 10 times faster at least in one instance. Does anyone know why this is so, and when would a udf be faster (only for instances that an identical spark function exists)?
Here is my testing code (ran on Databricks community ed):
# UDF vs Spark function
from faker import Factory
from pyspark.sql.functions import lit, concat
fake = Factory.create()
fake.seed(4321)
# Each entry consists of last_name, first_name, ssn, job, and age (at least 1)
from pyspark.sql import Row
def fake_entry():
name = fake.name().split()
return (name[1], name[0], fake.ssn(), fake.job(), abs(2016 - fake.date_time().year) + 1)
# Create a helper function to call a function repeatedly
def repeat(times, func, *args, **kwargs):
for _ in xrange(times):
yield func(*args, **kwargs)
data = list(repeat(500000, fake_entry))
print len(data)
data[0]
dataDF = sqlContext.createDataFrame(data, ('last_name', 'first_name', 'ssn', 'occupation', 'age'))
dataDF.cache()
UDF function:
concat_s = udf(lambda s: s+ 's')
udfData = dataDF.select(concat_s(dataDF.first_name).alias('name'))
udfData.count()
Spark Function:
spfData = dataDF.select(concat(dataDF.first_name, lit('s')).alias('name'))
spfData.count()
Ran both multiple times, the udf usually took about 1.1 - 1.4 s, and the Spark concat function always took under 0.15 s.
when would a udf be faster
If you ask about Python UDF the answer is probably never*. Since SQL functions are relatively simple and are not designed for complex tasks it is pretty much impossible compensate the cost of repeated serialization, deserialization and data movement between Python interpreter and JVM.
Does anyone know why this is so
The main reasons are already enumerated above and can be reduced to a simple fact that Spark DataFrame is natively a JVM structure and standard access methods are implemented by simple calls to Java API. UDF from the other hand are implemented in Python and require moving data back and forth.
While PySpark in general requires data movements between JVM and Python, in case of low level RDD API it typically doesn't require expensive serde activity. Spark SQL adds additional cost of serialization and serialization as well cost of moving data from and to unsafe representation on JVM. The later one is specific to all UDFs (Python, Scala and Java) but the former one is specific to non-native languages.
Unlike UDFs, Spark SQL functions operate directly on JVM and typically are well integrated with both Catalyst and Tungsten. It means these can be optimized in the execution plan and most of the time can benefit from codgen and other Tungsten optimizations. Moreover these can operate on data in its "native" representation.
So in a sense the problem here is that Python UDF has to bring data to the code while SQL expressions go the other way around.
* According to rough estimates PySpark window UDF can beat Scala window function.
After years, when I have a more spark knowledge and had second look on the question, just realized what #alfredox really want to ask. So I revised again, and divide the answer into two parts:
To answer Why native DF function (native Spark-SQL function) is faster:
Basically, why native Spark function is ALWAYS faster than Spark UDF, regardless your UDF is implemented in Python or Scala.
Firstly, we need to understand what Tungsten, which is firstly introduced in Spark 1.4.
It is a backend and what it focus on:
Off-Heap Memory Management using binary in-memory data representation aka Tungsten row format and managing memory explicitly,
Cache Locality which is about cache-aware computations with cache-aware layout for high cache hit rates,
Whole-Stage Code Generation (aka CodeGen).
One of the biggest Spark performance killer is GC. The GC would pause the every threads in JVM until the GC finished. This is exactly why Off-Heap Memory Management being introduced.
When executing Spark-SQL native functions, the data will stays in tungsten backend. However, in Spark UDF scenario, the data will be moved out from tungsten into JVM (Scala scenario) or JVM and Python Process (Python) to do the actual process, and then move back into tungsten. As a result of that:
Inevitably, there would be a overhead / penalty on :
Deserialize the input from tungsten.
Serialize the output back into tungsten.
Even using Scala, the first-class citizen in Spark, it will increase the memory footprint within JVM, and which may likely involve more GC within JVM.
This issue exactly what tungsten "Off-Heap Memory Management" feature try to address.
To answer if Python would necessarily slower than Scala:
Since 30th October, 2017, Spark just introduced vectorized udfs for pyspark.
https://databricks.com/blog/2017/10/30/introducing-vectorized-udfs-for-pyspark.html
The reason that Python UDF is slow, is probably the PySpark UDF is not implemented in a most optimized way:
According to the paragraph from the link.
Spark added a Python API in version 0.7, with support for user-defined functions. These user-defined functions operate one-row-at-a-time, and thus suffer from high serialization and invocation overhead.
However the newly vectorized udfs seem to be improving the performance a lot:
ranging from 3x to over 100x.
Use the higher-level standard Column-based functions with Dataset operators whenever possible before reverting to using your own custom UDF functions since UDFs are a BlackBox for Spark and so it does not even try to optimize them.
What actually happens behind the screens, is that the Catalyst can’t process and optimize UDFs at all, and it threats them as BlackBox, which results in losing many optimizations like Predicate pushdown, Constant folding and many others.
I have code similar to what follows:
val fileContent = sc.textFile("file:///myfile")
val dataset = fileContent.map(row => {
val explodedRow = row.split(",").map(s => s.toDouble)
new LabeledPoint(explodedRow(13), Vectors.dense(
Array(explodedRow(10), explodedRow(11), explodedRow(12))
))})
val algo = new LassoWithSGD().setIntercept(true)
val lambda = 0.0
algo.optimizer.setRegParam(lambda)
algo.optimizer.setNumIterations(100)
algo.optimizer.setStepSize(1.0)
val model = algo.run(dataset)
I'm running this in the cloud on my virtual server with 20 cores. The file is a "local" (i.e. not in HDFS) file with a few million rows. I run this in local mode, with sbt run (i.e. I don't use a cluster, I don't use spark-submit).
I would have expected this to get be increasingly faster as I increase the spark.master=local[*] setting from local[8] to local[40]. Instead, it takes the same amount of time regardless of what setting I use (but I notice from the Spark UI that my executor has a maximum number of Active Tasks at any given time that is equal to the expected amount, i.e. ~8 for local[8], ~40 for local[40], etc. -- so it seems that the parallelization works).
By default the number of partitions my dataset RDD is 4. I tried forcing the number of partitions to 20, without success -- in fact it slows the Lasso algorithm down even more...
Is my expectation of the scaling process incorrect? Can somebody help me troubleshoot this?
Is my expectation of the scaling process incorrect?
Well, kind of. I hope you don't mind I use a little bit of Python to prove my point.
Lets be generous and say a few million rows is actually ten million. With 40 000 000 values (intercept + 3 features + label per row) it gives around 380 MB of data (Java Double is a double-precision 64-bit IEEE 754 floating point). Lets create some dummy data:
import numpy as np
n = 10 * 1000**2
X = np.random.uniform(size=(n, 4)) # Features
y = np.random.uniform(size=(n, 1)) # Labels
theta = np.random.uniform(size=(4, 1)) # Estimated parameters
Each step of gradient descent (since default miniBatchFraction for LassoWithSGD is 1.0 it is not really stochastic) ignoring regularization requires operation like this.
def step(X, y, theta):
return ((X.dot(theta) - y) * X).sum(0)
So lets see how long it takes locally on our data:
%timeit -n 15 step(X, y, theta)
## 15 loops, best of 3: 743 ms per loop
Less than a second per step, without any additional optimizations. Intuitively it is pretty fast and it won't be easy to match this. Just for fun lets see how much it takes to get closed form solution for data like this
%timeit -n 15 np.linalg.inv(X.transpose().dot(X)).dot(X.transpose()).dot(y)
## 15 loops, best of 3: 1.33 s per loop
Now lets go back to Spark. Residuals for a single point can be computed in parallel. So this is a part which scales linearly when you increase number of partitions which are processed in parallel.
Problem is that you have to aggregate data locally, serialize, transfer to the driver, deserialize and reduce locally to get a final result after each step. Then you have compute new theta, serialize send back and so on.
All of that can be improved by a proper usage of mini batches and some further optimizations but at the end of the day you are limited by a latency of a whole system. It is worth noting that when you increase parallelism on a worker side you also increase amount of work that has to be performed sequentially on a driver and the other way round. One way or another the Amdahl's law will bite you.
Also all of the above ignores actual implementation.
Now lets perform another experiment. First some dummy data:
nCores = 8 # Number of cores on local machine I use for tests
rdd = sc.parallelize([], nCores)
and bechmark:
%timeit -n 40 rdd.mapPartitions(lambda x: x).count()
## 40 loops, best of 3: 82.3 ms per loop
It means that with 8 cores, without any real processing or network traffic we get to the point where we cannot do much better by increasing parallelism in Spark (743ms / 8 = 92.875ms per partition assuming linear scalability of the parallelized part)
Just to summarize above:
if data can be easily processed locally with a closed-form solution using gradient descent is just a waste of time. If you want to increase parallelism / reduce latency you can use good linear algebra libraries
Spark is designed to handle large amounts of data not to reduce latency. If your data fits in a memory of a few years old smartphone it is a good sign that is not the right tool
if computations are cheap then constant costs become a limiting factor
Side notes:
relatively large number of cores per machine is generally speaking not the best choice unless you can match this with IO throughput