i am new to Java and Apache Storm and i want to know how i can make things go faster!
I setup a Storm cluster with 2 physical machines with 8 cores each. The cluster is working perfectly fine. I setup the following test topology in order to measure performance:
builder.setSpout("spout", new RandomNumberSpoutSingle(sizeOfArray), 10);
builder.setBolt("null", new NullBolt(), 4).allGrouping("spout");
RandomNumberSpoutSingle creates an Array like so:
ArrayList<Integer> array = new ArrayList<Integer>();
I fill it with sizeOfArray integers. This array, combined with an ID, builds my tuple.
Now i measure how many tuples per second arrive at the bolt with allGrouping (i look at the Storm GUI's "transferred" value).
If i put sizeOfArray = 1024, about 173000 tuples/s get pushed. Since 1 tuple should be about 4*1024 bytes, around 675MB/second get moved.
Am i correct so far?
Now my question is: Is Storm/Kryo capable of moving more? How can i tune this? Are there settings i ignored?
I want to serialize more tuples per second! If i use local shuffling, the values skyrocket because nothing has to be serialized, but i need the tuples on all workers.
Neither CPU, Memory nor network are fully occupied.
I think you got the math about right, I am not sure though if the Java overhead for the non-primitive Integer type is considered in serialization, which would add some more bytes to the equation. Yet, I am also not sure if this is the best way of analyzing storm performance, as this is more measured in number of tuples per second than in bandwidth.
Storm has built in serialization for primitive types, strings, byte arrays, ArrayList, HashMap, and HashSet (source). When I program Java for maximum performance I try to stick with primitive types as much as possible. Would it be feasible to use int[] instead of ArrayList<Integer>? I would expect to gain some performance from that, if it is possible in your setup.
Considering the above types which storm is able to serialize out-of-the-box I would most likely shy away from trying to improve serialization performance. I assume kryo is pretty optimized and that it will be very hard to achieve anything faster here. I am also not sure if serialization is the real bottleneck here or rather something in your topology setup (see below).
I would look at other tunables which are related to the intra and inter worker communication. A good overview can be found here. In one topology for which performance is critical, I am using the following setup code to adjust these kind of parameters. What works best in your case needs to be found out via testing.
int topology_executor_receive_buffer_size = 32768; // intra-worker messaging, default: 32768
int topology_transfer_buffer_size = 2048; // inter-worker messaging, default: 1000
int topology_producer_batch_size = 10; // intra-worker batch, default: 1
int topology_transfer_batch_size = 20; // inter-worker batch, default: 1
int topology_batch_flush_interval_millis = 10; // flush tuple creation ms, default: 1
double topology_stats_sample_rate = 0.001; // calculate metrics every 1000 messages, default: 0.05
conf.put("topology.executor.receive.buffer.size", topology_executor_receive_buffer_size);
conf.put("topology.transfer.buffer.size", topology_transfer_buffer_size);
conf.put("topology.producer.batch.size", topology_producer_batch_size);
conf.put("topology.transfer.batch.size", topology_transfer_batch_size);
conf.put("topology.batch.flush.interval.millis", topology_batch_flush_interval_millis);
conf.put("topology.stats.sample.rate", topology_stats_sample_rate);
As you have noticed, performance greatly increases when storm is able to use intra-worker processing, so I would always suggest to make use of that if possible. Are you sure you need allGrouping? If not I would suggest to use shuffleGrouping, which will actually use local communication if storm thinks it is appropriate, unless topology.disable.loadaware.messaging is set to false. I am not sure if allGrouping will use local communication for those components which are on the same worker.
Another thing which I wonder about is the configuration of your topology: you have 10 spouts and 4 consumer bolts. Unless the bolts consume incoming tuples much faster than they are created, it might be advisable to use an equal number for both components. From how you describe your process it seems you use acking and failing, because you have written you assign an ID to your tuples. In case that guaranteed processing of individual tuples is not a absolute requirement, performance can probably be gained by switching to unanchored tuples. Acking and failing does produce some overhead, so I would assume a higher tuple throughput if it is turned off.
And lastly, you can also experiment with the value for maximum number of pending tuples (configured via method .setMaxSpoutPending of the spouts). Not sure what storm uses as default, however from my experience setting a little higher number than what the bolts can ingest downstream delivers higher throughput. Look at metrics capacity and number of transferred tuples in the storm UI.
Related
I am currently working on a Storm Crawler based project. We have a fixed and limited amount of bandwidth for fetching page from the web. We have 8 worker with a large value for parallelism hint for different Bolt in the topology (i.e. 50). So lots of thread created for fetching the page. Is there any relation between increasing number of fetch_error and increasing parallelism_hint in the project? How can I determine the good value for the parallelism_hint in the Storm Crawler?
The parallelism hint is not something that should be applied to all bolts indiscriminately.
Ideally, you need one instance of FetcherBolt per worker, so in your case 8. As you've probably read in the WIKI or seen in the conf, the FetcherBolt handles internal threads for fetching. This is determined by the config fetcher.threads.number which is set to 50 in the archetypes' configurations (assuming this is what you used as a starting point).
Using too many FetcherBolt instances is counterproductive. It is better to change the value of fetcher.threads.number instead. If you have 50 Fetcher instances with a default number of threads of 50, that would give you 2500 fetching threads which might be too much for your available bandwidth.
As I mentioned before you want 1 FetcherBolt per worker, the number of internal fetching threads per bolt depends on your bandwidth. There is no hard rule for this, it depends on your situation.
One constant I have observed however is the ratio of parsing bolts to Fetcher bolts; usually, 4 parsers per fetcher works fine. Run Storm in deployed mode and check the capacity value for the parser bolts in the UI. If the value is 1 or above, try using more instances and see if it affects the capacity.
In any case, not all bolts need the same level of parallelism.
I replaced CIFAR-10 preprocessing pipeline in the project with Dataset API approach and it resulted in performance decrease of about 10-20%.
Preporcessing is rather standart:
- read image from disk
- make random/crop and flip
- shuffle, batch
- feed to the model
Overall i see that batche processing is now 15% faster, but every once in a while (or, more precisely, whenever I reinitialize dataframe or expect reshuffling) the batch is being blocked for up long time (30 sec) which totals to slower epoch-per-epoch processing.
This behaviour seems to do something with internal hashing. If I reduce N in ds.shuffle(buffer_size=N) delays are shorter but proportionally more frequent. Removing shuffle at all results to delays as if buffer_size was set to dataset size.
Can somebody explain internal logic of Dataset API when it comes to reading/caching? Is there any reason at all to expect Dataset API to work faster than manually created Queues?
I am using TF 1.3.
If you implement the same pipeline using the tf.data.Dataset API and using queues, the performance of the Dataset version should be better than the queue-based version.
However, there are a few performance best practices to observe in order to get the best performance. We have collected these in a performance guide for tf.data. Here are the main issues:
Prefetching is important: the queue-based pipelines prefetch by default and the Dataset pipelines do not. Adding dataset.prefetch(1) to the end of your pipeline will give you most of the benefit of prefetching, but you might need to tune this further.
The shuffle operator has a delay at the beginning, while it fills its buffer. The queue-based pipelines shuffle a concatenation of all epochs, which means that the buffer is only filled once. In a Dataset pipeline, this would be equivalent to dataset.repeat(NUM_EPOCHS).shuffle(N). By contrast, you can also write dataset.shuffle(N).repeat(NUM_EPOCHS), but this needs to restart the shuffling in each epoch. The latter approach is slightly preferable (and truer to the definition of SGD, for example), but the difference might not be noticeable if your dataset is large.
We are adding a fused version of shuffle-and-repeat that doesn't incur the delay, and a nightly build of TensorFlow will include the custom tf.contrib.data.shuffle_and_repeat() transformation that is equivalent to dataset.shuffle(N).repeat(NUM_EPOCHS) but doesn't suffer the delay at the start of each epoch.
Having said this, if you have a pipeline that is significantly slower when using tf.data than the queues, please file a GitHub issue with the details, and we'll take a look!
Suggested things didn't solve my problem back in the days, but I would like to add a couple of recommendations for those, who don't want to learn about queues and still get the most out of TF data pipeline:
Convert your input data into TFRecord (as cumbersome as it might be)
Use recommended input pipeline format
.
files = tf.data.Dataset.list_files(data_dir)
ds = tf.data.TFRecordDataset(files, num_parallel_reads=32)
ds = (ds.shuffle(10000)
.repeat(EPOCHS)
.map(parser_fn, num_parallel_calls=64)
.batch(batch_size)
)
dataset = dataset.prefetch(2)
Where you have to pay attention to 3 main components:
num_parallel_read=32 to parallelize disk IO operations
num_parallel_calls=64 to parallelize calls to parser function
prefetch(2)
I have started using Apache Storm recently. Right now focusing on some performance testing and tuning for one of my applications (pulls data from a NoSQL database, formats, and publishes to a JMS Queue for consumption by the requester) to enable more parallel request processing at a time. I have been able to tune the topology in terms of changing no. of bolts, MAX_SPENDING_SPOUT etc. and to throttle data flow within topology using some ticking approach.
I wanted to know what happens when we define more parallelism than the no of cores we have. In my case I have a single node, single worker topology and the machine has 32 cores. But total no of executors (for all the spouts and bolts) = 60. So my questions are:
Does this high number really helps processing requests or is it actually degrades the performance, since I believe there will more context switch between bolt tasks to utilize cores.
If I define 20 (just a random selection) executors for a Bolt and my code flow never needs to utilize the Bolt, will this be impacting performance? How does storm handles this situation?
This is a very general question, so the answer is (as always): it depends.
If your load is large and a single executor fully utilizes a core completely, having more executors cannot give you any throughput improvements. If there is any impact, it might be negative (also with regard to contention of internally used queues to which all executers need to read from and write into for tuple transfer).
If you load is "small" and does not fully utilize your CPUs, it wound matter either -- you would not gain or loose anything -- as your cores are not fully utilized you have some left over head room anyway.
Furthermore, consider that Storm spans some more threads within each worker. Thus, if your executors fully utilize your hardware, those thread will also be impacted.
Overall, you should not run your topologies to utilize core completely anyway but leave form head room for small "spikes" etc. In operation, maybe 80% CPU utilization might be a good value. As a rule of thumb, one executor per core should be ok.
Does it make to use smaller blocks for jobs that perform more intense tasks?
For example, in Mapper I'm calculating the distance between two signals, which may take some time depending on the signal length, but on the other hand my dataset size is currently not so big. That takes me into temptation to specify smaller block size (like 16MB) and to increase the number of nodes in cluster. Does that make sense?
How should I perform? And if it is ok to use smaller blocks, how to do it? I haven't done it before...
Whether that makes sense to do for your job can only really be known by testing the performance. There is an overhead cost associated with launching additional JVM instances, and it's a question of whether the additional parallelization is given enough load to offset that cost and still make it a win.
You can change this setting for for a particular job rather than across the entire cluster. You'll have to decide what's a normal use case when deciding whether to make this a global change or not. If you want to make this change globally, you'll put the property in the XML config file or Cloudera Manager. If you only want to do it for particular jobs, put it in the job's configuration.
Either way, the smaller the value in mapreduce.input.fileinputformat.split.maxsize, the more mappers you'll get (it defaults to Integer.MAX_VALUE). That will work for any InputFormat that uses block size to determine it's splits (most do, since most extend FileInputFormat).
So to max out your utilization, you might do something like this
long bytesPerReducer = inputSizeInBytes / numberOfReduceTasksYouWant;
long splitSize = (CLUSTER_BLOCK_SIZE > bytesPerReducer) ? CLUSTER_BLOCK_SIZE : bytesPerReducer);
job.getConfiguration.setLong("mapreduce.input.fileinputformat.split.maxsize", splitSize);
Note that you can also increase the value of mapreduce.input.fileinputformat.split.minsize to reduce the number of mappers (it defaults to 1).
I have a cluster application, which is divided into a controller and a bunch of workers. The controller runs on a dedicated host, the workers phone in over the network and get handed jobs, so far so normal. (Basically the "divide-and-conquer pipeline" from the zeromq manual, with job-specific wrinkles. That's not important right now.)
The controller's core data structure is unordered_map<string, queue<string>> in pseudo-C++ (the controller is actually implemented in Python, but I am open to the possibility of rewriting it in something else). The strings in the queues define jobs, and the keys of the map are a categorization of the jobs. The controller is seeded with a set of jobs; when a worker starts up, the controller removes one string from one of the queues and hands it out as the worker's first job. The worker may crash during the run, in which case the job gets put back on the appropriate queue (there is an ancillary table of outstanding jobs). If it completes the job successfully, it will send back a list of new job-strings, which the controller will sort into the appropriate queues. Then it will pull another string off some queue and send it to the worker as its next job; usually, but not always, it will pick the same queue as the previous job for that worker.
Now, the question. This data structure currently sits entirely in main memory, which was fine for small-scale test runs, but at full scale is eating all available RAM on the controller, all by itself. And the controller has several other tasks to accomplish, so that's no good.
What approach should I take? So far, I have considered:
a) to convert this to a primarily-on-disk data structure. It could be cached in RAM to some extent for efficiency, but jobs take tens of seconds to complete, so it's okay if it's not that efficient,
b) using a relational database - e.g. SQLite, (but SQL schemas are a very poor fit AFAICT),
c) using a NoSQL database with persistency support, e.g. Redis (data structure maps over trivially, but this still appears very RAM-centric to make me feel confident that the memory-hog problem will actually go away)
Concrete numbers: For a full-scale run, there will be between one and ten million keys in the hash, and less than 100 entries in each queue. String length varies wildly but is unlikely to be more than 250-ish bytes. So, a hypothetical (impossible) zero-overhead data structure would require 234 – 237 bytes of storage.
Ultimately, it all boils down on how you define efficiency needed on part of the controller -- e.g. response times, throughput, memory consumption, disk consumption, scalability... These properties are directly or indirectly related to:
number of requests the controller needs to handle per second (throughput)
acceptable response times
future growth expectations
From your options, here's how I'd evaluate each option:
a) to convert this to a primarily-on-disk data structure. It could be
cached in RAM to some extent for efficiency, but jobs take tens of
seconds to complete, so it's okay if it's not that efficient,
Given the current memory hog requirement, some form of persistent storage seems a reaonsable choice. Caching comes into play if there is a repeatable access pattern, say the same queue is accessed over and over again -- otherwise, caching is likely not to help.
This option makes sense if 1) you cannot find a database that maps trivially to your data structure (unlikely), 2) for some other reason you want to have your own on-disk format, e.g. you find that converting to a database is too much overhead (again, unlikely).
One alternative to databases is to look at persistent queues (e.g. using a RabbitMQ backing store), but I'm not sure what the per-queue or overall size limits are.
b) using a relational database - e.g. SQLite, (but SQL schemas are a
very poor fit AFAICT),
As you mention, SQL is probably not a good fit for your requirements, even though you could surely map your data structure to a relational model somehow.
However, NoSQL databases like MongoDB or CouchDB seem much more appropriate. Either way, a database of some sort seems viable as long as they can meet your throughput requirement. Many if not most NoSQL databases are also a good choice from a scalability perspective, as they include support for sharding data across multiple machines.
c) using a NoSQL database with persistency support, e.g. Redis (data
structure maps over trivially, but this still appears very RAM-centric
to make me feel confident that the memory-hog problem will actually go
away)
An in-memory database like Redis doesn't solve the memory hog problem, unless you set up a cluster of machines that each holds a part of the overall data. This makes sense only if keeping all data in-memory is needed due to low response times requirements. Yet, given the nature of your jobs, taking tens of seconds to complete, response times, respective to workers, hardly matter.
If you find, however, that response times do matter, Redis would be a good choice, as it handles partitioning trivially using either client-side consistent-hashing or at the cluster level, thus also supporting scalability scenarios.
In any case
Before you choose a solution, be sure to clarify your requirements. You mention you want an efficient solution. Since efficiency can only be gauged against some set of requirements, here's the list of questions I would try to answer first:
*Requirements
how many jobs are expected to complete, say per minute or per hour?
how many workers are needed to do so?
concluding from that:
what is the expected load in requestes/per second, and
what response times are expected on part of the controller (handing out jobs, receiving results)?
And looking into the future:
will the workload increase, i.e. does your solution need to scale up (more jobs per time unit, more more data per job?)
will there be a need for persistency of jobs and results, e.g. for auditing purposes?
Again, concluding from that,
how will this influence the number of workers?
what effect will it have on the number of requests/second on part of the controller?
With these answers, you will find yourself in a better position to choose a solution.
I would look into a message queue like RabbitMQ. This way it will first fill up the RAM and then use the disk. I have up to 500,000,000 objects in queues on a single server and it's just plugging away.
RabbitMQ works on Windows and Linux and has simple connectors/SDKs to about any kind of language.
https://www.rabbitmq.com/