Using data structures and algorithms, other than queue and FIFO, for handling and prioritizing requests - algorithm

Using queue and first-in-first-out algorithm seems like a standard way of dealing with requests to servers/databases/services. But could any other data structures and algorithms be used for dealing with large quantities of requests?

There is an article at http://queue.acm.org/detail.cfm?id=2991130 "Scaling Synchronization in Multicore Programs" which goes into the detail of reducing the locking costs when multiple cores are trying to update the same FIFO queue.
If different users or different requests have different priorities then a FIFO queue won't do, and you need something that takes account of the priority. I have often seen ordered maps used for this, such as TreeMap and std::map or std::multimap. It's not the use case these are designed for, but people know how to use them, and they are typically highly optimised, so they can be faster than most people's attempt at implementing a more specialised data structure.

Related

What is the difference and how to choose between distributed queue and distributed computing platform?

there are many files need to process with two computers real-timely,I want to distribute them to the two computers and these tasks need to be completed as soon as possibile(means real-time processing),I am thinking about the below plan:
(1) distributed queue like Gearman
(2)distributed computing platform like hadoop/spark/storm/s4 and so on
I have two questions
(1)what is the advantage and disadvantage between (1) and (2)?
(2) How to choose in (2),hadoop?spark?storm?s4?or other?
thanks!
Maybe I have not described the question clearly. In most case,there are 1000-3000 files with the same format , these files are independent,you do not need to care their order,the size of one file maybe tens to hundreds of KB and in the future, the number of files and size of single file will rise. I have wrote a program , it can process the file and pick up the data and then store the data in mongodb. Now there are only two computers, I just want a solution that can process these files with the program quickly(as soon as possibile) and is easy to extend and maintain
distributed queue is easy to use in my case bur maybe hard to extend and maintain , hadoop/spark is to "big" in the two computers but easy to extend and maintain, which is better, i am confused.
It depends a lot on the nature of your "processing". Some dimensions that apply here are:
Are records independent from each other or you need some form of aggregation? i.e: do you need some pieces of data to go together? Say, all transactions from a single user account.
Is you processing CPU bound? Memory bound? FileSystem bound?
What will be persisted? How will you persist it?
Whenever you see new data, do you need to recompute any of the old?
Can you discard data?
Is the data somewhat ordered?
What is the expected load?
A good solution will depend on answers to these (and possibly others I'm forgetting). For instance:
If computation is simple but storage and retrieval is the main concern, you should maybe look into a distributed DB rather than either of your choices.
It could be that you are best served by just logging things into a distributed filesystem like HDFS and then run batch computations with Spark (should be generally better than plain hadoop).
Maybe not, and you can use Spark Streaming to process as you receive the data.
If order and consistency are important, you might be better served by a publish/subscribe architecture, especially if your load could be more than what your two servers can handle, but there are peak and slow hours where your workers can catch up.
etc. So the answer to "how you choose?" is "by carefully looking at the constraints of your particular problem, estimate the load demands to your system and picking the solution that better matches those". All of these solutions and frameworks dominate the others, that's why they are all alive and kicking. The choice is all in the tradeoffs you are willing/able to make.
Hope it helps.
First of all, dannyhow is right - this is not what real-time processing is about. There is a great book http://www.manning.com/marz/ which says a lot about lambda archtecture.
The two ways you mentioned serves completly different purposes and are connected to the definition of word "task". For example, Spark will take a whole job you got for him and divide it into "tasks", but the outcome of one task is useless for you, you still need to wait for whole job to finish. You can create small jobs working on the same dataset and use spark's caching to speed it up. But then you won't get much advantage from distribution (if they have to be run one after another).
Are the files big? Are there connected somehow to each other? If yes, I'd go with Spark. If no, distributed queue.

Growing hash-of-queues beyond main memory limits

I have a cluster application, which is divided into a controller and a bunch of workers. The controller runs on a dedicated host, the workers phone in over the network and get handed jobs, so far so normal. (Basically the "divide-and-conquer pipeline" from the zeromq manual, with job-specific wrinkles. That's not important right now.)
The controller's core data structure is unordered_map<string, queue<string>> in pseudo-C++ (the controller is actually implemented in Python, but I am open to the possibility of rewriting it in something else). The strings in the queues define jobs, and the keys of the map are a categorization of the jobs. The controller is seeded with a set of jobs; when a worker starts up, the controller removes one string from one of the queues and hands it out as the worker's first job. The worker may crash during the run, in which case the job gets put back on the appropriate queue (there is an ancillary table of outstanding jobs). If it completes the job successfully, it will send back a list of new job-strings, which the controller will sort into the appropriate queues. Then it will pull another string off some queue and send it to the worker as its next job; usually, but not always, it will pick the same queue as the previous job for that worker.
Now, the question. This data structure currently sits entirely in main memory, which was fine for small-scale test runs, but at full scale is eating all available RAM on the controller, all by itself. And the controller has several other tasks to accomplish, so that's no good.
What approach should I take? So far, I have considered:
a) to convert this to a primarily-on-disk data structure. It could be cached in RAM to some extent for efficiency, but jobs take tens of seconds to complete, so it's okay if it's not that efficient,
b) using a relational database - e.g. SQLite, (but SQL schemas are a very poor fit AFAICT),
c) using a NoSQL database with persistency support, e.g. Redis (data structure maps over trivially, but this still appears very RAM-centric to make me feel confident that the memory-hog problem will actually go away)
Concrete numbers: For a full-scale run, there will be between one and ten million keys in the hash, and less than 100 entries in each queue. String length varies wildly but is unlikely to be more than 250-ish bytes. So, a hypothetical (impossible) zero-overhead data structure would require 234 – 237 bytes of storage.
Ultimately, it all boils down on how you define efficiency needed on part of the controller -- e.g. response times, throughput, memory consumption, disk consumption, scalability... These properties are directly or indirectly related to:
number of requests the controller needs to handle per second (throughput)
acceptable response times
future growth expectations
From your options, here's how I'd evaluate each option:
a) to convert this to a primarily-on-disk data structure. It could be
cached in RAM to some extent for efficiency, but jobs take tens of
seconds to complete, so it's okay if it's not that efficient,
Given the current memory hog requirement, some form of persistent storage seems a reaonsable choice. Caching comes into play if there is a repeatable access pattern, say the same queue is accessed over and over again -- otherwise, caching is likely not to help.
This option makes sense if 1) you cannot find a database that maps trivially to your data structure (unlikely), 2) for some other reason you want to have your own on-disk format, e.g. you find that converting to a database is too much overhead (again, unlikely).
One alternative to databases is to look at persistent queues (e.g. using a RabbitMQ backing store), but I'm not sure what the per-queue or overall size limits are.
b) using a relational database - e.g. SQLite, (but SQL schemas are a
very poor fit AFAICT),
As you mention, SQL is probably not a good fit for your requirements, even though you could surely map your data structure to a relational model somehow.
However, NoSQL databases like MongoDB or CouchDB seem much more appropriate. Either way, a database of some sort seems viable as long as they can meet your throughput requirement. Many if not most NoSQL databases are also a good choice from a scalability perspective, as they include support for sharding data across multiple machines.
c) using a NoSQL database with persistency support, e.g. Redis (data
structure maps over trivially, but this still appears very RAM-centric
to make me feel confident that the memory-hog problem will actually go
away)
An in-memory database like Redis doesn't solve the memory hog problem, unless you set up a cluster of machines that each holds a part of the overall data. This makes sense only if keeping all data in-memory is needed due to low response times requirements. Yet, given the nature of your jobs, taking tens of seconds to complete, response times, respective to workers, hardly matter.
If you find, however, that response times do matter, Redis would be a good choice, as it handles partitioning trivially using either client-side consistent-hashing or at the cluster level, thus also supporting scalability scenarios.
In any case
Before you choose a solution, be sure to clarify your requirements. You mention you want an efficient solution. Since efficiency can only be gauged against some set of requirements, here's the list of questions I would try to answer first:
*Requirements
how many jobs are expected to complete, say per minute or per hour?
how many workers are needed to do so?
concluding from that:
what is the expected load in requestes/per second, and
what response times are expected on part of the controller (handing out jobs, receiving results)?
And looking into the future:
will the workload increase, i.e. does your solution need to scale up (more jobs per time unit, more more data per job?)
will there be a need for persistency of jobs and results, e.g. for auditing purposes?
Again, concluding from that,
how will this influence the number of workers?
what effect will it have on the number of requests/second on part of the controller?
With these answers, you will find yourself in a better position to choose a solution.
I would look into a message queue like RabbitMQ. This way it will first fill up the RAM and then use the disk. I have up to 500,000,000 objects in queues on a single server and it's just plugging away.
RabbitMQ works on Windows and Linux and has simple connectors/SDKs to about any kind of language.
https://www.rabbitmq.com/

What distributed message queues support millions of queues?

I'm looking for a distributed message queue that will support millions of queues, with each queue handling tens of messages per second.
The messages will be small (tens of bytes), and I don't expect the queues to get very long--on the order of tens of messages per queue at maximum, but when the system is humming along, the queues should stay fairly empty.
I'm not sure how many nodes to expect in the cluster--probably depends on the specific solution, but if I had to guess, I would say ten nodes. I would prefer that queues were relatively resilient to individual node failures within the cluster, but a few lost messages here and there won't make me lose sleep.
Does such a message queue exist? Seems like most of the field is optimized toward handling hundreds of queues with high throughput. But what is SQS built on? Surely not magic.
Update:
By request, it may indeed help to shed light on my problem domain. (I'd left details out before so as not to muddy the waters.) I'm experimenting with distributed cellular automata, with an initial target of a million cells in simulation. In some CA models, it's useful to add an event model, so that a cell can send events to its neighbors. Hence, a million queues, each with one consumer and 8 or so producers.
Costs are a concern for now, as I'm funding the experiments myself. (Thus Amazon's SQS is probably out of reach.)
From your description, it looks like OMG's Data Distribution Service could be a good fit. It is related to message queueing technologies, but I would rather call it a distributed data management infrastructure. It is completely distributed and supports advanced features that give you a lot of control over how the data is distributed, by means of a rich set of Quality of Service settings.
Not knowing much about your problem, I could guess what an approach might be. DDS is about distributing the state of strongly-typed data-items, as structures with typed attributes. You could create a data-type describing the state of an automaton. One of its attributes could be an ID uniquely identifying the automaton in the system. If possible, that would be assigned according to a scheme such that every automaton knows what the ID's of its neighbors are (if they are present). Each automaton would publish its state as needed, resulting in a distributed data-space containing the current state of all automatons. DDS supports so-called partitioning of that data-space. If you took advantage of that, then each of the nodes in your machine would be responsible for a well-defined subset of all automatons. Communication over the wire would only happen for those automatons neighboring a different partition. Since automatons know the ID's of their neighbors, they would be able to query the data-space for the states of the automatons it's interested in.
It is a bit hard to explain without a white board, but the end-result would be a single instance (which are a sort of very light-weight message queues) for most automatons, and two or three instances for those automatons at the border of a partition. If you had ten nodes and one million automata, then each node would have to be able to hold administration for approximately hundred thousand automata. I have seen systems being built with DDS of that scale, and larger, with tens of updates per second for each instance. The nice thing is that this technology scales well with the number of nodes, so you could bring down the resource load per node by adding more nodes.
If this is a research project, then you might even be able to use a commercial product without charge. Just google on dds research license.

How to dedupe parallel event streams

Is there a standard approach for deduping parallel event streams ? Before I attempt to reinvent the wheel, I want to know if this problem has some known approaches.
My client component will be communicating with two servers. Each one is providing a near real-time event stream (~1 second). The events may occasionally be out of order. Assume I can uniquely identify the events. I need to send a single stream of events to the consuming code at the same near real-time performance.
A lot has been written about this kind of problem. Here's a foundational paper, by Leslie Lamport:
http://research.microsoft.com/en-us/um/people/lamport/pubs/pubs.html#time-clocks
The Wikipedia article on Operational Transformation theory is a perfectly good starting point for further research:
http://en.wikipedia.org/wiki/Operational_transformation
As for your problem, you'll have to choose some arbitrary weight to measure the cost of delay vs the cost of dropped events. You can maintain two priority queues, time-ordered, where incoming events go. You'd do a merge-and on the heads of the two queues with some delay (to allow for out-of-order events), and throw away events that happened "before" the timestamp of whatever event you last sent. If that's no better than what you had in mind already, well, at least you get to read that cool Lamport paper!
I think that the optimization might be OS-specific. From the task as you described it I think about two threads consuming incoming data and appending it to the common stream having access based on mutexes. Both Linux and Win32 have mutex-like procedures, but they may have slow performance if you have data rate is really great. In this case I'd operate by blocks of data, that will allow to use mutexes not so often. Sure there's a main thread that consumes the data and it also access it with a mutex.

What are some scenarios for which MPI is a better fit than MapReduce?

As far as I understand, MPI gives me much more control over how exactly different nodes in the cluster will communicate.
In MapReduce/Hadoop, each node does some computation, exchanges data with other nodes, and then collates its partition of results. Seems simple, but since you can iterate the process, even algorithms like K-means or PageRank fit the model quite well. On a distributed file system with locality of scheduling, the performance is apparently good. In comparison, MPI gives me explicit control over how nodes send messages to each other.
Can anyone describe a cluster programming scenario where the more general MPI model is an obvious advantage over the simpler MapReduce model?
Almost any scientific code -- finite differences, finite elements, etc. Which kind of leads to the circular answer, that any distributed program which doesn't easily map to MapReduce would be better implemented with a more general MPI model. Not sure that's much help to you, I'll downvote this answer right after I post it.
Athough, this question has been answered, I would like to add/reiterate one very important point.
MPI is best suited for problems that require a lot of interprocess communication.
When Data becomes large (petabytes, anyone?), and there is little interprocess communication, MPI becomes a pain. This is so because the processes will spend all the time sending data to each other (bandwidth becomes a limiting factor) and your CPUs will remain idle. Perhaps an even bigger problem is reading all that data.
This is the fundamental reason behind having something like Hadoop. The Data also has to be distributed - Hadoop Distributed File System!
To say all this in short, MPI is good for task parallelism and Hadoop is good for Data Parallelism.
The best answer that I could come up with is that MPI is better than MapReduce in two cases:
For short tasks rather than batch processing. For example, MapReduce cannot be used to respond to individual queries - each job is expected to take minutes. I think that in MPI, you can build a query response system where machines send messages to each other to route the query and generate the answer.
For jobs nodes need to communicate more than what iterated MapReduce jobs support, but not too much so that the communication overheads make the computation impractical. I am not sure how often such cases occur in practice, though.
I expect that MPI beats MapReduce easily when the task is iterating over a data set whose size is comparable with the processor cache, and when communication with other tasks is frequently required. Lots of scientific domain-decomposition parallelization approaches fit this pattern. If MapReduce requires sequential processing and communication, or ending of processes, then the computational performance benefit from dealing with a cache-sized problem is lost.
When the computation and data that you are using have irregular behaviors that mostly translates to many message-passings between objects, or when you need low level hardware level accesses e.g. RDMA then MPI is better. In some answers that you see in here the latency of tasks or memory consistency model gets mentioned, frameworks like Spark or Actor Models like AKKA have shown that they can compete with MPI. Finally one should consider that MPI has benefit of being for years the main base for development of libraries needed for scientific computations (This are the most important missing parts missing from new frameworks using DAG/MapReduce Models).
All in all, I think the benefits that MapReduce/DAG models are bringing to the table like dynamic resource managers, and fault tolerance computation will make make them feasible for scientific computing groups.

Resources