I am creating an server to send data to many persistent sockets. I have chosen the REACTOR design pattern which suggests having multiple threads to send data to along sockets.
I cannot understand what is better:
- To have one thread to send all of the data to sockets
- Or have a couple of threads to send data across the sockets.
The way I see it is that I have 2 cores. So I can only do two things at once. Whcih would mean I have 1 worker thread and 1 thread to send data?
Why would it be better to have mulitple threads to send data when you suffer context switching between the threads?
See documentation on thttpd as to why single threaded non blocking IO is good. Indeed it makes good sense for static files.
If you are doing CGI however, you may have a script that runs for a long time. It's nicer to not hold up all the quicker simpler traffic, especially if the script has an infinite-loop bug in it and is to eventually be killed anyway! With threads the average response time experienced by users will be better - if some of the requests are very time consuming.
If the files being served come from disk and are not in main memory already, a similar argument can be used.
Related
Im running a 4-core Amazon EC2 instance(m3.xlarge) with 200.000 concurrent connections with no ressouce problems(each core at 10-20%, memory at 2/14GB). Anyway if i emit a message to all the user connected first on a cpu-core gets it within milliseconds but the last connected user gets it with a delay of 1-3 seconds and each CPU core goes up to 100% for 1-2 seconds. I noticed this problem even at "only" 50k concurrent users(12.5k per core).
How to reduce the delay?
I tried changing redis-adapter to mongo-adapter with no difference.
Im using this code to get sticky sessions on multiple cpu cores:
https://github.com/elad/node-cluster-socket.io
The test was very simple: The clients do just connect and do nothing more. The server only listens for a message and emits to all.
EDIT: I tested single-core without any cluster/adapter logic with 50k clients and the same result.
I published the server, single-core-server, benchmark and html-client in one package: https://github.com/MickL/socket-io-benchmark-kit
OK, let's break this down a bit. 200,000 users on four cores. If perfectly distributed, that's 50,000 users per core. So, if sending a message to a given user takes .1ms each of CPU time, that would take 50,000 * .1ms = 5 seconds to send them all.
If you see CPU utilization go to 100% during this, then a bottleneck probably is CPU and maybe you need more cores on the problem. But, there may be other bottlenecks too such as network bandwidth, network adapters or the redis process. So, one thing to immediately determine is whether your end-to-end time is directly proportional to the number of clusters/CPUs you have? If you drop to 2 cores, does the end-to-end time double? If you go to 8, does it drop in half? If yes for both, that's good news because that means you probably are only running into CPU bottleneck at the moment, not other bottlenecks. If that's the case, then you need to figure out how to make 200,000 emits across multiple clusters more efficient by examining node-cluster-socket.io code and finding ways to optimize your specific situation.
The most optimal the code could be would be to have every CPU do all it's housekeeping to gather exactly what it needs to send to all 50,000 users and then very quickly each CPU does a tight loop sending 50,000 network packets one right after the other. I can't really tell from the redis adapter code whether this is what happens or not.
A much worst case would be where some process gets all 200,000 socket IDs and then goes in a loop to send to each socket ID where in that loop, it has to lookup on redis which server contains that connection and then send a message to that server telling it to send to that socket. That would be a ton less efficient than instructing each server to just send a message to all it's own connected users.
It would be worth trying to figure out (by studying code) where in this spectrum, the socket.io + redis combination is.
Oh, and if you're using an SSL connection for each socket, you are also devoting some CPU to crypto on every send operation. There are ways to offload the SSL processing from your regular CPU (using additional hardware).
I have a single producer and n workers that I only want to give work to when they're not already processing a unit of work and I'm struggling to find a good zeroMQ pattern.
1) REQ/REP
The producer is the requestor and creates a connection to each worker. It tracks which worker is busy and round-robins to idle workers
Problem:
How to be notified of responses and still able to send new work to idle workers without dedicating a thread in the producer to each worker?
2) PUSH/PULL
Producer pushes into one socket that all workers feed off, and workers push into another socket that the producer listens to.
Problem:
Has no concept of worker idleness, i.e. work gets stuck behind long units of work
3) PUB/SUB
Non-starter, since there is no way to make sure work doesn't get lost
4) Reverse REQ/REP
Each worker is the REQ end and requests work from the producer and then sends another request when it completes the work
Problem:
Producer has to block on a request for work until there is work (since each recv has to be paired with a send ). This prevents workers to respond with work completion
Could be fixed with a separate completion channel, but the producer still needs some polling mechanism to detect new work and stay on the same thread.
5) PAIR per worker
Each worker has its own PAIR connection allowing independent sending of work and receipt of results
Problem:
Same problem as REQ/REP with requiring a thread per worker
As much as zeroMQ is non-blocking/async under the hood, I cannot find a pattern that allows my code to be asynchronous as well, rather than blocking in many many dedicated threads or polling spin-loops in fewer. Is this just not a good use case for zeroMQ?
Your problem is solved with the Load Balancing Pattern in the ZMQ Guide. It's all about flow control whilst also being able to send and receive messages. The producer will only send work requests to idle workers, whilst the workers are able to send and receive other messages at all times, e.g. abort, shutdown, etc.
Push/Pull is your answer.
When you send a message in ZeroMQ, all that happens initially is that it sits in a queue waiting to be delivered to the destination(s). When it has been successfully transferred it is removed from the queue. The queue is limited in length, but can be set by changing a socket's high water mark.
There is a/some background thread(s) that manage all this on your behalf, and your calls to the ZeroMQ API are simply issuing instructions to that/those threads. The threads at either end of a socket connection are collaborating to marshall the transfer of messages, i.e. a sender won't send a message unless the recipient can receive it.
Consider what this means in a push/pull set up. Suppose one of your pull workers is falling behind. It won't then be accepting messages. That means that messages being sent to it start piling up until the highwater mark is reached. ZeroMQ will no longer send messages to that pull worker. In fact AFAIK in ZeroMQ, a pull worker whose queue is more full than those of its peers will receive less messages, so the workload is evened out across all workers.
So What Does That Mean?
Just send the messages. Let 0MQ sort it out for you.
Whilst there's no explicit flag saying 'already busy', if messages can be sent at all then that means that some pull worker somewhere is able to receive it solely because it has kept up with the workload. It will therefore be best placed to process new messages.
There are limitations. If all the workers are full up then no messages are sent and you get blocked in the push when it tries to send another message. You can discover this only (it seems) by timing how long the zmq_send() took.
Don't Forget the Network
There's also the matter of network bandwidth to consider. Messages queued in the push will tranfer at the rate at which they're consumed by the recipients, or at the speed of the network (whichever is slower). If your network is fundamentally too slow, then it's the Wrong Network for the job.
Latency
Of course, messages piling up in buffers represents latency. This can be restricted by setting the high water mark to be quite low.
This won't cure a high latency problem, but it will allow you to find out that you have one. If you have an inadequate number of pull workers, a low high water mark will result in message sending failing/blocking sooner.
Actually I think in ZeroMQ it blocks for push/pull; you'd have to measure elapsed time in the call to zmq_send() to discover whether things had got bottled up.
Thought about Nanomsg?
Nanomsg is a reboot of ZeroMQ, one of the same guys is involved. There's many things I prefer about it, and ultimately I think it will replace ZeroMQ. It has some fancier patterns which are more universally usable (PAIR works on all transports, unlike in ZeroMQ). Also the patterns are essentially a plugable component in the source code, so it is far simpler for patterns to be developed and integrated than in ZeroMQ. There is a discussion on the differences here
Philisophical Discussion
Actor Model
ZeroMQ is definitely in the realms of Actor Model programming. Messages get stuffed into queues / channels / sockets, and at some undetermined point in time later they emerge at the recipient end to be processed.
The danger of this type of architecture is that it is possible to have the potential for deadlock without knowing it.
Suppose you have a system where messages pass both ways down a chain of processes, say instructions in one way and results in the other. It is possible that one of the processes will be trying to send a message whilst the recipient is actually also trying to send a message back to it.
That only works so long as the queues aren't full and can (temporarily) absorb the messages, allowing everyone to move on.
But suppose the network briefly became a little busy for some reason, and that delayed message transfer. The message send might then fail because the high water mark had been reached. Whoops! No one is then sending anything to anyone anymore!
CSP
A development of the Actor Model, called Communicating Sequential Processes, was invented to solve this problem. It has a restriction; there is no buffering of messages at all. No process can complete sending a message until the recipient has received all the data.
The theoretical consequence of this was that it was then possible to mathematically analyse a system design and pronounce it to be free of deadlock. The practical consequence is that if you've built a system that can deadlock, it will do so every time. That's actually not so bad; it'll show up in testing, not post-deployment.
Curiously this is hinted at in the documentation of Microsoft's Task Parallel library, where they advocate setting buffer lengths to zero in the intersts of achieving a more robust application.
It'd be like setting the ZeroMQ high water mark to zero, but in zmq_setsockopt() 0 means default, not nought. The default is non-zero...
CSP is much more suited to real time applications. Any shortage of available workers immediately results in an inability to send messages (so your system knows it's failed to keep up with the real time demand) instead of resulting in an increased latency as data is absorbed by sockets, etc. (which is far harder to discover).
Unfortunately almost every communications technology we have (Ethernet, TCP/IP, ZeroMQ, nanomsg, etc) leans towards Actor Model. Everything has some sort of buffer somewhere, be it a packet buffer on a NIC or a socket buffer in an operating system.
Thus to implement CSP in the real world one has to implement flow control on top of the existing transports. This takes work, and it's slightly inefficient. But if a system that needs it, it's definitely the way to go.
Personally I'd love to see 0MQ and Nanomsg to adopt it as a behavioural option.
I reached a point now, where is taking to long for a queue to finish, because new jobs are added to that queue.
What are the best options to overcome this problem.
I already use 50 processors, but I noticed that if I open more, it will take longer for jobs to finish.
My setup:
nginx,
unicorn,
ruby-on-rails 4,
postgresql
Thank you
You need to measure where you are constrained by resources.
If you're seeing things slow down as you add more workers you're likely blocked by your database server. Have you upgraded your Redis server to handle this amount of load? Where are you storing the scraped data to? Can that system handle the increased write load?
If you were blocked on CPU or I/O, you should see the amount of work through the system scale linearly as you add more workers. Since you're seeing things slow down when you scale out, you should measure where your problem is. I'd recommend instrumenting NewRelic for your worker processes and measuring where the time is being spent.
My guess would be that your Redis instance can't handle the load to manage the work queue with 50 worker processes.
EDIT
Based on your comment, it sounds like you're entirely I/O Bound doing web scraping. In that case, you should be increasing the concurrency option for each Sidekiq worker using the -c option to spawn more threads. Having more threads will allow you to continue processing scraping jobs even when scrapers are blocked on network I/O.
I developed a server for a custom protocol based on tcp/ip-stack with Netty. Writing this was a pleasure.
Right now I am testing performance. I wrote a test-application on netty that simply connects lots (20.000+) of "clients" to the server (for-loop with Thread.wait(1) after each bootstrap-connect). As soon as a client-channel is connected it sends a login-request to the server, that checks the account and sends a login-response.
The overall performance seems to be quite OK. All clients are logged in below 60s. But what's not so good is the spread waiting time per connections. I have extremely fast logins and extremely slow logins. Variing from 9ms to 40.000ms spread over the whole test-time. Is it somehow possible to share waiting time among the requesting channels (Fifo)?
I measured a lot of significant timestamps and found a strange phenomenon. I have a lot of connections where the server's timestamp of "channel-connected" is way after the client's timestamp (up to 19 seconds). I also do have the "normal" case, where they match and just the time between client-sending and server-reception is several seconds. And there are cases of everything in between those two cases. How can it be, that client and server "channel-connected" are so much time away from each other?
What is for sure is, that the client immediatly receives the server's login-response after it has been send.
Tuning:
I think I read most of the performance-articles around here. I am using the OrderMemoryAwareThreadPool with 200 Threads on a 4CPU-Hyper-Threading-i7 for the incoming connections and also do start the server-application with the known aggressive-options. I also completely tweaked my Win7-TCP-Stack.
The server runs very smooth on my machine. CPU-usage and memory consumption is ca. at 50% from what could be used.
Too much information:
I also started 2 of my test-apps from 2 seperate machines "attacking" the server in parallel with 15.000 connections each. There I had about 800 connections that got a timeout from the server. Any comments here?
Best regards and cheers to Netty,
Martin
Netty has a dedicated boss thread that accepts an incoming connection. If the boss thread accepts a new connection, it forwards the connection to a worker thread. The latency between the acceptance and the actual socket read might be larger than expected under load because of this. Although we are looking into different ways to improve the situation, meanwhile, you might want to increase the number of worker threads so that a worker thread handles less number of connections.
If you think it's performing way worse than non-Netty application, please feel free to file an issue with reproducing test case. We will try to reproduce and fix the problem.
I'm trying to work out in my head the best way to structure a Cocoa app that's essentially a concurrent download manager. There's a server the app talks to, the user makes a big list of things to pull down, and the app processes that list. (It's not using HTTP or FTP, so I can't use the URL-loading system; I'll be talking across socket connections.)
This is basically the classic producer-consumer pattern. The trick is that the number of consumers is fixed, and they're persistent. The server sets a strict limit on the number of simultaneous connections that can be open (though usually at least two), and opening new connections is expensive, so in an ideal world, the same N connections are open for the lifetime of the app.
One way to approach this might be to create N threads, each of which would "own" a connection, and wait on the request queue, blocking if it's empty. Since the number of connections will never be huge, this is not unreasonable in terms of actual system overhead. But conceptually, it seems like Cocoa must offer a more elegant solution.
It seems like I could use an NSOperationQueue, and call setMaxConcurrentOperationCount: with the number of connections. Then I just toss the download requests into that queue. But I'm not sure, in that case, how to manage the connections themselves. (Just put them on a stack, and rely on the queue to ensure I don't over/under-run? Throw in a dispatch semaphore along with the stack?)
Now that we're in the brave new world of Grand Central Dispatch, does that open up any other ways of tackling this? At first blush, it doesn't seem like it, since GCD's flagship ability to dynamically scale concurrency (and mentioned in Apple's recommendations on Changing Producer-Consumer Implementations) doesn't actually help me. But I've just scratched the surface of reading about it.
EDIT:
In case it matters: yes, I am planning on using the asynchronous/non-blocking socket APIs to do the actual communication with the server. So the I/O itself does not have to be on its own thread(s). I'm just concerned with the mechanics of queuing up the work, and (safely) doling it out to the connections, as they become available.
If you're using CFSocket's non-blocking calls for I/O, I agree, that should all happen on the main thread, letting the OS handle the concurrency issues, since you're just copying data and not really doing any computation.
Beyond that, it sounds like the only other work your app needs to do is maintain a queue of items to be downloaded. When any one of the transfers is complete, the CFSocket call back can initiate the transfer of the next item on the queue. (If the queue is empty, decrement your connection count, and if something is added to an empty queue, start a new transfer.) I don't see why you need multiple threads for that.
Maybe you've left out something important, but based on your description the app is I/O bound, not CPU bound, so all of the concurrency stuff is just going to make more complicated code with minimal impact on performance.
Do it all on the main thread.
For posterity's sake, after some discussion elsewhere, the solution I think I'd adopt for this is basically:
Have a queue of pending download operations, initially empty.
Have a set containing all open connections, initially empty.
Have a mutable array (queue, really) of idle open connections, initially empty.
When the user adds a download request:
If the array of idle connections is not empty, remove one and assign the download to it.
If there are no idle connections, but the number of total connections has not reached its limit, open a new connection, add it to the set, and assign the download to it.
Otherwise, enqueue the download for later.
When a download completes: if there are queued requests, dequeue one
and give it to the connection; otherwise, add the connection to the idle list.
All of that work would take place on the main thread. The work of decoding the results of each download would be offloaded to GCD, so it can handle throttling the concurrency, and it doesn't clog the main thread.
Opening a new connection might take a while, so the process of creating a new one might be a tad more complicated in actual practice (say, enqueue the download, initiate the connection process, and then dequeue it when the connection is fully established). But I still think my perception of the possibility of race conditions was overstated.