Determine Request Latency

Determine Request Latency - amazon-ec2

I'm working on creating a version of Pastry natively in Go. From the design [PDF]:
It is assumed that the application
provides a function that allows each Pastry node to determine the “distance” of a node
with a given IP address to itself. A node with a lower distance value is assumed to be
more desirable. An application is expected to implements this function depending on its
choice of a proximity metric, using network services like traceroute or Internet subnet
maps, and appropriate caching and approximation techniques to minimize overhead.
I'm trying to figure out what the best way to determine the "proximity" (i.e., network latency) between two EC2 instances programmatically from Go. Unfortunately, I'm not familiar enough with low-level networking to be able to differentiate between the different types of requests I could use. Googling did not turn up any suggestions for measuring latency from Go, and general latency techniques always seem to be Linux binaries, which I'm hoping to avoid in the name of fewer dependencies. Any help?
Also, I note that the latency should be on the scale of 1ms between two EC2 instances. While I plan to use the implementation on EC2, it could hypothetically be used anywhere. Is latency generally so bad that I should expend the effort to ensure the network proximity of two nodes? Keep in mind that most Pastry requests can be served in log base 16 of the number of servers in the cluster (so for 10,000 servers, it would take approximately 3 requests, on average, to find the key being searched for). Is the latency from, for example, EC2's Asia-Pacific region to EC2's US-East region enough to justify the increased complexity and the overhead introduced by the latency checks when adding nodes?

A common distance metric in networking is to count the number of hops (node-hops in-between) a packet needs to reach its destination. This metric was also mentioned in the text you quoted. This could give you adequate distance values even for the low-latency environment you mentioned (EC2 “local”).
For the go logic itself, one would think the net package is what you are looking for. And indeed, for latency tests (ICMP ping) you could use it to create an IP connection
conn, err := net.Dial("ip4", "127.0.0.1")
create your ICMP package structure and data, and send it. (See Wikipedia page on ICMP; IPv6 needs a different format.) Unfortunately you can’t create an ICMP connection directly, like you can with TCP and UDP, thus you will have to handle the package structure yourself.
As conn of type Conn is a Writer, you can then pass it your data, the ICMP data you defined.
In the ICMP Type field you can specify the message type. Values 8, 1 and 30 are the ones you are looking for. 8 for your echo request, the reply will be of type 1. And maybe 30 gives you some more information.
Unfortunately, for counting the network hops, you will need the IP packet header fields. This means, you will have to construct your own IP packets, which net does not seem to allow.
Checking the source of Dial(), it uses internetSocket, which is not exported/public. I’m not really sure if I’m missing something, but it seems there is no simple way to construct your own IP packets to send, with customizable header values. You’d have to further check how DialIP sends packages with internetSocket and duplicate and adapt that code/concept. Alternatively, you could use cgo and a system library to construct your own packages (this would add yet more complexity though).
If you are planning on using IPv6, you will (also) have to look into ICMPv6. Both packages have a different structure over their v4 versions.
So, I’d suggest using simple latency (timed ping) as a simple(r) implementation and then add node-hops at a later time/afterwards, if you need it. If you have both in place, maybe you also want to combine those 2 (less hops does not automatically mean better; think long overseas-cables etc).

Related

nifi ingestion with 10,000+ sensor data?

I am planning to use nifi to ingest data from more than 10,000 sensors. There are 50-100 types of sensors which will send a specific metric to nifi.
I am pondering over whether I should assign 1 port number to listen to all the sensors or I should assign 1 port for each type of sensor to facilitate my data pipeline. which is the better option?
Is there a upper limit of the no of ports which I can "listen" using nifi?

#ilovetolearn
NiFi is such a powerful tool. You can do either of your ideas, but I would recommend to do what is easier for you. If you have data source sensors that need different data flows, use different ports. However, if you can fire everything at a single port, I would do this. This makes it easier to implement, consistent, easier to support later, and easier to scale.
In large scale highly available NiFi, you may want a Load Balancer to handle the inbound data. This would push the sensor data toward a single host:port on the LB appliance, that then directs to NiFi with 3-5-10+ nodes.

I agree with the other answer that once scaling comes into play, an external load balancer in front of NiFi would be helpful.
In regards to the flow design, I would suggest using a single exposed port to ingest all the data, and then use RouteOnAttribute or RouteOnContent processors to direct specific sensor inputs into different flow segments.
One of the strengths of NiFi is the generic nature of flows given sufficient parameterization, so taking advantage of flowfile attributes to handle different data types dynamically scales and performs better than duplicating a lot of flow segments to statically handle slightly differing data.
The performance overhead to run multiple ingestion ports vs. a single port and routed flowfiles is substantial, so this will give you a large performance improvement. You can also organize your flow segments into hierarchical nested groups using the Process Group features, to keep different flow segments cleanly organized and enforce access controls as well.
2020-06-02 Edit to answer questions in comments
Yes, you would have a lot of relationships coming out of the initial RouteOnAttribute processor at the ingestion port. However, you can segment these (route all flowfiles with X attribute in "family" X here, Y here, etc.) and send each to a different process group which encapsulates more specific logic.
Think of it like a physical network: at a large organization, you don't buy 1000 external network connections and hook each individual user's machine directly to the internet. Instead, you obtain one (plus redundancy/backup) large connection to the internet and use a router internally to direct the traffic to the appropriate endpoint. This has management benefits as well as cost, scalability, etc.
The overhead of multiple ingestion ports is that you have additional network requirements (S2S is very efficient when communicating, but there is overhead on a connection basis), multiple ports to be opened and monitored, and CPU to schedule & run each port's ingestion logic.
I've observed this pattern in practice at scale in multinational commercial and government organizations, and the performance improvement was significant when switching to a "single port; route flowfiles" pattern vs. "input port per flow" design. It is possible to accomplish what you want with either design, but I think this will be much more performant and easier to build & maintain.

P2P distrubution - abstract algorithm for supervising peers

I plan to make a system for distributing VM images among several stations using BitTorrent protocol. Current system looks as follows:
|-[room with 20PCs]-
[srv_with_images]-->--[1Gbps-bottlneck]-->--|
|-[2nd room with 20PCs]-
All the PCs at once are downloading images through the 1Gbps bottleneck every night and it takes a lot of time. We plan to use BitTorrent to speed up the distribution of images using peer-to-peer exchange between all the PCs. However there is problem - when image appears on the origin server it starts to act as a single seed from whom all peers are downloading the file simultaneously. So we again fall into the trap of the bottleneck. To speed up the distribution we need to implement (at least we think that we need) an abstract high-level algorithm that:
Ensures the on the beggining when new image arrives only small portion of stations will be downloading the image from origin,
When the small portion will start seeding, rest of, or another bigger portion of PCs will start peering, or they will be peering only from the PCs in class, not from origin,
It shouldnt rely on "static" list of initial peers, as some computers may be offline during the day. We cant assume that any of the computers will always be up&running. A peer may also be turned off anytime.
Are there any specific algorithms that can help us desinging this? The most naive way would be to just keep active servers list somewhere and make some daemon that will be choosing initial peers for each torrent. But maybe there are some more elegant ways to do that kind of stuff??
Another option would be to ensure that only some peers ca download from origin, and rest of the peers do download from each other(but not from origin) - is it possible in BitTorrent protocol?

If you are using bittorrent no special coordination is necessary.
Peers behind the bottleneck can directly talk to each other and share the bandwidth. Using the rarest-first piece picking algorithm will mostly ensure that they download different pieces from the server and then share them with each other.
LSD may help to speed up lan-local discovery but it should work with a normal tracker too if there are no NAT shenanigans in play.

Algorithms behind load-balancers?

I need to study about load-balancers, such as Network Load Balancing, Linux Virtual Server, HAProxy, etc. There're something under-the-hood I need to know:
What algorithms/technologies are used in these load-balancers? Which is the most popular? most effective?
I expect that these algorithms/technologies will not be too complicated. Are there some resources written about them?

Load balancing in Apache, for example, is taken care of by the module called mod_proxy_balancer. This module supports 3 load balancing algorithms:
Request counting
Weighted traffic counting
Pending request counting
For more details, take a look here: mod_proxy_balancer

Not sure if this belongs on serverfault or not, but some load balancing techniques are:
Round Robin
Least Connections
I used least connections. It just made the most sense to send the person to the machine which had the least amount of load.

In general, load balancing is all about sending new client requests to servers which are the least busy. Based on the application running, assign a 'busy factor' to each server: basically a number reflecting one/several points of interest for your load balancing algorithm (connected clients, cpu/mem usage, etc.) and then, at runtime, choose the server with the lowest such score. Basically ANY load balancing technique is based on something like this:
Round robin does not implement a 'busy score' per se, but assigns each consecutive request to the next server in a circular queue.
Least connections has its score = number_of_open_connections to the server. Obviously, a server with fewer connections is a better choice.
Random assignment is a special case - you make an uninformed decision about the server's load, but assume that the function has a statistically even distribution.

In addition to those already mentioned, a simple random assignment can be a good enough algorithm for load balancing, especially with a large number of servers.
Here's one link from Oracle: http://download-llnw.oracle.com/docs/cd/E11035_01/wls100/cluster/load_balancing.html

Load balancing: DNS round robin in front of hardware load balancers. How to share stickiness?

DNS Round Robin (DRR) permits to do cheap load balancing (distribution is a better term). It has the pro of permitting infinite horizontal scaling. The con is that if one of the web servers goes down, some clients continue to use the broken IP for minutes (min TTL 300s) or more, even if the DNS implements fail-over.
An Hardware Load Balancer (HLB) handles such web server failures transparently but it cannot scale its bandwidth indefinitely. An hot spare is also needed.
A good solution seems to use DRR in front to a group of HLB pairs. Each HLB pair never goes down and therefore DRR never keeps clients down. Plus, when bandwidth isn't enough you can add a new HLB pair to the group.
Problem: DRR moves clients randomly between the HLB pairs and therefore (AFAIK) session stickiness cannot work.
I could just avoid to use session stickiness but it makes better use of caches therefore is something that I want to preserve.
Question: is it possible/exist an HLB implementation where an instance can share its (sessionid,webserver) mapping with other instances?
If this is possible then a client would be routed to the same web server independently by the HLB that routed the request.
Thanks in advance.

Modern load balancers have very high throughput capabilities (gigabit). So unless you're running a huuuuuuuuuuge site (e.g. google), adding bandwidth is not why you'll need a new pair of load balancers, especially since most large sites offload much of their bandwidth to CDNs (Content Delivery Networks) like Akamai. If you're pumping a gigabit of un-CDN-able data through your site and don't already have a global load-balancing strategy, you've got bigger problems than cache affinity. :-)
Instead of bandwidth limits, sites tend to add additional LB pairs for geo-distribution of servers at separate data centers to ensure users spread across the world can talk to a server closest to them.
For that latter scenario, load balancer companies offer geo-location solutions, which (at least until a few years ago which was when I was following this stuff) were based on custom DNS implementations which looked at client IPs and resolved to the load balancer pairs Virtual IP address which is "closest" (in network topology or performance) to the client. These days, CDNs like Akamai also offer global load balancing services (e.g. http://www.akamai.com/html/technology/products/gtm.html). Amazon's EC2 hosting also supports this kind of feature for sites hosted there (see http://aws.amazon.com/elasticloadbalancing/).
Since users tend not to move across continents in the course of a single session, you automatically get affinity (aka "stickiness") with geographic load balancing, assuming your pairs are located in separate data centers.
Keep in mind that geo-location is really hard since you also have to geo-locate your data to ensure your back-end cross-data-center network doesn't get swamped.
I suspect that F5 and other vendors also offer single-datacenter solutions which achieve the same ends, if you're really concerned about the single point of failure of network infrastructure (routers, etc.) inside your datacenter. But router and switch vendors have high-availability solutions which may be more appropriate to address that issue.
Net-net, if I were you I wouldn't worry about multiple pairs of load balancers. Get one pair and, unless you have a lot of money and engineering time to burn, partner with a hoster who's good at keeping their data center network up and running.
That said, if cache affinity is such a big deal for your app that you're thinking about shelling out big $$$ for multiple pairs of load balancers, it may be worth considering some app architecture changes (like using an external caching cluster). Solutions like memcached (for linux) are designed for this scenario. Microsoft also has one coming called "Velocity".
Anyway, hope this is useful info-- it's admittedly been a while since I've been deeply involved in this space (I was part of the team which designed an application load balancing product for a large software vendor) so you might want to double-check my assumptions above with facts you can pull off the web from F5 and other LB vendors.

Ok, this is an ancient question, which I just found through a Google search. But for any future visitors, here is some additional clarifications:
Problem: [DNS Round Robin] moves clients randomly between the HLB pairs and therefore (AFAIK) session stickiness cannot work.
This premise is as best I can tell not accurate. It seems nobody really knows what old browsers might do, but presumably each browser window will stay on the same IP address as long as it's open. Newer operation systems probably obey the "match longest prefix" rule. Thus there shouldn't be much 'flapping', randomly switching from one load balancer IP to another.
However, if you're still worried about users getting randomly reassigned to a new load balancer pair, then a small modification of the classic L3/4 & L7 load balancing setup can help:
Publish DNS Round Robin records that go to Virtual high-availability IPs that are handled by L4 load balancers.
Have the L4 load balancers forward to pairs of L7 load balancers based on the origin IP address, i.e. use consistent hashing based on the end users IP to always route end users to the same L7 load balancer.
Have your L7 load balancers use "sticky sessions" as you want them to.
Essentially this is just a small modification to what Willy Tarreau (the creator of HAProxy) wrote years ago.

thanks for having put things in the right perspective.
I agree with you.
I did some reading and found:
Flickr: http://highscalability.com/flickr-architecture
4 billion queries per day --> about 50000 queries/s
Youtube: http://highscalability.com/youtube-architecture
100 million video views/day --> about 1200 video views/second
PlentyOfFish: http://highscalability.com/plentyoffish-architecture
600 pages/second
200 Mbps used
CDN used
Twitter: http://highscalability.com/scaling-twitter-making-twitter-10000-percent-faster
300 tweets/second
600 req/s
A very top end LB like this can scale up :
200,000 SSL handshakes per second
1 million TCP connections per second
3.2 million HTTP requests per second
36 Gbps of TCP or HTTP throughput
Therefore, you are right a LB could hardly become a bottleneck.
Anyway I found this (old) article http://www.tenereillo.com/GSLBPageOfShame.htm
where it is explained that geo-aware DNS could create availability issues.
Could someone comment on that article?
Thanks,
Valentino

So why not keep it simple and have the DNS server give out a certain IP address (or addresses) based on the origin IP address (i.e. use consistent hashing based on the end users IP to always give end users the same IP address(es)) ?
I'm aware that this only provides a simple and cheap load distribution mechanism.
I have been looking for this, but haven't found a DNS server which implements this (although Bind has some possibilities with views).

Low-latency, large-scale message queuing

I'm going through a bit of a re-think of large-scale multiplayer games in the age of Facebook applications and cloud computing.
Suppose I were to build something on top of existing open protocols, and I want to serve 1,000,000 simultaneous players, just to scope the problem.
Suppose each player has an incoming message queue (for chat and whatnot), and on average one more incoming message queue (guilds, zones, instances, auction, ...) so we have 2,000,000 queues. A player will listen to 1-10 queues at a time. Each queue will have on average maybe 1 message per second, but certain queues will have much higher rate and higher number of listeners (say, a "entity location" queue for a level instance). Let's assume no more than 100 milliseconds of system queuing latency, which is OK for mildly action-oriented games (but not games like Quake or Unreal Tournament).
From other systems, I know that serving 10,000 users on a single 1U or blade box is a reasonable expectation (assuming there's nothing else expensive going on, like physics simulation or whatnot).
So, with a crossbar cluster system, where clients connect to connection gateways, which in turn connect to message queue servers, we'd get 10,000 users per gateway with 100 gateway machines, and 20,000 message queues per queue server with 100 queue machines. Again, just for general scoping. The number of connections on each MQ machine would be tiny: about 100, to talk to each of the gateways. The number of connections on the gateways would be alot higher: 10,100 for the clients + connections to all the queue servers. (On top of this, add some connections for game world simulation servers or whatnot, but I'm trying to keep that separate for now)
If I didn't want to build this from scratch, I'd have to use some messaging and/or queuing infrastructure that exists. The two open protocols I can find are AMQP and XMPP. The intended use of XMPP is a little more like what this game system would need, but the overhead is quite noticeable (XML, plus the verbose presence data, plus various other channels that have to be built on top). The actual data model of AMQP is closer to what I describe above, but all the users seem to be large, enterprise-type corporations, and the workloads seem to be workflow related, not real-time game update related.
Does anyone have any daytime experience with these technologies, or implementations thereof, that you can share?

#MSalters
Re 'message queue':
RabbitMQ's default operation is exactly what you describe: transient pubsub. But with TCP instead of UDP.
If you want guaranteed eventual delivery and other persistence and recovery features, then you CAN have that too - it's an option. That's the whole point of RabbitMQ and AMQP -- you can have lots of behaviours with just one message delivery system.
The model you describe is the DEFAULT behaviour, which is transient, "fire and forget", and routing messages to wherever the recipients are. People use RabbitMQ to do multicast discovery on EC2 for just that reason. You can get UDP type behaviours over unicast TCP pubsub. Neat, huh?
Re UDP:
I am not sure if UDP would be useful here. If you turn off Nagling then RabbitMQ single message roundtrip latency (client-broker-client) has been measured at 250-300 microseconds. See here for a comparison with Windows latency (which was a bit higher) http://old.nabble.com/High%28er%29-latency-with-1.5.1--p21663105.html
I cannot think of many multiplayer games that need roundtrip latency lower than 300 microseconds. You could get below 300us with TCP. TCP windowing is more expensive than raw UDP, but if you use UDP to go faster, and add a custom loss-recovery or seqno/ack/resend manager then that may slow you down again. It all depends on your use case. If you really really really need to use UDP and lazy acks and so on, then you could strip out RabbitMQ's TCP and probably pull that off.
I hope this helps clarify why I recommended RabbitMQ for Jon's use case.

I am building such a system now, actually.
I have done a fair amount of evaluation of several MQs, including RabbitMQ, Qpid, and ZeroMQ. The latency and throughput of any of those are more than adequate for this type of application. What is not good, however, is queue creation time in the midst of half a million queues or more. Qpid in particular degrades quite severely after a few thousand queues. To circumvent that problem, you will typically have to create your own routing mechanisms (smaller number of total queues, and consumers on those queues are getting messages that they don't have an interest in).
My current system will probably use ZeroMQ, but in a fairly limited way, inside the cluster. Connections from clients are handled with a custom sim. daemon that I built using libev and is entirely single-threaded (and is showing very good scaling -- it should be able to handle 50,000 connections on one box without any problems -- our sim. tick rate is quite low though, and there are no physics).
XML (and therefore XMPP) is very much not suited to this, as you'll peg the CPU processing XML long before you become bound on I/O, which isn't what you want. We're using Google Protocol Buffers, at the moment, and those seem well suited to our particular needs. We're also using TCP for the client connections. I have had experience using both UDP and TCP for this in the past, and as pointed out by others, UDP does have some advantage, but it's slightly more difficult to work with.
Hopefully when we're a little closer to launch, I'll be able to share more details.

Jon, this sounds like an ideal use case for AMQP and RabbitMQ.
I am not sure why you say that AMQP users are all large enterprise-type corporations. More than half of our customers are in the 'web' space ranging from huge to tiny companies. Lots of games, betting systems, chat systems, twittery type systems, and cloud computing infras have been built out of RabbitMQ. There are even mobile phone applications. Workflows are just one of many use cases.
We try to keep track of what is going on here:
http://www.rabbitmq.com/how.html (make sure you click through to the lists of use cases on del.icio.us too!)
Please do take a look. We are here to help. Feel free to email us at info#rabbitmq.com or hit me on twitter (#monadic).

My experience was with a non-open alternative, BizTalk. The most painful lesson we learnt is that these complex systems are NOT fast. And as you figured from the hardware requirements, that translates directly into significant costs.
For that reason, don't even go near XML for the core interfaces. Your server cluster will be parsing 2 million messages per second. That could easily be 2-20 GB/sec of XML! However, most messages will be for a few queues, while most queues are in fact low-traffic.
Therefore, design your architecture so that it's easy to start with COTS queue servers and then move each queue (type) to a custom queue server when a bottleneck is identified.
Also, for similar reasons, don't assume that a message queue architecture is the best for all comminication needs your application has. Take your "entity location in an instance" example. This is a classic case where you don't want guaranteed message delivery. The reason that you need to share this information is because it changes all the time. So, if a message is lost, you don't want to spend time recovering it. You'd only send the old locatiom of the affected entity. Instead, you'd want to send the current location of that entity. Technology-wise this means you want UDP, not TCP and a custom loss-recovery mechanism.

FWIW, for cases where intermediate results are not important (like positioning info) Qpid has a "last-value queue" that can deliver only the most recent value to a subscriber.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio