ELB Balancing Stateful Servers - amazon-ec2

Let's say i have this HTTP2 service, that has a list of users and this user hair color, in memory and database well.
Now i want to scale this up into multiple nodes - however i do not want the same user to be in two different servers memory - each server shall handle those specific users. This means i need to inform the load balancer where each user is being handled. In case of de-scaling, i need to inform this user is nowhere and can be routed to any server or by a given rule - IE server with less memory being used.
Would any1 know if ALB load balancer supports that ? One path i was thinking of using Query string parameter-based routing, so i could inform in the request itself something like destination_node = (int)user_id % 4 in case i had 4 nodes for instance - and this worked well in a proof of concept but that leads to a few issues:
The service itself would need to know how many instances there are to balance.
I could not guarantee even balancing, its basically a luck based balancing.
What would be the preferred approach for this, or what is a common way of solving this problem ? Does AWS ELB supports this out of the box ? I was trying to avoid having to write my own balancer, a middleware that keeps track of what services are handling what users, whose responsibility would be distributing the requests among those servers.

In AWS Application Load Balancer (ALB) it is possible to write Routing-Rules on
Host Header
HTTP Header
HTTP Request Method
Path Pattern
Query String
Source IP
But at the moment there is no way to route under dynamic conditions.
If it possible to group your data, i would prefere path pattern like
/users/blond/123

Related

Grafana/Prometheus visualizing multiple ips as query

I want to have a graph where all recent IPs that requested my webserver get shown as total request count. Is something like this doable? Can I add a query and remove it afterwards via Prometheus?
Technically, yes. You will need to:
Expose some metric (probably a counter) in your server - say, requests_count, with a label; say, ip
Whenever you receive a request, inc the metric with the label set to the requester IP
In Grafana, graph the metric, likely summing it by the IP address to handle the case where you have several horizontally scaled servers handling requests sum(your_prometheus_namespace_requests_count) by (ip)
Set the Legend of the graph in Grafana to {{ ip }} to 'name' each line after the IP address it represents
However, every different label value a metric has causes a whole new metric to exist in the Prometheus time-series database; you can think of a metric like requests_count{ip="192.168.0.1"}=1 to be somewhat similar to requests_count_ip_192_168_0_1{}=1 in terms of how it consumes memory. Each metric instance currently being held in the Prometheus TSDB head takes something on the order of 3kB to exist. What that means is that if you're handling millions of requests, you're going to be swamping Prometheus' memory with gigabytes of data just from this one metric alone. A more detailed explanation about this issue exists in this other answer: https://stackoverflow.com/a/69167162/511258
With that in mind, this approach would make sense if you know for a fact you expect a small volume of IP addresses to connect (maybe on an internal intranet, or a client you distribute to a small number of known clients), but if you are planning to deploy to the web this would allow a very easy way for people to (unknowingly, most likely) crash your monitoring systems.
You may want to investigate an alternative -- for example, Grafana is capable of ingesting data from some common log aggregation platforms, so perhaps you can do some structured (e.g. JSON) logging, hold that in e.g. Elasticsearch, and then create a graph from the data held within that.

Distributed crawling and rate limiting / flow control

I am running a niche search product that works with a web crawler. The current crawler is a single (PHP Laravel) worker crawling the urls and putting the results into an Elastic Search engine. The system continuously keeps re-crawling the found url's with a interval of X milliseconds.
This has served me well but with some new large clients coming up the crawler is going to hit it's limits. I need to redesign the system to a distributed crawler to speed up the crawling. The problem is the combination of specs below.
The system must adhere to the following 2 rules:
multiple workers (concurrency issues)
variable rate-limit per client. I need to be very sure the system doesn't crawl client X more then once every X milliseconds.
What i have tried:
I tried putting the url's in a MySQL table and let the workers query for a url to crawl based on last_crawled_at timestamps in the clients and urls table. But MySQL doesn't like multiple concurrent workers and i receive all sorts of deadlocks.
I tried putting the url's into a Redis engine. I got this kinda working, but only with a Lua script that checks and sets an expiring key for every client that is being served. This all feels way to hackish.
I thought about filling a regular queue but this will violate rule number 2 as i can't be 100% sure the workers can process the queue 'real-time'.
Can anybody explain me how the big boys do this? How can we have multiple processes query a big/massive list of url's based on a few criteria (like rate limiting the client) and make sure we hand out the the url to only 1 worker?
Ideally we won't need another database besides Elastic with all the available / found urls's but i don't think that's possible?
Have a look at StormCrawler, it is a distributed web crawler with has an Elasticsearch module. It is highly customisable and enforces politeness by respecting robots.txt and having by default a single thread per host or domain.

Drop all but one node from Service Discovery

We use the Consul Service Discovery mechanism to fetch a list of proxies through which we scrape certain targets. There are multiple proxies for redundancy but ultimately they all provide the exact same information.
Now we'd like have the relabeling always drop all but one (random) node returned from SD. It must not be hardcoded as the names and number of proxies can and will change.
After looking at the relabeling implementation I don't think this is possible, but maybe there is some clever hack to achieve this.
Question: Is it possible to drop all but one (random) node from Prometheus Service Discovery?
This is not possible. I'd suggest putting a load balancer of some form in front of the proxies.

S3 Ruby Client - when to specify regional endpoint

I have buckets in 2 AWS regions. I'm able to perform puts or gets against both buckets without specifying the regional endpoint(the ruby client defaults to us-east-1).
I haven't found much relevant info on how requests on a bucket reach the proper regional endpoint when the region is not specified. From what I've found(https://github.com/aws/aws-cli/issues/223#issuecomment-22872906), it appears that requests are routed to the bucket's proper region via DNS.
Does specifying the region have any advantages when performing puts and gets against existing buckets? I'm trying to decide whether I need to specify the appropriate region for operations against a bucket or if I can just rely on it working.
Note that the buckets are long lived so the DNS propagation delays mentioned in the linked github issue are not an issue.
SDK docs for region:
http://docs.aws.amazon.com/AWSRubySDK/latest/AWS/Core/Configuration.html#region-instance_method
I do not think that there is any performance benefit to putting/getting data if you specify the bucket. All bucket names are supposed to be unique across all regions. I don't think there's a lot of overhead in that lookup, compared to data throughput.
I welcome comments to the contrary.

Data-aware load balancing with embedded and distribted caches/datagrids

Sorry i'm a beginner in load balancing.
In distributed environments we tend more and more to send the treatment (map/reduce) to the data so that the result gets computed locally and then aggregated.
What i'd like to do apply for partionned/distributed data, not replicated.
Following the same kind of principle, i'd like to be able to send an user request on the server where the user data is cached.
When using an embedded cache or datagrid to get low response time, when the dataset is large, we tend to avoid replication and use distributed/partitionned caches.
The partitionning algorithm are generally hash-based and permits to have replicas to handle server failures.
So finally, a user data is generally hosted on something like 3 servers (1 primary copy and 2 replicas)
On a local cache misses, the caches are generally able to search for the entry on other cache peers.
This works fine but needs a network access.
I'd like to have a load balancing strategy that avoid this useless network call.
What i'd like to know: is it possible to have a load balancer that is aware of the partitionning mecanism of the cache so that it always forwards to one of the webservers having a local copy if the data we need?
For exemple, i have a request www.mywebsite.com/user=387
The load balancer will check the 387 userId and know that this user is stored in servers 1, 6 and 12. And thus he can roundrobin to one of them or other strategy.
If there's no generic solution, are there opensource or commercial, software or hardware load balancers that permits to define custom routing strategies?
How much extracting data of a request will slow down the load balancer? What's the cost of extracting an url parameter (like in my exemple with user=387) and following some rules to go to the right webserver, compared to a roundrobin strategy for exemple?
Is there an abstraction library on top of cache vendors so that we can retrieve easily the partitionning data and make it available to the load balancer?
Thanks!
Interesting question. I don't think there is a readily available solution for your requirements, but it would be pretty easy to build if your hashing criteria is relatively simple and depends only on the request (a URL parameter as in your example).
If I were building this, I would use Varnish (http://varnish-cache.org), but you could do the same in other reverse proxies.

Resources