S3 Ruby Client - when to specify regional endpoint - ruby

I have buckets in 2 AWS regions. I'm able to perform puts or gets against both buckets without specifying the regional endpoint(the ruby client defaults to us-east-1).
I haven't found much relevant info on how requests on a bucket reach the proper regional endpoint when the region is not specified. From what I've found(https://github.com/aws/aws-cli/issues/223#issuecomment-22872906), it appears that requests are routed to the bucket's proper region via DNS.
Does specifying the region have any advantages when performing puts and gets against existing buckets? I'm trying to decide whether I need to specify the appropriate region for operations against a bucket or if I can just rely on it working.
Note that the buckets are long lived so the DNS propagation delays mentioned in the linked github issue are not an issue.
SDK docs for region:
http://docs.aws.amazon.com/AWSRubySDK/latest/AWS/Core/Configuration.html#region-instance_method

I do not think that there is any performance benefit to putting/getting data if you specify the bucket. All bucket names are supposed to be unique across all regions. I don't think there's a lot of overhead in that lookup, compared to data throughput.
I welcome comments to the contrary.

Related

Grafana/Prometheus visualizing multiple ips as query

I want to have a graph where all recent IPs that requested my webserver get shown as total request count. Is something like this doable? Can I add a query and remove it afterwards via Prometheus?
Technically, yes. You will need to:
Expose some metric (probably a counter) in your server - say, requests_count, with a label; say, ip
Whenever you receive a request, inc the metric with the label set to the requester IP
In Grafana, graph the metric, likely summing it by the IP address to handle the case where you have several horizontally scaled servers handling requests sum(your_prometheus_namespace_requests_count) by (ip)
Set the Legend of the graph in Grafana to {{ ip }} to 'name' each line after the IP address it represents
However, every different label value a metric has causes a whole new metric to exist in the Prometheus time-series database; you can think of a metric like requests_count{ip="192.168.0.1"}=1 to be somewhat similar to requests_count_ip_192_168_0_1{}=1 in terms of how it consumes memory. Each metric instance currently being held in the Prometheus TSDB head takes something on the order of 3kB to exist. What that means is that if you're handling millions of requests, you're going to be swamping Prometheus' memory with gigabytes of data just from this one metric alone. A more detailed explanation about this issue exists in this other answer: https://stackoverflow.com/a/69167162/511258
With that in mind, this approach would make sense if you know for a fact you expect a small volume of IP addresses to connect (maybe on an internal intranet, or a client you distribute to a small number of known clients), but if you are planning to deploy to the web this would allow a very easy way for people to (unknowingly, most likely) crash your monitoring systems.
You may want to investigate an alternative -- for example, Grafana is capable of ingesting data from some common log aggregation platforms, so perhaps you can do some structured (e.g. JSON) logging, hold that in e.g. Elasticsearch, and then create a graph from the data held within that.

Best method to persist data from an AWS Lambda invocation?

I use AWS Simple Email Services (SES) for email. I've configured SES to save incoming email to an S3 bucket, which triggers an AWS Lambda function. This function reads the new object and forwards the object contents to an alternate email address.
I'd like to log some basic info. from my AWS Lambda function during invocation -- who the email is from, to whom it was sent, if it contained any links, etc.
Ideally I'd save this info. to a database, but since AWS Lambda functions are costly (relatively so to other AWS ops.), I'd like to do this as efficiently as possible.
I was thinking I could issue an HTTPS GET request to a private endpoint with a query-string containing the info. I want logged. Since I could fire my request async. at the outset and continue processing, I thought this might be a cheap and efficient approach.
Is this a good method? Are there any alternatives?
My Lambda function fires irregularly so despite Lambda functions being kept alive for 10 minutes or so post-firing, it seems a database connection is likely slow and costly since AWS charges per 100ms of usage.
Since I could conceivable get thousands of emails/month, ensuring my Lambda function is efficient is paramount to cost. I maintain 100s of domain names so my numbers aren't exaggerated. Thanks in advance.
I do not think that thousands per emails per month should be a problem, these cloud services have been developed with scalability in mind and can go way beyond the numbers you are suggesting.
In terms of persisting, I cannot really understand - lack of logs, metrics - why your db connection would be slow. From the moment you use AWS, it will use its own internal infrastructure so speeds will be high and not something you should be worrying about.
I am not an expert on billing but from what you are describing, it seems like using lambdas + S3 + dynamoDB is highly optimised for your use case.
From the type of data you are describing (email data) it doesn't seem that you would have neither a memory issue (lambdas have mem constraints which can be a pain) or an IO bottleneck. If you can share more details on your memory used during invocation and the time taken that would be great. Also how much data you store on each lambda invocation.
I think you could store jsonified strings of your email data in dynamodb easily, it should be pretty seamless and not that costly.
Have not used (SES) but you could put a trigger on DynamoDB whenever you store a record, in case you want to follow up with another lambda.
You could combine S3 + dynamoDB. When you store a record, simply upload a file containing the record to a new S3 key and update the row in DynamoDB with a pointer to the new S3 object
DynamoDB + S3
You can now persist data using AWS EFS.

ELB Balancing Stateful Servers

Let's say i have this HTTP2 service, that has a list of users and this user hair color, in memory and database well.
Now i want to scale this up into multiple nodes - however i do not want the same user to be in two different servers memory - each server shall handle those specific users. This means i need to inform the load balancer where each user is being handled. In case of de-scaling, i need to inform this user is nowhere and can be routed to any server or by a given rule - IE server with less memory being used.
Would any1 know if ALB load balancer supports that ? One path i was thinking of using Query string parameter-based routing, so i could inform in the request itself something like destination_node = (int)user_id % 4 in case i had 4 nodes for instance - and this worked well in a proof of concept but that leads to a few issues:
The service itself would need to know how many instances there are to balance.
I could not guarantee even balancing, its basically a luck based balancing.
What would be the preferred approach for this, or what is a common way of solving this problem ? Does AWS ELB supports this out of the box ? I was trying to avoid having to write my own balancer, a middleware that keeps track of what services are handling what users, whose responsibility would be distributing the requests among those servers.
In AWS Application Load Balancer (ALB) it is possible to write Routing-Rules on
Host Header
HTTP Header
HTTP Request Method
Path Pattern
Query String
Source IP
But at the moment there is no way to route under dynamic conditions.
If it possible to group your data, i would prefere path pattern like
/users/blond/123

How to add storage-level caching between DynamoDB and Titan?

I am using the Titan/DynamoDB library to use AWS DynamoDB as a backend for my Titan DB graphs. My app is very read-heavy and I noticed Titan is mostly executing query requests against DynamoDB. I am using transaction- and instance-local caches and indexes to reduce my DynamoDB read units and the overall latency. I would like to introduce a cache layer that is consistent for all my EC2 instances: A read/write-through cache between DynamoDB and my application to store query results, vertices, and edges.
I see two solutions to this:
Implicit caching done directly by the Titan/DynamoDB library. Classes like the ParallelScanner could be changed to read from AWS ElastiCache first. The change would have to be applied to read & write operations to ensure consistency.
Explicit caching done by the application before even invoking the Titan/Gremlin API.
The first option seems to be the more fine-grained, cross-cutting, and generic.
Does something like this already exist? Maybe for other storage backends?
Is there a reason why this does not exist already? Graph DB applications seem to be very read-intensive so cross-instance caching seems like a pretty significant feature to speedup queries.
First, ParallelScanner is not the only thing you would need to change. Most importantly, all the changes you need to make are in DynamoDBDelegate (that is the only class that makes low level DynamoDB API calls).
Regarding implicit caching, you could add a caching layer on top of DynamoDB. For example, you could implement a cache using API Gateway on top of DynamoDB, or you could use Elasticache. Either way, you need to figure out a way to invalidate Query/Scan pages. Inserting/deleting items will cause page boundaries to change so it requires some thought.
Explicit caching may be easier to do than implicit caching. The level of abstraction is higher, so based on your incoming writes it may be easier for you to decide at the application level whether a traversal that is cached needs to be invalidated. If you treat your graph application as another service, you could cache the results at the service level.
Something in between may also be possible (but requires some work). You could continue to use your vertex/database caches as provided by Titan, and use a low value for TTL that is consistent with how frequently you write columns. Or, you could take your caching approach a step further and do the following.
Enable DynamoDB Stream on edgestore.
Use a Lambda function to stream the edgestore updates to a Kinesis Stream.
Consume the Kinesis Stream with edgestore updates in the same JVM as the Gremlin Server on each of your Gremlin Server instances. You would need to instrument the database level cache in Titan to consume the Kinesis stream and invalidate the cached columns as appropriate, in each Titan instance.

Data-aware load balancing with embedded and distribted caches/datagrids

Sorry i'm a beginner in load balancing.
In distributed environments we tend more and more to send the treatment (map/reduce) to the data so that the result gets computed locally and then aggregated.
What i'd like to do apply for partionned/distributed data, not replicated.
Following the same kind of principle, i'd like to be able to send an user request on the server where the user data is cached.
When using an embedded cache or datagrid to get low response time, when the dataset is large, we tend to avoid replication and use distributed/partitionned caches.
The partitionning algorithm are generally hash-based and permits to have replicas to handle server failures.
So finally, a user data is generally hosted on something like 3 servers (1 primary copy and 2 replicas)
On a local cache misses, the caches are generally able to search for the entry on other cache peers.
This works fine but needs a network access.
I'd like to have a load balancing strategy that avoid this useless network call.
What i'd like to know: is it possible to have a load balancer that is aware of the partitionning mecanism of the cache so that it always forwards to one of the webservers having a local copy if the data we need?
For exemple, i have a request www.mywebsite.com/user=387
The load balancer will check the 387 userId and know that this user is stored in servers 1, 6 and 12. And thus he can roundrobin to one of them or other strategy.
If there's no generic solution, are there opensource or commercial, software or hardware load balancers that permits to define custom routing strategies?
How much extracting data of a request will slow down the load balancer? What's the cost of extracting an url parameter (like in my exemple with user=387) and following some rules to go to the right webserver, compared to a roundrobin strategy for exemple?
Is there an abstraction library on top of cache vendors so that we can retrieve easily the partitionning data and make it available to the load balancer?
Thanks!
Interesting question. I don't think there is a readily available solution for your requirements, but it would be pretty easy to build if your hashing criteria is relatively simple and depends only on the request (a URL parameter as in your example).
If I were building this, I would use Varnish (http://varnish-cache.org), but you could do the same in other reverse proxies.

Resources