DynamoDB vs MySQL with Amazon Lambda - aws-lambda

Is better to use DynamoDB with Lambda functions? (and if yes why? is the connection faster for some reason, maybe Lambda is designed to be more compatible with DynamoDB than MySQL or other databases?)
Because in Amazon Lambda official main page it makes reference to "DynamoDB" but not to "MySQL" or any other DB system:
https://aws.amazon.com/lambda/?nc1=h_ls
But I've found some tutorials tu connect MySQL from Lambda functions:
https://aws.amazon.com/blogs/database/query-your-aws-database-from-your-serverless-application/
NOTE: I'm not asking for the difference between DynamoDB and MySQL, relational vs non-relational DB, etc. This is about "Lambda and Databases". I need my Lambda function to read and write in some DB, I was thinking in MySQL first but because the only reference on the main page is to DynamoDB I'm a bit confused if I should choose that for some reason (performance, connection speed, or some other limitations).

First of all AWS Lambda can connect with almost any database system.
There are several benefits when using Lambda with Dynamodb.
Using IAM policies for fine grained Access Control.
Both are Serverless offerings from AWS.
Simplicity in provisioning with AWS SAM
Dynamodb supports streams with Lambda for data driven workflows.
On the other hand using MySQL or any other database will require to place the database in a private subnet inside a VPC for security best practices. This will require placing Lambda functions also within a VPC that has a negative impact for performance since an Elastic Network Interface (ENI) needs to be attached to each Lambda function upon provisioning. This increases the cold start time of Lambda.
Its also challenging in managing stateful connections with other databases since Lambda is stateless which also impacts performance.

The AWS Dynamodb is complete managed service in which there is no minimum fee to use DynamoDB.you will pay only for the resources you provision.The AWS will takecare of millisecond latency at any scale.
When it comes to RDS MYSQL lets you to set up, operate and scale database on AWS.so you need to choose Multi A-Z or single A-Z deployment,Db class instance (micro, small, large, xlarge), storage etc.
Dynamodb is a distributed nosql solution designed for very large datastore/extremely high throughput nosql application, while RDS shines in smaller scale flexible traditional RDBMS for far more query and design flexibility.
For simple application and small data set you can go with Dynamodb, For large & complex application,go for Dynamodb if you look for high throughput or you can choose RDS if you look for cheaper option.

Related

Databricks or AWS Lambda for low throughput event driven architecture

I am looking to setup an event driven architecture to process messages from SQS and load into AWS S3. The events will be low volume and I was looking at either using Databricks or AWS lambda to process these messages as these are the 2 tools we already have procured.
I wanted to understand which one would be best to use as I'm struggling to differentiate them for this task as the throughput is only up to 1000 messages per day and unlikely to go higher at the moment so both are capable.
I just wanted to see what other people would consider and see as the differentiators between the two of these products so I can make sure this is future proofed as best I can?
We have used lambda more where I work and it may help to keep it consistent as we have more AWS skills in house but we are looking to build out databricks capability and I do personally find it easier to use.
If it was big data then I would have made the decision easier.
Thanks
AWS Lambda seems to be a much better choice in this case. Following are some benefits you will get with Lambda as compared to DataBricks.
Pros:
Free of cost: AWS Lambda is free for 1 Million requests per month and 400,000 GB-seconds of compute time per month, which means your request rate of 1000/day will easily be covered under this. More details here.
Very simple setup: The Lambda function implementation will be very straight-forward. Connect the SQS Queue with your Lambda function using the AWS Console or AWS cli. More details here. The Lambda function code will just be a couple of lines. It receives the message from SQS queue and writes to S3.
Logging and monitoring: You won't need any separate setup to track the performance metrics - How many messages were processed by Lambda, how many were successful, how much time it took. All these metrics are automatically generated by AWS CloudWatch. You also get an in-built retry mechanism, just specify the retry policy and AWS Lambda will take care of the rest.
Cons:
One drawback of this approach would be that each invocation of Lambda will write to a separate file in S3 because S3 doesn't provide APIs to append to existing files. So you will get 1000 files in S3 per day. Maybe you are fine with this (depends on what you want to do with this data in S3). If not, you will either need a separate job to join all files periodically or do a download of existing file from S3, append to it and upload back, which makes your Lambda a bit more complex.
DataBricks on the other hand, is built for different kind of use cases - Loading large datasets from Amazon S3 and performing analytics, SQL-like queries, builing ML models etc. It won't be suitable for this use case.

AWS Lambda vs EC2 REST API

I am not an expert in AWS but have some experience. Got a situation where Angular UI (host on EC2) would have to talk to RDS DB instance. All set so far in the stack except API(middle ware). We are thinking of using Lambda (as our traffic is unknown at this time). Again here we have lot of choices to make on programming side like C# or Python or Node. (we are tilting towards C# or Python based on the some research done and skills also Python good at having great cold start and C# .NET core being stable in terms of performance).
Since we are with Lambda offcourse we should go in the route of API GATEWAY. all set but now, can all the business logic of our application can reside in Lambda? if so wouldnt it Lambda becomes huge and take performance hit(more memory, more computational resources thus higher costs?)? then we thought of lets have Lambda out there to take light weight processing and heavy lifting can be moved to .NET API that host on EC2?
Not sure of we are seeing any issues in this approach? Also have to mention that, Lambda have to call RDS for CRUD operations then should I think to much about concurrency issues? as it might fall into state full category?
The advantage with AWS Lambda here is scaling. As you know already , cuz Lambda is fully managed so we can take advantages of it for this case.
If you host API on EC2, so we don't have "scaling" part in place. And of course, you can start using ECS, Auto Scaling Group ... but it is bring you to another route.
Are you building this application for learning or for production?

Redis vs dynamoDb geolocation tracking

I am currently a bit confused into which database to use for geolocation Tracking. What I want to do is update the location of a group of people every 30 secs. The data is sent to the server using web-sockets. Each user has an Id in the database and I would like to update the location of that user every 30 second. After doing so, I would like to query these locations and show it in real time to another group of users. My question is what is the advantage and the disadvantages of DynamoDb and Redis. Which one is faster and can scale easier. I am expecting almost 2 million QPS
Both can scale fairly well, but this depends heavily on your use case and architecture.
DynamoDB is a cloud based NoSQL storage system, and Redis is an in memory data structure store. This means that queries to DynamoDB would involve making a roundtrip to Amazon's servers, while queries to Redis would be over RAM (so, much, much lower latency).
As a consequence of the above, the amount of data you can store in Redis would be limited by the RAM available on your hardware. That said, in the event of Redis or your hardware crashing for some reason, you would have to be content with some level of data loss. You can mitigate this somewhat by configuring Redis persistence so that Redis writes to disk regularly (either every N seconds or by manually triggering a write in your code) and mitigate further by then copying those writes to S3 or elsewhere. This trades performance (depending on your scale) for data safety somewhat due to I/O latency. See the documentation for Redis persistence and this blog post by the GitHub engineering team mentioning their decision to remove Redis persistence for performance reasons.
Meanwhile all of the issues above are abstracted away for you by DynamoDB since AWS handles availability for you behind the scenes. You are really only limited by how much you can afford and usage (read/write per second) limits.
DynamoDB does not have native support for querying and inserting geospatial data (although there is a library for it, but it seems to be unmaintained), Redis does. You could write your own code for this.
DynamoDB does not have support for namespacing, or rather, DynamoDB is namespaced by your AWS account meaning you would not be able to maintain a separate DynamoDB instance with the same table names (say for production vs dev data) on the same AWS account. Redis doesn't either, but you can trivially spin up a separate Redis instance for this.
See also Redis MEMORY USAGE command and Redis memory optimization docs.

Can Aurora PostgreSQL be used with AWS AppSync?

I've seen examples of DynamoDB as the data source for AWS AppSync but I'm wondering if Aurora (specifically PostgreSQL) can be used? If yes, what would the resolvers look like for a basic example? Are there any resources that demonstrate doing this for Aurora PostgreSQL or even MySQL?
It can not. You can use Aurora Serverless as the data source which is driven by Data API (still in beta), this allows you to configure resolvers as database queries. That being said Data API is still very slow and Aurora Serverless has a cold-start of 30sec or so as it needs to run from VPC. I would recommend avoiding production but worth playing around.
You are much better of using Lambdas as resolvers or running HTTP RestFul calls from within the resolvers.
Ignore the comments provided in the answer, no disrespect, but the comments are coming from people who never managed production at scale. The fact you have a fully manage GraphQL service at scale as well as with high-security posture will save you months of maintenance nightmares when your product(s) will reach anything close to 1MIL revenue.
You can use the AWS Lambda resolver available in AWS AppSync to access Aurora Postgres. The code is similar to how you would access a relational database using any language. For example, you could use node-postgres with NodeJS to implement the Lambda function.
yes this can be done.
Do take a look at this open-source repo that does exactly that: https://github.com/wednesday-solutions/appsync-rds-todo
As of time of writing, yes but only if it is a Serverless Aurora RDS cluster set to Postgres compatibility. The reason for this is it's the only RDS instance type that supports the Data API. Other RDS instances would have to be configured as a different data source type, most commonly Lambda.

How do you distribute your app across multiple servers using EC2?

For the first time I am developing an app that requires quite a bit of scaling, I have never had an application need to run on multiple instances before.
How is this normally achieved? Do I cluster SQL servers then mirror the programming across all servers and use load balancing?
Or do I separate out the functionality to run some on one server some on another?
Also how do I push out code to all my EC2 windows instances?
This will depend on the requirements you have. But as a general guideline (I am assuming a website) I would separate db, webserver, caching server etc to different instance(s) and use s3(+cloudfont) for static assets. I would also make sure that some proper rate limiting is in place so that only legitimate load is on the infrastructure.
For RDBMS server I might setup a master-slave db setup (RDS makes this easier), use db sharding etc. DB cluster solutions also exists which will be more complex to setup but simplifies database access for the application programmer. I would also check all the db queries and the tune db/sql queries accordingly. In some cases pure NoSQL type databases might be better than RDBMS or a mix of both where the application switches between them depending on the data required.
For webserver I will setup a loadbalancer and then use autoscaling on the webserver instance(s) behind the loadbalancer. Something similar will apply for app server if any. I will also tune the web servers settings.
Caching server will also be separated into its on cluster of instance(s). ElastiCache seems like a nice service. Redis has comparable performance to memcache but has more features(like lists, sets etc) which might come in handy when scaling.
Disclaimer - I'm not going to mention any Windows specifics because I have always worked on Unix machines. These guidelines are fairly generic.
This is a subjective question and everyone would tailor one's own system in a unique style. Here are a few guidelines I follow.
If it's a web application, separate the presentation (front-end), middleware (APIs) and database layers. A sliced architecture scales the best as compared to a monolithic application.
Database - Amazon provides excellent and highly available services (unless you are on us-east availability zone) for SQL and NoSQL data stores. You might want to check out RDS for Relational databases and DynamoDb for NoSQL. Both scale well and you need not worry about managing and load sharding/clustering your data stores once you launch them.
Middleware APIs - This is a crucial part. It is important to have a set of APIs (preferably REST, but you could pretty much use anything here) which expose your back-end functionality as a service. A service oriented architecture can be scaled very easily to cater multiple front-facing clients such as web, mobile, desktop, third-party widgets, etc. Middleware APIs should typically NOT be where your business logic is processed, most of it (or all of it) should be translated to database lookups/queries for higher performance. These services could be load balanced for high availability. Amazon's Elastic Load Balancers (ELB) are good for starters. If you want to get into some more customization like blocking traffic for certain set of IP addresses, performing Blue/Green deployments, then maybe you should consider HAProxy load balancers deployed to separate instances.
Front-end - This is where your presentation layer should reside. It should avoid any direct database queries except for the ones which are limited to the scope of the front-end e.g.: a simple Redis call to get the latest cache keys for front-end fragments. Here is where you could pretty much perform a lot of caching, right from the service calls to the front-end fragments. You could use AWS CloudFront for static assets delivery and AWS ElastiCache for your cache store. ElastiCache is nothing but a managed memcached cluster. You should even consider load balancing the front-end nodes behind an ELB.
All this can be bundled and deployed with AutoScaling using AWS Elastic Beanstalk. It currently supports ASP .NET, PHP, Python, Java and Ruby containers. AWS Elastic Beanstalk still has it's own limitations but is a very cool way to manage your infrastructure with the least hassle for monitoring, scaling and load balancing.
Tip: Identifying the read and write intensive areas of your application helps a lot. You could then go ahead and slice your infrastructure accordingly and perform required optimizations with a read or write focus at a time.
To sum it all, Amazon AWS has pretty much everything you could possibly use to craft your server topology. It's upon you to choose components.
Hope this helps!
The way I would do it would be, to have 1 server as the DB server with mysql running on it. All my data on memcached, which can span across multiple servers and my clients with a simple "if not on memcached, read from db, put it on memcached and return".
Memcached is very easy to scale, as compared to a DB. A db scaling takes a lot of administrative effort. Its a pain to get it right and working. So I choose memcached. Infact I have extra memcached servers up, just to manage downtime (if any of my memcached) servers.
My data is mostly read, and few writes. And when writes happen, I push the data to memcached too. All in all this works better for me, code, administrative, fallback, failover, loadbalancing way. All win. You just need to code a "little" bit better.
Clustering mysql is more tempting, as it seems more easy to code, deploy, maintain and keep up and performing. Remember mysql is harddisk based, and memcached is memory based, so by nature its much more faster (10 times atleast). And since it takes over all the read load from the db, your db config can be REALLY simple.
I really hope someone points to a contrary argument here, I would love to hear it.

Resources