How to see the total space used and available in aws aurora schema - amazon-aurora

Do any one know how to see the total space used and available in aws aurora schema, also the memory of the cluster within DB itself using queries. We are using AWS aurora psql and we don't have console, hence we want to try the same from queries like we will do for oracle.

This is most likely not doable using a SQL query in aurora, specifically the volume size. As an alternative, if you do have AWS CLI access, you should be able to query Cloudwatch metrics under "AWS/RDS" namespace to get these metrics in a more reliable and accurate manner. The ones you are interested in are "VolumeBytesUsed" and "CPUUtilization" if I'm not mistaken.
http://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/Aurora.Monitoring.html

Related

AWS RDS allocated_storage size

I'm creating AWS RDS instance to migrate existing on-prem oracle database to AWS Rds instance.
Existing On-prem oracle database is appox 700GB.
I've two queries
what allocate storage size should i use for RDS oracle database instance,
should it be equal or greater then on-prem oracle database?
which instance type should be suitable for size of database?
If you are using S3 import method with Data Pump, you would need sufficient space to download the backup file to RDS Oracle instance + restore the same, so a little more than double should be a good option to create with(~1.5TB). If you are using some other method where you don't need the backup file on RDS, you can try provisioning around 800 GB initially.
https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/Oracle.Procedural.Importing.html#Oracle.Procedural.Importing.DataPump
It depends on your workload, and you might need to look at current Oracle AWR reports to analyze suitable CPU and RAM requirements. Nevertheless, you can start with something which you feel should suffice, and scale up or down as needed.

ETL process in AWS using EC2-s and EFS

I am a data engineer with experience in designing n creating data integration and ELT processes. Below is my use case, and I need to move my process to aws and would like your opinion?
My files to be processed are in s3. I need to process those files using Hadoop. I have existing logic written in hive, just need to migrate the same to aws. Is the below approach correct/ feasible?
Spin up a fleet of ec2 instances, initially say 5, enable autoscaling.
Create an EFS, and mount it on the ec2 instances.
Copy file from s3 to EFS as Hadoop tables.
Run hive queries on top of the data in EFS and create new tables.
Once the process is completed, move/ export the final reports table from EFS to s3 (somehow). Not sure that whether this is possible or not, if this is not possible then this entire solution is not feasible.
6.Terminate EFS and EC2 instances.
If the above method is correct, How does the Hadoop orchestration happen using EFS?
Thanks,
KR
Spin up a fleet of ec2 instances, initially say 5, enable autoscaling.
I'm not sure you need the autoscaling.
why?
let's say you start a "big" query which takes lot's of time & cpu.
auto-scale will start more instances , but how will it start run "fraction" of the query on the new machine?
all machines need to be ready before you run the query . just keep it in mind.
Or in other words: only the machines that available now will handle the query.
Copy file from s3 to EFS as Hadoop tables.
There isn't any problem with this idea.
just keep in mind , you can keep the data in EFS .
if EFS is too pricy for you ,
Please check options for provision EBS-magnetic with Raid 0 .
You will gain great speeds at minimal costs.
The rest is okay, and this is one of the ways to do "on demand" interactive analytics.
Please take a look into AWS Athena.
It's a service which allows you to run queries on s3 objects .
You can use Json and even Parquet (which is much more efficient !)
This service may be enough for your need .
Good luck !

restrict lambda to use only some data centres?

I’m researching my options to publish microservices under AWS lambda. As we will be billed by the microsecond and memory used, performance of our lambda functions is pretty important.
My DB is Cassandra or other NOSQL distributed on several AWS nodes, on several data centres.
How can I configure lambda to be sure that my lambda function, including DB access, is using a DB node on the same data centre?
I know that Cassandra drivers will redirect intelligently queries to the nearest DB node, but what I need is to the restrict lambda execution to just nodes on the same data centres where I have some DB node.
You can configure or create a lambda role using IAM roles which will have access to only a specific aws resource (you can be restrictive or open on access control by specifying that in the policies).
Then assign that role to your lambda in a VPC or default VPC.

What to use to read/write from dynamodb from Spark?

I'd like to know what's the best to use to read/write from dynamodb from Spark.
I've tried with the official API from dynamodb, also with the emr connector(hadoop and also with hive) and others.
But i've found (among other problems) that to perform a query a full scan is needed, and that's not something valid with big tables.
Any suggestions please?
The process you tried using emr-dynamodb-connector is generally the way most of the people use it.
However there is a library which you could use to connect to DynamoDb.
Generally accessing DynamoDb from spark is difficult because now you have tied spark executors with the DynamoDb throttle. One alternative you could try is to use Hbase or cassandra which I found better supported with spark usage, provides predicate pushdown etc.
Generally the way I use DynamoDB data on cluster with spark is by utilizing the DynamoDb stream. Collect the stream data in S3 and apply batch processing on that data.

Elastic search with Google Big Query

I have the event logs loaded in elasticsearch engine and I visualise it using Kibana. My event logs are actually stored in the Google Big Query table. Currently I am dumping the json files to a Google bucket and download it to a local drive. Then using logstash, I move the json files from the local drive to the elastic search engine.
Now, I am trying to automate the process by establishing the connection between google big query and elastic search. From what I have read, I understand that there is a output connector which sends the data from elastic search to Google big query but not vice versa. Just wondering whether I should upload the json file to a kubernete cluster and then establish the connection between the cluster and Elastic search engine.
Any help with this regard would be appreciated.
Although this solution may be a little complex, I suggest some solution that you use Google Storage Connector with ES-Hadoop. These two are very mature and used in production-grade by many great companies.
Logstash over a lot of pods on Kubernetes will be very expensive and - I think - not a very nice, resilient and scalable approach.
Apache Beam has connectors for BigQuery and Elastic Search, I would definitly perform this using DataFlow so you don´t need to implement a complex ETL and staging storage. You can read the data from BigQuery using BigQueryIO.Read.from (take a look to this if performance is important BigQueryIO Read vs fromQuery) and load it into ElasticSearch using ElasticsearchIO.write()
Refer this how read data from BigQuery Dataflow
https://github.com/GoogleCloudPlatform/professional-services/blob/master/examples/dataflow-bigquery-transpose/src/main/java/com/google/cloud/pso/pipeline/Pivot.java
Elastic Search indexing
https://github.com/GoogleCloudPlatform/professional-services/tree/master/examples/dataflow-elasticsearch-indexer
UPDATED 2019-06-24
Recently this year was release BigQuery Storage API which improve the parallelism to extract data from BigQuery and is natively supported by DataFlow. Refer to https://beam.apache.org/documentation/io/built-in/google-bigquery/#storage-api for more details.
From the documentation
The BigQuery Storage API allows you to directly access tables in BigQuery storage. As a result, your pipeline can read from BigQuery storage faster than previously possible.
I have recently worked on a similar pipeline. A workflow I would suggest would either use the mentioned Google storage connector, or other methods to read your json files into a spark job. You should be able to quickly and easily transform your data, and then use the elasticsearch-spark plugin to load that data into your Elasticsearch cluster.
You can use Google Cloud Dataproc or Cloud Dataflow to run and schedule your job.
As of 2021, there is a Dataflow template that allows a "GCP native" connection between BigQuery and ElasticSearch
More information here in a blog post by elastic.co
Further documentation and step by step process by google

Resources