For one of my hadoop deployments I'm taking previous generation instances (m1,xlarge, m1.large etc).m1.xlarge instance comes with 4X420 GiB instance store. Is instance store is safer to store data or do i need to go for EBS?
thanks
It really depends on how much persistence you want for your data (or how you would define safer). The instance store will be lost if the instance is terminated or stopped. You also have the risk of AWS "losing" your instance if you plan to run them for a long time (in my case, I've seen instances fail after approximately one year, however we've had instances run flawlessly for over 3 years).
So if you need persistance, go with EBS and, if needed, compensate performance differences by using EBS with provisioned IOPS or an RAID array of EBS volumes. If your use case is just importing data, crunching it in hadoop and exporting it somewhere else, you can safely choose the instance store (we're doing this using EMR, for example).
Related
I need to use SQL on multiple different locations. The best option will be to set some databases (or even some records, like tagging on Mongo) in different locations. Is it possible to achieve on Google SQL?
There maybe two scenarios -
One single Cloud SQL instance in multiple locations
Different Cloud SQL instances in multiple locations
When you create a Cloud SQL instance, you choose a region where the instance and its data are stored. To reduce latency and increase availability, choose the same region for your data and your Compute Engine instances, standard environment apps, and other services.
Location types are of mainly two types, regional location i.e. a specific geographic place and multi-regional location which contains at least two geographic places. Multi-regional locations are only used for backup operations in Cloud SQL.
You choose a location when you first create the instance. The location can't be changed after the instance is created.
One single region consists of many small data centers called zones. While creating a Cloud SQL instance you can specify the instance to be available in a single zone or in two different zones within the selected region. Selecting the Cloud SQL instance to be in two different zones is called High Availability (HA) configuration.
The purpose of an HA configuration is to reduce downtime when a zone or instance becomes unavailable which might happen during a zonal outage, or when an instance becomes corrupted. With HA, your data continues to be available to client applications.
The HA configuration provides data redundancy. A Cloud SQL instance configured for HA is also called a regional instance and is located in a primary and secondary zone within the configured region. Within a regional instance, the configuration is made up of a primary instance and a standby instance.
So considering the first scenario when you say if a Cloud SQL instance can be located in multiple locations then it is yes if you consider different zones as different locations (This is correct as two zones are physically separated data centers within a single GCP region). But it can only be located in two zones and for that you have to configure High Availability(HA) for the instance.
For the second scenario, you can always create different Cloud SQL instances in different regions.
You can go through instance locations in Cloud SQL and overview of HA configuration to have a brief understanding of the above.
There is another option in Cloud SQL called read replicas.
You use a read replica to offload work from a Cloud SQL instance. The read replica is an exact copy of the primary instance. Data and other changes on the primary instance are updated in almost real time on the read replica.
Read replicas are read-only; you cannot write to them. The read replica processes queries, read requests, and analytics traffic, thus reducing the load on the primary instance.
If you want the data to be available in multiple locations you may consider using cross-region read replicas.
Cross-region replication lets you create a read replica in a different region from the primary instance.
Cross-region read replicas has many advantages -
Improve read performance by making replicas available closer to your
application's region.
Provide additional disaster recovery capability to guard against a
regional failure.
Let you migrate data from one region to another.
I have a cluster with two Redis docker instances (v3.2.5) I use for caching responses from Spring boot microservices.
I've disabled all persistence and the number of keys is stable over time, all of them expiring between 5 minutes and 1 day.
Despite this, I can see the memory usage creeping up. It looks like once a day (around midnight) it uses a lot of memory and then releases some of it.
Does anyone have any idea what this process may be, if there's any way to configure Redis to avoid using that much memory?
The number of keys I have doesn't justify this amount of memory
UPDATE
After taking a snapshot of the database and loading the data on a fresh new Redis instance (same version, same config) the memory_used_human is 10 times lower than the original one.
Is it possible that key expiration doesn't really delete keys from memory?
I need Mongo cluster doing 2 operations:
get/update a single document - Mongo is great for realtime changes, excelent speed.
export all documents into JSON file (one file for a category, there are cca 15 categories) - this is very slow, when I use regular query. May be I do not know, what command or options to use ... or I would need to fit it whole into RAM, which is expesive. Even replication to a new mongo instance is much faster (takes hours) then a query and writing data to disk (takes days).
I have about 10m documents. Mongo data on disk has 250Gb. There are cca 15 categories for which I need separate files (at the moment all documents are in 1 collection regardless of category).
Which command should I use to export all data into files in a couple of hours?
How large aws instances should I use to speed it up, but not to pay too much for RAM. Would it help? Operation 2) must not cause a performace hit for operation 1) -- I cannot stop Mongo and use mongoexport.
I am not sure what kind of servers you are using but this may provide some further insights regarding the export/file creation performance and not shutting off mongo. One presumes you are working with a sharded and replicated cluster.
In my case I am on Azure VMs running Windows server in a replicated and sharded cluster. So I would take a copy of the Azure blobs associated with the data disks on a secondary in each RS. You should stop your balancer and lock the db on the secondary to do this. This should take a couple of minutes at most to copy only 250gb. Then I would restore the blobs to disks on a new VM.
Then you could query data out of this VM without affecting your cluster's performance. You may additionally add indexing fir this export process since you are on a separate instance now.
Personally I use PowerShell to do this in Azure. Golang may be a better choice to write your queries in due to its parallel capabilities if JavaScript via the mongo shell fails you. I've had JS work faster than python code but it also depends on what you know.
This is just one way but it does address some of the criteria you posted.
I'm confused by the situation and trying to fix this for a couple of days now. I'm running 3 shard on top of three 3-members replica sets (rs0, rs1 and rs2). All is working so far. Data is distributed over the 3 shards as well as cloned within the replica sets.
BUT: importing data into one of the replica set works fine with constantly 40k docs/s but by enabling sharding slows the entire process down to just 1.5k docs/s.
I've populated the data via different methods:
generated some random data in the mongo shell (running in my mongos)
JSON data import via mongoimport
MongoDB dump restore from another server via mongorestore
All of them result in just 1.5k doc/s which is disappointing. The mongod's are physical Xeon boxes with 32GB each, the 3 config servers are virtual servers (40 GB HDD, 2 GB RAM, if that matters), the mongos is running on my app server. By the way, the value of 1.5k inserts/s doesn't depend on the shard key, same behaviour for a dedicated shard key (single field key as well as compound key) as well as hashed shard key on _id field.
I tried a lot, even reinstalled the entire cluster twice. The question is: what is the bottleneck in this setup:
config servers running on virtual server? -> shouldn't be problematic due to the low resource consumption of config servers
mongos? -> running multiple Mongos on a dedicated box behind HAproxy might be an alternative, haven't tested that yet
Let's do the math first: how big are your documents? Keep in mind that they have to be transferred over the net multiple times depending on your write concern.
May be you are experiencing this because of the indices which have to be build.
Please try this:
Disable all indices except the one for _id (which is not possible anyway, iirc)
Load your data
Reenable indices.
Enable sharding and balancing if not done already
This is the suggested way for importing data into a shared cluster anyway and should speed up your import considerably. Some (cautious !) fiddling with storage.syncPeriodSecs and storage.journal.commitIntervalMs might help, too.
The delay can occur even when storing the data on the primary shard. Depending on the size of your indices, they may slow down bulk operations considerably. You might also want to have a look at the replication.secondaryIndexPrefetch config option.
Another thing might be that your oplog simply gets filled faster than the replication can take place. Problem here: once it is created, you can not increase it's size. I am not sure wether it is safe to delete and recreate it in standalone mode and then reshare the replica set, but I doubt it. So the safe option would be to have the instance actually leave the replica set, reinstall it with a more appropriate oplog size and add the instance to the replica set as if it were the first time. If you don't care for the data, simply shut the replica set down, adjust the oplog size in the config file, delete the data dir and restart and reinitialize the replica set. Thinking of your problem twice, this sounds like the best bet to me, since the opllog isn't involved in standalone mode, iirc.
If you still have the same performance issues, my bet is on problems with disk or network IO.
You have a fairly standard setup, your mongos instance is running on a different machine than your mongod (be it a standalone or the primary of a replica set). You might want to check a few things:
Name resolution latency for resolving the name of your primary and secondary shards from the machine running your mongos instance. I can not count the times installing nscd improved performance for various operations.
network latency from your mongos instance to your primary shard. Assuming you have a firewall between your AppServer and your cluster, you might want to talk to the respective administrator.
In case you are using external authentication, try to measure how long it takes.
When using some sort of tunneling (e.g. stunnel or encryption like SSL/TLS), make sure you disable name resolution. Please keep in mind that encrypting and decrypting may take a relatively long time.
Measure random disk IO on the mongod instances
I was facing a similar performance issue. What helped to solve the performance issue was I ended up setting the mongod instance that was running on the same host as the mongos as the primary shard.
using the following command:
mongos> use admin
mongos> db.runCommand( { movePrimary: "mydb", to: "shard0003" } )
After making this change (without touching the load balancer or tweaking anything else), I was able to load a relatively large dataset (25 million rows) using a loader I had written, and the entire procedure took about 15 minutes instead of hours/days.
What are the advantages and disadvantages of having a single instance compared to multiple instances when multiple databases are intended to be created?
You may want to browse the Oracle concept guide, especially if you're more familiar with other DBMS.
A database is a set of files, located on disk, that store data.
These files can exist independently of
a database instance.
An instance is a set of memory structures that manage database files.
The instance consists of a shared
memory area, called the system global
area (SGA), and a set of background
processes. An instance can exist
independently of database files.
A single instance (set of processes) can mount at most one database (set of files). If you need to access multiple databases, you will need multiple instances. More on the difference between instances and databases on askTom.
Ideally, you only want one instance per server (the server may be a logical server -- i.e a virtual server). This will allow Oracle to know exactly what is going on. This implies one database per server.
If your databases are really independent, going with multiple instances/databases would make sense, since you have greater control over DB version, administration, etc.
If however your databases are not really independent (you frequently share data across them, you need some common data accessible to all of them), it may be more efficient (and simpler) to go with a single consolidated database. Each original database would have its own set of schemas. In this case cross-schema referential integrity would be easy, you wouldn't need to duplicate the data that needs to be shared.