Maximum number of databases in redis - caching

In redis, select "number" gives access to specific database at that index. I my redis config it is set 16(why ?). We require high scaling of our application so what is the max limit for that?

The default number of Redis databases is 16, but can be configured to more. You probably have 16 in your config because of that default (see Storing Data with Redis).
Databases (in Redis) are a way to partition data logically (think "namespace", "key-space" or, in RDBMS terms, a schema). Redis databases have nothing to do with scalability, so your "max limit" question is out of context.
To scale you would want to do as Sergio suggests in his comment: create separate Redis instances/clusters for separate applications.

Answer: 1 Million (Maybe) or 100K on Linux (Maybe)
Official Info
So the official documentation indicates that the default setting is 16. This may be changed in redis.conf. The official documentation does not indicate the range that is allowed here.
Original Research
Through experimentation on my local Windows 10 WSL Debian install I found that I could set the conf value to anything and the server would start up fine.
However when I then attempted to select a database via command line my computer would freeze. I tried several values and the system worked perfectly and quickly at 1,000,000 (one million) and froze out at 10,000,000 (ten million). This number seems rather arbitrary in the computer world so either it is a memory limitation (seems unlikely but I don't know WSL's memory handling) or an arbitrary limitation set by the developers.
I ran some similar tests on my CentOS 7 Box and redis refused to start at 1 Million. But started fine at 100 thousand. No idea why it's different from my windows system or why it just refused to start instead of starting and then failing when a database was selected like my WSL version.
Disclaimer
As already stated by #kit in his previous answer databases are not designed for "scaling" but rather for "namespaces". For example a SAAS may run one code base but hundreds of clients each client with their own "namespace" or redis database. This allows you to flush a client without affecting others and minimize the administrative overhead. But running dozens of unique wordpress instances would be better suited for a unique install of redis each.

The default number of databases in Redis is 16,index:0~15。You can edit your redis.conf file to adjust this number:
Steps:
1)edit config file
vi /etc/redis.conf
Default config path is /etc/redis.conf on Centos when installed with Yum.
2)find keyword:"databases"
# Set the number of databases. The default database is DB 0, you can select
# a different one on a per-connection basis using SELECT <dbid> where
# dbid is a number between 0 and 'databases'-1
databases 16
databases 16:16 is the default on new installations
3)Update the number of databases:
databases 30
4)end,save and quit

The default number of databases is limited to 16. You can change it by making the changes in etc/redis/redis.conf file.
sudo vim etc/redis/redis.conf
change databases 16 to
databases 150 ,you can type any number instead of 150.
Also, change supervised no to
supervised systemd in redis.conf file.
Restart redis:
sudo systemctl restart redis.service
Now try selecting any databases higher that 16.
redis-cli
select 34

Answer is "unlimited"...
Here is the question and its answer from redis faqs page:
How many Redis databases can I create and manage?
The number of Redis databases is unlimited. The limiting factor is the
available memory in the cluster, and the number of shards in the
subscription.
Note the impact of the specific database configuration on the number
of shards it consumes. For example:
-Enabling database replication, without enabling database clustering, creates two shards: a master shard and a replica shard.
-Enabling database clustering creates as many database shards as you configure.
-Enabling both database replication and database clustering creates double the number of database shards you configure.
for mor details:
https://redis.com/faqs/#:~:text=The%20number%20of%20Redis%20databases,of%20shards%20in%20the%20subscription.

Related

Is it still best to store images and files on a file system or non-RDBMS

I am working on a system which will store user's picture and in the future some soft documents as well.
Number of users: 4000+
Transcripts and other documents per user: 10 MB
Total system requirement in first year: 40 GB
Additional Increment Each year: 10%
Reduction due to archiving Each year: 10%
Saving locally on Ubuntu Linux system without any fancy RAIDS.
Using MySQL community edition for application.
Simultaneous Users: 10 to 20
Documents are for historical purposes and will not be accessed frequently.
I always thought it is cumbersome to store in a RDBMS due to the multiple layers to access etc. However, since we use key/value pair in nonRDBMS databases, is it still better to store the documents in file system or DB? Thanks for any pointers.
Similar question was asked about 7 years ago (storing uploaded photos and documents - filesystem vs database blob)!. I hope there was some change in the technology with all NoSQL databases in the spin. Hence, I am asking this again.
Please correct me if I should be doing something else instead of raising a fresh question.
It really depends (notably of the DBMS considered, of the file system, is it remote or local, total size of data -petabytes is not the same as gigabytes-, numbers of users/documents etc.).
If the data is remote on a 1Gb/s Ethernet the network is the bottleneck. So using a DBMS won't add significant additional overhead. See the answers section of this interesting webpage, or STFW for Approximate timing for various operations on a typical PC...
If the data is local, things matter much more (but few computers have one petabyte of SATA disks). Most filesystems on Linux use some minimal block size (e.g. 1Kbytes, 4Kbytes, ...) per file.
A possible approach might be to have some threshold (typically 4 or 8kilobytes, or even perhaps 64kilobytes, that is several pages; YMMV). Data smaller than it could be directly a field in a database, data bigger than it could be in a file. The database might sometimes contain file path for the data. Read about BLOBs in databases.
Consider not only RDBMS like PostGreSQL, but also noSQL solutions à la MongoDB, and key-value stores à la REDIS, etc.
For a local data approach, consider not only plain files, but also sqlite & GDBM, etc. If you use a file system, consider avoiding very wide directories, so instead of having widedir/000001.jpg .... widedir/999999.jpg organise it as dir/subdir000/001.jpg ... dir/subdir999/999.jpg and have no more than a thousand entries per directory.
If you locally use a MySQL database, and don't consider a lot of data (e.g. less than a terabyte), you might store directly in the database any raw data smaller than e.g. 64Kbytes, and store bigger data in individual files (whose path is going into the database); but you still should avoid very wide directories for them.
Of course, don't forget to define and apply (human decided) backup procedures.

CouchDB - Mobile application architecture - Replication performance

I built a mobile application based on CouchDB.
For security reason, i have to make sure that a document can be read only by the users allowed to do do it. As i cannot manage the access right at document level, i create one couchdb database per user, and i replicate documents from my main couchDB database in each user database with a filtered replication.
This model work very well, but today i faced huge performance issues.
I tried to have all my replications continuous, filtered and bi-directionnal, but after 80 users (so 81 databases and 160 simultanous continuous replications), there was too much replications and my couchDB service start to slow down and even crashed sometimes. Notices that all the databases are on the same server (and i could not have more than one server)
I tried to put in place a "manual" replications, but even this way when i need to replicate a document from my main database to all my 80 users databases, each filtered replication from my main database to a user database take around 30 seconds.
Maybe i have an issue with my replication filter, i store for each document a list of users allowed to see it. As each user has it own database, i replicate only the document the user is allowed to see in its database. Here is my replication function :
function(doc, req) {
if(doc.userList) {
if(doc.userList.indexOf(req.query.username) > 1) {
return true;
}
}
return false;
}
The goal of my application is to get around 1000 users, that is totally impossible with the current architecture / performance.
I have three questions :
1. Even if i think that it's not possible, Is it possible to get about 1000 databases in continuous replication on the same server?
2. Is there anything wrong with my replication filter? Is there any way to improve it to have fast databases replications?
3. If the current architecture is not good at all, what kind of architecture would you advise in my case?
Thank you very much !
We finally changed our global project architecture.
The main server cannot handle more than 100 replicated databases even if the configuration limits can be changed, after 80 synchronied databases couchdb logs start to explode. I may wrong, but i think that this kind of architecture is not possible on a single server.
Here is the solution we put in place.
We removed all the users databases and we plugged all our mobile applications directly on the main database and do a filtered replication directly on the main database : http://pouchdb.com/api.html#replication by using this solution : Example 3: filter function inside of a design document
This new model is now working we did some stress tests and we didn't get any issue until 1000 simultaneous users.
Just be aware that pouchDB, to replicate a database, ask couchdb all the modifications applied on the main database since the last synchronisation (even for filtered replication). So when you create a new pouchdb database and synchronise it, if your main couchDB is old and has a big historical (check couchdb _changes API), it can take a very (very) long time !
Step 0 is always identify the bottleneck. My first guess based on your scenarioe outlined would be to look at I/O perf. Check out
GET /_stats/couchdb
and
GET /_active_tasks
Each database gets its own read & write file descriptors so as the number of open databases on the server increases, so does the I/O resources required. Hope this helps

Mongo - quick exports of all documents without performace hit on other queries?

I need Mongo cluster doing 2 operations:
get/update a single document - Mongo is great for realtime changes, excelent speed.
export all documents into JSON file (one file for a category, there are cca 15 categories) - this is very slow, when I use regular query. May be I do not know, what command or options to use ... or I would need to fit it whole into RAM, which is expesive. Even replication to a new mongo instance is much faster (takes hours) then a query and writing data to disk (takes days).
I have about 10m documents. Mongo data on disk has 250Gb. There are cca 15 categories for which I need separate files (at the moment all documents are in 1 collection regardless of category).
Which command should I use to export all data into files in a couple of hours?
How large aws instances should I use to speed it up, but not to pay too much for RAM. Would it help? Operation 2) must not cause a performace hit for operation 1) -- I cannot stop Mongo and use mongoexport.
I am not sure what kind of servers you are using but this may provide some further insights regarding the export/file creation performance and not shutting off mongo. One presumes you are working with a sharded and replicated cluster.
In my case I am on Azure VMs running Windows server in a replicated and sharded cluster. So I would take a copy of the Azure blobs associated with the data disks on a secondary in each RS. You should stop your balancer and lock the db on the secondary to do this. This should take a couple of minutes at most to copy only 250gb. Then I would restore the blobs to disks on a new VM.
Then you could query data out of this VM without affecting your cluster's performance. You may additionally add indexing fir this export process since you are on a separate instance now.
Personally I use PowerShell to do this in Azure. Golang may be a better choice to write your queries in due to its parallel capabilities if JavaScript via the mongo shell fails you. I've had JS work faster than python code but it also depends on what you know.
This is just one way but it does address some of the criteria you posted.

How to replicate, for DR purposes, a Greenplum DB to another data centre?

We are planning for a large Greenplum DB (growing from 10 to 100TB over the first 18 months). Traditional backup and restore tools aren't going to help as we have 24hr RPO/RTOs to deal with.
Is there a way to replicate the DB across to our DR site without resorting to block replication (i.e. place a segment on SAN and mirror)?
You've got a number of options to choose:
Dual ETL. Replicate input data and run the same ETL on two sites. Synchronize them with backup-restore every week or so
Backup-restore. Simple backup-restore can be not that efficient. But if you use DataDomain it can perform deduplication on the block level and store only changed blocks. It can offload the deduplication task to run on the Greenplum cluster (DDBoost). Also in case of replication to remote site it would replicate only changed blocks, which would greatly reduce replication time. In my experience, if clean backup on DD takes 12 hours, subsequent DDBoost backup will take 4 hours + 4 hours to replicate the data
Custom solution. I know the case when the data replicatioin to remote site is made as a part of ETL process. For the ETL job you know which tables are changed, they are added to the replication queue and moved to the remote site using external tables. Analysts are working in a special sandbox and their sandbox is replicated with backup-restore daily
At the moment Greenplum does not have built-in WAN replication solution so this is almost all the options to choose from.
I do some investigation on this. here is my result
I. Using EMC Symmetrix VMAX SAN(Storage Area Network) Mirror and SRDF (Symmetrix Remote Data Facility) remote replication software
Please refer to h12079-vnx-replication-technologies-overview-wp.pdf for details
Preconditions
1. Having EMC Symmetrix VMAX SAN installed
2. Having SRDF softeware
Advantages of 3 different modes
1. Symmetrix Remote Data Facility / Synchronous (SRDF/S)
Provides a no data loss solution (Zero RPO).
No server resource contention for remote mirroring operation.
Can perform restoration of primary site with minimal impact to application. Performance on remote site. Enterprise
disaster recovery solution. Supports replicating over IP and Fibre
Channel protocols.
2. Symmetrix Remote Data Facility / Asynchronous (SRDF/A) Extended-distance data replication that supports longer distances
than SRDF/S. SRDF/A does not affect host performance, because host
activity is decoupled from the remote copy process. Efficient link
utilization that results in lower link-bandwidth requirements.
Facilities to invoke failover and restore operations. Supports
replicating over IP and Fibre Channel protocols.
3. Symmetrix Remote Data Facility / Data Mobility (SRDF/DM)
II. Using Backup Tools
Please refer to http://gpdb.docs.pivotal.io/4350/admin_guide/managing/backup.html for details
Parallel Backup
Parallel backup utility gpcrondump
Non-parallel backup
It is not recommended. It is used for migrate PostgreSQL databases to GreenPlum databases
Parallel Restore
Support system with the same configuration and different configuration with the source GreenPlum database configuration
Non-Parallel Restore
pg_restore requires to modified the create statement to add distributed by clause
 
Disadvantages
1. The backup process locks table, it put an EXCLUSIVE lock on table pg_class. It means that read permission is only allowed in this period.
2. After releasing the EXCLUSIVE lock on table pg_clas, it will put an ACCESS SHARE lock on all the tables, it only allows read access during the lock period.
III. Replay DDL statements
In PostgreSQL, there is a parameters to log all the sql statements to a file.
In the data/postgresql.conf, modify log_statement to ‘all’
Write an application to get the DML and DDL statement, and run them in the DR servers.
Advantage
1. Easy to configure and maintain
2. No decrease in the performance
Disadvantage
1. Need additional storage for the statement logging
IV. Parse the wal log of PostgreSQL
Parse the wal log to extract the DDL statement from the log and then run all the generated DDL statements in the DR GreenPlum
Advantage
1. Doesn’t impact the source GreenPlum Database
Disadvantage
1. Write code to parse the wal log
2. Not easy to parse the log, there are not enough documents about the wal log.
3. Don’t know if it is feasible for GreenPlum, as it is one solution for PostgreSQL.

MongoDB sharded cluster 25 slower than standalone node

I'm confused by the situation and trying to fix this for a couple of days now. I'm running 3 shard on top of three 3-members replica sets (rs0, rs1 and rs2). All is working so far. Data is distributed over the 3 shards as well as cloned within the replica sets.
BUT: importing data into one of the replica set works fine with constantly 40k docs/s but by enabling sharding slows the entire process down to just 1.5k docs/s.
I've populated the data via different methods:
generated some random data in the mongo shell (running in my mongos)
JSON data import via mongoimport
MongoDB dump restore from another server via mongorestore
All of them result in just 1.5k doc/s which is disappointing. The mongod's are physical Xeon boxes with 32GB each, the 3 config servers are virtual servers (40 GB HDD, 2 GB RAM, if that matters), the mongos is running on my app server. By the way, the value of 1.5k inserts/s doesn't depend on the shard key, same behaviour for a dedicated shard key (single field key as well as compound key) as well as hashed shard key on _id field.
I tried a lot, even reinstalled the entire cluster twice. The question is: what is the bottleneck in this setup:
config servers running on virtual server? -> shouldn't be problematic due to the low resource consumption of config servers
mongos? -> running multiple Mongos on a dedicated box behind HAproxy might be an alternative, haven't tested that yet
Let's do the math first: how big are your documents? Keep in mind that they have to be transferred over the net multiple times depending on your write concern.
May be you are experiencing this because of the indices which have to be build.
Please try this:
Disable all indices except the one for _id (which is not possible anyway, iirc)
Load your data
Reenable indices.
Enable sharding and balancing if not done already
This is the suggested way for importing data into a shared cluster anyway and should speed up your import considerably. Some (cautious !) fiddling with storage.syncPeriodSecs and storage.journal.commitIntervalMs might help, too.
The delay can occur even when storing the data on the primary shard. Depending on the size of your indices, they may slow down bulk operations considerably. You might also want to have a look at the replication.secondaryIndexPrefetch config option.
Another thing might be that your oplog simply gets filled faster than the replication can take place. Problem here: once it is created, you can not increase it's size. I am not sure wether it is safe to delete and recreate it in standalone mode and then reshare the replica set, but I doubt it. So the safe option would be to have the instance actually leave the replica set, reinstall it with a more appropriate oplog size and add the instance to the replica set as if it were the first time. If you don't care for the data, simply shut the replica set down, adjust the oplog size in the config file, delete the data dir and restart and reinitialize the replica set. Thinking of your problem twice, this sounds like the best bet to me, since the opllog isn't involved in standalone mode, iirc.
If you still have the same performance issues, my bet is on problems with disk or network IO.
You have a fairly standard setup, your mongos instance is running on a different machine than your mongod (be it a standalone or the primary of a replica set). You might want to check a few things:
Name resolution latency for resolving the name of your primary and secondary shards from the machine running your mongos instance. I can not count the times installing nscd improved performance for various operations.
network latency from your mongos instance to your primary shard. Assuming you have a firewall between your AppServer and your cluster, you might want to talk to the respective administrator.
In case you are using external authentication, try to measure how long it takes.
When using some sort of tunneling (e.g. stunnel or encryption like SSL/TLS), make sure you disable name resolution. Please keep in mind that encrypting and decrypting may take a relatively long time.
Measure random disk IO on the mongod instances
I was facing a similar performance issue. What helped to solve the performance issue was I ended up setting the mongod instance that was running on the same host as the mongos as the primary shard.
using the following command:
mongos> use admin
mongos> db.runCommand( { movePrimary: "mydb", to: "shard0003" } )
After making this change (without touching the load balancer or tweaking anything else), I was able to load a relatively large dataset (25 million rows) using a loader I had written, and the entire procedure took about 15 minutes instead of hours/days.

Resources