Splitting a Redis RDB file - caching

Currently I'm using redis on a EC2 machine, with 60G RAM without any slaves, but as my data grows I will need more memory.
I was thinking to migrate to 2 x 60G machines and split the already existing data between the two.
Is there any tool for splitting the RDB file? I haven't found anything specifically designed for this.

If you want to split your data, you will need to have a way to shard your keys so some keys will be written/read from server A and the others from server B
There is no way to split a RDB file, but there is something you can do to achieve what you want.
First what you can do is start a redis instance on your second server and say it is a slave of your current server, but set the param slave-read-only to false. This will cause the slave to synchronize and read all of your redis data from master. So far you only have a slave with all the data, but now we will do the interesting bit.
Then you need to decide on a sharding strategy. Some redis clients do this for you. For example, the official Ruby client knows how to handle that if you configure it. You will need to configure your client so keys will be sharded to A and B (or alternative use twemproxy so the clients won't know about different servers and the twemproxy will take care of it)
Once you have the clients configure, you need to deploy the new clients to production and immediately configure the slave as not a slave anymore. You can do this directly using the CONFIG command on the slave server (don't forget to persist the config using CONFIG REWRITE) or you can change the config file of the slave and restart, whatever is more convenient for you. Since the slave is configured as slave-read-only false, it will accept writes even on slave mode. This means if you change the config directly from the redis-cli you can change from slave to just a sharded stand-alone redis without restarting, which I think is quite cool.
Be aware once you shard, you will have to be careful with MULTI commands or when using LUA scripts. If you are using twemproxy you won't be able to use those commands, but if you are sharding on the client side, you will still be able to use MULTI or LUA. Just be careful to use a sharding mechanism in which all the related keys will stay on the same server.

step1: install https://github.com/leonchen83/redis-rdb-cli/
step2: create a config file to set spliting condition
content of nodes.conf
34b6e1dfb871ad30398ef5edd6b9a954617e6ec1 127.0.0.1:10003#20003 master - 0 1531044047088 3 connected 8193-16383
89d020a7e727e81f003836207902ae26fe05fd51 127.0.0.1:10001#20001 myself,master - 0 1531044047000 1 connected 0-8192
vars currentEpoch 6 lastVoteEpoch 0
step3: run rdt -s your-dump.rdb -c nodes.conf -o /path/to
after step3. that will generate 2 rdb files in /path/to directory 34b6e1dfb871ad30398ef5edd6b9a954617e6ec1.rdb and 89d020a7e727e81f003836207902ae26fe05fd51.rdb

Related

Can StreamSets be used to fetch data onto a local system?

Our team is exploring options for HDFS to local data fetch. We were suggested about StreamSets and no one in the team has an idea about it. Could anyone help me to understand if this will fit our requirement that is to fetch the data from HDFS onto our local system?
Just an additional question.
I have setup StreamSets locally. For example on local ip: xxx.xx.x.xx:18630 and it works fine on one machine. But when I try to access this URL from some other machine on the network, it doesn't work. While my other application like Shiny-server etc works fine with the same mechanism.
Yes - you can read data from HDFS to a local filesystem using StreamSets Data Collector's Hadoop FS Standalone origin. As cricket_007 mentions in his answer, though, you should carefully consider if this is what you really want to do, as a single Hadoop file can easily be larger than your local disk!
Answering your second question, Data Collector listens on all addresses by default. There is a http.bindHost setting in the sdc.properties config file that you can use to restrict the addresses that Data Collector listens on, but it is commented out by default.
You can use netstat to check - this is what I see on my Mac, with Data Collector listening on all addresses:
$ netstat -ant | grep 18630
tcp46 0 0 *.18630 *.* LISTEN
That wildcard, * in front of the 18630 in the output means that Data Collector will accept connections on any address.
If you are running Data Collector directly on your machine, then the most likely problem is a firewall setting. If you are running Data Collector in a VM or on Docker, you will need to look at your VM/Docker network config.
I believe by default Streamsets only exposes its services on localhost. You'll need to go through the config files to find where you can set it to listen on external addresses
If you are using the CDH Quickstart VM, you'll need to externally forward that port.
Anyway, StreamSets is really designed to run as a cluster, on dedicated servers, for optimal performance. It's production deployments are comparable to Apache Nifi offered in Hortonworks HDF.
So no, it wouldn't make sense to use the local FS destinations for anything other than testing/evaluation purposes.
If you want HDFS exposed as a local device, look into installing an NFS Gateway. Or you can use Streamsets to write to FTP / NFS, probably.
It's not clear what data you're trying to get, but many BI tools can perform CSV exports or Hue can be used to download files from HDFS. At the very least, hdfs dfs -getmerge is the one minimalist way to get data from HDFS to local, however, Hadoop typically stores many TB worth of data in the ideal cases, and if you're using anything smaller, then dumping those results into a database is typically the better option than moving around flatfiles

Cache a redis cluster locally

I have a scenario where we want to use redis, but I am not sure how to go about setting it up. Here is what we want to achieve eventually:
A redundant central redis cluster where all the writes will occur with servers in two aws regions.
Local redis caches on servers which will hold a replica of the complete central cluster.
The reason for this is that we have many servers which need read access only, and we want them to be independent even in case of an outage (where the server cannot reach the main cluster).
I know there might be a "stale data" issue withing the caches, but we can tolerate that as long as we get eventual consistency.
What is the correct way to achieve something like that using redis?
Thanks!
You need the Redis Replication (Master-Slave) Architecture.
Redis Replication :
Redis replication is a very simple to use and configure master-slave replication that allows slave Redis servers to be exact copies of master servers. The following are some very important facts about Redis replication:
Redis uses asynchronous replication. Starting with Redis 2.8, however, slaves periodically acknowledge the amount of data processed from the replication stream.
A master can have multiple slaves.
Slaves are able to accept connections from other slaves. Aside from connecting a number of slaves to the same master, slaves can also be connected to other slaves in a cascading-like structure.
Redis replication is non-blocking on the master side. This means that the master will continue to handle queries when one or more slaves perform the initial synchronization.
Replication is also non-blocking on the slave side. While the slave is performing the initial synchronization, it can handle queries using the old version of the dataset, assuming you configured Redis to do so in redis.conf. Otherwise, you can configure Redis slaves to return an error to clients if the replication stream is down. However, after the initial sync, the old dataset must be deleted and the new one must be loaded. The slave will block incoming connections during this brief window (that can be as long as many seconds for very large datasets).
Replication can be used both for scalability, in order to have multiple slaves for read-only queries (for example, slow O(N) operations can be offloaded to slaves), or simply for data redundancy.
It is possible to use replication to avoid the cost of having the master write the full dataset to disk: a typical technique involves configuring your master redis.conf to avoid persisting to disk at all, then connect a slave configured to save from time to time, or with AOF enabled. However this setup must be handled with care, since a restarting master will start with an empty dataset: if the slave tries to synchronized with it, the slave will be emptied as well.
Go through the Steps : How to Configure Redis Replication.
So I decided to go with redis-sentinel.
Using a redis-sentinel I can set the slave-priority on the cache servers to 0, which will prevent them from becoming masters.
I will have one master set up, and a few "backup masters" which will actually be slaves with slave-priority set to a value which is not 0, which will allow them to take over once the master goes down.
The sentinel will monitor the master, and once the master goes down it will promote one of the "backup masters" and promote it to be the new master.
More info can be found here

Solutions for a secure distributed cache

Problem: I want to cache user information such that all my applications can read the data quickly, but I want only one specific application to be able to write to this cache.
I am on AWS, so one solution that occurred to me was a version of memcached with two ports: one port that accepts read commands only and one that accepts reads and writes. I could then use security groups to control access.
Since I'm on AWS, if there are solutions that use out-of-the box memcached or redis, that'd be great.
I suggest you use ElastiCache with one open port at 11211(Memcached)then create an EC2 instance, set your security group so only this server can access to your ElastiCache cluster. Use this server to filter your applications, so only one specific application can write to it. You control the access with security group, script or iptable. If you are not using VPC, then you can use cache security group.
I believe you can accomplish this using Redis (instead of Memcached) which is also available via ElastiCache. Once the instance has been created, you will want to create a replication group and associate it to the cache cluster you already launched.
You can then add instances to the replication group. Instances within the replication group are simply replicated from the Master Cache Cluster (single Redis instance) and so are (by default) read-only.
So, in this setup, you have a master node (single endpoint) that you can write to and as many read nodes (multiple endpoints) as you would like.
You can take security a step further and assign different routing rules to the replication group (via the VPC) so the applications reading data does not have access to the master node (the only one that can write data).

H2 Database Cluster Recovery

I have got a SpringMVC application which runs on Apache Tomcat and uses H2 database.
The infrastructure contains two application servers (lets name them A & B) running their own Tomcat Servers. I also have a H2 database clustering in place.
On one system (A) I ran the following command
java org.h2.tools.Server -tcp -tcpPort 9101 -tcpAllowOthers -baseDir server1
On the other (B) I ran
java org.h2.tools.Server -tcp -tcpPort 9101 -tcpAllowOthers -baseDir server2
I started the cluster in machine A
java org.h2.tools.CreateCluster
-urlSource jdbc:h2:tcp://IpAddrOfA:9101/~/test
-urlTarget jdbc:h2:tcp://IpAddrOfB:9101/~/test
-user sa
-serverList IpAddrOfA:9101,IpAddrOfB:9101
When any one of the server is down, it has been mentioned that, one has to delete the database that failed, restart the server and rerun the CreateCluster.
I have the following questions ?
If both servers are down, how can I ascertain, which database to
delete so that I can restart that server and rerun the cluster ?
CreateCluster contains a urlSource and urlTarget. Do I need to be
specific as to give them the same value as was previously given or I
can interchange them without any side effect ?
Do I need to run the CreateCluster command from both the machines?
If so, do I need to interchange the urlSource and urlTarget ?
Is there a way to know whether both, one or none of the servers are
running ? I want that both IpAddress will be returned if both of
them are up, one IpAddress if only one is up otherwise none is all
are down.
If both servers are down, how can I ascertain, which database to delete
The idea of the cluster is that a second database adds redundancy to the system. Let's assume a server fails one every 100 days (hard disk failure, power failure or so). That is 99% availability. This might not be good enough for you, that's why you may want to use a cluster with two servers. That way, even if one of the server fails every 100 days, the chance of both failing at the same time is very very low. Ideally, the risk of failure is completely independent. That would mean the risk of both failing at the exact same time is 1 in 10000 (100 times 100), giving you 99.99% availability. To the risk that both servers are down is exactly what the cluster feature should prevent.
CreateCluster contains a urlSource and urlTarget. Do I need to be specific as to give them the same value as was previously
It depends which one you want to use as the source and which one as the target. The database from the source is copied to the target. The source is that database you want to copy to the target.
Do I need to run the CreateCluster command from both the machines?
No.
Is there a way to know whether both, one or none of the servers are running ?
You could try to open a TCP/IP connection to them, to check if the listener is running. What I usually do is running telnet <server> <port> on the command line.

EC2 database server failover strategy

I am planning to deploy my web app to EC2. I have several webserver instances. I have 1 primary database instance. I have 1 failover database instance. I need a strategy to redirect the webservers to the failover database instance IP when the primary database instance fails.
I was hoping I could use an Elastic IP in my connection strings. But, the webservers are not able to access/ping the Elastic IP. I have several brute force ideas to solve the problem. However, I am trying to find the most elegant solution possible.
I am using all .Net and SQL Server. My connection strings are encrypted.
Does anybody have a strategy for failing over a database instance in EC2 using some form of automation or DNS configuration?
Please let me know.
http://alestic.com/2009/06/ec2-elastic-ip-internal
tells you how to use the Elastic IP public DNS.
Haven't used EC2 but surely you need to either:
(a) put your front-end into some custom maintenance mode, that you define, while you switch the IP over; and have the front-end perform required steps to manage potential data integrity and data loss issues related to the previous server going down and the new server coming up when it enters and leaves your custom maintenance mode
OR, for a zero down-time system:
(b) design the system at the object/relational and transaction levels from the ground up to support zero-down-time fail-over. It's not something you can bolt on quicjkly to just any application.
(c) use some database support for automatic failover. I am unaware whether SQL Server support for failover suitable for your application exists or is appropriate here. I suggest adding a "sql-server" tag to the question to start a search for the right audience.
If Elastic IPs don't work (which sounds odd to say the least - shouldn't you talk to EC2 about that), you mayhave to be able to instruct your front-end which new database IP to use at the same time as telling it to go from maintenance mode to normal mode.
If you're willing to shell out a bit of extra money, take a look at Rightscale's tools; they've built custom server images and supporting tools that handle database failover (among many other things). This link explains how to do it with MySQL, so will hopefully show you some principles even though it doesn't use SQL Server.
I always thought there was this possibility in the connnection string
This is taken (but not yet tested) from How to add Failover Partner to a connection string in VB.NET :
If you connect with ADO.NET or the SQL
Native Client to a database that is
being mirrored, your application can
take advantage of the drivers ability
to automatically redirect connections
when a database mirroring failover
occurs. You must specify the initial
principal server and database in the
connection string and the failover
partner server.
Data Source=myServerAddress;Failover Partner=myMirrorServerAddress;
Initial Catalog=myDataBase;Integrated Security=True;
There is ofcourse many other ways to
write the connection string using
database mirroring, this is just one
example pointing out the failover
functionality. You can combine this
with the other connection strings
options available.
To broaden gareth's answer, cloud management softwares usually solve this type of problems. RightScale is one of them, but you can try enStratus or Scalr (disclaimer: I work at Scalr). These tools provide failover solutions like:
Backups: you can schedule automated snapshots of the EBS volume containing the data
Fault-tolerant database: in the event of failure, a slave is promoted master and mounted storage will be switched if the failed master and new master are in the same AZ, or a snapshot taken of the volume
If you want to build your own solution, you could replicate the process detailed below that we use at Scalr:
Is there a slave in the same AZ? If so, promote it, switch EBS
volumes (which are limited to a single AZ), switch any ElasticIP you
might have, reconfigure replication of the remaining slaves.
If not, is there a slave fully replicated in another AZ? If so, promote it,
then do the above.
If there are no slave in same AZ, and no slave fully
replicated in another AZ, then create a snapshot from master's
volume, and use this snapshot to create a new volume in an AZ where a
slave is running. Then do the above.

Resources