Distributed key-value storage for total data size of 80TB - key-value-store

TL;DR:
I'd like to have recommendations for a distributed key-value storage, for avg. entry size of up to 50KB, to be installed on a Linux environment (dedicated servers).
A file-system solution would do.
I found a few solutions: Ceph, Cassandra, Riak, and a few more.
Details
I'm looking for a storage solution for one of our components, it should be a key-value storage, flat namespace.
Scenario
The read/write patterns are very simple:
Once a key-value is written, there are a few reads within the next hours.
After that, nothing touches the given key-value. We'd like to keep the data for future purposes, "Storage mode".
Other usage aspects
OS: Linux
Python client/connector
Total size: up to 80TB (this value also represents future needs).
Avg Entry Size (for a single value in a k-v pair): 10 to 50 KB, uncompressed, mostly textual data
Compression: either built-in or external.
Encryption: not needed
Network bandwidth: 1Gb, single LAN
Servers: dedicated (not in the cloud)
Most important requirements
The "base" requirements are:
OS: Linux
Python client/connector OR RESTful API via HTTP
Can easily store up to 80TB (this value also represents future needs).
Max read latency: a few seconds for first reads, 30 seconds for "storage mode" (see above for explanation)
Built in replication (so that data is stored on more than a single node)
Nice to have
RESTful gateway
Background data backup to another store (for data recovery in case of a disaster).
Easy to configure
What I've found so far
Ceph
HDFS
HBase on top of HDFS
Lustre
GlusterFS
Mongo's GridFS - but can I trust Mongo's infrastructure?
Cassandra - not an option, since the merge process consumes double disk size
Riak - looks like it has the same issue as Cassandra, needs more research
Swift + OpenStack (actual storage can be on Amazon S3)
Voldemort
There are dozens of additional tools, but I won't write them here since some of them have proprietary license, and others seem to be immature.
I'd appreciate any recommendation on any of the tools I mentioned above (with total capacity of more than 50TB), or on a tool you think is sufficient.

Just use Ceph (I mean direct librados usage). Don't use GlusterFS -- it's hangy.

Related

How to increase AeroSpark read performance?

I am using latest AeroSpark connector to work with AeroSpike and Spark ML. But when i have inserted round 60M records to AeroSpike, i got too big time amount in read operations. For example for fetch round 500K records from set that contains 60M records, AeroSpark spend ~30 mins. When i look at htop cmd output, AeroSpike use only 7% of CPU.
Each record round contains 1k of data. The AeroSpike and Spark hosted on the same node. The data filtered by secondary index.
How can i speed up performance in read operations? Seems AeroSpark is working only by one thread, how i can parallelize this job? Any suggestions?
AeroSpike conf:
memory-size 8G
default-ttl 30d
storage-engine device {
file /vol/rmla.data
filesize 900G
}
Without knowing anything about your server, and with just a snippet of config, I'll stick to some generic recommendations that should improve your experience.
Disk IO
You are clearly bound by the read speed from your storage media, which you declared to be a file. If you're storing the data on disk, you can either use file or device in the storage-engine device config block.
There is a big difference in the read and write latency between a file on a HDD versus raw device access to an SSD. Typically Aerospike is used with data stored on enterprise-grade SSD devices. Read the section in the operations manual about initializing and setting up the drive. Declaring multiple devices for the namespace with give you a linear performance boost (two drives will have double the read and write throughput of one of the same kind).
In Amazon EC2 you could use the c3, i2, r3, or i3 instance families for this purpose. The ephemeral SSD devices of EC2 instances don't need to be over-provisioned, have their RAID turned off, etc. They only need to be initialized before they're first used. Do not use EBS drives for primary storage, as they're too slow.
Cluster Configuration
The Spark connector uses lots of scan operations. Make sure that you've configured scan-threads under your service config block to the number of cores. If you don't know how many cores you have, do cat /proc/cpuinfo. If Spark is the only client using the Aerospike cluster, you can tune the scan threads higher.
Connector Configuration
You can modify the connector config options for lower write latency. Optionally set aerospike.commitLevel to CommitLevel.COMMIT_MASTER.
Upgrade Version
As of November 28 2016 aerospike/aerospark supports Spark 2.0. Make sure you're using the latest code.
Note: See the new tutorial for Aerospark on the Aerospike website.

How to setup Apache Spark to use local hard disk when data does not fit in RAM in local mode?

I have 50 GB dataset which doesn't fit in 8 GB RAM of my work computer but it has 1 TB local hard disk.
The below link from offical documentation mentions that Spark can use local hard disk if data doesnt fit in the memory.
http://spark.apache.org/docs/latest/hardware-provisioning.html
Local Disks
While Spark can perform a lot of its computation in memory, it still
uses local disks to store data that doesn’t fit in RAM, as well as to
preserve intermediate output between stages.
For me computational time is not at all a priority but fitting the data into a single computer's RAM/hard disk for processing is more important due to lack of alternate options.
Note:
I am looking for a solution which doesn't include the below items
Increase the RAM
Sample & reduce data size
Use cloud or cluster computers
My end objective is to use Spark MLLIB to build machine learning models.
I am looking for real-life, practical solutions that people successfully used Spark to operate on data that doesn't fit in RAM in standalone/local mode in a single computer. Have someone done this successfully without major limitations?
Questions
SAS have similar capability of out-of-core processing using which it can use both RAM & local hard disk for model building etc. Can Spark be made to work in the same way when data is more than RAM size?
SAS writes persistent the complete dataset to hardisk in ".sas7bdat" format can Spark do similar persistent to hard disk?
If this is possible, how to install and configure Spark for this purpose?
Look at http://spark.apache.org/docs/latest/programming-guide.html#rdd-persistence
You can use various persistence models as per your need. MEMORY_AND_DISK is what will solve your problem . If you want a better performance, use MEMORY_AND_DISK_SER which stores data in serialized fashion.

doubts regarding migration to big data

I have a few doubts regarding hadoop
In one of the videos published by cloudera an instructer told that in hadoop there is HDFS. Every file will be stored as a set of chucks or blocks. Each block will be replicated three times in different machines to minimize the point of failure. Each mapper will process a single hdfs block.
From these logics i perceived that if i have a server having some 100 peta bytes of logs which are not stored in traditional file system unlike hdfs.
Main doubt 1. Now if i want to analyse this huge data efficiently using the mapreduce technique then do i have to transfer the data in a new server running hdfs and having three times the storage of the old server.
In one more video which was also published by cloudera..the instructer mentioned clearly that we dont need to migrate the traditional system to a new system, we can use hadoop and map reduce on top of that. This is little contradictry to the statement mentioned in first point.
Main doubt 2: Lets assume that point 2 statement is true. Now how can this be possible. I mean how can we apply hadoop and map reduce on a traditional file system where there is no replication of blocks or name node ..deamon on each machine.
My main task is to Facilitate fast analysis of a huge amount of logs which are currently not stored in hdfs. For doing this will i need a new server or not.
P.S: I need some good tutorial or Books or some articles which could give me in depth knowledge of big data so that i can start working on it.
So recomendations are most welcome.
Hadoop is just an infrastructure for running a MapReduce style workload (for "big data" or "analytics" atop a cluster of servers.
You can use HDFS for data sharing across the nodes, then use Hadoop's built in workload management to distribute work to nodes where the data is stored. This is sometimes called "function shipping."
But it's also possible to not use HDFS. You can use another network file sharing / distribution mechanism. FTP (file copies), S3 (access from the Amazon Web Services cloud), and a variety of other clustered/distributed file systems are supported by various vendors/platforms. Some of these move the data to the system on which workload is being done ("data shipping").
Which storage strategy is appropriate, efficient, and performant is a big question, and depends greatly on your infrastructure and your MapReduce app's data access patterns. In general, however, analytics jobs are resource hungry, so only small analytics apps tend to run on servers doing other work (the "original systems"). So processing "big data" does tend to suggest new servers--if not ones you buy, ones you rent temporarily from a cloud service like AWS, RackSpace, etc.--and data streaming from replicas/clones of data captured in production ("secondary storage") rather than data still resident on "primary storage."
If you're just starting out with small or modest apps, you might be able to access data in-place, directly from existing systems. But if you've got 100 PB of logs, you're going to want that processed on systems devoted to the task.

ScaleOut Software In Memory DataGrid Using Hadoop

I have been doing some reading on real time processing using hadoop and stumbled upon this http://www.scaleoutsoftware.com/hserver/
From what the documentation says, it looks like they implemented an in memory data grid using the hadoop worker/slave nodes. I have couple of questions here
From my understanding, if i have a data of size 100 GB, i would atleast need 100GB of ram across all nodes on my cluster just for the data + additional ram for task tracker, data node daemons + additional ram for the hServer service that would run on all these nodes. Is my understanding correct?
The software claims they can do real-time data processing by improving the latency issues in hadoop. Is it because, it allows us to write data to the in-memory grid instead of HDFS?
I am new to Big Data technologies. Apologize if some of the questions are naive.
[Full disclosure: I work at ScaleOut Software, the company which created ScaleOut hServer.]
In-memory data grids create a replica for every object to ensure high availability in case of failures.The aggregate amount of memory that is required is the memory used to store the objects with the addition of the memory used to store object replicas. In your example, you will need 200 GB of total memory: 100 GB for objects and 100 GB for replicas. For example, in a four-server cluster, each server needs 50 GB of memory available to the ScaleOut hServer service.
With the current release, ScaleOut hServer takes the first step in enabling real-time analytics by speeding up data access. It does this in two ways, which are implemented using different input/output formats. The first mode of operation uses the grid as a cache for HDFS, and the second uses the grid as the primary storage for a data set, providing support for fast-changing, memory-based data. Accessing data using an in-memory data grid reduces latency by eliminating disk I/O and minimizing network overhead. Also, caching HDFS data provides an additional performance boost by storing keys and values generated by the record reader instead of raw HDFS files in the grid.

Caching in RAM using HDFS

I need to process some big files (~2 TBs) with a small cluster (~10 servers), in order to produce a relatively small report (some GBs).
I only care about the final report, not intermediate results, and the machines have a great amount of RAM, so it would be fantastic to use it to reduce as much as possible disk access (and consequently increasing speed), ideally by storing the data blocks in volatile memory using the disk only when.
Looking at the configuration files and a previous question it seems Hadoop doesn't offer this function. Spark website talks about a memory_and_disk option, but I'd prefer to ask the company to deploy a new software based on a new language.
The only "solution" I found is to set dfs.datanode.data.dir as /dev/shm/ in hdfs-default.xml, to trick it to use volatile memory instead of the filesystem to store data, still in this case it would behave badly, I assume, when the RAM gets full and it uses the swap.
Is there a trick to make Hadoop store datablocks as much as possible on RAM and write on disk only when necessary?
Since the release of Hadoop 2.3 you can use HDFS in memory caching.
You can toy around with mapred.job.reduce.input.buffer.percent (defaults to 0, try something closer to 1.0, see for example this blog post) and also setting the value of mapred.inmem.merge.threshold to 0. Note that finding the right values is a bit of an art and requires some experimentation.

Resources