Storing data larger than machine memory into database - memory-management

How or what ways I would need to store an amount of data that is larger than machine memory into database (Ex. data is 20 GB, RAM is 16 GB). Considering I am using one machine, and all the data is document oriented (NoSQL)

The most easy solution is to use wiredtiger. It's an embedded key-value store with ACID guarantees. It's unlike REDIS.

Related

Surrogate Key Mapping for large (50 Million) keysets in Apache Flink

I have a use case where the apache flink process must integrate near real-time data streams (events) from multiple sources but due to lack of uniform keys in the different systems I need to use a surrogate key (SK) lookup from an existing data base. The SK data set is very large (50 Million+ keys). Is it possible/advisable to cache such a data set for in-stream transformation (mapping) without a DB lookup? If yes, What are caching limitations? If not, what alternatives are possible with Flink?
There are a few options
Local map
If the surrogate key is never changing, you could just load it in RichMapFunction#open and perform the lookup. That of course means that you will have to adjust the memory settings such that Flink doesn't try to take all memory for its own operations.
Some quick math: assume both keys are strings of length 10. They will each need 40 bytes of chars in memory. With some object overhead, we are getting to ~50 bytes per entry. With 50M entries, we are needing 2.5 GB RAM to store that. Because the hash map will have some overhead, I'd plan with 3 GB RAM.
So if you task manager has 8GB, I'd set taskmanager.memory.size to 4 GB.
Ofc, you need to ensure that different tasks of the same task manager are not loading the same map twice. Also I'd choose a format that is suited to load the data as quickly as possible (e.g., Avro) because a slow parsing will greatly reduce startup and recovery time.
State-based
If memory is an issue or data is changing, you can also model the lookup data as a map-state. I'd add a second input for that lookup data and use a KeyedCoProcessFunction. The feed whatever comes from the second input into the map-state. The state should use a rocks-db backend, such that the data effectively resides on disk.
Joining data
A lookup can also be modeled as a join. If you are already using Table API, have a look at Join with Temporal Table. This will internally use the state-based approach but is much more concise. You can also mix DataStream with Tables.

How to setup Apache Spark to use local hard disk when data does not fit in RAM in local mode?

I have 50 GB dataset which doesn't fit in 8 GB RAM of my work computer but it has 1 TB local hard disk.
The below link from offical documentation mentions that Spark can use local hard disk if data doesnt fit in the memory.
http://spark.apache.org/docs/latest/hardware-provisioning.html
Local Disks
While Spark can perform a lot of its computation in memory, it still
uses local disks to store data that doesn’t fit in RAM, as well as to
preserve intermediate output between stages.
For me computational time is not at all a priority but fitting the data into a single computer's RAM/hard disk for processing is more important due to lack of alternate options.
Note:
I am looking for a solution which doesn't include the below items
Increase the RAM
Sample & reduce data size
Use cloud or cluster computers
My end objective is to use Spark MLLIB to build machine learning models.
I am looking for real-life, practical solutions that people successfully used Spark to operate on data that doesn't fit in RAM in standalone/local mode in a single computer. Have someone done this successfully without major limitations?
Questions
SAS have similar capability of out-of-core processing using which it can use both RAM & local hard disk for model building etc. Can Spark be made to work in the same way when data is more than RAM size?
SAS writes persistent the complete dataset to hardisk in ".sas7bdat" format can Spark do similar persistent to hard disk?
If this is possible, how to install and configure Spark for this purpose?
Look at http://spark.apache.org/docs/latest/programming-guide.html#rdd-persistence
You can use various persistence models as per your need. MEMORY_AND_DISK is what will solve your problem . If you want a better performance, use MEMORY_AND_DISK_SER which stores data in serialized fashion.

Mongodb - make inmemory or use cache

I will be creating a 5 node mongodb cluster. It will be more read heavy than write and had a question which design would bring better performance. These nodes will be dedicated to only mongodb. For the sake of an example, say each node will have 64GB of ram.
From the mongodb docs it states:
MongoDB automatically uses all free memory on the machine as its cache
Does this mean as long as my data is smaller than the available ram it will be like having an in-memory database?
I also read that it is possible to implement mongodb purely in memory
http://edgystuff.tumblr.com/post/49304254688/how-to-use-mongodb-as-a-pure-in-memory-db-redis
If my data was quite dynamic (can range from 50gb to 75gb every few hours), would it be theoretically be better performing to design mongodb in a way which allows mongodb to manage itself with its cache (default setup of mongo), or to put the mongodb into memory initially and if the data grows over the size of ram use swap space (SSD)?
MongoDB default storage engine maps the files in memory. It provides an efficient way to access the data, while avoiding double caching (i.e. MongoDB cache is actually the page cache of the OS).
Does this mean as long as my data is smaller than the available ram it will be like having an in-memory database?
For read traffic, yes. For write traffic, it is different, since MongoDB may have to journalize the write operation (depending on the configuration), and maintain the oplog.
Is it better to run MongoDB from memory only (leveraging tmpfs)?
For read traffic, it should not be better. Putting the files on tmpfs will also avoid double caching (which is good), but the data can still be paged out. Using a regular filesystem instead will be as fast once the data have been paged in.
For write traffic, it is faster, provided the journal and oplog are also put on tmpfs. Note that in that case, a system crash will result in a total data loss. Usually, the performance gain does not worth the risk.

ScaleOut Software In Memory DataGrid Using Hadoop

I have been doing some reading on real time processing using hadoop and stumbled upon this http://www.scaleoutsoftware.com/hserver/
From what the documentation says, it looks like they implemented an in memory data grid using the hadoop worker/slave nodes. I have couple of questions here
From my understanding, if i have a data of size 100 GB, i would atleast need 100GB of ram across all nodes on my cluster just for the data + additional ram for task tracker, data node daemons + additional ram for the hServer service that would run on all these nodes. Is my understanding correct?
The software claims they can do real-time data processing by improving the latency issues in hadoop. Is it because, it allows us to write data to the in-memory grid instead of HDFS?
I am new to Big Data technologies. Apologize if some of the questions are naive.
[Full disclosure: I work at ScaleOut Software, the company which created ScaleOut hServer.]
In-memory data grids create a replica for every object to ensure high availability in case of failures.The aggregate amount of memory that is required is the memory used to store the objects with the addition of the memory used to store object replicas. In your example, you will need 200 GB of total memory: 100 GB for objects and 100 GB for replicas. For example, in a four-server cluster, each server needs 50 GB of memory available to the ScaleOut hServer service.
With the current release, ScaleOut hServer takes the first step in enabling real-time analytics by speeding up data access. It does this in two ways, which are implemented using different input/output formats. The first mode of operation uses the grid as a cache for HDFS, and the second uses the grid as the primary storage for a data set, providing support for fast-changing, memory-based data. Accessing data using an in-memory data grid reduces latency by eliminating disk I/O and minimizing network overhead. Also, caching HDFS data provides an additional performance boost by storing keys and values generated by the record reader instead of raw HDFS files in the grid.

Does Cassandra uses Heap memory to store blooms filter ,and how much space does it consumes for 100GB of data?

I come to know that cassandra uses blooms filter for performance ,and it stores these filter data into physical-memory.
1)Where does cassandra stores this filters?(in heap memory ?)
2)How much memory do these filters consumes?
When running, the Bloom filters must be held in memory, since their whole purpose is to avoid disk IO.
However, each filter is saved to disk with the other files that make up each SSTable - see http://wiki.apache.org/cassandra/ArchitectureSSTable
The filters are typically a very small fraction of the data size, though the actual ratio seems to vary quite a bit. On the test node I have handy here, the biggest filter I can find is 3.3MB, which is for 1GB of data. For another 1.3GB data file, however, the filter is just 93KB...
If you are running Cassandra, you can check the size of your filters yourself by looking in the data directory for files named *-Filter.db

Resources