Apache Ignite cache size limitation - caching

I am starting a project where I want to load a lot of data into Apache Ignite cache to perform certain computations. My original data load will be about 40Gb and that may grow 4 or 5-fold at certain times. I looked through the Ignite documentation and I didn't find anything in regards of size of the cache limitations. So, would it be fair to assume that as long as I have enough resources ( CPUs and RAM ) I can add as many Nodes as necessary without compromising the performance, which is speed of computations in my case.

Yes, Ignite scales horizontally. There's no explicit limit in the capacity of a table.

Related

What is critical factor of performance in Nifi?

i find many factor effecting perfomance,such as,the amount of memory dedicated to JVM heap space,the directory of repository,many paramaters in nifi.properties.
I have no ideas to set them?ANY HELP IS APPRECIATED!
NIFI-1.10.0/CLUSTER MODE
The configuration of VM as pic
#KONG There isnt a clear question asked above.
NiFi Performance involves the following critical factors in order:
Number of Nodes
Number of Cores Per Node
Memory for Nifi Min/Max
Memory Available
Garbage Collection
Min/Max Thread Configuration
Disk Configuration (per documentation separate disk for nifi repositories & high speed disks are requried)
Size of Flow / Number of Processors & Controller Services
Amount of Data Flowing
Tuning of flow (concurrency)
My advice is to start with Default settings that come with NiFi out of the box. Then clearly investigate any performance issues and try to determine the correct next steps:
Add more Nodes
Increase Memory
Tune Garbage Collection
Adjust Min/Max Threads based on Total # Cores & Nodes

Amount of data storage : HDFS vs NoSQL

In several sources on the internet, it's explained that HDFS is built to handle a greater amount of data than NoSQL technologies (Cassandra, for example). In general when we go further than 1TB we must start thinking Hadoop (HDFS) and not NoSQL.
Besides the architecture and the fact that HDFS supports batch processing and that most NoSQL technologies (e.g. Cassandra) perform random I/O, and besides the schema design differences, why can't NoSQL Solutions (again, for example Cassandra) handle as much data as HDFS?
Why can't we use a NoSQL technology as a Data Lake? Why should we only use them as hot storage solutions in a big data architecture?
why can't NoSQL Solutions (... for example Cassandra) handle as much data as HDFS?
HDFS has been designed to store massive amounts of data and support batch mode (OLAP) whereas Cassandra was designed for online transactional use-cases (OLTP).
The current recommendation for server density is 1TB/node for spinning disk and 3TB/node when using SSD.
In the Cassandra 3.x series, the storage engine has been rewritten to improve node density. Furthermore there are a few JIRA tickets to improve server density in the future.
There is a limit right now for server density in Cassandra because of:
repair. With an eventually consistent DB, repair is mandatory to re-sync data in case of failures. The more data you have on one server, the longer it takes to repair (more precisely to compute the Merkle tree, a binary tree of digests). But the issue of repair is mostly solved with incremental repair introduced in Cassandra 2.1
compaction. With an LSM tree data structure, any mutation results in a new write on disk so compaction is necessary to get rid of deprecated data or deleted data. The more data you have on 1 node, the longer is the compaction. There are also some solutions to address this issue, mainly the new DateTieredCompactionStrategy that has some tuning knobs to stop compacting data after a time threshold. There are few people using DateTiered compaction in production with density up to 10TB/node
node rebuild. Imagine one node crashes and is completely lost, you'll need to rebuild it by streaming data from other replicas. The higher the node density, the longer it takes to rebuild the node
load distribution. The more data you have on a node, the greater the load average (high disk I/O and high CPU usage). This will greatly impact the node latency for real time requests. Whereas a difference of 100ms is negligible for a batch scenario that takes 10h to complete, it is critical for a real time database/application subject to a tight SLA

What makes Spark fast if data size exceeds available memory?

Everywhere I try to understand spark it says it is fast because it keeps data in memory as opposed to map reduce. Lets take this examples -
I have a 5 node spark cluster, with 100 GB RAM each. Lets say I have 500 TB of data to run a spark job against. Now total data that spark can keep is 100*5=500 GB. If It can keep max of 500 GB of data only in memory at any point of time, what makes it lightning fast ??
Spark isn't magical and can't change fundamental principles of computing. Spark uses memory as a progressive enhancement and will fall back to disk I/O for huge datasets that can not be kept in memory. In a scenario where tables must be scanned from disks, spark performance should be comparable to other parallel solutions involving table scanning from disk.
Suppose only 0.1% of the 500 TB is "interesting". For instance, in a marketing funnel there are a lot of ad impressions, fewer clicks, even fewer sales, and less repeat sales. A program can filter through a huge dataset and tell Spark to cache in memory a smaller, filtered and corrected dataset needed for further processing. Spark caching of a smaller filtered data set is obviously much faster than repeated disk table scans and repeated processing of the larger raw data.

Having performance issues with Datastax cassandra

I have installed datastax Cassandra in 2 independent machines(one with 16gb RAM and other with 32GB RAM) and going with most of the default configuration.
I have created a table with some 700 columns, when I try to insert records using java its able to insert 1000 records per 30 seconds, which seems be very less for me as per datastax benchmark it should be around 18000+. For my surprise performance is same in both 32GB & 16GB RAM machines.
I am new to Cassandra, can any one help me in this regard. I feel I doing something wrong with Cassandra.yaml configurations.
I did a Benchmarking and tuning activity on Cassandra some time ago. Found some useful settings which are mentioned below,
In Cassandra data division is based of strategies. Default is a combination of round robin and token aware policy which works best in almost all cases. If you want to customize data distribution then it is possible to write a new data distribution strategy in Cassandra i.e. distribute the data based on a location, based on an attribute etc. which can be best for customized requirement.
Cassandra uses Bloom filters to determine whether an SSTable has data for a particular row. We used bloom filter value is 0.1 to maintain balance between efficiency and overhead
Consistency level is key parameter in NoSQL databases. Try with Quorum or one.
Other options in JVM tuning like, heap memory size, survivor ratio should be optimal to achieve maximum performance
If large memory is available then memTable size can be increased and that can fit into memory and it will improve performance. Flushing memTables to disk interval should be high enough so that it shouldn’t perform unnecessary IO operations
Concurrency settings in Cassandra are important to scale. Based on our tests and observations we found that Cassandra performs better when concurrency is set to no. of cores*5 and native_transport_max_threads set to 256
Follow additional tuning settings recommended by Cassandra like; disable swap, ulimit settings, and compaction settings
Replication factor in Cassandra should be equal to no. of nodes in cluster to achieve maximum throughput of system.
These are mostly for insertion with a little bit of impact of read.
I hope this will help you :)
Are you using async writes?
Try running cassandra-stress, that way you can isolate client issues.
Another option is Brian's cassandra-loader:
https://github.com/brianmhess/cassandra-loader
Since you are writing in Java, use Brian's code as a best practice example.

Mongodb - make inmemory or use cache

I will be creating a 5 node mongodb cluster. It will be more read heavy than write and had a question which design would bring better performance. These nodes will be dedicated to only mongodb. For the sake of an example, say each node will have 64GB of ram.
From the mongodb docs it states:
MongoDB automatically uses all free memory on the machine as its cache
Does this mean as long as my data is smaller than the available ram it will be like having an in-memory database?
I also read that it is possible to implement mongodb purely in memory
http://edgystuff.tumblr.com/post/49304254688/how-to-use-mongodb-as-a-pure-in-memory-db-redis
If my data was quite dynamic (can range from 50gb to 75gb every few hours), would it be theoretically be better performing to design mongodb in a way which allows mongodb to manage itself with its cache (default setup of mongo), or to put the mongodb into memory initially and if the data grows over the size of ram use swap space (SSD)?
MongoDB default storage engine maps the files in memory. It provides an efficient way to access the data, while avoiding double caching (i.e. MongoDB cache is actually the page cache of the OS).
Does this mean as long as my data is smaller than the available ram it will be like having an in-memory database?
For read traffic, yes. For write traffic, it is different, since MongoDB may have to journalize the write operation (depending on the configuration), and maintain the oplog.
Is it better to run MongoDB from memory only (leveraging tmpfs)?
For read traffic, it should not be better. Putting the files on tmpfs will also avoid double caching (which is good), but the data can still be paged out. Using a regular filesystem instead will be as fast once the data have been paged in.
For write traffic, it is faster, provided the journal and oplog are also put on tmpfs. Note that in that case, a system crash will result in a total data loss. Usually, the performance gain does not worth the risk.

Resources