I am trying to put data in Ignite Cache and it was taking too much time and making my process much slow.
I have an Ignite Cache of type IgniteCache<Integer,List<Integer>> and adding data to cache taking too much time. I am using 4 nodes in cluster.
How to reduce its time and make the processing fast?
Related
I am starting a project where I want to load a lot of data into Apache Ignite cache to perform certain computations. My original data load will be about 40Gb and that may grow 4 or 5-fold at certain times. I looked through the Ignite documentation and I didn't find anything in regards of size of the cache limitations. So, would it be fair to assume that as long as I have enough resources ( CPUs and RAM ) I can add as many Nodes as necessary without compromising the performance, which is speed of computations in my case.
Yes, Ignite scales horizontally. There's no explicit limit in the capacity of a table.
For example, I cached a number of RDDs in memory.
Then I leave the application for a few days or more.
And then I try to access cached RDDs.
Will they still be in memory?
Or Spark will clean unused cached RDDs after some period of time.
Please help!
Spark automatically monitors cache usage on each node and drops out old data partitions in a least-recently-used (LRU) fashion. If you would like to manually remove an RDD instead of waiting for it to fall out of the cache, use the RDD.unpersist() method.
I have a large dataset that I am trying to run with Apache Spark (around 5TB). I have noticed that when the job starts, it retrieves data really fast and the first stage of the job (a map transformation) gets done really fast.
However, after having processed around 500GB of data, that map transformation starts being slow and some of the tasks are taking several minutes or even hours to complete.
I am using 10 machines with 122 GB and 16CPUs and I am allocating all resources to each of the worker nodes. I thought about increasing the number of machines, but is there any other thing I could be missing?
I have tried with a small portion of my data set (30 GB) and it seemed to be working fine.
It seems that the stage gets completed locally in some nodes faster than in others. Driven from that observation, here is what I would try:
Cache the RDD that you process. Do not forget to unpersist it, when you don't need it anymore.
Understanding caching, persisting in Spark.
Check if the partitions are balanced, which doesn't seem to be
the case (that would explain why some local stages complete much
earlier than others). Having balanced partitions is the holy grail
in distributed-computing, isn't it? :)
How to balance my data across the partitions?
Reducing the communications costs, i.e. use less workers than you
use, and see what happens. Of course that heavily depends on your
application. You see, sometimes communication costs become so big,
they dominate, so using less machines for example, speeds up the
job. However, I would do that, only if steps 1 and 2 would not suffice.
Without any more info it would seem that at some point of the computation your data gets spilled to the disk because there is no more space in memory.
It's just a guess, you should check your Spark UI.
I will be creating a 5 node mongodb cluster. It will be more read heavy than write and had a question which design would bring better performance. These nodes will be dedicated to only mongodb. For the sake of an example, say each node will have 64GB of ram.
From the mongodb docs it states:
MongoDB automatically uses all free memory on the machine as its cache
Does this mean as long as my data is smaller than the available ram it will be like having an in-memory database?
I also read that it is possible to implement mongodb purely in memory
http://edgystuff.tumblr.com/post/49304254688/how-to-use-mongodb-as-a-pure-in-memory-db-redis
If my data was quite dynamic (can range from 50gb to 75gb every few hours), would it be theoretically be better performing to design mongodb in a way which allows mongodb to manage itself with its cache (default setup of mongo), or to put the mongodb into memory initially and if the data grows over the size of ram use swap space (SSD)?
MongoDB default storage engine maps the files in memory. It provides an efficient way to access the data, while avoiding double caching (i.e. MongoDB cache is actually the page cache of the OS).
Does this mean as long as my data is smaller than the available ram it will be like having an in-memory database?
For read traffic, yes. For write traffic, it is different, since MongoDB may have to journalize the write operation (depending on the configuration), and maintain the oplog.
Is it better to run MongoDB from memory only (leveraging tmpfs)?
For read traffic, it should not be better. Putting the files on tmpfs will also avoid double caching (which is good), but the data can still be paged out. Using a regular filesystem instead will be as fast once the data have been paged in.
For write traffic, it is faster, provided the journal and oplog are also put on tmpfs. Note that in that case, a system crash will result in a total data loss. Usually, the performance gain does not worth the risk.
How does spark handle concurrent queries? I have read a bit about spark and underlying RDD's but I am unable to understand how concurrent queries would be handled?
For example if I run a query which loads the data in memory and the entire available memory is consumed and at the same time someone else runs a query involving another set of data, how would spark allocate the memory to both the queries? Also what would be the impact if the priorities are taken into account.
Also can running lots of parallel queries would result in the machines hanging ?
Firstly Spark doesn't take the in-memory (RAM) more than threshold limit.
Spark tries to allocate the default in-memory to every job.
If there is insufficient memory for a new job then it tries to spill the in-memory content of LeastRecentlyUsed (LRU) RDD to disk and then allocates to new job.
Optionally you can also specify the storage of RDD like IN-MEMORY only, DISK only, MEMORY AND DISK etc..
Scenario: consider a low in-memory machine with huge no of jobs, then most of the RDDs will be placed in disk only, as per the above approach.
So, the jobs will continue to run but it will not take the advantage of Spark in-memory processing.
Spark does the memory allocation very intelligently.
If Spark used on top-of YARN then Resource manager also takes place in the resource allocation.