i find many factor effecting perfomance,such as,the amount of memory dedicated to JVM heap space,the directory of repository,many paramaters in nifi.properties.
I have no ideas to set them?ANY HELP IS APPRECIATED!
NIFI-1.10.0/CLUSTER MODE
The configuration of VM as pic
#KONG There isnt a clear question asked above.
NiFi Performance involves the following critical factors in order:
Number of Nodes
Number of Cores Per Node
Memory for Nifi Min/Max
Memory Available
Garbage Collection
Min/Max Thread Configuration
Disk Configuration (per documentation separate disk for nifi repositories & high speed disks are requried)
Size of Flow / Number of Processors & Controller Services
Amount of Data Flowing
Tuning of flow (concurrency)
My advice is to start with Default settings that come with NiFi out of the box. Then clearly investigate any performance issues and try to determine the correct next steps:
Add more Nodes
Increase Memory
Tune Garbage Collection
Adjust Min/Max Threads based on Total # Cores & Nodes
Related
I am starting a project where I want to load a lot of data into Apache Ignite cache to perform certain computations. My original data load will be about 40Gb and that may grow 4 or 5-fold at certain times. I looked through the Ignite documentation and I didn't find anything in regards of size of the cache limitations. So, would it be fair to assume that as long as I have enough resources ( CPUs and RAM ) I can add as many Nodes as necessary without compromising the performance, which is speed of computations in my case.
Yes, Ignite scales horizontally. There's no explicit limit in the capacity of a table.
I am curious to know how memory is managed in H2O.
Is it completely 'in-memory' or does it allow swapping in case the memory consumption goes beyond available physical memory? Can I set -mapperXmx parameter to 350GB if I have a total of 384GB of RAM on a node? I do realise that the cluster won't be able to handle anything other than the H2O cluster in this case.
Any pointers are much appreciated, Thanks.
H2O-3 stores data completely in-memory in a distributed column-compressed distributed key-value store.
No swapping to disk is supported.
Since you are alluding to mapperXmx, I assume you are talking about running H2O in a YARN environment. In that case, the total YARN container size allocated per node is:
mapreduce.map.memory.mb = mapperXmx * (1 + extramempercent/100)
extramempercent is another (rarely used) command-line parameter to h2odriver.jar. Note the default extramempercent is 10 (percent).
mapperXmx is the size of the Java heap, and the extra memory referred to above is for additional overhead of the JVM implementation itself (e.g. the C/C++ heap).
YARN is extremely picky about this, and if your container tries to use even one byte over its allocation (mapreduce.map.memory.mb), YARN will immediately terminate the container. (And for H2O-3, since it's an in-memory processing engine, the loss of one container terminates the entire job.)
You can set mapperXmx and extramempercent to as large a value as YARN has space to start containers.
I will have 200 million files in my HDFS cluster, we know each file will occupy 150 bytes in NameNode memory, plus 3 blocks so there are total 600 bytes in NN.
So I set my NN memory having 250GB to well handle 200 Million files. My question is that so big memory size of 250GB, will it cause too much pressure on GC ? Is it feasible that creating 250GB Memory for NN.
Can someone just say something, why no body answer??
Ideal name node memory size is about total space used by meta of the data + OS + size of daemons and 20-30% space for processing related data.
You should also consider the rate at which data comes in to your cluster. If you have data coming in at 1TB/day then you must consider a bigger memory drive or you would soon run out of memory.
Its always advised to have at least 20% memory free at any point of time. This would help towards avoiding the name node going into a full garbage collection.
As Marco specified earlier you may refer NameNode Garbage Collection Configuration: Best Practices and Rationale for GC config.
In your case 256 looks good if you aren't going to get a lot of data and not going to do lots of operations on the existing data.
Refer: How to Plan Capacity for Hadoop Cluster?
Also refer: Select the Right Hardware for Your New Hadoop Cluster
You can have a physical memory of 256 GB in your namenode. If your data increase in huge volumes, consider hdfs federation. I assume you already have multi cores ( with or without hyperthreading) in the name node host. Guess the below link addresses your GC concerns:
https://community.hortonworks.com/articles/14170/namenode-garbage-collection-configuration-best-pra.html
I have installed datastax Cassandra in 2 independent machines(one with 16gb RAM and other with 32GB RAM) and going with most of the default configuration.
I have created a table with some 700 columns, when I try to insert records using java its able to insert 1000 records per 30 seconds, which seems be very less for me as per datastax benchmark it should be around 18000+. For my surprise performance is same in both 32GB & 16GB RAM machines.
I am new to Cassandra, can any one help me in this regard. I feel I doing something wrong with Cassandra.yaml configurations.
I did a Benchmarking and tuning activity on Cassandra some time ago. Found some useful settings which are mentioned below,
In Cassandra data division is based of strategies. Default is a combination of round robin and token aware policy which works best in almost all cases. If you want to customize data distribution then it is possible to write a new data distribution strategy in Cassandra i.e. distribute the data based on a location, based on an attribute etc. which can be best for customized requirement.
Cassandra uses Bloom filters to determine whether an SSTable has data for a particular row. We used bloom filter value is 0.1 to maintain balance between efficiency and overhead
Consistency level is key parameter in NoSQL databases. Try with Quorum or one.
Other options in JVM tuning like, heap memory size, survivor ratio should be optimal to achieve maximum performance
If large memory is available then memTable size can be increased and that can fit into memory and it will improve performance. Flushing memTables to disk interval should be high enough so that it shouldn’t perform unnecessary IO operations
Concurrency settings in Cassandra are important to scale. Based on our tests and observations we found that Cassandra performs better when concurrency is set to no. of cores*5 and native_transport_max_threads set to 256
Follow additional tuning settings recommended by Cassandra like; disable swap, ulimit settings, and compaction settings
Replication factor in Cassandra should be equal to no. of nodes in cluster to achieve maximum throughput of system.
These are mostly for insertion with a little bit of impact of read.
I hope this will help you :)
Are you using async writes?
Try running cassandra-stress, that way you can isolate client issues.
Another option is Brian's cassandra-loader:
https://github.com/brianmhess/cassandra-loader
Since you are writing in Java, use Brian's code as a best practice example.
I'm having trouble figuring out the best way to configure my Hadoop cluster (CDH4), running MapReduce1. I'm in a situation where I need to run both mappers that require such a large amount of Java heap space that I couldn't possible run more than 1 mapper per node - but at the same time I want to be able to run jobs that can benefit from many mappers per node.
I'm configuring the cluster through the Cloudera management UI, and the Max Map Tasks and mapred.map.child.java.opts appear to be quite static settings.
What I would like to have is something like a heap space pool with X GB available, that would accommodate both kinds of jobs without having to reconfigure the MapReduce service each time. If I run 1 mapper, it should assign X GB heap - if I run 8 mappers, it should assign X/8 GB heap.
I have considered both the Maximum Virtual Memory and the Cgroup Memory Soft/Hard limits, but neither will get me exactly what I want. Maximum Virtual Memory is not effective, since it still is a per task setting. The Cgroup setting is problematic because it does not seem to actually restrict the individual tasks to a lower amount of heap if there is more of them, but rather will allow the task to use too much memory and then kill the process when it does.
Can the behavior I want to achieve be configured?
(PS you should use the newer name of this property with Hadoop 2 / CDH4: mapreduce.map.java.opts. But both should still be recognized.)
The value you configure in your cluster is merely a default. It can be overridden on a per-job basis. You should leave the default value from CDH, or configure it to something reasonable for normal mappers.
For your high-memory job only, in your client code, set mapreduce.map.java.opts in your Configuration object for the Job before you submit it.
The answer gets more complex if you are running MR2/YARN since it no longer schedules by 'slots' but by container memory. So memory enters the picture in a new, different way with new, different properties. (It confuses me, and I'm even at Cloudera.)
In a way it would be better, because you express your resource requirement in terms of memory, which is good here. You would set mapreduce.map.memory.mb as well to a size about 30% larger than your JVM heap size since this is the memory allowed to the whole process. It would be set higher by you for high-memory jobs in the same way. Then Hadoop can decide how many mappers to run, and decide where to put the workers for you, and use as much of the cluster as possible per your configuration. No fussing with your own imaginary resource pool.
In MR1, this is harder to get right. Conceptually you want to set the maximum number of mappers per worker to 1 via mapreduce.tasktracker.map.tasks.maximum, along with your heap setting, but just for the high-memory job. I don't know if the client can request or set this though on a per-job basis. I doubt it as it wouldn't quite make sense. You can't really approach this by controlling the number of mappers just because you have to hack around to even find out, let alone control, the number of mappers it will run.
I don't think OS-level settings will help. In a way these resemble more how MR2 / YARN thinks about resource scheduling. Your best bet may be to (move to MR2 and) use MR2's resource controls and let it figure the rest out.