Nifi memory continues to expand - apache-nifi

I used a three-node nifi cluster, the nifi version is 1.16.3, the hardware is 8core 32G memory, and the solid-state high-speed hard disk is 2T. OS is CentOS7.9, ARM64 hardware architecture.
The initial configuration of nifi is xms12g and xmx12G(bootstrip.conf).
Native installation, docker is not used, and only nifi installed on all thoese machines, using integrated zookeeper.
Run 20 workflow everyday from 00:00 to 03:00, and the total data size is 1.2G. Collect csv documents to the greenplum database.
My problem now is that the memory usage of nifi is increasing every day, 0.2G per day, and all three nodes are like this. Then the memory is slowly full and then the machine is dead. This procedure is about a month(when the memory is set to 12G.).
That is to say, I need to restart the cluster every month. I use a native processor and workflow.
I can't locate the problem. Who can help me?
I may have any descriptions. Please feel to let me know,thanks.
I have made the following attempts:
I set the initial memory to 18G or 6G, and the speed of workflow processing has not changed. The difference is that, after setting it to 18G, it will freeze for a shorter time.
I used openjre1.8, and I tried to upgrade it to 11, but it was useless.
i add the following configuration, and is also useless:
java.arg.7=-XX:ReservedCodeCacheSize=256m
java.arg.8=-XX:CodeCacheMinimumFreeSpace=10m
java.arg.9=-XX:+UseCodeCacheFlushing
Every day's timing tasks consume little resources. Even if the memory is adjusted to 6G, 20 tasks run at the same time, the memory consumption is about 30%, and it will run out in half an hour.

Related

Why is hadoop slow for a simple hello world job

I am following the tutorial on the hadoop website: https://hadoop.apache.org/docs/r3.1.2/hadoop-project-dist/hadoop-common/SingleCluster.html.
I run the following example in Pseudo-Distributed Mode.
time hadoop jar hadoop/share/hadoop/mapreduce/hadoop-mapreduce-examples-3.1.2.jar grep input output 'dfs[a-z.]+'
It takes 1:47min to complete. When I turn off the network (wifi), it finishes in approx 50 seconds.
When I run the same command using the Local (Standalone) Mode, it finishes in approx 5 seconds (on a mac).
I understand that in Pseudo-Distributed Mode there is more overhead involved and hence it will take more time, but in this case it takes way more time. The CPU is completely idle during the run.
Do you have any idea what can cause this issue?
First, I don't have an explanation for why turning off your network would result in faster times. You'd have to dig through the Hadoop logs to figure out that problem.
This is typical behavior most people encounter running Hadoop on a single node. Effectively, you are trying to use Fedex to deliver something to your next door neighbor. It will always be faster to walk it over because the inherent overhead of operating a distributed system. When you run local mode, you are only performing the Map-Reduce function. When you run pseudo-distributed, it will use all the Hadoop servers (NameNode, DataNodes for data; Resource Manager, NodeManagers for compute) and what you are seeing is the latencies involved in that.
When you submit your job, the Resource Manager has to schedule it. As your cluster is not busy, it will ask for resources from the Node Manager. The Node Manager will give it a container which will run your Application Master. Typically, this loop takes about 10 seconds. Once your AM is running it will ask for resources from the Resource Manager for it's Map and Reduce tasks. This takes another 10 seconds. Also when you submit your job there is around a 3 second wait before this job is actually submitted to the Resource Manager. So far that's 23 seconds and you haven't done any computation yet.
Once the job is running, the most likely cause of waiting is allocating memory. On smaller systems (> 32GB of memory) the OS might take a while to allocate space. If you were to run the same thing on what is considered commodity hardware for Hadoop (16+ core, 64+ GB) you would probably see run time closer to 25-30 seconds.

How to work with a group of people using Zeppelin?

I a trying to work with Zeppelin on my Hadoop Cluster:
1 edge node
1 name node
1 secondary node
16 data nodes.
Node specification:
CPU: Intel(R) Xeon(R) CPU E5345 # 2.33GHz, 8 cores
Memory: 32 GB DDR2
I have some issues with this tool when more than 20 people want to use it at the same time.
This is mainly when I am using pyspark - either 1.6 or 2.0.
Even if I set zeppelin.execution.memory = 512 mb and spark.executor memory = 512 mb is still the same. I have tried a few interpreter options (for pyspark), like Per User in scoped/isolated and others and still the same. It is a little better with globally option but still after a while I can not do anything there. I was looking on Edge Node and I saw that memory is going up very fast. I want to use Edge Node only as an access point.
If your deploy mode is yarn client, then your driver will always be the access point server (the Edge Node in your case).
Every notebook (per note mode) or every user (per user mode) instantiates a spark context allocating memory on the driver and on the executors. Reducing spark.executor.memory will alleviate the cluster but not the driver. Try reducing spark.driver.memory instead.
The Spark interpreter can be instantiated globally, per note or per user, I don't think sharing the same interpreter (globally) is a solution in your case, since you can only run one job at a time. Users would end up waiting for every one else's cells to compile before being able to do so themselves.

MemSQL performance issues

I have a single node MemSQL install with one master aggregator and two leaves (all on a single box). The machine has 2 cores, 16Gb RAM, and MemSQL columnstore data is ~7Gb (coming from 21Gb CSV). When running queries on the data, memory usage caps at ~2150Mb (11Gb sitting free). I've configured both leaves to have maximum_memory = 7000 in the memsql.cnf files for both nodes (memsql-optimize does similar). During query execution, the master aggregator sits at 100% CPU, with the leaves 0-8% CPU.
This does not seems like an efficient use of system resources, but I'm not sure what I can do to configure the system or MemSQL to make more efficient use of CPU or memory. Any help would be greatly appreciated!
If during query execution your machine is at 100% cpu (on all cores), it doesn't really matter which MemSQL node it is, your workload throughput is still bottlenecked on cpu. However for most queries you wouldn't expect most of the cpu use to be on the aggregator, so you may want to take a look at EXPLAIN or PROFILE of your queries.
Columnstore data is cached in memory as part of the OS file cache - it isn't counted as memory reserved by MemSQL, which is why your memory usage is less than the size of the columnstore data.
My database was coming from some other place than the current memsql install (perhaps an older cluster configuration) despite there only being a single memsql cluster on the machine. Looking at the Databases section in the Web UI was displaying no databases/tables, but my queries were succeeded with the expected answers.
drop database/reload from CSV managed to remedy the situation. All core threads are now used during query.

How to increase AeroSpark read performance?

I am using latest AeroSpark connector to work with AeroSpike and Spark ML. But when i have inserted round 60M records to AeroSpike, i got too big time amount in read operations. For example for fetch round 500K records from set that contains 60M records, AeroSpark spend ~30 mins. When i look at htop cmd output, AeroSpike use only 7% of CPU.
Each record round contains 1k of data. The AeroSpike and Spark hosted on the same node. The data filtered by secondary index.
How can i speed up performance in read operations? Seems AeroSpark is working only by one thread, how i can parallelize this job? Any suggestions?
AeroSpike conf:
memory-size 8G
default-ttl 30d
storage-engine device {
file /vol/rmla.data
filesize 900G
}
Without knowing anything about your server, and with just a snippet of config, I'll stick to some generic recommendations that should improve your experience.
Disk IO
You are clearly bound by the read speed from your storage media, which you declared to be a file. If you're storing the data on disk, you can either use file or device in the storage-engine device config block.
There is a big difference in the read and write latency between a file on a HDD versus raw device access to an SSD. Typically Aerospike is used with data stored on enterprise-grade SSD devices. Read the section in the operations manual about initializing and setting up the drive. Declaring multiple devices for the namespace with give you a linear performance boost (two drives will have double the read and write throughput of one of the same kind).
In Amazon EC2 you could use the c3, i2, r3, or i3 instance families for this purpose. The ephemeral SSD devices of EC2 instances don't need to be over-provisioned, have their RAID turned off, etc. They only need to be initialized before they're first used. Do not use EBS drives for primary storage, as they're too slow.
Cluster Configuration
The Spark connector uses lots of scan operations. Make sure that you've configured scan-threads under your service config block to the number of cores. If you don't know how many cores you have, do cat /proc/cpuinfo. If Spark is the only client using the Aerospike cluster, you can tune the scan threads higher.
Connector Configuration
You can modify the connector config options for lower write latency. Optionally set aerospike.commitLevel to CommitLevel.COMMIT_MASTER.
Upgrade Version
As of November 28 2016 aerospike/aerospark supports Spark 2.0. Make sure you're using the latest code.
Note: See the new tutorial for Aerospark on the Aerospike website.

Tuning Hadoop job execution on YARN

A bit of intro - I'm learning about Hadoop. I have implemented machine learning algorithm on top of Hadoop (clustering) and tested it only on a small example (30MB).
A couple of days ago I installed Ambari and created a small cluster of four machines (master and 3 workers). Master has Resource manager and NameNode.
Now I'm testing my algorithm by increasing the amount of data (300MB, 3GB). I'm looking for a pointer how to tune up my mini-cluster. Concretely, I would like to know how to determine MapReduce2 and YARN settings in Ambari.
How to determine min/max memory for container, reserved memory for container, Sort Allocation Memory, map memory and reduce memory?
The problem is that execution of my jobs is very slow on Hadoop (and clustering is an iterative algorithm, which makes things worse).
I have a feeling that my cluster setup is not good, because of the following reason:
I run a job for a dataset of 30MB (I set-up block memory for this job to be 8MB, since data is small and processing is intensive) - execution time 30 minutes
I run the same job, but multiply same dataset 10 times - 300MB (same block size, 8MB) - execution time 2 hours
Now same amount of data - 300MB, but block size 128MB - same execution time, maybe even a bit greater than 2 hours
Size of blocks on HDFS is 128MB, so I thought that this will cause the speedup, but that is not the case. My doubts are that the cluster setup (min/max RAM size, map and reduce RAM) is not good, hence it cannot improve even though greater data locality is achieved.
Could this be the consequence of a bad setup, or am I wrong?
Please set the below properties in Yarn configuratins to allocate 33% of max yarn memory per job, which can be altered based on your requirement.
yarn.scheduler.capacity.root.default.user-limit-factor=1
yarn.scheduler.capacity.root.default.user-limit-factor=0.33
If you need further info on this, please refer following link https://analyticsanvil.wordpress.com/2015/08/16/managing-yarn-memory-with-multiple-hive-users/

Resources