how to avoid filling up hadoop logs on nodes?

how to avoid filling up hadoop logs on nodes? - hadoop

When our Cascading jobs encounter an error in data, they throw various exceptions… These end up in the logs, and if the logs fill up, the cluster stops working. do we have any config file to be edited/configured to avoid such scenarios?
we are using MapR 3.1.0, and we are looking for a way to limit the log use (syslogs/userlogs), without using centralized logging, without adjusting the logging level, and we are less bothered about whether it keeps the first N bytes, or the last N bytes of logs and discords remain part.
We don't really care about the logs, and we only need the first (or last) few Megs to figure out what went wrong. We don't wan't to use centralized logging, because we don't really want to keep the logs/ don't care to spend the perf overhead of replicating them. Also, correct me if I'm wrong: user_log.retain-size, has issues when JVM re-use is used.
Any clue/answer will be greatly appreciated !!
Thanks,
Srinivas

This should probably be in a different StackExchange site as it's a more of a DevOps question than a programming question.
Anyway, what you need is your DevOps to setup logrotate and configure it according to your needs. Or edit the log4j xml files on the cluster to change the way logging is done.

Related

Losing Provenance records in Apache Nifi

We work with a lot of data and have a high throughput of files going through our Nifi instances. We have recently been losing providence records and don't seem to understand what the cause is.
Below are some details, if relevant:
We have our Providence database on its own drive in the cloud, and are not seeing any high IO usage or resource contention.
We have added additional threads to this, as well as 999k file handles.
If it means anything, providence data is kept for 2 weeks in our configuration.
We are on Nifi version 1.15.3, but are planning on an upgrade in the near future.
Any ideas on what the cause may be and how to remediate this? Thanks!

ubuntu server cpu utilisation increasing very quickly after installing ELK

I installed elasticsearch logstash and kibana in the ubuntu server. Before I starting these services the CPU utilization is less than 5% and after starting these services in the next minute the CPU utilization crossing 85%. I don't know why it is happening. Can anyone help me with this issue?
Thanks in advance.

There is not enough information in your question to give you a specific answer, but i will point out few possible scenarios and how to deal with them.
Did you wait long enough? sometimes there is a warmpup which is consuming higher CPU until all services are registered and finish to boot. if you have a fairly small machine it might consume higher CPU and take longer to finish.
folder write permissions. if any of the components of the ELK fails due to restricted access on needed directories either for logging, creating sub folders for sinceDB files or more it can cause it to go into an infinity loop and try again and again while it is consuming high CPU.
connection issues. ES should be the first component to start, if it fails, Kibana and Logstash will go and try to connect to the ES again and again until successful connection- which can cause high CPU.
bad logstash configuration. if logstash fails to read the file from the configurations or if you have a bad parsing, excessive parsing for example- your first "match" in the filter part will include the least common option it might consume high CPU.
For further investigation:
I suggest you to not start all of them together. start ES first. if everything goes well start Kibana and lastly start Logstash.
check the logs of all the ELK components to find error messages, failures, etc.
for a better answer I will need the yaml of all 3 components (ES, Kibana, Logstash)
I will need the logstash configuration file.

Would recommend you to analyse the CPU cycles consumed by each of the elasticsearch, logstash and kibana process.
Check specifically which process among the above is consuming the most memory/cpu via top command for example.
Start only ES first and allow it to settle and the node to be started completely before starting kibana and may be logstash after that.
Send me the logs for each and I can assist if there are any errors.

Spark in AWS: "S3AbortableInputStream: Not all bytes were read from the S3ObjectInputStream"

I'm running a PySpark application in:
emr-5.8.0
Hadoop distribution:Amazon 2.7.3
Spark 2.2.0
I'm running on a very large cluster. The application reads a few input files from s3. One of these is loaded into memory and broadcast to all the nodes. The other is distributed to the disks of each node in the cluster using the SparkFiles functionality. The application works but performance is slower than expected for larger jobs. Looking at the log files I see the following warning repeated almost constantly:
WARN S3AbortableInputStream: Not all bytes were read from the S3ObjectInputStream, aborting HTTP connection. This is likely an error and may result in sub-optimal behavior. Request only the bytes you need via a ranged GET or drain the input stream after use.
It tends to happen after a message about accessing the file that was loaded into memory and broadcasted. Is this warning something to warn about? How to avoid it?
Google searching brings up several people dealing with this warning in native Hadoop applications, but I've found nothing about it in Spark or PySpark and can't figure out how those solutions would apply for me.
Thanks!

Ignore it.
The more recent versions of the AWS SDK always tell you off when you call abort() on the input stream, even when it's what you need to do when moving around a many-GB file. For small files, yes, reading to the EOF is the right thing to do, but with big files, no.
See: SDK repeatedly complaining "Not all bytes were read from the S3ObjectInputStream
If you see this a lot, and you are working with columnar data formats such as ORC and Parquet, switch the input streams over to random IO over sequential by setting the property fs.s3a.experimental.fadvise to random. This stops it from ever trying to read the whole file, and instead only reading small blocks. Very bad for full file reads (including .gz files), but transforms column IO.
Note, there's a small fix in S3A for Hadoop 3.x on the final close HADOOP-14596. Up to the EMR team whether to backport or not.
+I'll add some text to the S3A troubleshooting docs. The ASF have never shipped a hadoop release with this problem, but if people are playing mix-and-match with the AWS SDK (very brittle), then this may surface

Note: This only applies to non-EMR installations as AWS doesn't offer s3a.
Before choosing to ignore the warnings or altering your input streams via settings per Steve Loughran's answer, make absolutely sure you're not using s3://bucket/path notation.
Starting with Spark 2, you should leverage the s3a protocol via s3a://bucket/path, which would likely address the warnings you're seeing (it did for us) and substantially boost the speed of S3 interactions. See this answer for detail on a breakdown of differences.

Web application very slow in Tomcat 7

I implemented a web application to start the Tomcat service works very quickly, but spending hours and when more users are entering is getting slow (up to 15 users approx.).
Checking RAM usage statistics (20%), CPU (25%)
Server Features:
RAM 8GB
Processor i7
Windows Server 2008 64bit
Tomcat 7
MySql 5.0
Struts2
-Xms1024m
-Xmx1024m
PermGen = 1024
MaxPernGen = 1024
I do not use Web server, we publish directly on Tomcat.
Entering midnight slowness is still maintained (only 1 user online)
The solution I have is to restart the Tomcat service and response time is again excellent.
Is there anyone who has experienced this issue? Any clue would be appreciated.

Not enough details provided. Need more information :(
Use htop or top to find memory and CPU usage per process & per thread.
CPU
A constant 25% CPU usage in a 4 cores system can indicate that a single-core application/thread is running 100% CPU on the only core it is able to use.
Which application is eating the CPU ?
Memory
20% memory is ~1.6GB. It is a bit more than I expect for an idle server running only tomcat + mysql. The -Xms1024 tells tomcat to preallocate 1GB memory so that explains it.
Change tomcat settings to -Xms512 and -Xmx2048. Watch tomcat memory usage while you throw some users at it. If it keeps growing until it reaches 2GB... then freezes, that can indicate a memory leak.
Disk
Use df -h to check disk usage. A full partition can make the issues you are experiencing.
Filesystem Size Used Avail Usage% Mounted on
/cygdrive/c 149G 149G 414M 100% /
(If you just discovered in this example that my laptop is running out of space. You're doing it right :D)
Logs
Logs are awesome. Yet they have a bad habit to fill up the disk. Check logs disk usage. Are logs being written/erased/rotated properly when new users connect ? Does erasing logs fix the issue ? (copy them somewhere for future analysis before you erase them)
If not. Logs are STILL awesome. They have the good habit to help you track bugs. Check tomcat logs. You may want to set logging level to debug. What happens last when the website die ? Any useful error message ? Do user connections are still received and accepted by tomcat ?
Application
I suppose that the 25% CPU goes to tomcat (and not mysql). Tomcat doesn't fail by itself. The application running on it must be failing. Try removing the application from tomcat (you can eventually put an hello world instead). Can tomcat keep working overnight without your application ? It probably can, in which case the fault is on the application.
Enable full debug logging in your application and try to track the issue. Run it straight from eclipse in debug mode and throw users at it. Does it fail consistently in the same way ?
If yes, hit "pause" in the eclipse debugger and check what the application is doing. Look at the piece of code each thread is currently running + its call stack. Repeat that a few times. If there is a deadlock, an infinite loop, or similar, you can find it this way.
You will have found the issue by now if you are lucky. If not, you're unfortunate and it's a tricky bug that might be deep inside the application. That can get tricky to trace. Determination will lead to success. Good luck =)

For performance related issue, we need to follow the given rules:
You can equalize and emphasize the size of xms and xmx for effectiveness.
-Xms2048m
-Xmx2048m
You can also enable the PermGen to be garbage collected.
-XX:+UseConcMarkSweepGC -XX:+CMSPermGenSweepingEnabled -XX:+CMSClassUnloadingEnabled
If the page changes too frequently to make this option logical, try temporarily caching the dynamic content, so that it doesn't need to be regenerated over and over again. Any techniques you can use to cache work that's already been done instead of doing it again should be used - this is the key to achieving the best Tomcat performance.
If there any database related issue, then can follow sql query perfomance tuning
rotating the Catalina.out log file, without restarting Tomcat.
In details,There are two ways.
The first, which is more direct, is that you can rotate Catalina.out by adding a simple pipe to the log rotation tool of your choice in Catalina's startup shell script. This will look something like:
"$CATALINA_BASE"/logs/catalina.out WeaponOfChoice 2>&1 &
Simply replace "WeaponOfChoice" with your favorite log rotation tool.
The second way is less direct, but ultimately better. The best way to handle the rotation of Catalina.out is to make sure it never needs to rotate. Simply set the "swallowOutput" property to true for all Contexts in "server.xml".
This will route System.err and System.out to whatever Logging implementation you have configured, or JULI, if you haven't configured.
See more at: Tomcat Catalina Out

I experienced a very slow stock Tomcat dashboard on a clean Centos7 install and found the following cause and solution:
Slow start up times for Tomcat are often related to Java's
SecureRandom implementation. By default, it uses /dev/random as an
entropy source. This can be slow as it uses system events to gather
entropy (e.g. disk reads, key presses, etc). As the urandom manpage
states:
When the entropy pool is empty, reads from /dev/random will block until additional environmental noise is gathered.
Source: https://www.digitalocean.com/community/questions/tomcat-8-5-9-restart-is-really-slow-on-my-centos-7-2-droplet
Fix it by adding the following configuration option to your tomcat.conf or (preferred) a custom file into /tomcat/conf/conf.d/:
JAVA_OPTS="-Djava.security.egd=file:/dev/./urandom"

We encountered a similar problem, the cause was "catalina.out". It is the standard destination log file for "System.out" and "System.err". It's size kept on increasing thus slowing things down and ultimately tomcat crashed. This problem was solved by rotating "catalina.out". We were using redhat so we made a shell script to rotate "catalina.out".
Here are some links:-
Mulesoft article on catalina (also contains two methods of rotating):
Tomcat Catalina Introduction
If "catalina.out" is not the problem then try this instead:-
Mulesoft article on optimizing tomcat:
Tuning Tomcat Performance For Optimum Speed

We had a problem, which looks similar to yours. Tomcat was slow to respond, but access log showed just milliseconds for answer. The problem was streaming responses. One of our services returned real-time data that user could subscribe to. EPOLL were becoming bloated. Network requests couldn't get to the Tomcat. And whats more interesting, CPU was mostly idle (since no one could ask server to do anything) and acceptor/poller threads were sitting in WAIT, not RUNNING or IN_NATIVE.
At the time we just limited amount of such requests and everything became normal.

Can hadoop be used as a distributed queue server?

I'm thinking of learning hadoop but not sure if it'll solve my problem. Basically I have a job with a queue and a bunch of workers. Each worker does a small amount of work and then either saves the results(if successful) or sends it back to the queue for further processing. My problem is scalable, is limited by the bandwidth on the network(ec2) which will never keep up with multiple cpu's crunching the data. I thought maybe I could run my jobs in Java in a hadoop cluster and have hadoop distribute the work via a queue. Would this be a better approach? I am correct in assuming hadoop can a queue and try to run jobs as locally as possible to minimize bandwidth usage and maximize cpu usage? My program is very cpu bound but most of my recent problems with its performence are related to passing work over a network(I want to keep the work as local as possible), but the difference between the hadoop tutorials I see and my problem is that in the tutorials all the work is known in advance while my program is generating new work for its self constantly(until its finally done). Would this work and would it help me reduce the impact of passing messages over a network?
Sorry I'm new to hadoop and wanted to know if it could solve my problem.

Hadoop is all about running jobs in a batch-like mode over a large data set. It's hard to get it to have some sort of queue-like behavior, but not impossible. There is Apache ZooKeeper, which will give you synchronization to build a queue if you need it.
There are plenty of tools to solve the problem it looks like you are trying to solve. I suggest taking a look at RabbitMQ. If you use python, Celery is quite fantastic.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio