How to collect Hadoop userlogs? - hadoop

I am running M/R jobs and logging errors when they occur, rather than making the job fail. There are only a few errors, but the job is run on a hadoop cluster with hundreds of nodes. How to search in task logs without having to manually open each task log in the web ui (jobtaskhistory)? In other words, how to automatically search in M/R task logs that are spread all over the cluster, stored in each node locally?

Side Note First: 2.0.0 is oldy moldy (that's the "beta" version of 2.0), you should consider upgrading to a newer stack (e.g. 2.4, 2.5 2.6).
Starting with 2.0, Hadoop implemented what's called "log aggregation" (though it's not what you would think. The logs are just stored on HDFS). There are bunch of command line tools that you can use to get the logs and analyze them without having to go through the UI. This is, in fact, much faster than the UI.
Check out this blog post for more information.
Unfortunately, even with the command line tool, there's not way for you to get all task logs at the same time and pipe it to something like grep. You'll have to get each task log as a separate command. However, this is at least scriptable.
The Hadoop community is working on a more robust log analysis tool that will not only store the job logs on HDFS, but will also give you the ability to perform search and other analyses on these logs. However, this is tool is still a ways out.

This is how we did it (large internet company): we made sure that only v critical messages were logged: but for those messages we actually did use System.err.println. Please keep the aggregate messages per tracker/reducer to only a few KB.
The majority of messages should still use the standard log4j mechanism (which goes to the System logs area)

Go to to your http://sandbox-hdp.hortonworks.com:8088/cluster/apps
There look for the instantiation of the execution you are interested in, and for that entry click the History link (in the Tracking UI column),
then look for the Logs link (in the Logs column), and click on it

yarn logs -applicationId <myAppId> | grep ...

Related

Magic committer not improving performance in a Spark3+Yarn3+S3 setup

What I am trying to achieve?
I am trying to enable the S3A magic committer for my Spark3.3.0 application running on a Yarn (Hadoop 3.3.1) cluster, to see performance improvements in my app during S3 writes. IIUC, my Spark application is writing about 21GBs of data with 30 tasks in the corresponding Spark stage (see below image).
My setup
I have a server which has the Spark client. The Spark client submits the application on Yarn cluster via the client-mode with PySpark.
What I tried
I am using the following config (setting via PySpark Spark-conf) to enable the committer:
"spark.sql.sources.commitProtocolClass": "org.apache.spark.internal.io.cloud.PathOutputCommitProtocol"
"spark.sql.parquet.output.committer.class": "org.apache.hadoop.mapreduce.lib.output.BindingPathOutputCommitter"
"spark.hadoop.mapreduce.outputcommitter.factory.scheme.s3a": "org.apache.hadoop.fs.s3a.commit.S3ACommitterFactory"
"spark.hadoop.fs.s3a.committer.name": "magic"
"spark.hadoop.fs.s3a.committer.magic.enabled": "true"
I also downloaded the spark-hadoop-cloud jar to the jars/ directory of the Spark-Home on the Nodemanagers and my Spark-client servers.
Changes that I see after applying the aforementioned configs:
I see PRE __magic/ directory if I run aws s3 ls <write-path> when the job is running.
I don't see the warning WARN AbstractS3ACommitterFactory: Using standard FileOutputCommitter to commit work. This is slow and potentially unsafe. anymore.
A _SUCCESS file gets created with (JSON) content. One of the key-value that I see in that file is "committer" : "magic".
Hence, I believe my configs are getting applied correctly.
What I expect
I have read in multiple articles that this committer is expected to show a performance boost (e.g. this article claims 57-77% time reduction). Hence, I expect to see significant reduction (from 39s) in the "duration" column of my "paruqet" stage, when I use the above shared configs.
Some other point that might be of value
When I use "spark.sql.sources.commitProtocolClass": "com.hortonworks.spark.cloud.commit.PathOutputCommitProtocol", my app fails with the error java.lang.ClassNotFoundException: com.hortonworks.spark.cloud.commit.PathOutputCommitProtocol.
I have not looked into enabling S3gaurd, as S3 now provides strong consistency.
correct. you don't need s3guard
the com.hortonworks binding was for the wip committer work. the binding classes for wiring up spark/parquet are all in spark-hadoop-cloud and have org.spark prefixes. you seem to be ok there
the simple test for what committer is live is to print the JSON _SUCCESS file. If that is a 0 byte file, you are still using the old committer. it does sound like you are.
grab the latest spark+hadoop build you can get, there's always ongoing improvements, with hadoop 3.3.5 doing a big enhancement there.
you should see performance improvements compared to the v1 committer, with commit speed O(files) rather than O(data). it is also correct, which the v1 algorithm doesn't offer on s3 (and which v2 doesn't offer anywhere

Newbie: Hadoop IIS Logs - Reasonable approach?

I am a totaly beginner at the topic hadoop - so sorry if this is a stupid question.
My fictional scenario is, that I have several webserver (IIS) with several log locations. I want to centralize this log files and based on the data I want to analyze the health of the applications and the webservers.
Since the eco system of hadoop overs a variety of tools I am not sure if my solution is a valid one.
So I thought that I move the log files to hdfs, create an external table on the directory and an internal table and copy the data via hive (insert into ...select from) from the external table to internal table (with some filtering because of the comment lines beginning with #)
When the data is stored within the internal table I delete the previous moved files from hdfs.
Technical it works, I tried it already - but is this is reasonable aproach?
And if yes - how would I automatize this steps since now I did all the stuff manually via Ambari.
THanks for your input
BW
Yes, this is perfectly fine approach.
Outside of setting up the Hive table ahead of time, what's the left to automate?
You want to run things on a schedule? Use Oozie, Luigi, Airflow, or Azkaban.
Ingesting logs from other Windows servers because you have a highly available web service? Use Puppet, for example, to configure your log collections agents (not Hadoop related)
Note, if it's only log file collection that you care about, I would probably have used Elasticsearch instead of Hadoop to store data, Filebeat to continuously watch log files, Logstash to apply per-message level filtering, and Kibana to do visualizations. If combining Elasticsearch for fast indexing/searching and Hadoop for archival, you can insert Kafka between the log message ingestion and message writers/consumers

Find and set Hadoop logs to verbose level

I need to track what is happening when I run a job or upload a file to HDFS. I do this using sql profiler in sql server. However, I miss such a tool for hadoop and so I am assuming that I can get some information from logs. I thing all logs are stored at /var/logs/hadoop/ but I am confused with what file I need to look at and how to set that file to capture detailed level information.
I am using HDP2.2.
Thanks,
Sree
'Hadoop' represents an entire ecosystem of different products. Each one has its own logging.
HDFS consists of NameNode and DataNode services. Each has its own log. Location of logs is distribution dependent. See File Locations for Hortonworks or Apache Hadoop Log Files: Where to find them in CDH, and what info they contain for Cloudera.
In Hadoop 2.2, MapReduce ('jobs') is a specific application in YARN, so you are talking about ResourceManager and NodeManager services (the YARN components), each with its own log, and then there is the MRApplication (the M/R component), which is a YARN applicaiton yet with its own log.
Jobs consists of taks, and tasks themselves have their own logs.
In Hadoop 2 there is a dedicated Job History service tasked with collecting and storing the logs from the jobs executed.
Higher level components (eg. Hive, Pig, Kafka) have their own logs, asside from the logs resulted from the jobs they submit (which are logging as any job does).
The good news is that vendor specific distribution (Cloudera, Hortonworks etc) will provide some specific UI to expose the most common logs for ease access. Usually they expose the JobHistory service collected logs from the UI that shows job status and job history.
I cannot point you to anything SQL Profiler equivalent, because the problem space is orders of magnitude more complex, with many different products, versions and vendor specific distributions being involved. I recommend to start by reading about and learning how the Job History server runs and how it can be accessed.

hadoop logging facility?

If I am to use zookeeper as a work queue and connect to it individual consumers/workers. What would you recommend as a good distributed setup for logging these workers' activities?
Assume the following:
1) At anytime we could be down to 1 single computer housing the hadoop cluster. The system will autoscale up and down as needed but has alot of down time where only 1 single computer is needed.
2) I just need the ability to access all of the workers logs without accessing the individual machine that worker is located at. Bare in mind, that by the time I get to read one of these logs that machine might very well be terminated and long gone.
3) We'll need easy access to the logs i.e being able to cat/grep and tail or alternatively in a more SQLish manner - we'll need real time ability to both query as well as monitor output for short periods of time in real time. (i.e tail -f /var/log/mylog.1)
I appreciate your expert ideas here!
Thanks.
Have you looked at using Flume, chukwa or scribe - ensure that your flume etc process has access to the log files that you are trying to aggregate onto a centralized server.
flume reference:
http://archive.cloudera.com/cdh/3/flume/Cookbook/
chukwa:
http://incubator.apache.org/chukwa/docs/r0.4.0/admin.html
scribe:
https://github.com/facebook/scribe/wiki/_pages
hope it helps.
Fluentd log collector just released its WebHDFS plugin, which allows the users to instantly stream data into HDFS. It's really easy to install with ease of management.
Fluentd + Hadoop: Instant Big Data Collection
Of course you can import data directly from your applications. Here's a Java example to post logs against Fluentd. Fluentd's Java library is clever enough to buffer locally when Fluentd daemon is down. This lessens the possibility of the data loss.
Fluentd: Data Import from Java Applications
High availability configuration is also available, which basically enables you to have centralized log aggregation system.
Fluentd: High Availability Configuration

What is the best components stack for building distributed log aggregator (like Splunk)?

I'm trying to find the best components I could use to build something similar to Splunk in order to aggregate logs from a big number of servers in computing grid. Also it should be distributed because I have gigs of logs everyday and no single machine will be able to store logs.
I'm particularly interested in something that will work with Ruby and will work on Windows and latest Solaris (yeah, I got a zoo).
I see architecture as:
Log crawler (Ruby script).
Distributed log storage.
Distributed search engine.
Lightweight front end.
Log crawler and distributed search engine are out of questions - logs will be parsed by Ruby script and ElasticSearch will be used to index log messages. Front end is also very easy to choose - Sinatra.
My main problem is distributed log storage. I looked at MongoDB, CouchDB, HDFS, Cassandra and HBase.
MongoDB was rejected because it doesn't work on Solaris.
CouchDB doesn't support sharding (smartproxy is required to make it work but this is something I don't want to even try).
Cassandra works great but it's just a disk space hog and it requires running autobalance everyday to spread the load between Cassandra nodes.
HDFS looked promising but FileSystem API is Java only and JRuby was a pain.
HBase looked like a best solution around but deploying it and monitoring is just a disaster - in order to start HBase I need to start HDFS first, check that it started without problems, then start HBase and check it also, and then start REST service and also check it.
So I'm stuck. Something tells me HDFS or HBase are the best thing to use as a log storage, but HDFS only works smoothly with Java and HBase is just a deploying/monitoring nightmare.
Can anyone share its thoughts or experience building similar systems using components I described above or with something completely different?
I'd recommend using Flume to aggregate your data into HBase. You could also use the Elastic Search Sink for Flume to keep a search index up to date in real time.
For more, see my answer to a similar question on Quora.
With regards to Java and HDFS - using a tool like BeanShell, you can interact with the HDFS store via Javascript.

Resources