Skipping bad input files in hadoop - hadoop

I'm using Amazon Elastic MapReduce to process some log files uploaded to S3.
The log files are uploaded daily from servers using S3, but it seems that a some get corrupted during the transfer. This results in a java.io.IOException: IO error in map input file exception.
Is there any way to have hadoop skip over the bad files?

There's a who bunch of record skipping configuration properties you can use to do this - see the mapred.skip. prefixed properties on http://hadoop.apache.org/docs/r1.2.1/mapred-default.html
There's also a nice blog post entry about this subject and these config properties:
http://devblog.factual.com/practical-hadoop-streaming-dealing-with-brittle-code
That said, if you file is completely corrupt (i.e. broken before the first record), you might still have issues even with these properties.

Chris White's comment suggesting writing your own RecordReader and InputFormat is exactly right. I recently faced this issue and was able to solve it by catching the file exceptions in those classes, logging them, and then moving on to the next file.
I've written up some details (including full Java source code) here: http://daynebatten.com/2016/03/dealing-with-corrupt-or-blank-files-in-hadoop/

Related

Hadoop - Managing multiple input/output files

I'm facing problems managing multiple input files.
I have many in .../input/ folder and a mapreduce job which I want to be executed for each of my input files, so that every input file has its own output (in .../output/).
Now, I tried searching on the Net but many pages are very old and I can't get a working method. Any methods/classes which I can use to make this work?
Thanks in advance.

Google Cloud Logs Export Names

Is there a way to configure the names of the files exported from Logging?
Currently the file exported includes colons. This are invalid characters as a path element in hadoop, so PySpark for instance cannot read these files. Obviously the easy solution is to rename the files, but this interferes with syncing.
Is there a way to configure the names or change them to no include colons? Any other solutions are appreciated. Thanks!
https://github.com/apache/hadoop/blob/trunk/hadoop-common-project/hadoop-common/src/site/markdown/filesystem/introduction.md
At this time, there is no way to change the naming convention when exporting log files as this process is automated on the backend.
If you would like to request to have this feature available in GCP, I would suggest creating a PIT. This page allows you to report bugs and request new features to be implemented within GCP.

Logstash close file descriptors?

BACKGROUND:
We have rsyslog creating log files directories like: /var/log/rsyslog/SERVER-NAME/LOG-DATE/LOG-FILE-NAME
So multiple servers are spilling out their logs of different dates to a central location.
Now to read these logs and store them in elasticsearch for analysing I have my logstash config file something like this:
file{
path => /var/log/rsyslog/**/*.log
}
ISSUE :
Now as number of log files in the directory increase, logstash opens file descriptors (FD) for new files and will not release FDs for already read log files.
Since log files are generated per date, once it is read, it is of no use after that since it will not be updated after that date.
I have increased the file openings limit to 65K in /etc/security/limits.conf
Can we make logstash close the handle after some time so that number of file handles opened do not increase too much ??
I think you may have hit this bug: http://github.com/elastic/logstash/issues/1604. Do you have the same symptoms? Exceptions in logs after some time? If you run sudo lsof | grep java | wc -l do you see the descriptors steadily increasing over time? (some of them might close, but some will stay open and their number will increase)
I've been tracking this issue for some time, and I don't know that it's properly solved.
We were in a similar boat, perhaps bigger: Logstash couldn't open handles for hundreds of thousands of log files on a box, even though very few of them written to actively. LOGSTASH-271 captured this issue, and there were some attempts to patch Logstash, including PR #1260.
It seems a fix may have made it's way into Logstash 1.5 with PR #1545, but I've never tested this personally. We ended up forking the underlying library Logstash uses to implement the file input, called FileWatch, into FFileWatch, which adds an "eviction mechanism".
The basic idea behind this approach is to only keep files open while they're being written. Normally, Logstash will open a handle on the file and keep it open forever, but FFileWatch adds an option to close the handle if the file has not changed recently (eviction_interval). I then created a custom build of Logstash using the forked gem.
Obviously this is less than ideal, but it worked for us. Eventually we dropped Logstash entirely for picking up log files, although we still use it further down the log processing pipeline. We implemented our own lightweight log shipper (Franz), which does not suffer from this issue.

Hadoop avro correct jar files issue

I'm writing my first Avro job that is meant to take an avro file and output text. I tried to reverse engineer it from this example:
https://gist.github.com/chriswhite199/6755242
I am getting the error below though.
Error: Found interface org.apache.hadoop.mapreduce.TaskAttemptContext, but class was expected
I looked around and found it was likely an issue with what jar files are being used. I'm running CDH4 with MR1 and am using the jar files below:
avro-tools-1.7.5.jar
hadoop-core-2.0.0-mr1-cdh4.4.0.jar
hadoop-mapreduce-client-core-2.0.2-alpha.jar
I can't post code for security reasons but it shouldn't need anything not used in the example code. I don't have maven set up yet either so I can't follow those routes. Is there something else I can try to get around these issues?
Try using avro 1.7.3
AVRO-1170 bug

Need help to find out why Websphere Application Server has many .lck files

The file names seesm to point to our WAS data sources. However, we're not sure what is creating them and why there are so many. The servers didn't seem to crash. Why is WAS 6.1.0.23 creating these andy why aren't they being cleaned?
There are many files like these, with some going up to xxx.43.lck
DWSqlLog0.0.lck
DWSqlLog0.0
TritonSqlLog0.0.lck
TritonSqlLog0.0
JTSqlLog0.0
JTSqlLog0.1
JTSqlLog0.3
JTSqlLog0.2
JTSqlLog0.4.lck
JTSqlLog0.4
JTSqlLog0.3.lck
JTSqlLog0.2.lck
JTSqlLog0.1.lck
JTSqlLog0.0.lck
WAS uses JDK Logging and JDK logger creates such files with extension .0,.1 etc along with the .lck file so that the WAS runtime has a lock to these files that it writes to.
Cheers
Manglu

Resources