COBOL Internal SORT Error - sorting

I am facing an issue while sorting a sequential file using internal sort (with using and giving). The input file has a corrupt record so, the output file generated after sort is not complete.
It stops after encountering the corrupt record. However, the sort does not fail in the process. When I check the SORT-STATUS after this, it is coming zero. This causes the incomplete file to get processed further, which I do not want.
Is there a way to identify that there was a corrupt record in the process and the processing is incomplete so that I can end the program execution there only and correct the file before executing again?
Below is the code snippet:
SORT WRK-FILE
ON ASCENDING KEY KEY1
KEY2
USING INPUT-FILE
GIVING OUTPUT-FILE.
IF SORT-STATUS = ZEROS
SET PROCESS-OK to TRUE
ELSE
SET PROCESS-NOT-OK to TRUE
END-IF.

Related

Very long lines of input in Ruby

I have a Ruby script that performs some substitutions on the output of mysqldump.
The input can have very long lines (hundreds of MB), because a single line can represent a multi-row INSERT statement for all the data in a table. The mysqldump utility can be coerced to produce one INSERT statement per row, but I don't have control of every client.
My script naively expects IO#each_line to control memory usage:
$stdin.each_line do |line|
next if options[:entity_excludes].any? { |entity| line =~ /^(DROP TABLE IF EXISTS|INSERT INTO) `custom_#{entity}(s|_meta)`/ }
line.gsub!(/^CREATE TABLE `/, "CREATE TABLE IF NOT EXISTS `")
line.gsub!('{{__OPF_SITEURL__}}', siteurl) if siteurl
$stdout.write(line)
end
I've already seen input with maximum line length over 400MB, and this translates directly into process resident memory.
Are there libraries for Ruby that allow text transforms on an input stream using buffers instead of relying on line-delimited input?
This was marked as a duplicate of a simpler question. But there's quite a bit more to this. You need to keep track of multiple buffers and test for application of transforms even when they apply across a buffer boundary. It's easy to get wrong, which is why I'm hoping a library already exists.

Each run of the same Hadoop SequenceFile creation routine creates a file with different crc. Is it ok?

I have a simple code which creates Hadoop's Sequence file. Each the code is ran it leaves in working dir two files:
mySequenceFile.txt
.mySequenceFile.txt.crc
After each run the sizes of both files remain the same. But the crc file contents become different!
Is this a bug or an expected behaviour?
This is a confusing, but expected behaviour.
According to SequenceFile standart, each sequencefile has a sync-block, its length is 16 bytes. The sync-block repeats after each record in block-compressed sequencefiles, and after some records or one very long record in uncompressed or record-compressed sequencefiles.
The thing is, that the sync-block is some sort of random value. It is written in the header, so this is how the reader recognizes it. It stays same within one sequencefile, but it can (and actually is) different from one sequencefile to another.
So the files are logically same, but binary different. CRC is binary shecksum, so its different between two files too.
I haven`t found any ways to manually set this sync-block. If someone gets the way, please write it here.

Analyze total error entry occurance in a time frame from log files with a hadoop mapreduce job

I have a huge number of logfiles stored in HDFS which look like the following:
2012-10-20 00:05:00; BEGIN
...
SQL ERROR -678: Error message
...
2012-10-20 00:47:20; END
I'd like to know how often certain sql error codes occured during a time frame, e.g.:
How many 678 SQL ERRORs occured from 20 OCT 2012 0:00am until 20 OCT 2012 1:00am.
Since files are typically split into several blocks they could be distributed between all data nodes.
Is such a query possible? I'd like to use the hadoop mapreduce Java API or Apache Pig, but I don't know how to apply the time frame condition.
HDFS doesn't take care of new lines into consideration while splitting the file into blocks, so a single line might be split across two blocks. But, MapReduce does, so a line in the input file will be processed by a single mapper.
2012-10-20 00:05:00; BEGIN
...
SQL ERROR -678: Error message
...
2012-10-20 00:47:20; END
If the file is bigger than the block size then there is a better chance that the above lines will be in two blocks and processed by different mappers. The FileInputFormat.isSplitable() can be overwritten to make sure that a single log file is processed by a single mapper and not processed by multiple mappers.
Hadoop will invoke the user defined map function with the KV pairs, where K is the file offset and the value is the line in the input file. An instance variable would be required to store the BEGIN time to check against the END time in the later call to the user defined map function.
This is not an efficient way, since a single mapper is processing a particular map file and is not distributed.
Another approach is to pre-process the log files, by combining the the relevant lines into a single line. This way, the relevant lines in the log files will be processed by a single mapper only.
FYI, a more complex approach without using the FileInputFormat.isSplitable() is a also possbile, but that needs to be worked out.
The pros and cons have to be evaluated of each approach and the right one picked.

How to see input records of a particular hadoop task?

I am running a hadoop job. All, but 4 tasks are done. I am pondering why is it taking so much longer to process those chunks. My guess is that those input records are "hard" to process by my job. To test locally I would like to retrieve those input records. How an I do this?
The status column for the task says
hdfs://10.4.94.75:8020/user/someuser/myfilename:154260+3
But what does it mean?
The last part of the status gives you information about the split. More specifically:
hdfs://10.4.94.75:8020/user/someuser/myfilename:154260+3
tells you that the task having this status processed the split of "myfilename" starting at byte offset 154260 in "myfilename" and having length 3.
Given this piece of information, you can detect the records assigned to this task by skiping in the file to byte 154260 and reading 3 bytes.

What can lead to failures in appending data to a file?

I maintain a program that is responsible for collecting data from a data acquisition system and appending that data to a very large (size > 4GB) binary file. Before appending data, the program must validate the header of this file in order to ensure that the meta-data in the file matches that which has been collected. In order to do this, I open the file as follows:
data_file = fopen(file_name, "rb+");
I then seek to the beginning of the file in order to validate the header. When this is done, I seek to the end of the file as follows:
_fseeki64(data_file, _filelengthi64(data_file), SEEK_SET);
At this point, I write the data that has been collected using fwrite(). I am careful to check the return values from all I/O functions.
One of the computers (windows 7 64 bit) on which we have been testing this program intermittently shows a condition where the data appears to have been written to the file yet neither the file's last changed time nor its size changes. If any of the calls to fopen(), fseek(), or fwrite() fail, my program will throw an exception which will result in aborting the data collection process and logging the error. On this machine, none of these failures seem to be occurring. Something that makes the matter even more mysterious is that, if a restore point is set on the host file system, the problem goes away only to re-appear intermittently appear at some future time.
We have tried to reproduce this problem on other machines (a vista 32 bit operating system) but have had no success in replicating the issue (this doesn't necessarily mean anything since the problem is so intermittent in the first place.
Has anyone else encountered anything similar to this? Is there a potential remedy?
Further Information
I have now found that the failure occurs when fflush() is called on the file and that the win32 error that is being returned by GetLastError() is 665 (ERROR_FILE_SYSTEM_LIMITATION). Searching google for this error leads to a bunch of reports related to "extents" for SQL server files. I suspect that there is some sort of journaling resource that the file system is reporting and this because we are growing a large file by opening it, appending a chunk of data, and closing it. I am now looking for understanding regarding this particular error with the hope for coming up with a valid remedy.
The file append is failing because of a file system fragmentation limit. The question was answered in What factors can lead to Win32 error 665 (file system limitation)?

Resources