How to read Hadoop Map intermediate file file.out - hadoop

I have set property keep.task.files.pattern to ".*" in mapred-site.xml
restarted cluster and executed my test mapreduce program.
I see two file file.out and file.out.index in the folder
/opt/hadoopws/tmp/mapred/local/taskTracker/hduser/jobcache/job_201403260903_0001/attempt_201403260903_0001_m_000000_0/output/
When i attempt to read file.out using below code i get "not a SequenceFile error" message.
I know for sure its a binary file when i try to open file.out with less, it prompts that its a binary file.
I'm running Hadoop 1.2.1. What is the default map output format?
FileSystem fs = FileSystem.get(conf);
Path path = new Path("/opt/hadoopws/tmp/mapred/local/taskTracker/hduser/jobcache
/job_201403260903_0001/attempt_201403260903_0001_m_000000_0/output
/file.out");
SequenceFile.Reader reader = new SequenceFile.Reader(fs, path, conf);
IntWritable key = new IntWritable();
IntWritable value = new IntWritable();
while (reader.next(key, value)) {
System.out.println(key.get() + " | " + value.get());
}
reader.close();
Error Message:
Exception in thread "main" java.io.IOException: /opt/hadoopws/tmp/mapred/local/taskTracker/hduser/jobcache/job_201403260903_0001/attempt_201403260903_0001_m_000000_0/output/file.out not a SequenceFile
at org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1517)
at org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1490)
at org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1479)
at org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1474)
at HDPConfigRun.run(HDPConfigRun.java:31)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)
at HDPConfigRun.main(HDPConfigRun.java:45)

The file is similar to a sequential file in that it contains key value pairs persisted in their serialized form, but it's designed to be scanned quickly for sort and partitioning purposes and has an internal-only format.
If you really need to see the set of key value pairs produced by the mapper then look into configuring your reducer to use the MultipleOutputs class - one output would contain every key/value pair read in by the reducer and another output to contain the "real" output of the reducer.

Related

How to append to an existing file in a Hadoop user program?

I have a Hadoop program in which when the mapping and reducing phases are done, I need to append to an existing file (which is already on HDFS). How can I do that?
it's already supported to append a file on hdfs after hadoop 0.20.2, more information is available here1 and here2
An append example i found may help you:
FSDataOutputStream stm = fs.create(path, true,
conf.getInt("io.file.buffer.size", 4096),
(short)3, blocksize);
String a = make(1000);
stm.write(a.getBytes());
stm.sync();
You can use append method of HDFS,
check the file is exists on not, if exists append the new content in the same file.
for example:-
FileSystem hdfs;
FSDataOutputStream writeInFile;
Path file;
if (hdfs.exists(file)) {
System.out.println("file exists");
writeInFile = hdfs.append(file);
writeInFile.writeBytes(data);
}
else {
System.out.println("new file");
writeInFile = hdfs.create(file, true);
writeInFile.writeBytes(data);
}

using amazon s3 as input,output and to store intermediate results in EMR map reduce job

I am trying to use Amazon s3 storage with EMR. However, when I currently run my code I get multiple errors like
java.lang.IllegalArgumentException: This file system object (hdfs://10.254.37.109:9000) does not support access to the request path 's3n://energydata/input/centers_200_10k_norm.csv' You possibly called FileSystem.get(conf) when you should have called FileSystem.get(uri, conf) to obtain a file system supporting your path.
at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:384)
at org.apache.hadoop.hdfs.DistributedFileSystem.getPathName(DistributedFileSystem.java:129)
at org.apache.hadoop.hdfs.DistributedFileSystem.open(DistributedFileSystem.java:154)
at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:429)
at edu.stanford.cs246.hw2.KMeans$CentroidMapper.setup(KMeans.java:112)
at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:142)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:771)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:375)
at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1132)
at org.apache.hadoop.mapred.Child.main(Child.java:249)
In main I set my input and output paths like this and I put s3n://energydata/input/centers_200_10k_norm.csv in configuration CFILE that I retrieve in the mapper and reducer:
FileSystem fs = FileSystem.get(conf);
conf.set(CFILE, inPath); //inPath in this case is s3n://energydata/input/centers_200_10k_norm.csv
FileInputFormat.addInputPath(job, new Path(inputDir));
FileOutputFormat.setOutputPath(job, new Path(outputDir));
The specific example where the error above occurs in my mapper and reducer where I try to access CFILE (s3n://energydata/input/centers_200_10k_norm.csv). This is how I try to get the path:
FileSystem fs = FileSystem.get(context.getConfiguration());
Path cFile = new Path(context.getConfiguration().get(CFILE));
DataInputStream d = new DataInputStream(fs.open(cFile)); ---->Error
s3n://energydata/input/centers_200_10k_norm.csv is one of the input arguments to the program and when I launched my EMR job I specified my input and output directories to be s3n://energydata/input and s3n://energydata/output
I tried doing what was suggested in file path in hdfs but I'm still getting the error. Any help would be appreciated.
thanks!
try instead:
Path cFile = new Path(context.getConfiguration().get(CFILE));
FileSystem fs = cFile.getFileSystem(context.getConfiguration());
DataInputStream d = new DataInputStream(fs.open(cFile));
thanks. I actually fixed it by using the following code:
String uriStr = "s3n://energydata/centroid/";
URI uri = URI.create(uriStr);
FileSystem fs = FileSystem.get(uri, context.getConfiguration());
Path cFile = new Path(context.getConfiguration().get(CFILE));
DataInputStream d = new DataInputStream(fs.open(cFile));

Reading Distributed Files in Hadoop

I'm trying to the following in hadoop:
I have implemented a map-reduce job that outputs a file to directory "foo".
the foo files are with a key=IntWriteable, value=IntWriteable format (used a SequenceFileOutputFormat).
Now, I want to start another map-reduce job. the mapper is fine, but each reducer is required to read the entire "foo" files at start-up (I'm using the HDFS for sharing data between reducers).
I used this code on the "public void configure(JobConf conf)":
String uri = "out/foo";
FileSystem fs = FileSystem.get(URI.create(uri), conf);
FileStatus[] status = fs.listStatus(new Path(uri));
for (int i=0; i<status.length; ++i) {
Path currFile = status[i].getPath();
System.out.println("status: " + i + " " + currFile.toString());
try {
SequenceFile.Reader reader = null;
reader = new SequenceFile.Reader(fs, currFile, conf);
IntWritable key = (IntWritable) ReflectionUtils.newInstance(reader.getKeyClass(), conf);
IntWritable value = (IntWritable ) ReflectionUtils.newInstance(reader.getValueClass(), conf);
while (reader.next(key, value)) {
// do the code for all the pairs.
}
}
}
The code runs well on a single machine, but I'm notsure if it will run on a cluster.
In other words, does this code reads files from the current machine or does id read from the distributed system?
Is there a better solution for what I'm trying to do?
Thanks in advance,
Arik.
The URI for the FileSystem.get() does not have scheme defined and hence, the File System used depends on the configuration parameter fs.defaultFS. If none set, the default setting i.e LocalFile system will be used.
Your program writes to the Local file system under the workingDir/out/foo. It should work in the cluster as well but looks for the local file system.
With the above said, I'm not sure why you need the entire files from foo directory. You may have consider other designs. If needed, these files should copied to HDFS first and read the files from the overridden setup method of your reducer. Needless to say, to close the files opened in the overridden closeup method of your reducer. While the files can be read in reducers, the map/reduce programs are not designed for this kind of functionality.

Naming MapReduce job's part-0000 file to that of the input file in hadoop

I have developed a code that runs a map reduce job to read files from FTP server and write it into HDFS. Into HDFS it writes the file from FTP into the specified output directory naming it as part-0000. In case I have multiple files on the FTP server I get all of them written to that one part-0000 file in HDFS.
To avoid this I plan to pass the name of the file as key along with the data as value . Thus the reducer gets the data into an output file with the key as the name of the file.
I understand that I have to use an outputformat that extends MultipleTextOutputFormat. I have written it as follows
static class MultiFileOutput extends MultipleTextOutputFormat<Text, Text> {
protected String generateFileNameForKeyValue(Text key, Text value,String name) {
System.out.println("key is :"+ key.toString());
System.out.println("value is :"+ value.toString());
System.out.println("name is :"+ name.toString());
return key.toString();
}
But I fail to pass the name of the input file being processed . How do I get the name of the input file ?
map.input.file
and
FileSystem fs = file.getFileSystem(conf);
String fileName=fs.getName();
do not return the name of the input file.
Any pointers ?
You can get the input file path through context.
FileSplit fileSplit = (FileSplit) context.getInputSplit();
String inputFilePath = fileSplit.getPath().toString();
This will give the full path. If you want just the filename you can do this :
String inputFileName = fileSplit.getPath().getName();
HTH
I used FileStatus object in the following code as my customised input format would not split the input file. It worked fine for me ..
FileSystem fs = file.getFileSystem(conf);
FileStatus status= fs.getFileStatus(file);
String fileName=status.getPath().toString();

Migrating Data from HBase to FileSystem. (Writing Reducer output to Local or Hadoop filesystem)

My Purpose is to migrate the data from Hbase Tables to Flat (say csv formatted) files.
I am used
TableMapReduceUtil.initTableMapperJob(tableName, scan,
GetCustomerAccountsMapper.class, Text.class, Result.class,
job);
for scanning through HBase table and TableMapper for Mapper.
My challange is in while forcing Reducer to dump the Row values (which is normalized in flattened format) to local(or Hdfs) file system.
My problem is neither I am able to see logs of Reducer nor I can see the any files at path that I have mentioned in Reducer.
It's my 2nd or 3rd MR job and first serious one. After trying hard for two days, I am still clueless how to achieve my goal.
Would be great if someone could show the right direction.
Here is my reducer code -
public void reduce(Text key, Iterable<Result> rows, Context context)
throws IOException, InterruptedException {
FileSystem fs = LocalFileSystem.getLocal(new Configuration());
Path dir = new Path("/data/HBaseDataMigration/" + tableName+"_Reducer" + "/" + key.toString());
FSDataOutputStream fsOut = fs.create(dir,true);
for (Result row : rows) {
try {
String normRow = NormalizeHBaserow(
Bytes.toString(key.getBytes()), row, tableName);
fsOut.writeBytes(normRow);
//context.write(new Text(key.toString()), new Text(normRow));
} catch (BadHTableResultException ex) {
throw new IOException(ex);
}
}
fsOut.flush();
fsOut.close();
My Configuration for Reducer Output
Path out = new Path(args[0] + "/" + tableName+"Global");
FileOutputFormat.setOutputPath(job, out);
Thanks in Advance - Panks
Why not reduce into HDFS and once finished use hdfs fs to export the file
hadoop fs -get /user/hadoop/file localfile
If you do want to handle it in the reduce phase take a look at this article on OutputFormat on InfoQ

Resources