Reading from a sequence file placed in DistributedCache Hadoop - hadoop

How can I read sequence files from distributed cache?
I have tried some things, but I'm always getting FileNotFoundException.
I'm adding file to distributed cache like this
DistributedCache.addCacheFile(new URI(currentMedoids), conf);
And reading from it in mapper's setup method
Configuration conf = context.getConfiguration();
FileSystem fs = FileSystem.get(conf);
Path[] paths = DistributedCache.getLocalCacheFiles(conf);
List<Element> sketch = new ArrayList<Element>();
SequenceFile.Reader medoidsReader = new SequenceFile.Reader(fs, paths[0], conf);
Writable medoidKey = (Writable) medoidsReader.getKeyClass().newInstance();
Writable medoidValue = (Writable) medoidsReader.getValueClass().newInstance();
while(medoidsReader.next(medoidKey, medoidValue)){
ElementWritable medoidWritable = (ElementWritable)medoidValue;
sketch.add(medoidWritable.getElement());
}

It seems that I should have used getCacheFiles(), which returns URI[] instead of getLocalCacheFiles(), which returns Path[].
Now it works after making that change.

Related

Tika text extraction not working on HDFS

I'm trying to use Tika to extract text from a bunch of simple txt files stored on HDFS. I have the following code in my reducer, but surprisingly Tika does not return anything. It work fine in my local machine but as soon as I move everything to hadoop cluster, the result is empty.
FileSystem fs = FileSystem.get(new Configuration());
Path pt = new Path(Configs.BLOBSTORAGEPREFIX+fileAdd);
InputStream stream = fs.open(pt);
AutoDetectParser parser = new AutoDetectParser();
BodyContentHandler handler = new BodyContentHandler();
Metadata metadata = new Metadata();
parser.parse(stream, handler, metadata);
spaceContentBuffer.append(handler.toString());
The last line append the extreaxted content to a StringBuilder, but it is always empty.
p.s. my hadoop cluster is Azure HDInsight so the HDFS is Blob Storage.
I also tried the following code
Metadata metadata = new Metadata();
BodyContentHandler handler = new BodyContentHandler();
Parser parser = new TXTParser();
ParseContext con = new ParseContext();
parser.parse(stream, handler, metadata, con);
and I got the following error message:
Failed to detect the character encoding of a document
If the user does not specify Content-Type when uploading a blob, it will be set to “application/octet-stream” by default.

No data being written to S3 using Hadoop FileSystem and BouncyCastle

I'm using the following code to write encrypted data to Amazon S3:
byte[] bytes = compressFile(instr, CompressionAlgorithmTags.ZIP);
PGPEncryptedDataGenerator encGen = new PGPEncryptedDataGenerator(new JcePGPDataEncryptorBuilder(PGPEncryptedData.CAST5).setWithIntegrityPacket(withIntegrityCheck).setSecureRandom(new SecureRandom()).setProvider("BC"));
encGen.addMethod(new JcePublicKeyKeyEncryptionMethodGenerator(pubKey).setProvider("BC"));
OutputStream cOut = encGen.open(out, bytes.length);
cOut.write(bytes);
cOut.close();
If I set "out" to:
final OutputStream fsOutStr = new FileOutputStream(new File("/home/hadoop/encrypted.gpg"));
It writes the file just fine.
However when I attempt to write it to S3, it does not give me any errors, appears to work, but there is no data on S3 when I check for it:
final FileSystem fileSys = FileSystem.get(new URI(GenericUtils.getAsEncodedStringIfEmbeddedSpaces(s3OutputDir)), new Configuration());
final OutputStream fsOutStr = fileSys.create(new Path(s3OutputDir)); // outputPath on S3
Any idea why it writes the data perfectly fine to the local disk but does not write the file to S3?
Closing fsOutStr solved the problem.

hadoop-1.0.3 sequenceFile.Writer overwrites instead of appending images into a sequencefile

I am using hadoop 1.0.3 (I can't really upgrade right now,Thats for later. )
I have around 100 images in my HDFS and I am trying to combine them into a single sequencefile ( default no compression etc.. )
here's my code:
FSDataInputStream in = null;
BytesWritable value = new BytesWritable();
Text key = new Text();
Path inpath = new Path(fs.getHomeDirectory(),"/user/hduser/input");
Path seq_path = new Path(fs.getHomeDirectory(),"/user/hduser/output/file.seq");
FileStatus[] files = fs.listStatus(inpath);
SequenceFile.Writer writer = null;
for( FileStatus fileStatus : files){
inpath = fileStatus.getPath();
try {
in = fs.open(inpath);
byte bufffer[] = new byte[in.available()];
in.read(bufffer);
writer = SequenceFile.createWriter(fs,conf,seq_path,key.getClass(),value.getClass());
writer.append(new Text(inpath.getName()), new BytesWritable(bufffer));
}catch (Exception e) {
System.out.println("Exception MESSAGES = "+e.getMessage());
e.printStackTrace();
}}
This just goes through all the files in input/ and one by one appends them.
HOWEVER this just overwrites my sequence file instead of appending it , I see only the last image in sequencefile.
NOTE I am not closing the writer before the for loop ends , can anyone help me with this please.
I am not sure How can I append the images?
Your main issue is with the following line :
writer = SequenceFile.createWriter(fs, conf, seq_path, key.getClass(), value.getClass());
which is inside the for, creating a new writer in each pass. It replaces previous file at the path seq_path. Thus only last image is available.
Pull it out of the loop, and the problem should vanish.

using amazon s3 as input,output and to store intermediate results in EMR map reduce job

I am trying to use Amazon s3 storage with EMR. However, when I currently run my code I get multiple errors like
java.lang.IllegalArgumentException: This file system object (hdfs://10.254.37.109:9000) does not support access to the request path 's3n://energydata/input/centers_200_10k_norm.csv' You possibly called FileSystem.get(conf) when you should have called FileSystem.get(uri, conf) to obtain a file system supporting your path.
at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:384)
at org.apache.hadoop.hdfs.DistributedFileSystem.getPathName(DistributedFileSystem.java:129)
at org.apache.hadoop.hdfs.DistributedFileSystem.open(DistributedFileSystem.java:154)
at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:429)
at edu.stanford.cs246.hw2.KMeans$CentroidMapper.setup(KMeans.java:112)
at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:142)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:771)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:375)
at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1132)
at org.apache.hadoop.mapred.Child.main(Child.java:249)
In main I set my input and output paths like this and I put s3n://energydata/input/centers_200_10k_norm.csv in configuration CFILE that I retrieve in the mapper and reducer:
FileSystem fs = FileSystem.get(conf);
conf.set(CFILE, inPath); //inPath in this case is s3n://energydata/input/centers_200_10k_norm.csv
FileInputFormat.addInputPath(job, new Path(inputDir));
FileOutputFormat.setOutputPath(job, new Path(outputDir));
The specific example where the error above occurs in my mapper and reducer where I try to access CFILE (s3n://energydata/input/centers_200_10k_norm.csv). This is how I try to get the path:
FileSystem fs = FileSystem.get(context.getConfiguration());
Path cFile = new Path(context.getConfiguration().get(CFILE));
DataInputStream d = new DataInputStream(fs.open(cFile)); ---->Error
s3n://energydata/input/centers_200_10k_norm.csv is one of the input arguments to the program and when I launched my EMR job I specified my input and output directories to be s3n://energydata/input and s3n://energydata/output
I tried doing what was suggested in file path in hdfs but I'm still getting the error. Any help would be appreciated.
thanks!
try instead:
Path cFile = new Path(context.getConfiguration().get(CFILE));
FileSystem fs = cFile.getFileSystem(context.getConfiguration());
DataInputStream d = new DataInputStream(fs.open(cFile));
thanks. I actually fixed it by using the following code:
String uriStr = "s3n://energydata/centroid/";
URI uri = URI.create(uriStr);
FileSystem fs = FileSystem.get(uri, context.getConfiguration());
Path cFile = new Path(context.getConfiguration().get(CFILE));
DataInputStream d = new DataInputStream(fs.open(cFile));

Reading Distributed Files in Hadoop

I'm trying to the following in hadoop:
I have implemented a map-reduce job that outputs a file to directory "foo".
the foo files are with a key=IntWriteable, value=IntWriteable format (used a SequenceFileOutputFormat).
Now, I want to start another map-reduce job. the mapper is fine, but each reducer is required to read the entire "foo" files at start-up (I'm using the HDFS for sharing data between reducers).
I used this code on the "public void configure(JobConf conf)":
String uri = "out/foo";
FileSystem fs = FileSystem.get(URI.create(uri), conf);
FileStatus[] status = fs.listStatus(new Path(uri));
for (int i=0; i<status.length; ++i) {
Path currFile = status[i].getPath();
System.out.println("status: " + i + " " + currFile.toString());
try {
SequenceFile.Reader reader = null;
reader = new SequenceFile.Reader(fs, currFile, conf);
IntWritable key = (IntWritable) ReflectionUtils.newInstance(reader.getKeyClass(), conf);
IntWritable value = (IntWritable ) ReflectionUtils.newInstance(reader.getValueClass(), conf);
while (reader.next(key, value)) {
// do the code for all the pairs.
}
}
}
The code runs well on a single machine, but I'm notsure if it will run on a cluster.
In other words, does this code reads files from the current machine or does id read from the distributed system?
Is there a better solution for what I'm trying to do?
Thanks in advance,
Arik.
The URI for the FileSystem.get() does not have scheme defined and hence, the File System used depends on the configuration parameter fs.defaultFS. If none set, the default setting i.e LocalFile system will be used.
Your program writes to the Local file system under the workingDir/out/foo. It should work in the cluster as well but looks for the local file system.
With the above said, I'm not sure why you need the entire files from foo directory. You may have consider other designs. If needed, these files should copied to HDFS first and read the files from the overridden setup method of your reducer. Needless to say, to close the files opened in the overridden closeup method of your reducer. While the files can be read in reducers, the map/reduce programs are not designed for this kind of functionality.

Resources