No data being written to S3 using Hadoop FileSystem and BouncyCastle - hadoop

I'm using the following code to write encrypted data to Amazon S3:
byte[] bytes = compressFile(instr, CompressionAlgorithmTags.ZIP);
PGPEncryptedDataGenerator encGen = new PGPEncryptedDataGenerator(new JcePGPDataEncryptorBuilder(PGPEncryptedData.CAST5).setWithIntegrityPacket(withIntegrityCheck).setSecureRandom(new SecureRandom()).setProvider("BC"));
encGen.addMethod(new JcePublicKeyKeyEncryptionMethodGenerator(pubKey).setProvider("BC"));
OutputStream cOut = encGen.open(out, bytes.length);
cOut.write(bytes);
cOut.close();
If I set "out" to:
final OutputStream fsOutStr = new FileOutputStream(new File("/home/hadoop/encrypted.gpg"));
It writes the file just fine.
However when I attempt to write it to S3, it does not give me any errors, appears to work, but there is no data on S3 when I check for it:
final FileSystem fileSys = FileSystem.get(new URI(GenericUtils.getAsEncodedStringIfEmbeddedSpaces(s3OutputDir)), new Configuration());
final OutputStream fsOutStr = fileSys.create(new Path(s3OutputDir)); // outputPath on S3
Any idea why it writes the data perfectly fine to the local disk but does not write the file to S3?

Closing fsOutStr solved the problem.

Related

How to decompress Hadoop snappy compressed file in Java

We are compressing Flink job S3 output using Hadoop Parquet+Snappy compression.
AvroParquetWriter.<T>builder(out)
.withSchema(schema)
.withCompressionCodec(CompressionCodecName.SNAPPY)
.withDataModel(dataModel)
.build();
Now we tried to use Hadoop SnappyDecompressor/Snappy-java dependency in an Ec2 Java service to decompress this file but both returning following exception :
"java.lang.UnsatisfiedLinkError: 'int org.apache.hadoop.io.compress.snappy.SnappyDecompressor.decompressBytesDirect()'" .
Please let us know what is the correct way to decompress these files in a Java Ec2 service.
final SnappyDecompressor decompressor = new SnappyDecompressor();
final byte[] data = IOUtils.toByteArray(s3ObjectInputStream);
decompressor.setInput(data, 0, data. Length);
final byte[] uncompressed = new byte[10 * 1024 * 1024];
decompressor.decompress(uncompressed, 0, data.length);
final SnappyInputStream snappyInputStream = new SnappyInputStream(s3ObjectInputStream);
final List<String> lines = IOUtils.readLines(snappyInputStream, StandardCharsets.UTF_8);

Tika text extraction not working on HDFS

I'm trying to use Tika to extract text from a bunch of simple txt files stored on HDFS. I have the following code in my reducer, but surprisingly Tika does not return anything. It work fine in my local machine but as soon as I move everything to hadoop cluster, the result is empty.
FileSystem fs = FileSystem.get(new Configuration());
Path pt = new Path(Configs.BLOBSTORAGEPREFIX+fileAdd);
InputStream stream = fs.open(pt);
AutoDetectParser parser = new AutoDetectParser();
BodyContentHandler handler = new BodyContentHandler();
Metadata metadata = new Metadata();
parser.parse(stream, handler, metadata);
spaceContentBuffer.append(handler.toString());
The last line append the extreaxted content to a StringBuilder, but it is always empty.
p.s. my hadoop cluster is Azure HDInsight so the HDFS is Blob Storage.
I also tried the following code
Metadata metadata = new Metadata();
BodyContentHandler handler = new BodyContentHandler();
Parser parser = new TXTParser();
ParseContext con = new ParseContext();
parser.parse(stream, handler, metadata, con);
and I got the following error message:
Failed to detect the character encoding of a document
If the user does not specify Content-Type when uploading a blob, it will be set to “application/octet-stream” by default.

Reading from a sequence file placed in DistributedCache Hadoop

How can I read sequence files from distributed cache?
I have tried some things, but I'm always getting FileNotFoundException.
I'm adding file to distributed cache like this
DistributedCache.addCacheFile(new URI(currentMedoids), conf);
And reading from it in mapper's setup method
Configuration conf = context.getConfiguration();
FileSystem fs = FileSystem.get(conf);
Path[] paths = DistributedCache.getLocalCacheFiles(conf);
List<Element> sketch = new ArrayList<Element>();
SequenceFile.Reader medoidsReader = new SequenceFile.Reader(fs, paths[0], conf);
Writable medoidKey = (Writable) medoidsReader.getKeyClass().newInstance();
Writable medoidValue = (Writable) medoidsReader.getValueClass().newInstance();
while(medoidsReader.next(medoidKey, medoidValue)){
ElementWritable medoidWritable = (ElementWritable)medoidValue;
sketch.add(medoidWritable.getElement());
}
It seems that I should have used getCacheFiles(), which returns URI[] instead of getLocalCacheFiles(), which returns Path[].
Now it works after making that change.

hadoop-1.0.3 sequenceFile.Writer overwrites instead of appending images into a sequencefile

I am using hadoop 1.0.3 (I can't really upgrade right now,Thats for later. )
I have around 100 images in my HDFS and I am trying to combine them into a single sequencefile ( default no compression etc.. )
here's my code:
FSDataInputStream in = null;
BytesWritable value = new BytesWritable();
Text key = new Text();
Path inpath = new Path(fs.getHomeDirectory(),"/user/hduser/input");
Path seq_path = new Path(fs.getHomeDirectory(),"/user/hduser/output/file.seq");
FileStatus[] files = fs.listStatus(inpath);
SequenceFile.Writer writer = null;
for( FileStatus fileStatus : files){
inpath = fileStatus.getPath();
try {
in = fs.open(inpath);
byte bufffer[] = new byte[in.available()];
in.read(bufffer);
writer = SequenceFile.createWriter(fs,conf,seq_path,key.getClass(),value.getClass());
writer.append(new Text(inpath.getName()), new BytesWritable(bufffer));
}catch (Exception e) {
System.out.println("Exception MESSAGES = "+e.getMessage());
e.printStackTrace();
}}
This just goes through all the files in input/ and one by one appends them.
HOWEVER this just overwrites my sequence file instead of appending it , I see only the last image in sequencefile.
NOTE I am not closing the writer before the for loop ends , can anyone help me with this please.
I am not sure How can I append the images?
Your main issue is with the following line :
writer = SequenceFile.createWriter(fs, conf, seq_path, key.getClass(), value.getClass());
which is inside the for, creating a new writer in each pass. It replaces previous file at the path seq_path. Thus only last image is available.
Pull it out of the loop, and the problem should vanish.

Reading Distributed Files in Hadoop

I'm trying to the following in hadoop:
I have implemented a map-reduce job that outputs a file to directory "foo".
the foo files are with a key=IntWriteable, value=IntWriteable format (used a SequenceFileOutputFormat).
Now, I want to start another map-reduce job. the mapper is fine, but each reducer is required to read the entire "foo" files at start-up (I'm using the HDFS for sharing data between reducers).
I used this code on the "public void configure(JobConf conf)":
String uri = "out/foo";
FileSystem fs = FileSystem.get(URI.create(uri), conf);
FileStatus[] status = fs.listStatus(new Path(uri));
for (int i=0; i<status.length; ++i) {
Path currFile = status[i].getPath();
System.out.println("status: " + i + " " + currFile.toString());
try {
SequenceFile.Reader reader = null;
reader = new SequenceFile.Reader(fs, currFile, conf);
IntWritable key = (IntWritable) ReflectionUtils.newInstance(reader.getKeyClass(), conf);
IntWritable value = (IntWritable ) ReflectionUtils.newInstance(reader.getValueClass(), conf);
while (reader.next(key, value)) {
// do the code for all the pairs.
}
}
}
The code runs well on a single machine, but I'm notsure if it will run on a cluster.
In other words, does this code reads files from the current machine or does id read from the distributed system?
Is there a better solution for what I'm trying to do?
Thanks in advance,
Arik.
The URI for the FileSystem.get() does not have scheme defined and hence, the File System used depends on the configuration parameter fs.defaultFS. If none set, the default setting i.e LocalFile system will be used.
Your program writes to the Local file system under the workingDir/out/foo. It should work in the cluster as well but looks for the local file system.
With the above said, I'm not sure why you need the entire files from foo directory. You may have consider other designs. If needed, these files should copied to HDFS first and read the files from the overridden setup method of your reducer. Needless to say, to close the files opened in the overridden closeup method of your reducer. While the files can be read in reducers, the map/reduce programs are not designed for this kind of functionality.

Resources