Tika text extraction not working on HDFS - hadoop

I'm trying to use Tika to extract text from a bunch of simple txt files stored on HDFS. I have the following code in my reducer, but surprisingly Tika does not return anything. It work fine in my local machine but as soon as I move everything to hadoop cluster, the result is empty.
FileSystem fs = FileSystem.get(new Configuration());
Path pt = new Path(Configs.BLOBSTORAGEPREFIX+fileAdd);
InputStream stream = fs.open(pt);
AutoDetectParser parser = new AutoDetectParser();
BodyContentHandler handler = new BodyContentHandler();
Metadata metadata = new Metadata();
parser.parse(stream, handler, metadata);
spaceContentBuffer.append(handler.toString());
The last line append the extreaxted content to a StringBuilder, but it is always empty.
p.s. my hadoop cluster is Azure HDInsight so the HDFS is Blob Storage.
I also tried the following code
Metadata metadata = new Metadata();
BodyContentHandler handler = new BodyContentHandler();
Parser parser = new TXTParser();
ParseContext con = new ParseContext();
parser.parse(stream, handler, metadata, con);
and I got the following error message:
Failed to detect the character encoding of a document

If the user does not specify Content-Type when uploading a blob, it will be set to “application/octet-stream” by default.

Related

Unable to use CompressionCodecFactory to derive compression type

I have compressed the data using gzip compression in hdfs. While reading back from same uri , i am using CompressionCodecFactory , which dynamically finds out the type of compressor being used & retrieve the result. somehow the
CompressionCodecFactory is not able to read the gzip compression & returns null.
Configuration conf = new Configuration();
CompressionCodecFactory factory = new CompressionCodecFactory(conf);
CompressionCodec cdc = factory.getCodec(new Path(path));
According to documentation it should dynamically fetch type of compressor used & use it somehow returning null to me.

No data being written to S3 using Hadoop FileSystem and BouncyCastle

I'm using the following code to write encrypted data to Amazon S3:
byte[] bytes = compressFile(instr, CompressionAlgorithmTags.ZIP);
PGPEncryptedDataGenerator encGen = new PGPEncryptedDataGenerator(new JcePGPDataEncryptorBuilder(PGPEncryptedData.CAST5).setWithIntegrityPacket(withIntegrityCheck).setSecureRandom(new SecureRandom()).setProvider("BC"));
encGen.addMethod(new JcePublicKeyKeyEncryptionMethodGenerator(pubKey).setProvider("BC"));
OutputStream cOut = encGen.open(out, bytes.length);
cOut.write(bytes);
cOut.close();
If I set "out" to:
final OutputStream fsOutStr = new FileOutputStream(new File("/home/hadoop/encrypted.gpg"));
It writes the file just fine.
However when I attempt to write it to S3, it does not give me any errors, appears to work, but there is no data on S3 when I check for it:
final FileSystem fileSys = FileSystem.get(new URI(GenericUtils.getAsEncodedStringIfEmbeddedSpaces(s3OutputDir)), new Configuration());
final OutputStream fsOutStr = fileSys.create(new Path(s3OutputDir)); // outputPath on S3
Any idea why it writes the data perfectly fine to the local disk but does not write the file to S3?
Closing fsOutStr solved the problem.

hadoop-1.0.3 sequenceFile.Writer overwrites instead of appending images into a sequencefile

I am using hadoop 1.0.3 (I can't really upgrade right now,Thats for later. )
I have around 100 images in my HDFS and I am trying to combine them into a single sequencefile ( default no compression etc.. )
here's my code:
FSDataInputStream in = null;
BytesWritable value = new BytesWritable();
Text key = new Text();
Path inpath = new Path(fs.getHomeDirectory(),"/user/hduser/input");
Path seq_path = new Path(fs.getHomeDirectory(),"/user/hduser/output/file.seq");
FileStatus[] files = fs.listStatus(inpath);
SequenceFile.Writer writer = null;
for( FileStatus fileStatus : files){
inpath = fileStatus.getPath();
try {
in = fs.open(inpath);
byte bufffer[] = new byte[in.available()];
in.read(bufffer);
writer = SequenceFile.createWriter(fs,conf,seq_path,key.getClass(),value.getClass());
writer.append(new Text(inpath.getName()), new BytesWritable(bufffer));
}catch (Exception e) {
System.out.println("Exception MESSAGES = "+e.getMessage());
e.printStackTrace();
}}
This just goes through all the files in input/ and one by one appends them.
HOWEVER this just overwrites my sequence file instead of appending it , I see only the last image in sequencefile.
NOTE I am not closing the writer before the for loop ends , can anyone help me with this please.
I am not sure How can I append the images?
Your main issue is with the following line :
writer = SequenceFile.createWriter(fs, conf, seq_path, key.getClass(), value.getClass());
which is inside the for, creating a new writer in each pass. It replaces previous file at the path seq_path. Thus only last image is available.
Pull it out of the loop, and the problem should vanish.

Reading Distributed Files in Hadoop

I'm trying to the following in hadoop:
I have implemented a map-reduce job that outputs a file to directory "foo".
the foo files are with a key=IntWriteable, value=IntWriteable format (used a SequenceFileOutputFormat).
Now, I want to start another map-reduce job. the mapper is fine, but each reducer is required to read the entire "foo" files at start-up (I'm using the HDFS for sharing data between reducers).
I used this code on the "public void configure(JobConf conf)":
String uri = "out/foo";
FileSystem fs = FileSystem.get(URI.create(uri), conf);
FileStatus[] status = fs.listStatus(new Path(uri));
for (int i=0; i<status.length; ++i) {
Path currFile = status[i].getPath();
System.out.println("status: " + i + " " + currFile.toString());
try {
SequenceFile.Reader reader = null;
reader = new SequenceFile.Reader(fs, currFile, conf);
IntWritable key = (IntWritable) ReflectionUtils.newInstance(reader.getKeyClass(), conf);
IntWritable value = (IntWritable ) ReflectionUtils.newInstance(reader.getValueClass(), conf);
while (reader.next(key, value)) {
// do the code for all the pairs.
}
}
}
The code runs well on a single machine, but I'm notsure if it will run on a cluster.
In other words, does this code reads files from the current machine or does id read from the distributed system?
Is there a better solution for what I'm trying to do?
Thanks in advance,
Arik.
The URI for the FileSystem.get() does not have scheme defined and hence, the File System used depends on the configuration parameter fs.defaultFS. If none set, the default setting i.e LocalFile system will be used.
Your program writes to the Local file system under the workingDir/out/foo. It should work in the cluster as well but looks for the local file system.
With the above said, I'm not sure why you need the entire files from foo directory. You may have consider other designs. If needed, these files should copied to HDFS first and read the files from the overridden setup method of your reducer. Needless to say, to close the files opened in the overridden closeup method of your reducer. While the files can be read in reducers, the map/reduce programs are not designed for this kind of functionality.

Error Appending to IsolatedStorageFile

I am having some problems with Isolated file store , I am trying to append to a file, but when I use the code below, I get an error about invalid Arguments on this line
IsolatedStorageFileStream("Folder\\barcodeinfo.txt", FileMode.Append,
FileMode.OpenOrCreate, myStore))
I think it has something to do with the Filemode.Append.. I am trying to append to the file rather than create new
// Obtain the virtual store for the application.
IsolatedStorageFile myStore = IsolatedStorageFile.GetUserStoreForApplication();
// Create a new folder and call it "MyFolder".
myStore.CreateDirectory("Folder");
// Specify the file path and options.
using (var isoFileStream = new IsolatedStorageFileStream("Folder\\barcodeinfo.txt", FileMode.Append, FileMode.OpenOrCreate, myStore))
{
//Write the data
using (var isoFileWriter = new StreamWriter(isoFileStream))
{
isoFileWriter.WriteLine(textBox1.Text);
isoFileWriter.WriteLine(textBox2.Text);
isoFileWriter.WriteLine(textBox3.Text);
}
}
There is no overload that takes two FileModes. It should be
IsolatedStorageFileStream("Folder\\barcodeinfo.txt", FileMode.Append,
FileAccess.Write, myStore));
Important thing to note about FileMode.Append is:
[FileMode.Append] Opens the file if it exists and seeks to the end of the file, or
creates a new file. Append can only be used in conjunction with Write.
Attempting to seek to a position before the end of the file will throw
an IOException and any attempt to read fails and throws an
NotSupportedException.
which is why FileAccess.Write is used.
It looks like you have FileMode.Append, FileMode.OpenOrCreate. That is 2 file modes. The first parameter is be FileMode and the second should be FileAccess.
That should fix your problem.

Resources