Using pig or hadoop streaming, has anyone loaded and uncompressed a zipped file? The original csv file was compressed using pkzip.
Not sure if this helps because its mainly focused on using MapReduce in Java, but there is a ZipFileInputFormat available in hadoop. Its use via the Java API is described here:
http://cotdp.com/2012/07/hadoop-processing-zip-files-in-mapreduce/
The main part of this is the ZipFileRecordReader which uses Javas ZipInputStream to process each ZipEntry. The Hadoop reader is probably not going to work for you out of the box because it passes the file path of each ZipEntry as the key and the ZipEntry contents as the value.
Related
I have a spark code that runs on a yarn cluster and converts csv to parquet using databricks library.
It works fine when the csv source is hdfs. But when the csv source is non-hdfs, which is usually the case, I come across this exception.
It should not happen as the same code works for hdfs csv source.
Complete link to the issue :
https://issues.apache.org/jira/browse/SPARK-19344
As discussed in the comments.
When the files are on the driver node, but not access-able by the nodes, the read will fail.
When using reading input file (e.g. spark.read in spark 2.0), the files should be be access by all executors nodes (e.g. when the files are on HDFS, cassandra, etc)
I have created a real time application in which I am writing data streams to hdfs from weblogs using flume, and then processing that data using spark stream. But while flume is writing and creating new files in hdfs spark stream is unable to process those files. If I am putting the files to hdfs directory using put command spark stream is able to read and process the files. Any help regarding the same will be great.
You have detected the problem yourself: while the stream of data continues, the HDFS file is "locked" and can not be read by any other process. On the contrary, as you have experienced, if you put a batch of data (that's yur file, a batch, not a stream), once it is uploaded it is ready for being read.
Anyway, and not being an expert on Spark streaming, it seems from the Spark Streaming Programming Guide, Overview section, that you are not performing the right deployment. I mean, from the picture shown there, it seems the streaming (in this case generated by Flume) must be directly sent to Spark Streaming engine; then the results will be put in HDFS.
Nevertheless, if you want to maintain your deployment, i.e. Flume -> HDFS -> Spark, then my suggestion is to create mini-batches of data in temporal HDFS folders, and once the mini-batches are ready, store new data in a second minibatch, passing the first batch to Spark for analysis.
HTH
In addition to frb's answer: which is correct - SparkStreaming with Flume acts as an Avro RPC Server - you'll need to configure an AvroSink which points to your SparkStreaming instance.
with spark2, now you can connect directly your spark streaming to flume, see official docs, and then write once on HDFS at the end of the process.
import org.apache.spark.streaming.flume._
val flumeStream = FlumeUtils.createStream(streamingContext, [chosen machine's hostname], [chosen port])
I'm trying to run a streaming job where the input files are csv inside zip files.
I tried using this, however it doesn't seem for work with CDH4 (I get the error class com.cotdp.hadoop.ZipFileInputFormat not org.apache.hadoop.mapred.InputFormat)
Anyone know of an input file reader I can use for streaming with zip files? If possible, I'm looking for a multi file reader (that can be given the top level directory).
I ended up writing zipstream.
Note that is process only the first file in the zip, I'll probably add support for multiple files later.
There are two hadoop api's for input formats. mapred.InputFormat, and mapreduce.InputFormat.
mapreduce is the newer API and the one you should be using if you can.
I would check to see which InputFormat the ZipInputFormat actually implements. If it implements the mapreduce version you'll need to move your job over to this second API.
For a bit of background: In an earlier Hadoop version 'mapred' was depreciated in favor of 'mapreduce', a newer, faster, and cleaner implementation. Unfortunately this new API didn't include all the features of the old one, so in more recent versions of Hadoop 'mapred' was reinstated, and now there are two APIs that basically do the same thing.
I have large mbox files and I am using third party API like mstor to parse messages from mbox file using hadoop. I have uploaded those files in hdfs. But the problem is that this API uses only local file system path , similar to shown below
MessageStoreApi store = new MessageStoreApi(“file location in locl file system”);
I could not find a constructor in this API that would initialize from stream . So I cannot read hdfs stream and initialize it.
Now my question is, should I copy my files from hdfs to local file system and initialize it from local temporary folder? As thats what I have been doing for now:
Currently My Map function receives path of the mbox files.
Map(key=path_of_mbox_file in_hdfs, value=null){
String local_temp_file = CopyToLocalFile(path in hdfs);
MessageStoreApi store = new MessageStoreApi(“local_temp_file”);
//process file
}
Or Is there some other solution? I am expecting some solution like what If I increase the block-size so that single file fits in one block and somehow if I can get the location of those blocks in my map function, as mostly map functions will execute on the same node where those blocks are stored then I may not have to always download to local file system? But I am not sure if that will always work :)
Suggestions , comments are welcome!
For local filesystem path-like access, HDFS offers two options: HDFS NFS (via NFSv3 mounts) and FUSE-mounted HDFS.
The former is documented under the Apache Hadoop docs (CDH users may follow this instead)
The latter is documented at the Apache Hadoop wiki (CDH users may find relevant docs here instead)
The NFS feature is more maintained upstream than the FUSE option, currently.
I am using MAPI tools (Its microsoft lib and in .NET) and then apache TIKA libraries to process and extract the pst from exchange server, which is not scalable.
How can I process/extracts pst using MR way ... Is there any tool, library available in java which I can use in my MR jobs. Any help would be great-full .
Jpst Lib internally uses: PstFile pstFile = new PstFile(java.io.File)
And the problem is for Hadoop API's we don't have anything close to java.io.File.
Following option is always there but not efficient:
File tempFile = File.createTempFile("myfile", ".tmp");
fs.moveToLocalFile(new Path (<HDFS pst path>) , new Path(tempFile.getAbsolutePath()) );
PstFile pstFile = new PstFile(tempFile);
Take a look at Behemoth (http://digitalpebble.blogspot.com/2011/05/processing-enron-dataset-using-behemoth.html). It combines Tika and Hadoop.
I've also written by own Hadoop + Tika jobs. The pattern is:
Wrap all the pst files into sequencence or avro files.
Write a map only job that reads the pst files form the avro files and writes it to the local disk.
Run tika across the files.
Write the output of tika back into a sequence file
Hope that help.s
Its not possible to process PST file in mapper. after long analysis and debug it was found out that the API is not exposed properly and those API needs localfile system to store extracted pst contents. It directly cant store on HDFS. thats bottle-neck. And all those API's(libs that extract and process) are not free.
what we can do is extract outside hdfs and then we can process in MR jobs