How to open/stream .zip files through Spark? - hadoop

I have zip files that I would like to open 'through' Spark. I can open .gzip file no problem because of Hadoops native Codec support, but am unable to do so with .zip files.
Is there an easy way to read a zip file in your Spark code? I've also searched for zip codec implementations to add to the CompressionCodecFactory, but am unsuccessful so far.

There was no solution with python code and I recently had to read zips in pyspark. And, while searching how to do that I came across this question. So, hopefully this'll help others.
import zipfile
import io
def zip_extract(x):
in_memory_data = io.BytesIO(x[1])
file_obj = zipfile.ZipFile(in_memory_data, "r")
files = [i for i in file_obj.namelist()]
return dict(zip(files, [file_obj.open(file).read() for file in files]))
zips = sc.binaryFiles("hdfs:/Testing/*.zip")
files_data = zips.map(zip_extract).collect()
In the above code I returned a dictionary with filename in the zip as a key and the text data in each file as the value. you can change it however you want to suit your purposes.

#user3591785 pointed me in the correct direction, so I marked his answer as correct.
For a bit more detail, I was able to search for ZipFileInputFormat Hadoop, and came across this link: http://cotdp.com/2012/07/hadoop-processing-zip-files-in-mapreduce/
Taking the ZipFileInputFormat and its helper ZipfileRecordReader class, I was able to get Spark to perfectly open and read the zip file.
rdd1 = sc.newAPIHadoopFile("/Users/myname/data/compressed/target_file.ZIP", ZipFileInputFormat.class, Text.class, Text.class, new Job().getConfiguration());
The result was a map with one element. The file name as key, and the content as the value, so I needed to transform this into a JavaPairRdd. I'm sure you could probably replace Text with BytesWritable if you want, and replace the ArrayList with something else, but my goal was to first get something running.
JavaPairRDD<String, String> rdd2 = rdd1.flatMapToPair(new PairFlatMapFunction<Tuple2<Text, Text>, String, String>() {
#Override
public Iterable<Tuple2<String, String>> call(Tuple2<Text, Text> textTextTuple2) throws Exception {
List<Tuple2<String,String>> newList = new ArrayList<Tuple2<String, String>>();
InputStream is = new ByteArrayInputStream(textTextTuple2._2.getBytes());
BufferedReader br = new BufferedReader(new InputStreamReader(is, "UTF-8"));
String line;
while ((line = br.readLine()) != null) {
Tuple2 newTuple = new Tuple2(line.split("\\t")[0],line);
newList.add(newTuple);
}
return newList;
}
});

Please try the code below:
using API sparkContext.newAPIHadoopRDD(
hadoopConf,
InputFormat.class,
ImmutableBytesWritable.class, Result.class)

I've had a similar issue and I've solved with the following code
sparkContext.binaryFiles("/pathToZipFiles/*")
.flatMap { case (zipFilePath, zipContent) =>
val zipInputStream = new ZipInputStream(zipContent.open())
Stream.continually(zipInputStream.getNextEntry)
.takeWhile(_ != null)
.flatMap { zipEntry => ??? }
}

This answer only collects the previous knowledge and I share my experience.
ZipFileInputFormat
I tried following #Tinku and #JeffLL answers, and use imported ZipFileInputFormat together with sc.newAPIHadoopFile API. But this did not work for me. And I do not know how would I put com-cotdp-hadoop lib on my production cluster. I am not responsible for the setup.
ZipInputStream
#Tiago Palma gave a good advice, but he did not finish his answer and I struggled quite some time to actually get the decompressed output.
By the time I was able to do so, I had to prepare all the theoretical aspects, which you can find in my answer: https://stackoverflow.com/a/45958182/1549135
But the missing part of the mentioned answer is reading the ZipEntry:
import java.util.zip.ZipInputStream;
import java.io.BufferedReader;
import java.io.InputStreamReader;
sc.binaryFiles(path, minPartitions)
.flatMap { case (name: String, content: PortableDataStream) =>
val zis = new ZipInputStream(content.open)
Stream.continually(zis.getNextEntry)
.takeWhile(_ != null)
.flatMap { _ =>
val br = new BufferedReader(new InputStreamReader(zis))
Stream.continually(br.readLine()).takeWhile(_ != null)
}}

using API sparkContext.newAPIHadoopRDD(hadoopConf, InputFormat.class, ImmutableBytesWritable.class, Result.class)
File name should be pass using conf
conf=( new Job().getConfiguration())
conf.set(PROPERTY_NAME from your input formatter,"Zip file address")
sparkContext.newAPIHadoopRDD(conf, ZipFileInputFormat.class, Text.class, Text.class)
Please Find PROPERTY_NAME from your input formatter for set path

Try:
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
spark.read.text("yourGzFile.gz")

Related

Apache Commons CSV parser: Not able to read the values

I am using apache commons CSV parser to convert the CSV to a map. In the map I couldnt able to read some values through intellij debuger. if I manually type map.get("key") the value is null. However, if I copy paste the key from the map, I am getting data. Couldnt understand what is going wrong. Any pointers would help. Thanks
Here is my CSV parser code:
private CSVParser parseCSV(InputStream inputStream) {
System.out.println("What is the encoding "+ new InputStreamReader(inputStream).getEncoding());
try {
return new CSVParser(new InputStreamReader(inputStream), CSVFormat.DEFAULT
.withFirstRecordAsHeader()
.withIgnoreHeaderCase()
.withSkipHeaderRecord()
.withTrim());
} catch (IOException e) {
throw new IPRSException(e);
}
}
There was a weird character in the strings (Reference: Reading UTF-8 - BOM marker). The below syntax help to resolve the issue
header = header("\uFEFF", "");
in java use UnicodeReader:
String path = "demo.csv";
CSVFormat.Builder builder = CSVFormat.RFC4180.builder();
CSVFormat format = builder.setQuote(null).setHeader().build();
InputStream in = new FileInputStream(new File(path));
CSVParser parser = new CSVParser(new BufferedReader(new UnicodeReader(in)), format);

Need to extract attributes directly from Avro using NiFi

I have found no way in NiFi to extract attributes directly from Avro so I am using ConvertAvroToJson -> EvaluateJsonPath -> ConvertJsonToAvro as the workaround.
But I would like to write a script to extract the attributes from the Avro flow file for use in an ExecuteScript processor to determine if it is a better approach.
Does anyone have a script to do this? Otherwise, I may end up using the original approach.
Thanks,
Kevin
Here's a Groovy script (which needs the Avro JAR in its Module Directory property) where I let the user specify dynamic properties with JSONPath expressions to be evaluated against the Avro file. Ironically it does a GenericData.toString() which converts the record to JSON anyway, but perhaps there is some code in here you could reuse:
import org.apache.avro.*
import org.apache.avro.generic.*
import org.apache.avro.file.*
import groovy.json.*
import org.apache.commons.io.IOUtils
import java.nio.charset.*
flowFile = session.get()
if(!flowFile) return
final GenericData genericData = GenericData.get();
slurper = new JsonSlurper().setType(JsonParserType.INDEX_OVERLAY)
pathAttrs = this.binding?.variables?.findAll {attr -> attr.key.startsWith('avro.path')}
newAttrs = [:]
try {
session.read(flowFile, { inputStream ->
def reader = new DataFileStream<>(inputStream, new GenericDatumReader<GenericRecord>())
GenericRecord currRecord = null;
if(reader.hasNext()) {
currRecord = reader.next();
log.info(genericData.toString(currRecord))
record = slurper.parseText(genericData.toString(currRecord))
pathAttrs?.each {k,v ->
object = record
v.value.tokenize('.').each {
object = object[it]
}
newAttrs[k - "avro.path."] = String.valueOf(object)
}
reader.close()
}
} as InputStreamCallback)
newAttrs.each{k,v ->
flowFile = session.putAttribute(flowFile, k,v)
}
session.transfer(flowFile, REL_SUCCESS)
} catch(e) {
log.error("Error during Avro Path: {}", [e.message] as Object[], e)
session.transfer(flowFile, REL_FAILURE)
}
If you meant to extract Avro metadata vs fields (not totally sure what you meant by "attributes"), also check MergeContent's AvroMerge as there is some code in there to pull Avro metadata:
If you are extracting simple patterns from a single Avro record per flowfile, ExtractText may be sufficient for you. If you want to take advantage of the new record processing available in Apache NiFi 1.3.0, AvroReader is where you should start, and there are a series of blogs describing this process in detail. You can also extract Avro metadata with ExtractAvroMetadata.

How to use CombineFileInputFormat on gzip files?

What is the best way to use CombineFileInputFormat on gzip files?
This article will help you in building up your own Inputformat with the help of CombineFIleInputFOrmat to read and process gzip files. Below parts would give you an idea of what needs to be done.
Custom InputFormat:
Build your own custom combinefileinputformat almost same as that of combinefileinputformat. Key has to be our own writable class which would hold filename,offset and value would be the actual file content. Have to set issplittable to false(we dont want to split the file). set maxsplitsize to a value of your requirement. based on that value Combinefilerecordreader decides the number of splits and creates an instance for each split.
You have to built you own custom recordreader by adding your decompression logic to it .
Custom RecordReader:
Custom Recordreader uses linereader and sets the key as filename,offset and value as actual file content. If the file is compressed it decompresses it and reads it. Here is the extract for that.
private void codecWiseDecompress(Configuration conf) throws IOException{
CompressionCodecFactory factory = new CompressionCodecFactory(conf);
CompressionCodec codec = factory.getCodec(path);
if (codec == null) {
System.err.println("No Codec Found For " + path);
System.exit(1);
}
String outputUri =
CompressionCodecFactory.removeSuffix(path.toString(),
codec.getDefaultExtension());
dPath = new Path(outputUri);
InputStream in = null;
OutputStream out = null;
fs = this.path.getFileSystem(conf);
try {
in = codec.createInputStream(fs.open(path));
out = fs.create(dPath);
IOUtils.copyBytes(in, out, conf);
} finally {
IOUtils.closeStream(in);
IOUtils.closeStream(out);
rlength = fs.getFileStatus(dPath).getLen();
}
}
Custom Writable Class:
A pair with filename,offset value

Access hdfs file from udf

I`d like to access a file from my udf call. This is my script:
files = LOAD '$docs_in' USING PigStorage(';') AS (id, stopwords, id2, file);
buzz = FOREACH files GENERATE pigbuzz.Buzz(file, id) as file:bag{(year:chararray, word:chararray, count:long)};
The jar is registered. The path is realtive to my hdfs, where the files really exist. The call is made. But seems that the file is not discovered. Maybe beacause I'm trying to access the file on hdfs.
How can I access a file in hdfs, from my UDF java call?
Inside an EvalFunc you can get a file from the HDFS via:
FileSystem fs = FileSystem.get(UDFContext.getUDFContext().getJobConf());
in = fs.open(new Path(fileName));
BufferedReader br = new BufferedReader(new InputStreamReader(in));
....
You might also consider putting the files into the distributed cache, in that case you have to override getCacheFiles() in your EvalFunc class.
E.g:
#Override
public List<String> getCacheFiles() {
List<String> list = new ArrayList<String>(2);
list.add("/cache/pig/wordlist1.txt#w1");
list.add("/cache/pig/wordlist2.txt#w2");
return list;
}
then you can just pass the symlinks of the files (w1 and w2) in order to get them from
the local file system of each of the worker nodes:
BufferedReader br = new BufferedReader(new FileReader(fileName));

IsolatedFileStorage XML Reading Crash

Ok so, basically my problem is with reading and XML file from IsolatedFileStorage. I'll go through the process that leads to my error and then I'll list the relevant code and XML file.
On the first execution it recognises that the file does not exist - it therefore creates the file in IsolatedFileStorage
On the second execution it can now see that the file does exist and so it loads the XML file
On the third execution it can see that it exists - but it throws an XML error
I cannot for the life of me find a solution to it (link to other discussion on MSDN here)
So the code for reading/creating the XML file in IsolatedFileStorage is as follows:
try
{
/***********************
* CHECK THE SETTINGS
********************/
if (store.FileExists("AppSettings.xml"))
{
streamSettings = new IsolatedStorageFileStream("AppSettings.xml", System.IO.FileMode.Open, store);
DebugHelp.Text = "AppSettings.xml exists... Loading!";
streamSettings.Seek(0, System.IO.SeekOrigin.Begin);
xmlDoc = XDocument.Load(streamSettings, LoadOptions.None);
}
else
{
streamSettings = new IsolatedStorageFileStream("AppSettings.xml", System.IO.FileMode.Create, store);
DebugHelp.Text = "AppSettings.xml does not exist... Creating!";
xmlDoc = XDocument.Load("AppSettings.xml", LoadOptions.None);
}
if (xmlDoc != null)
xmlDoc.Save(streamSettings);
}
catch (Exception e)
{
DebugHelp.Text = e.ToString();
}
finally
{
streamSettings.Close();
}
And the related XML file is as follows:
<?xml version="1.0" encoding="utf-8" ?>
<Settings>
</Settings>
Extremely advanced I know - however it throws the following error (here) and you can find the full error text at the bottom of the Social.MSDN page.
Please help - I have been looking for a solution (as the one on the social.msdn site didn't work) for about 2 weeks now.
Why don't you try to read file using a simple StreamReader ? Below a part of a method I have created to readfile from store. Have a try, check your content, and then try loading xml from String (XDocument.Parse etc ...)
String fileContent = String.Empty;
using (_store = IsolatedStorageFile.GetUserStoreForApplication())
{
if (_store.FileExists(file))
{
_storeStream = new IsolatedStorageFileStream(file, FileMode.Open, _store);
using (StreamReader sr = new StreamReader(_storeStream))
{
fileContent = sr.ReadToEnd();
}
__storeStream.Close();
return fileContent;
}
else {
return null;
}
}
It looks to me like the problem is in your save method - it looks like you are maybe appending the settings each time you close - to overwrite your existing settings, you need to ensure that you delete your existing file and create a new one.
To help debug this, try using http://wp7explorer.codeplex.com/ - this might help you see the raw file "on disk"
As an aside, for settings in general, do check out the AppSettings that IsolatedStorage provides by default - unless you have complicated needs, then these may suffice on their own.
Your code sample isn't complete so it's hard to say for sure but, rather than just seeking to the start of the file you may find it easier to just delete it if it already exists. You can do this with FileMode.Create. In turn this means you can do away with the need to check for the existing file.
I suspect that the problem is that you are writing a smaller amount of text to the file on subsequent attempts and so leaving part of the original/previous text behind. In turn this creates a file which contains invalid XML.

Resources