How to use CombineFileInputFormat on gzip files? - hadoop

What is the best way to use CombineFileInputFormat on gzip files?

This article will help you in building up your own Inputformat with the help of CombineFIleInputFOrmat to read and process gzip files. Below parts would give you an idea of what needs to be done.
Custom InputFormat:
Build your own custom combinefileinputformat almost same as that of combinefileinputformat. Key has to be our own writable class which would hold filename,offset and value would be the actual file content. Have to set issplittable to false(we dont want to split the file). set maxsplitsize to a value of your requirement. based on that value Combinefilerecordreader decides the number of splits and creates an instance for each split.
You have to built you own custom recordreader by adding your decompression logic to it .
Custom RecordReader:
Custom Recordreader uses linereader and sets the key as filename,offset and value as actual file content. If the file is compressed it decompresses it and reads it. Here is the extract for that.
private void codecWiseDecompress(Configuration conf) throws IOException{
CompressionCodecFactory factory = new CompressionCodecFactory(conf);
CompressionCodec codec = factory.getCodec(path);
if (codec == null) {
System.err.println("No Codec Found For " + path);
System.exit(1);
}
String outputUri =
CompressionCodecFactory.removeSuffix(path.toString(),
codec.getDefaultExtension());
dPath = new Path(outputUri);
InputStream in = null;
OutputStream out = null;
fs = this.path.getFileSystem(conf);
try {
in = codec.createInputStream(fs.open(path));
out = fs.create(dPath);
IOUtils.copyBytes(in, out, conf);
} finally {
IOUtils.closeStream(in);
IOUtils.closeStream(out);
rlength = fs.getFileStatus(dPath).getLen();
}
}
Custom Writable Class:
A pair with filename,offset value

Related

How to extract and manipulate data within a Nifi processor

I'm trying to write a custom Nifi processor which will take in the contents of the incoming flow file, perform some math operations on it, then write the results into an outgoing flow file. Is there a way to dump the contents of the incoming flow file into a string or something? I've been searching for a while now and it doesn't seem that simple. If anyone could point me toward a good tutorial that deals with doing something like that it would be greatly appreciated.
The Apache NiFi Developer Guide documents the process of creating a custom processor very well. In your specific case, I would start with the Component Lifecycle section and the Enrich/Modify Content pattern. Any other processor which does similar work (like ReplaceText or Base64EncodeContent) would be good examples to learn from; all of the source code is available on GitHub.
Essentially you need to implement the #onTrigger() method in your processor class, read the flowfile content and parse it into your expected format, perform your operations, and then re-populate the resulting flowfile content. Your source code will look something like this:
#Override
public void onTrigger(final ProcessContext context, final ProcessSession session) throws ProcessException {
FlowFile flowFile = session.get();
if (flowFile == null) {
return;
}
final ComponentLog logger = getLogger();
AtomicBoolean error = new AtomicBoolean();
AtomicReference<String> result = new AtomicReference<>(null);
// This uses a lambda function in place of a callback for InputStreamCallback#process()
processSession.read(flowFile, in -> {
long start = System.nanoTime();
// Read the flowfile content into a String
// TODO: May need to buffer this if the content is large
try {
final String contents = IOUtils.toString(in, StandardCharsets.UTF_8);
result.set(new MyMathOperationService().performSomeOperation(contents));
long stop = System.nanoTime();
if (getLogger().isDebugEnabled()) {
final long durationNanos = stop - start;
DecimalFormat df = new DecimalFormat("#.###");
getLogger().debug("Performed operation in " + durationNanos + " nanoseconds (" + df.format(durationNanos / 1_000_000_000.0) + " seconds).");
}
} catch (Exception e) {
error.set(true);
getLogger().error(e.getMessage() + " Routing to failure.", e);
}
});
if (error.get()) {
processSession.transfer(flowFile, REL_FAILURE);
} else {
// Again, a lambda takes the place of the OutputStreamCallback#process()
FlowFile updatedFlowFile = session.write(flowFile, (in, out) -> {
final String resultString = result.get();
final byte[] resultBytes = resultString.getBytes(StandardCharsets.UTF_8);
// TODO: This can use a while loop for performance
out.write(resultBytes, 0, resultBytes.length);
out.flush();
});
processSession.transfer(updatedFlowFile, REL_SUCCESS);
}
}
Daggett is right that the ExecuteScript processor is a good place to start because it will shorten the development lifecycle (no building NARs, deploying, and restarting NiFi to use it) and when you have the correct behavior, you can easily copy/paste into the generated skeleton and deploy it once.

How to read flat file header and body separately in Spring Batch

i'm doing a simple batch job with Spring Batch and Spring Boot.
I need to read a flat file, separate the header data (first line) from the body data (rest of lines) for individual business logic processing and then write everything into a single file.
As you can see, the header has 5 params that have to be mapped to one class, and the body has 12 which have to be mapped to a different one.
I first thought of using FlatFileItemReader and skip the header. Then use the skippedLinesCallback to handle that line, but i couldn't figure out how to do it.
I'm new to Spring Batch and Java Config. If someone can help me writing a solution for my problem i would really aprecciate it!
I leave here the input file:
01.01.2017|SUBDCOBR|12:21:23|01/12/2016|31/12/2016
01.01.2017|12345678231234|0002342434|BORGIA RUBEN|27-32548987-9|FA|A|2062-
00010443/444/445|142,12|30/08/2017|142,01
01.01.2017|12345673201234|2342434|ALVAREZ ESTHER|27-32533987-9|FA|A|2062-
00010443/444/445|142,12|30/08/2017|142,02
01.01.2017|12345673201234|0002342434|LOPEZ LUCRECIA|27-32553387-9|FA|A|2062-
00010443/444/445|142,12|30/08/2017|142,12
01.01.2017|12345672301234|0002342434|SILVA JESUS|27-32558657-9|NC|A|2062-
00010443|142,12|30/08/2017|142,12
Cheers!
EDIT 1:
This would be my first attepmt . My "body" POJO is called DetalleFacturacion and my "header" POJO is CabeceraFacturacion. The reader I thought to do it with DetalleFacturacion pojo, so i can skip the header and treat it later... however i'm not sure how to assign header's data into CabeceraFacturacion.
public FlatFileItemReader<DetalleFacturacion> readerDetalleFacturacion(){
FlatFileItemReader<DetalleFacturacion> reader = new FlatFileItemReader<>();
reader.setLinesToSkip(1);
reader.setResource(new ClassPathResource("/inputFiles/GLEO-MN170100-PROCESO01-SUBDFACT-000001.txt"));
DefaultLineMapper<DetalleFacturacion> detalleLineMapper = new DefaultLineMapper<>();
DelimitedLineTokenizer tokenizerDet = new DelimitedLineTokenizer("|");
tokenizerDet.setNames(new String[] {"fechaEmision", "tipoDocumento", "letra", "nroComprobante",
"nroCliente", "razonSocial", "cuit", "montoNetoGP", "montoNetoG3",
"montoExento", "impuestos", "montoTotal"});
LineCallbackHandler skippedLineCallback = new LineCallbackHandler() {
#Override
public void handleLine(String line) {
String[] headerSeparado = line.split("|");
String printDate = headerSeparado[0];
String reportIdentifier = headerSeparado[1];
String tituloReporte = headerSeparado[2];
String fechaDesde = headerSeparado[3];
String fechaHasta = headerSeparado[4];
CabeceraFacturacion cabeceraFacturacion = new CabeceraFacturacion();
cabeceraFacturacion.setPrintDate(printDate);
cabeceraFacturacion.setReportIdentifier(reportIdentifier);
cabeceraFacturacion.setTituloReporte(tituloReporte);
cabeceraFacturacion.setFechaDesde(fechaDesde);
cabeceraFacturacion.setFechaHasta(fechaHasta);
}
};
reader.setSkippedLinesCallback(skippedLineCallback);
detalleLineMapper.setLineTokenizer(tokenizerDet);
detalleLineMapper.setFieldSetMapper(new DetalleFieldSetMapper());
detalleLineMapper.afterPropertiesSet();
reader.setLineMapper(detalleLineMapper);
// Test to check if it is saving correctly data in CabeceraFacturacion
CabeceraFacturacion cabeceraFacturacion = new CabeceraFacturacion();
System.out.println("Print Date:"+cabeceraFacturacion.getPrintDate());
System.out.println("Report Identif:
"+cabeceraFacturacion.getReportIdentifier());
return reader;
}
You are correct . You need to use skippedLinesCallback to handle skip lines.
You need to implement LineCallbackHandler interface and add you processing in handleLine method.
LineCallbackHandler Interface passes the raw line content of the lines in the file to be skipped. If linesToSkip is set to 2, then this interface is called twice.
This is how you can define Reader for the same.
Java Config - Spring Batch 4
#Bean
public FlatFileItemReader<POJO> myReader() {
return FlatFileItemReader<pojo>().
.setResource(new FileSystemResource("resources/players.csv"));
.name("myReader")
.delimited()
.delimiter(",")
.names("pro1,pro2,pro3")
.targetType(POJO.class)
.skippedLinesCallback(skippedLinesCallback)
.build();
}

How to open/stream .zip files through Spark?

I have zip files that I would like to open 'through' Spark. I can open .gzip file no problem because of Hadoops native Codec support, but am unable to do so with .zip files.
Is there an easy way to read a zip file in your Spark code? I've also searched for zip codec implementations to add to the CompressionCodecFactory, but am unsuccessful so far.
There was no solution with python code and I recently had to read zips in pyspark. And, while searching how to do that I came across this question. So, hopefully this'll help others.
import zipfile
import io
def zip_extract(x):
in_memory_data = io.BytesIO(x[1])
file_obj = zipfile.ZipFile(in_memory_data, "r")
files = [i for i in file_obj.namelist()]
return dict(zip(files, [file_obj.open(file).read() for file in files]))
zips = sc.binaryFiles("hdfs:/Testing/*.zip")
files_data = zips.map(zip_extract).collect()
In the above code I returned a dictionary with filename in the zip as a key and the text data in each file as the value. you can change it however you want to suit your purposes.
#user3591785 pointed me in the correct direction, so I marked his answer as correct.
For a bit more detail, I was able to search for ZipFileInputFormat Hadoop, and came across this link: http://cotdp.com/2012/07/hadoop-processing-zip-files-in-mapreduce/
Taking the ZipFileInputFormat and its helper ZipfileRecordReader class, I was able to get Spark to perfectly open and read the zip file.
rdd1 = sc.newAPIHadoopFile("/Users/myname/data/compressed/target_file.ZIP", ZipFileInputFormat.class, Text.class, Text.class, new Job().getConfiguration());
The result was a map with one element. The file name as key, and the content as the value, so I needed to transform this into a JavaPairRdd. I'm sure you could probably replace Text with BytesWritable if you want, and replace the ArrayList with something else, but my goal was to first get something running.
JavaPairRDD<String, String> rdd2 = rdd1.flatMapToPair(new PairFlatMapFunction<Tuple2<Text, Text>, String, String>() {
#Override
public Iterable<Tuple2<String, String>> call(Tuple2<Text, Text> textTextTuple2) throws Exception {
List<Tuple2<String,String>> newList = new ArrayList<Tuple2<String, String>>();
InputStream is = new ByteArrayInputStream(textTextTuple2._2.getBytes());
BufferedReader br = new BufferedReader(new InputStreamReader(is, "UTF-8"));
String line;
while ((line = br.readLine()) != null) {
Tuple2 newTuple = new Tuple2(line.split("\\t")[0],line);
newList.add(newTuple);
}
return newList;
}
});
Please try the code below:
using API sparkContext.newAPIHadoopRDD(
hadoopConf,
InputFormat.class,
ImmutableBytesWritable.class, Result.class)
I've had a similar issue and I've solved with the following code
sparkContext.binaryFiles("/pathToZipFiles/*")
.flatMap { case (zipFilePath, zipContent) =>
val zipInputStream = new ZipInputStream(zipContent.open())
Stream.continually(zipInputStream.getNextEntry)
.takeWhile(_ != null)
.flatMap { zipEntry => ??? }
}
This answer only collects the previous knowledge and I share my experience.
ZipFileInputFormat
I tried following #Tinku and #JeffLL answers, and use imported ZipFileInputFormat together with sc.newAPIHadoopFile API. But this did not work for me. And I do not know how would I put com-cotdp-hadoop lib on my production cluster. I am not responsible for the setup.
ZipInputStream
#Tiago Palma gave a good advice, but he did not finish his answer and I struggled quite some time to actually get the decompressed output.
By the time I was able to do so, I had to prepare all the theoretical aspects, which you can find in my answer: https://stackoverflow.com/a/45958182/1549135
But the missing part of the mentioned answer is reading the ZipEntry:
import java.util.zip.ZipInputStream;
import java.io.BufferedReader;
import java.io.InputStreamReader;
sc.binaryFiles(path, minPartitions)
.flatMap { case (name: String, content: PortableDataStream) =>
val zis = new ZipInputStream(content.open)
Stream.continually(zis.getNextEntry)
.takeWhile(_ != null)
.flatMap { _ =>
val br = new BufferedReader(new InputStreamReader(zis))
Stream.continually(br.readLine()).takeWhile(_ != null)
}}
using API sparkContext.newAPIHadoopRDD(hadoopConf, InputFormat.class, ImmutableBytesWritable.class, Result.class)
File name should be pass using conf
conf=( new Job().getConfiguration())
conf.set(PROPERTY_NAME from your input formatter,"Zip file address")
sparkContext.newAPIHadoopRDD(conf, ZipFileInputFormat.class, Text.class, Text.class)
Please Find PROPERTY_NAME from your input formatter for set path
Try:
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
spark.read.text("yourGzFile.gz")

Access hdfs file from udf

I`d like to access a file from my udf call. This is my script:
files = LOAD '$docs_in' USING PigStorage(';') AS (id, stopwords, id2, file);
buzz = FOREACH files GENERATE pigbuzz.Buzz(file, id) as file:bag{(year:chararray, word:chararray, count:long)};
The jar is registered. The path is realtive to my hdfs, where the files really exist. The call is made. But seems that the file is not discovered. Maybe beacause I'm trying to access the file on hdfs.
How can I access a file in hdfs, from my UDF java call?
Inside an EvalFunc you can get a file from the HDFS via:
FileSystem fs = FileSystem.get(UDFContext.getUDFContext().getJobConf());
in = fs.open(new Path(fileName));
BufferedReader br = new BufferedReader(new InputStreamReader(in));
....
You might also consider putting the files into the distributed cache, in that case you have to override getCacheFiles() in your EvalFunc class.
E.g:
#Override
public List<String> getCacheFiles() {
List<String> list = new ArrayList<String>(2);
list.add("/cache/pig/wordlist1.txt#w1");
list.add("/cache/pig/wordlist2.txt#w2");
return list;
}
then you can just pass the symlinks of the files (w1 and w2) in order to get them from
the local file system of each of the worker nodes:
BufferedReader br = new BufferedReader(new FileReader(fileName));

How to design my mapper?

I have to write a mapreduce job but I dont know how to go about it,
I have jar MARD.jar through which I can instantiate MARD objects.
Using which I call the mard.normalize file meathod on it i.e. mard.normaliseFile(bunch of arguments).
This inturn creates certain output file.
For the normalise meathod to run it needs a folder called myMard in the working directory.
So I thought that I would give the myMard folder as the in input path to hadoop job, but m not sure if that would help beacuse mard.normaliseFile(bunch of arguments) will search for the myMard folder in the working directory but it will not find it as (**this is what I think) the Mapper will only be able to access the content of files through the "values" obtained from the fileSplit, it cannot give direct access to the files in the myMard folder.
In short I have to execute the follwing code through the MapReduce
File setupFolder = new File(setupFolderName);
setupFolder.mkdirs();
MARD mard = new MARD(setupFolder);
Text valuz = new Text();
IntWritable intval = new IntWritable();
File original = new File("Vca1652.txt");
File mardedxml = new File("Vca1652-mardedxml.txt");
File marded = new File("Vca1652-marded.txt");
mardedxml.createNewFile();
marded.createNewFile();
NormalisationStats stats;
try {
stats = mard.normaliseFile(original,mardedxml,marded,50.0);
//This meathod requires access to the myMardfolder
System.out.println(stats);
} catch (MARDException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
Please help

Resources