Spring Batch Best Architecture to Read XML - spring

What is the Best performance architecture to read XML in Spring Batch? Each XML is approximately 300 KB size and we are processing 1 Million.
Our Current Approach
30 partitions and 30 Grids and Each slave gets 166 XMLS
Commit Chunk 100
Application Start Memory is 8 GB
Using JAXB in Reader Default Bean Scope
#StepScope
#Qualifier("xmlItemReader")
public IteratorItemReader<BaseDTO> xmlItemReader(
#Value("#{stepExecutionContext['fileName']}") List<String> fileNameList) throws Exception {
String readingFile = "File Not Found";
logger.info("----StaxEventItemReader----fileName--->" + fileNameList.toString());
List<BaseDTO> fileList = new ArrayList<BaseDTO>();
for (String filePath : fileNameList) {
try {
readingFile = filePath.trim();
Invoice bill = (Invoice) getUnMarshaller().unmarshal(new File(filePath));
UnifiedInvoiceDTO unifiedDTO = new UnifiedInvoiceDTO(bill, environment);
unifiedDTO.setFileName(filePath);
BaseDTO baseDTO = new BaseDTO();
baseDTO.setUnifiedDTO(unifiedDTO);
fileList.add(baseDTO);
} catch (Exception e) {
UnifiedInvoiceDTO unifiedDTO = new UnifiedInvoiceDTO();
unifiedDTO.setFileName(readingFile);
unifiedDTO.setErrorMessage(e);
BaseDTO baseDTO = new BaseDTO();
baseDTO.setUnifiedDTO(unifiedDTO);
fileList.add(baseDTO);
}
}
return new IteratorItemReader<>(fileList);
}
Our questions:
Is this Archirecture correct
Is any performance or architecture advantage of using StaxEventItemReader and XStreamMarshaller over JAXB.
How to handle memory properly to avoid slow down

I would create a job per xml file by using the file name as a job parameter. This approach has many benefits:
Restartability: If a job fails, you only restart the failed file (from where it left off)
Scalability: This approach allows you to run multiple jobs in parallel. If a single machine is not enough, you can distribute the load on multiple machines
Logging: Logs are separate by design, you don't need to use an MDC or any other technique to separate logs
We are receiving XML filepath in a *.txt file
You can a create a script that iterates over these lines and launch a job per line (aka per file). Gnu Parallel (or a similar tool) is a good option to launch jobs in parallel.

Related

Files are overwriting instated of appending in BeanShell PostProcessor Jmeter while running 1000 threads

I have written below code under beanshall post-processor. But when I am running 1000 threads the files are overwriting existing content instated of appending. It is working for 1-5 threads. Can anyone help me on this?
import org.apache.commons.io.FileUtils;
import java.util.ArrayList;
import java.util.Collections;
File fExceptionLog = new File("${logPath}/ExceptionLog.txt");
String extExceptionData= FileUtils.readFileToString(fExceptionLog);
id=vars.get("id");
try{
String cDatestamp="${__time(yyyyMMddHHmmssSSS)}";
String cResponce = prev.getResponseDataAsString();
String cRequest = prev.getQueryString();
String cResponceCode=prev.getResponseCode();
cTransactionName = prev.getSampleLabel();
cResponseTime = prev.getTime();
cSize = prev.getBytesAsLong();
cIsSuccessful =prev.isSuccessful();
File fRequestLog = new File("${logPath}/RequestLog.txt");
File fHitLog = new File("${logPath}/HitLog.txt");
File fResponceLog = new File("${logPath}/ResponceLog.txt");
File fErrorLog = new File("${logPath}/ErrorLog.txt");
String extHitData = FileUtils.readFileToString(fHitLog);
String extRequestData = FileUtils.readFileToString(fRequestLog);
String extResponceData = FileUtils.readFileToString(fResponceLog);
String extErrorData = FileUtils.readFileToString(fErrorLog);
log.info("cResponceCode"+cResponceCode);
FileUtils.writeStringToFile(fHitLog,extHitData+id+"~"+cDatestamp+"~"+cTransactionName+"~"+cResponceCode+"~"+cResponseTime+"~"+cSize+"~"+cIsSuccessful+"\n");
if(cResponceCode.equals("200")){
FileUtils.writeStringToFile(fRequestLog,extRequestData+id+"~"+cDatestamp+"~"+cTransactionName+"~"+cResponce+"\n");
FileUtils.writeStringToFile(fResponceLog,extResponceData+id+"~"+cDatestamp+"~"+cResponceCode+"~"+cResponce+"\n");
}else{
FileUtils.writeStringToFile(fErrorLog,extErrorData+id+"~"+cDatestamp+"~"+cTransactionName+"~"+cResponce+"\n"+id+"~"+cDatestamp+"~"+cResponceCode+"~"+cResponce+"\n");
}
}catch(Exception e){
FileUtils.writeStringToFile(fExceptionLog,extExceptionData+id+"~"+cDatestamp+"~"+cTransactionName+"~"+e+"\n");
}
You're violating at least 3 JMeter Best Practices
You're referring JMeter Variables like ${logPath} while you should be using vars shorthand instead like vars.get("logPath")
You're using Beanshell while starting from JMeter 3.1 you should be using JSR223 and Groovy
And last but not the least, you yourself introduced a race condition so when several threads will be concurrently writing the same file it will result in data loss. You can put this Beanshell test element (along with the parent Sampler(s)) under the Critical Section Controller, but it will reduce concurrency of the parent sampler(s) to only one at a time
If you need to write some some metrics into a custom file in your own format I would rather recommend consider migrating to the Flexible File Writer which is extremely "flexible" with regards to what values is to store and it accumulates multiple entries in memory and flushes them periodically in batch manner so all the data will be stored without collisions.
You can install Flexible File Writer using JMeter Plugins Manager

Bulk loading with LoadIncrementalHFiles and subdirectories

I wrote a Spark application that generates HFiles to be used for bulk loading with the LoadIncrementalHFiles command later. As the source data pool is very big, the input files are splitted into iterations that are processed one after the other. Each iteration creates its own HFile directory, so my HDFS structure looks like this:
/user/myuser/map_data/hfiles_0
... /hfiles_1
... /hfiles_2
... /hfiles_3
...
There are about 500 files in this map_data directory, therefore I'm searching for a way to automatically call the LoadIncrementalHFiles function, to process these subdirectories also in iterations later.
The corresponding command would be this:
hbase org.apache.hadoop.hbase.mapreduce.LoadIncrementalHFiles -Dcreate.table=no /user/myuser/map_data/hfiles_0 mytable
I need to change this into an iterative command, as this command does not work with subdirectories (when I call it with the /user/myuser/map_data directory)!
I tried to use a Java Process instance to execute the command above automatically, but this doesn't seen to do anything (no output to console and also no more rows in my HBase table).
Using the org.apache.hadoop.hbase.mapreduce.LoadIncrementalHFiles Java class out of my code also doesn't work, it's also not responsing!
Has anybody a working example for me? Or is there a parameter to be able to run the above hbase command on the parent directory? I'm working with HBase 1.1.2 in a Hortonworks Data Platform 2.5 cluster.
EDIT I tried to run the LoadIncrementalHFiles command from a Hadoop client Java application, but I'm getting an exception relating to snappy compression, see Run LoadIncrementalHFiles from Java client
The solution was to split the hbase org.apache.hadoop.hbase.mapreduce.LoadIncrementalHFiles -Dcreate.table=no /user/myuser/map_data/hfiles_0 mytable command into many parts (one per command part), see this Java code snippet:
TreeSet<String> subDirs = getHFileDirectories(new Path(HDFS_PATH), hadoopConf);
for(String hFileDir : subDirs) {
try {
String pathToReadFrom = HDFS_OUTPUT_PATH + "/" + hFileDir;
==> String[] execCode = {"hbase", "org.apache.hadoop.hbase.mapreduce.LoadIncrementalHFiles", "-Dcreate.table=no", pathToReadFrom, hbaseTableName};
ProcessBuilder pb = new ProcessBuilder(execCode);
pb.redirectErrorStream(true);
final Process p = pb.start();
// Write the output of the Process to the console
new Thread(new Runnable() {
public void run() {
BufferedReader input = new BufferedReader(new InputStreamReader(p.getInputStream()));
String line = null;
try {
while ((line = input.readLine()) != null)
System.out.println(line);
} catch (IOException e) {
e.printStackTrace();
}
}
}).start();
// Wait for the end of the execution
p.waitFor();
...
}

Spark Streaming: Micro batches Parallel Execution

We are receiving data in spark streaming from Kafka. Once execution has been started in Spark Streaming, it executes only one batch and the remaining batches starting queuing up in Kafka.
Our data is independent and can be processes in Parallel.
We tried multiple configurations with multiple executor, cores, back pressure and other configurations but nothing worked so far. There are a lot messages queued and only one micro batch has been processed at a time and rest are remained in queue.
We want to achieve parallelism at maximum, so that not any micro batch is queued, as we have enough resources available. So how we can reduce time by maximum utilization of resources.
// Start reading messages from Kafka and get DStream
final JavaInputDStream<ConsumerRecord<String, byte[]>> consumerStream = KafkaUtils.createDirectStream(
getJavaStreamingContext(), LocationStrategies.PreferConsistent(),
ConsumerStrategies.<String, byte[]>Subscribe("TOPIC_NAME",
sparkServiceConf.getKafkaConsumeParams()));
ThreadContext.put(Constants.CommonLiterals.LOGGER_UID_VAR, CommonUtils.loggerUniqueId());
JavaDStream<byte[]> messagesStream = consumerStream.map(new Function<ConsumerRecord<String, byte[]>, byte[]>() {
private static final long serialVersionUID = 1L;
#Override
public byte[] call(ConsumerRecord<String, byte[]> kafkaRecord) throws Exception {
return kafkaRecord.value();
}
});
// Decode each binary message and generate JSON array
JavaDStream<String> decodedStream = messagesStream.map(new Function<byte[], String>() {
private static final long serialVersionUID = 1L;
#Override
public String call(byte[] asn1Data) throws Exception {
if(asn1Data.length > 0) {
try (InputStream inputStream = new ByteArrayInputStream(asn1Data);
Writer writer = new StringWriter(); ) {
ByteArrayInputStream byteArrayInputStream = new ByteArrayInputStream(asn1Data);
GZIPInputStream gzipInputStream = new GZIPInputStream(byteArrayInputStream);
byte[] buffer = new byte[1024];
ByteArrayOutputStream byteArrayOutputStream = new ByteArrayOutputStream();
int len;
while((len = gzipInputStream.read(buffer)) != -1) {
byteArrayOutputStream.write(buffer, 0, len);
}
return new String(byteArrayOutputStream.toByteArray());
} catch (Exception e) {
//
producer.flush();
throw e;
}
}
return null;
}
});
// publish generated json gzip to kafka
cache.foreachRDD(new VoidFunction<JavaRDD<String>>() {
private static final long serialVersionUID = 1L;
#Override
public void call(JavaRDD<String> jsonRdd4DF) throws Exception {
//Dataset<Row> json = sparkSession.read().json(jsonRdd4DF);
if(!jsonRdd4DF.isEmpty()) {
//JavaRDD<String> jsonRddDF = getJavaSparkContext().parallelize(jsonRdd4DF.collect());
Dataset<Row> json = sparkSession.read().json(jsonRdd4DF);
SparkAIRMainJsonProcessor airMainJsonProcessor = new SparkAIRMainJsonProcessor();
airMainJsonProcessor.processAIRData(json, sparkSession);
}
}
});
getJavaStreamingContext().start();
getJavaStreamingContext().awaitTermination();
getJavaStreamingContext().stop();
Technology that we are using:
HDFS 2.7.1.2.5
YARN + MapReduce2 2.7.1.2.5
ZooKeeper 3.4.6.2.5
Ambari Infra 0.1.0
Ambari Metrics 0.1.0
Kafka 0.10.0.2.5
Knox 0.9.0.2.5
Ranger 0.6.0.2.5
Ranger KMS 0.6.0.2.5
SmartSense 1.3.0.0-1
Spark2 2.0.x.2.5
Statistics that we got from difference experimentations:
Experiment 1
num_executors=6
executor_memory=8g
executor_cores=12
100 Files processing time 48 Minutes
Experiment 2
spark.default.parallelism=12
num_executors=6
executor_memory=8g
executor_cores=12
100 Files processing time 8 Minutes
Experiment 3
spark.default.parallelism=12
num_executors=6
executor_memory=8g
executor_cores=12
100 Files processing time 7 Minutes
Experiment 4
spark.default.parallelism=16
num_executors=6
executor_memory=8g
executor_cores=12
100 Files processing time 10 Minutes
Please advise, how we can process maximum so no queued.
I was facing same issue and I tried a few things in trying to resolve the issue and came to following findings:
First of all. Intuition says that one batch must be processed per executor but on the contrary, only one batch is processed at a time but jobs and tasks are processed in parallel.
Multiple batch processing can be achieved by using spark.streaming.concurrentjobs, but it's not documented and still needs a few fixes. One of problems is with saving Kafka offsets. Suppose we set this parameter to 4 and 4 batches are processed in parallel, what if 3rd batch finishes before 4th one, which Kafka offsets would be committed. This parameter is quite useful if batches are independent.
spark.default.parallelism because of its name is sometimes considered to make things parallel. But its true benefit is in distributed shuffle operations. Try different numbers and find an optimum number for this. You will get a considerable difference in processing time. It depends upon shuffle operations in your jobs. Setting it too high would decrease the performance. It's apparent from you experiments results too.
Another option is to use foreachPartitionAsync in place of foreach on RDD. But I think foreachPartition is better as foreachPartitionAsync would queue up the jobs whereas batches would appear to be processed but their jobs would still be in the queue or in processing. May be I didn't get its usage right. But it behaved same in my 3 services.
FAIR spark.scheduler.mode must be used for jobs with lots of tasks as round-robin assignment of tasks to jobs, gives opportunity to smaller tasks to start receiving resources while bigger tasks are processing.
Try to tune your batch duration+input size and always keep it below processing duration otherwise you're gonna see a long backlog of batches.
These are my findings and suggestions, however, there are so many configurations and methods to do streaming and often one set of operation doesn't work for others. Spark Streaming is all about learning, putting your experience and anticipation together to get to a set of optimum configuration.
Hope it helps. It would be a great relief if someone could tell specifically how we can legitimately process batches in parallel.
We want to achieve parallelism at maximum, so that not any micro batch is queued
That's the thing about stream processing: you process the data in the order it was received. If you process your data at the rate slower than it arrives it will be queued. Also, don't expect that processing of one record will suddenly be parallelized across multiple nodes.
From your screenshot, it seems your batch time is 10 seconds and your producer published 100 records over 90 seconds.
It took 36s to process 2 records and 70s to process 17 records. Clearly, there is some per-batch overhead. If this dependency is linear, it would take only 4:18 to process all 100 records in a single mini-batch thus beating your record holder.
Since your code is not complete, it's hard to tell what exactly takes so much time. Transformations in the code look fine but probably the action (or subsequent transformations) are the real bottlenecks. Also, what's with producer.flush() which wasn't mentioned anywhere in your code?
I was facing the same issue and I solved it using Scala Futures.
Here are some link that show how to use it:
https://alvinalexander.com/scala/how-use-multiple-scala-futures-in-for-comprehension-loop
https://www.beyondthelines.net/computing/scala-future-and-execution-context/
Also, this is piece of my code when I used Scala Futures:
messages.foreachRDD{ rdd =>
val f = Future {
// sleep(100)
val newRDD = rdd.map{message =>
val req_message = message.value()
(message.value())
}
println("Request messages: " + newRDD.count())
var resultrows = newRDD.collect()//.collectAsList()
processMessage(resultrows, mlFeatures: MLFeatures, conf)
println("Inside scala future")
1
}
f.onComplete {
case Success(messages) => println("yay!")
case Failure(exception) => println("On no!")
}
}
It's hard to tell without having all the details, but general advice to tackle issues like that -- start with very simple application, "Hello world" kind. Just read from input stream and print data into log file. Once this works you prove that problem was in application and you gradually add your functionality back until you find what was culprit. If even simplest app doesn't work - you know that problem in configuration or Spark cluster itself. Hope this helps.

How to safely write to one file from many verticle instances in vert.x 3.2?

Instead of using a logger or database server I'd like to append information to one file from possibly many verticle instances.
There are versions of methods for writing asynchronously to a file.
Can I assume that vertx handles the synchronisation between the writes so that these dont interfere when using those versions of methods marked as ¨async¨ ?
There seems to be a rule that one can rely on vertx providing all isolation between concurrent processing out of the box. But is that true in case of writing file access?
Could you please include a code snippet into the answer that shows how to open and write to one file from many verticle instances with finest possible granularity, e.g. for logging requests.
I wouldn't recommend writing to a single file with many different "writers". Regarding concurrent logging I would stick to the Single Writer principle.
Create a Verticle which subscribes to the Event Bus and listens for messages to be logged. Lets call this Verticle Logger which listens to system.logger.
EventBus eb = vertx.eventBus();
eb.consumer("system.logger", message -> {
// write to file
});
Verticles which like to log something need to send a message to the Logger Verticle:
eventBus.send("system.logger", "foobar");
Appending to a existing file work something like this (didn't test):
vertx.fileSystem().open("file.log", new OpenOptions(), result -> {
if (result.succeeded()) {
Buffer buff = Buffer.buffer(message); // message from consume
AsyncFile file = result.result();
file.write(buff, buff.length() * i, ar -> {
if (ar.succeeded()) {
System.out.println("done");
} else {
System.err.println("write failed: " + ar.cause());
}
});
} else {
System.err.println("open file failed " + result.cause());
}
});

Apache Spark 1.2.1 standalone cluster giving java heap space error

I need information about, how to figure out how much heap space(memory) would be needed to operate on x mb(suppose x means 600 mb) in spark standalone cluster.
Scenario:
I have standalone cluster with 14gb memory and 8 cores. I want to operate(Reading data from files and writing it to Cassandra) on 600 MB of data.
For this task I have SparkConfig as:
.set("spark.cassandra.output.throughput_mb_per_sec","800")
.set("spark.storage.memoryFraction", "0.3")
And --executor-memory=5g --total-executor-cores 6 --driver-memory 6g at the time of submitting task.
In spite of above configuration,I getting java heap space error while writing data to Cassandra.
Below is the java code:
public static void main(String[] args) throws Exception {
String fileName = args[0];
Long now = new Date().getTime();
SparkConf conf = new SparkConf(true)
.setAppName("JavaSparkSQL_" +now)
.set("spark.cassandra.connection.host", "192.168.1.65")
.set("spark.cassandra.connection.native.port", "9042")
.set("spark.cassandra.connection.rpc.port", "9160")
.set("spark.cassandra.output.throughput_mb_per_sec","800")
.set("spark.storage.memoryFraction", "0.3");
JavaSparkContext ctx = new JavaSparkContext(conf);
JavaRDD<String> input =ctx.textFile
("hdfs://abc.xyz.net:9000/figmd/resources/" + fileName, 12);
JavaRDD<PlanOfCare> result = input.mapPartitions(new
ParseJson()).filter(new PickInputData());
System.out.print("Count --> "+result.count());
System.out.println(StringUtils.join(result.collect(), ","));
javaFunctions(result).writerBuilder("ks","pt_planofcarelarge",
mapToRow(PlanOfCare.class)).saveToCassandra();
}
What configuration I am suppose to do?Am I missing anything?
Thanks in advance.
JavaRDD collect method return an array that contains all of the elements in this RDD.
So in your case, it will creates an array with 340000 elements which will result in a Java Heap Error, you may want to take a small sample of your data and collect it or you may want to save it directly to your disk.
For more information about JavaRDD, you can always refer to the official documentation.

Resources