Saving intermediate result in Spark - hadoop

I was creating processing pipeline using Spark SQL 1.6.0 . This pipeline consist of steps/transformations and the output of one step is forward to next one. After last step the resulted DataFrame is save at HDFS. I also need to save the result at some intermediate steps. The code which is doing this as:
saveDataFrame(flushPath, flushFormat, isCoalesce, flushMode, previousDataFrame, sqlContext)
previousDataFrame
here, previousDataFrame is the result of the last step and saveDataFrame is just saving the DataFrame as given location, then the previousDataFrame will be used by next steps/transformation. And Finally after last step it will be saved at HDFS. The code for saveDataFrame is:
implicit def saveDataFrame(path: String, format: String, isCoalesce: Boolean, saveMode: SaveMode, dataFrame: DataFrame, sqlContext: SQLContext): Unit = {
val source = if (isCoalesce) dataFrame.coalesce(1) else dataFrame
if (format.equalsIgnoreCase("csv")) {
source
.write
.mode(saveMode)
.format("com.databricks.spark.csv")
.option("header", "true")
.option("inferSchema", "true")
.save(path)
}
else if (format.equalsIgnoreCase("parquet") || format.equalsIgnoreCase("json")) {
source
.write
.mode(SaveMode.Overwrite)
.format(format)
.save(path)
}
else {
throw new Exception("%s input format is not supported".format(format))
}}
This works well, only the spark application is taking longer time than usual. If with saving intermediate output application runs in 20 minutes, then with this code it took 1 hour. Although the jobs and tasks complete in 20 minutes as per the Spark UI, but the spark-submit process continue to run till 1 hour.
Please help in figuring out the result. I have also tried following 2 possible solutions:
Using Future to create multi-threading to call saveDataFrame.
Caching the previousDataFrame before saving and reusing it into next step.

The issue was the AWS S3 path, which was causing the delay in execution. As I started saving output to HDFS, the execution time got reduced.

Related

Flink Hadoop Bucketing Sink performances with many parallel buckets

I'm investigating the performances of a Flink job that transports data from Kafka to an S3 Sink.
We are using a BucketingSink to write parquet files. The bucketing logic divides the messages having a folder per type of data, tenant (customer), date-time, extraction Id, etc etc. This results in each file is stored in a folder structure composed by 9-10 layers (s3_bucket:/1/2/3/4/5/6/7/8/9/myFile...)
If the data is distributed as bursts of messages for tenant-type we see good performances in writing, but when the data is more a white noise distribution on thousands of tenants, dozens of data types and multiple extraction IDs, we have an incredible loss of performances. (in the order of 300x times)
Attaching a debugger, it seems the issue is connected to the number of handlers open at the same time on S3 to write data. More specifically:
Researching in the hadoop libraries used to write to S3 I have found some possible improvements setting:
<name>fs.s3a.connection.maximum</name>
<name>fs.s3a.threads.max</name>
<name>fs.s3a.threads.core</name>
<name>fs.s3a.max.total.tasks</name>
But none of these made a big difference in throughput.
I also tried to flatten the folder structure to write to a single key like (1_2_3_...) but also this didn't bring any improvement.
Note: The tests have been done on Flink 1.8 with the Hadoop FileSystem (BucketingSink), writing to S3 using the hadoop fs libraries 2.6.x (as we use Cloudera CDH 5.x for savepoints), so we can't switch to StreamingFileSink.
After the suggestion from Kostas in https://lists.apache.org/thread.html/50ef4d26a1af408df8d9abb70589699cb6b26b2600ab6f4464e86ea4%40%3Cdev.flink.apache.org%3E
The culprit of the slow-down is this piece of code:
https://github.com/apache/flink/blob/master/flink-connectors/flink-connector-filesystem/src/main/java/org/apache/flink/streaming/connectors/fs/bucketing/BucketingSink.java#L543-L551
This alone takes around 4-5 secs, with a total of 6 secs to open the file. Logs from an instrumented call:
2020-02-07 08:51:05,825 INFO BucketingSink - openNewPartFile FS verification
2020-02-07 08:51:09,906 INFO BucketingSink - openNewPartFile FS verification - done
2020-02-07 08:51:11,181 INFO BucketingSink - openNewPartFile FS - completed partPath = s3a://....
This together with the default setup of the bucketing sink with 60 secs inactivity rollover
https://github.com/apache/flink/blob/master/flink-connectors/flink-connector-filesystem/src/main/java/org/apache/flink/streaming/connectors/fs/bucketing/BucketingSink.java#L195
means that with more than 10 parallel bucket on a slot by the time we finish creating the last bucket the first one became stale, so needs to be rotated generating a blocking situation.
We solved this by replacing the BucketingSink.java and deleting the FS check mentioned above:
LOG.debug("Opening new part file FS verification");
if (!fs.exists(bucketPath)) {
try {
if (fs.mkdirs(bucketPath)) {
LOG.debug("Created new bucket directory: {}", bucketPath);
}
}
catch (IOException e) {
throw new RuntimeException("Could not create new bucket path.", e);
}
}
LOG.debug("Opening new part file FS verification - done");
as we see that the sink works fine without it, now the file opening takes ~1.2sec.
Moreover we set the default inactive threshold to 5 mins. With this changes we can easily handle more than 200 buckets per slot (once the job takes speed it will ingest on all the slots so postponing the inactive timeout)

how to take flowfile count in nifi queue?

i have nifi flow(standalone) like
executestreamprocessor(hive script) -> executestreamprocessor(hadoop script).
for each and every incoming flowfile, hive script runs with the command INSERT..INTO..SELECT..FROM and hadoop script deletes the particular files from the storage area.
sometimes, hive script is getting failed when simultaneously hadoop script runs the command.
i will be getting maximum of 4 number of files for each an hour. so i planned to use controlrate processor in between the hive and hadoop processors. i set the condition when count of the queue reaches to 4 flowfile, then hadoop script should be executed. But, controlrate has the property to set only for maximum rate. it does not have minimum rate.
is there any possible solution to achieve? or any other solution?
You should be able to use ExecuteScript for this, try this Groovy script:
def flowFiles = session.get(4)
if(!flowFiles || flowFiles.size() < 4) {
session.rollback()
} else {
session.transfer(flowFiles, REL_SUCCESS)
}
If you only want to trigger the downstream flow once, then you can create a child flow file from the parents (and report a JOIN provenance event):
def flowFiles = session.get(4)
if(!flowFiles || flowFiles.size() < 4) {
session.rollback()
} else {
def flowFile = session.create(flowFiles)
session.provenanceReporter.join(flowFiles, flowFile)
session.remove(flowFiles)
session.transfer(flowFile, REL_SUCCESS)
}
Having said that, if you don't care about the flow file contents (i.e. you are using a flow file as a trigger), you could use MergeContent with a Minimum and Maximum Number Of Entries = 4.

Filestreams limitations in Spark Streaming

I need to develop a streaming application which read some session logs from several sources.
The batch interval could be in a scale around 5 minutes..
The problem is that the files I get in each batch vary enormously. In one in each batch I may get some file with 10 megabyte and then in another batch getting some files around 20GB.
I want to know if there is any approach to handle this..Is there any limitation for the size of RDDs a file stream can generate for each batch?
Can I limit the spark streaming to read just a fixed amount of data in each batch into the RDD?
As of I know there is no direct way to limit that. File to considered is controlled in isNewFile private function in FileStream. Based on the code I can think of one work around.
Use filter function to limit the number of files to be read. Any files more then 10 return false and use touch command to update the timestamp of the file to be considered for next window.
globalcounter=10
val filterF = new Function[Path, Boolean] {
def apply(file: Path): Boolean = {
globalcounter --
if(globalcounter > 0) {
return true // consider only 10 files.
}
// touch the file so that timestamp of the file is updated.
return false
}
}

Create Snapshot of FS from Spark Job

I would like to create a snapshot of the underlying HDFS, when running a spark job. The particular step involves deleting contents of some parquet files. I want to create a snapshot perform the delete operation, verify the operation results and proceed with next Steps.
However, I am unable to find a good way to access the HDFS API from my spark job. The directory I want to create a snapshot is tagged/marked snapshotable in HDFS. the command line method of creating the snapshot works, However I need to do this programmatically.
i am running Spark 1.5 on CDH 5.5.
any hints clues as to how I can perform this operation ?
Thanks
Ramdev
I have not verified this, but atleast I do not get Compile errors and in theory this solution should work.
This is scala code:
val sc = new SparkContext();
val fs = FileSystem.get(sc.hadoopConfig)
val snapshotPath = fs.createSnapshot("path to createsnapshot of","snapshot name")
.....
.....
if (condition satisfied) {
fs.deleteSnapshot(snapshotPath,"snapshot name")
}
I assume this will work in theory.

Why does Hadoop take so long to start with a large number of S3 paths?

I have a Hadoop job that has ~60k S3 input paths. This job takes about 45 minutes to start. The same job, with only ~3k S3 input paths starts almost instantly.
Why does having a large number of input paths cause the job to take so long to start?
The answer has to do with how FileInputPath.addInputPath(...) is implemented. If you take a look at the source here, you'll see that its actually doing a string concatenation to save all of these paths to a file. Calling addInputPaths(...) just calls addInputPath, so there's no savings there. I ended up calling FileInputPath.setInputPaths(Job, Path[]). This skips the 60k+ string concatenations by building that part of the settings file once.
As climbage mentioned, there will need to be 60k+ calls to S3 to build the splits. It turns out that the S3 calls were taking less time than the string concatenation. My jobs went from taking 45 minutes to start down to less than 20.
For those who don't want to go combing through the source, heres the implementation of FileInputFormat.addInputPath() in Hadoop 2.5.1:
public static void addInputPath(Job job,
Path path) throws IOException {
Configuration conf = job.getConfiguration();
path = path.getFileSystem(conf).makeQualified(path);
String dirStr = StringUtils.escapeString(path.toString());
String dirs = conf.get(INPUT_DIR);
conf.set(INPUT_DIR, dirs == null ? dirStr : dirs + "," + dirStr);
}
and FileInputFormat.setInputPaths() in Hadoop 2.5.1:
public static void setInputPaths(Job job,
Path... inputPaths) throws IOException {
Configuration conf = job.getConfiguration();
Path path = inputPaths[0].getFileSystem(conf).makeQualified(inputPaths[0]);
StringBuffer str = new StringBuffer(StringUtils.escapeString(path.toString()));
for(int i = 1; i < inputPaths.length;i++) {
str.append(StringUtils.COMMA_STR);
path = inputPaths[i].getFileSystem(conf).makeQualified(inputPaths[i]);
str.append(StringUtils.escapeString(path.toString()));
}
conf.set(INPUT_DIR, str.toString());
}
One of the first things that FileInputFormat does during MapReduce initialization is determine the input splits. This is done by creating a list of every input file and its information (such as file size). I imagine that 60k API calls to S3 for file information isn't fast. 45 minutes seems extraordinarily slow - there may be some rate limiting going on as well?
Sorry for reopening an old question but I recently came across similar issue.
And the core of it is that in your case Hadoop will make 60K calls to AWS
To work around this one can use wildcards
FileInputFormat.addInputPath("path_to_a_folder/prefix*")
this will generate only 1 AWS call to list the directory path_to_a_folder and then filter by the prefix
I hope this will help to whoever will find this question

Resources