Filestreams limitations in Spark Streaming - spark-streaming

I need to develop a streaming application which read some session logs from several sources.
The batch interval could be in a scale around 5 minutes..
The problem is that the files I get in each batch vary enormously. In one in each batch I may get some file with 10 megabyte and then in another batch getting some files around 20GB.
I want to know if there is any approach to handle this..Is there any limitation for the size of RDDs a file stream can generate for each batch?
Can I limit the spark streaming to read just a fixed amount of data in each batch into the RDD?

As of I know there is no direct way to limit that. File to considered is controlled in isNewFile private function in FileStream. Based on the code I can think of one work around.
Use filter function to limit the number of files to be read. Any files more then 10 return false and use touch command to update the timestamp of the file to be considered for next window.
val filterF = new Function[Path, Boolean] {
def apply(file: Path): Boolean = {
globalcounter --
if(globalcounter > 0) {
return true // consider only 10 files.
// touch the file so that timestamp of the file is updated.
return false


Nifi Group Content by Given Attributes

I am trying to run a script or a custom processor to group data by given attributes every hour. Queue size is up to 30-40k on a single run and it might go up to 200k depending on the case.
MergeContent does not fit since there is no limit on min-max counts.
RouteOnAttribute does not fit since there are too many combinations.
Solution 1: Consume all flow files and group by attributes and create the new flow file and push the new one. Not ideal but gave it a try.
While running this when I had 33k flow files on queue waiting.
This number is returning 10k all the time even though I increased the queue threshold numbers on output flows.
Solution 2: Better approach is consume one flow file and and filter flow files matching the provided attributes
final List<FlowFile> flowFiles = session.get(file -> {
if (correlationId.equals(":"))))
return FlowFileFilter.FlowFileFilterResult.ACCEPT_AND_CONTINUE;
return FlowFileFilter.FlowFileFilterResult.REJECT_AND_CONTINUE;
Again with 33k waiting in the queue I was expecting around 200 new grouped flow files but 320 is created. It looks like a similar issue above and does not scan all waiting flow files on filter query.
Is there a parameter to change so this getObjectCount can take up to 300k?
Is there a way to filter all waiting flow files again by changing a parameter or by changing the processor?
I tried making default queue threshold 300k on but it didn't help
in there is a parameter that affects batching behavior
here is my test flow:
1. GenerateFlowFile with "batch size = 50K"
2. ExecuteGroovyScript with script below
3. LogAttrribute (disabled) - just to have queue after groovy
groovy script:
def ffList = session.get(100000) // get batch with maximum 100K files from incoming queue
def ff = session.create() // create new empty file
ff.batch_size = ffList.size() // set attribute to real batch size
session.remove(ffList) // drop all incoming batch files
REL_SUCCESS << ff // transfer new file to success
with parameters above there are 4 files generated in output:
1. batch_size = 20000
2. batch_size = 10000
3. batch_size = 10000
4. batch_size = 10000
according to documentation:
There is also the notion of "swapping" FlowFiles. This occurs when the number of FlowFiles in a connection queue exceeds the value set in the nifi.queue.swap.threshold property. The FlowFiles with the lowest priority in the connection queue are serialized and written to disk in a "swap file" in batches of 10,000.
This explains that from 50K incoming files - 20K it keeps inmemory and others in swap batched by 10K.
i don't know how increasing of nifi.queue.swap.threshold property will affect your system performance and memory consumption, but i set it to 100K on my local nifi 1.16.3 and it looks good with multiple small files, and first batch increased to 100K by this.

How write performance can be improved for RecordWriter

Can anyone help me out finding correct API to improve write performance?
We use MultipleOutputs<ImmutableBytesWritable, Result> class to write data we read from a table, we use the newly created file as a backup. We face performance issue in write using MultipleOutputs, it takes nearly 5 seconds for every 10000 records we write.
This is the code we use:
Result[] results = // result from another table
MultipleOutputs<ImmutableBytesWritable, Result> mos = new MultipleOutputs<ImmutableBytesWritable, Result> ();
for(Result res : results ){
mos.write(new ImmutableBytesWritable(result.getRow()), result, baseoutputpath);
We get a batch of 10000 rows and write them in a loop, with baseoutputpath changing depending on Result content.
We are facing performance dip when writing into MultipleOutputs, we suspect that it might be due to writing in a loop.
Is there any other API in maprdb or HBase which push data to database using fewer RPC calls by buffering upto certain limit.
We write data as records so no file system write class would work for us.
Please note that we use mapreduce job to do all of the above.

how to take flowfile count in nifi queue?

i have nifi flow(standalone) like
executestreamprocessor(hive script) -> executestreamprocessor(hadoop script).
for each and every incoming flowfile, hive script runs with the command INSERT..INTO..SELECT..FROM and hadoop script deletes the particular files from the storage area.
sometimes, hive script is getting failed when simultaneously hadoop script runs the command.
i will be getting maximum of 4 number of files for each an hour. so i planned to use controlrate processor in between the hive and hadoop processors. i set the condition when count of the queue reaches to 4 flowfile, then hadoop script should be executed. But, controlrate has the property to set only for maximum rate. it does not have minimum rate.
is there any possible solution to achieve? or any other solution?
You should be able to use ExecuteScript for this, try this Groovy script:
def flowFiles = session.get(4)
if(!flowFiles || flowFiles.size() < 4) {
} else {
session.transfer(flowFiles, REL_SUCCESS)
If you only want to trigger the downstream flow once, then you can create a child flow file from the parents (and report a JOIN provenance event):
def flowFiles = session.get(4)
if(!flowFiles || flowFiles.size() < 4) {
} else {
def flowFile = session.create(flowFiles)
session.provenanceReporter.join(flowFiles, flowFile)
session.transfer(flowFile, REL_SUCCESS)
Having said that, if you don't care about the flow file contents (i.e. you are using a flow file as a trigger), you could use MergeContent with a Minimum and Maximum Number Of Entries = 4.

S3 Flume HDFS SINK Compression

I am trying to write the flume events in Amaozn S3.The events written in S3 is in compressed format. My Flume configuration is given below. I am facing a data loss. Based on the configuration given below, if I publish 20000 events, I receive only 1000 events and all other data is lost. But When I disable the rollcount, rollSize and rollInterval configurations, all the events are received but there are 2000 small files created. Is there any wrong in my configuration settings? Should I add any other configurations?
injector.sinks.s3_3store.type = hdfs = disk_backed4
injector.sinks.s3_3store.hdfs.fileType = CompressedStream
injector.sinks.s3_3store.hdfs.codeC = gzip
injector.sinks.s3_3store.hdfs.serializer = TEXT
injector.sinks.s3_3store.hdfs.path = s3n://CID:SecretKey#bucketName/dth=%Y-%m-%d-%H
injector.sinks.s3_1store.hdfs.filePrefix = events-%{receiver}
# Roll when files reach 256M or after 10m, whichever comes first
injector.sinks.s3_3store.hdfs.rollCount = 0
injector.sinks.s3_3store.hdfs.idleTimeout = 600
injector.sinks.s3_3store.hdfs.rollSize = 268435456
#injector.sinks.s3_3store.hdfs.rollInterval = 3600
# Flush data to buckets every 1k events
injector.sinks.s3_3store.hdfs.batchSize = 10000
For starters: if you disable your setting for rollCount, rollSize and so on, flume will revert to defaults, hence the small files you receive, those are the default.
The relevant aspect is this:
injector.sinks.s3_3store.hdfs.batchSize = 10000
it basically tells your sink to collect 10.000 events before flushing. If you reduce that amount, you'll get smaller files too, because S3 in contrast to regular HDFS doesn't support file appends. Once you flush, the files will be closed and a new file will be created.
Try to determine which amount of events your sink will receive within a short time frame of a couple of minutes or so and set that value as you batch size.

Why does Hadoop take so long to start with a large number of S3 paths?

I have a Hadoop job that has ~60k S3 input paths. This job takes about 45 minutes to start. The same job, with only ~3k S3 input paths starts almost instantly.
Why does having a large number of input paths cause the job to take so long to start?
The answer has to do with how FileInputPath.addInputPath(...) is implemented. If you take a look at the source here, you'll see that its actually doing a string concatenation to save all of these paths to a file. Calling addInputPaths(...) just calls addInputPath, so there's no savings there. I ended up calling FileInputPath.setInputPaths(Job, Path[]). This skips the 60k+ string concatenations by building that part of the settings file once.
As climbage mentioned, there will need to be 60k+ calls to S3 to build the splits. It turns out that the S3 calls were taking less time than the string concatenation. My jobs went from taking 45 minutes to start down to less than 20.
For those who don't want to go combing through the source, heres the implementation of FileInputFormat.addInputPath() in Hadoop 2.5.1:
public static void addInputPath(Job job,
Path path) throws IOException {
Configuration conf = job.getConfiguration();
path = path.getFileSystem(conf).makeQualified(path);
String dirStr = StringUtils.escapeString(path.toString());
String dirs = conf.get(INPUT_DIR);
conf.set(INPUT_DIR, dirs == null ? dirStr : dirs + "," + dirStr);
and FileInputFormat.setInputPaths() in Hadoop 2.5.1:
public static void setInputPaths(Job job,
Path... inputPaths) throws IOException {
Configuration conf = job.getConfiguration();
Path path = inputPaths[0].getFileSystem(conf).makeQualified(inputPaths[0]);
StringBuffer str = new StringBuffer(StringUtils.escapeString(path.toString()));
for(int i = 1; i < inputPaths.length;i++) {
path = inputPaths[i].getFileSystem(conf).makeQualified(inputPaths[i]);
conf.set(INPUT_DIR, str.toString());
One of the first things that FileInputFormat does during MapReduce initialization is determine the input splits. This is done by creating a list of every input file and its information (such as file size). I imagine that 60k API calls to S3 for file information isn't fast. 45 minutes seems extraordinarily slow - there may be some rate limiting going on as well?
Sorry for reopening an old question but I recently came across similar issue.
And the core of it is that in your case Hadoop will make 60K calls to AWS
To work around this one can use wildcards
this will generate only 1 AWS call to list the directory path_to_a_folder and then filter by the prefix
I hope this will help to whoever will find this question
