Structured streaming doesnt get individual file name with input_file_name() - spark-streaming

I have a structured streaming job that reads in a bunch of json.gz files under the following directory and writes to a delta table
headFolder
|- 00
|-- file1.json.gz
|- 01
|-- file2.json.gz
...
|- 23
|-- file24.json.gz
The structured streaming I'm running is as follows
spark.readStream
.format('cloudFiles')
.options({"cloudFiles.format": "json", "cloudFiles.schemaEvolutionMode": "rescue"})
.schema(schema_predefined)
.load("./headFolder/")
.withColumn("input_file_path", input_file_name())
.writeStream
.format("delta")
.outputMode("append")
.options({'checkpointLocation': checkpoint_path, 'path': output_path})
.trigger({'once': True})
.queryName("query_name")
.start()
I omitted some details in the query above, please take all undeclared parameters as pre-defined. After I run the job, all the 24 files were processed and I can validate that data was correct. However, the input_file_name() function didn't work as I was expecting.
When I check the input_file_name column, I was expecting 24 distinct records since their key names are different. However, I see only around 5 filenames, which varies based on file size. After I looked into the documentation here, indeed it returns the file name of the TASK instead of the individual files, thus since I'm reading from the top level, Spark automatically divides the 24 hours into several tasks and picked one name from the files read.
My question is, is there still a way to accurately record the filename for the file processed under the current framework? I don't want to change the file path or force it to run one task per file for runtime reasons.
Thank you!

Related

mlcp performs differently to different input directory paths

I am using mlcp v9.0.4 to load data into MarkLogic v9.0.9 and I am trying to figure out the following:
If the csv file is not having data rows and has only the column names, the file never gets loaded. How can I overcome this and load the empty files?
There is a different behaviour of mlcp when the input_file_path is a directory containing csvs vs input_file_path is a directory containing another directory.
Eg: if structure is /dir/dir1/*.csv, then input_file_path=/dir/dir1/ loads faster compared to input_file_path=/dir/ [ with other options set to default ]
What is the logic that mlcp is applying to do the load here?
Should I change any options for both ways to give same result to me?
For point 1:
I could add an empty row to the csv and load it, but I wouldn't want this approach.
I tried using a transform module but that is slowing down the load.
For point 2: I have been trying by changing the mlcp options - batch_size, split_size, max_split_size, thread_count, thread_count_per_split as given in the marklogic docs using different combinations. However, I wonder if I am just beating around bush.
I wanted to understand how mlcp treats the inputs under the hood.
For point 2:
For a 128GB RAM server - Following are the details I tried
File/directory structure:
/dir/dir1/1.csv - 4 MB
/dir/dir1/2.csv - 10 MB
/dir/dir1/3.csv - 400 MB
/dir/dir1/4.csv - 3000 MB
Database configuration:
forest policy - bucket
locking - off
journaling - fast
options file for mlcp:
-generate_uri
true
-fast_load
true
-thread_count
32
-split_size
true
-max_split_size
94371840
-thread_count_per_split
1
-batch_size
100
-transaction_size
20
For point 1) what would expect as the result of loading a file with no data rows ? Considering that the data model is such that 1 CSV 'row' == 1 ML Document. 0 CSV data 'rows' == ??? Documents ? are you expecting a number != 0 ?
For point 2) Could you share the performance difference you are seeing ? What is "loads faster" and what is the final result set look like ?

Apache Nifi MergeContent output data inconsistent?

Fairly new to using nifi. Need help with the design.
I am trying to create a simple flow with dummy csv files(for now) in HDFS dir and prepend some text data to each record in each flowfile.
Incoming files:
dummy1.csv
dummy2.csv
dummy3.csv
contents:
"Eldon Base for stackable storage shelf, platinum",Muhammed MacIntyre,3,-213.25,38.94,35,Nunavut,Storage & Organization,0.8
"1.7 Cubic Foot Compact ""Cube"" Office Refrigerators",BarryFrench,293,457.81,208.16,68.02,Nunavut,Appliances,0.58
"Cardinal Slant-D Ring Binder, Heavy Gauge Vinyl",Barry French,293,46.71,8.69,2.99,Nunavut,Binders and Binder Accessories,0.39
...
Desired output:
d17a3259-0718-4c7b-bee8-924266aebcc7,Mon Jun 04 16:36:56 EDT 2018,Fellowes Recycled Storage Drawers,Allen Rosenblatt,11137,395.12,111.03,8.64,Northwest Territories,Storage & Organization,0.78
25f17667-9216-4f1d-b69c-23403cd13464,Mon Jun 04 16:36:56 EDT 2018,Satellite Sectional Post Binders,Barry Weirich,11202,79.59,43.41,2.99,Northwest Territories,Binders and Binder Accessories,0.39
ce0b569f-5d93-4a54-b55e-09c18705f973,Mon Jun 04 16:36:56 EDT 2018,Deflect-o DuraMat Antistatic Studded Beveled Mat for Medium Pile Carpeting,Doug Bickford,11456,399.37,105.34,24.49,Northwest Territories,Office Furnishings,0.61
the flow
splitText-
ReplaceText-
MergeContent-
(this may be a poor way to achieve what I am trying to get, but I saw somewhere that uuid is best bet when it comes to generating unique session id. So thought of extracting each line from incoming data to flowfile and generating uuid)
But somehow, as you can see the order of data is messing up. The first 3 rows are not the same in output. However, the test data I am using (50000 entries) seems to have the data in some other line. Multiple tests show usually the data order changes after 2001st line.
And yes, I did search similar issues here and tried using defragment method in merge but it didnt work. I would appreciate if someone can explain what is happening here and how can I get the data in the same way with unique session_id,timestamp for each record. Is there some parameter I need to change or modify to get the correct output? I am open to suggestions if there is a better way as well.
First of all thank you for such an elaborate and detailed response. I think you cleared a lot of doubts I had as to how the processor works!
The ordering of the merge is only guaranteed in defragment mode because it will put the flow files in order according to their fragment index. I'm not sure why that wouldn't be working, but if you could create a template of a flow with sample data that showed the problem it would be helpful to debug.
I will try to replicate this method using a clean template again. Could be some parameter problem and the HDFS writer not able to write.
I'm not sure if the intent of your flow is to just re-merge the original CSV that was split, or to merge together several different CSVs. Defragment mode will only re-merge the original CSV, so if ListHDFS picked up 10 CSVs, after splitting and re-merging, you should again have 10 CSVs.
Yes, that is exactly what I need. Split and join data to their corresponding files. I dont specifically (yet) need to join the outputs again.
The approach of splitting a CSV down to 1 line per flow file to manipulate each line is a common approach, however it won't perform very well if you have many large CSV files. A more efficient approach would be to try and manipulate the data in place without splitting. This can generally be done with the record-oriented processors.
I used this approach purely instinctively and did not realize this is a common method. Sometimes the datafile could be very large, that means more than a million records in a single file. Wont that be an issue with the i/o in the cluster? coz that would mean each record=one flowfile=one unique uuid. What is a comfortable number of flowfiles that nifi can handle? (i know it depends on cluster config and will try to get more info about the cluster from hdp admin)
What do you suggest by "try and manipulate the data in place without splitting" ? can you give an example or template or processor to use?
In this case you would need to define a schema for your CSV which included all the columns in your data, plus the session id and timestamp. Then using an UpdateRecord processor you would use record path expressions like /session_id = ${UUID()} and /timestamp = ${now()}. This would stream the content line by line and update each record and write it back out, keeping it all as one flow file.
This looks promising. Can you share a simple template pulling files from hdfs>processing>write hdfs files but without splitting?
I am reluctant to share the template due to restrictions. But let me see if I can create a generic templ and I will share
Thank you for your wisdom! :)

How to append current date to property file value every day in Unix?

I've got a property file which is read several times per day by an external application in order to process some files. One of the properties tells the app where to store the processed files. Application runs on Linux.
success_path=/u02/oapp/success
The problem is that every day several files are thrown in that path and after several months, I would have thousands of files in this plane folder.
Question: How can I append the current date to this property file so it would look like:
success_path=/u02/oapp/success/dd-MMM-yyyy
This would be updated every day at 12:00AM so for example today it would be
success_path=/u02/oapp/success/28-JAN-2017
The file is /u02/oapp/configuration/oapp.properties
Thanks in advance
Instead of appending current date to the property, add additional logic to the code that stores the processed files so that:
it takes the base directory from the property file (success_path in your case)
it creates a year/month/day directory to store the files
Something like:
/u02/oapp/success/year/month/day (as in `/u02/oapp/success/2017/01/01`)
or
/u02/oapp/success/yearmonth/day (as in `/u02/oapp/success/201701/01`)
or
/u02/oapp/success/yearmonthday (as in `/u02/oapp/success/20170101`)
If you don't have the capability to change the app's behavior, you might need to write a cron job that periodically moves the files external to the app.
jq -Rr 'select(startswith("success_path="))="success_path=/u02/oapp/success/"+(now|strflocaltime("%d-%b-%Y")|ascii_upcase)' /u02/oapp/configuration/oapp.properties | sponge /u02/oapp/configuration/oapp.properties

What are SUCCESS and part-r-00000 files in hadoop

Although I use Hadoop frequently on my Ubuntu machine I have never thought about SUCCESS and part-r-00000 files. The output always resides in part-r-00000 file, but what is the use of SUCCESS file? Why does the output file have the name part-r-0000? Is there any significance/any nomenclature or is this just a randomly defined?
See http://www.cloudera.com/blog/2010/08/what%E2%80%99s-new-in-apache-hadoop-0-21/
On the successful completion of a job, the MapReduce runtime creates a _SUCCESS file in the output directory. This may be useful for applications that need to see if a result set is complete just by inspecting HDFS. (MAPREDUCE-947)
This would typically be used by job scheduling systems (such as OOZIE), to denote that follow-on processing on the contents of this directory can commence as all the data has been output.
Update (in response to comment)
The output files are by default named part-x-yyyyy where:
x is either 'm' or 'r', depending on whether the job was a map only job, or reduce
yyyyy is the mapper or reducer task number (zero based)
So a job which has 32 reducers will have files named part-r-00000 to part-r-00031, one for each reducer task.

Correlating input files to output files

I have a MR streaming job. My code is in C++. Its a mapper only job, with no reducer. Input to the the job is a directory containing three files. Job creates 3 mappers. Each mapper processes one input file and produces one output file in different format.
Input files are like:
MyDir/file1
MyDir/file2
MyDir/file3
Output file are like:
MyDir/Output/part-00000
MyDir/Output/part-00001
MyDir/Output/part-00002
I want to correlate input files to output files. For example, input file MyDir/file1 may correspond to output file MyDir/Output/part-00002, i.e. mapper that processed input file MyDir/file1 may have produced output file MyDir/Output/part-00002.
I want to know this relationship, i.e., which input file corresponds to which output file. Is there a simple way to know this?
One way I can think of is it to have the i/p and the o/p file names of the Job the same. Get the input file name (map.input.file environment property) which the mapper is processing and then us it in the MultipleOutputFormat#generateFileNameForKeyValue method.
With how Hadoop is designed, the only relationship that you can rely on, without you expressly naming the output files as per the other answer, is that the number of output files corresponds to the number of final tasks being run, usually reducers (mappers in your case, since you're not running any reducers).
If Hadoop later decides to run more mappers/reducers instead of just 3 (larger input files, more nodes available), you'll get a different number of output files.

Resources