Shell script for hourly run to pull data if exists - shell

I am trying to optimize our batch process to pull and insert data into a database. Currently, we have a data source that we pull our data from, create a text file, and load into our reporting database. We have that on a time schedule in Autosys, since most of the time, data is available by a certain time. However, lately, the data source has been late and we are not able to pull the data during the scheduled time and we have to manually run the shell script.
I want to have a shell script that runs the queries every hour and if the data exists, spools to a file to be loaded into the DB. If the data isn't there, then try again next hour so that we can eliminate any manual intervention.
I know I can set up a file trigger in Autosys to run the load into the database if the file exists, but I am having issues setting up the shell script only pull the data once it is available and not repeat the next hour if the file has already been spooled. I am new to UNIX so I am not sure how to proceed. Any help would be great.

You haven't stated your priority clearly. The priorities could be:
load the data as soon as it is available
load the data at least once every x minutes or hours
eliminate any need for manual intervention (which is clear from your question)
This is what you could do, assuming there is no need to load the data as soon as it is available:
increase the frequency of the Autosys job (instead of hourly, may be make it once in 30 or 15 minutes)
change the script so that:
it attempts to load only if it has been x minutes since last successful load, otherwise does nothing and ends in success
stores the last successful load timestamp in a file (which would be touched only upon a successful load)
if data doesn't arrive even after x + some buffer minutes, it might make more sense for the load job to fail so that it gets the required attention.


Nifi Job to execute a spark submit command not giving correct results

I have a spark code that appends data from a hive table to parquet files partitioned on dates. The code runs absolutely correct when executed from the spark shell and the parquet files show the exact same number of rows as present in the hive table for the corresponding date.
However, when the same code is executed by putting the code in a jar file, which is called upon by a spark submit command, and the spark submit command is scheduled to execute daily at 9 AM via Nifi, the number of rows in the parquet partition files are coming out to be less. We are on the P_NO_SLA queue, and below are some of the facts and observations we have:
•Data on the source hive table gets updated by 4 AM approx
•Initially our Nifi job was scheduled to start running at 4:45 AM but the number of records did not match. On doing a manual update from the spark shell post 6 AM, the data was an exact match.
•Hence, we scheduled the job to run at 7 AM. On doing this, when the number of records were too less (approx. 20000 on weekends) as compared to weekdays (in the range of 150000 to >200000 records), the data got updated correctly via the Nifi Job. Again a manual run was done to backfill the missing data.
•Again, we postponed the job to 9 AM. Post doing this, there were 2 days when the number of records matched (between 160000 to 200000), however, since Jul-31, the data hasn't matched at all, irrespective of the number of records on any of the days, and we are having to do a manual backfill everyday.
We are unable to figure out any specific reason that maybe causing the code to run correctly from the spark shell at any time, but giving incorrect results from Nifi when Nifi is just schedculed to execute the spark submit command to run the jar file containing the same spark code.
Please help me with understanding why this would be happening and how I can fix this.
P.S.: I have checked the Nifi log files, and could not find any of the scheduled jobs giving an error.

Pull Data from Hive to SQL Server without duplicates using Apache Nifi

Sorry I'm new in Apache Nifi. So i made a data flow regarding pulling data from Hive and storing it in SQL. There is no error on my data flow, the only problem is, its pulling data repeatedly.
My Data flow is consists of the following:
For example my table in hive have 20 rows only but when i run the data flow and check my table in MS SQL. It saved 5,000 rows. The SelectHiveQL pulled the data repeatedly.
What do i need to do so it will only pull 20 rows or just the exact number of rows in my Hive Table?
Thank you
SelectHiveQL (like many NiFi processors) runs on a user-specified schedule. To get a processor to only run once, you can set the run schedule to something like 30 sec, then start and immediately stop the processor. The processor will be triggered once, and stopping it does not interrupt that current execution, it just causes it not to be scheduled again.
Another way might be to set the run schedule to something very large, such that it would only execute once per some very long time interval (days, years, etc.)

Is there a way to set a TTL for certain directories in HDFS?

I have the following requirements. I am adding date-wise data to a specific directory in HDFS, and I need to keep a backup of the last 3 sets, and remove the rest. Is there a way to set a TTL for the directory so that the data perishes automatically after a certain number of days?
If not, is there a way to achieve similar results?
This feature is not yet available on HDFS.
There was a JIRA ticket created to support this feature:
But, the fix is not yet available.
You need to handle it using a cron job. You can create a job (this could be a simple Shell, Perl or Python script), which periodically deletes the data older than a certain pre-configured period.
This job could:
Run periodically (For e.g. once an hour or once a day)
Take the list of folders or files which need to be checked, along with their TTL as input
Delete any file or folder, which is older than the specified TTL.
This can be achieved easily, using scripting.

How to keep a state in Hadoop jobs?

I'm working on a hadoop program which is scheduled to run once a day. It takes a bunch of json documents and each document has a time-stamp which shows when the document has been added. My program should only process those documents that are added since its last run. So, I need to keep a state which is a time-stamp showing the last time my hadoop job has run. I was thinking of storing this state in a SQL Server and query that in the driver program of my job. Is it a good solution or might be a better solution ?
p.s. my hadoop job is running on HDInsight. Having said that it is still possible to query the SQL server from my driver program?
We had solved this problem for our workflows running in AWS (Amazon Web Services), for the data stored in S3.
Our setup:
Data store: AWS S3
Data ingestion mechanism: Flume
Workflow management: Oozie
Storage for file status: MySQL
We were ingesting data into Amazon S3, using Flume. All the ingested data was in same folder (S3 is a key/value store and has no concept of folder. Here folder means, all the data had same prefix. For e.g. /tmp/1.txt, /tmp/2.txt etc. Here /tmp/ is the key prefix).
We had a ETL workflow, which was scheduled to run once in an hour. But, since all the data was ingested into same folder, we had to distinguish between the Processed and Un-Processed files.
For e.g. for the 1st hour data ingested is:
When the workflow starts for the first time, it should process data from "1.txt" and "2.txt" and mark them as Processed.
If for the second hour, the data ingested is:
Then, the total data in the folder after 2 hours will be:
Since, "1.txt" and "2.txt" were already processed and marked as Processed, during the second run, the job should just process "3.txt", "4.txt" and "5.txt".
We developed a library (let's call it as FileManager), for managing the list of processed files. We plugged in this library into the Oozie work flow, as a Java action. This was the first step in the workflow.
This library also took care of ignoring the files, which are currently being written into by Flume. When Flume is writing data into a file, those files had "_current" suffix. So, those files were ignored for processing, till they are completely written into.
The ingested files were generated with timestamp as a suffix. For e.g. "hourly_feed.1234567". So, the file names were in ascending order of their creation.
For getting the list of unprocessed files, we used S3's feature of querying using markers (For e.g. if you have 10,000 files in a folder, if you specify marker as the name of the 5,000th file, then S3 will return you files from 5001 to 10,000).
We had following 3 states for each of the files:
SUCCESS - Files which were successfully processed
ERROR - Files which were picked up for processing, but there was an error in processing these files. Hence, these files need to be picked up again for processing
IN_PROGRESS - Files which have been picked up for processing and are currently being processed by a job
For each file, we stored following details in the MySQL DB:
File Name
Last Modified Time - We used this to handle some corner cases
Status of the file (IN_PROGRESS, SUCCESS, ERROR)
The FileManager exposed following interfaces:
GetLatestFiles: Return the list of latest Un-Processed files
UpdateFileStatus: After processing the files, update the status of the files
Following are the steps followed to identify the files, which were not yet processed:
Query the database (MySql), to check the last file which had status of SUCCESS (query: order by created desc).
If the first step returns a file, then query S3 with the file marker set to the last successfully processed file. This will return all the files, ingested after the last successfully processed file.
Also query the DB to check if there are any files in ERROR status. These files need to be re-processed, because previous workflow did not process them successfully.
Return the list of files obtained from Steps 2 and 3 (Before returning them, mark their status as IN_PROGRESS).
After the job is completed successfully update the state of all the processed file as SUCCESS. If there was an error in processing the files, then update the status of all the files as ERROR (so that they can be picked up for processing next time)
We used Oozie for workflow management. Oozie workflow had following steps:
Step 1: Fetch next set of files to be processed, mark each of their state as IN_PROGRESS and pass them to the next stage
Step 2: Process the files
Step 3: Update the status of the processing (SUCCESS or ERROR)
When you implement such a library, there is a possibility of duplication of records (in some corner cases, same file may be picked up twice for processing). We had implemented a de-duplication logic to remove duplicate records.
you can rename the result document by using date-time,then your program can process the document according to the name of document.
Driver program checking the last run time stamp is good approach, but for storing last run time stamp, you can use a temporary file from HDFS.

Spring XD Batch Job Incremental

I created a job batch to extract data from csv file to a jdbc using filejdbc module, it worked properly, but when I scheduled the batch to run every 5 minutes, it did not work with the incremental load concept, it loaded all the data again, Is there any feature to schedule the batch with incremental load?
Is the solution to run the batch once, and to create a stream to do the incremental load? Will the stream load all the data again, or it will just continue from a certain point.
Please explain how can I achieve the incremental load concept using spring XD?
I suppose what is missing is the concept of 'state'...the filejdbc module does not seem to know where the last import stopped. I do something similar but I use a custom batch job and I use a meta store to keep track of where the last load stopped - that is where the next incremental will pick up from, etc.
Since you're using a module that came with spring-xd itself, you may not have this flexibility but you may have to options:
a- your destination table can define unique fields that will prevent duplicates. That way, even if its trying to load ALL the data again, only new rows will get inserted. This assumes that the module is using 'insert ignore' (or something similar and not just basic insert (which will throw an error/exception). This, I must say, will end up being non-optimal pretty quickly, pretty soon.
b- If its an option, write a module that can delete the file after its uploaded into the db. You can construct a complex stream that will first do your data load and then file delete.
