How to keep a state in Hadoop jobs? - hadoop

I'm working on a hadoop program which is scheduled to run once a day. It takes a bunch of json documents and each document has a time-stamp which shows when the document has been added. My program should only process those documents that are added since its last run. So, I need to keep a state which is a time-stamp showing the last time my hadoop job has run. I was thinking of storing this state in a SQL Server and query that in the driver program of my job. Is it a good solution or might be a better solution ?
p.s. my hadoop job is running on HDInsight. Having said that it is still possible to query the SQL server from my driver program?

We had solved this problem for our workflows running in AWS (Amazon Web Services), for the data stored in S3.
Our setup:
Data store: AWS S3
Data ingestion mechanism: Flume
Workflow management: Oozie
Storage for file status: MySQL
Problem:
We were ingesting data into Amazon S3, using Flume. All the ingested data was in same folder (S3 is a key/value store and has no concept of folder. Here folder means, all the data had same prefix. For e.g. /tmp/1.txt, /tmp/2.txt etc. Here /tmp/ is the key prefix).
We had a ETL workflow, which was scheduled to run once in an hour. But, since all the data was ingested into same folder, we had to distinguish between the Processed and Un-Processed files.
For e.g. for the 1st hour data ingested is:
/tmp/1.txt
/tmp/2.txt
When the workflow starts for the first time, it should process data from "1.txt" and "2.txt" and mark them as Processed.
If for the second hour, the data ingested is:
/tmp/3.txt
/tmp/4.txt
/tmp/5.txt
Then, the total data in the folder after 2 hours will be:
/tmp/1.txt
/tmp/2.txt
/tmp/3.txt
/tmp/4.txt
/tmp/5.txt
Since, "1.txt" and "2.txt" were already processed and marked as Processed, during the second run, the job should just process "3.txt", "4.txt" and "5.txt".
Solution:
We developed a library (let's call it as FileManager), for managing the list of processed files. We plugged in this library into the Oozie work flow, as a Java action. This was the first step in the workflow.
This library also took care of ignoring the files, which are currently being written into by Flume. When Flume is writing data into a file, those files had "_current" suffix. So, those files were ignored for processing, till they are completely written into.
The ingested files were generated with timestamp as a suffix. For e.g. "hourly_feed.1234567". So, the file names were in ascending order of their creation.
For getting the list of unprocessed files, we used S3's feature of querying using markers (For e.g. if you have 10,000 files in a folder, if you specify marker as the name of the 5,000th file, then S3 will return you files from 5001 to 10,000).
We had following 3 states for each of the files:
SUCCESS - Files which were successfully processed
ERROR - Files which were picked up for processing, but there was an error in processing these files. Hence, these files need to be picked up again for processing
IN_PROGRESS - Files which have been picked up for processing and are currently being processed by a job
For each file, we stored following details in the MySQL DB:
File Name
Last Modified Time - We used this to handle some corner cases
Status of the file (IN_PROGRESS, SUCCESS, ERROR)
The FileManager exposed following interfaces:
GetLatestFiles: Return the list of latest Un-Processed files
UpdateFileStatus: After processing the files, update the status of the files
Following are the steps followed to identify the files, which were not yet processed:
Query the database (MySql), to check the last file which had status of SUCCESS (query: order by created desc).
If the first step returns a file, then query S3 with the file marker set to the last successfully processed file. This will return all the files, ingested after the last successfully processed file.
Also query the DB to check if there are any files in ERROR status. These files need to be re-processed, because previous workflow did not process them successfully.
Return the list of files obtained from Steps 2 and 3 (Before returning them, mark their status as IN_PROGRESS).
After the job is completed successfully update the state of all the processed file as SUCCESS. If there was an error in processing the files, then update the status of all the files as ERROR (so that they can be picked up for processing next time)
We used Oozie for workflow management. Oozie workflow had following steps:
Step 1: Fetch next set of files to be processed, mark each of their state as IN_PROGRESS and pass them to the next stage
Step 2: Process the files
Step 3: Update the status of the processing (SUCCESS or ERROR)
De-duplication:
When you implement such a library, there is a possibility of duplication of records (in some corner cases, same file may be picked up twice for processing). We had implemented a de-duplication logic to remove duplicate records.

you can rename the result document by using date-time,then your program can process the document according to the name of document.

Driver program checking the last run time stamp is good approach, but for storing last run time stamp, you can use a temporary file from HDFS.

Related

Incremental ETL using Glue

Need help in processing incremental files.
Scenario: Source team is creating file in every 1hr in s3 (hrly partitioned). I would like to process in every 4hr. The Glue etl will read the s3 files (partitioned hrly) and process to store in different s3 folders.
Note : Glue ETL is called from airflow.
Question How can I make sure that I only process the incremental files ( let’s say 4 files in each execution)?
Sounds like a use case for Bookmarks
For example, your ETL job might read new partitions in an Amazon S3
file. AWS Glue tracks which partitions the job has processed
successfully to prevent duplicate processing and duplicate data in the
job's target data store.

How to skip the target write in case zero record count in Informatica Cloud (IICS)?

I have one use case to create incremental Data ingestion pipeline from one Database to AWS S3. I have created a pipeline and it is working fine except for the one scenario where no incremental data was found.
In case of zero record count, it is writing the file with a header-only (parquet file). I want to skip the target write when there is no incremental record.
How I can implement this in IICS?
I have already tried to implement the router transformation where I have put the condition if record count > 0 then only write to target but still it is not working.
First of all: the target file gets created even before any data is read from source. This is to ensure the process has write access to target location. So even if there will be no data to store, an empty file will get created.
The possible ways out here will be to:
Have a command task check the number of lines in output file and delete it if there is just a header. This would require the file to be created locally, verified, and uploaded to S3 afterwards e.g. using Mass Ingestion task - all invoked sequentially via taskflow
Have a session that will first check if there is any data available, and only then run the data extraction.

Create SPRING BATCH JOBS in RUN TIME and Execute them

I have a requirement where i will take input CSV files from one folder, process them (DB lookup and validation) one after one and generate new output file for each input file. I need to choose the input files at RUN TIME based on the DB query on user object which will tell what are the qualified files (like out of 400 files in folder - 350 may be qualified to process and I need to generate 350 output files). I want to use SpringBatch and want to create one JOB for one File. Any reference or sample code to create JOBs at RUN TIME and Execute them?

Spark EMR S3 Processing Large No of Files

I have around 15000 files (ORC) present in S3 where each file contain few minutes worth of data and size of each file varies between 300-700MB.
Since recursively looping through a directory present in YYYY/MM/DD/HH24/MIN format is expensive, I am creating a file which contain list of all S3 files for a given day (objects_list.txt) and passing this file as input to spark read API
val file_list = scala.io.Source.fromInputStream(getClass.getResourceAsStream("/objects_list.txt"))
val paths: mutable.Set[String] = mutable.Set[String]()
for (line <- file_list.getLines()) {
if(line.length > 0 && line.contains("part"))
paths.add(line.trim)
}
val eventsDF = spark.read.format("orc").option("spark.sql.orc.filterPushdown","true").load(paths.toSeq: _*)
eventsDF.createOrReplaceTempView("events")
The Size of the cluster is 10 r3.4xlarge machines (workers)(Where Each Node: 120GB RAM and 16 cores) and master is of m3.2xlarge config (
The problem which am facing is, spark read was running endlessly and I see only driver working and rest all Nodes aren't doing anything and am not sure why driver is opening each S3 file for reading, because AFAIK spark works lazily so till an action is called reading shouldn't happen, I think it's listing each file and collecting some metadata associated with it.
But why only Driver is working and rest all Nodes aren't doing anything and how can I make this operation to run in parallel on all worker nodes ?
I have come across these articles https://tech.kinja.com/how-not-to-pull-from-s3-using-apache-spark-1704509219 and https://gist.github.com/snowindy/d438cb5256f9331f5eec, but here the entire file contents are being read as an RDD, but my use case is depending on the columns being referred only those blocks/columns of data should be fetched from S3 (columnar access given ORC is my storage) . Files in S3 have around 130 columns but only 20 fields are being referred and processed using dataframe API's
Sample Log Messages:
17/10/08 18:31:15 INFO S3NativeFileSystem: Opening 's3://xxxx/flattenedDataOrc/data=eventsTable/y=2017/m=09/d=20/h=09/min=00/part-r-00199-e4ba7eee-fb98-4d4f-aecc-3f5685ff64a8.zlib.orc' for reading
17/10/08 18:31:15 INFO S3NativeFileSystem: Opening 's3://xxxx/flattenedDataOrc/data=eventsTable/y=2017/m=09/d=20/h=19/min=00/part-r-00023-5e53e661-82ec-4ff1-8f4c-8e9419b2aadc.zlib.orc' for reading
You can see below that only One Executor is running that to driver program on one of the task Nodes(Cluster Mode) and CPU is 0% on rest of the other Nodes(i.e Workers) and even after 3-4 hours of processing, the situation is same given huge number of files have to be processed
Any Pointers on how can I avoid this issue, i.e speed up the load and process ?
There is a solution that can help you based in AWS Glue.
You have a lot of files partitioned in your S3. But you have partitions based in timestamp. So using glue you can use your objects in S3 like "hive tables" in your EMR.
First you need to create a EMR with version 5.8+ and you will be able to see this:
You can set up this checking both options. This will allow to access the AWS Glue Data Catalog.
After this you need to add the your root folder to the AWS Glue Catalog. The fast way to do that is using the Glue Crawler. This tool will crawl your data and will create the catalog as you need.
I will suggest you to take a look here.
After the crawler runs, this will have the metadata of your table in the catalog that you can see at AWS Athena.
In Athena you can check if your data was properly identified by the crawler.
This solution will make your spark works close to a real HDFS. Due to the metadata will be properly in the Data Catalog. And the time you app is taking to find the "indexing" will allow to run the jobs faster.
Working with this here I was able to improve the queries, and working with partitions was much better with glue. So, have a try this probably can help in the performance.

Shell script for hourly run to pull data if exists

I am trying to optimize our batch process to pull and insert data into a database. Currently, we have a data source that we pull our data from, create a text file, and load into our reporting database. We have that on a time schedule in Autosys, since most of the time, data is available by a certain time. However, lately, the data source has been late and we are not able to pull the data during the scheduled time and we have to manually run the shell script.
I want to have a shell script that runs the queries every hour and if the data exists, spools to a file to be loaded into the DB. If the data isn't there, then try again next hour so that we can eliminate any manual intervention.
I know I can set up a file trigger in Autosys to run the load into the database if the file exists, but I am having issues setting up the shell script only pull the data once it is available and not repeat the next hour if the file has already been spooled. I am new to UNIX so I am not sure how to proceed. Any help would be great.
You haven't stated your priority clearly. The priorities could be:
load the data as soon as it is available
load the data at least once every x minutes or hours
eliminate any need for manual intervention (which is clear from your question)
This is what you could do, assuming there is no need to load the data as soon as it is available:
increase the frequency of the Autosys job (instead of hourly, may be make it once in 30 or 15 minutes)
change the script so that:
it attempts to load only if it has been x minutes since last successful load, otherwise does nothing and ends in success
stores the last successful load timestamp in a file (which would be touched only upon a successful load)
if data doesn't arrive even after x + some buffer minutes, it might make more sense for the load job to fail so that it gets the required attention.

Resources