Spark data frame, JOIN two datasets and De-dup the records by a key and latest timestamp of a record - hadoop

I need some help in a efficient way to JOIN two datasets and De-dup the records by a key and latest timestamp of a record.
use case: Need to run a daily incremental refresh for each table and provide an a snapshot of the extract everyday
For each table get a daily incremental file: 150 Million records need to run a De-dup process against a history full volume file (3 billion). The dedup process need to run by a composite primary key and get latest record by the timestamp. every record contains key and a timestamp. Files are available in ORC and parquet format using spark.

Related

Sync database extraction with Hadoop

Lets say you have periodic task that extract data from a database and loads that data into Hadoop.
How does Apache Sqoop/Nifi mantain database sync between the source database (SQL or NoSQL) with destination storage(Hadoop HDFS or HBASE, even S3)?
For example, lets say that at time A the database has 500 records and at time B it has 600 records with some of the old records updated, does it have a mechanism that efficiently knows the difference between time A and time B that only updates rows that changed and add missing rows?
Yes,NiFi has QueryDatabaseTable processor which can store the state and incrementally fetches the records that got updated.
in your table if you are having some date column that can be updated when your records gets updated then you can use the same date column in Max value columns property then processor will pulls only the changes that got made from last state value.
Here is the awesome article regarding querydatabasetable processor
https://community.hortonworks.com/articles/51902/incremental-fetch-in-nifi-with-querydatabasetable.html

How to coalesce large portioned data into single directory in spark/Hive

I have a requirement,
Huge data is partitioned and inserting it into Hive.To bind this data, I am using DF.Coalesce(10). Now i want to bind this portioned data to single directory, if I use DF.Coalesce(1) will the performance decrease? or do I have any other process to do so?
From what I understand is that you are trying to ensure that there are less no of files per partition. So, by using coalesce(10), you will get max 10 files per partition. I would suggest using repartition($"COL"), here COL is the column used to partition the data. This will ensure that your "huge" data is split based on the partition column used in HIVE. df.repartition($"COL")

Apache Nifi ExecuteSQL Processor

I am trying to fetch data from oracle database using ExecuteSQL processor.I have some queries like suppose there are 15 records in my oracle database.Here when I run the ExecuteSQL processor,it will run continuously as a streaming process and store the whole records as a single file in HDFS and repeatedly do the same.Thus many files will be there in the hdfs location which will fetch the already fetched records from oracle db and these files contains the same data.How can i make this processor to run in such a way that it must fetch all the data from oracle db once and store as a single file and when ever new records is inserted into the db,it must ingest those to hdfs location?
Take a look at the QueryDatabaseTable processor:
https://nifi.apache.org/docs/nifi-docs/components/org.apache.nifi.processors.standard.QueryDatabaseTable/index.html
You will need to tell this processor one or more columns to use to track new records, this is the Maximum Value Columns property. If your table has a one-up id column you can use that, and every time it runs it will track the last id that was seen, and start there on the next execution.

Apache Sqoop Incremental import

I understand that Sqoop offers couple of methods to handle incremental imports
Append mode
lastmodified mode
Questions on Append mode:
Is the append mode supported only for the check column as integer data type? What if i want to use a date or a timestamp column but still i want to only append to the data already in HDFS?
Does this mode mean that the new data is appended to the existing HDFS file or it picks only the new data from the source DB or both?
Lets say that the check-column is an id column in the source table. There already exists a row in the table where the id column is 100. When the sqoop import is run in the append mode where the last-value is 50. Now it imports all rows where the id > 50. When run again with last-value as 150, but this time the row with the id value as 100 was updated to 200. Would this row also be pulled?
Example: Lets say there is a table called customers with one of the records as follows. The first column is the id.
100 abc xyz 5000
When Sqoop job is run in the append mode and last-value as 50 for the id column, then it would pull the above record.
Now the same record is changed and id also gets changed (hypothetical example though) as follows
200 abc xyz 6000
If you run the sqoop command again, would this pull the above record as well was the question.
Questions on lastmodified mode:
Looks like running sqoop with this mode would merge the existing data with the new data using 2 MR jobs internally. What is the column that sqoop use to compare the old and the new for the merge process?
Can user specify the column for the merge process?
Can more than one column be provided that have to be used for the merge process?
Should the target-dir exist for the merge process to happen, so that sqoop treats the existing target dir as the old dataset? Otherwise, how would Sqoop what is the old data set to be merged?
Answers for append mode:
Yes, it needs to be integer
Both
Question is not clear.
Answers for lastmodified mode:
Incremental load does not merge data with lastmodified, it is primarily to pull updated and inserted data using timestamp.
Merge process is completely different. Once you have both old data and new data, you can merge new data onto old data to a different directory. You can see detailed explanation here.
Merge process works with only one field
target-dir should not exist. The video covers complete merge process

INSERT INTO results in a new file

I am working with hive on an external table in text format. I populate this table every hour, but I partition the table by month (The dataset is relatively small). Each hour I want to insert new data into some partitions.
The INSERT INTO clause results in the creation of a new file in the existing partition which contains the old data. This way in the end of the month i will have around 700 small files in each partition.
Is there a way for HIVE to append the data to the old file in the partition (Without using UNION ALL on the old data)?
Unfortunately, this is impossible at the current time. Hopefully with the file append patch gaining more traction these days, it'll eventually be a new feature to append to the existing files.
I see this as one of the major downsides of Hive.... especially when you start dealing with much smaller inserts.

Resources