Apache Sqoop Incremental import - hadoop

I understand that Sqoop offers couple of methods to handle incremental imports
Append mode
lastmodified mode
Questions on Append mode:
Is the append mode supported only for the check column as integer data type? What if i want to use a date or a timestamp column but still i want to only append to the data already in HDFS?
Does this mode mean that the new data is appended to the existing HDFS file or it picks only the new data from the source DB or both?
Lets say that the check-column is an id column in the source table. There already exists a row in the table where the id column is 100. When the sqoop import is run in the append mode where the last-value is 50. Now it imports all rows where the id > 50. When run again with last-value as 150, but this time the row with the id value as 100 was updated to 200. Would this row also be pulled?
Example: Lets say there is a table called customers with one of the records as follows. The first column is the id.
100 abc xyz 5000
When Sqoop job is run in the append mode and last-value as 50 for the id column, then it would pull the above record.
Now the same record is changed and id also gets changed (hypothetical example though) as follows
200 abc xyz 6000
If you run the sqoop command again, would this pull the above record as well was the question.
Questions on lastmodified mode:
Looks like running sqoop with this mode would merge the existing data with the new data using 2 MR jobs internally. What is the column that sqoop use to compare the old and the new for the merge process?
Can user specify the column for the merge process?
Can more than one column be provided that have to be used for the merge process?
Should the target-dir exist for the merge process to happen, so that sqoop treats the existing target dir as the old dataset? Otherwise, how would Sqoop what is the old data set to be merged?

Answers for append mode:
Yes, it needs to be integer
Both
Question is not clear.
Answers for lastmodified mode:
Incremental load does not merge data with lastmodified, it is primarily to pull updated and inserted data using timestamp.
Merge process is completely different. Once you have both old data and new data, you can merge new data onto old data to a different directory. You can see detailed explanation here.
Merge process works with only one field
target-dir should not exist. The video covers complete merge process

Related

Sync database extraction with Hadoop

Lets say you have periodic task that extract data from a database and loads that data into Hadoop.
How does Apache Sqoop/Nifi mantain database sync between the source database (SQL or NoSQL) with destination storage(Hadoop HDFS or HBASE, even S3)?
For example, lets say that at time A the database has 500 records and at time B it has 600 records with some of the old records updated, does it have a mechanism that efficiently knows the difference between time A and time B that only updates rows that changed and add missing rows?
Yes,NiFi has QueryDatabaseTable processor which can store the state and incrementally fetches the records that got updated.
in your table if you are having some date column that can be updated when your records gets updated then you can use the same date column in Max value columns property then processor will pulls only the changes that got made from last state value.
Here is the awesome article regarding querydatabasetable processor
https://community.hortonworks.com/articles/51902/incremental-fetch-in-nifi-with-querydatabasetable.html

Hive partitioned column doesn't appear in rdd via sc.textFile

The Hive partitioned column is not the part of the underlying saved data, I need to know how it can be pulled via sc.textFile(filePath) syntax to be loaded in RDD.
I know the other way of creating hive context and all but was wondering is there a way I can directly get it via sc.textFile(filePath) syntax and use it.
By partitioning the data by a column when saving, that columns data will be stored in the file structure and not in the actual files. Since, sc.textFile(filePath) is made for reading single files I do not believe it supports reading partitioned data.
I would recommend reading the data as a dataframe, for example:
val df = hiveContext.read().format("orc").load("path/to/table/")
The wholeTextFiles() method could also be used. Then you would get a tuple of (file path, file data) and from that it should be possible to parse out the partitioned data column and then add it as a new column.
If the storage size is no problem, then an alternative solution would be to store the information of the partitioned column twice. Once in the file structure (done by partitioning on that column), and once more in the data itself. This is achieved by duplicating the column before saving it to file. Say the column in question is named colA,
val df2 = df.withColumn("colADup", $"colA")
df2.write.partitionBy("colADup").orc("path/to/save/")
This can also easily be extended to multiple columns.

Spark data frame, JOIN two datasets and De-dup the records by a key and latest timestamp of a record

I need some help in a efficient way to JOIN two datasets and De-dup the records by a key and latest timestamp of a record.
use case: Need to run a daily incremental refresh for each table and provide an a snapshot of the extract everyday
For each table get a daily incremental file: 150 Million records need to run a De-dup process against a history full volume file (3 billion). The dedup process need to run by a composite primary key and get latest record by the timestamp. every record contains key and a timestamp. Files are available in ORC and parquet format using spark.

Overwriting HBase id

What happens when I add duplicate entry to hbase table. Happened to see updated timestamp to the column. Is there any property in hbase that have options to avoid/allow overwriting while adding to the table?
HBase client uses PUT to perform insert and update of a row. Based on the key supplied, if row key doesn't exist it inserts and if it does exist it updates. HBase update means add another version to row with latest data and timestamp. Read (get) will get the data with latest timestamp by default unless a timestamp is specified. (PUT is idempotent method). so i don't think there is any property to avoid overwriting. Probably you can use a prePut co-processor to customize some behavior. check out HBase API documentation for more on co processor (Package org.apache.hadoop.hbase.coprocessor)
https://hbase.apache.org/apidocs/index.html

how to work on specific part of cvs file uploaded into HDFS?

how to work on specific part of cvs file uploaded into HDFS ?
I'm new in Hadoop and i have an a question that is if i export an a relational database into cvs file then uploaded it into HDFS . so how to work on specific part (table) in file using MapReduce .
thanks in advance .
I assume that the RDBMS tables are exported to individual csv files for each table and stored in HDFS. I presume that, you are referring to column(s) data within the table(s) when you mentioned 'specific part (table)'. If so, place the individual csv files into the separate file paths say /user/userName/dbName/tables/table1.csv
Now, you can configure the job for the input path and field occurrences. You may consider to use the default Input Format so that your mapper would get one line at time as input. Based on the configuration/properties, you can read the specific fields and process the data.
Cascading allows you to get started very quickly with MapReduce. It has framework that allows you to set up Taps to access sources (your CSV file) and process it inside a pipeline say to (for example) add column A to column B and place the sum into column C by selecting them as Fields
use BigTable means convert your database to one big table

Resources