Databricks- Autoincrement identity- How to insert the changes from latest cdf version - insert

I have an autoloader table processing a mount point with CSV files.
After each run, I would like to insert some of the records into another table where I have an AutoIncrement Identity column set up.
I can rerun the entire insert and this works, but I am trying to only insert the newest records.
I have CDF enabled, so I should be able to determine the latest version, or maintain the versions processed. But it seems like I am missing some built in feature of Databricks.
Any suggestions or sample to look at?

Note - Delta change data feed is available in Databricks Runtime 8.4
and above.
You can read the change events in batch queries using SQL and DataFrame APIs (that is, df.read), and in streaming queries using DataFrame APIs (that is, df.readStream).
Enable CDF
%sql
ALTER TABLE silverTable SET TBLPROPERTIES (delta.enableChangeDataFeed = true)
Any suggestions or sample to look at?
You can find Sample Notebook here

Related

Nifi Fetching Data From Oracle Issue

I am having a requirement to fetch data from oracle and upload into google cloud storage.
I am using executeSql proecssor but it is failing for large table and even for table with 1million records of approx 45mb size it is taking 2hrs to pull.
The table name are getting passed using restapi to listenHttp which passes them to executeSql. I cant use QueryDatabase because the number of table are dynamic and calls to start the fetch is also dynamic using a UI and Nifi RestUi.
Please suggest any tuning parameter in ExecuteSql Processor.
I believe you are talking about having the capability to have smaller flow files and possibly sending them downstream while the processor is still working on the (large) result set. For QueryDatabaseTable this was added in NiFi 1.6.0 (via NIFI-4836) and in an upcoming release (NiFi 1.8.0 via NIFI-1251) this capability will be available for ExecuteSQL as well.
You should be able to use GenerateTableFetch to do what you want. There you can set the Partition Size (which will end up being the number of rows per flow file) and you don't need a Maximum Value Column if you want to fetch the entire table each time a flow file comes in (which also allows you do handle multiple tables as you described). GenerateTableFetch will generate the SQL statements to fetch "pages" of data from the table, which should give you better, incremental performance on very large tables.

Overwriting HBase id

What happens when I add duplicate entry to hbase table. Happened to see updated timestamp to the column. Is there any property in hbase that have options to avoid/allow overwriting while adding to the table?
HBase client uses PUT to perform insert and update of a row. Based on the key supplied, if row key doesn't exist it inserts and if it does exist it updates. HBase update means add another version to row with latest data and timestamp. Read (get) will get the data with latest timestamp by default unless a timestamp is specified. (PUT is idempotent method). so i don't think there is any property to avoid overwriting. Probably you can use a prePut co-processor to customize some behavior. check out HBase API documentation for more on co processor (Package org.apache.hadoop.hbase.coprocessor)
https://hbase.apache.org/apidocs/index.html

Need to export data from cassandra 1.2 for a demo

I have to transfer some data from an older cassandra 1.2 instance to a demo instance that has personal information anonymized.
I discovered the COPY command and that seems to work, but I see no option to specify a limit. I'd like to do something like take one year's worth of data only, however there seems to be no way to specify that.
What I have now is working, but it's dumping the entire contents of the tables, which is way more than I need.
export data
COPY my_keyspace.ThingEventLog( key, column1 , value ) to 'ThingEventLog.csv';
import data
COPY my_keyspace.ThingEventLog( key, column1 , value ) from 'ThingEventLog.csv';
Thanks for any other ideas
Unfortunately it's not until Cassandra 2.0 and later that the MAXOUTPUTSIZE is supported as a COPY option. The only data limitation Cassandra 1.2 allows you to specify is by column. While it's more data than you need, at least it reads/spits out data incredibly quickly.
http://www.datastax.com/dev/blog/simple-data-importing-and-exporting-with-cassandra

Bulk Update of Particular fields In Hbase

I have a scenario while was working on Hbase. Initially I have to bulkupload a csv file to Hbase table.Which I could do successfully by using Hbase bulkloading.
Now I want to update a particular field in hbase table by comparing to an new csv provided and if the value is updated have to maintain a flag which says the rowkey was updated. Any hint how I can do it easily.
Any help is really appreciated.
Thanks
HBase maintains versions for each cell. As long as you have the row key with you, you get a handle of the row, and you can just use put to add the updated column. Internally it maintains the versions, and you can have access to history of the updated values too.
However, you need comparing too, as I can see. So after bulk loading the fastest you can do it, use a map reduce as have HBase as source and sink. Look here at 7.2.2 section.
The idea is have mapreduce perform the scan, do comparision in map, and write the new updated put in output. Its like a basic fetch, modify and update sequence. But we are using map reduce parallel feature as we are dealing with large amount of data

Hive/Impala select and average all rowkey versions

I am wondering if there is a way to get previous versions of a particular rowkey in HBase without having to write a MapReduce program and average the values out. I was curious whether this was possible using Hive or Impala (or another similar program) and how you would do this.
My table looks like this:
Composite keys Values
(md5 + date + id) | (value)
I'd like to average all the values for the particular date and a sub string of the id ("411") for all versions.
Thanks ahead of time.
Impala uses the Hive metastore to map its logical notion of a table onto data physically stored in HDFS or HBase (for more details, see the Cloudera documentation).
To learn more about how to tell the Hive metastore about data stored in HBase, see the Hive documentation.
Unfortunately, as noted in the Hive documentation linked above:
there is currently no way to access the HBase timestamp attribute, and
queries always access data with the latest timestamp
There was some work done to add this feature against an older version of Hive in HIVE-2828, though unfortunately that work has not yet been merged into trunk.
So for your application you'll have to redesign your HBase schema to include a "version" column, tell the Hive metastore about this new column, and make your application aware of this column.

Resources