How to read hbase current and previous versions data from Hive - hadoop

I want to read all the data from hbase table using Hive.
Should be able to read all the Current and previous versions data from hbase

You can specify the number of version you get for Scan and Get and it will retrieve them:
HTable cTable = new HTable(TableName);
Get res = new Get(Bytes.toBytes(key));
//set no. of version that you want to fetch.
res.setMaxVersions(verNo); <--
Result fetchRow = cTable.get(res);
NavigableMap<byte[], NavigableMap<byte[], NavigableMap<Long,byte[] >>> allVersions = fetchRow.getMap();
Note: Versioning is by default disabled while creating table.
So, you need to enable it.
create 'employee',{NAME=>"myname",Versions=>2},'office' //Here versioning is enabled for column "myname" as "2" and no versioning for column "office"
describe 'employee' // show you versioning information.
alter 'employee',NAME=>'office',VERSIONS =>4 // Alter
// Put and scan the table - it will show new and old value
put 'employee','1','myname:name','Jigyasa1'
put 'employee','1','myname:name','Jigyasa2'
put 'employee','1','office:name','Add1'
put 'employee','1','office:name','Add2'
scan 'employee',{VERSIONS=>10}
For Hbase-hive integration follow the ref. link :
https://cwiki.apache.org/confluence/display/Hive/HBaseIntegration

Related

Update one column and set all values to 1 using Entity Framework

how does one update 1 column to set the value to one for all rows, so my table has three columns customerno,name,doneflag, I want to set the Doneflag to 1 for all,currently i have 8 rows in my table and all have a value of 0 as Doneflag, I want to do one update to set the doneflag for all rows to 1 using entity framework, this is simple using Mysql as it would be:
update myDB.Customer
set doneflag = 0;
I tried this but does not work;
context.Customer.Add(x => x.Doneflag = 1);
You need to load the existing records, modify them and then save the changes, e.g.
var notYetDoneCustomers = await context.Customers.Where(c => c.DoneFlag == 0).ToListAsync();
foreach(var cust in notYetDoneCustomers) {
cust.DoneFlag = 1;
}
await context.SaveChangesAsync();
The standard entity framework flow is
Load data
Modify it
Save changes
For bulk updates, you have several options:
Use the standard flow
Use an add-on library, see: Entity Framework Core(7) bulk update
Use SQL: https://learn.microsoft.com/en-us/ef/core/querying/raw-sql
This topic is covered in the docs here: https://learn.microsoft.com/en-us/ef/core/performance/efficient-updating
Unfortunately, EF doesn't currently provide APIs for performing bulk updates. Until these are introduced, you can use raw SQL to perform the operation where performance is sensitive:
Using a library like entityframework plus has the benefit, that you stay in a type-safe world that you would loose if you would perform raw SQL.
Another alternative:
Install linq2db.EntityFrameworkCore (disclaimer: I'm one of the creators)
context.Customers
.Where(c => c.doneFlag == 0)
.Set(c => c.doneflag, 1)
.Update();
In EF 6 you can use await myDB.Database.ExecuteSqlRawAsync or myDB.Database.ExecuteSqlRaw for UPDATE, INSERT OR DELETE SQL command valid expression, example:
var commandText = "UPDATE Customer SET doneflag = 0";
await _context.Database.ExecuteSqlRawAsync(commandText);
This will update all records in Costumer to doneflag = 0. Remember parameterize user input to prevent the possibility of a SQL injection attack being successful.

Perform Incremental Load On Qlik Sense

New to Qlik Sense.
I would like to perform incremental insert, update and delete. Through research i managed to write this script
//This fetches deleted records
SELECT `sale_detail_auto_id`
FROM `iprocure_ods.deleted_records` as dr
INNER JOIN `iprocure_ods.saledetail` sd ON sd.sale_detail_auto_id = dr.identifier AND dr.type = 2
WHERE dr.delete_date > TIMESTAMP('$(vSaleTransactionsRunTime)');
//This fetches new and updated records
[sale_transactions]:
SELECT *
FROM `iprocure_edw.sale_transactions`
WHERE `server_update_date` > TIMESTAMP('$(vSaleTransactionsRunTime)');
Concatenate([sale_transactions])
LOAD *
FROM [lib://qlikPath/saletransactions.qvd] (qvd) Where Not Exists(`sale_detail_auto_id`);
//This part updates runtime dates
MaxUpdateDate:
LOAD Timestamp(MAX(`server_update_date`), '$(TimestampFormat)') As maxServerUpdateDate
FROM [lib://qlikPath/saletransactions.qvd] (qvd);
Let vSaleTransactionsRunTime = peek('maxServerUpdateDate', 0, MaxUpdateDate);
DROP Table MaxUpdateDate;
New and update records works fine. The problem is with the deleted records are replaced with empty column except sale_detail_auto_id column.
How can i fetch data from saletransactions.qvd that are not in deleted records?
In first SELECT you select sale_detail_auto_id fields which is also exists under the same field name in new and updated records, so then you see deleted ids together with new ones. You need to rename that column to avoid conflict.
Please use AS, for example:
sale_detail_auto_id` AS `deleted_sale_detail_auto_id`
and then in EXISTS use that field:
Where Not Exists(deleted_sale_detail_auto_id, sale_detail_auto_id);
UPDATED:
Additionally I think it doesn't make sense to store deleted ids in data model so you can name that table:
[TEMP_deleted_ids]
SELECT sale_detail_auto_id` AS `deleted_sale_detail_auto_id`
and then in the end of the script remove it:
DROP Table [TEMP_deleted_ids];

UPSERT in Memsql from another table

I am trying this query to insert some records from a table to another one,when recods are not already exsiting in the target table, but I am getting the following error, what is the best query to UPSERT in memsql from another table?
Query:
INSERT INTO ema.device_set
(segment_0, segment_1, segment_2, segment_3, segment_4, last_updated)
SELECT tmp.segment_0, tmp.segment_1, tmp.segment_2, tmp.segment_3, tmp.segment_4, tmp.last_updated
FROM ema.tmp_device_set tmp
WHERE NOT EXISTS (
SELECT *
FROM ema.device_set tab
WHERE tmp.segment_0 = tab.segment_0 and tmp.segment_1 = tab.segment_1 and tmp.segment_2 = tab.segment_2 and tmp.segment_3 = tab.segment_3 and tmp.segment_4 = tab.segment_4
);
error:
Partition has no master instance or Leaf Error: The database will be available to query in 2 seconds after recovery from disk is finished.
That error message means your nodes are down or recovering from disk. It has nothing to do with the specific UPSERT you are trying to do.
Check to make sure your query is not in any violations of the MemSQL INSERT...SELECT rules shown at the following link.
https://docs.memsql.com/docs/insert

Spark not be able to retrieve all of Hbase data in specific column

My Hbase table has 30 Million records, each record has the column raw:sample, raw is columnfamily sample is column. This column is very big, the size from a few KB to 50MB. When I run the following Spark code, it only can get 40 thousand records but I should get 30 million records:
val conf = HBaseConfiguration.create()
conf.set("hbase.zookeeper.quorum", "10.1.1.15:2181")
conf.set(TableInputFormat.INPUT_TABLE, "sampleData")
conf.set(TableInputFormat.SCAN_COLUMNS, "raw:sample")
conf.set("hbase.client.keyvalue.maxsize","0")
val hBaseRDD = sc.newAPIHadoopRDD(conf, classOf[TableInputFormat],classOf[org.apache.hadoop.hbase.io.ImmutableBytesWritable],classOf[org.apache.hadoop.hbase.client.Result])
var arrRdd:RDD[Map[String,Object]] = hBaseRDD.map(tuple => tuple._2).map(...)
Right now I work around this by get the id list first then iterate the id list to get the column raw:sample by pure Hbase java client in Spark foreach.
Any ideas please why I can not get all of the column raw:sample by Spark, is it because the column too big?
A few days ago one of my zookeeper nodes and datanodes down, but I fixed it soon since the replica is 3, is this the reason? Would think if I run hbck -repair would help, thanks a lot!
Internally, TableInputFormat creates a Scan object in order to retrieve the data from HBase.
Try to create a Scan object (without using Spark), configured to retrieve the same column from HBase, see if the error repeats:
// Instantiating Configuration class
Configuration config = HBaseConfiguration.create();
// Instantiating HTable class
HTable table = new HTable(config, "emp");
// Instantiating the Scan class
Scan scan = new Scan();
// Scanning the required columns
scan.addColumn(Bytes.toBytes("personal"), Bytes.toBytes("name"));
scan.addColumn(Bytes.toBytes("personal"), Bytes.toBytes("city"));
// Getting the scan result
ResultScanner scanner = table.getScanner(scan);
// Reading values from scan result
for (Result result = scanner.next(); result != null; result = scanner.next())
System.out.println("Found row : " + result);
//closing the scanner
scanner.close();
In addition, by default, TableInputFormat is configured to request a very small chunk of data from the HBase server (which is bad and causes a large overhead). Set the following to increase the chunk size:
scan.setBlockCache(false);
scan.setCaching(2000);
For a high throughput like yours, Apache Kafka is the best solution to integrate the data flow and keeping data pipeline alive. Please refer http://kafka.apache.org/08/uses.html for some use cases of kafka
One more
http://sites.computer.org/debull/A12june/pipeline.pdf

How to set autoflush=false in HBase table

I have this code that saves to HBase HTABLE. The expected behavior is that the table will push the commits or "flush" the puts to hbase for each partition.
NOTE: This is the updated code
rdd.foreachPartition(p => {
val table = connection.getTable(TableName.valueOf(HTABLE))
val mutator = connection.getBufferedMutator(TableName.valueOf(HTABLE))
p.foreach(row => {
val hRow = new Put(rowkey)
hRow.addColumn....
// use table.exists instead of table.checkAndPut (in favor of BufferedMutator's flushCommits)
val exists = table.exists(new Get(rowkey))
if (!exists) {
hRow.addColumn...
}
mutator.mutate(hRow)
})
table.close()
mutator.flush()
mutator.close()
})
In HBase 1.1, HTable is deprecated and there's no flushCommits() available in org.apache.hadoop.hbase.client.Table.
Replacing BufferedMutator.mutate(put) is ok for normal puts, but mutator does not have any checkAndPut similar to Table.
In the new API, BufferedMutatoris used.
You could change Table t = connection.getTable(TableName.valueOf("foo")) to BufferedMutator t = connection.getBufferedMutator(TableName.valueOf("foo")). And then change t.put(p); to t.mutate(p);
It works for me!
There is little information about that when I was searching, even in the official document. Hope my answer is helpful, and someone could update the document.
You need to set autoFlush to false see section 11.7.4
in http://hbase.apache.org/0.94/book/perf.writing.html
You dont need to do anything since you DONT want to buffer puts at Client side. By default, HBase client will not buffer the PUTS at client side.
Explicit calls to flushCommits() is only required when the client handling when to send data to HBase RegionServers.

Resources