I have created a external hive table that points on top of a HBASE table. I understand HBASE stores multiple versions of a column.
My understanding is that the hive query on HBASE will fetch the latest version from the HBASE for a column.
Is there a way i can mention the version of a column to be retrieved either (VERSIONS or TIMERANGE hbase clause) ?
From the Hive HBase integration documentation:
There is currently no way to access the HBase timestamp attribute, and queries always access data with the latest timestamp.
So no, sorry, it doesn't look like there is currently an easy way to do this. It looks like you might have to write your own custom InputFormat and/or SerDe to accomplish it.
Related
I want to read Hbase current and previous versions of data from either hive or Impala.In my initial research, I found out that only current version can be accessed from Hive. So, currently is there any way to retrieve the older versions from either hive or Impala?
In case of Hive :
Please see this
Seems like it's not possible to get different versions of same cell in Hive(even though Hbase has multiple versions of the same cell ) which always return the cell with latest timestamp. I believe we can handle this in tactical way. we can append the previous version to Hbase row key or else as separate cell (name, value)
In case of Impala :
Please see limitations section
everyone
I am new to Hadoop World and i have some problem with Hbase join.
I have two cluster,clusterA's Hbase have employee table ,clusterB's Hbase have department table.
So,how to join empolyee and department ?
Should i need to install Hive ?
If the tables are in two separate clusters, you'll need to get one of the HBase tables from one cluster to another. This can be done via sqoop.
From there, you could, in theory, use Phoenix as suggested by Vignesh I in the comments, however, there are some limitations there. You would need to create a Phoenix view of both of those HBase tables. Native HBase views in Phoenix, currently, do not automatically update if they are updated outside of Phoenix, which most native HBase tables would be. This effectively renders views of native HBase tables in Phoenix snapshots instead of views; you will need to rebuild any indexes on a regular basis (and potentially stats as well) in order to capture any updates to the underlying HBase tables.
There is a JIRA open to enhance this behavior so that it would auto update, but the ETA of such a feature is unknown at this time.
What I would recommend, unless you have very specific real-time needs (in which case Phoenix, if you could live with the view limitations, may be the better choice), is to use Pig.
Within the Pig script, you can join the two HBase tables and then perform various transformations.
Hive would be another option, but in that case, you would need to sqoop both tables from HBase into Hive, and then proceed from there within Hive.
I was trying to use Hive to query the tables I saved using saveAsTable() provided by Spark DataFrame. Everything works well when I query using hiveContext.sql(). However, when I switch to hive and describe the table, it becomes col, array, something like this and is no longer queryable.
Any ideas how to work it through? Is there a reliable way to make Hive understands the metadata defined in spark instead of explicitly defining the schema?
Sometimes I make use of spark to infer schema from the raw data or read schema from certain file formats like parquet so don't want to create these table that could be inferred automatically.
Thanks a lot for any advice!
From my understanding, Hbase is the Hadoop database and Hive is the data warehouse.
Hive allows to create tables and store data in it, you can also map your existing HBase tables to Hive and operate on them.
why we should use hbase if hive do all that? can we use hive by itself?
I'm confused :(
So in simple terms, with hive you can fire SQL like queries (with some exceptions) on your table/s and is used in batch operation. While with hbase, you can do real time querying and is based on key value pair.
"why we should use hbase if hive do all that? can we use hive by itself" Because Hive doesn't supports updating your data set. So if you have large analytical processing application use Hive and if you have real time get/set/update request processing, use Hbase.
I am wondering if there is a way to get previous versions of a particular rowkey in HBase without having to write a MapReduce program and average the values out. I was curious whether this was possible using Hive or Impala (or another similar program) and how you would do this.
My table looks like this:
Composite keys Values
(md5 + date + id) | (value)
I'd like to average all the values for the particular date and a sub string of the id ("411") for all versions.
Thanks ahead of time.
Impala uses the Hive metastore to map its logical notion of a table onto data physically stored in HDFS or HBase (for more details, see the Cloudera documentation).
To learn more about how to tell the Hive metastore about data stored in HBase, see the Hive documentation.
Unfortunately, as noted in the Hive documentation linked above:
there is currently no way to access the HBase timestamp attribute, and
queries always access data with the latest timestamp
There was some work done to add this feature against an older version of Hive in HIVE-2828, though unfortunately that work has not yet been merged into trunk.
So for your application you'll have to redesign your HBase schema to include a "version" column, tell the Hive metastore about this new column, and make your application aware of this column.