Real time database in Hadoop ecosystem - hadoop

Pardon me if this is a silly question.
I have a cloudera manager installed in a single node.
I am trying to use Hbase and Hadoop for logging request and response in my web application.
I am trying to list latest user activity using the log.
Rows are added using the below table structure.
1 Column Family, RowId, 11 columns. I store every value as string. Fairly simple & similar to a mysql table.
RowId
entry:addedTime
entry:value
entry:ip
entry:accessToken
entry:identifier
entry:userId
entry:productId
entry:object
entry:requestHeader
entry:completeDate
entry:tag
Now, in order to get rows from my Hbase, I use
SingleColumnValueFilter("entry", "userId", "=", binary:"25", true, true)
Now, I am struggling to order this by
entry:completeDate DESCENDING
and limit by 25 rows for pagination or infinite scroll.
My question,
Is Hbase the only real time querying database available in Hadoop ecosystem?
Am I using Hbase for wrong reasons? Is my table structure correct?
I work in a startup and these are our baby steps to moving to BigData. Though BigData created lot of hype, the Hadoop is poorly supported for latest linux and looks too complicated.
Any help or suggestions would be appreciated.
Many thanks,
Karthik

Related

Cognos reporting on Hive datasource is very slow?

I am new to Cognos and trying to create reports on top of Hadoop using Hive JDBC Driver. I'm able to connect to Hive through JDBC and can able to generate reports, but here the report runs very slow. I did the same job while connecting with DB2 and the data is same as in Hadoop. Reports ran very quickly when compared to reports on top of Hive. I'm using same data-sets in both Hadoop and DB2, but can't figure out why reports on top of Hadoop are very slow. I installed Hadoop in pseudo distributed mode and connected through JDBC.
I installed following versions of software's which I used,
IBM Cognos 10.2.1 with fix pack 11,
Apache Hadoop 2.7.2,
Apache Hive 0.12.
Both are installed in different systems, Cognos on top of Windows 7 and Hadoop on top of Red-Hat.
Can any one tell me where I might be wrong in setting up of Cognos or Hadoop. Is there any way to speed up the report running time in Cognos on top of Hadoop.
When you say you installed Hadoop in pseudo distributed mode are you saying you are only running it on a single server? If so, it's never going to be as fast as DB2. Hadoop and Hive are designed to run on a cluster and scale. Get 3 or 4 servers running in a cluster and you should find that you can start to see some impressive query speeds over large datasets.
Check that you have allowed the Cognos Query Service to access more than the default amount of memory for it's Java Heap (http://www-01.ibm.com/support/docview.wss?uid=swg21587457) I currently run an initial size of 8Gb and max of 12Gb, but still manage to blow this occasionally.
Next issue you will run into is that Cognos doesn't know Hive SQL specifics (or Impala which is what I am using). This means that any non-basic query is going to be converted to a select from and maybe a group by. The big missing piece will be a where clause, which will mean that Cognos is going to try to suck in all the data from the Hive table and then do the filtering in Cognos rather than pass that off to Hive where it belongs. Cognos knows how to write DB2 SQL and all the specifics so it can pass that workload through.
The more complex the query, or any platform specific functions etc will generally not get passed to Hive (date functions, analytic functions etc), so try to structure your data and queries so they are required in filters.
Use the Hive query logs to monitor the queries that Cognos is running. Also try things like add fields to the query and then drag that field to the filter rather than direct from the model into the filter. I have found this can help in getting Cognos to include the filter in a where clause.
The other option is to use passthrough SQL queries in Report Studio and just write it all in Hive's SQL. I have just done this for a set of dashboards which required a stack of top 5's from a fact table with 5 million rows. For 5 rows Cognos was extracting all 5 million rows and then ranking them within Cognos. Do this a number of times and all of a sudden Cognos is going to struggle. With a passthrough query I could use the Impala Rank() function and only get 5 rows, much much faster, and faster than what DB2 would do seeing I am running on a proper (but small) cluster.
Another consideration with Hive is whether you are using Hive on Map Reduce or Hive on TEZ. From what a colleague has found, Hive on TEZ is much faster at the type of queries Cognos runs than Hive on Map Reduce.

how to manage modified data in Apache Hive

We are working on Cloudera CDH and trying to perform reporting on the data stored on Apache Hadoop. We send daily reports to client so need to import data from operational store to hadoop daily.
Hadoop works on the append only mode. Hence we can not perform the Hive update/delete query. We can perform Insert overwrite on dimension tables and add delta values in the fact tables. Introducing thousands for the delta rows daily does not seem quite impressive solution.
Are there any other standard better ways to update modified data in Hadoop?
Thanks
HDFS might be append only, but Hive does support updates from 0.14 on.
see here:
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DML#LanguageManualDML-Update
A design pattern is to take all your previous and current data and insert it into a new table every time.
Depending on your usecase have a look at Apache Impala/Hbase/... or even Drill.

Processing very large dataset in real time in hadoop

I'm trying to understand how to architect a big data solution. I have historic data of 400TB of data and every hour 1GB of data is getting inserted.
Since data is confidential, I'm describing sample scenario, Data contains information of all activities in a bank branch. With every hour, when new data is inserted(no updation) into hdfs, I need to find how many loans closed, loans created,accounts expired, etc ( around 1000 analytics to be performed). Analytics involve processing entire 400TB of data.
I was plan was to use hadoop + spark. But I'm being suggested to use HBase. Reading through all the documents, I'm not able to find a clear advantage.
What is the best way to go for data which will grow to 600TB
1. MR for analytics and impala/hive for query
2. Spark for analytics and query
3. HBase + MR for analytics and query
Thanks in advance
About HBase:
HBase is a database that is build over HDFS. HBase uses HDFS to store data.
Basically, HBase will allow you to update records, have versioning and deletion of single records. HDFS does not support file updates, so HBase is introducing something you can consider "virtual" operations, and merge data from multiple sources (original files, delete markers) when you are asking it for data. Also, HBase as key-value store is creating indices to support selecting by key.
Your problem:
Choosing the technology in such situations you should look into what you are going to do with the data: Single query on Impala (with Avro schema) can be much faster than MapReduce (not to mention Spark). Spark will be faster in batch jobs, when there is caching involved.
You are probably familiar with Lambda architecture, if not, take a look into it. For what I can tell you now, the third option you mentioned (HBase and MR only) won't be good. I did not try Impala + HBase, so I can't say anything about performance, but HDFS (plain files) + Spark + Impala (with Avro), worked for me: Spark was doing reports for pre-defined queries (after that, data was stored in objectFiles - not human-readable, but very fast), Impala for custom queries.
Hope it helps at least a little.

Hive/Impala select and average all rowkey versions

I am wondering if there is a way to get previous versions of a particular rowkey in HBase without having to write a MapReduce program and average the values out. I was curious whether this was possible using Hive or Impala (or another similar program) and how you would do this.
My table looks like this:
Composite keys Values
(md5 + date + id) | (value)
I'd like to average all the values for the particular date and a sub string of the id ("411") for all versions.
Thanks ahead of time.
Impala uses the Hive metastore to map its logical notion of a table onto data physically stored in HDFS or HBase (for more details, see the Cloudera documentation).
To learn more about how to tell the Hive metastore about data stored in HBase, see the Hive documentation.
Unfortunately, as noted in the Hive documentation linked above:
there is currently no way to access the HBase timestamp attribute, and
queries always access data with the latest timestamp
There was some work done to add this feature against an older version of Hive in HIVE-2828, though unfortunately that work has not yet been merged into trunk.
So for your application you'll have to redesign your HBase schema to include a "version" column, tell the Hive metastore about this new column, and make your application aware of this column.

Basic thing about Hadoop and Hive

I have started working with Hadoop recently. There is table named Checkout that I access through Hive. And below is the path where the data goes to HDFS and other info. So what information I can get if I have to read the below three lines?
Path Size Record Count Date Loaded
/sys/edw/dw_checkout_trans/snapshot/2012/07/04/00 1.13 TB 9,294,245,800 2012-07-05 07:26
/sys/edw/dw_checkout_trans/snapshot/2012/07/03/00 1.13 TB 9,290,477,963 2012-07-04 09:37
/sys/edw/dw_checkout_trans/snapshot/2012/07/02/00 1.12 TB 9,286,199,847 2012-07-03 07:08
So my question is-
1) Firstly, We are loading the data to HDFS and then through Hive I am querying it to get the result back? Right?
2) Secondly, When you look into the above path and other things, the only thing that I am confuse is, when I will be querying using Hive then I will be getting data from all the three paths above? or the most recent one at the top?
As I am new to these stuff, so I am having lot of problem. Can anyone explain me hive gets the data from where? And we store all the data in HDFS and then we use Hive or Pig to get data back from HDFS? And it will be great if some one give high level knowledge of Hadoop and Hive.
I think you need to get the difference between Hive's native table and Hive's external table.
Hive native table mean that you load data into hive, and it takes care how data is stored in the HDFS. We usually do not care what is directory structure in this case.
Hive External table mean that we put data in some directory (if we forget about partitioning for the moment) and tell to Hive - it is table's data. Please treat is as such. And hive enable us to query it, join with other external or regular table. And it is our responsibility to add data, delete it, etc

Resources