I am trying to update HDP architecture so data residing in Hive tables can be accessed by REST APIs. What are the best approaches how to expose data from HDP to other services?
This is my initial idea:
I am storing data in Hive tables and I want to expose some of the information through REST API therefore I thought that using HCatalog/WebHCat would be the best solution. However, I found out that it allows only to query metadata.
What are the options that I have here?
Thank you
You can very well use WebHDFS which is basically a REST Service over Hadoop.
Please see documentation below:
https://hadoop.apache.org/docs/r1.0.4/webhdfs.html
The REST API gateway for the Apache Hadoop Ecosystem is called KNOX
I would check it before explore any other options. In other words, Do you have any reason to avoid using KNOX?
What version of HDP are you running?
The Knox component has been available for quite a while and manageable via Ambari.
Can you get an instance of HiveServer2 running in HTTP mode?
This would give you SQL access through J/ODBC drivers without requiring Hadoop config and binaries (other than those required for the drivers) on the client machines.
Related
I have a task which requires me to create a Go program to read from an HBASE table.
HBASE is installed in a MapR cluster.
Every other application (Java) uses a MapR client to connect to the MapR cluster so as to retrieve the data.
However, I am unable to find a way to connect to HBASE with a Go application.
I have found HBASE package, but it does not support integration with MapR.
It would be great if anyone could guide me in this situation.
I also have seen that for MapR 6 and above has Go support through OJAI, but sadly, upgrading MapR is not an option.
Can someone advice me how to proceed in this situation?
If you are actually running HBase in MapR, then the Go package for HBase should work (assuming version match and such).
If you are actually using the MapR DB Binary tables (which are roughly HBase compatible) the likely best approach would be to use the Thrift API or REST.
The OJAI lightweight client should work well in Go since it uses gRPC to talk to the underlying table (and thus gains lots of portability). The problem in your case won't be so much that you need to upgrade the platform so much as the lightweight client only works with MapR DB JSON (the document oriented version of MapR DB).
Ping me directly if you would like more information.
The Resource Manager REST API provides the status of all applications.
I'm curious to know where does this information is actually stored?
Is it possible to get this information to HBase/Hive?
No, you cannot get this information from HBase or Hive because the Resource Manager REST APIs return live data from data structures in the RM. The application logs are stored locally on Node Managers and in HDFS and Zookeeper maintains some state information that could be extracted independent of the RM but that's all.
Have you looked on Timeline Server v2? ATSv2 can store all application metrics. As the storage this service uses HBase. (Link: https://hadoop.apache.org/docs/stable/hadoop-yarn/hadoop-yarn-site/TimelineServiceV2.html)
Check that ATSv2 supported in your version of Hadoop.
I have some general questions about fiware-cosmos, apologies if they are basic but Im trying to understand the architecture and use of cosmos.
I saw that you are planning to integrate Apache Spark into Cosmos ? Do you have a road map or date for that to happen ? What happens if I want to use Spark now ?
What Hadoop service sources can be used ? I think I read that Cosmos supports Cloudera CDH services and raw Hadoop server services ? What about HortonWorks or MapR ?
I know that non standard file systems can be used with Hadoop, for instance MapR-FS, are options like this possible with Cosmos ?
I also read that Cosmos "sits" on top of fiware and so Hadoop as a service (HaaS) can be used and Hadoop clusters generated using open stack ? However, I saw that people are referring to a shared fiware cloud ? Does fiware run as a remote cloud ? Can a local cloud be used on a customer site ?
Is cosmos the only Apache Hadoop/Spark solution on fiware.org ?
Finally, if Cloudera CDH can be used with Cosmos how does the Cloudera cluster manager fit into the mix ? Can it still be used ?
Sorry for all of the questions :)
Cosmos is the name of the Global Instance of the Big Data GE in FIWARE Lab. It is a shared Hadoop instance already deployed in the cloud, ready to be used by FIWARE users.
In fact, there are two instances: The "old" one, serving a pretty old version of the Hadoop stack and being cosmos.lab.fiware.org its entry point. And the "new" one, which is a pair of Hadoop clusters, one for data storage and another one for data analysis; the entry points are storing.cosmos.lab.fiware.org and computing.cosmos.lab.fiware.org.
Of course, you can deploy any other Hadoop (or even Spark) instance by your own in the FIWARE cloud (or any other cloud, such as Amazon one).
Regarding Spark, since it was initially in our plans to deploy it in FIWARE Lab (that's why it appears in the roadmap), it is not clear nowadays it is gonna be deployed.
A big question about using hadoop or related technologies in a real web application.
I just want to find out how a web app can use hbase as its database. I mean is it the thing the big data apps do or they use normal databases and just use these sort of technologies for analysis?
Is it ok to have a online store with Hbase database or something like this?
Yes it is perfectly fine to have hbase as your backend.
What I am doing to get this done,( I have a online community and forum running on my website )
1.Writing C# code to access the Hbase using thrift, very easy and simple to get this done. (Thrift is a cross language binding platform, to HBase Java is only the first class citizen!)
2.Managing the HBase cluster(have it on Amazon) using the Amazon EMI
3.Using ganglia to monitor Hbase
Some Extra tips:
So you can organize the web application like this
You can set up your webservers on Amazon Web Services or IBMWebSphere
You can set up your own HBase cluster using cloudera or use AmazonEC2 again here.
Communication between web server and Hbase master node happens via thrift client.
You can generate thrift code in your own desired programming language
Here are some links that helped me
A)Thrift Client,
B)Filtering options
Along with this I refer to HBase administrative cookbook by Yifeng Jiang and HBase reference guide by Lars George in case I dont get answers on web.
Filtering options provided by HBase are fast and accurate. Let's say if you use HBase for storing your product details, you can have sub-stores and have a column in your Product table, which tells to which store a product may belong and use Filters to get products for a specific store.
I think you should read the article below:
"Apache HBase Do’s and Don’ts"
http://blog.cloudera.com/blog/2011/04/hbase-dos-and-donts/
I was reading the below integration of using Hive for querying data on DynamoDB.
http://aws.typepad.com/aws/2012/01/aws-howto-using-amazon-elastic-mapreduce-with-dynamodb.html
But as per that link, Hive needs to be setup on top of EMR. But I wanted to know if I can use this integration with the standalone Hadoop cluster I already have instead of using EMR. Has anyone done this? Will there be sync issues between data in DynamoDB and HDFS happen compared to using EMR?
To be able to use it on your own cluster, you would need the custom StorageHandler for DynamoDB(it probably involves a custom SerDe as well).
It seems to be no available at the moment, at least not at AWS website.
What you can do is use the JDBC interface, provided by Amazon, to produce the queries from your cluster, but it would still be executed on top of EMR.