I'm just beginning with Hadoop based systems and I am working in Cloudera 5.2 at the moment. I am trying to get metadata out of HDFS/Hive and into some other software. When I say metadata I mean stuff like:
- for Hive: database schema and table schema
- for HDFS: the directory structure in HDFS, creation and modification times, owner and access controls
Does anyone know how to export the table schema's from Hive into a table or CSV file?
It seems that the Hive EXPORT function doesn't support only providing the schema. I found the Pig DESCRIBE function but I'm not sure how to get the output into a table-like structure; seems to only be available on the screen.
Thanks
Cloudera Navigator can be used to manage/export metadata from HDFS and Hive. The Navigator Metadata Server periodically collects the cluster's metadata information and provides a REST API for retrieving metadata information. More details at http://www.cloudera.com/content/cloudera/en/documentation/cloudera-navigator/v2-latest/Cloudera-Navigator-Installation-and-User-Guide/cnui_metadata_arch.html.
I'm not familiar with Hive, but you can also extract HDFS metadata by:
Fetching the HDFS fsimage. "hdfs dfsadmin -fetchImage ./fsimage"
Processing the fsimage using the OfflineImageViewer. "hdfs oiv XML -i ./fsimage -o ./fsimage.out"
More information about the HDFS OIV at https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/HdfsImageViewer.html.
Related
I have read that it is recommended to save files more than 10MB in HDFS and store path of that file in HBase. Is there any recommended approach of doing this. Is there any specific configurations or tools like Apache Phoenix that can help us achieve this?
Or all of the saving data in HDFS and then saving the location in HBase then reading the path from HBase then reading data from HDFS with the location all be done manually from the client?
I am new to hadoop and hive, I am trying to use
hadoop distcp -overwrite hdfs://source_cluster/apps/hive/warehouse/test.db hdfs://destination_cluster/apps/hive/warehouse/test.db
this command runs properly and there is no error, still I can't see test.db on the target hdfs cluster
You've copied files, but haven't modified the Hive metastore that actually registers table information.
If you want to copy between clusters, I suggest looking into a tool called Circus Train, otherwise, use SparkSQL to interact with the Hiveserver of both cluster rather than use hdfs only tooling
After copying files and directories, it is necessary to recreate the tables (ddl) so that data about those tables appears in the metastore
I have seen some conflicting posts across the web about whether or not Hive uses HCatalog to access the metastore and I was hoping someone could help me out here?
Does Hive use the actual HCatalog api's to access the metastore, or does it have its own mechanism of retrieving metadata and is HCatalog only used by non-hive tools to access the metadata?
No ,hive doesn't uses Hcatalog Api to access metastore.
HCatalog opens up the hive metadata to other mapreduce tools. Every mapreduce tools has its own notion about HDFS data (example Pig sees the HDFS data as set of files, Hive sees it as tables). With having table based abstraction, HCatalog supported mapreduce tools do not need to care about where the data is stored, in which format and storage location (HBase or HDFS).
What is the actual difference, and when should be use the other when data needs to be stored?
Please read this post for a good explanation. But in general, HBASE runs on top of HDFS. HDFS is a distributed file system just like any other file system (Unix/Windows) and HBASE is like a database which reads and writes from that file system just like any other database (MySQL, MSSQL).
I am looking into using Hive on our Hadoop cluster to then use Presto to do some analytics on the data stored in Hadoop but I am still confused about some things:
Files are stored in Hadoop (some kind of file manager)
Hive needs tables to store data from Hadoop (data manager)
Do both Hadoop and Hive store their data separate or does Hive just use the files from Hadoop? (in terms of hard disk space and so on?)
-> So does Hive import data from Hadoop in tables and leave Hadoop alone or how must I see this?
Can Presto be used without Hive and just on Hadoop directly?
Thanks in advance for answering my questions :)
First things first: files are stored in Hadoop Distributed File System (HDFS). Is that what you call Data manager?
Actually Hive can use both - "regular" files in HDFS or tables which are once again "regular" files with additional metadata stored in special datastore (it is called warehouse).
Concerning Presto - it has a built-in support for Hive metastore, but you can also write your own connector plugin for any data source.
Please read more info about Hive connector configuration here and about connector plugins here.