How to save large files in HDFS and links in HBase? - hadoop

I have read that it is recommended to save files more than 10MB in HDFS and store path of that file in HBase. Is there any recommended approach of doing this. Is there any specific configurations or tools like Apache Phoenix that can help us achieve this?
Or all of the saving data in HDFS and then saving the location in HBase then reading the path from HBase then reading data from HDFS with the location all be done manually from the client?

Related

Is it possible to configure clickhouse data storage to be hdfs

Currently, clickhouse stores data on
/var/lib/clickhouse
path and I've read It doesn't have support for deep storage.
By the way, does it have any configs for hdfs setup in config.xml file?
store clickhouse datadir into HDFS it's a really BAD idea ;)
cause HDFS not posix compatible file system, clickhouse will be extremly slow on this deployment variant
you can use https://github.com/jaykelin/clickhouse-hdfs-loader to load data from HDFS into clickhouse, and in near future https://clickhouse.yandex/docs/en/roadmap/ clickhouse may will be support PARQUET format for loading data
clickhouse have own solution for High Availability and Clusterization
please read
https://clickhouse.yandex/docs/en/operations/table_engines/replication/ and https://clickhouse.yandex/docs/en/operations/table_engines/distributed/
#MajidHajibaba
clickhouse designed initially for data locality, it means you have local disk and data will read from local disk as fast as possible
3 years later, S3 and HDFS as remote data storage with local caching is good implemented approach
look https://clickhouse.com/docs/en/engines/table-engines/mergetree-family/mergetree/#table_engine-mergetree-s3 fo details
look to cache_enabled and cache_path options
and https://clickhouse.com/docs/en/operations/storing-data/#configuring-hdfs
HDFS engine provides integration with Apache Hadoop ecosystem by allowing to manage data on HDFSvia ClickHouse. This engine is similar to the File and URL engines, but provides Hadoop-specific features.
https://clickhouse.yandex/docs/ru/operations/table_engines/hdfs/

What is the difference between HBASE and HDFS in Hadoop?

What is the actual difference, and when should be use the other when data needs to be stored?
Please read this post for a good explanation. But in general, HBASE runs on top of HDFS. HDFS is a distributed file system just like any other file system (Unix/Windows) and HBASE is like a database which reads and writes from that file system just like any other database (MySQL, MSSQL).

Ingesting data files into HDFS

I have terabyte of CSV files which I need to ingest into HDFS, files are residing on application server I can FTP data on edge node and use any of below methods .
HDFS CLI (-put)
Mounting HDFS
Using ETL tools
I was wondering which method will be good to use in terms of performance
Please suggest
I can remember that I have faced with similar situation in one of my previous projects. We have followed the approach of mounting HDFS. This will allow the users to transfer files easily from the local system. I have found the below links which might help you.
mounting hdfs - stackoverflow
HDFS NFS Gateway

Flume and sqoop limitation

I have a terabyte of data files on different machines i want to collect it on centralized machine for some processing is it advisable to use flume ?
Same amount of data is there in RDBMS which i would like to put in hdfs is it advisable to use sqoop to trasffer terabyte of data? if not what will be alternative
Using Sqoop to transfer few terabytes from RDBMS to HDFS is a great idea, highly recommended. This is Sqoop's intended use case and it does do reliably.
Flume is mostly intended for streaming data, so if the files all have events, and you get new files frequently, then Flume with Spooling Directory source can work.
Otherwise, "HDFS -put" is a good way to copy files to HDFS.

How to export metadata from Cloudera

I'm just beginning with Hadoop based systems and I am working in Cloudera 5.2 at the moment. I am trying to get metadata out of HDFS/Hive and into some other software. When I say metadata I mean stuff like:
- for Hive: database schema and table schema
- for HDFS: the directory structure in HDFS, creation and modification times, owner and access controls
Does anyone know how to export the table schema's from Hive into a table or CSV file?
It seems that the Hive EXPORT function doesn't support only providing the schema. I found the Pig DESCRIBE function but I'm not sure how to get the output into a table-like structure; seems to only be available on the screen.
Thanks
Cloudera Navigator can be used to manage/export metadata from HDFS and Hive. The Navigator Metadata Server periodically collects the cluster's metadata information and provides a REST API for retrieving metadata information. More details at http://www.cloudera.com/content/cloudera/en/documentation/cloudera-navigator/v2-latest/Cloudera-Navigator-Installation-and-User-Guide/cnui_metadata_arch.html.
I'm not familiar with Hive, but you can also extract HDFS metadata by:
Fetching the HDFS fsimage. "hdfs dfsadmin -fetchImage ./fsimage"
Processing the fsimage using the OfflineImageViewer. "hdfs oiv XML -i ./fsimage -o ./fsimage.out"
More information about the HDFS OIV at https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/HdfsImageViewer.html.

Resources