Is it possible to configure clickhouse data storage to be hdfs - hadoop

Currently, clickhouse stores data on
/var/lib/clickhouse
path and I've read It doesn't have support for deep storage.
By the way, does it have any configs for hdfs setup in config.xml file?

store clickhouse datadir into HDFS it's a really BAD idea ;)
cause HDFS not posix compatible file system, clickhouse will be extremly slow on this deployment variant
you can use https://github.com/jaykelin/clickhouse-hdfs-loader to load data from HDFS into clickhouse, and in near future https://clickhouse.yandex/docs/en/roadmap/ clickhouse may will be support PARQUET format for loading data
clickhouse have own solution for High Availability and Clusterization
please read
https://clickhouse.yandex/docs/en/operations/table_engines/replication/ and https://clickhouse.yandex/docs/en/operations/table_engines/distributed/

#MajidHajibaba
clickhouse designed initially for data locality, it means you have local disk and data will read from local disk as fast as possible
3 years later, S3 and HDFS as remote data storage with local caching is good implemented approach
look https://clickhouse.com/docs/en/engines/table-engines/mergetree-family/mergetree/#table_engine-mergetree-s3 fo details
look to cache_enabled and cache_path options
and https://clickhouse.com/docs/en/operations/storing-data/#configuring-hdfs

HDFS engine provides integration with Apache Hadoop ecosystem by allowing to manage data on HDFSvia ClickHouse. This engine is similar to the File and URL engines, but provides Hadoop-specific features.
https://clickhouse.yandex/docs/ru/operations/table_engines/hdfs/

Related

GPFS to HDFS Migration

I have an IBM's BigInsight Cluster where I have ~5k Hive tables and other HBASE data along with some Big SQL Tables. All the data files are in different format i.e. Text, Avro, bz2 etc.
Now to migrate from BigInsight to HDP(Hortonworks Data Platform) I need to understand how we can move data from GPFS to HDFS.
Can you please explain what are the architectural differences between GPFS and HDFS. Is for both Namenode will work similarly. What are the changes in Namespace. Copying Namespace wont work.
How to access GPFS from other Hadoop Cluster - simply distcp will work?
What are the challenges we can face at the time of migration.
I have some options:
nfs gateway
distscp
httpfs
WebHDFS REST API
SCP - Secure copy
My only concern is which one from all these options can work for both GPFS and HDFS. If these are not tested for my scenario, what are the other alternative option I should opt.
Please suggest for any solution, what are the other things I need to take care.
Thanks.
Regards,
Pardeep Sharma.

Different ways to import files into HDFS

I want to know what are the different ways through which I can bring data into HDFS.
I am a newbie to Hadoop and was a java web developer till this time. I want to know if I have a web application that is creating log files, how can i import the log files into HDFS.
There are lot's of ways on how you can ingest data into HDFS, let me try to illustrate them here:
hdfs dfs -put - simple way to insert files from local file system to HDFS
HDFS Java API
Sqoop - for bringing data to/from databases
Flume - streaming files, logs
Kafka - distributed queue, mostly for near-real time stream processing
Nifi - incubating project at Apache for moving data into HDFS without making lots of changes
Best solution for bringing web application logs to HDFS is through Flume.
We have three different kinds of data - Structured (schema based systems like Oracle/MySQL etc.), Unstructured (images, weblogs etc.) and Semi-structured data(XML etc.)
Structured data can be stored in database SQL in table with rows and columns
Semi-structured data is information that doesn’t reside in a relational database but that does have some organizational properties that make it easier to analyze. With some process you can store them in relation database (e.g. XML)
Unstructured data often include text and multimedia content. Examples include e-mail messages, word processing documents, videos, photos, audio files, presentations, webpages and many other kinds of business documents.
Depending on type of your data, you will choose the tools to import data into HDFS.
Your company may use CRM,ERP tools. But we don't exactly know how the data is organized & structured.
If we leave simple HDFS commands like put, copyFromLocal etc to load data into HDFS compatible format, below are the main tools to load data into HDFS
Apache Sqoop(TM) is a tool designed for efficiently transferring bulk data between Apache Hadoop and structured datastores such as relational databases. Data from MySQL, SQL Server & Oracle tables can be loaded into HDFS with this tool.
Apache Flume is a distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of log data. It has a simple and flexible architecture based on streaming data flows. It is robust and fault tolerant with tunable reliability mechanisms and many failover and recovery mechanisms.
Other tools include Chukwa,Storm and Kafka
But other important technology, which is becoming very popular is Spark. It is a Friend & Foe for Hadoop.
Spark is emerging an good alternative to Hadoop for real time data processing, which may or may not use HDFS as data source.

Ingesting data files into HDFS

I have terabyte of CSV files which I need to ingest into HDFS, files are residing on application server I can FTP data on edge node and use any of below methods .
HDFS CLI (-put)
Mounting HDFS
Using ETL tools
I was wondering which method will be good to use in terms of performance
Please suggest
I can remember that I have faced with similar situation in one of my previous projects. We have followed the approach of mounting HDFS. This will allow the users to transfer files easily from the local system. I have found the below links which might help you.
mounting hdfs - stackoverflow
HDFS NFS Gateway

Flume and sqoop limitation

I have a terabyte of data files on different machines i want to collect it on centralized machine for some processing is it advisable to use flume ?
Same amount of data is there in RDBMS which i would like to put in hdfs is it advisable to use sqoop to trasffer terabyte of data? if not what will be alternative
Using Sqoop to transfer few terabytes from RDBMS to HDFS is a great idea, highly recommended. This is Sqoop's intended use case and it does do reliably.
Flume is mostly intended for streaming data, so if the files all have events, and you get new files frequently, then Flume with Spooling Directory source can work.
Otherwise, "HDFS -put" is a good way to copy files to HDFS.

How to convert cassandra to HDFS file system for shark/hive query

Is there any way to expose cassandra data as HDFS and then perfom shark/Hive query on HDFS ??
If yes, kindly provide some links to transform cassandra db into HDFS.
You can write identity MapReduce Code which take input from CFS (cassandra filesystem) and dump data to HDFS.
Once you have data in HDFS , you can map a hive table and run queries.
The typical way to access Cassandra data in Hive is to use the CqlStorageHandler.
Details see Hive Support for Cassandra CQL3.
But if you have some reasons to access the data directly, take a look at Cassowary. It is a "Hive storage handler for Cassandra and Shark that reads the SSTables directly. This allows total control over the resources used to run ad-hoc queries so that the impact on real-time Cassandra performance is controlled."
I think you are trying to run Hive/Shark against data already in Cassandra. If that is the case then you don't need to access it as HDFS but you need a hive-handler for using it against Cassandra.
For this you can use Tuplejump's project, CASH The Readme provides the instruction on how to build and use it. If you want to put your "big files" in Cassandra and query on them, like you do from HDFS, you will need a FileSystem that runs on Cassandra like DataStax's CFS present in DSE, or Tuplejump's SnackFS (present in the Calliope Project Early Access Repo)
Disclaimer: I work for Tuplejump, Inc.
You can use Tuplejump Calliope Project.
https://github.com/tuplejump/calliope
Configure external Cassandra Table in Shark(like Hive) using Storage Handler provided in TumpleJump code.
All the best!
Three cassandra hive storage
https://github.com/2013Commons/hive-cassandra for 2.0 and hadoop 2
https://github.com/dvasilen/Hive-Cassandra/tree/HIVE-0.11.0-HADOOP-2.0.0-CASSANDRA-1.2.9
https://github.com/richardalow/cassowary directly from sstable

Resources