GPFS to HDFS Migration - hadoop

I have an IBM's BigInsight Cluster where I have ~5k Hive tables and other HBASE data along with some Big SQL Tables. All the data files are in different format i.e. Text, Avro, bz2 etc.
Now to migrate from BigInsight to HDP(Hortonworks Data Platform) I need to understand how we can move data from GPFS to HDFS.
Can you please explain what are the architectural differences between GPFS and HDFS. Is for both Namenode will work similarly. What are the changes in Namespace. Copying Namespace wont work.
How to access GPFS from other Hadoop Cluster - simply distcp will work?
What are the challenges we can face at the time of migration.
I have some options:
nfs gateway
SCP - Secure copy
My only concern is which one from all these options can work for both GPFS and HDFS. If these are not tested for my scenario, what are the other alternative option I should opt.
Please suggest for any solution, what are the other things I need to take care.
Pardeep Sharma.


Is it possible to configure clickhouse data storage to be hdfs

Currently, clickhouse stores data on
path and I've read It doesn't have support for deep storage.
By the way, does it have any configs for hdfs setup in config.xml file?
store clickhouse datadir into HDFS it's a really BAD idea ;)
cause HDFS not posix compatible file system, clickhouse will be extremly slow on this deployment variant
you can use to load data from HDFS into clickhouse, and in near future clickhouse may will be support PARQUET format for loading data
clickhouse have own solution for High Availability and Clusterization
please read and
clickhouse designed initially for data locality, it means you have local disk and data will read from local disk as fast as possible
3 years later, S3 and HDFS as remote data storage with local caching is good implemented approach
look fo details
look to cache_enabled and cache_path options
HDFS engine provides integration with Apache Hadoop ecosystem by allowing to manage data on HDFSvia ClickHouse. This engine is similar to the File and URL engines, but provides Hadoop-specific features.

Ingesting data files into HDFS

I have terabyte of CSV files which I need to ingest into HDFS, files are residing on application server I can FTP data on edge node and use any of below methods .
HDFS CLI (-put)
Mounting HDFS
Using ETL tools
I was wondering which method will be good to use in terms of performance
Please suggest
I can remember that I have faced with similar situation in one of my previous projects. We have followed the approach of mounting HDFS. This will allow the users to transfer files easily from the local system. I have found the below links which might help you.
mounting hdfs - stackoverflow
HDFS NFS Gateway

Flume and sqoop limitation

I have a terabyte of data files on different machines i want to collect it on centralized machine for some processing is it advisable to use flume ?
Same amount of data is there in RDBMS which i would like to put in hdfs is it advisable to use sqoop to trasffer terabyte of data? if not what will be alternative
Using Sqoop to transfer few terabytes from RDBMS to HDFS is a great idea, highly recommended. This is Sqoop's intended use case and it does do reliably.
Flume is mostly intended for streaming data, so if the files all have events, and you get new files frequently, then Flume with Spooling Directory source can work.
Otherwise, "HDFS -put" is a good way to copy files to HDFS.

data backup and recovery in hadoop 2.2.0

I am new to Hadoop and much interested in Hadoop Administration,so i tried to install Hadoop 2.2.0 in Ubuntu 12.04 as pseudo distributed mode and installed successfully and run some example jar files also ,now i am trying learn further ,trying to learn data back up and recovery part now,can anyone tell ways to take data back back up and recovery it in hadoop 2.2.0 ,and also please suggest any good books for Hadoop Adminstration and steps to learn Hadoop Adminstration.
Thanks in Advance.
There is no classic backup and recovery functionality in Hadoop. There are several reasons for this:
HDFS uses block level replication for data protection via redundancy.
HDFS scales out massively in size, and it is becoming more economic to backup to disk, rather than tape.
The size of "Big Data" doesn't lend itself to being easily backed up.
Instead of backups, Hadoop uses data replication. Internally, it creates multiple copies of each block of data (by default, 3 copies). It also has a function called 'distcp', which allows you to replicate copies of data between clusters. This is what's typically done for "backups" by most Hadoop operators.
Some companies, like Cloudera, are incorporating the distcp tool into creating a 'backup' or 'replication' service for their distribution of Hadoop. It operates against a specific directory in HDFS, and replicates it to another cluster.
If you really wanted to create a backup service for Hadoop, you can create one manually yourself. You would need some mechanism of accessing the data (NFS gateway, webFS, etc), and could then use tape libraries, VTLs, etc. to create backups.

How to convert cassandra to HDFS file system for shark/hive query

Is there any way to expose cassandra data as HDFS and then perfom shark/Hive query on HDFS ??
If yes, kindly provide some links to transform cassandra db into HDFS.
You can write identity MapReduce Code which take input from CFS (cassandra filesystem) and dump data to HDFS.
Once you have data in HDFS , you can map a hive table and run queries.
The typical way to access Cassandra data in Hive is to use the CqlStorageHandler.
Details see Hive Support for Cassandra CQL3.
But if you have some reasons to access the data directly, take a look at Cassowary. It is a "Hive storage handler for Cassandra and Shark that reads the SSTables directly. This allows total control over the resources used to run ad-hoc queries so that the impact on real-time Cassandra performance is controlled."
I think you are trying to run Hive/Shark against data already in Cassandra. If that is the case then you don't need to access it as HDFS but you need a hive-handler for using it against Cassandra.
For this you can use Tuplejump's project, CASH The Readme provides the instruction on how to build and use it. If you want to put your "big files" in Cassandra and query on them, like you do from HDFS, you will need a FileSystem that runs on Cassandra like DataStax's CFS present in DSE, or Tuplejump's SnackFS (present in the Calliope Project Early Access Repo)
Disclaimer: I work for Tuplejump, Inc.
You can use Tuplejump Calliope Project.
Configure external Cassandra Table in Shark(like Hive) using Storage Handler provided in TumpleJump code.
All the best!
Three cassandra hive storage for 2.0 and hadoop 2 directly from sstable
