We are currently running spark jobs, store the data in HDFS and create Hive tables on top of it. Now, we are planning to migrate from HDFS to GFS to store the Hadoop data. Is it possible to use GFS as our Hadoop File System in our case, either by mounting or CIFS share or using GFS directly?
The following link talks about remote mounting of devices, but it mentions that it is not specific to any of the details.
https://www.cloudera.com/documentation/other/reference-architecture/topics/ra_storage_device_accepta...
Please refer me to any document that explains in detail about remote mounting of devices.
Related
I have used google-cloud-storage-connector for Hadoop and able to run mapreduce job that takes input from my local HDFS (Hadoop running in my local machine) and places the result in Google Cloud Storage bucket.
Now I want to run a mapreduce job using google-datastore-connector for Hadoop that takes input from local HDFS (Hadoop running in my local machine) and places the result in a Cloud Datastore kind (kind is synonymous to database table).
Please help me what are all the configurations that I need to give and what steps I have to follow.
I have terabyte of CSV files which I need to ingest into HDFS, files are residing on application server I can FTP data on edge node and use any of below methods .
HDFS CLI (-put)
Mounting HDFS
Using ETL tools
I was wondering which method will be good to use in terms of performance
Please suggest
I can remember that I have faced with similar situation in one of my previous projects. We have followed the approach of mounting HDFS. This will allow the users to transfer files easily from the local system. I have found the below links which might help you.
mounting hdfs - stackoverflow
HDFS NFS Gateway
I am new to Hadoop and much interested in Hadoop Administration,so i tried to install Hadoop 2.2.0 in Ubuntu 12.04 as pseudo distributed mode and installed successfully and run some example jar files also ,now i am trying learn further ,trying to learn data back up and recovery part now,can anyone tell ways to take data back back up and recovery it in hadoop 2.2.0 ,and also please suggest any good books for Hadoop Adminstration and steps to learn Hadoop Adminstration.
Thanks in Advance.
There is no classic backup and recovery functionality in Hadoop. There are several reasons for this:
HDFS uses block level replication for data protection via redundancy.
HDFS scales out massively in size, and it is becoming more economic to backup to disk, rather than tape.
The size of "Big Data" doesn't lend itself to being easily backed up.
Instead of backups, Hadoop uses data replication. Internally, it creates multiple copies of each block of data (by default, 3 copies). It also has a function called 'distcp', which allows you to replicate copies of data between clusters. This is what's typically done for "backups" by most Hadoop operators.
Some companies, like Cloudera, are incorporating the distcp tool into creating a 'backup' or 'replication' service for their distribution of Hadoop. It operates against a specific directory in HDFS, and replicates it to another cluster.
If you really wanted to create a backup service for Hadoop, you can create one manually yourself. You would need some mechanism of accessing the data (NFS gateway, webFS, etc), and could then use tape libraries, VTLs, etc. to create backups.
Is there any way to expose cassandra data as HDFS and then perfom shark/Hive query on HDFS ??
If yes, kindly provide some links to transform cassandra db into HDFS.
You can write identity MapReduce Code which take input from CFS (cassandra filesystem) and dump data to HDFS.
Once you have data in HDFS , you can map a hive table and run queries.
The typical way to access Cassandra data in Hive is to use the CqlStorageHandler.
Details see Hive Support for Cassandra CQL3.
But if you have some reasons to access the data directly, take a look at Cassowary. It is a "Hive storage handler for Cassandra and Shark that reads the SSTables directly. This allows total control over the resources used to run ad-hoc queries so that the impact on real-time Cassandra performance is controlled."
I think you are trying to run Hive/Shark against data already in Cassandra. If that is the case then you don't need to access it as HDFS but you need a hive-handler for using it against Cassandra.
For this you can use Tuplejump's project, CASH The Readme provides the instruction on how to build and use it. If you want to put your "big files" in Cassandra and query on them, like you do from HDFS, you will need a FileSystem that runs on Cassandra like DataStax's CFS present in DSE, or Tuplejump's SnackFS (present in the Calliope Project Early Access Repo)
Disclaimer: I work for Tuplejump, Inc.
You can use Tuplejump Calliope Project.
https://github.com/tuplejump/calliope
Configure external Cassandra Table in Shark(like Hive) using Storage Handler provided in TumpleJump code.
All the best!
Three cassandra hive storage
https://github.com/2013Commons/hive-cassandra for 2.0 and hadoop 2
https://github.com/dvasilen/Hive-Cassandra/tree/HIVE-0.11.0-HADOOP-2.0.0-CASSANDRA-1.2.9
https://github.com/richardalow/cassowary directly from sstable
Hadoop writes the intermediate results to the local disk and the results of the reducer to the HDFS. what does HDFS mean. What does it physically translate to
HDFS is the Hadoop Distributed File System. Physically, it is a program running on each node of the cluster that provides a file system interface very similar to that of a local file system. However, data written to HDFS is not just stored on the local disk but rather is distributed on disks across the cluster. Data stored in HDFS is typically also replicated, so the same block of data may appear on multiple nodes in the cluster. This provides reliable access so that one node's crashing or being busy will not prevent someone from being able to read any particular block of data from HDFS.
Check out http://en.wikipedia.org/wiki/Hadoop_Distributed_File_System#Hadoop_Distributed_File_System for more information.
As Chase indicated, HDFS is Hadoop Distributed File System.
If I may, I recommend this tutorial and video of how HDFS and the Map/Reduce framework works and will serve you as a guide into the world of Hadoop: http://www.cloudera.com/resource/introduction-to-apache-mapreduce-and-hdfs/