How to convert cassandra to HDFS file system for shark/hive query - hadoop

Is there any way to expose cassandra data as HDFS and then perfom shark/Hive query on HDFS ??
If yes, kindly provide some links to transform cassandra db into HDFS.

You can write identity MapReduce Code which take input from CFS (cassandra filesystem) and dump data to HDFS.
Once you have data in HDFS , you can map a hive table and run queries.

The typical way to access Cassandra data in Hive is to use the CqlStorageHandler.
Details see Hive Support for Cassandra CQL3.
But if you have some reasons to access the data directly, take a look at Cassowary. It is a "Hive storage handler for Cassandra and Shark that reads the SSTables directly. This allows total control over the resources used to run ad-hoc queries so that the impact on real-time Cassandra performance is controlled."

I think you are trying to run Hive/Shark against data already in Cassandra. If that is the case then you don't need to access it as HDFS but you need a hive-handler for using it against Cassandra.
For this you can use Tuplejump's project, CASH The Readme provides the instruction on how to build and use it. If you want to put your "big files" in Cassandra and query on them, like you do from HDFS, you will need a FileSystem that runs on Cassandra like DataStax's CFS present in DSE, or Tuplejump's SnackFS (present in the Calliope Project Early Access Repo)
Disclaimer: I work for Tuplejump, Inc.

You can use Tuplejump Calliope Project.
https://github.com/tuplejump/calliope
Configure external Cassandra Table in Shark(like Hive) using Storage Handler provided in TumpleJump code.
All the best!

Three cassandra hive storage
https://github.com/2013Commons/hive-cassandra for 2.0 and hadoop 2
https://github.com/dvasilen/Hive-Cassandra/tree/HIVE-0.11.0-HADOOP-2.0.0-CASSANDRA-1.2.9
https://github.com/richardalow/cassowary directly from sstable

Related

Is it possible to configure clickhouse data storage to be hdfs

Currently, clickhouse stores data on
/var/lib/clickhouse
path and I've read It doesn't have support for deep storage.
By the way, does it have any configs for hdfs setup in config.xml file?
store clickhouse datadir into HDFS it's a really BAD idea ;)
cause HDFS not posix compatible file system, clickhouse will be extremly slow on this deployment variant
you can use https://github.com/jaykelin/clickhouse-hdfs-loader to load data from HDFS into clickhouse, and in near future https://clickhouse.yandex/docs/en/roadmap/ clickhouse may will be support PARQUET format for loading data
clickhouse have own solution for High Availability and Clusterization
please read
https://clickhouse.yandex/docs/en/operations/table_engines/replication/ and https://clickhouse.yandex/docs/en/operations/table_engines/distributed/
#MajidHajibaba
clickhouse designed initially for data locality, it means you have local disk and data will read from local disk as fast as possible
3 years later, S3 and HDFS as remote data storage with local caching is good implemented approach
look https://clickhouse.com/docs/en/engines/table-engines/mergetree-family/mergetree/#table_engine-mergetree-s3 fo details
look to cache_enabled and cache_path options
and https://clickhouse.com/docs/en/operations/storing-data/#configuring-hdfs
HDFS engine provides integration with Apache Hadoop ecosystem by allowing to manage data on HDFSvia ClickHouse. This engine is similar to the File and URL engines, but provides Hadoop-specific features.
https://clickhouse.yandex/docs/ru/operations/table_engines/hdfs/

Query github data using Hadoop

I am trying to query GitHub data provided by ghtorrent API using hadoop. how can I inject this much data(4-5 TB) into HDFS? Also, their databases are real time. Is it possible to process real time data in hadoop using tools such as pig, hive, hbase?
Go through this presentation . It has described the way you can connect to their MySql or MongoDb instance and fetch data. Basically you have to share your public key, they will add that key to their repository and then you can ssh. As an alternative you can download their periodic dumps from this link
Imp Link :
query mongodb programatically
connect to mysql instance
For processing real time data, you cannt do that uisng Pig, Hive. Those are Batch processing tools. Consider using Apache Spark.

HDFS into Cassandra

is it possible to migrate/replicate/copy/move processed files (using PIG) from local HDFS (lets say 192.168.0.10) to a cassandra (192.168.0.20)?
What I have in mind is that I literally create a java application to parse the file and re-insert them into cassandra.
Is there any other way in doing so?
thanks alot!
Writing a Java program to migrate Hadoop data to Cassandra tables is actually a overkill. It would become more worse if you happen to perform the same periodically.
Instead , we can utilize a very useful feature of Hive which helps us to integrate Hive tables with external data sources. Its Storage Handler Api of hive, which integrates with external data sources like Cassandra/Oracle/Mysql etc.
There is already an Hive-Cassandra Storage Handler API implementation available , which we can very well reuse, kindly find the same in below url.
https://github.com/tuplejump/cash/tree/master/cassandra-handler
The idea is to create Hive external table which is configured with storage handler specs about the remote Cassandra host/table details.
Any write/read performed to this external table , will be handled by Hive through mapreduce jobs which talks with the Cassandra.
I hope this is the ideal way to integrate Hive and Cassandra which takes very less efforts from us and very efficient too.
Hope this helps.
There are several ways to move the data from Hadoop to Cassandra.
Using Java HDFS API and Cassandra API (inefficient).
Using Java MapReduce program (Parallel loading).
Using Pig (Parallel loading).
Using Hive (Parallel loading).
Using Spark (Parallel loading).
Out of all Pig is easier way to load the data from HDFS to Cassandra.
Pig has a storage type called CassandraStorage. It allows us to load the data into Cassandra in parallel.
Please see this link for more information:
https://wiki.apache.org/cassandra/HadoopSupport#Pig

What is the relationship between Spark, Hadoop and Cassandra

My understanding was that Spark is an alternative to Hadoop. However, when trying to install Spark, the installation page asks for an existing Hadoop installation. I'm not able to find anything that clarifies that relationship.
Secondly, Spark apparently has good connectivity to Cassandra and Hive. Both have sql style interface. However, Spark has its own sql. Why would one use Cassandra/Hive instead of Spark's native sql? Assuming that this is a brand new project with no existing installation?
Spark is a distributed in memory processing engine. It does not need to be paired with Hadoop, but since Hadoop is one of the most popular big data processing tools, Spark is designed to work well in that environment. For example, Hadoop uses the HDFS (Hadoop Distributed File System) to store its data, so Spark is able to read data from HDFS, and to save results in HDFS.
For speed, Spark keeps its data sets in memory. It will typically start a job by loading data from durable storage, such as HDFS, Hbase, a Cassandra database, etc. Once loaded into memory, Spark can run many transformations on the data set to calculate a desired result. The final result is then typically written back to durable storage.
In terms of it being an alternative to Hadoop, it can be much faster than Hadoop at certain operations. For example a multi-pass map reduce operation can be dramatically faster in Spark than with Hadoop map reduce since most of the disk I/O of Hadoop is avoided. Spark can read data formatted for Apache Hive, so Spark SQL can be much faster than using HQL (Hive Query Language).
Cassandra has its own native query language called CQL (Cassandra Query Language), but it is a small subset of full SQL and is quite poor for things like aggregation and ad hoc queries. So when Spark is paired with Cassandra, it offers a more feature rich query language and allows you to do data analytics that native CQL doesn't provide.
Another use case for Spark is for stream processing. Spark can be set up to ingest incoming real time data and process it in micro-batches, and then save the result to durable storage, such as HDFS, Cassandra, etc.
So spark is really a standalone in memory system that can be paired with many different distributed databases and file systems to add performance, a more complete SQL implementation, and features they may lack such a stream processing.
Im writing a paper about Hadoop for university. And stumbled over your question. Spark is just using Hadoop for persistence and only if you want to use it. It's possible to use it with other persistence tiers like Amazon EC2.
On the other hand-side spark is running in-memory and it's not primarly build to be used for map reduce use-cases like Hadoop was/is.
I can recommend this article, if you like a more detailed description: https://www.xplenty.com/blog/2014/11/apache-spark-vs-hadoop-mapreduce/
The README.md file in Spark can solve your puzzle:
A Note About Hadoop Versions
Spark uses the Hadoop core library to talk to HDFS and other Hadoop-supported
storage systems. Because the protocols have changed in different versions of
Hadoop, you must build Spark against the same version that your cluster runs.
Please refer to the build documentation at
"Specifying the Hadoop Version"
for detailed guidance on building for a particular distribution of Hadoop, including
building for particular Hive and Hive Thriftserver distributions.

Questions about Hadoop And Hive And Presto

I am looking into using Hive on our Hadoop cluster to then use Presto to do some analytics on the data stored in Hadoop but I am still confused about some things:
Files are stored in Hadoop (some kind of file manager)
Hive needs tables to store data from Hadoop (data manager)
Do both Hadoop and Hive store their data separate or does Hive just use the files from Hadoop? (in terms of hard disk space and so on?)
-> So does Hive import data from Hadoop in tables and leave Hadoop alone or how must I see this?
Can Presto be used without Hive and just on Hadoop directly?
Thanks in advance for answering my questions :)
First things first: files are stored in Hadoop Distributed File System (HDFS). Is that what you call Data manager?
Actually Hive can use both - "regular" files in HDFS or tables which are once again "regular" files with additional metadata stored in special datastore (it is called warehouse).
Concerning Presto - it has a built-in support for Hive metastore, but you can also write your own connector plugin for any data source.
Please read more info about Hive connector configuration here and about connector plugins here.

Resources