Is there a framework that can push bulk data?

Is there a framework that can push bulk data? - hadoop

Our MySQL and Hadoop platform are deployed in different network, Hadoop cannot visit MySQL but MySQL can connect to Hadoop.
So is there a tool that can push bulk data from RDS to HDFS like Sqoop fetch data? or is there any other method that can address this issue?
Any help is appreciated.
By the way, the Hadoop cluster consumes a Kafka topic in our network by deploying flume in both sides and connect together. But for bulk RDS data, how to achieve this?

Here for movement of bulk data from MySQL to Hadoop, you have another option/Solution i.e. Spark JDBC connection to MySQL database to bring data into Hadoop cluster.
Use of Sqoop for data movement is old technique to import data from RDBMS to hadoop cluster. But SparkJDBC solution is based on Spark in memory execution engine with well proven performance record and reliable to use.

Related

Query github data using Hadoop

I am trying to query GitHub data provided by ghtorrent API using hadoop. how can I inject this much data(4-5 TB) into HDFS? Also, their databases are real time. Is it possible to process real time data in hadoop using tools such as pig, hive, hbase?

Go through this presentation . It has described the way you can connect to their MySql or MongoDb instance and fetch data. Basically you have to share your public key, they will add that key to their repository and then you can ssh. As an alternative you can download their periodic dumps from this link
Imp Link :
query mongodb programatically
connect to mysql instance
For processing real time data, you cannt do that uisng Pig, Hive. Those are Batch processing tools. Consider using Apache Spark.

How to start spark (with thrift server) in non-blocking mode that hive can update and reload data into spark (table-looking)

We do have problems with table lookings. We need simultanious access from hive and spark (with thrift server) to tables. However our problem is running spark with thrift server result in a table looking.
We're running on an Amazon AWS EMR Cluster with Hive, Spark and thrift Server 2.
We'd like to update with hive an s3 storage and load this aggregated data into spark in background periodically. Spark meanwhile is allways on with thrift server loaded and has the same data loaded from s3, to do realtime aggregations on this data. Spark does not need write access on this data.
The problem is running the periodicall data-loading tasks on hive result in freeze of the job.
We think the meta-store may be locked by spark / thrift server, blocking hive from updating and reloading data into spark. (But not sure about this)
Is it possible to start spark and thrift server in read only non-blocking mode?
What may cause the problem? Anyone experienced similar problems?

How is your metastore configured ? Does it use Derby for the metastore ?
With the default configuration it uses Derby, which does not support multiple concurrent users.
If so, you should change it to use something like MySQL, which does support multiple users.

hadoop architecture query example

Currently i have 2 machines one of them is the Horton sandbox i have configured it as name node and decommissioned the data node from it and other machine which i have made and made it as a data node and i have installed hive server on it.
Also and assigned the slave role to it and i used Ambari to finish it .
My question is as its my first time ever to use hadoop my plan is to transfer data from sql database to the hadoop so does this mean i have to install mysql on datanode while i will be using sqoop and other thing what will the name node do ?shall i query it and it passes the queries to the datanode am really very much confused and really having huge pressure to finish so forgive me as am newbie the installations of the machines are all default i have chosen datanode for the First machine and nodemanager for the second one with no special configurations appreciate if You have a simple example from which i can understand .
Thanks alot fellows

Sqoop is a tool designed to transfer data between Hadoop and relational database servers. It is used to import data from relational databases such as MySQL, Oracle to Hadoop HDFS, and export from Hadoop file system to relational databases.
example like- you have some data in mysql in other machine and you have to transfer the data into your hadoop hdfs. In this condition sqoop will be used
NameNode stores MetaData(No of Blocks, On Which Rack which DataNode the data is stored and other details) about the data being stored in DataNodes whereas the DataNode stores the actual Data.

How to convert cassandra to HDFS file system for shark/hive query

Is there any way to expose cassandra data as HDFS and then perfom shark/Hive query on HDFS ??
If yes, kindly provide some links to transform cassandra db into HDFS.

You can write identity MapReduce Code which take input from CFS (cassandra filesystem) and dump data to HDFS.
Once you have data in HDFS , you can map a hive table and run queries.

The typical way to access Cassandra data in Hive is to use the CqlStorageHandler.
Details see Hive Support for Cassandra CQL3.
But if you have some reasons to access the data directly, take a look at Cassowary. It is a "Hive storage handler for Cassandra and Shark that reads the SSTables directly. This allows total control over the resources used to run ad-hoc queries so that the impact on real-time Cassandra performance is controlled."

I think you are trying to run Hive/Shark against data already in Cassandra. If that is the case then you don't need to access it as HDFS but you need a hive-handler for using it against Cassandra.
For this you can use Tuplejump's project, CASH The Readme provides the instruction on how to build and use it. If you want to put your "big files" in Cassandra and query on them, like you do from HDFS, you will need a FileSystem that runs on Cassandra like DataStax's CFS present in DSE, or Tuplejump's SnackFS (present in the Calliope Project Early Access Repo)
Disclaimer: I work for Tuplejump, Inc.

You can use Tuplejump Calliope Project.
https://github.com/tuplejump/calliope
Configure external Cassandra Table in Shark(like Hive) using Storage Handler provided in TumpleJump code.
All the best!

Three cassandra hive storage
https://github.com/2013Commons/hive-cassandra for 2.0 and hadoop 2
https://github.com/dvasilen/Hive-Cassandra/tree/HIVE-0.11.0-HADOOP-2.0.0-CASSANDRA-1.2.9
https://github.com/richardalow/cassowary directly from sstable

What's the difference between Flume and Sqoop?

Both Flume and Sqoop are meant for data movement, then what is the difference between them? Under what condition should I use Flume or Sqoop?

From http://flume.apache.org/
Flume is a distributed, reliable, and available service for
efficiently collecting, aggregating, and moving large amounts of log
data.
Flume helps to collect data from a variety of sources, like logs, jms, Directory etc. Multiple flume agents can be configured to collect high volume of data.
It scales horizontally.
From http://sqoop.apache.org/
Apache Sqoop(TM) is a tool designed for efficiently transferring bulk
data between Apache Hadoop and structured datastores such as
relational databases.
Sqoop helps to move data between hadoop and other databases and it can transfer data in parallel for performance.

Both Sqoop and Flume, pull the data from the source and push it to the sink. The main difference is Flume is event driven, while Sqoop is not.

Flume:
Flume is a framework for populating Hadoop with data. Agents are populated
throughout ones IT infrastructure – inside web servers, application servers
and mobile devices, for example – to collect data and integrate it into Hadoop.
Sqoop:
Sqoop is a connectivity tool for moving data from non-Hadoop data stores – such
as relational databases and data warehouses – into Hadoop. It allows users to
specify the target location inside of Hadoop and instruct Sqoop to move data
from Oracle,Teradata or other relational databases to the target.
You can see the full Post

Flume:
A very common use case is collecting log data from one system- a bank of web servers(aggregating it in HDFS for later analysis).
Sqoop:
On the other hand is designed for performing bulk imports of data into HDFS from structured data stores. simple use case will be an organization that runs a nightly sqoop import to load the day's data from a production DB into a Hive data ware house for analysis.
--From the definitive guide.

Apache Sqoop and Apache Flume work with various kinds of data sources. Flume functions well in streaming data sources which are generated continuously in hadoop environment such as log files from multiple servers.
whereas Apache Sqoop is designed to work well with any kind of relational database system that has JDBC connectivity.
Sqoop can also import data from NoSQL databases like MongoDB or Cassandra and also allows direct data transfer or Hive or HDFS. For transferring data to Hive using Apache Sqoop tool, a table has to be created for which the schema is taken from the database itself.
In Apache Flume data loading is event driven whereas in Apache Sqoop data load is not driven by events.
4.Flume is a better choice when moving bulk streaming data from various sources like JMS or Spooling directory whereas Sqoop is an ideal fit if the data is sitting in databases like Teradata, Oracle, MySQL Server, Postgres or any other JDBC compatible database then it is best to use Apache Sqoop.
5.In Apache Flume, data flows to HDFS through multiple channels whereas in Apache Sqoop HDFS is the destination for importing data.
6.Apache Flume has agent based architecture i.e. the code written in flume is known as agent which is responsible for fetching data whereas in Apache Sqoop the architecture is based on connectors. The connectors in Sqoop know how to connect with the various data sources and fetch data accordingly.
Lastly, Sqoop and Flume cannot be used achieve the same tasks as they are developed specifically to serve different purposes. Apache Flume agents are designed to fetch streaming data like tweets from Twitter or log file from the web server whereas Sqoop connectors are designed to work only with structured data sources and fetch data from them.
Apache Sqoop is mainly used for parallel data transfers, for data imports as it copies data quickly where Apache Flume is used for collecting and aggregating data because of its distributed, reliable nature and highly available backup routes.

Sqoop and Flume both are meant to fulfill data ingestion needs but they serve different purposes. Apache Flume works well for streaming data sources that are generated continuously in hadoop environment such as log files from multiple servers whereas whereas Apache Sqoop works well with any RDBMS has JDBC connectivity.
Sqoop is actually meant for bulk data transfers between hadoop and any other structured data stores. Flume collects log data from many sources, aggregating it, and writing it to HDFS.
I came across this interesting infographic that explains the differences between the two apache projects Sqoop and Flume -
Difference between Sqoop and Flume

Sqoop
Sqoop can perform import/export from RDBMS to HDFS/HIVE/HBASE
sqoop only import/export structured data not unstructured or semi
structured.
Flume
import stream data from multiple sources mostly semi-structured and
unstructured in nature. Now Kafka is better alternative for flume.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio