import table from HDFS into spark - hadoop

Is there a way to import a table from HDFS directly into spark and store it as an RDD or does it need to made into a textfile to do so?
ps - I get the table onto HDFS from my local system using sqoop (if that matters) and When i do so it comes in the form of 4 files

While I haven't used sqoop before my self, you can use it to create hive tables which you can then query with Spark SQL which will give you back SchemaRDDs :)

You can use the read.jdbc() on your sqlContext to import a table from an external DB into a Spark DataFrame.

Related

Can sqoop write data to hive and hbase together

Can we wirte sqoop data to hive and hbase together in hadoop
I want to write sqoop to hive (rdbms) and hbase (NoSql) together
No it cannot. If you want the data to show up in Hive and HBase, you will have to import it into two separate locations, Create hive table on one for use in Hive. On the second location you will have to create an External Hive table with HBase SerDe properties.
Integrating Hive and HBase. This link shall give you the steps required.

Questions about Hive

I have this environment:
Haddop environment (1 master, 4 slaves) with several applications:
ambari, hue, hive, sqoop, hdfs ... Server in production (separate
from hadoop) with mysql database.
My goal is:
Optimize the queries made on this mysql server that are slow to
execute today.
What did I do:
I imported the mysql data to HDFS using Sqoop.
My doubts:
I can not make selects direct in HDFS using Hive?
Do I have to load the data into Hive and make the queries?
If new data is entered into the mysql database, what is the best way
to get this data and insert it into HDFS and then insert it into
Hive again? (Maybe in real time)
Thank you in advance
I can not make selects direct in HDFS using Hive?
You can. Create External Table in hive specifying your hdfs location. Then you can perform any HQL over it.
Do I have to load the data into Hive and make the queries?
In case of external table, you don't need to load data in hive; your data resides in the same HDFS directory.
If new data is entered into the mysql database, what is the best way to get this data.
You can use Sqoop Incremental Import for this. It will fetch only newly added/updated data (depending upon incremental mode). You can create a sqoop job and schedule it as per your need.
You can try Impala which is much faster than Hive in case of SQL queries. You need to define tables most probably specifying some delimiter, storage format and where the data is stored on HDFS (I don't know what kind of data are you storing). Then you can write SQL queries which will take the data from HDFS.
I have no experience with real-time data ingestion from relational databases, however you can try scheduling Sqoop jobs with cron.

Differences between Apache Sqoop and Hive. Can we use both together?

What is the difference between Apache Sqoop and Hive? I know that sqoop is used to import/export data from RDBMS to HDFS and Hive is a SQL layer abstraction on top of Hadoop. Can I can use Sqoop for importing data into HDFS and then use Hive for querying?
Yes, you can. In fact many people use sqoop and hive for exactly what you have told.
In my project what I had to do was to load the historical data from my RDBMS which was oracle, move it to HDFS. I had hive external tables defined for this path. This allowed me to run hive queries to do transformations. Also, we used to write mapreduce programs on top of these data to come up with various analysis.
Sqoop transfers data between HDFS and relational databases. You can use Sqoop to transfer data from a relational database management system (RDBMS) such as MySQL or Oracle into HDFS and use MapReduce on the transferred data. Sqoop can export this transformed data back into an RDBMS as well. More info http://sqoop.apache.org/docs/1.4.3/index.html
Hive is a data warehouse software that facilitates querying and managing large datasets residing in HDFS. Hive provides schema on read (as opposed to schema on write for RDBMS) onto the data and the ability to query the data using a SQL-like language called HiveQL. More info https://hive.apache.org/
Yes you can. As a matter of fact, that's exactly how it is meant to be used.
Sqoop :
We can integrate with any external data sources with HDFS i.e Sql , NoSql and Data warehouses as well using this tool at the same time we export it as well since this can be used as bi-directional ways.
sqoop to move data from a relational database into Hbase.
Hive: 1.As per my understanding we can import the data from Sql databases into hive rather than NoSql Databases.
We can't export the data from HDFS into Sql Databases.
We can use both together using the below two options
sqoop create-hive-table --connect jdbc:mysql://<hostname>/<dbname> --table <table name> --fields-terminated-by ','
The above command will generate the hive table and this table name will be same name in the external table and also the schema
Load the data
hive> LOAD DATA INPATH <filename> INTO TABLE <filename>
Hive can be shortened to one step if you know that you want to import stright from a database directly into hive
sqoop import --connect jdbc:mysql://<hostname>/<dbname> --table <table name> -m 1 --hive-import

can you sqoop from hive db to another hive db?

we have 2 different Hadoop clusters, and I was wondering if you can sqoop data between hive dbs/tables.
I've been looking for this for a while but i can't find it.
Cluster1: Cluster2
db: metrics db:metrics
table: disk table: disk
output on Cluster2:
db: metrics
table:disk
where disk= Cluster1.disk Union Cluster2.disk
really, I can add the logic easy, I just wanted to know if you can use sqoop to import data from hive to hive.
Thanks in advance.
Sqoop doesn't do that. Try using Apache Falcon version 0.4 which adds support for Hive replication.

Export HBase Data to RDBMS

I am using HBase to store the data but later to suit my requirements I want to export the data from HBase to RDBM like mysql or postgres. I know we have Sqoop as a option but it imports from MySQL and put it into HBase and will export the saved data in HDFS to RDBMS and it cannot import data directly from HBase.
Is there any tool to export data from HBase tables to RDBMS tables?
Not sure if this is a better approach, but HBase data can be exported into a flat file and then loaded into RDBMS.

Resources