since "load local" command can load data from local file system to hive table, I'm unsure why most of people would like to putHFDS + replaceText + HiveQL. isn't better to use "replaceText + HiveQL" only instead of adding 1 more processor:putHDFS in workflow?
Many times NiFi will be running on a server outside the Hadoop cluster where the Hadoop client doesn't exist, so PutHDFS is transferring the data from that server to HDFS, and then ReplaceText + PutHiveQL is a way to create a Hive external table on top of the data that just landed in HDFS.
Related
on a Ubuntu server I set up Divolte Collector to gather clickstream data from websites. The data is being stored in Hadoop HDFS (Avro files).
(http://divolte.io/)
Then I would like to visualize the data with Airbnb Superset which has several connectors to common databases (thanks to SqlAlchemy) but not to HDFS.
Superset has in particular a connector to SparkSQL thanks to JDBC Hive (http://airbnb.io/superset/installation.html#database-dependencies)
So is it possible to use it to retrieve HDFS clickstream data? Thanks
In order to read HDFS data in SparkSQL there are two major ways depening on your setup:
Read the table as it was defined in Hive (reading from a remote metastore) (probably not your case)
SparkSQL by default (if not configured otherwise) creates a embedded metastore for Hive which allows you to issue DDL and DML statements using Hive syntax.
You need an external package for that to work com.databricks:spark-avro.
CREATE TEMPORARY TABLE divolte_data
USING com.databricks.spark.avro
OPTIONS (path "path/to/divolte/avro");
Now data should be available inside the table divolte_data
I want to save and access a table like data structure in HDFS with MapReduce programming. Part of this DS is shown in the following picture. This DS have tens of thousands of columns and hundreds of rows and All nodes should have access to it.
My Question is: How can I save this DS in HDFS and access it with MapReduce programming. Should I use arrays? (Or Hive tables ? Or Hbase?)
Thank you.
HDFS is distributed file System which stores your big files in distributed servers.
You can copy your files from local system to HDFS using command
hadoop fs -copyFromLocal /source/local/path destincation/hdfs/path
Once copy completed an External hive table can be formed on destincation/hdfs/path.
This table can be queried using hive shell.
Do consider Hive for this scenario. If you want to do table type of processing like SAS dataset or R dataframe/dataTable or python pandas; almost always an equivalent thing is possible in SQL. Hive provides powerful SQL abstraction through MapReduce and Tez engines. If you want to graduate to Spark sometime then you can read Hive tables in dataframes. As #sumit pointed you just need to transfer your data from local to HDFS (using HDFS copyFromLocal or put command) and define an external Hive table on that.
If in case you want to write some custom map-reduce on this data then access the background hive table data (more likely at /user/hive/warehouse). After reading the data from stdin, parse it in mapper (separator could be find using describe extended <hive_table>) and emit in key-value pair format.
Hi everybody
I'm quite new with bigdata, I have installed a HDFS + Hbase test database and I use Talend Big Data (an ETL) to make my test.
I would like to know : if I put a file directly in the HDFS, without going via hbase, I could never request these data ? I mean, I have to read the entire file if I want to filter data I want to chose, is that right ?
Thanks a lot for any help !
HDFS is just a distributed file system, you cannot query your files without passing by an intermidiate component.
Hbase is a nosql database that persist your data on the HDFS, use it when you need a random access to your data.
If you want to store your files on the HDFS as they are and query them, you can create an external table upon them using Hive.
The best option is to use hive on the top of the files which are on the HDFS. You can use bucketing and partitioning in the hive for performance improvement.
I am in the process of learning Hadoop and stuck with few concepts on moving data from Relational database to Hadoop and vice versa.
I have transferred files from MySQL to HDFS using SQOOP import queries. The files I transferred were structured datasets and not any server log data. I recently read that we usually use flume for moving log files into Hadoop,
My question is:
1. Can we use SQOOP as well for moving log files?
2. If yes, which of SQOOP or FLUME is more preferred for log files and why?
1) Sqoop can be used to transfer data between any rdbms and hdfs. To use scoop the data has to be structured usually specified by schema of database from where data is being imported or exported.Log files are not always structured,depending on source and type of log so sqoop is not used for moving log files.
2)Flume can collect, aggregate data from many different kinds of customizable data sources. It gives more flexibility in controlling what specific events to capture and use in user defined work flow before storing in say hdfs.
I hope it clarified difference between sqoop and flume.
SQOOP is designed to transfer data from RDMS to HDFS whereas FLUME is for moving large amounts of log data.
Both are different and specialized for different purposes.
Like
You can use SQOOP to import data via JDBC ( which you can not do in FLUME ),
and
You can use FLUME to say something like "I want to tail 200 lines of log file from this server".
Read more about FLUME here
http://flume.apache.org/
SQOOP not only transfers data from RDBMS but also from NOSql databases like MongoDB. You can directly transfer data to HDFS or Hive.
Transferring data to Hive you need not have to create table beforehand.. It takes the scheme from database itself.
Flume is used to fetch log data or streaming data
We are setting up Hadoop and Hive in our organization.
Also we will be having the sample data created by data generator tool. The data will be around 1 TB.
My question is - i have to load that data into Hive and Hadoop. What is the process i need to follow for this?
Also we will be having HBase installed with Hadoop.
We need to create the same database design which is right now there in SQL Server..But using Hive. Cz after this data loaded into hive we want to use the Business Objects 4.1 as a front end to create the Reports.
The challage is to load the sample data into the Hive..
Please help me as we want to do all the things asap.
First ingest your data in HDFS
Use Hive external tables, pointing to the location where you ingested the data i.e. your hdfs directory.
You are all set to query the data from the tables you created in Hive.
Good luck.
For the first case you need to put data in hdfs.
Transport your data file(s) to a client node (app node)
put your files en distribute file system (hdfs dfs -put ... )
create an external Table pointing the hdfs directory in which you uploaded those files. Your data have been structure of some way. For instance delimited by semicolon symbol.
Now you can operate over the data with sql queries.
For the second case you can create another hive table (using HBaseStorageHandler , https://cwiki.apache.org/confluence/display/Hive/HBaseIntegration) and load from the first table with Insert statement.
I hope this can help you.