Flume and sqoop limitation - performance

I have a terabyte of data files on different machines i want to collect it on centralized machine for some processing is it advisable to use flume ?
Same amount of data is there in RDBMS which i would like to put in hdfs is it advisable to use sqoop to trasffer terabyte of data? if not what will be alternative

Using Sqoop to transfer few terabytes from RDBMS to HDFS is a great idea, highly recommended. This is Sqoop's intended use case and it does do reliably.
Flume is mostly intended for streaming data, so if the files all have events, and you get new files frequently, then Flume with Spooling Directory source can work.
Otherwise, "HDFS -put" is a good way to copy files to HDFS.

Related

How to save large files in HDFS and links in HBase?

I have read that it is recommended to save files more than 10MB in HDFS and store path of that file in HBase. Is there any recommended approach of doing this. Is there any specific configurations or tools like Apache Phoenix that can help us achieve this?
Or all of the saving data in HDFS and then saving the location in HBase then reading the path from HBase then reading data from HDFS with the location all be done manually from the client?

GPFS to HDFS Migration

I have an IBM's BigInsight Cluster where I have ~5k Hive tables and other HBASE data along with some Big SQL Tables. All the data files are in different format i.e. Text, Avro, bz2 etc.
Now to migrate from BigInsight to HDP(Hortonworks Data Platform) I need to understand how we can move data from GPFS to HDFS.
Can you please explain what are the architectural differences between GPFS and HDFS. Is for both Namenode will work similarly. What are the changes in Namespace. Copying Namespace wont work.
How to access GPFS from other Hadoop Cluster - simply distcp will work?
What are the challenges we can face at the time of migration.
I have some options:
nfs gateway
distscp
httpfs
WebHDFS REST API
SCP - Secure copy
My only concern is which one from all these options can work for both GPFS and HDFS. If these are not tested for my scenario, what are the other alternative option I should opt.
Please suggest for any solution, what are the other things I need to take care.
Thanks.
Regards,
Pardeep Sharma.

Different ways to import files into HDFS

I want to know what are the different ways through which I can bring data into HDFS.
I am a newbie to Hadoop and was a java web developer till this time. I want to know if I have a web application that is creating log files, how can i import the log files into HDFS.
There are lot's of ways on how you can ingest data into HDFS, let me try to illustrate them here:
hdfs dfs -put - simple way to insert files from local file system to HDFS
HDFS Java API
Sqoop - for bringing data to/from databases
Flume - streaming files, logs
Kafka - distributed queue, mostly for near-real time stream processing
Nifi - incubating project at Apache for moving data into HDFS without making lots of changes
Best solution for bringing web application logs to HDFS is through Flume.
We have three different kinds of data - Structured (schema based systems like Oracle/MySQL etc.), Unstructured (images, weblogs etc.) and Semi-structured data(XML etc.)
Structured data can be stored in database SQL in table with rows and columns
Semi-structured data is information that doesn’t reside in a relational database but that does have some organizational properties that make it easier to analyze. With some process you can store them in relation database (e.g. XML)
Unstructured data often include text and multimedia content. Examples include e-mail messages, word processing documents, videos, photos, audio files, presentations, webpages and many other kinds of business documents.
Depending on type of your data, you will choose the tools to import data into HDFS.
Your company may use CRM,ERP tools. But we don't exactly know how the data is organized & structured.
If we leave simple HDFS commands like put, copyFromLocal etc to load data into HDFS compatible format, below are the main tools to load data into HDFS
Apache Sqoop(TM) is a tool designed for efficiently transferring bulk data between Apache Hadoop and structured datastores such as relational databases. Data from MySQL, SQL Server & Oracle tables can be loaded into HDFS with this tool.
Apache Flume is a distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of log data. It has a simple and flexible architecture based on streaming data flows. It is robust and fault tolerant with tunable reliability mechanisms and many failover and recovery mechanisms.
Other tools include Chukwa,Storm and Kafka
But other important technology, which is becoming very popular is Spark. It is a Friend & Foe for Hadoop.
Spark is emerging an good alternative to Hadoop for real time data processing, which may or may not use HDFS as data source.

What is the difference between HBASE and HDFS in Hadoop?

What is the actual difference, and when should be use the other when data needs to be stored?
Please read this post for a good explanation. But in general, HBASE runs on top of HDFS. HDFS is a distributed file system just like any other file system (Unix/Windows) and HBASE is like a database which reads and writes from that file system just like any other database (MySQL, MSSQL).

Copy Unstructured data into HDFS?

How to copy unstructured data directly from web server to HDFS using Sqoop in Hadoop. (without copying data into the local file system)
From webserver to HDFS you need to use Flume or anyother appropriate tool. Sqoop is used to import/export from RDBMS.
Since you have said the source to be Webserver and data to be unstructured, Flume is what you should look for!!
Flume is a distributed, reliable, and available service for
efficiently collecting, aggregating, and moving large amounts of log
data
http://flume.apache.org/
If data source is RDBMS and data is structured, then Sqoop will fit the bill.
Sqoop is designed for efficiently transferring bulk data between
Apache Hadoop and structured datastores such as relational databases.
http://sqoop.apache.org/

Resources