Different ways to import files into HDFS - hadoop

I want to know what are the different ways through which I can bring data into HDFS.
I am a newbie to Hadoop and was a java web developer till this time. I want to know if I have a web application that is creating log files, how can i import the log files into HDFS.

There are lot's of ways on how you can ingest data into HDFS, let me try to illustrate them here:
hdfs dfs -put - simple way to insert files from local file system to HDFS
HDFS Java API
Sqoop - for bringing data to/from databases
Flume - streaming files, logs
Kafka - distributed queue, mostly for near-real time stream processing
Nifi - incubating project at Apache for moving data into HDFS without making lots of changes
Best solution for bringing web application logs to HDFS is through Flume.

We have three different kinds of data - Structured (schema based systems like Oracle/MySQL etc.), Unstructured (images, weblogs etc.) and Semi-structured data(XML etc.)
Structured data can be stored in database SQL in table with rows and columns
Semi-structured data is information that doesn’t reside in a relational database but that does have some organizational properties that make it easier to analyze. With some process you can store them in relation database (e.g. XML)
Unstructured data often include text and multimedia content. Examples include e-mail messages, word processing documents, videos, photos, audio files, presentations, webpages and many other kinds of business documents.
Depending on type of your data, you will choose the tools to import data into HDFS.
Your company may use CRM,ERP tools. But we don't exactly know how the data is organized & structured.
If we leave simple HDFS commands like put, copyFromLocal etc to load data into HDFS compatible format, below are the main tools to load data into HDFS
Apache Sqoop(TM) is a tool designed for efficiently transferring bulk data between Apache Hadoop and structured datastores such as relational databases. Data from MySQL, SQL Server & Oracle tables can be loaded into HDFS with this tool.
Apache Flume is a distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of log data. It has a simple and flexible architecture based on streaming data flows. It is robust and fault tolerant with tunable reliability mechanisms and many failover and recovery mechanisms.
Other tools include Chukwa,Storm and Kafka
But other important technology, which is becoming very popular is Spark. It is a Friend & Foe for Hadoop.
Spark is emerging an good alternative to Hadoop for real time data processing, which may or may not use HDFS as data source.

Related

Pentaho and Hadoop

I am sorry if this question seems naive, But I am new to Data engineering field, as I am self learner right now, however my questions is what is the differences between ETL products like Pentaho and Hadoop?
when I use this instead of that? or I may use them together, how?
Thank you,
An ETL is a tool to Extract data, Transform (join, enrich, filter,...) it and Load the result in another data store. Good ETLS are visual, data store agnostic and easy to automate.
Hadoop is a data store distributed on a network of clusters plus software to handle diseminated data. The data transformation is specialized on few elementary operations which can be optimized to this usually massive amount of data, like (but not only) Map-Reduce.
Pentaho Data Integrator has connectors to Hadoop systems which are easy to set up and tune up. So the best strategy is to setup a Hadoop network as data store and manipulate it through the PDI.
Pentaho PDI is a tool for creating, managing, running and monitoring ETL workflows. It can work with Hadoop, RDBMS, Queues, files, etc. Hadoop is a platform for distributed computation (Map-Reduce framework, HDFS, etc). Many tools can run on Hadoop or can connect to Hadoop and use it's data, run processes.
Pentaho PDI can connect to Hadoop using it's own connectors and write/read data. You can start Hadopp job from PDI, also it can process data by itself inside transformation flow and store or send results to HDFS, RDBMS, some queue, email, etc. Of course you can invent you own tool for ETL workflows or simply use bash+Hive, etc, but PDI allows ETL processsing in a unified way not depending on data sources and targets. Also Pentaho has great visualization.

What is the relationship between Spark, Hadoop and Cassandra

My understanding was that Spark is an alternative to Hadoop. However, when trying to install Spark, the installation page asks for an existing Hadoop installation. I'm not able to find anything that clarifies that relationship.
Secondly, Spark apparently has good connectivity to Cassandra and Hive. Both have sql style interface. However, Spark has its own sql. Why would one use Cassandra/Hive instead of Spark's native sql? Assuming that this is a brand new project with no existing installation?
Spark is a distributed in memory processing engine. It does not need to be paired with Hadoop, but since Hadoop is one of the most popular big data processing tools, Spark is designed to work well in that environment. For example, Hadoop uses the HDFS (Hadoop Distributed File System) to store its data, so Spark is able to read data from HDFS, and to save results in HDFS.
For speed, Spark keeps its data sets in memory. It will typically start a job by loading data from durable storage, such as HDFS, Hbase, a Cassandra database, etc. Once loaded into memory, Spark can run many transformations on the data set to calculate a desired result. The final result is then typically written back to durable storage.
In terms of it being an alternative to Hadoop, it can be much faster than Hadoop at certain operations. For example a multi-pass map reduce operation can be dramatically faster in Spark than with Hadoop map reduce since most of the disk I/O of Hadoop is avoided. Spark can read data formatted for Apache Hive, so Spark SQL can be much faster than using HQL (Hive Query Language).
Cassandra has its own native query language called CQL (Cassandra Query Language), but it is a small subset of full SQL and is quite poor for things like aggregation and ad hoc queries. So when Spark is paired with Cassandra, it offers a more feature rich query language and allows you to do data analytics that native CQL doesn't provide.
Another use case for Spark is for stream processing. Spark can be set up to ingest incoming real time data and process it in micro-batches, and then save the result to durable storage, such as HDFS, Cassandra, etc.
So spark is really a standalone in memory system that can be paired with many different distributed databases and file systems to add performance, a more complete SQL implementation, and features they may lack such a stream processing.
Im writing a paper about Hadoop for university. And stumbled over your question. Spark is just using Hadoop for persistence and only if you want to use it. It's possible to use it with other persistence tiers like Amazon EC2.
On the other hand-side spark is running in-memory and it's not primarly build to be used for map reduce use-cases like Hadoop was/is.
I can recommend this article, if you like a more detailed description: https://www.xplenty.com/blog/2014/11/apache-spark-vs-hadoop-mapreduce/
The README.md file in Spark can solve your puzzle:
A Note About Hadoop Versions
Spark uses the Hadoop core library to talk to HDFS and other Hadoop-supported
storage systems. Because the protocols have changed in different versions of
Hadoop, you must build Spark against the same version that your cluster runs.
Please refer to the build documentation at
"Specifying the Hadoop Version"
for detailed guidance on building for a particular distribution of Hadoop, including
building for particular Hive and Hive Thriftserver distributions.

Copy Unstructured data into HDFS?

How to copy unstructured data directly from web server to HDFS using Sqoop in Hadoop. (without copying data into the local file system)
From webserver to HDFS you need to use Flume or anyother appropriate tool. Sqoop is used to import/export from RDBMS.
Since you have said the source to be Webserver and data to be unstructured, Flume is what you should look for!!
Flume is a distributed, reliable, and available service for
efficiently collecting, aggregating, and moving large amounts of log
data
http://flume.apache.org/
If data source is RDBMS and data is structured, then Sqoop will fit the bill.
Sqoop is designed for efficiently transferring bulk data between
Apache Hadoop and structured datastores such as relational databases.
http://sqoop.apache.org/

What's the difference between Flume and Sqoop?

Both Flume and Sqoop are meant for data movement, then what is the difference between them? Under what condition should I use Flume or Sqoop?
From http://flume.apache.org/
Flume is a distributed, reliable, and available service for
efficiently collecting, aggregating, and moving large amounts of log
data.
Flume helps to collect data from a variety of sources, like logs, jms, Directory etc. Multiple flume agents can be configured to collect high volume of data.
It scales horizontally.
From http://sqoop.apache.org/
Apache Sqoop(TM) is a tool designed for efficiently transferring bulk
data between Apache Hadoop and structured datastores such as
relational databases.
Sqoop helps to move data between hadoop and other databases and it can transfer data in parallel for performance.
Both Sqoop and Flume, pull the data from the source and push it to the sink. The main difference is Flume is event driven, while Sqoop is not.
Flume:
Flume is a framework for populating Hadoop with data. Agents are populated
throughout ones IT infrastructure – inside web servers, application servers
and mobile devices, for example – to collect data and integrate it into Hadoop.
Sqoop:
Sqoop is a connectivity tool for moving data from non-Hadoop data stores – such
as relational databases and data warehouses – into Hadoop. It allows users to
specify the target location inside of Hadoop and instruct Sqoop to move data
from Oracle,Teradata or other relational databases to the target.
You can see the full Post
Flume:
A very common use case is collecting log data from one system- a bank of web servers(aggregating it in HDFS for later analysis).
Sqoop:
On the other hand is designed for performing bulk imports of data into HDFS from structured data stores. simple use case will be an organization that runs a nightly sqoop import to load the day's data from a production DB into a Hive data ware house for analysis.
--From the definitive guide.
Apache Sqoop and Apache Flume work with various kinds of data sources. Flume functions well in streaming data sources which are generated continuously in hadoop environment such as log files from multiple servers.
whereas Apache Sqoop is designed to work well with any kind of relational database system that has JDBC connectivity.
Sqoop can also import data from NoSQL databases like MongoDB or Cassandra and also allows direct data transfer or Hive or HDFS. For transferring data to Hive using Apache Sqoop tool, a table has to be created for which the schema is taken from the database itself.
In Apache Flume data loading is event driven whereas in Apache Sqoop data load is not driven by events.
4.Flume is a better choice when moving bulk streaming data from various sources like JMS or Spooling directory whereas Sqoop is an ideal fit if the data is sitting in databases like Teradata, Oracle, MySQL Server, Postgres or any other JDBC compatible database then it is best to use Apache Sqoop.
5.In Apache Flume, data flows to HDFS through multiple channels whereas in Apache Sqoop HDFS is the destination for importing data.
6.Apache Flume has agent based architecture i.e. the code written in flume is known as agent which is responsible for fetching data whereas in Apache Sqoop the architecture is based on connectors. The connectors in Sqoop know how to connect with the various data sources and fetch data accordingly.
Lastly, Sqoop and Flume cannot be used achieve the same tasks as they are developed specifically to serve different purposes. Apache Flume agents are designed to fetch streaming data like tweets from Twitter or log file from the web server whereas Sqoop connectors are designed to work only with structured data sources and fetch data from them.
Apache Sqoop is mainly used for parallel data transfers, for data imports as it copies data quickly where Apache Flume is used for collecting and aggregating data because of its distributed, reliable nature and highly available backup routes.
Sqoop and Flume both are meant to fulfill data ingestion needs but they serve different purposes. Apache Flume works well for streaming data sources that are generated continuously in hadoop environment such as log files from multiple servers whereas whereas Apache Sqoop works well with any RDBMS has JDBC connectivity.
Sqoop is actually meant for bulk data transfers between hadoop and any other structured data stores. Flume collects log data from many sources, aggregating it, and writing it to HDFS.
I came across this interesting infographic that explains the differences between the two apache projects Sqoop and Flume -
Difference between Sqoop and Flume
Sqoop
Sqoop can perform import/export from RDBMS to HDFS/HIVE/HBASE
sqoop only import/export structured data not unstructured or semi
structured.
Flume
import stream data from multiple sources mostly semi-structured and
unstructured in nature. Now Kafka is better alternative for flume.

Hadoop Basics: What do I do with the output?

(I'm sure a similar question exists, but I haven't found the answer I'm looking for yet.)
I'm using Hadoop and Hive (for our developers with SQL familiarity) to batch process multiple terabytes of data nightly. From an input of a few hundred massive CSV files, I'm outputting four or five fairly large CSV files. Obviously, Hive stores these in HDFS. Originally these input files were extracted from a giant SQL data warehouse.
Hadoop is extremely valuable for what it does. But what's the industry standard for dealing with the output? Right now I'm using a shell script to copy these back to a local folder and upload them to another data warehouse.
This question: ( Hadoop and MySQL Integration ) calls the practice of re-importing Hadoop exports non-standard. How do I explore my data with a BI tool, or integrate the results into my ASP.NET app? Thrift? Protobuf? Hive ODBC API Driver? There must be a better way.....
Enlighten me.
At foursquare I'm using Hive's Thrift driver to put the data into databases/spreadsheets as needed.
I maintain a job server that executes jobs via the Hive driver and then moves the output wherever it is needed. Using thrift directly is very easy and allows you to use any programming language.
If you're dealing with hadoop directly (and can't use this) you should check out Sqoop, built by Cloudera
Sqoop is designed for moving data in batch (whereas Flume is designed for moving it in real-time, and seems more aligned with putting data into hdfs than taking it out).
Hope that helps.

Resources