Different tools available for creating data pipelines - hadoop

I need to create data pipelines in hadoop. I have data import, export, scripts to clean data set up and need to set it up in a pipeline now.
I have been using Oozie for data import and export schedules but now need to integrate R scripts for data cleaning process as well.
I see falcon is used for the same.
How to install falcon in cloudera?
What other tools are available to create data pipelines in hadoop?

2) I'm tempted to answer nifi from Hortonworks, since this post on linkedin it has grown a lot and it's very close to replace oozie. When I'm writing this answer the difference between oozie and nifi is the place where they run: nifi on external cluster and oozie into hadoop.

Related

How to run analytics on Paraquet files on Non Hadoop environment

We are generating Parquet files , using apache Nifi in a non hadoop environment. We need to run analytics on Parquet files.
Apart from using apache frameworks like Hive , Spark etc. Do we have any open source BI or a reporting tool which can read Parquet files , or is there any other work around for this . In our environment we have Jasper Reporting tool.
Any suggestion is appreciated. Thanks.
You can easily process Parquet files in Python:
To read/write Parquet files, you can use pyarrow or fastparquet.
To analyze the data, you can use Pandas (which can even read/write Parquet itself using one of the implemention mentioned in the previous item behind the scenes).
To get a nice interactive data exploration environment, you can use Jupyter Notebook.
All of these work in a non-Hadoop environment.

How to transfer data from production cluster to a datalab cluster for real time data analysis?

We are using mapr and we want to deploy a new (datalab) cluster, and I'm asking about the best way to transfer data from our production cluster to the datalab cluster ?
We used mirroring, between the two cluster , but with this option we have only-read data in our datalab , so how could we transfer data in real time ?
You can use the below options:
Distcp.But there are certain protocols supported in the same.Refer
here
If you are using hbase,then you can use snapshot feature.Refer here
Or,You can use the utility of database to create a dump.For
example,if you are using mysql,then use mysqldump -u [username]-p
[pass][dbname]| gzip > file.sql.gz and then you can move it to other server scp username#<ip>:/<source>/file.sql.gz <destination>/
Or, you can use Apache falcon which uses oozie workflow to replicate
the data between clusters.You can set one time workflow and execute
it
If you want just a FS.a ==> FS.b "real-time" pipe, the best options I know of are either Apache NiFi or StreamSets because there is no coding required.
Flume could potentially be another option because its already available in most Hadoop vendor environments.
You can use Spark or Flink if you are more development oriented.
DistCP on an Oozie schedule is the fail-safe solution

No passwd entry for user 'hdfs'

I trying to set up a hive environment on my google compute engine hadoop clusters which was deployed from one click deployment.
When I try to switch to hdfs user(su hdfs), I get below error message.
No passwd entry for user 'hdfs'
The "one-click deployment" is an older sample which perhaps showcases installation from shell scripts and tarballs, but isn't intended for use as a supported Hadoop service, and doesn't set up typical Hadoop installation configurations like an hdfs user or adding commands to /usr/bin.
If you want a more Hadoop (and Pig+Hive+Spark) specialized service, you may want to consider using Google Cloud Dataproc, which is Google's managed Hadoop solution. You can create clusters from the cloud console UI in Dataproc just like click-to-deploy, and you'll get a more fully installed Hadoop/Hive environment, including a per-cluster persistent MySQL-based Hive metastore which is shared with SparkSQL to make it easy to play with Spark without modifying your Hive environment if you so choose.

Can apache spark run without hadoop?

Are there any dependencies between Spark and Hadoop?
If not, are there any features I'll miss when I run Spark without Hadoop?
Spark is an in-memory distributed computing engine.
Hadoop is a framework for distributed storage (HDFS) and distributed processing (YARN).
Spark can run with or without Hadoop components (HDFS/YARN)
Distributed Storage:
Since Spark does not have its own distributed storage system, it has to depend on one of these storage systems for distributed computing.
S3 – Non-urgent batch jobs. S3 fits very specific use cases when data locality isn’t critical.
Cassandra – Perfect for streaming data analysis and an overkill for batch jobs.
HDFS – Great fit for batch jobs without compromising on data locality.
Distributed processing:
You can run Spark in three different modes: Standalone, YARN and Mesos
Have a look at the below SE question for a detailed explanation about both distributed storage and distributed processing.
Which cluster type should I choose for Spark?
Spark can run without Hadoop but some of its functionality relies on Hadoop's code (e.g. handling of Parquet files). We're running Spark on Mesos and S3 which was a little tricky to set up but works really well once done (you can read a summary of what needed to properly set it here).
(Edit) Note: since version 2.3.0 Spark also added native support for Kubernetes
By default , Spark does not have storage mechanism.
To store data, it needs fast and scalable file system. You can use S3 or HDFS or any other file system. Hadoop is economical option due to low cost.
Additionally if you use Tachyon, it will boost performance with Hadoop. It's highly recommended Hadoop for apache spark processing.
As per Spark documentation, Spark can run without Hadoop.
You may run it as a Standalone mode without any resource manager.
But if you want to run in multi-node setup, you need a resource manager like YARN or Mesos and a distributed file system like HDFS,S3 etc.
Yes, spark can run without hadoop. All core spark features will continue to work, but you'll miss things like easily distributing all your files (code as well as data) to all the nodes in the cluster via hdfs, etc.
Yes, you can install the Spark without the Hadoop.
That would be little tricky
You can refer arnon link to use parquet to configure on S3 as data storage.
http://arnon.me/2015/08/spark-parquet-s3/
Spark is only do processing and it uses dynamic memory to perform the task, but to store the data you need some data storage system. Here hadoop comes in role with Spark, it provide the storage for Spark.
One more reason for using Hadoop with Spark is they are open source and both can integrate with each other easily as compare to other data storage system. For other storage like S3, you should be tricky to configure it like mention in above link.
But Hadoop also have its processing unit called Mapreduce.
Want to know difference in Both?
Check this article: https://www.dezyre.com/article/hadoop-mapreduce-vs-apache-spark-who-wins-the-battle/83
I think this article will help you understand
what to use,
when to use and
how to use !!!
Yes, of course. Spark is an independent computation framework. Hadoop is a distribution storage system(HDFS) with MapReduce computation framework. Spark can get data from HDFS, as well as any other data source such as traditional database(JDBC), kafka or even local disk.
Yes, Spark can run with or without Hadoop installation for more details you can visit -https://spark.apache.org/docs/latest/
Yes spark can run without Hadoop. You can install spark in your local machine with out Hadoop. But Spark lib comes with pre Haddop libraries i.e. are used while installing on your local machine.
You can run spark without hadoop but spark has dependency on hadoop win-utils. so some features may not work, also if you want to read hive tables from spark then you need hadoop.
Not good at english,Forgive me!
TL;DR
Use local(single node) or standalone(cluster) to run spark without Hadoop,but stills need hadoop dependencies for logging and some file process.
Windows is strongly NOT recommend to run spark!
Local mode
There are so many running mode with spark,one of it is called local will running without hadoop dependencies.
So,here is the first question:how to tell spark we want to run on local mode?
After read this official doc,i just give it a try on my linux os:
Must install java and scala,not the core content so skip it.
Download spark package
There are "without hadoop" and "hadoop integrated" 2 type of package
The most important thing is "without hadoop" do NOT mean run without hadoop but just not bundle with hadoop so you can bundle it with your custom hadoop!
Spark can run without hadoop(HDFS and YARN) but need hadoop dependency jar such as parquet/avro etc SerDe class,so strongly recommend to use "integrated" package(and you will found missing some log dependencies like log4j and slfj and other common utils class if chose "without hadoop" package but all this bundled with hadoop integrated pacakge)!
Run on local mode
Most simple way is just run shell,and you will see the welcome log
# as same as ./bin/spark-shell --master local[*]
./bin/spark-shell
Standalone mode
As same as blew,but different with step 3.
# Starup cluster
# if you want run on frontend
# export SPARK_NO_DAEMONIZE=true
./sbin/start-master.sh
# run this on your every worker
./sbin/start-worker.sh spark://VMS110109:7077
# Submit job or just shell
./bin/spark-shell spark://VMS110109:7077
On windows?
I kown so many people run spark on windown just for study,but here is so different on windows and really strongly NOT recommend to use windows.
The most important things is download winutils.exe from here and configure system variable HADOOP_HOME to point where winutils located.
At this moment 3.2.1 is the most latest release version of spark,but a bug is exist.You will got a exception like Illegal character in path at index 32: spark://xxxxxx:63293/D:\classe when run ./bin/spark-shell.cmd,only startup a standalone cluster then use ./bin/sparkshell.cmd or use lower version can temporary fix this.
For more detail and solution you can refer for here
No. It requires full blown Hadoop installation to start working - https://issues.apache.org/jira/browse/SPARK-10944

How to know that a new data is been added to HDFS?

I am implementing a Notification system based on publish subscribe model to notify about the availability of data as it arrives/loaded to HDFS. I did n't find a ways where to look for this. Is there any HDFS API which can be used to do this or what method should I use to get information of new data written to HDFS? I am using Hadoop v2.0.2 and I don't want to use HCatalog, I want to implement my own tool to do this.
What you are looking for is Oozie Coordinator.
HDFS is a file system, so something must be built on top of HDFS to check for file availability. HBase has coprocessor which are triggered procedures . But it is only available for HBase tables. So it cannot be used for detecting data availabilty in HDFS.
Oozie is a workflow scheduler system to manage Hadoop jobs. Oozie Coordinator jobs are recurrent Oozie Workflow jobs triggered by time (frequency) and data availabilty. Also you can execute other programs from it :
Oozie is integrated with the rest of the Hadoop stack supporting
several types of Hadoop jobs out of the box (such as Java map-reduce,
Streaming map-reduce, Pig, Hive, Sqoop and Distcp) as well as system
specific jobs (such as Java programs and shell scripts).
So you can use the file availability trigger for your notification system too.
If you use HDFS you might want to check out HBase as it has the functionality you want. In HBase, you can create a pre-put (or post-put) coprocessor essentially acting equivilant to a MySQL Trigger- running a bit of code for every time data is written to a table.
If HBase doesn't suit your use case and you must use HDFS, AFAIK there aren't similar triggers. You can try wrapping the HDFS API with your own code to perform the notification whenever data is written to your file system under the appropriate circumstances. Alternatively, you can poll HDFS for changes (which sounds like an ugly alternative)...
Hope that helps

Resources