Does customers usually allow third party applications in their Onpremise hadoop cluster - hadoop

We are building a solution to validate the data migrated from traditional RDBMS to on-premise hadoop and also perform validation of the data after the migration.The scripts performing the validation will compare the data between hadoop and data present on-premise.I would like to know if customers would allow our application script inside the cluster or we have to execute the scirpts from a remote server where our application will be hosted?

data migrated from traditional RDBMS to on-premise hadoop
Using what tools? Sqoop? Spark? Kakfa? NiFi?
Each one of those tools are installed on the side of your Hadoop cluster, so in that case, yes they are "installable". Whether they are "allowed" is up to your Hadoop Administrator / Architects.
You won't get "24/7 vendor support" if you use a tool you aren't paying for, though.

Related

How to transfer data from production cluster to a datalab cluster for real time data analysis?

We are using mapr and we want to deploy a new (datalab) cluster, and I'm asking about the best way to transfer data from our production cluster to the datalab cluster ?
We used mirroring, between the two cluster , but with this option we have only-read data in our datalab , so how could we transfer data in real time ?
You can use the below options:
Distcp.But there are certain protocols supported in the same.Refer
here
If you are using hbase,then you can use snapshot feature.Refer here
Or,You can use the utility of database to create a dump.For
example,if you are using mysql,then use mysqldump -u [username]-p
[pass][dbname]| gzip > file.sql.gz and then you can move it to other server scp username#<ip>:/<source>/file.sql.gz <destination>/
Or, you can use Apache falcon which uses oozie workflow to replicate
the data between clusters.You can set one time workflow and execute
it
If you want just a FS.a ==> FS.b "real-time" pipe, the best options I know of are either Apache NiFi or StreamSets because there is no coding required.
Flume could potentially be another option because its already available in most Hadoop vendor environments.
You can use Spark or Flink if you are more development oriented.
DistCP on an Oozie schedule is the fail-safe solution

how to integrate Cassandra with Hadoop to take advantage of Hive

It is almost 3 days that I've been looking for a solution at year 2015 to integrate Cassandra on Hadoop and lots of resources on the net are outdated or vanished from the net and the Datastax Enterprise offers no free of charge solution for such integration.
What are the options for doing such? I want to use Hive query language to get data from my Cassandra and I think the first step is to integrate the Cassandra with Hadoop.
The easiest (but also paid option) is to use Datastax Enterprise packaging of C* with Hadoop + Hive. This provides an automatic connection and registration of Hive tables with C* and includes and setups up a Hadoop execution platform if you need one.
http://www.datastax.com/products/datastax-enterprise
The second easiest way is to utilize Spark instead. The Spark Cassandra Connector is open source and allows HiveQL to be used to access C* tables. This is done running on Spark as an execution platform instead of Hadoop but has similar (if not better) performance.
With this solution I would standup a stand alone Spark Cluster (since you don't have an existing hadoop infra) and then use the spark-sql-thrift server to run queries against C* tables.
https://github.com/datastax/spark-cassandra-connector
There are other options but these are the ones I am most familiar with (and conflict of interest notice, also develop :D )

No passwd entry for user 'hdfs'

I trying to set up a hive environment on my google compute engine hadoop clusters which was deployed from one click deployment.
When I try to switch to hdfs user(su hdfs), I get below error message.
No passwd entry for user 'hdfs'
The "one-click deployment" is an older sample which perhaps showcases installation from shell scripts and tarballs, but isn't intended for use as a supported Hadoop service, and doesn't set up typical Hadoop installation configurations like an hdfs user or adding commands to /usr/bin.
If you want a more Hadoop (and Pig+Hive+Spark) specialized service, you may want to consider using Google Cloud Dataproc, which is Google's managed Hadoop solution. You can create clusters from the cloud console UI in Dataproc just like click-to-deploy, and you'll get a more fully installed Hadoop/Hive environment, including a per-cluster persistent MySQL-based Hive metastore which is shared with SparkSQL to make it easy to play with Spark without modifying your Hive environment if you so choose.

Sqoop vs Informatica Big Data edition for Data sourcing

I have a option of using Sqoop or Informatica Big Data edition to source data into HDFS. The source systems are Tearadata, Oracle.
I would like to know which one is better and any reason behind the same.
Note:
My current utility is able to pull data using sqoop into HDFS , Create Hive staging table and archive external table.
Informatica is the ETL tool used in the organization.
Regards
Sanjeeb
Sqoop
Sqoop is capable of performing full and incremental loading from Oracle/Teradata.
Sqoop does parallel copy of data from source systems.
Sqoop scripts can be custom genrated and scheduled by Oozie.
Open source solution for any size cluster. No license cost.
Informatica
Best Interface in ETL Industry to manage mappings.
Does not provide parallel copy options. Provides Hive mode for parallel processing. Basically converts transformation into Hive queries for execution. Also supports push downs to generate MR code.
Licensing cost per node. If you plan 500 Hadoop nodes for future data storage you need to pay 10 times as compared with 50 node cluster when you scale cluster.
Informatica BDE is relatively new product in market. INFA Developer will be usefull for working on Big data. There are challenges in supporting all latest Hadoop platform features on Informatica, also traditional RDBMS features like Sequence generation, Stateful mapping,Sessions, Lookup Transformation in Informatica BDE.
Informatica MDM does not support Hadoop.
If price is criteria for decision making, go for Sqoop. If you want to leverage flexibility of switching Hadoop plaftorm tools, use Sqoop(Sqoop project is also thinking of moving over Spark).
If you are tied to Informatica for some reason, go for Informatica. But most Informatica developers want to move to Hadoop technologies.
Although this was asked an year ago, sharing new features in Informatica
Informatica BDM version 10.1 supports Sqoop connectivity i.e. you can use Sqoop to read the data from RDBMS and load it into Hadoop/Hive
Also, there are many new features in BDM version 10.2, especially the parameterization support in the developer tool and dynamic mappings.
Tool versus handcoding was always there.
Informatica tool gives enterprise level solution which is easier to maintain.
BDM 10.1.1 supports sqoop with spark engine. Spark 2.0.1 is supported in this version so performance its pretty good.
BDM 10.2 is just released with new features like stateful variable support which was missing in earlier versions.
SQOOP must be used for the Data exchange. You have lot of options with which you can have an optimal performance. Also if you are trying to exchange the data between RDBMS(Teradata / Oracle) <-> Informatica <-> Hadoop cluster then the data would first need to be brought to the Informatica Server which may involve additional I/O.
If the data processing must be done within hive Informatica BDE must be used.

hbase as database in web application

A big question about using hadoop or related technologies in a real web application.
I just want to find out how a web app can use hbase as its database. I mean is it the thing the big data apps do or they use normal databases and just use these sort of technologies for analysis?
Is it ok to have a online store with Hbase database or something like this?
Yes it is perfectly fine to have hbase as your backend.
What I am doing to get this done,( I have a online community and forum running on my website )
1.Writing C# code to access the Hbase using thrift, very easy and simple to get this done. (Thrift is a cross language binding platform, to HBase Java is only the first class citizen!)
2.Managing the HBase cluster(have it on Amazon) using the Amazon EMI
3.Using ganglia to monitor Hbase
Some Extra tips:
So you can organize the web application like this
You can set up your webservers on Amazon Web Services or IBMWebSphere
You can set up your own HBase cluster using cloudera or use AmazonEC2 again here.
Communication between web server and Hbase master node happens via thrift client.
You can generate thrift code in your own desired programming language
Here are some links that helped me
A)Thrift Client,
B)Filtering options
Along with this I refer to HBase administrative cookbook by Yifeng Jiang and HBase reference guide by Lars George in case I dont get answers on web.
Filtering options provided by HBase are fast and accurate. Let's say if you use HBase for storing your product details, you can have sub-stores and have a column in your Product table, which tells to which store a product may belong and use Filters to get products for a specific store.
I think you should read the article below:
"Apache HBase Do’s and Don’ts"
http://blog.cloudera.com/blog/2011/04/hbase-dos-and-donts/

Resources