Setup Marklogic connector for hadoop in Windows machine? - hadoop

I need to setup MarkLogic connector for Hadoop for sending the ml files to HDFS storage and retrieving them.
I went through one of the ML document where they had mentioned in required software section it requires Linux:
The MarkLogic Connector for Hadoop is a Java-only API and is only available on Linux. You can use the connector with any of the Hadoop distributions listed below. Though the Hadoop MapReduce Connector is only supported on the Hadoop distributions listed below, it may work with other distributions, such as an equivalent version of Apache Hadoop.
So, does this mean I can't achieve this on a Windows machine?

Related

Hadoop distribution

I was using IBM big insights via VNC software (remote access) provided by the university I study but I can't access Internet through that desktop. To use some data samples available in internet, I decided to install Hadoop in my laptop (single cluster), but I found that there are many distributions, So What's the best free Hadoop distribution for training as a beginner ?
1) Amazon Elastic MapReduce
2) Cloudera CDH Hadoop Distribution
3) Hortonworks Data Platform (HDP)
4) MapR Hadoop Distribution
5) IBM Open Platform
6) Microsoft Azure's HDInsight -Cloud based Hadoop Distrbution
7) Pivotal Big Data Suite
8) Datameer Professional
9) Datastax Enterprise Analytics
10) Dell- Cloudera Apache Hadoop Solution.
CDH and Hortonworks are the easiest to get a single node cluster up and running, and are also very widely used so you can find a lot of troubleshooting resources.
If you just want to write application code/run arbitrary MapReduce jobs rather than learn the Hadoop systems architecture, then Amazon EMR is more suitable.

Can Tableau connect with apache hadoop ? or it should be with only major hadoop distributions?

Need a help on reporting tool Basically we are looking for a best reporting tool that can connect to hive and pull the report. So thought of using Tableau. We are using our own hadoop distribution ( not from hortonworks, cloudera, Mapr Etc). Will tableau connects to apache distribution of hadoop also. If not please suggest some good reporting tool. Freeware is highly recommended.
thank you
Yes tableau will connect with your apache hadoop free distribution.
you will have to put all necessary jar file like hadoop core jars, hadoop common jars into your tableau lib directory. also in your hadoop lib directory you have to put your tableau driver correct version.
then with the help of hiveserver2 also known as hive thrift server. you can give your driver name and connection string
for more details:
http://kb.tableau.com/articles/knowledgebase/connecting-to-hive-server-2-in-secure-mode
http://kb.tableau.com/articles/knowledgebase/administering-hadoop-hive

hcatUtil not found when Configuring HP Vertica for HCatalog

I am trying to configure HP Vertica for HCatalog:
Configuring HP Vertica for HCatalog
But I can not found hcatUtil on my Vertica cluster.
Where can I get this utility?
As this answer said, it's in /opt/vertica/packages/hcat/tools starting with version 7.1.1. But you probably need some further information:
You need to run hcatUtil on a node in your Hadoop cluster; the utility gathers up Hadoop libraries that Vertica also needs to access, so you need to have those libraries available. Assuming you're not co-locating Vertica nodes on your Hadoop nodes, the easiest way to do this is probably to copy the script to a Hadoop node, run it with output to a temporary directory, and then copy the contents of the temporary directory back to the Vertica node. (Put them in /opt/vertica/packages/hcat/lib.) Then proceed with installing the HCatalog connector.
See this section in the Vertica documentation for more details. (Link is to 7.2.x, but the process has been the same since the tool was introduced.)
The hcatUtil utility has been introduced in vertica 7.1.1 and is located at /opt/vertica/packages/hcat/tools. If you do not have it there, most likely you're using an older Vertica version.

How to install Apache Spark on HortonWorks HDP 2.2 (built using Ambari)

I successfully built a 5 node cluster of HortonWorks HDP 2.2 using Ambari.
However I don't see Apache Spark in the installed services list.
I did some research and found that Ambari does not install certain components like hue etc. ( Spark was not in that list, but I guess its not installed).
How do I do a manual install of Apache spark on my 5 node HDP 2.2?
Or should I delete my cluster and perform a fresh install without using Ambari?
Hortonworks support for Spark is arriving but not fully complete (details and blog).
Instructions for how to integrate Spark with HDP can be found here.
You could build your own Ambari Stack for Spark. I recently did just that, but I cannot share that code :(
What I can do is share a tutorial I did on how to do any stack for Ambari, including Spark. There are many interesting issues with Spark that need to be addressed and are not covered through the tutorial. Anyways hope it helps. http://bit.ly/1HDBgS6
There is also a guide from the Ambari people here: https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=38571133.
1) Ambari 1.7x does not install Accumulo, Hue, Ranger, or Solr services for the HDP 2.2 Stack.
For Installing Accumulo, Hue, Knox, Ranger, and Solr services, install
HDP Manually.
2) Apache Spark 1.2.0 on YARN with HDP 2.2 : here .
3)
Spark and Hadoop: Working Together :
Standalone deployment: With the standalone deployment one can statically allocate resources on all or a subset of machines in a Hadoop cluster and run Spark side by side with Hadoop MR. The user can then run arbitrary Spark jobs on her HDFS data. Its simplicity makes this the deployment of choice for many Hadoop 1.x users.
Hadoop Yarn deployment: Hadoop users who have already deployed or are planning to deploy Hadoop Yarn can simply run Spark on YARN without any pre-installation or administrative access required. This allows users to easily integrate Spark in their Hadoop stack and take advantage of the full power of Spark, as well as of other components running on top of Spark.
Spark In MapReduce : For the Hadoop users that are not running YARN yet, another option, in addition to the standalone deployment, is to use SIMR to launch Spark jobs inside MapReduce. With SIMR, users can start experimenting with Spark and use its shell within a couple of minutes after downloading it! This tremendously lowers the barrier of deployment, and lets virtually everyone play with Spark.

Moving files to Hadoop HDFS using SFTP

I've a VPC subnet which has multiple machines inside it.
On of the machine, I've some files stored. On another machine, I've hadoop HDFS service installed and running.
I need to move those files from first machine to HDFS file system using SFTP.
Do Hadoop has some API's that can achieve this goal ?
PS : I've installed Hadoop using Cloudera CDH4 distribution.
This is a requirement which is much easier to implement on ftp/sftp server side than HDFS.
check out a ftp server works on top of HDFS hdfs-over-ftp
A workflow written in Apache Oozie would do it. It comes with the Cloudera distribution. Other tools for orchestration could be Talend or PDI Kettle.

Resources