We have a big Hadoop cluster and recently installed Elastic Search for evaluation.
Now we want to bring data from HDFS to ElasticSearch.
ElasticSearch is installed in a different cluster and so far - we could run a Beeling or HDFS script to extract data from Hadoop into some file and then from a local file bulk load it to ElasticSearch.
Wondering if there is a direct connection from HDFS to ElasticSearch.
I start reading about it here:
https://www.elastic.co/guide/en/elasticsearch/hadoop/current/install.html
But since our team is not DevOps (does not configure nor manage Hadoop cluster) and can only access Hadoop via Kerberos/user/pass - wondering if this is possible to configure (and how) without involving whole DevOps team that manages Hadoop cluster to install/setup all these libraries before direct connect?
How to do it from a Client side?
Thanks.
Related
We have a kerberoized hadoop cluster, where HBase is running. Apache drill is running in distributed mode, in another cluster. Now, we need to query the Kerberos enabled HBase from the apache drill, using web UI. The Hadoop cluster is actually running in AWS, the HBase uses s3 as storage. Please help me with steps to achieve successful queries to HBase.
Apache_Drill_version:1.16.0 Hadoop version: 2
Usually, to query HBase, we run kinit with the keytab manually, then get into HBase shell in Hadoop cluster. We wanted to make use of drill, to query in a SQL fashion easily, better readability.
I have scenario in which i have to pull data from Hadoop cluster into AWS.
I understand running dist-cp on the hadoop cluster is a way to copy the data into s3, but i have a restriction here, i wont be able to run any commands in the cluster. I should be able to pull the files from hadoop cluster into AWS. The data is available in hive.
I thought of the below options:
1) Sqoop data from Hive ? Is it possible ?
2) S3-distcp (running it on aws), if so what would be the configuration needed ?
Any Suggestions ?
If the hadoop cluster is visible from EC2-land, you could run a distcp command there, or, if it's a specific bit of data, some hive query which uses hdfs:// as input and writes out to s3. You'll need to deal with kerberos auth though: you cannot use distcp in an un-kerberized cluster to read data from a kerberized one, though you can go the other way.
You can also run distcp locally in 1+ machine, though you are limited by the bandwidth of those individual systems. distcp is best when it schedules the uploads on the hosts which actually have the data.
Finally, if it is incremental backup you are interested in, you can use the HDFS audit log as a source of changed files...this is what incremental backup tools tend to use
I integrated ML 9 with hadoop using marklogic connector. I want to load data from my local machine to marklogic using hadoop . In the Document they had mentioned there are two ways to load data using hadoop
Importing data from HDFS to ML using MLCP
Exporting data from ML to HDFS using MLCP
What i want to know is there any way to send data from MLCP to ML directly though my hadoop because i want to use the mapreduce power of hadoop by giving input_split , -max_split_size etc . I know MLCP is build in mapreduce since my hadoop cluster has much processing power i want to use it.
Thanks
As title, is it possible to write to a remote HDFS?
E.g. I have installed a HDFS cluster on AWS EC2, and I want to write a file from my local computer to the HDFS cluster.
Two ways you could write to remote HDFS,
Use the WebHDFS api available.It supports the systems running outside
Hadoop clusters to access and manipulate the HDFS contents. It
doesn't require the client systems to have hadoop binaries installed.
Configure the client system as Hadoop edge node to interact with the
Hadoop cluster/HDFS.
Please refer,
https://hadoop.apache.org/docs/r1.2.1/webhdfs.html
http://www.dummies.com/how-to/content/edge-nodes-in-hadoop-clusters.html
I want to efficiently develop a Hadoop job using Cassandra as input and output.
As I know MapReduce jobs in Hadoop uses HDFS to store intermediate results.
Is it possible to make Hadoop store intermediate results in Cassandra File System? If yes then how to achieve that?
I wonder if it possible to completely disable HDFS if I am using Hadoop only Cassandra as underlying data storage system.
I am using Cassandra 2.0.11 and Hadoop 1.0.4 (If the above is possible only in Hadoop 2.x I would also apreciate that information)