Marklogic 9 with Hadoop? - hadoop

I integrated ML 9 with hadoop using marklogic connector. I want to load data from my local machine to marklogic using hadoop . In the Document they had mentioned there are two ways to load data using hadoop
Importing data from HDFS to ML using MLCP
Exporting data from ML to HDFS using MLCP
What i want to know is there any way to send data from MLCP to ML directly though my hadoop because i want to use the mapreduce power of hadoop by giving input_split , -max_split_size etc . I know MLCP is build in mapreduce since my hadoop cluster has much processing power i want to use it.
Thanks

Related

How is Hadoop different from database?

I was doing a case study on Spotify. I found out that Spotify uses Cassandra as a DB and also Hadoop. My question is, how is Hadoop different from a database. What type of files does Hadoop datanode stores? Why every corporation has DB and Hadoop as well. I know Hadoop is not a DB but what is it used for if there is DB cluster to save data?
Hadoop is not a database at all. Hadoop is a set of tools for distributed storage and processing, such as distributed filesystem (HDFS), MapReduce framework libraries, YARN resource manager.
Other tools like Hive, Spark, Pig, Giraph, sqoop, etc, etc can use Hadoop or it's components. For example Hive is a database. It uses HDFS for storing it's data and MapReduce framework primitives for building query execution graph.

How to import data from HDFS (Hadoop) into ElasticSearch?

We have a big Hadoop cluster and recently installed Elastic Search for evaluation.
Now we want to bring data from HDFS to ElasticSearch.
ElasticSearch is installed in a different cluster and so far - we could run a Beeling or HDFS script to extract data from Hadoop into some file and then from a local file bulk load it to ElasticSearch.
Wondering if there is a direct connection from HDFS to ElasticSearch.
I start reading about it here:
https://www.elastic.co/guide/en/elasticsearch/hadoop/current/install.html
But since our team is not DevOps (does not configure nor manage Hadoop cluster) and can only access Hadoop via Kerberos/user/pass - wondering if this is possible to configure (and how) without involving whole DevOps team that manages Hadoop cluster to install/setup all these libraries before direct connect?
How to do it from a Client side?
Thanks.

HDFS into Cassandra

is it possible to migrate/replicate/copy/move processed files (using PIG) from local HDFS (lets say 192.168.0.10) to a cassandra (192.168.0.20)?
What I have in mind is that I literally create a java application to parse the file and re-insert them into cassandra.
Is there any other way in doing so?
thanks alot!
Writing a Java program to migrate Hadoop data to Cassandra tables is actually a overkill. It would become more worse if you happen to perform the same periodically.
Instead , we can utilize a very useful feature of Hive which helps us to integrate Hive tables with external data sources. Its Storage Handler Api of hive, which integrates with external data sources like Cassandra/Oracle/Mysql etc.
There is already an Hive-Cassandra Storage Handler API implementation available , which we can very well reuse, kindly find the same in below url.
https://github.com/tuplejump/cash/tree/master/cassandra-handler
The idea is to create Hive external table which is configured with storage handler specs about the remote Cassandra host/table details.
Any write/read performed to this external table , will be handled by Hive through mapreduce jobs which talks with the Cassandra.
I hope this is the ideal way to integrate Hive and Cassandra which takes very less efforts from us and very efficient too.
Hope this helps.
There are several ways to move the data from Hadoop to Cassandra.
Using Java HDFS API and Cassandra API (inefficient).
Using Java MapReduce program (Parallel loading).
Using Pig (Parallel loading).
Using Hive (Parallel loading).
Using Spark (Parallel loading).
Out of all Pig is easier way to load the data from HDFS to Cassandra.
Pig has a storage type called CassandraStorage. It allows us to load the data into Cassandra in parallel.
Please see this link for more information:
https://wiki.apache.org/cassandra/HadoopSupport#Pig

Hadoop and Cassandra integration - substituting HDFS with Cassandra

I want to efficiently develop a Hadoop job using Cassandra as input and output.
As I know MapReduce jobs in Hadoop uses HDFS to store intermediate results.
Is it possible to make Hadoop store intermediate results in Cassandra File System? If yes then how to achieve that?
I wonder if it possible to completely disable HDFS if I am using Hadoop only Cassandra as underlying data storage system.
I am using Cassandra 2.0.11 and Hadoop 1.0.4 (If the above is possible only in Hadoop 2.x I would also apreciate that information)

How does es-hadoop (ElasticSearch-Hadoop) do Hadoop?

How does es-hadoop enable Hadoop analytics if it is merely a Hadoop connector to HDFS?
I am assuming you are referring to this project. In which case, the ES Hadoop project has two sides. An ES plugin for HDFS which is used for creating index snapshots. But It also has various utilities that can be used within Mapreduce, Hive, Pig, Spack, ect for interacting with Elasticsearch.
For example, it is possible to bulk load ES documents from HBase using Mapreduce via the ESOutputFileFormat format. It is also possible to use Mapreduce to read from ES by a similar mechanism.
ES is an also analytical tool with a query capacity and visualization tool, kibana. So, with the integration, elasticsearch asserts that you can analyze the data at hdfs using elasticsearch.

Resources