I have to process some data which is persisted in Amazon Dynamo DB using Hadoop map reduce.
I was searching over internet for Hadoop InputFormat for Dynamo DB and couldn't find it. I'm not familiar with Dynamo DB so I'm guessing there is some trick related to DynamoDB and Hadoop? If there is anywhere implementation of this Input Format could you please share it?
After a lot of searching I found DynamoDBInputFormat and DynamoDBOutputFormat in one of Amazon's libraries.
On amazon elastic map reduce there is library called hive-bigbird-handler which contains input and output format for dynamoDB.
Full class names are: org.apache.hadoop.hive.dynamodb.write.DynamoDBOutputFormat and org.apache.hadoop.hive.dynamodb.read.DynamoDBInputFormat
I hope these classes will be useful to community.
Couldn't find an InputFormat which you could use directly in MapReduce. But, here is an article AWS HowTo: Using Amazon Elastic MapReduce with DynamoDB (Guest Post) to run MarReduce jobs using Hive.
Related
I'd like to know what's the best to use to read/write from dynamodb from Spark.
I've tried with the official API from dynamodb, also with the emr connector(hadoop and also with hive) and others.
But i've found (among other problems) that to perform a query a full scan is needed, and that's not something valid with big tables.
Any suggestions please?
The process you tried using emr-dynamodb-connector is generally the way most of the people use it.
However there is a library which you could use to connect to DynamoDb.
Generally accessing DynamoDb from spark is difficult because now you have tied spark executors with the DynamoDb throttle. One alternative you could try is to use Hbase or cassandra which I found better supported with spark usage, provides predicate pushdown etc.
Generally the way I use DynamoDB data on cluster with spark is by utilizing the DynamoDb stream. Collect the stream data in S3 and apply batch processing on that data.
I am trying to query GitHub data provided by ghtorrent API using hadoop. how can I inject this much data(4-5 TB) into HDFS? Also, their databases are real time. Is it possible to process real time data in hadoop using tools such as pig, hive, hbase?
Go through this presentation . It has described the way you can connect to their MySql or MongoDb instance and fetch data. Basically you have to share your public key, they will add that key to their repository and then you can ssh. As an alternative you can download their periodic dumps from this link
Imp Link :
query mongodb programatically
connect to mysql instance
For processing real time data, you cannt do that uisng Pig, Hive. Those are Batch processing tools. Consider using Apache Spark.
is it possible to migrate/replicate/copy/move processed files (using PIG) from local HDFS (lets say 192.168.0.10) to a cassandra (192.168.0.20)?
What I have in mind is that I literally create a java application to parse the file and re-insert them into cassandra.
Is there any other way in doing so?
thanks alot!
Writing a Java program to migrate Hadoop data to Cassandra tables is actually a overkill. It would become more worse if you happen to perform the same periodically.
Instead , we can utilize a very useful feature of Hive which helps us to integrate Hive tables with external data sources. Its Storage Handler Api of hive, which integrates with external data sources like Cassandra/Oracle/Mysql etc.
There is already an Hive-Cassandra Storage Handler API implementation available , which we can very well reuse, kindly find the same in below url.
https://github.com/tuplejump/cash/tree/master/cassandra-handler
The idea is to create Hive external table which is configured with storage handler specs about the remote Cassandra host/table details.
Any write/read performed to this external table , will be handled by Hive through mapreduce jobs which talks with the Cassandra.
I hope this is the ideal way to integrate Hive and Cassandra which takes very less efforts from us and very efficient too.
Hope this helps.
There are several ways to move the data from Hadoop to Cassandra.
Using Java HDFS API and Cassandra API (inefficient).
Using Java MapReduce program (Parallel loading).
Using Pig (Parallel loading).
Using Hive (Parallel loading).
Using Spark (Parallel loading).
Out of all Pig is easier way to load the data from HDFS to Cassandra.
Pig has a storage type called CassandraStorage. It allows us to load the data into Cassandra in parallel.
Please see this link for more information:
https://wiki.apache.org/cassandra/HadoopSupport#Pig
Is there any way to expose cassandra data as HDFS and then perfom shark/Hive query on HDFS ??
If yes, kindly provide some links to transform cassandra db into HDFS.
You can write identity MapReduce Code which take input from CFS (cassandra filesystem) and dump data to HDFS.
Once you have data in HDFS , you can map a hive table and run queries.
The typical way to access Cassandra data in Hive is to use the CqlStorageHandler.
Details see Hive Support for Cassandra CQL3.
But if you have some reasons to access the data directly, take a look at Cassowary. It is a "Hive storage handler for Cassandra and Shark that reads the SSTables directly. This allows total control over the resources used to run ad-hoc queries so that the impact on real-time Cassandra performance is controlled."
I think you are trying to run Hive/Shark against data already in Cassandra. If that is the case then you don't need to access it as HDFS but you need a hive-handler for using it against Cassandra.
For this you can use Tuplejump's project, CASH The Readme provides the instruction on how to build and use it. If you want to put your "big files" in Cassandra and query on them, like you do from HDFS, you will need a FileSystem that runs on Cassandra like DataStax's CFS present in DSE, or Tuplejump's SnackFS (present in the Calliope Project Early Access Repo)
Disclaimer: I work for Tuplejump, Inc.
You can use Tuplejump Calliope Project.
https://github.com/tuplejump/calliope
Configure external Cassandra Table in Shark(like Hive) using Storage Handler provided in TumpleJump code.
All the best!
Three cassandra hive storage
https://github.com/2013Commons/hive-cassandra for 2.0 and hadoop 2
https://github.com/dvasilen/Hive-Cassandra/tree/HIVE-0.11.0-HADOOP-2.0.0-CASSANDRA-1.2.9
https://github.com/richardalow/cassowary directly from sstable
I was reading the below integration of using Hive for querying data on DynamoDB.
http://aws.typepad.com/aws/2012/01/aws-howto-using-amazon-elastic-mapreduce-with-dynamodb.html
But as per that link, Hive needs to be setup on top of EMR. But I wanted to know if I can use this integration with the standalone Hadoop cluster I already have instead of using EMR. Has anyone done this? Will there be sync issues between data in DynamoDB and HDFS happen compared to using EMR?
To be able to use it on your own cluster, you would need the custom StorageHandler for DynamoDB(it probably involves a custom SerDe as well).
It seems to be no available at the moment, at least not at AWS website.
What you can do is use the JDBC interface, provided by Amazon, to produce the queries from your cluster, but it would still be executed on top of EMR.