What to use to read/write from dynamodb from Spark? - hadoop

I'd like to know what's the best to use to read/write from dynamodb from Spark.
I've tried with the official API from dynamodb, also with the emr connector(hadoop and also with hive) and others.
But i've found (among other problems) that to perform a query a full scan is needed, and that's not something valid with big tables.
Any suggestions please?

The process you tried using emr-dynamodb-connector is generally the way most of the people use it.
However there is a library which you could use to connect to DynamoDb.
Generally accessing DynamoDb from spark is difficult because now you have tied spark executors with the DynamoDb throttle. One alternative you could try is to use Hbase or cassandra which I found better supported with spark usage, provides predicate pushdown etc.
Generally the way I use DynamoDB data on cluster with spark is by utilizing the DynamoDb stream. Collect the stream data in S3 and apply batch processing on that data.

Related

What is the difference between AWS Elastic MapReduce and AWS Redshift

I see that AWS Elastic MapReduce and AWS Redshift both use a cluster structure and can be used for data analysis. What are the different use cases for them?
Amazon Redshift supports client connections with many types of applications, including business intelligence (BI), reporting, data, and analytics tools.
Amazon Elastic MapReduce (Amazon EMR) is a managed cluster platform that simplifies running big data frameworks, such as Apache Hadoop and Apache Spark, on AWS to process and analyze vast amounts of data.
You are correct that both Amazon EMR and Amazon Redshift are clustered systems that can scale-out to offer more computing power. However, there are some very distinct differences between the two services.
Amazon EMR provides Apache Hadoop and applications that run on Hadoop. It is a very flexible system that can read and process unstructured data and is typically used for processing Big Data. However, learning Hadoop and related technologies can be quite difficult. ("With great power comes great responsibility!")
Amazon Redshift is a petabyte-scale data warehouse that is accessed via SQL. Data must be loaded into Redshift before being queried, which often requires some for of transformation ("ETL").
So which one to choose?
If you want to use SQL and you have structured data (eg CSV files), then Redshift is the simplest solution.
If you want to process unstructured data (eg in strange formats rather than structured CSV files), Amazon EMR can provide a Hadoop system that is very capable.
Sometimes people use both -- use Hadoop to transform data, then use Redshift for querying the data.
If Amazon Redshift can fit your needs, then use it rather than Hadoop. Redshift is simpler to use because it presents itself as a standard SQL database that you can get going in a few minutes. All the cluster stuff is behind-the-scenes and you don't have to know much to use it.
If you need more flexible capabilities and you don't mind getting low-level and technical, then Hadoop on Amazon EMR will offer you more capabilities.

HDFS into Cassandra

is it possible to migrate/replicate/copy/move processed files (using PIG) from local HDFS (lets say 192.168.0.10) to a cassandra (192.168.0.20)?
What I have in mind is that I literally create a java application to parse the file and re-insert them into cassandra.
Is there any other way in doing so?
thanks alot!
Writing a Java program to migrate Hadoop data to Cassandra tables is actually a overkill. It would become more worse if you happen to perform the same periodically.
Instead , we can utilize a very useful feature of Hive which helps us to integrate Hive tables with external data sources. Its Storage Handler Api of hive, which integrates with external data sources like Cassandra/Oracle/Mysql etc.
There is already an Hive-Cassandra Storage Handler API implementation available , which we can very well reuse, kindly find the same in below url.
https://github.com/tuplejump/cash/tree/master/cassandra-handler
The idea is to create Hive external table which is configured with storage handler specs about the remote Cassandra host/table details.
Any write/read performed to this external table , will be handled by Hive through mapreduce jobs which talks with the Cassandra.
I hope this is the ideal way to integrate Hive and Cassandra which takes very less efforts from us and very efficient too.
Hope this helps.
There are several ways to move the data from Hadoop to Cassandra.
Using Java HDFS API and Cassandra API (inefficient).
Using Java MapReduce program (Parallel loading).
Using Pig (Parallel loading).
Using Hive (Parallel loading).
Using Spark (Parallel loading).
Out of all Pig is easier way to load the data from HDFS to Cassandra.
Pig has a storage type called CassandraStorage. It allows us to load the data into Cassandra in parallel.
Please see this link for more information:
https://wiki.apache.org/cassandra/HadoopSupport#Pig

How to Stream Data To an EMR Cluster

I appreciate ideas on how to stream data from an On-Premise Windows server to a persistent EMR cluster?
Some Background
I would like to run a persistent cluster running a MR job much like the WordCount examples that are available. I would like to stream text from a local Windows Server up to the cluster and have it processed by the running job.
All of the streaming WordCount examples I have reviewed always start with a static text file in S3 and don't cover how to implement anything to generate the stream.
Does this need to be treated in two parts?
Get the data first into S3
Stream it into the EMR cluster?
I have seen tools like Logstash which tend to run agents on the local server which tail the end of a weblog and transfer it.
As you can probably tell, I'm a Windows guy, stretching into EMR and by association Linux. Feel free to let me know if there is some way cool command line tool that already does this.
Thanks in advance.
Currently EMR as-is only supports MR, Hive, Pig, HBase and Impala. MR/Hive/Pig process the data in a batch oriented fashion and data can't be streamed to them. While HBase is a NoSQL DB and Impala is used for interactive ad-hoc queries.
For processing streaming data there are a lot of other options like Storm, Samza, S4. From AWS there is Kinesis which has been moved into GA recently.
Yes a static file would go into S3 and then be the input into your EMR cluster job.
But I believe that fact you want a persistent cluster implies you are streaming in continuation from your Windows server. Is that the case ?
If so you need to create a AWS Kinesis Stream, configure your producers which put data into the stream's shards by calling the Putrecord.
Start by reading "Developing Record Consumer Applications"
I think you could use apache Flume (https://flume.apache.org/)
Flume is a distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of log data. It has a simple and flexible architecture based on streaming data flows. It is robust and fault tolerant with tunable reliability mechanisms and many failover and recovery mechanisms. It uses a simple extensible data model that allows for online analytic application.

DynamoDB InputFormat for Hadoop

I have to process some data which is persisted in Amazon Dynamo DB using Hadoop map reduce.
I was searching over internet for Hadoop InputFormat for Dynamo DB and couldn't find it. I'm not familiar with Dynamo DB so I'm guessing there is some trick related to DynamoDB and Hadoop? If there is anywhere implementation of this Input Format could you please share it?
After a lot of searching I found DynamoDBInputFormat and DynamoDBOutputFormat in one of Amazon's libraries.
On amazon elastic map reduce there is library called hive-bigbird-handler which contains input and output format for dynamoDB.
Full class names are: org.apache.hadoop.hive.dynamodb.write.DynamoDBOutputFormat and org.apache.hadoop.hive.dynamodb.read.DynamoDBInputFormat
I hope these classes will be useful to community.
Couldn't find an InputFormat which you could use directly in MapReduce. But, here is an article AWS HowTo: Using Amazon Elastic MapReduce with DynamoDB (Guest Post) to run MarReduce jobs using Hive.

Anyone using DynamoDB and Hive without using EMR?

I was reading the below integration of using Hive for querying data on DynamoDB.
http://aws.typepad.com/aws/2012/01/aws-howto-using-amazon-elastic-mapreduce-with-dynamodb.html
But as per that link, Hive needs to be setup on top of EMR. But I wanted to know if I can use this integration with the standalone Hadoop cluster I already have instead of using EMR. Has anyone done this? Will there be sync issues between data in DynamoDB and HDFS happen compared to using EMR?
To be able to use it on your own cluster, you would need the custom StorageHandler for DynamoDB(it probably involves a custom SerDe as well).
It seems to be no available at the moment, at least not at AWS website.
What you can do is use the JDBC interface, provided by Amazon, to produce the queries from your cluster, but it would still be executed on top of EMR.

Resources