Anyone using DynamoDB and Hive without using EMR? - hadoop

I was reading the below integration of using Hive for querying data on DynamoDB.
http://aws.typepad.com/aws/2012/01/aws-howto-using-amazon-elastic-mapreduce-with-dynamodb.html
But as per that link, Hive needs to be setup on top of EMR. But I wanted to know if I can use this integration with the standalone Hadoop cluster I already have instead of using EMR. Has anyone done this? Will there be sync issues between data in DynamoDB and HDFS happen compared to using EMR?

To be able to use it on your own cluster, you would need the custom StorageHandler for DynamoDB(it probably involves a custom SerDe as well).
It seems to be no available at the moment, at least not at AWS website.
What you can do is use the JDBC interface, provided by Amazon, to produce the queries from your cluster, but it would still be executed on top of EMR.

Related

What to use to read/write from dynamodb from Spark?

I'd like to know what's the best to use to read/write from dynamodb from Spark.
I've tried with the official API from dynamodb, also with the emr connector(hadoop and also with hive) and others.
But i've found (among other problems) that to perform a query a full scan is needed, and that's not something valid with big tables.
Any suggestions please?
The process you tried using emr-dynamodb-connector is generally the way most of the people use it.
However there is a library which you could use to connect to DynamoDb.
Generally accessing DynamoDb from spark is difficult because now you have tied spark executors with the DynamoDb throttle. One alternative you could try is to use Hbase or cassandra which I found better supported with spark usage, provides predicate pushdown etc.
Generally the way I use DynamoDB data on cluster with spark is by utilizing the DynamoDb stream. Collect the stream data in S3 and apply batch processing on that data.

IIS Logs Straming to Hadoop real time

I am trying to do a POC in Hadoop for log aggregation. we have multiple IIS servers hosting atleast 100 sites. I want to to stream logs continously to HDFS and parse data and store in Hive for further analytics.
1) Is Apache KAFKA correct choice or Apache Flume
2) After streaming is it better to use Apache storm and ingest data into Hive
Please help with any suggestions and also any information of this kind of problem statement.
Thanks
You can use either Kafka or flume also you can combine both to get data into HDFSbut you need to write code for this There are Opensource data flow management tools available, you don't need to write code. Eg. NiFi and Streamsets
You don't need to use any separate ingestion tools, you can directly use those data flow tools to put data into hive table. Once table is created in hive then you can do your analytics by providing queries.
Let me know you need anything else on this.

What is the difference between AWS Elastic MapReduce and AWS Redshift

I see that AWS Elastic MapReduce and AWS Redshift both use a cluster structure and can be used for data analysis. What are the different use cases for them?
Amazon Redshift supports client connections with many types of applications, including business intelligence (BI), reporting, data, and analytics tools.
Amazon Elastic MapReduce (Amazon EMR) is a managed cluster platform that simplifies running big data frameworks, such as Apache Hadoop and Apache Spark, on AWS to process and analyze vast amounts of data.
You are correct that both Amazon EMR and Amazon Redshift are clustered systems that can scale-out to offer more computing power. However, there are some very distinct differences between the two services.
Amazon EMR provides Apache Hadoop and applications that run on Hadoop. It is a very flexible system that can read and process unstructured data and is typically used for processing Big Data. However, learning Hadoop and related technologies can be quite difficult. ("With great power comes great responsibility!")
Amazon Redshift is a petabyte-scale data warehouse that is accessed via SQL. Data must be loaded into Redshift before being queried, which often requires some for of transformation ("ETL").
So which one to choose?
If you want to use SQL and you have structured data (eg CSV files), then Redshift is the simplest solution.
If you want to process unstructured data (eg in strange formats rather than structured CSV files), Amazon EMR can provide a Hadoop system that is very capable.
Sometimes people use both -- use Hadoop to transform data, then use Redshift for querying the data.
If Amazon Redshift can fit your needs, then use it rather than Hadoop. Redshift is simpler to use because it presents itself as a standard SQL database that you can get going in a few minutes. All the cluster stuff is behind-the-scenes and you don't have to know much to use it.
If you need more flexible capabilities and you don't mind getting low-level and technical, then Hadoop on Amazon EMR will offer you more capabilities.

HDFS into Cassandra

is it possible to migrate/replicate/copy/move processed files (using PIG) from local HDFS (lets say 192.168.0.10) to a cassandra (192.168.0.20)?
What I have in mind is that I literally create a java application to parse the file and re-insert them into cassandra.
Is there any other way in doing so?
thanks alot!
Writing a Java program to migrate Hadoop data to Cassandra tables is actually a overkill. It would become more worse if you happen to perform the same periodically.
Instead , we can utilize a very useful feature of Hive which helps us to integrate Hive tables with external data sources. Its Storage Handler Api of hive, which integrates with external data sources like Cassandra/Oracle/Mysql etc.
There is already an Hive-Cassandra Storage Handler API implementation available , which we can very well reuse, kindly find the same in below url.
https://github.com/tuplejump/cash/tree/master/cassandra-handler
The idea is to create Hive external table which is configured with storage handler specs about the remote Cassandra host/table details.
Any write/read performed to this external table , will be handled by Hive through mapreduce jobs which talks with the Cassandra.
I hope this is the ideal way to integrate Hive and Cassandra which takes very less efforts from us and very efficient too.
Hope this helps.
There are several ways to move the data from Hadoop to Cassandra.
Using Java HDFS API and Cassandra API (inefficient).
Using Java MapReduce program (Parallel loading).
Using Pig (Parallel loading).
Using Hive (Parallel loading).
Using Spark (Parallel loading).
Out of all Pig is easier way to load the data from HDFS to Cassandra.
Pig has a storage type called CassandraStorage. It allows us to load the data into Cassandra in parallel.
Please see this link for more information:
https://wiki.apache.org/cassandra/HadoopSupport#Pig

How to convert cassandra to HDFS file system for shark/hive query

Is there any way to expose cassandra data as HDFS and then perfom shark/Hive query on HDFS ??
If yes, kindly provide some links to transform cassandra db into HDFS.
You can write identity MapReduce Code which take input from CFS (cassandra filesystem) and dump data to HDFS.
Once you have data in HDFS , you can map a hive table and run queries.
The typical way to access Cassandra data in Hive is to use the CqlStorageHandler.
Details see Hive Support for Cassandra CQL3.
But if you have some reasons to access the data directly, take a look at Cassowary. It is a "Hive storage handler for Cassandra and Shark that reads the SSTables directly. This allows total control over the resources used to run ad-hoc queries so that the impact on real-time Cassandra performance is controlled."
I think you are trying to run Hive/Shark against data already in Cassandra. If that is the case then you don't need to access it as HDFS but you need a hive-handler for using it against Cassandra.
For this you can use Tuplejump's project, CASH The Readme provides the instruction on how to build and use it. If you want to put your "big files" in Cassandra and query on them, like you do from HDFS, you will need a FileSystem that runs on Cassandra like DataStax's CFS present in DSE, or Tuplejump's SnackFS (present in the Calliope Project Early Access Repo)
Disclaimer: I work for Tuplejump, Inc.
You can use Tuplejump Calliope Project.
https://github.com/tuplejump/calliope
Configure external Cassandra Table in Shark(like Hive) using Storage Handler provided in TumpleJump code.
All the best!
Three cassandra hive storage
https://github.com/2013Commons/hive-cassandra for 2.0 and hadoop 2
https://github.com/dvasilen/Hive-Cassandra/tree/HIVE-0.11.0-HADOOP-2.0.0-CASSANDRA-1.2.9
https://github.com/richardalow/cassowary directly from sstable

Resources