cassandra/hadoop/pig design for loading and processing data - hadoop

I have a setup of Hadoop,Cassandra, Pig, Mysql
My goal is to read 1 month data from cassandra process it and put result to mysql periodically.
What is the best practice to do.?
Is it i need to load all the data and filter in pig for 1 month or filter while loading from cassandra using pig/cql(using CqlStorage).
Here the problem is,
if i need to filter while loading from cassandra pig has a bug of having where clause on cql(https://issues.apache.org/jira/browse/CASSANDRA-6151).
or
problem with another solution of loading all data and filter through pig is, the data is too large nearly 200 million records, is it a better solution to load all data, if so what about the performance and time taken by pig script to run.

Related

how to manage modified data in Apache Hive

We are working on Cloudera CDH and trying to perform reporting on the data stored on Apache Hadoop. We send daily reports to client so need to import data from operational store to hadoop daily.
Hadoop works on the append only mode. Hence we can not perform the Hive update/delete query. We can perform Insert overwrite on dimension tables and add delta values in the fact tables. Introducing thousands for the delta rows daily does not seem quite impressive solution.
Are there any other standard better ways to update modified data in Hadoop?
Thanks
HDFS might be append only, but Hive does support updates from 0.14 on.
see here:
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DML#LanguageManualDML-Update
A design pattern is to take all your previous and current data and insert it into a new table every time.
Depending on your usecase have a look at Apache Impala/Hbase/... or even Drill.

Loading data into HIVE to support front end application

We have a datawarehousing application which we are planning to convert to Hadoop.
Currently, there are 20 feeds that we receive on daily basis and load this data into MySQL database.
As the data is getting large, we are planning to move to Hadoop for faster query processing.
As the first step we are planning to load the data into HIVE on a daily basis instead of MySQL.
Question:-
1.Can I convert Hadoop similar to a DWH application to process files on daily basis?
2.When I load the data in Master Node, will it be sync'd automatically?
It really depends on the size of your data. The Question is a bit complex but in general you will have to design your own pipeline.
If you are analyzing raw logs HDFS will be a good choice to start from. You can use Java, Python or Scala to schedule the Hive jobs on daily basis and use Sqoop if you still need some MySQL data.
In Hive you will have to create partitioned table to be synced and available upon query execution. Partition creation can be also scheduled.
I would suggest to go with Impala instead of Hive as it is more tunable, fault tolerant and easier to use.

Interactively search Parquet-stored data using Apache Spark Streaming and Dataframes

I have significant amount of data stored on my Hadoop HDFS as Parquet files
I am using Spark streaming to interactively receive queries from a web server and transform the received queries into SQL to run on my data using SparkSQL.
In this process I need to run several SQL queries and then return some aggregate result by merging or subtracting the results of individual queries.
Are there any ways I could optimize and increase the speed of the process by, for example, running queries on already received dataframes rather than the whole database?
Is there a better way to interactively query the Parquet stored data and give results?
Thank you!
If you are running multiple queries on the same RDD you will get a performance increase by caching the RDD with .cache() before querying it.
Also are you sure that Apache Spark is the right tool for the job here? From the interactive queries that you are describing maybe Impala or Presto is more suitable.

Processing very large dataset in real time in hadoop

I'm trying to understand how to architect a big data solution. I have historic data of 400TB of data and every hour 1GB of data is getting inserted.
Since data is confidential, I'm describing sample scenario, Data contains information of all activities in a bank branch. With every hour, when new data is inserted(no updation) into hdfs, I need to find how many loans closed, loans created,accounts expired, etc ( around 1000 analytics to be performed). Analytics involve processing entire 400TB of data.
I was plan was to use hadoop + spark. But I'm being suggested to use HBase. Reading through all the documents, I'm not able to find a clear advantage.
What is the best way to go for data which will grow to 600TB
1. MR for analytics and impala/hive for query
2. Spark for analytics and query
3. HBase + MR for analytics and query
Thanks in advance
About HBase:
HBase is a database that is build over HDFS. HBase uses HDFS to store data.
Basically, HBase will allow you to update records, have versioning and deletion of single records. HDFS does not support file updates, so HBase is introducing something you can consider "virtual" operations, and merge data from multiple sources (original files, delete markers) when you are asking it for data. Also, HBase as key-value store is creating indices to support selecting by key.
Your problem:
Choosing the technology in such situations you should look into what you are going to do with the data: Single query on Impala (with Avro schema) can be much faster than MapReduce (not to mention Spark). Spark will be faster in batch jobs, when there is caching involved.
You are probably familiar with Lambda architecture, if not, take a look into it. For what I can tell you now, the third option you mentioned (HBase and MR only) won't be good. I did not try Impala + HBase, so I can't say anything about performance, but HDFS (plain files) + Spark + Impala (with Avro), worked for me: Spark was doing reports for pre-defined queries (after that, data was stored in objectFiles - not human-readable, but very fast), Impala for custom queries.
Hope it helps at least a little.

How to create a data pipeline from hive table to relational database

Background :
I have a Hive Table "log" which contains log information. This table is loaded with new log data every hour. I want to do some quick analytics on logs for past 2 days, so i want to extract last 48 hours of data into my relational database.
To solve the above problem I have created a staging hive table which is loaded by a HIVE SQL query. After loading the new data into the staging table, i load the new logs into relational database using sqoop Query.
Problem is that sqoop is loading data into relational database in BATCH. So at any particular time i have only partial logs for a particular hour.
This is leading to erroneous analytics output.
Questions:
1). How to make this Sqoop data load transactional, i.e either all records are exported or none are exported.
2). What is best way to build this data pipeline where this whole process of Hive Table -> Staging Table -> Relational Table.
Technical Details:
Hadoop version 1.0.4
Hive- 0.9.0
Sqoop - 1.4.2
You should be able to do this with sqoop by using the option called --staging-table. What this does is basically act as an auxiliary table that is used to stage exported data. The staged data is finally moved to the destination table in a single transaction. So by doing this, you shouldn't have consistency issues with partial data.
(source: Sqoop documentation)
Hive and Hadoop are such great technologies that can allow your analytics to run inside MapReduce tasks, performing the analytics very fast by utilizing multiple nodes.
Use that to your benefit. First of all partition your Hive table.
I guess that you store all logs in a single Hive table. Thus when you run your queries and you have a
SQL .... WHERE LOG_DATA > '17/10/2013 00:00:00'
Then you effictivelly query all the data that you have collected so far.
Instead if you use partitions - let's say one per day you can define in your query
WHERE p_date=20131017 OR p_date=20131016
Hive is partitioned and now knows to read only those two files
So let's say you got 10 GB of logs per day - then a HIVE QUERY should succeed in a few seconds in a decent Hadoop cluster

Resources