Apache Spark: Which data storage and data format to choose - hadoop

I'm going to write a sales analytics application with Spark. Therefore I get a delta-dataset every night with new sales data (the sellings of the day before). Later I want to realize some analytics like Association-Rules or popularity of products.
The sales data contains information about:
store-id
article-group
timestamp of cash-point
article GTIN
amount
price
So far I used a simple .textFile method and RDDs in my Applications. I heard something about DataFrame and Parquet, which is a table-like data format for text files, right? And what about storing the data once in a database (I have HBase installed in a Hadoop cluster) and later read this?
Can someone give a short overview of the different types of save-/load-possibilities in Spark? And give a recommendation what to use for this data?
The data-amount are actually about 6 GB, which represent data data for 3 stores for about 1 year. Later I will work with data of ~500 stores and time-period of ~5 years.

You can use spark to process that data without any problem. You can read from a csv file as well(there's a library from databricks that supports csv). You can manipulate it, from an rdd your one step closer to turning it into a dataframe. And you can throw the final dataframe dirrectly into HBASE.
All needed documentation you can find here:
http://spark.apache.org/docs/latest/sql-programming-guide.html
https://www.mapr.com/blog/spark-streaming-hbase
Cheers,
Alex

Related

Best way ho to validate ingested data

I am ingesting data daily from various external sources like GA, scrapers, Google BQ, etc.
I store created CSV file into HDFS, create stage table from it and then append it to historical table in Hadoop.
Can you share some best practices how to valide new data with historical one? Like for example compare row count of actual data with average of last 10 days or someting like that. Is there any prepared solution in spark or something?
Thanks for advices.

Data format and database choices Spark/hadoop

I am working on structured data (one value per field, the same fields for each row) that I have to put in a NoSql environment with Spark (as analysing tool) and Hadoop. Though, I am wondering what format to use. i was thinking about json or csv but I'm not sure. What do you think and why? I don't have enough experience in this field to properly decide.
2nd question : I have to analyse these data (stored in an HDFS). So, as far as I know I have two possibilities to query them (before the analysis):
direct reading and filtering. i mean that it can be done with Spark, for exemple:
data = sqlCtxt.read.json(path_data)
Use Hbase/Hive to properly make a query and then process the data.
So, I don't know what is the standard way of doing all this and above all, what will be the fastest.
Thank you by advance!
Use Parquet. I'm not sure about CSV but definitely don't use JSON. My personal experience using JSON with spark was extremely, extremely slow to read from storage, after switching to Parquet my read times were much faster (e.g. some small files took minutes to load in compressed JSON, now they take less than a second to load in compressed Parquet).
On top of improving read speeds, compressed parquet can be partitioned by spark when reading, whereas compressed JSON cannot. What this means is that Parquet can be loaded onto multiple cluster workers, whereas JSON will just be read onto a single node with 1 partition. This isn't a good idea if your files are large and you'll get Out Of Memory Exceptions. It also won't parallelise your computations, so you'll be executing on one node. This isn't the 'Sparky' way of doing things.
Final point: you can use SparkSQL to execute queries on stored parquet files, without having to read them into dataframes first. Very handy.
Hope this helps :)

Using HBase for small dataset and big data analysis at the same time?

I am building an application which requires lot of data processing and analytics (processing tons of files at same time ).
I am planing to use Hadoop (Map-reduce , Hbase(HDFS file system)) for this.
At same time i have small dataset like user setting, application user listing ,payment information and other which can be easily managed on any RDMS database like sql or Mongo.
Some time it may have few aggregated and analysis data which is computed by Hadoop but that data is also not that big.
My question is whether i should pick 2 database like Mysql/Mongo for storing small dataset and HBase for big dataset ?
Or my HBase can do both job efficiently ?
My opinion you cant compare apple with banana.
Hbase is schema less and From CAP theorem, CP is the main attention for hbase.
Where as CA is for RDBMS. please see my answer.
RDBMS has these properties has schema , is centralized, supports joins, supports ACID, supports referrential integrity.
Where as Hbase is schema less , distributed, doesnt support joins ,no built-in support for ACID.
Now you can decide which one is for what based on your requirements.
Hope this helps!

Processing very large dataset in real time in hadoop

I'm trying to understand how to architect a big data solution. I have historic data of 400TB of data and every hour 1GB of data is getting inserted.
Since data is confidential, I'm describing sample scenario, Data contains information of all activities in a bank branch. With every hour, when new data is inserted(no updation) into hdfs, I need to find how many loans closed, loans created,accounts expired, etc ( around 1000 analytics to be performed). Analytics involve processing entire 400TB of data.
I was plan was to use hadoop + spark. But I'm being suggested to use HBase. Reading through all the documents, I'm not able to find a clear advantage.
What is the best way to go for data which will grow to 600TB
1. MR for analytics and impala/hive for query
2. Spark for analytics and query
3. HBase + MR for analytics and query
Thanks in advance
About HBase:
HBase is a database that is build over HDFS. HBase uses HDFS to store data.
Basically, HBase will allow you to update records, have versioning and deletion of single records. HDFS does not support file updates, so HBase is introducing something you can consider "virtual" operations, and merge data from multiple sources (original files, delete markers) when you are asking it for data. Also, HBase as key-value store is creating indices to support selecting by key.
Your problem:
Choosing the technology in such situations you should look into what you are going to do with the data: Single query on Impala (with Avro schema) can be much faster than MapReduce (not to mention Spark). Spark will be faster in batch jobs, when there is caching involved.
You are probably familiar with Lambda architecture, if not, take a look into it. For what I can tell you now, the third option you mentioned (HBase and MR only) won't be good. I did not try Impala + HBase, so I can't say anything about performance, but HDFS (plain files) + Spark + Impala (with Avro), worked for me: Spark was doing reports for pre-defined queries (after that, data was stored in objectFiles - not human-readable, but very fast), Impala for custom queries.
Hope it helps at least a little.

How to do complex query on big data?

every one.
I have some data about 6G in hdfs that has been exported from mysql.And I have write mapreduces prehandling data to fill some key field that data can be easily queried.
As the business demands are different aggregation data group by day ,hour,hospital,area etc,
so I have to write many hive sqls exporting data to local disk,and then I write python script to parse files on local disk ,then get datas in demand.
Is there some good technique on hadoop to resolve my demand.I am considering.
Can you help me ,please.

Resources