Apache Spark comparing files with SQL data - hadoop

I going to use Apache Spark for processing big text files where in processing cycle is a part with comparing text-parts with data from big SQL table.
The task is:
1) Process files and break text into pieces
2) Compare pieces with database ones
Definitely, the bottleneck will be a SQL. I'm completely new to Apache Spark and while I'm sure, that Subtask #1 is "his guy", I'm not fully sure, that the Subtask #2 can be handled by Spark (I mean, in efficient way).
The question is how Spark deals with iterable selects from big SQL (maybe, cache as much as can?) in parallel and distributed environment?

Posting as an answer per request:
If you need to repetitively process data from a SQL data source, I usually find it worth using Sqoop to pull the data into HDFS so that my processing can run more easily. This is particularly useful while I'm developing my data flow, since I'll often run the same job on a sample of data several times in a short time period, and if it has been sqooped I don't have to hit the database server every time.
If your job is periodic/batch style (a daily data cleanup or report or something), this may be a sufficient implementation, and having a collection of historic data in HDFS ends up being useful for other purposes many times.
If you need live, up-to-the-minute data, then you'll want to use JdbcRDD, as described in this other answer, which lets you treat a SQL data source as an RDD in your Spark data flow.
Good luck.


Data format and database choices Spark/hadoop

I am working on structured data (one value per field, the same fields for each row) that I have to put in a NoSql environment with Spark (as analysing tool) and Hadoop. Though, I am wondering what format to use. i was thinking about json or csv but I'm not sure. What do you think and why? I don't have enough experience in this field to properly decide.
2nd question : I have to analyse these data (stored in an HDFS). So, as far as I know I have two possibilities to query them (before the analysis):
direct reading and filtering. i mean that it can be done with Spark, for exemple:
data = sqlCtxt.read.json(path_data)
Use Hbase/Hive to properly make a query and then process the data.
So, I don't know what is the standard way of doing all this and above all, what will be the fastest.
Thank you by advance!
Use Parquet. I'm not sure about CSV but definitely don't use JSON. My personal experience using JSON with spark was extremely, extremely slow to read from storage, after switching to Parquet my read times were much faster (e.g. some small files took minutes to load in compressed JSON, now they take less than a second to load in compressed Parquet).
On top of improving read speeds, compressed parquet can be partitioned by spark when reading, whereas compressed JSON cannot. What this means is that Parquet can be loaded onto multiple cluster workers, whereas JSON will just be read onto a single node with 1 partition. This isn't a good idea if your files are large and you'll get Out Of Memory Exceptions. It also won't parallelise your computations, so you'll be executing on one node. This isn't the 'Sparky' way of doing things.
Final point: you can use SparkSQL to execute queries on stored parquet files, without having to read them into dataframes first. Very handy.
Hope this helps :)

Big Data - Lambda Architecture and Storing Raw Data

Currently I am using cassandra for storing data for my functional use cases (display time-series and consolidated data to users). Cassandra is very good at it, if you design correctly your data model (query driven)
Basically, data are ingested from RabbitMQ by Storm and save to Cassandra
Lambda architecture is just a design-pattern for big-data architect and technology independent, the layers can be combined :
Cassandra is a database that can be used as serving layer & batch layer : I'm using it for my analytics purpose with spark too (because data are already well formatted, like time-series, in cassandra)
As far as I know, one huge thing to consider is STORING your raw data before any processing. You need to do this in order to recover for any problem, human-based (algorithm problem, DROP TABLE in PROD, stuff like that this can happen..) or for future use or mainly for batch aggregation
And here I'm facing a choice :
Currently I'm storing it in cassandra, but i'm consider switching storing the raw data in HDFS for different reason : raw data are "dead", using cassandra token, using resource (mainly disk space) in cassandra cluster.
Can someone help me in that choice ?
HDFS makes perfect sense. Some considerations :
Serialization of data - Use ORC/ Parquet or AVRO if format is variable
Compression of data - Always compress
HDFS does not like too many small files - In case of streaming have a job which aggregates & write single large file on a regular interval
Have a good partitioning scheme so you can get to data you want on HDFS without wasting resources
hdfs is better idea for binary files. Cassandra is o.k. for storing locations where the files are etc etc but just pure files need to be modelled really really well so most of the people just give up on cassandra and complain that it sucks. It still can be done, if you want to do it there are some examples like:
that might help you to get started.
Also the question is more material for quora or even http://www.mail-archive.com/user#cassandra.apache.org/ this question has been asked there a lot of time.

Suggestions for noSQL selection for mass data export

We have billions of records formatted with relational data format (e.g transaction id, user name, user id and some other fields), my requirement is to create system where user can request data export from this data store (user will provide some filters like user id, date and so on), typically exported file will be having thousand to 100s of thousands to millions of records based on selected filters (output file will be CSV or similar format)
Other than raw data, I am also looking for some dynamic aggregation on few of the fields during data export.
Typical time between user submitting request and exported data file available should be within 2-3 minutes (max can be 4-5 minutes).
I am seeking suggestions on backend noSQLs for this use case, I've used Hadoop map-reduce so far but hadoop batch job execution with typical HDFS data map-reduce might not give expected SLA in my opinion.
Another option is to use Spark map-reduce which I've never used but it should be way faster then typical Hadoop map-reduce batch job.
We've already tried production grade RDBMS/OLTP instance but it clearly seems not a correct option due to size of data we are exporting and dynamic aggregation.
Any suggestion on using Spark here? or any other better noSQL?
In summary SLA, dynamic aggregation and raw data (millions) are the requirement considerations here.
If system only requires to export data after doing some ETL - aggregations, filtering and transformations then answer is very straight forward. Apache Spark is the best. You would have to fine tune the system and decide whether you want to use only memory or memory + disk or serialization etc.. However, most of the times one needs to think about other aspects too; I am considering them as well.
This is a wide topic of discussion and it involves many aspects such aggregations involved, search related queries (if any), development time. As per the description, it seems to be an interactive/near-real-time-interactive system. Other aspect is whether any analysis involved? And another important point is type of system (OLTP/OLAP, only reporting etc..).
I see there are two questions involved -
Which computing/data processing engine to use?
Which data storage/NoSQL?
- Data processing -
Apache Spark would be a best choice for computing. We are using for the same purpose, along with the filtering, we also have xml transformations to perform which are also done in Spark. Its superfast as compared to Hadoop MapReduce. Spark can run standalone and it can also run on the top of Hadoop.
- Storage -
There are many noSQL solutions available. Selection depends upon many factors such as volume, aggregations involved, search related queries etc..
Hadoop - You can go with Hadoop with HDFS as a storage system. It has many benefits as you get entire Hadoop ecosystem.If you have analysts/data scientists who require to get insights of data/ play with data then this would be a better choice as you would get different tools such as Hive/Impala. Also, resource management would be easy. But for some applications it can be too much.
Cassendra - Cassandra as a storage engine that has solved the problems of distribution and availability while maintaining scale and performance. It brings wonders when used with Spark. For example, performing complex aggregations. By the way, we are using it. For visualization (to view data for analyzing), options are Apache Zeppelin, Tableau (lot of options)
Elastic Search - Elastic Search is also a suitable option if your storage is in few TBs upto 10 TBs. It comes with Kibana (UI) which provides limited analytics capabilities including aggregations. Development time is minimal, its very quick to implement.
So, depending upon your requirement I would suggest Apache Spark for data processing (transformations/filtering/aggregations) and you may also require to consider other technology for storage and data visualization.

Processing very large dataset in real time in hadoop

I'm trying to understand how to architect a big data solution. I have historic data of 400TB of data and every hour 1GB of data is getting inserted.
Since data is confidential, I'm describing sample scenario, Data contains information of all activities in a bank branch. With every hour, when new data is inserted(no updation) into hdfs, I need to find how many loans closed, loans created,accounts expired, etc ( around 1000 analytics to be performed). Analytics involve processing entire 400TB of data.
I was plan was to use hadoop + spark. But I'm being suggested to use HBase. Reading through all the documents, I'm not able to find a clear advantage.
What is the best way to go for data which will grow to 600TB
1. MR for analytics and impala/hive for query
2. Spark for analytics and query
3. HBase + MR for analytics and query
Thanks in advance
About HBase:
HBase is a database that is build over HDFS. HBase uses HDFS to store data.
Basically, HBase will allow you to update records, have versioning and deletion of single records. HDFS does not support file updates, so HBase is introducing something you can consider "virtual" operations, and merge data from multiple sources (original files, delete markers) when you are asking it for data. Also, HBase as key-value store is creating indices to support selecting by key.
Your problem:
Choosing the technology in such situations you should look into what you are going to do with the data: Single query on Impala (with Avro schema) can be much faster than MapReduce (not to mention Spark). Spark will be faster in batch jobs, when there is caching involved.
You are probably familiar with Lambda architecture, if not, take a look into it. For what I can tell you now, the third option you mentioned (HBase and MR only) won't be good. I did not try Impala + HBase, so I can't say anything about performance, but HDFS (plain files) + Spark + Impala (with Avro), worked for me: Spark was doing reports for pre-defined queries (after that, data was stored in objectFiles - not human-readable, but very fast), Impala for custom queries.
Hope it helps at least a little.

Hadoop Ecosystem - What technological tool combination to use in my scenrio? (Details Inside)

This might be an interesting question to some:
Given: 2-3 Terabyte of data stored in SQL Server(RDBMS), consider it similar to Amazons data, i.e., users -> what things they saw/clicked to see -> what they bought
Task: Make a recommendation engine (like Amazon), which displays to user, customer who bought this also bought this -> if you liked this, then you might like this -> (Also) kind of data mining to predict future buying habits as well(Data Mining). So on and so forth, basically a reco engine.
Issue: Because of the sheer volume of data (5-6 yrs worth of user habit data), I see Hadoop as the ultimate solution. Now the question is, what technological tools combinations to use?, i.e.,
HDFS: Underlying FIle system
Mahout: For running some algorithms, which I assume uses Map-Reduce (genetic, cluster, data mining etc.)
- What am I missing? What about loading RDBMS data for all this processing? (Sqoop for Hadoop?)
- At the end of all this, I get a list of results(reco's), or there exists a way to query it directly and report it to the front-end I build in .NET??
I think the answer to this question, just might be a good discussion for many people like me in the future who want to kick start their hadoop experimentation.
For loading data from RDBMS, I'd recommend looking into BCP (to export from SQL to flat file) then Hadoop command line for loading into HDFS. Sqoop is good for ongoing data but it's going to be intolerably slow for your initial load.
To query results from Hadoop you can use HBase (assuming you want low-latency queries), which can be queried from C# via it's Thrift API.
HBase can fit your scenario.
HDFS is the underlying file system. Nevertheless you cannot load the data in HDFS (in arbitrary format) query in HBase, unless you use the HBase file format (HFile)
HBase has integration with MR.
Pig and Hive also integrate with HBase.
As Chris mentioned it, you can use Thrift to perform your queries (get, scan) since this will extract specific user info and not a massive data set it is more suitable than using MR.
