We have billions of records formatted with relational data format (e.g transaction id, user name, user id and some other fields), my requirement is to create system where user can request data export from this data store (user will provide some filters like user id, date and so on), typically exported file will be having thousand to 100s of thousands to millions of records based on selected filters (output file will be CSV or similar format)
Other than raw data, I am also looking for some dynamic aggregation on few of the fields during data export.
Typical time between user submitting request and exported data file available should be within 2-3 minutes (max can be 4-5 minutes).
I am seeking suggestions on backend noSQLs for this use case, I've used Hadoop map-reduce so far but hadoop batch job execution with typical HDFS data map-reduce might not give expected SLA in my opinion.
Another option is to use Spark map-reduce which I've never used but it should be way faster then typical Hadoop map-reduce batch job.
We've already tried production grade RDBMS/OLTP instance but it clearly seems not a correct option due to size of data we are exporting and dynamic aggregation.
Any suggestion on using Spark here? or any other better noSQL?
In summary SLA, dynamic aggregation and raw data (millions) are the requirement considerations here.
If system only requires to export data after doing some ETL - aggregations, filtering and transformations then answer is very straight forward. Apache Spark is the best. You would have to fine tune the system and decide whether you want to use only memory or memory + disk or serialization etc.. However, most of the times one needs to think about other aspects too; I am considering them as well.
This is a wide topic of discussion and it involves many aspects such aggregations involved, search related queries (if any), development time. As per the description, it seems to be an interactive/near-real-time-interactive system. Other aspect is whether any analysis involved? And another important point is type of system (OLTP/OLAP, only reporting etc..).
I see there are two questions involved -
Which computing/data processing engine to use?
Which data storage/NoSQL?
- Data processing -
Apache Spark would be a best choice for computing. We are using for the same purpose, along with the filtering, we also have xml transformations to perform which are also done in Spark. Its superfast as compared to Hadoop MapReduce. Spark can run standalone and it can also run on the top of Hadoop.
- Storage -
There are many noSQL solutions available. Selection depends upon many factors such as volume, aggregations involved, search related queries etc..
Hadoop - You can go with Hadoop with HDFS as a storage system. It has many benefits as you get entire Hadoop ecosystem.If you have analysts/data scientists who require to get insights of data/ play with data then this would be a better choice as you would get different tools such as Hive/Impala. Also, resource management would be easy. But for some applications it can be too much.
Cassendra - Cassandra as a storage engine that has solved the problems of distribution and availability while maintaining scale and performance. It brings wonders when used with Spark. For example, performing complex aggregations. By the way, we are using it. For visualization (to view data for analyzing), options are Apache Zeppelin, Tableau (lot of options)
Elastic Search - Elastic Search is also a suitable option if your storage is in few TBs upto 10 TBs. It comes with Kibana (UI) which provides limited analytics capabilities including aggregations. Development time is minimal, its very quick to implement.
So, depending upon your requirement I would suggest Apache Spark for data processing (transformations/filtering/aggregations) and you may also require to consider other technology for storage and data visualization.
Related
We are trying to implement Elasticsearch into our big data environment. Currently we are running Apache Hadoop 2.7, include Hive and Spark. Data store as Parquest format in Hadoop.
When we implement ELK into our environment, Can we only store data into Hadoop HDFS ? Or we must to extract data from Hadoop and import into Elasticsearch so can create indexing, but we have duplicate dataset into system(Hadoop HDFS and ElasticSearch)
Thank you.
Sorry if this is a bit long, but I hope that would help!
What Elasticsearch is?
Elasticsearch is a search engine. Period. Search Solution. Or rather a type of database or datastore which helps you organize your data in such a way which can help you perform activities like data discovery or build search applications for your organisation.
Although you can also do lot of analytical queries and build analytical solutions around it, there are certain limitations.
The nature of Elasticsearch and the datastructures used in it are so different, that you would need to push the data(ingest) into it in order to perform all search/data-discover/analytical activities. It has its own file system and data structures which manage/store the data specifically for efficient searching.
So yes there will be duplication of data.
What Elasticsearch is not?
It is not to be used as analytical solution, although it does come with lot of aggregation queries, it is not as expressive as processing engine like Apache Spark or data virtualisation tools like Denodo or Presto.
It is not a storage solution like HDFS or S3 and used as a data lake for the organization.
It is not to be used as a transactional database to be replaced with RDBMS solutions.
Most of the times, many organisations ingest data into ES from various different sources like OLAP, RDBMS, NoSQL database, CMS, Messaging Queues so that they can do searching of the data more efficiently.
In a way, for most of the times, ES is never a primary datasource.
How organisations use it?
Build search solutions for e.g. if you wish to provide or build any e-commerce solution, you can have its search implementation managed by Elasticsearch.
Enterprise Search Solutions (internal and external) for IT people to be more productive and allow the data for their customers to find required documentation, knowledgebase, downloadable pdfs text etc for their products for e.g. Installation docs, Configuration docs, Release docs, New Product Docs. All the contents would be assembled from various different sources in a company and pushed into ES so that they could be searchable.
Ingest data for e.g. logs from application servers and messaging queues in order to perform Logging and Monitoring activities (Alerts, Fraud analysis).
So two most common usage of ES is searching and logging and monitoring activities. Basically real time activities.
How it is different from Hadoop?
Mostly organisations are increasing leveraging Hadoop for its file system i.e. HDFS to be used as a data store while utilising Spark or Hive for data processing. Mostly to do heavy data analytical solutions for which ES has limitation.
Hadoop has the capability to store all file formats(of course you need to make use of parquet or other formats for efficient storage) however Elasticsearch only makes use of JSON. This makes Hadoop a default industry standard along with S3 and other FS for data-lake or data-archival storage tool.
If you are storing data in Hadoop, you probably would have to make do with other frameworks to do efficient data processing like Spark or Giraph or Hive to transform data and do complex analytical processing for which ES has a limitation. ES in its core is a full-text retrieval/search engine.
Hadoop for search
You probably need to run Map-Reduce or Spark Jobs and write tons of pattern-matching algorithm to find the documents or folders with any keyword you want to search. Every search would result into one such job. Which would not be practical.
Even if you transform and organise them in such a way for you to leverage Hive, still it would not be as efficient as Elasticsearch for text processing.
Here is a link that can help you understand core data structure used in Elasticsearch and why text search is faster/different/efficient.
How can we make use of Hadoop and Elasticsearch?
Perhaps the below diagram as mentioned in this link could be useful.
Basically you can set up ingestion pipelines, process raw data from Hadoop, transform and thereby index them in Elasticsearch so that you can make use of its search capabilities.
Take a look this link to understand how you can make use of Spark with Elasticsearch and have two-way communication achieved between them.
Hope that helps!
Currently I am using cassandra for storing data for my functional use cases (display time-series and consolidated data to users). Cassandra is very good at it, if you design correctly your data model (query driven)
Basically, data are ingested from RabbitMQ by Storm and save to Cassandra
Lambda architecture is just a design-pattern for big-data architect and technology independent, the layers can be combined :
Cassandra is a database that can be used as serving layer & batch layer : I'm using it for my analytics purpose with spark too (because data are already well formatted, like time-series, in cassandra)
As far as I know, one huge thing to consider is STORING your raw data before any processing. You need to do this in order to recover for any problem, human-based (algorithm problem, DROP TABLE in PROD, stuff like that this can happen..) or for future use or mainly for batch aggregation
And here I'm facing a choice :
Currently I'm storing it in cassandra, but i'm consider switching storing the raw data in HDFS for different reason : raw data are "dead", using cassandra token, using resource (mainly disk space) in cassandra cluster.
Can someone help me in that choice ?
HDFS makes perfect sense. Some considerations :
Serialization of data - Use ORC/ Parquet or AVRO if format is variable
Compression of data - Always compress
HDFS does not like too many small files - In case of streaming have a job which aggregates & write single large file on a regular interval
Have a good partitioning scheme so you can get to data you want on HDFS without wasting resources
hdfs is better idea for binary files. Cassandra is o.k. for storing locations where the files are etc etc but just pure files need to be modelled really really well so most of the people just give up on cassandra and complain that it sucks. It still can be done, if you want to do it there are some examples like:
https://academy.datastax.com/resources/datastax-reference-application-killrvideo
that might help you to get started.
Also the question is more material for quora or even http://www.mail-archive.com/user#cassandra.apache.org/ this question has been asked there a lot of time.
I am building an application which requires lot of data processing and analytics (processing tons of files at same time ).
I am planing to use Hadoop (Map-reduce , Hbase(HDFS file system)) for this.
At same time i have small dataset like user setting, application user listing ,payment information and other which can be easily managed on any RDMS database like sql or Mongo.
Some time it may have few aggregated and analysis data which is computed by Hadoop but that data is also not that big.
My question is whether i should pick 2 database like Mysql/Mongo for storing small dataset and HBase for big dataset ?
Or my HBase can do both job efficiently ?
My opinion you cant compare apple with banana.
Hbase is schema less and From CAP theorem, CP is the main attention for hbase.
Where as CA is for RDBMS. please see my answer.
RDBMS has these properties has schema , is centralized, supports joins, supports ACID, supports referrential integrity.
Where as Hbase is schema less , distributed, doesnt support joins ,no built-in support for ACID.
Now you can decide which one is for what based on your requirements.
Hope this helps!
I'm trying to understand how to architect a big data solution. I have historic data of 400TB of data and every hour 1GB of data is getting inserted.
Since data is confidential, I'm describing sample scenario, Data contains information of all activities in a bank branch. With every hour, when new data is inserted(no updation) into hdfs, I need to find how many loans closed, loans created,accounts expired, etc ( around 1000 analytics to be performed). Analytics involve processing entire 400TB of data.
I was plan was to use hadoop + spark. But I'm being suggested to use HBase. Reading through all the documents, I'm not able to find a clear advantage.
What is the best way to go for data which will grow to 600TB
1. MR for analytics and impala/hive for query
2. Spark for analytics and query
3. HBase + MR for analytics and query
Thanks in advance
About HBase:
HBase is a database that is build over HDFS. HBase uses HDFS to store data.
Basically, HBase will allow you to update records, have versioning and deletion of single records. HDFS does not support file updates, so HBase is introducing something you can consider "virtual" operations, and merge data from multiple sources (original files, delete markers) when you are asking it for data. Also, HBase as key-value store is creating indices to support selecting by key.
Your problem:
Choosing the technology in such situations you should look into what you are going to do with the data: Single query on Impala (with Avro schema) can be much faster than MapReduce (not to mention Spark). Spark will be faster in batch jobs, when there is caching involved.
You are probably familiar with Lambda architecture, if not, take a look into it. For what I can tell you now, the third option you mentioned (HBase and MR only) won't be good. I did not try Impala + HBase, so I can't say anything about performance, but HDFS (plain files) + Spark + Impala (with Avro), worked for me: Spark was doing reports for pre-defined queries (after that, data was stored in objectFiles - not human-readable, but very fast), Impala for custom queries.
Hope it helps at least a little.
This might be an interesting question to some:
Given: 2-3 Terabyte of data stored in SQL Server(RDBMS), consider it similar to Amazons data, i.e., users -> what things they saw/clicked to see -> what they bought
Task: Make a recommendation engine (like Amazon), which displays to user, customer who bought this also bought this -> if you liked this, then you might like this -> (Also) kind of data mining to predict future buying habits as well(Data Mining). So on and so forth, basically a reco engine.
Issue: Because of the sheer volume of data (5-6 yrs worth of user habit data), I see Hadoop as the ultimate solution. Now the question is, what technological tools combinations to use?, i.e.,
HDFS: Underlying FIle system
HBASE/HIVE/PIG: ?
Mahout: For running some algorithms, which I assume uses Map-Reduce (genetic, cluster, data mining etc.)
- What am I missing? What about loading RDBMS data for all this processing? (Sqoop for Hadoop?)
- At the end of all this, I get a list of results(reco's), or there exists a way to query it directly and report it to the front-end I build in .NET??
I think the answer to this question, just might be a good discussion for many people like me in the future who want to kick start their hadoop experimentation.
For loading data from RDBMS, I'd recommend looking into BCP (to export from SQL to flat file) then Hadoop command line for loading into HDFS. Sqoop is good for ongoing data but it's going to be intolerably slow for your initial load.
To query results from Hadoop you can use HBase (assuming you want low-latency queries), which can be queried from C# via it's Thrift API.
HBase can fit your scenario.
HDFS is the underlying file system. Nevertheless you cannot load the data in HDFS (in arbitrary format) query in HBase, unless you use the HBase file format (HFile)
HBase has integration with MR.
Pig and Hive also integrate with HBase.
As Chris mentioned it, you can use Thrift to perform your queries (get, scan) since this will extract specific user info and not a massive data set it is more suitable than using MR.