Lambda Architecture - Why batch layer - hadoop

I am going through the lambda architecture and understanding how it can be used to build fault tolerant big data systems.
I am wondering how batch layer is useful when everything can be stored in realtime view and generate the results out of it? is it because realtime storage cant be used to store all of the data, then it wont be realtime as the time taken to retrieve the data is dependent on the the space it took for the data to store.

Why batch layer
To save Time and Money!
It basically has two functionalities,
To manage the master dataset (assumed to be immutable)
To pre-compute the batch views for ad-hoc querying
Everything can be stored in realtime view and generate the results out of it - NOT TRUE
The above is certainly possible, but not feasible as data could be 100's..1000's of petabytes and generating results could take time.. a lot of time!
Key here, is to attain low-latency queries over large dataset. Batch layer is used for creating batch views (queries served with low-latency) and realtime layer is used for recent/updated data which is usually small. Now, any ad-hoc query can be answered by merging results from batch views and real-time views instead of computing over all the master dataset.
Also, think of a query (same query?) running again and again over huge dataset.. loss of time and money!

Further to the answer provided by #karthik manchala, data Processing can be handled in three ways - Batch, Interactive and Real-time / Streaming.
I believe, your reference to real-time is more with interactive response than to streaming as not all use cases are streaming related.
Interactive responses are where the response can be expected anywhere from sub-second to few seconds to minutes, depending on the use case. Key here is to understand that processing is done on data at rest i.e. already stored on a storage medium. User interacts with the system while processing and hence waits for the response. All the efforts of Hive on Tez, Impala, Spark core etc are to address this issue and make the responses as fast as possible.
Streaming on the other side is where data streams into the system in real-time - for example twitter feeds, click streams etc and processing need to be done as soon as the data is generated. Frameworks like Storm, Spark Streaming address this space.
The case for batch processing is to address scenarios where some heavy-lifting need to be done on a huge dataset before hand such that user would be made believe that the responses he sees are real-time. For example, indexing a huge collection of documents into Apache Solr is a batch job, where indexing would run for minutes or possibly hours depending on the dataset. However, user who queries the Solr index would get the response in sub-second latency. As you can see, indexing cannot be achieved in real-time as there may be hue amounts of data. Same is the case with Google search, where indexing would be done in a batch mode and the results are presented in interactive mode.
All the three modes of data processing are likely involved in any organisation grappling with data challenges. Lambda Architecture addresses this challenge effectively to use the same data sources for multiple data processing requirements

You can check out the Kappa-Architecture where there is no seperate Batch-Layer.
Everything is analyzed in the Stream-Layer. You can use Kafka in the right configuration as as master-datasetstorage and save computed data in a database as your view.
If you want to recompute, you can start a new Stream-Processing job and recompute your view from Kafka into your database and replace your old view.
It is possible to use only the Realtime view as the main storage for adhoc query but as it is already mentioned in other answers, it is faster if you have much data to do batch-processing and stream-processing seperate instead of doing batch-jobs as a stream-job. It depends on the size of your data.
Also it is cheaper to have a storage like hdfs instead of a database for batch-computing.
And the last point in many cases you have different algorithms for batch and stream processing, so you need to do it seperate. But basically it is possible to only use the "realtime view" as your batch-and stream-layer also without using Kafka as masterset. It depends on your usecase.

Related

Is Elasticsearch optimized for inserts?

I develop for a relatively large online store with a PHP backend, and it uses elasticsearch for some things (like text search, logging... etc).
Now, I'd like to start storing all kinds of information about user activity in ES. For instance, every page view (for instance: user enter product page/category page ,etc).
Is ES optimized for such a heavy load of continuous inserts, or should I consider some alternatives, like for instance having some sort of a buffer layer where I store all of my immediate inserts in memory, and then every minute or so, insert them into ES in bulk?
What is the industry standard? Or am I worrying in vain and ES is optimized for that?
Thanks.
Elasticsearch, when properly sized to handle your load, is definitely a valid alternative for such a use case.
You might decide, however, to store that streaming data into another cluster which is different from your production cluster, so as to not impact the health of the production cluster too much.
There are a lot variables to arrive at the correct decision, and we don't have enough information here, but it's definitely a valid way.

Using ElasticSearch as a permanent storage

Recently I am working on a project which is producing a huge amount of data every day, in this project, there are two functionalities, one is storing data into Hbase for future analysis, and second one is pushing data into ElasticSearch for monitoring.
As the data is huge, we should store data into two platforms(Hbase,Elasticsearch)!
I have no experience in both of them. I want no know is it possible to use elasticsearch instead of hbase as a persistence storage for future analytics?
I recommend you reading this old but still valid article : https://www.elastic.co/blog/found-elasticsearch-as-nosql
Keep in mind, Elasticsearch is only a search engine. But it depends if your data are critical or if you can accept to lose some of them like non critical logs.
If you don't want to use an additionnal database with huge large data, you probably can store them into files in something like HDFS.
You should also check Phoenix https://phoenix.apache.org/ which may provide the monitoring features that you are looking for

Oracle Materialized View for sensory data transfer

In an application we have to send sensory data stream from multiple clients to a central server over internet. One obvious solution is to use MOMs (Message Oriented Middlewares) such as Kafka, but I recently learned that we can do this with data base synchronization tools such as oracle Materialized View.
The later approach works in some application (sending data from a central server to multiple clients, inverse directin of our application), but what is the pros and cons of it in our application? Which one is better for sending sensory data stream from multiple (~100) clients to server in terms of speed, security, etc.?
Thanks.
P.S.
For more detail consider an application in which many (about 100) clients have to send streaming data (1MB data per minute) to a central server over internet. The data are needed in server for the sake of online monitoring, analysis and some computation such as machine learning and data mining tasks.
My question is about the difference between db-to-db connection and streaming solutions such as kafka for trasfering data from clients to server.
Prologue
I'm going to try and break your question down into in order to get a clearer understanding of your current requirements and then build it back up again. This has taken a long time to write so I'd really appreciate it if you do two things off the back of it:
Be sceptical - there's absolutely no substitute for testing things yourself. The internet is very useful as a guide but there's no guarantee that the help you receive (if this answer is even helpful!) is the best thing for your specific situation. It's impossible to completely describe your current situation in the space allotted and so any answer is, of necessity, going to be lacking somewhere.
Look again at how you explained yourself - this is a valid question that's been partially stopped by a lack of clarity in your description of the system and what you're trying to achieve. Getting someone unfamiliar with your system to look over your question before posting a complex question may help.
Problem definition
sensory data stream from multiple clients to a central server
You're sending data from multiple locations to a single persistence store
online monitoring
You're going to be triggering further actions based off the raw data and potentially some aggregated data
analysis and some computation such as machine learning and data mining tasks
You're going to be performing some aggregations on the clients' data, i.e. you require aggregations of all of the clients' data to be persisted (however temporarily) somewhere
Further assumptions
Because you're talking about materialized views we can assume that all the clients persist data in a database, probably Oracle.
The data coming in from your clients is about the same topic.
You've got ~100 clients, at that amount we can assume that:
the number of clients might change
you want to be able to add clients without increasing the number of methods of accessing data
You don't work for one of Google, Amazon, Facebook, Quantcast, Apple etc.
Architecture diagram
Here, I'm not making any comment on how it's actually going to work - it's the start of a discussion based on my lack of knowledge of your systems. The "raw data persistence" can be files, Kafka, a database etc. This is description of the components that are going to be required and a rough guess as to how they will have to connect.
Applying assumed architecture to materialized views
Materialized views are a persisted query. Therefore you have two choices:
Create a query that unions all 100 clients data together. If you add or remove a client you must change the query. If a network issue occurs at any one of your clients then everything fails
Write and maintain 100 materialized views. The Oracle database at your central location has 100 incoming connections.
As you can probably guess from the tradeoffs you'll have to make I do not like materialized views as the sole solution. We should be trying to reduce the amount of repeated code and single points of failure.
You can still use materialized views though. If we take our diagram and remove all the duplicated arrows in your central location it implies two things.
There is a single service that accepts incoming data
There is a single service that puts all the incoming data into a single place
You could then use a single materialized view for your aggregation layer (if your raw data persistence isn't in Oracle you'll first have to put the data into Oracle).
Consequences of changes
Now we've decided that you have a single data pipeline your decisions actually become harder. We've decoupled your clients from the central location and the aggregation layer from our raw data persistence. This means that the choices are now yours but they're also considerably easier to change.
Reimagining architecture
Here we need to work out what technologies aren't going to change.
Oracle databases are expensive and you're pushing 140GB/day into yours (that's 50TB/year by the way, quite a bit). I don't know if you're actually storing all the raw data but at those volumes it's less likely that you are - you're only storing the aggregations
I'm assuming you've got some preferred technologies where your machine learning and data mining happen. If you don't then consider getting some to prevent madness supporting everything
Putting all of this together we end up with the following. There's actually only one question that matters:
How many times do you want to read your raw data off your database.
If the answer to that is once then we've just described middleware of some description. If the answer is more than once then I would reconsider unless you've got some very good disks. Whether you use Kafka for this middle layer is completely up to you. Use whatever you're most familiar with and whatever you're most willing to invest the time into learning and supporting. The amount of data you're dealing with is non-trivial and there's going to be some trial and error getting this right.
One final point about this; we've defined a data pipeline. A single method of data flowing through your system. In doing so, we've increased the flexibility of the system. Want to add more clients, no need to do anything. Want to change the technology behind part of the system, as long as the interface remains the same there's no issue. Want to send data elsewhere, no problem, it's all in the raw data persistence layer.

performance issue in getting millions of record from database and processing in ERP in mule esb

We are trying to fetch millions of record from database and processing in ERP system per day and we are facing performance issue, is there any solution regarding this in Community?
What is the best way to process the records in mule? So should we use batch or is there any alternate to it? And if we use batch or any other solution, how can we use it so as not to face any performance issue?
Since we don't have details on your specific situation, here are some general ideas. You will definitely need to do performance testing when dealing with large data sets to make sure your flow design is performing well.
Just to clarify, I'm giving options below that show streaming, which are slightly less performant, but will allow you to process large datasets. If you can handle the dataset in memory and you want faster processing, then turn off streaming.
Test your db queries outside of mule to make sure they are performant and tables are properly indexed.
Use streaming db connection. Tweak chunk size for performance testing. (Using this with batch scope is a good combo)
If using on-premise runtime, do performance tuning.
Use batch scope (enterprise edition)
Batch sounds like what you want to do. For each batch step Mule creates a batch job instance and each instance contains a persistent queue with the batched records. However, it does a deep copy of the MuleEvent containing the flow variables, flow construct, message, processing time, session and exchange pattern so beware, make sure you keep a light footprint before going into your batch job. If you have to set the payload with millions of records to flow variables to do some manipulation, make sure you delete them before you start executing the batch. It will load these batch steps in memory and execute them concurrently so the amount of memory you will need will be the size of the batch job instance (in particular the MuleEvent) by the number of batch steps.

Storing and processing timeseries with Hadoop

I would like to store a large amount of timeseries from devices. Also these timeseries have to be validated, can be modified by an operator and have to be exported to other systems. Holes in the timeseries must be found. Timeseries must be shown in the UI filtered by serialnumber and date range.
We have thought about using hadoop, hbase, opentsdb and spark for this scenario.
What do you think about it? Can Spark connect to opentsdb easily?
Thanks
OpenTSDB is really great for storing large amount of time series data. Internally, it is underpinned by HBase - which means that it had to find a way around HBase's limitations in order to perform well. As a result, the representation of time series is highly optimized and not easy to decode. AFAIK, there is no out-of-the-box connector that would allow to fetch data from OpenTSDB into Spark.
The following GitHub project might provide you with some guidance:
Achak1987's connector
If you are looking for libs that would help you with time series, have a look at spark-ts - it contains useful functions for missing data imputation as well.
Warp 10 offers the WarpScript language which can be used from Spark/Pig/Flink to manipulate time series and access data stored in Warp 10 via a Warp10InputFormat.
Warp 10 is Open Source and available at www.warp10.io
Disclaimer: I'm CTO of Cityzen Data, maker of Warp 10.
Take a look at Axibase Time Series Database which has a rather unique versioning feature to maintain a history of value changes for the same timestamp. Once enabled with per-metric granularity, the database keeps track of source, status and times of value modifications for audit trail or data reconciliation.
We have customers streaming data from Spark apps using Network API, typically once data is enriched with additional metadata (aks series tags) for downstream reporting.
You can query data from ATSD with REST API or SQL.
Disclaimer: I work for Axibase.

Resources