I'm brand new with Azure Data Factory. Previously I've been working with SSIS and Pentaho. Recently I have started using this tool to create some ETL, and I've noticed some differences between the time values provided at the end of the process. So I wonder what they mean (Duration - Processing Time - Time), and especially why the big difference between Duration and Processing Time, is this difference a standard preparation time for the tool or something like that?
Regards.
When you read the "Duration" time from the top of your screenshot, that it is end-to-end for the pipeline activity. That takes into account all factors like marshaling of your data flow script from ADF to the Spark cluster, cluster acquisition time, job execution, and I/O write time.
The bottom section of your screenshot is the amount of time Spark spent in that stage of your transformation logic, which is all in-memory data frames.
The write time is shown in the data flow execution plan in the Sink transformation and the cluster acquisition time is shown at the top.
Related
Hadoop map-reduce and it's echo-systems (like Hive..) we usually use for batch processing. But I would like to know is there any way that we can use hadoop MapReduce for realtime data processing example like live results, live tweets.
If not what are the alternatives for real time data processing or analysis?
Real-time App with Map-Reduce
Let’s try to implement a real-time App using Hadoop. To understand the scenario, let’s consider a temperature sensor. Assuming the sensor continues to work, we will keep getting the new readings. So data will never stop.
We should not wait for data to finish, as it will never happen. Then maybe we should continue to do analysis periodically (e.g. every hour). We can run Spark every hour and get the last hour data.
What if every hour, we need the last 24 hours analysis? Should we reprocess the last 24 hours data every hour? Maybe we can calculate the hourly data, store it, and use them to calculate 24 hours data from. It will work, but I will have to write code to do it.
Our problems have just begun. Let us iterate few requirements that complicate our problem.
What if the temperature sensor is placed inside a nuclear plant and
our code create alarms. Creating alarms after one hour has elapsed
may not be the best way to handle it. Can we get alerts within 1
second?
What if you want the readings calculated at hour boundary while it
takes few seconds for data to arrive at the storage. Now you cannot
start the job at your boundary, you need to watch the disk and
trigger the job when data has arrived for the hour boundary.
Well, you can run Hadoop fast. Will the job finish within 1 seconds?
Can we write the data to the disk, read the data, process it, and
produce the results, and recombine with other 23 hours of data in one
second? Now things start to get tight.
The reason you start to feel the friction is because you are not
using the right tool for the Job. You are using the flat screwdriver
when you have an Allen-wrench screw.
Stream Processing
The right tool for this kind of problem is called “Stream Processing”. Here “Stream” refers to the data stream. The sequence of data that will continue to come. “Stream Processing” can watch the data as they come in, process them, and respond to them in milliseconds.
Following are reasons that we want to move beyond batch processing ( Hadoop/ Spark), our comfort zone, and consider stream processing.
Some data naturally comes as a never-ending stream of events. To do
batch processing, you need to store it, cut off at some time and
processes the data. Then you have to do the next batch and then worry
about aggregating across multiple batches. In contrast, streaming
handles neverending data streams gracefully and naturally. You can
have conditions, look at multiple levels of focus ( will discuss this
when we get to windows), and also easily look at data from multiple
streams simultaneously.
With streaming, you can respond to the events faster. You can produce
a result within milliseconds of receiving an event ( update). With
batch this often takes minutes.
Stream processing naturally fit with time series data and detecting
patterns over time. For example, if you are trying to detect the
length of a web session in a never-ending stream ( this is an example
of trying to detect a sequence), it is very hard to do it with
batches as some session will fall into two batches. Stream processing
can handle this easily. If you take a step back and consider, the
most continuous data series are time series data. For example, almost
all IoT data are time series data. Hence, it makes sense to use a
programming model that fits naturally.
Batch lets the data build up and try to process them at once while
stream processing data as they come in hence spread the processing
over time. Hence stream processing can work with a lot less hardware
than batch processing.
Sometimes data is huge and it is not even possible to store it.
Stream processing let you handle large fire horse style data and
retain only useful bits.
Finally, there are a lot of streaming data available ( e.g. customer
transactions, activities, website visits) and they will grow faster
with IoT use cases ( all kind of sensors). Streaming is a much more
natural model to think about and program those use cases.
In HDP 3.1, Hive-Kafka integration was introduced for working with real-time data. For more info, see the docs: Apache Hive-Kafka Integration
You can add Apache Druid to a Hadoop cluster to process OLAP queries on event data, and you can use Hive and Kafka with Druid.
Hadoop/Spark shines in case of handling large volume of data and batch processing on it but when your use case is revolving around real time analytics requirement then Kafka Steams and druid are good options to consider.
Here's the good reference link to understand a similar use case:
https://www.youtube.com/watch?v=3NEQV5mjKfY
Hortonworks also provides HDF Stack (https://hortonworks.com/products/data-platforms/hdf/) which works best with use cases related to data in motion.
Kafka and Druid documentation is a good place to understand strength of both technologies. Here are their documentation links:
Kafka: https://kafka.apache.org/documentation/streams/
Druid: http://druid.io/docs/latest/design/index.html#when-to-use-druid
I am doing a personal project that consists of creating the full architecture of a data warehouse (DWH). In this case as an ETL and BI analysis tool I decided to use Pentaho; it has a lot of functionality from allowing easy dashboard creation, to full data mining processes and OLAP cubes.
I have read that a data warehouse must be a relational database, and understand this. What I don't understand is how to achieve a near real time, or fully real time DWH. I have read about push and pull strategies but my conclusions are the following:
The choice of DBMS is not important to create real time DWH. I mean that is possible with MySQL, SQL Server, Oracle or any other. As I am doing it as a personal project I choose MySQL.
The key factor is the frequency of the jobs scheduling, and this is task of the scheduler. Is this assumption correct? I mean, the key to create a real time DWH is to establish jobs every second for every ETL process?
If I am wrong can you provide me some help to understand this? And then, which is the way to create a real time DWH? Is the any open source scheduler that allows that? And any not open source scheduler which allows that?
I am very confused because some references say that this is impossible, others that is possible.
Definition
Very interesting question. First of all, it should be defined how "real-time" realtime should be. Realtime really has a very low latency for incoming data but requires good architecture in the sending systems, maybe a event bus or messaging queue and good infrastructure on the receiving end. This usually involves some kind of listener and pushing from the deliviering systems.
Near-realtime would be the next "lower" level. If we say near-realtime would be about 5 minutes delay max, your approach could work as well. So for example here you could pull every minute or so the data. But keep in mind that you need some kind of high-performance check if new data is available and which to get. If this check and the pull would take longer than a minute it would become harder to keep up with the data. Really depends on the volume.
Realtime
As I said before, realtime analytics require at best a messaging queue or a service bus some jobs of yours could connect to and "listen" for new data. If a new data package is pushed into the pipeline, the size of it will probably be very small and it can be processed very fast.
If there is no infrastructure for listeners, you need to go near-realtime.
Near-realtime
This is the part where you have to develop more. You have to make sure to get realtively small data packages which will usually be some kind of delta. This could be done with triggers if you have access to the database. Otherwise you have to pull every once in a while whereas your "once" will probably be very frequent.
This could be done on Linux for example with a simple conjob or on Windows with event planning. Just keep in mind that your loading and processing time shouldn't exceed the time window you have got until the next job is being started.
Database
In the end, when you defined what you want to achieve and have a general idea how to implement delta loading or listeners, you are right - you could take a relational database. If you are interested in performance and are modelling this part as Star Schema, you also could look into Column Based Engines or Column Based Databases like Apache Cassandra.
Scheduling
Also for job scheduling you could start with Linux or Windows standard planning tools. If you code in Java you could use later something like quartz. But this would only be the case for near-realtime. Realtime requires a different architecture as I explained above.
I want to know the difference between Hadoop batch analytics and Hadoop real time analytics.
E.g Hadoop real time analytics can be done using Apache Spark while Hadoop batch analytics can be done using Map reduce programming.
Also if real time analytics is the more preferred one then what is batch analytics required for?
thanks
Batch means you process aaaaaaall data you have collected so far. Real-time means you process data as it enters the system. Neither one is "preferred".
Let me explain use cases for Batch processing & Real processing.
Batch processing:
In stock market application, you have requirement to provide below summary data on daily basis
For each stock, total number of buy orders and sum of all buy orders
For each stock, total number of sell order and sum of all sell orders
For each stock, total number of successful orders & failed orders
etc.
Here need 24 hours of stock market data to generate these reports.
** Weather application: **
Save weather reports of all places in the world for all countries. For a given place like Newyork or Country like America, find hottest and coldest day since 1900. This query requires huge input data sets which requires processing on thousands of noudes.
You can use Hadoop Map Reduce job to provide above summary. You may have to process Peta bytes of data, which is stored in 4000+ servers in Hadoop cluster.
Real time analytics:
Another use case, you logged into social networking site like facebook or twitter. Your friends posted a message on your wall or tweeted in twitter. You have to get these notification in real time.
When you visit sites like Booking.com to book a hotel, You will get real time notifications like X users are currently viewing this hotel etc. These notifications are generated in real time.
In above use cases, system should process streams of data and generate real time notifications to users instead of waiting for one day data. Spark streaming provides excellent support to handle these type of scenarios.
Spark uses in - memory processing for faster query execution but it's not possible to always use in - memory for peta bytes of data. Spark can process terabytes of data and Hadoop can process Peta bytes of data.
Hadoop batch analytics and real time analytics both are totally different, It depends on your use case what you want, example - you have a large volume of row dataset and you want to extract only few information from that dataset, information may be based on some calculation/trending etc. than this can be done with batch processing like finding a minimum temperature since last 50 years.
Whereas real time analytics, means you need the expected output ASAP like your friend tweeted on twitter and you get the tweets as soon as tweeted by your friend.
Batch data processing is an efficient way of processing high volumes of data is where a group of transactions is collected over a period of time. Data is collected, entered, processed and then the batch results are produced (Hadoop focused on batch data processing). Batch processing requires separate programs for input, process and output. An example is payroll and billing systems.
In contrast, real time data processing involves a continual input, process and output of data. Data must be processed in a small time period (or near real time). Radar systems, customer services and bank ATMs are examples.
Apache Kylin looks like a great tool that will fill in the needs of a lot data scientists. It's also a very complex system. We are developing an in-house solution with exactly the same goal in mind, multidimensional OLAP cube with low query latency.
Among the many issues, the one I'm concerned of the most right now is about fault tolerance.
With large volumes of incoming transactional data, the cube must be incrementally updated, and some of the cuboids are updated over long period of time such as those with time dimension value at the scale of year. Over such long period, some piece of the complex system is guaranteed to fail, and how does the system ensure all the raw transactional records are aggregated into the cuboids exactly once, no more no less? Even each of the pieces has its own fault tolerance mechanism, it doesn't mean they will play together automatically.
For simplicity, we can assume all the input data are saved in HDFS by another process, and can be "played back" in any way you want to recover from any interruption, voluntary or forced. What are Kylin's fault tolerance considerations, or is it not really an issue?
There are data faults and system faults.
Data fault tolerance: Kylin partitions cube into segments and allows rebuild an individual segment without impacting the whole cube. For example, assume a new daily segment is built on daily basis and get merged into weekly segment on weekend; weekly segments merge into monthly segment and so on. When there is data error (or whatever change) within a week, you need to rebuild only one day's segment. Data changes further back will require rebuild a weekly or monthly segment.
The segment strategy is fully customizable so you can balance the data error tolerance and query performance. More segments means more tolerable to data changes but also more scans to execute for each query. Kylin provides RESTful API, an external scheduling system can invoke the API to trigger segment build and merge.
A cube is still online and can serve queries when some of its segments is under rebuild.
System fault tolerance: Kylin relies on Hadoop and HBase for most system redundancy and fault tolerance. In addition to that, every build step in Kylin is idempotent. Meaning you can safely retry a failed step without any side effect. This ensures the final correctness, no matter how many fails and retries the build process has been through.
(I'm also Apache Kylin co-creator and committer. :-)
Notes: I'm Apache Kylin co-creator and committer.
The Fault Tolerance point is really good one which we actually be asked from some cases, when they have extreme large datasets. To calculate again from begin will require huge computing resources, network traffic and time.
But from product perspective, the question is: which one is more important between precision result and resources? For transaction data, I believe the exactly number is more important, but for behavior data, it should be fine, for example, the distinct count value is approximate result in Kylin now. It depends what's kind of case you will leverage Kylin to serve business needs.
Will put this idea into our backlog and will update to Kylin dev mailing list if we have more clear clue for this later.
Thanks.
Is there a way where I can measure the time taking for a particular node in the TIBCO workflow process?
e.g - How much time did the JMS/ Database node take to complete its operation?
The following goes for Tibco Business Works:
a) In tibco Administrator, you can see the time elapsed for each individual activity.
Service Instances > BW Process > Process Definitions.
Select each process after running it once and you will get an Execution count, Elapsed time and CPU time for each activity than ran.
b) If you are only interested in a single activity, you can add two mapper activities in the flow, one before and one after the node you want to measure, and assign to them a value of tib:timestamp(). Their difference will give you the elapsed time in miliseconds.
You might enable statistics in TIBCO Administrator for the deployed engine
(Engine Control Tab) -> Start Statistic Collection.
This will produce a CSV on local disk (the path is also displayed there) with details of elapsed time of all activities of the executed processes of your engine.
You might use this data for detailed analysis then.