Fiware-Cosmos MapReduce - hadoop

I have a Question regarding the MapReduce example explained here:
http://forge.fiware.org/plugins/mediawiki/wiki/fiware/index.php/BigData_Analysis_-_Quick_Start_for_Programmers
It is indeed the most common example of hadoop MapReduce, the WordCount.
I am able to execute it with no problems at the global instance of Cosmos, but even when I give it an small input (a file with 2 or 3 lines) it takes a lot to execute it (half a minute more or less). I assume this is its normal behavior but my question is: ¿Why does it takes so long even for an small input?
I guess this method increases its efectiveness with bigger datasets where this minimal delay is negligible.

First of all, you have to take into account the current instance of Cosmos at FIWARE LAB is a shared instance of Hadoop, thus many other user may be executing MapReduce jobs at the same time resulting in a "competition" for the computation resources.
Being said that, MapReduce is designed for large datasets and larga data files. It adds a lot of overhead that it's not necessary when processing a couple of lines (because for a couple of lines analsis you don't need MapReduce! :)) but which help a lot when those lines are thounsands, even millions. In those cases the processing time is proportional to the data size, of course, but not in a let's say 1:1 proportion.

Related

How long does it take to process the file If I have only one worker node?

Let's say I have a data with 25 blocks and the replication factor is 1. The mapper requires about 5 mins to read and process a single block of the data. Then how can I calculate the time for one worker node? The what about 15 nodes? Will the time be changed if we change the replication factor to 3?
I really need a help.
First of all I would advice reading some scientific papers regarding the issue (Google Scholar is a good starting point).
Now a bit of discussion. From my latest experiments I have concluded that processing time has very strong relation with amount of data you want to process (makes sense). On our cluster, on average it takes around 7-8 seconds for Mapper to read a block of 128MBytes. Now there are several factors which you need to consider in order to predict the overall execution time:
How much data the Mapper produces, which will determine moreless the time Hadoop requires to execute Shuffling
What Reducer is doing? Does it do some iterative processing? (might be slow!)
What is the configuration of the resources? (how many Mappers and Reducers are allowed to run on the same machine)
Finally are there other jobs running simultaneously? (this might be slowing down the jobs significantly, since your Reducer slots can be occupied waiting for data instead of doing useful things).
So already for one machine you are seeing the complexity of the task of predicting the time of job execution. Basically during my study I was able to conclude that in average one machine is capable of processing from 20-50 MBytes/second (the rate is calculated according to the following formula: total input size/total job running time). The processing rate includes the staging time (when your application is starting and uploading required files to the cluster for example). The processing rate is different for different use cases and greatly influenced by the input size and more importantly the amount of data produced by Mappers (once again this values are for our infrastructure and on different machine configuration you will be seeing completely different execution times).
When you start scaling your experiments, you would see in average improved performance, but once again from my study I could conclude that it is not linear and you would need to fit by yourself, for your own infrastructure the model with respective variables which would approximate the job execution time.
Just to give you an idea, I will share some part of the results. The rate when executing determine use case on 1 node was ~46MBytes/second, for 2 nodes it was ~73MBytes/second and for 3 nodes it was ~85MBytes/second (in my case the replication factor was equal to the number of nodes).
The problem is complex requires time, patience and some analytical skills to solve it. Have fun!

Tez. Slow reducers

I have strange behavior with TEZ mapreduce job.
I'm trying to read logs data from Hive, split it into some chunks by id, date and some other parameters and then write to another hive tables.
Map phase works fast enough and takes about 20 minutes, than reducers start to work and 453 from 458 reducers process all data within next 20 minutes. But last 5 reducers work for about 1 hour.
It happens because my input data includes some huge entries and processing these entries takes a lot of time.
What is the best practice for such cases? Should I make some hadoop/tez/hive tuning to allow kind of parallel processing for last reducers or it would be smarter to split input data by other parameters to avoid huge entries?
Thanks for any advice.
The magic word behind that not-so-strange behavior is skew. And it's a veeeery common issue. Usually people prefer ignoring the problem... until they really feel the pain (just like you do now).
With TEZ, since HIVE-7158 Use Tez auto-parallelism in Hive you can try to tinker with some specific properties:
hive.tez.auto.reducer.parallelism
hive.tez.max.partition.factor
hive.tez.min.partition.factor
But that "auto-parallelism" feature seems to apply when you have several abnormally small reduce datasets that can be merged, while your problem is the exact opposite (one abnormally large reduce dataset). So you should try also to tinker with
hive.exec.reducers.bytes.per.reducer
hive.exec.reducers.max
...to change the scale and make "large" the new "normal" (hence "normal" becoming "small"). But then, maybe all you will get will be 3 reducers all taking 1 hour to complete. Hard to say.
Good luck. This kind of performance tuning is more Art than Science.
Reference :
https://cwiki.apache.org/confluence/display/Hive/Configuration+Properties#ConfigurationProperties-hive.tez.auto.reducer.parallelism
https://www.mail-archive.com/user#tez.apache.org/msg00641.html
http://fr.slideshare.net/t3rmin4t0r/hivetez-a-performance-deep-dive/25
http://fr.slideshare.net/hortonworks/discoverhdp22faster-sql-queries-with-hive/28
~~~~~~
PS: of course, if you could remove the source of skewness by changing the way you organize your input dataset...

Spark "ExecutorLostFailure" - how to solve?

I've checked out some of the other answers on "ExecutorLostFailure" and most of them either:
** 1. Don't have an answer**
** 2. Insist on increasing the executor memory and the number of cores **
Here are some of the ones that I'm referring to: here here here
Is there any other solution to this? I've tried both, but it's unclear to me how to correctly gauge how much to allocate for each (memory and cores) in my SparkContext.
The error occurs within a saveAsTextFile action. Thanks.
From my experience, increasing the executor memory can help. But I'd suggest that this is a naive fix, and usually the underlying issue will remain.
The reason I say this is that one of Spark's most important features is that it allows you to perform computations on datasets that are too big to fit in memory. In theory, you could perform most calculations on a 1TB dataset with a single executor with 2GB memory.
In every case that I've encountered an OOM, it has been one of the following two reasons:
1. Insufficient executor memory overhead
This only applies if you are using a resource manager like Mesos or YARN). Check the Spark docs for guidance with this.
2. Something you are doing in your transformations is causing your RDD to become massively "horizontal".
Recall that I said Spark can handle datasets that are too big to fit in memory. The caveat to this is that the datasets must be vertically parallelizable - think a text file with 10^8 rows, where each row contains a relatively small data point (e.g. list of floats, JSON string, a single sentence etc.). Spark will then partition your dataset and send an appropriate number of rows to each executor.
The problem arises when a single row is very large. This is unlikely to occur through normal map-like operations (unless you are doing something quite weird), but is very easy to do through aggregation-type operations like groupByKey or reduceByKey. Consider the following example:
Dataset (name, age):
John 30
Kelly 36
Steve 48
Jane 36
If I then do a groupByKey with the age as key, I will get data in the form:
36 [Kelly, Jane]
30 [John]
48 [Steve]
If the number of rows in the initial dataset is very large, the rows in the resulting dataset could be very long. If they are long enough, they may be too large to fit into executor memory.
The solution?
It depends on your application. In some cases, it may indeed be unavoidable, and you may just have to increase executor memory. But usually it's possible to restructure your algorithm to avoid the issue, e.g. by replacing a groupByKey with a countByKey, or throwing away data points with a very high incidence rate (in one case I observed, it was a bot generating millions of requests that was responsible for the issue. These could be safely discarded without affecting the analysis).

What is the difference and how to choose between distributed queue and distributed computing platform?

there are many files need to process with two computers real-timely,I want to distribute them to the two computers and these tasks need to be completed as soon as possibile(means real-time processing),I am thinking about the below plan:
(1) distributed queue like Gearman
(2)distributed computing platform like hadoop/spark/storm/s4 and so on
I have two questions
(1)what is the advantage and disadvantage between (1) and (2)?
(2) How to choose in (2),hadoop?spark?storm?s4?or other?
thanks!
Maybe I have not described the question clearly. In most case,there are 1000-3000 files with the same format , these files are independent,you do not need to care their order,the size of one file maybe tens to hundreds of KB and in the future, the number of files and size of single file will rise. I have wrote a program , it can process the file and pick up the data and then store the data in mongodb. Now there are only two computers, I just want a solution that can process these files with the program quickly(as soon as possibile) and is easy to extend and maintain
distributed queue is easy to use in my case bur maybe hard to extend and maintain , hadoop/spark is to "big" in the two computers but easy to extend and maintain, which is better, i am confused.
It depends a lot on the nature of your "processing". Some dimensions that apply here are:
Are records independent from each other or you need some form of aggregation? i.e: do you need some pieces of data to go together? Say, all transactions from a single user account.
Is you processing CPU bound? Memory bound? FileSystem bound?
What will be persisted? How will you persist it?
Whenever you see new data, do you need to recompute any of the old?
Can you discard data?
Is the data somewhat ordered?
What is the expected load?
A good solution will depend on answers to these (and possibly others I'm forgetting). For instance:
If computation is simple but storage and retrieval is the main concern, you should maybe look into a distributed DB rather than either of your choices.
It could be that you are best served by just logging things into a distributed filesystem like HDFS and then run batch computations with Spark (should be generally better than plain hadoop).
Maybe not, and you can use Spark Streaming to process as you receive the data.
If order and consistency are important, you might be better served by a publish/subscribe architecture, especially if your load could be more than what your two servers can handle, but there are peak and slow hours where your workers can catch up.
etc. So the answer to "how you choose?" is "by carefully looking at the constraints of your particular problem, estimate the load demands to your system and picking the solution that better matches those". All of these solutions and frameworks dominate the others, that's why they are all alive and kicking. The choice is all in the tradeoffs you are willing/able to make.
Hope it helps.
First of all, dannyhow is right - this is not what real-time processing is about. There is a great book http://www.manning.com/marz/ which says a lot about lambda archtecture.
The two ways you mentioned serves completly different purposes and are connected to the definition of word "task". For example, Spark will take a whole job you got for him and divide it into "tasks", but the outcome of one task is useless for you, you still need to wait for whole job to finish. You can create small jobs working on the same dataset and use spark's caching to speed it up. But then you won't get much advantage from distribution (if they have to be run one after another).
Are the files big? Are there connected somehow to each other? If yes, I'd go with Spark. If no, distributed queue.

Why does real world MapReduce jobs tend to have very small dataset sizes?

All the papers I have read suggest real world mapreduce jobs tend to operate on relatiely small data set sizes (mostly map only, tend to operate on KB-16GB for vast majority of jobs). If anyone working in production world could talk about how and why smaller data set tends to be the case, I would understand better. For small dataset (<128MB), are the files tend to be fragmented or contigous because it has some implication on the splits and number of map tasks spawned ? And if hadoop lets mapreduce to operate only on a section of file ?
Any pointers is much appreciated.
Typically small data is used to quickly check if the logic / code is good enough. The evaluations have to be done again and again until a good solution is obtained.
I work in production and we use small data for unit testing (order of MBs) and we have sample data sets of size 10-30 gigs which we use for integration testing at dev end. But this is way too small considering the actual data dealt with on prod servers (which is in order of terabytes). The dev environment is of low capacity as compared to prod environment so we cannot expect terabytes of data to run smoothly over it... plus its time consuming as it has to be executed for every release.
Moving to technical papers: Authors want real data: that too which is inclined towards the specific use cases that they attempt to solve. Its difficult to obtain huge data sets (10-100 gigs) focused to their problem. I have seen few papers where they used huge data sets but then those researchers where belonged to big companies and can easily get that data.

Resources