I'm trying to understand lazy evaluation in Apache spark.
My understanding says:
Lets say am having Text file in hardrive.
Steps:
1) First I'll create RDD1, that is nothing but a data definition right now.(No data loaded into memory right now)
2) I apply some transformation logic on RDD1 and creates RDD2, still here RDD2 is data definition (Still no data loaded into memory)
3) Then I apply filter on RDD2 and creates RDD3 (Still no data loaded into memory and RDD3 is also an data definition)
4) I perform an action so that I could get RDD3 output in text file. So the moment I perform this action where am expecting output something from memory, then spark loads data into memory creates RDD1, 2 and 3 and produce output.
So laziness of RDDs in spark says just keep making the roadmap(RDDs) until they dont get the approval to make it or produce it live.
Is my understanding correct upto here...?
My second question here is, its said that its(Lazy Evaluation) one of the reason that the spark is powerful than Hadoop, May I know please how because am not much aware of Hadoop ? What happens in hadoop in this scenario ?
Thanks :)
Yes, your understanding is fine. A graph of actions (a DAG) is built via transformations, and they computed all at once upon an action. This is what is meant by lazy execution.
Hadoop only provides a filesystem (HDFS), a resource manager (YARN), and the libraries which allow you to run MapReduce. Spark only concerns itself with being more optimal than the latter, given enough memory
Apache Pig is another framework in the Hadoop ecosystem that allows for lazy evaluation, but it has its own scripting language compared to the wide programmability of Spark in the languages it supports. Pig supports running MapReduce, Tez, or Spark actions for computations. Spark only runs and optimizes its own code.
What happens in actual MapReduce code is that you need to procedurally write out each stage of an action to disk or memory in order to accomplish relatively large tasks
Spark is not a replacement for "Hadoop" it's a compliment.
I have a use case where I have millions of small files in S3 which needs to be processed by Spark. I have two options to reduce number of tasks:
1. Use Coalesce
2. Extend CombineFileInputFormat
But I'm not clear of performance implications with bot and when to use one over other.
Also, CombineFileInputFormat is an abstract class, that means I need to provide my implementation. But Spark API (newAPIHadoopRDD) takes the class name as param, I'm not sure how to pass configurable maxSplitSize
Another great option to consider for such scenarios is SparkContext.wholeTextFiles() which makes one record for each file with its name as the key and the content as the value -- see Documentation
I am just beginner in Hadoop framework. I would like to understand few concepts here and i browsed so many links but I would like to get clear answers
1) Why Map reduce works only with Key Value pairs.I also read that I can create a Map reduce job with out actually using reduce
2) Key for the input of Mapping phase is fileoffset key. Can I use explicit key value? or custom input ?
Good, you are digging hadoop concepts.
1) Can I use explicit key value? or custom input?: Yes, write your own (overwrite) RecordReader to do so.
2) Why Map reduce works only with Key Value pairs?:
MapReduce, as name suggests, program just maps(filters) required data to Reduce(Combine based on unique keys) from the data set fed to the program.
Now, why key-value pair?: Since you are processing on unstructured data, one would not like to get the same as output too. We will require some manipulations on data. Think of using Map in java, it helps to uniquely identify the pair, so does in hadoop with the help of Sort & Shuffle.
create a Map reduce job with out actually using reduce?:
Ofcourse, completely depends but recommended for only small operations and in a scenario where your mapper outputs are not required to be combined for expected output.
Reason: Here is where Distributed concept, commodity hardware to be given a priority. For example: i have a large data set to process upon. While processing the data set using a java program(just java, not hadoop), we store the required in Collection objects (As simple as using RAM space). Hadoop is introduced to do the same job in different fashion: store required data in context. Context in mapper refers to Intermediate data (Local FS), in reducer refers to Output(HDFS). Ofcourse, Context in both the cases store in HardDisk.
Hadoop helps doing all the calculations in HardDisk instead of RAM.
I suggest read Hadoop Defenitive Guide, Data Algorithms book for better understanding.
Hadoop's Distributed Cache lets the developer add small files to the MR context which can be used to obtain additional information during Map or Reduce phases. However, I did not find a way to access this cache in a Partitioner. I need the contents of a small file (the output of an earlier MR job) in a custom Partitioner to determine how the keys are sent to the reducers.
Unfortunately, I cannot find any useful documentation on this, and my only idea is currently a somewhat "hackish" approach, which involves serializing the contents of the file to a Base64 string and putting it into the Configuration. Configurations can be used in a partitioner by letting it implement Configurable. While the file is small enough for this approach (around 50KB) I suppose the distributed cache is better suited for this.
EDIT:
I found another approach which I consider slightly better. Since the file I need to access in the partitioner is on HDFS, I put its fully-qualified URI into the Configuration. In my Partitioner's setConf method I can then re-create the Path via new Path(new URI(conf.get("some.file.key"))) and read it with the help of the Configuration. Still hackish though...
I'd like to analyze a continuous stream of data (accessed over HTTP) using a MapReduce approach, so I've been looking into Apache Hadoop. Unfortunately, it appears that Hadoop expects to start a job with an input file of fixed size, rather than being able to hand off new data to consumers as it arrives. Is this actually the case, or am I missing something? Is there a different MapReduce tool that works with data being read in from an open socket? Scalability is an issue here, so I'd prefer to let the MapReducer handle the messy parallelization stuff.
I've played around with Cascading and was able to run a job on a static file accessed via HTTP, but this doesn't actually solve my problem. I could use curl as an intermediate step to dump the data somewhere on a Hadoop filesystem and write a watchdog to fire off a new job every time a new chunk of data is ready, but that's a dirty hack; there has to be some more elegant way to do this. Any ideas?
The hack you describe is more or less the standard way to do things -- Hadoop is fundamentally a batch-oriented system (for one thing, if there is no end to the data, Reducers can't ever start, as they must start after the map phase is finished).
Rotate your logs; as you rotate them out, dump them into HDFS. Have a watchdog process (possibly a distributed one, coordinated using ZooKeeper) monitor the dumping grounds and start up new processing jobs. You will want to make sure the jobs run on inputs large enough to warrant the overhead.
Hbase is a BigTable clone in the hadoop ecosystem that may be interesting to you, as it allows for a continuous stream of inserts; you will still need to run analytical queries in batch mode, however.
What about http://s4.io/. It's made for processing streaming data.
Update
A new product is rising: Storm - Distributed and fault-tolerant realtime computation: stream processing, continuous computation, distributed RPC, and more
I think you should take a look over Esper CEP ( http://esper.codehaus.org/ ).
Yahoo S4 http://s4.io/
It provide real time stream computing, like map reduce
Twitter's Storm is what you need, you can have a try!
Multiple options here.
I suggest the combination of Kafka and Storm + (Hadoop or NoSql) as the solution.
We already build our big data platform using those opensource tools, and it works very well.
Your use case sounds similar to the issue of writing a web crawler using Hadoop - the data streams back (slowly) from sockets opened to fetch remote pages via HTTP.
If so, then see Why fetching web pages doesn't map well to map-reduce. And you might want to check out the FetcherBuffer class in Bixo, which implements a threaded approach in a reducer (via Cascading) to solve this type of problem.
As you know the main issues with Hadoop for usage in stream mining are the fact that first, it uses HFDS which is a disk and disk operations bring latency that will result in missing data in stream. second, is that the pipeline is not parallel. Map-reduce generally operates on batches of data and not instances as it is with stream data.
I recently read an article about M3 which tackles the first issue apparently by bypassing HDFS and perform in-memory computations in objects database. And for the second issue, they are using incremental learners which are not anymore performed in batch. Worth checking it out M3
: Stream Processing on
Main-Memory MapReduce. I could not find the source code or API of this M3 anywhere, if somebody found it please share the link here.
Also, Hadoop Online is also another prototype that attemps to solve the same issues as M3 does: Hadoop Online
However, Apache Storm is the key solution to the issue, however it is not enough. You need some euqivalent of map-reduce right, here is why you need a library called SAMOA which actually has great algorithms for online learning that mahout kinda lacks.
Several mature stream processing frameworks and products are available on the market. Open source frameworks are e.g. Apache Storm or Apache Spark (which can both run on top of Hadoop). You can also use products such as IBM InfoSphere Streams or TIBCO StreamBase.
Take a look at this InfoQ article, which explains stream processing and all these frameworks and products in detail: Real Time Stream Processing / Streaming Analytics in Combination with Hadoop. Besides the article also explains how this is complementary to Hadoop.
By the way: Many software vendors such as Oracle or TIBCO call this stream processing / streaming analytics approach "fast data" instead of "big data" as you have to act in real time instead of batch processing.
You should try Apache Spark Streaming.
It should work well for your purposes.