lazy evaluation in Apache Spark - hadoop

I'm trying to understand lazy evaluation in Apache spark.
My understanding says:
Lets say am having Text file in hardrive.
Steps:
1) First I'll create RDD1, that is nothing but a data definition right now.(No data loaded into memory right now)
2) I apply some transformation logic on RDD1 and creates RDD2, still here RDD2 is data definition (Still no data loaded into memory)
3) Then I apply filter on RDD2 and creates RDD3 (Still no data loaded into memory and RDD3 is also an data definition)
4) I perform an action so that I could get RDD3 output in text file. So the moment I perform this action where am expecting output something from memory, then spark loads data into memory creates RDD1, 2 and 3 and produce output.
So laziness of RDDs in spark says just keep making the roadmap(RDDs) until they dont get the approval to make it or produce it live.
Is my understanding correct upto here...?
My second question here is, its said that its(Lazy Evaluation) one of the reason that the spark is powerful than Hadoop, May I know please how because am not much aware of Hadoop ? What happens in hadoop in this scenario ?
Thanks :)

Yes, your understanding is fine. A graph of actions (a DAG) is built via transformations, and they computed all at once upon an action. This is what is meant by lazy execution.
Hadoop only provides a filesystem (HDFS), a resource manager (YARN), and the libraries which allow you to run MapReduce. Spark only concerns itself with being more optimal than the latter, given enough memory
Apache Pig is another framework in the Hadoop ecosystem that allows for lazy evaluation, but it has its own scripting language compared to the wide programmability of Spark in the languages it supports. Pig supports running MapReduce, Tez, or Spark actions for computations. Spark only runs and optimizes its own code.
What happens in actual MapReduce code is that you need to procedurally write out each stage of an action to disk or memory in order to accomplish relatively large tasks
Spark is not a replacement for "Hadoop" it's a compliment.

Related

Uploading data to HDFS cluster from custom format

I have several machines with TBs of log data in a custom format which can be read with a c++ library. I want to upload all data to hadoop cluster (HDFS) while converting it to parquet files.
This is an on going process (meaning every day I will get more data) and not a one time effort.
What is best alternative to do it performance wise (doing it efficiently)?
Is the parquet C++ library as good as the Java one? (updates, bugs, etc.)
The solution should handle tens of TBs per day or even more in the future.
Log data arrives on going and should be available immediately on HDFS cluster.
Performance-wise, your best approach will be to gather the data in batches and then write out a new Parquet file per batch. If your data is received in single lines and you want to persist them immediately on HDFS, then you could also write them out to a row-based format (that supports single line appends), e.g. AVRO and run regulary a job that compacts them into a single Parquet file.
Library-wise, parquet-cpp is much more in active development at the moment then parquet-mr (the Java library). This is mainly due to the fact that active parquet-cpp development (re-)started about 1.5 years ago (winter/spring 2016). So updates to the C++ library will happen very quickly at the moment while the Java library is very mature as it has a huge userbase since quite some years. There are some features like predicate pushdown that are not yet implemented in parquet-cpp but these all on the read path, so for write they don't matter.
We now at a point with parquet-cpp, that it already runs very stable in different productive environments, so in the end, your choice of using the C++ or Java library should mainly depend on our system environment. If all your code is currently running in the JVM, than use parquet-mr, otherwise, if you're a C++/Python/Ruby user, use parquet-cpp.
Disclaimer: I'm one of the parquet-cpp developers.

What is the purpose of cache an RDD in Apache Spark?

I am new for Apache Spark and I have couple of basic questions in spark which I could not understand while reading the spark material. Every materials have their own style of explanation. I am using PySpark Jupyter notebook on Ubuntu to practice.
As per my understanding, When I run the below command, the data in the testfile.csv is partitioned and stored in memory of the respective nodes.( actually I know its a lazy evaluation and it will not process until it sees action command ), but still the concept is
rdd1 = sc.textFile("testfile.csv")
My question is when I run the below transformation and action command, where does the rdd2 data will store.
1.Does it stores in memory?
rdd2 = rdd1.map( lambda x: x.split(",") )
rdd2.count()
I know the data in rdd2 will available till I close the jupyter notebook.Then what is the need of cache(), anyhow rdd2 is available to do all transformation. I heard after all the transformation the data in memory is cleared, what is that about?
Is there any difference between keeping RDD in memory and cache()
rdd2.cache()
Does it stores in memory?
When you run a spark transformation via an action (count, print, foreach), then, and only then is your graph being materialized and in your case the file is being consumed. RDD.cache purpose it to make sure that the result of sc.textFile("testfile.csv") is available in memory and isn't needed to be read over again.
Don't confuse the variable with the actual operations that are being done behind the scenes. Caching allows you to re-iterate the data, making sure it is in memory (if there is sufficient memory to store it in it's entirety) if you want to re-iterate the said RDD, and as long as you've set the right storage level (which defaults to StorageLevel.MEMORY). From the documentation (Thanks #RockieYang):
In addition, each persisted RDD can be stored using a different
storage level, allowing you, for example, to persist the dataset on
disk, persist it in memory but as serialized Java objects (to save
space), replicate it across nodes, or store it off-heap in Tachyon.
These levels are set by passing a StorageLevel object (Scala, Java,
Python) to persist(). The cache() method is a shorthand for using the
default storage level, which is StorageLevel.MEMORY_ONLY (store
deserialized objects in memory).
You can mark an RDD to be persisted using the persist() or cache()
methods on it. The first time it is computed in an action, it will be
kept in memory on the nodes. Spark’s cache is fault-tolerant – if any
partition of an RDD is lost, it will automatically be recomputed using
the transformations that originally created it.
Is there any difference between keeping RDD in memory and cache()
As stated above, you keep it in memory via cache, as long as you've provided the right storage level. Otherwise, it won't necessarily be kept in memory at the time you want to re-use it.

Hadoop use-case scenario

I would like to have some expert views on the use of a Big Data platform like Hadoop in one of my project scenarios. I am a complete novice in this technology although I understand databases like MySQL well.
We are creating a product which would be used to analyse data from social media. So the input data would be a large volume of tweets, facebook posts, user profiles, YouTube data and data from blogs etc. On top of this I would be having a web application to help me view and analyse this data. As the requirement makes it clear, I would be needing a sort of real time system. So if I have a tweet coming in, I would like to have it available to my web app readily for processing. Batch data processing may not be a suitable choice for my application.
My questions are:
Is a Hadoop engine a good choice for me?
What are the parameter I should base my decision on?
Is it also a good option to use a Multi Cluster MySQL engine as opposed to Hadoop?
Is there any benchmarking in terms of Size and velocity of data in which Hadoop becomes a good choice?
Hadoop is not appropriate for near real time / interactive analysis. Hadoop was designed to do big batch processing of say a few hours of data plus. I used to use Hadoop to process any dataset that was around 10 GB or more (which is still a bit overkill), once it get's to 100 GB then you defo want something like Hadoop.
Now my recommendation would be for Spark as this is much more modern, much faster, more flexible, more powerful, and has a SparkStreaming module for achieving closer to real time analysis. Read all about it! https://spark.apache.org/
In this case I prefer the Lambda Architecture.
With Lambda Architecture you have two routes: A fast route with a noSQL database for the current informations, and a batch route with hadoop-hdfs for the archive data, and with a merge component you can merge the two datasources in one query, so you receive a whole amount of data, which is near real time.
http://lambda-architecture.net/
Image about lambda architecture: http://i.stack.imgur.com/eofRW.png
We created a PoC Project with Lambda Architecture (also for Twitter analysis), and its working fine.
Spark will be the best solution for your problem.You can also look other in-memory databases.

Extending Hive: writing a UDF that does both Map and Reduce operations

I am working on a project to extend Hive to support some image processing functions.
To do this, we need to read in an image, break it up into multiple files, pass each into a separate Map task that does some processing on it and then reduce them back into one image to be returned to the user.
To do this, we had planned to implement a UDF that would call a MapReduce task in Hadoop. However, from what we understand the UDF would only operate either on the Map side OR the Reduce side of the HQL query, while we need it to ideally 'bridge the gap' between the Map and the Reduce side.
The Hive documentation isn't the most helpful, and I was looking for some pointers on where to start looking for more information about this.
Please feel free to ask more questions if I haven't been clear enough in the question.
Looking into HIPI (Hadoop Image Processing Inteface) might give you a start.
Particularly, the example on computing the Principal Components of a bunch of images might be of interest.
Use UDAF (User Defined Aggragate Function). Which has sort of map and reduce phase.

Streaming data and Hadoop? (not Hadoop Streaming)

I'd like to analyze a continuous stream of data (accessed over HTTP) using a MapReduce approach, so I've been looking into Apache Hadoop. Unfortunately, it appears that Hadoop expects to start a job with an input file of fixed size, rather than being able to hand off new data to consumers as it arrives. Is this actually the case, or am I missing something? Is there a different MapReduce tool that works with data being read in from an open socket? Scalability is an issue here, so I'd prefer to let the MapReducer handle the messy parallelization stuff.
I've played around with Cascading and was able to run a job on a static file accessed via HTTP, but this doesn't actually solve my problem. I could use curl as an intermediate step to dump the data somewhere on a Hadoop filesystem and write a watchdog to fire off a new job every time a new chunk of data is ready, but that's a dirty hack; there has to be some more elegant way to do this. Any ideas?
The hack you describe is more or less the standard way to do things -- Hadoop is fundamentally a batch-oriented system (for one thing, if there is no end to the data, Reducers can't ever start, as they must start after the map phase is finished).
Rotate your logs; as you rotate them out, dump them into HDFS. Have a watchdog process (possibly a distributed one, coordinated using ZooKeeper) monitor the dumping grounds and start up new processing jobs. You will want to make sure the jobs run on inputs large enough to warrant the overhead.
Hbase is a BigTable clone in the hadoop ecosystem that may be interesting to you, as it allows for a continuous stream of inserts; you will still need to run analytical queries in batch mode, however.
What about http://s4.io/. It's made for processing streaming data.
Update
A new product is rising: Storm - Distributed and fault-tolerant realtime computation: stream processing, continuous computation, distributed RPC, and more
I think you should take a look over Esper CEP ( http://esper.codehaus.org/ ).
Yahoo S4 http://s4.io/
It provide real time stream computing, like map reduce
Twitter's Storm is what you need, you can have a try!
Multiple options here.
I suggest the combination of Kafka and Storm + (Hadoop or NoSql) as the solution.
We already build our big data platform using those opensource tools, and it works very well.
Your use case sounds similar to the issue of writing a web crawler using Hadoop - the data streams back (slowly) from sockets opened to fetch remote pages via HTTP.
If so, then see Why fetching web pages doesn't map well to map-reduce. And you might want to check out the FetcherBuffer class in Bixo, which implements a threaded approach in a reducer (via Cascading) to solve this type of problem.
As you know the main issues with Hadoop for usage in stream mining are the fact that first, it uses HFDS which is a disk and disk operations bring latency that will result in missing data in stream. second, is that the pipeline is not parallel. Map-reduce generally operates on batches of data and not instances as it is with stream data.
I recently read an article about M3 which tackles the first issue apparently by bypassing HDFS and perform in-memory computations in objects database. And for the second issue, they are using incremental learners which are not anymore performed in batch. Worth checking it out M3
: Stream Processing on
Main-Memory MapReduce. I could not find the source code or API of this M3 anywhere, if somebody found it please share the link here.
Also, Hadoop Online is also another prototype that attemps to solve the same issues as M3 does: Hadoop Online
However, Apache Storm is the key solution to the issue, however it is not enough. You need some euqivalent of map-reduce right, here is why you need a library called SAMOA which actually has great algorithms for online learning that mahout kinda lacks.
Several mature stream processing frameworks and products are available on the market. Open source frameworks are e.g. Apache Storm or Apache Spark (which can both run on top of Hadoop). You can also use products such as IBM InfoSphere Streams or TIBCO StreamBase.
Take a look at this InfoQ article, which explains stream processing and all these frameworks and products in detail: Real Time Stream Processing / Streaming Analytics in Combination with Hadoop. Besides the article also explains how this is complementary to Hadoop.
By the way: Many software vendors such as Oracle or TIBCO call this stream processing / streaming analytics approach "fast data" instead of "big data" as you have to act in real time instead of batch processing.
You should try Apache Spark Streaming.
It should work well for your purposes.

Resources