Storm Trident 'merge' function that preserves time order - apache-storm

Say I have two streams:
Stream 1: [1,3],[2,4]
Stream 2: [2,5],[3,2]
A regular merge would produce a Stream 3, like this:
[1,3],[2,4],[2,5],[3,2]
I would like to merge the stream whilst preserving the order in which the
tuple was emitted, so if [2,5] was emitted at time 1, [1,3] was emitted at
time 2, [3,2] at time 3 and [2,4] at time 4, the resulting stream would
be:
[2,5],[1,3],[3,2],[2,4]
Is there anyway way to do this and, if so, how? Some sample code would be
appreciated as I'm a complete Trident rookie who has recently been thrust
into a Trident based project.
Thanks in advance for your help,
Eli

You have to use an external data storage using trident persistent. Sorted set of redis should serve your purpose, I guess.
MORE INFO
If you go through this https://github.com/nathanmarz/storm/wiki/Trident-tutorial, you can get how to use memcache as the store for words count.
Similarly, you can write a stream backup on Redis (if you are not familiar with redis try out,
http://redis.io/commands#sorted_set). I think redis sorted set will serve as a purpose for your case.
If you want a persistent storage for your data, you can think of using other NOSQL solution like mongo and then you can always easily index your final data on your time. That will easily provide the sort functionality you want. And what not someone has already written a mongo trident, https://github.com/sjoerdmulder/trident-mongodb.
Let me know if you are still confused and about what.

Related

Why do a Beam io need beam.AddFixedKey+beam.GroupByKey to work properly?

I'm working on a Beam IO for Elasticsearch in Golang and at the moment I have a working draft version but, only managed to make it work by doing something that's not clear to me why do I need it.
Basically I looked at existing IO's and found that writes only work if I add the following:
x := beam.AddFixedKey(s, pColl)
y := beam.GroupByKey(s, x)
A full example is in the existing BigQuery IO
Basically I would like to understand why do I need both AddFixedKey followed by a GroupByKey to make it work. Also checked the issue BEAM-3860, but doesn't have much more details about it.
Those two transforms essentially function as a way to group all elements in a PCollection into one list. For example, its usage in the BigQuery example you posted allows grouping the entire input PCollection into a list that gets iterated over in the ProcessElement method.
Whether to use this approach depends how you are implementing the IO. The BigQuery example you posted performs its writes as a batch once all elements are available, but that may not be the best approach for your use case. You might prefer to write elements one at a time as they come in, especially if you can parallelize writes among different workers. In that case you would want to avoid grouping the input PCollection together.

Spark save files distributedly

According to the Spark documentation,
All transformations in Spark are lazy, in that they do not compute their results right away. Instead, they just remember the transformations applied to some base dataset (e.g. a file). The transformations are only computed when an action requires a result to be returned to the driver program.
I am currently working on a large dataset that, once processed, outputs even a bigger amount of data, which needs to be stored in text files, as done with the command saveAsTextFile(path).
So far I have been using this method; however, since it is an action (as stated above) and not a transformation, Spark needs to send data from every partition to the driver node, thus slowing down the process of saving quite a bit.
I was wondering if any distributed file saving method (similar to saveAsTextFile()) exists on Spark, enabling each executor to store its own partition by itself.
I think you're misinterpreting what it means to send a result to the driver. saveAsTextFile does not send the data back to the driver. Rather, it sends the result of the save back to the driver once it's complete. That is, saveAsTextFile is distributed. The only case where it's not distributed is if you only have a single partition or you've coallesced your RDD back to a single partition before calling saveAsTextFile.
What that documentation is referring to is sending the result of saveAsTextFile (or any other "action") back to the driver. If you call collect() then it will indeed send the data to the driver, but saveAsTextFile only sends a succeed/failed message back to the driver once complete. The save itself is still done on many nodes in the cluster, which is why you'll end up with many files - one per partition.
IO is always expensive. But sometimes it can seem as if saveAsTextFile is even more expensive precisely because of the lazy behavior described in that excerpt. Essentially, when saveAsTextFile is called, Spark may perform many or all of the prior operations on its way to being saved. That is what is meant by laziness.
If you have the Spark UI set up it may give you better insight into what is happening to the data on its way to a save (if you haven't already done that).

write only stream

I'm using joliver/EventStore library and trying to find a way of how to get a stream not reading any events from it.
The reason is that I want just to write some events into that store for specific stream without loading all 10k messages from it.
The way you're expected to use the store is that you always do a GetById first. Even if you new up an Aggregate and Save it, you'll see in the CommonDomain EventStoreRepository that it will first correlate it with the existing data.
The key reason why a read is needed first is that the infrastructure needs to work out how many events have gone before to compute the new commit sequence number.
Regarding your citing of your example threshold that makes you want to optimize this away... If you're really going to have that level of events, you'll already be into snapshotting territory as you'll need to have an appropriately efficient way of doing things other than blind write too.
Even if you're not intending to lean on snapshotting, half the benefit of using EventStore is that the facility is buitl in for when you need it.

A Persistent Store for Increment/Decrementing Integers Easily and Quickly

Does there exist some sort of persistent key-value like store that allows for quick and easy incrementing, decrementing, and retrieval of integers (and nothing else). I know that I could implement something with a SQL database, but I see two drawbacks to that:
It's heavyweight for the task at hand. All I need is the ability to say "server[key].inc()" or "server[key].dec()"
I need the ability to handle potentially thousands of writes to a single key simultaneously. I don't want to deal with excessive resource contention. Change the value and get out - that's all I need.
I know memcached supports inc/dec, but it's not persistent. My strategy at this point is going to be to use a SQL server behind a queueing system of some sort such that there's only one process updating the database. It just seems... harder than it should be.
Is there something someone can recommend?
Redis is a key-value store that supports several data types. Integer is present, along with incr and decr commands.

Streaming data and Hadoop? (not Hadoop Streaming)

I'd like to analyze a continuous stream of data (accessed over HTTP) using a MapReduce approach, so I've been looking into Apache Hadoop. Unfortunately, it appears that Hadoop expects to start a job with an input file of fixed size, rather than being able to hand off new data to consumers as it arrives. Is this actually the case, or am I missing something? Is there a different MapReduce tool that works with data being read in from an open socket? Scalability is an issue here, so I'd prefer to let the MapReducer handle the messy parallelization stuff.
I've played around with Cascading and was able to run a job on a static file accessed via HTTP, but this doesn't actually solve my problem. I could use curl as an intermediate step to dump the data somewhere on a Hadoop filesystem and write a watchdog to fire off a new job every time a new chunk of data is ready, but that's a dirty hack; there has to be some more elegant way to do this. Any ideas?
The hack you describe is more or less the standard way to do things -- Hadoop is fundamentally a batch-oriented system (for one thing, if there is no end to the data, Reducers can't ever start, as they must start after the map phase is finished).
Rotate your logs; as you rotate them out, dump them into HDFS. Have a watchdog process (possibly a distributed one, coordinated using ZooKeeper) monitor the dumping grounds and start up new processing jobs. You will want to make sure the jobs run on inputs large enough to warrant the overhead.
Hbase is a BigTable clone in the hadoop ecosystem that may be interesting to you, as it allows for a continuous stream of inserts; you will still need to run analytical queries in batch mode, however.
What about http://s4.io/. It's made for processing streaming data.
Update
A new product is rising: Storm - Distributed and fault-tolerant realtime computation: stream processing, continuous computation, distributed RPC, and more
I think you should take a look over Esper CEP ( http://esper.codehaus.org/ ).
Yahoo S4 http://s4.io/
It provide real time stream computing, like map reduce
Twitter's Storm is what you need, you can have a try!
Multiple options here.
I suggest the combination of Kafka and Storm + (Hadoop or NoSql) as the solution.
We already build our big data platform using those opensource tools, and it works very well.
Your use case sounds similar to the issue of writing a web crawler using Hadoop - the data streams back (slowly) from sockets opened to fetch remote pages via HTTP.
If so, then see Why fetching web pages doesn't map well to map-reduce. And you might want to check out the FetcherBuffer class in Bixo, which implements a threaded approach in a reducer (via Cascading) to solve this type of problem.
As you know the main issues with Hadoop for usage in stream mining are the fact that first, it uses HFDS which is a disk and disk operations bring latency that will result in missing data in stream. second, is that the pipeline is not parallel. Map-reduce generally operates on batches of data and not instances as it is with stream data.
I recently read an article about M3 which tackles the first issue apparently by bypassing HDFS and perform in-memory computations in objects database. And for the second issue, they are using incremental learners which are not anymore performed in batch. Worth checking it out M3
: Stream Processing on
Main-Memory MapReduce. I could not find the source code or API of this M3 anywhere, if somebody found it please share the link here.
Also, Hadoop Online is also another prototype that attemps to solve the same issues as M3 does: Hadoop Online
However, Apache Storm is the key solution to the issue, however it is not enough. You need some euqivalent of map-reduce right, here is why you need a library called SAMOA which actually has great algorithms for online learning that mahout kinda lacks.
Several mature stream processing frameworks and products are available on the market. Open source frameworks are e.g. Apache Storm or Apache Spark (which can both run on top of Hadoop). You can also use products such as IBM InfoSphere Streams or TIBCO StreamBase.
Take a look at this InfoQ article, which explains stream processing and all these frameworks and products in detail: Real Time Stream Processing / Streaming Analytics in Combination with Hadoop. Besides the article also explains how this is complementary to Hadoop.
By the way: Many software vendors such as Oracle or TIBCO call this stream processing / streaming analytics approach "fast data" instead of "big data" as you have to act in real time instead of batch processing.
You should try Apache Spark Streaming.
It should work well for your purposes.

Resources