Lamina vs Storm - events

I am designing a prototype realtime monitor for processing fairly large amounts (>30G/day) of streaming numeric data. I would like to write this in Clojure, as the language seems to be well suited to the kind of "Observer + state machine" system that this will probably end up as.
The two main candidates I have found for a framework are Lamina and Storm. There is also Riemann and Pulse, but the former seems to be more of a full solution rather than a framework, and I'd rather not commit to a final design yet; Pulse's repo looks a little unmaintained?
What I would like to know is; what kinds of data- and work flow are these two projects optimised for? Storm seems to be more mature, but Lamina seems more composable and "Clojureic" (my background is Python, so I tend to rate this highly).
What I've found from reading online:
Storm seems to be Big Data(stream) focussed, the core is straight Java with a Clojure DSL. It appears to have pre=built handlers for a number of existing data sources.
Lamina is more a lightweight, reusable component that does the Clojure thing of coding to abstractions, meaning it can be reused as a base for other eventing systems. The data sources need to be handled in code.
Both have a useful set of aggregation/splitting/computation library functions out of the box. Lamina's graphviz integration is a nice touch.

Storm probably isn't a bad choice, but "over 30GB per day" of numeric data isn't big data, it is tiny data. Any semi-modern computer can handle that much data easily on one node with lamina. You might want to go with Storm anyway, so that once you do get into a realm where you need more servers you can scale easily, but I imagine there's some initial friction to getting Storm set up (and some ongoing friction in maintaining the cluster), which will be wasted if you never have to scale up.

Storm incorporates cluster management and handling of failed nodes in the flow because it was designed to be sort of "like Hadoop but for streaming", which from what I understand of your requirements seems to be closer to your use case.

Lamina seems like an okay choice, but it appears to be totally lacking the killer feature of Storm--cluster computing management. A Storm cluster will take care of most of the dirty work of distributing your computation across a cluster of nodes, leaving you to just focus on your business logic so long as you fit it within the Storm framework. Lamina, from what I can see, provides a nice way to organize your computation, but then you'll have to take care of all the details of scaling that out if that's something you need.

Related

Apache Beam and ETL processes

Given following processes:
manually transforming huge .csv's files via rules (using MS excel or excel like software) & sharing them via ftp
scripts (usually written in Perl or Python) which basically transform data preparing them for other processes.
API's batch reading from files or other origin sources & updating their corresponding data model.
Springboot deployments used (or abused) to in part regularly collect & aggregate data from files or other sources.
And given these problems/ areas of improvement:
Standardization: I'd like to (as far as it makes sense), to propose a unified powerful tool that natively deals with these types of (kind of big) data transformation workflows.
Rising the abstraction level of the processes (related to the point above): Many of the "tasks/jobs" I mentioned above, are seen by the teams using them, in a very technical low level task-like way. I believe having a higher level view of these processes/flows highlighting their business meaning would help self document these processes better, and would also help to establish a ubiquitous language different stakeholders can refer to and think unambiguously about.
IO bottlenecks and resource utilization (technical): Some of those processes do fail more often that what would be desirable, (or take a very long time to finish) due to some memory or network bottleneck. Though it is clear that hardware has limits, resource utilization doesn't seem to have been a priority in many of these data transformation scripts.
Do the Dataflow model and specifically the Apache Beam implementation paired with either Flink or Google Cloud Dataflow as a backend runner, offer a proven solution to those "mundane" topics? The material on the internet mainly focuses on discussing the unified streaming/batch model and also typically cover more advanced features like streaming/event windowing/watermarks/late events/etc, which do look very elegant and promising indeed, but I have some concerns regarding tool maturity and community long term support.
It's hard to give a concrete answer to such a broad question, but I would say that, yes, Beam/Dataflow is a tool that handle this kind of thing. Even though the documentation focuses on "advanced" features like windowing and streaming, lots of people are using it for more "mundane" ETL. For questions about tool maturity and community you could consider sources like Forrester reports that often speak of Dataflow.
You may also want to consider pairing it with other technologies like Arflow/Composer.

Cluster Computing in Go

Is there a framework for cluster computing in Go? (I wish to bring together multiple PC's to for custom parallel computation, and wonder whether Go might be a suitable language to use).
I don't know the level of connectedness you plan to have in your cluster, but go's RPC package makes communication among nodes trivial. It will likely serve as the backbone of your work and you can build abstractions on top of it (for instance if you need to multicast requests to different nodes). The examples given in the doc assume your nodes will communicate over HTTP, but that bit is abstracted out in net/rpc to allow different transports.
http://golang.org/pkg/net/rpc/
You can use Hadoop Streaming with Go. See (a bit dated) example here.
You should have a look at Go Circuit.
Quoting from the introduction:
The circuit reduces the human development and sustenance costs of complex massively-scaled
systems nearly to the level of their single-process counterparts. ...
... and:
For isntance, we have been able to write large real-world cloud applications — e.g.
streaming multi-stage MapReduce pipelines — in as many as 200 lines of code from
the ground up.
Also, for some simpler use cases, you might want to check out Golem.
You can try to use https://github.com/bketelsen/skynet . This is service oriented framework based on doozer.

does anyone find Cascading for Hadoop Map Reduce useful?

I've been trying Cascading, but I cannot see any advantage over the classic map reduce approach for writing jobs.
Map Reduce jobs gives me more freedom and Cascading seems to be putting a lot of obstacles.
Might make a good job for making simple things simple, but complex things.. I find them extremely hard
Is there something I'm missing. Is there an obvious advantage of Cascading over the classic approach?
In what scenario should I chose cascading over the classic approach? Any one using it and happy?
Keeping in mind I'm the author of Cascading...
My suggestion is to use Pig or Hive if they make sense for your problem, Pig especially.
But if you are in the business of data, and not just poking around your data for insights, you will find the Cascading approach makes much more sense for most problems than raw MapReduce.
Your first obstacle with raw MapReduce will be thinking in MapReduce. Trivial problems are simple in MapReduce, but its much easier to develop complex applications if you can work with a model that more easily maps to your problem domain (filter this, parse that, sort those, join the rest, etc).
Next you will realize that a normal unit of work in Hadoop consists of multiple MapReduce jobs. Chaining jobs together is a solvable problem but it should not leak into your application domain level code, it should be hidden and transparent.
Further, you will find refactoring and creating re-usable code much harder if you have to continually move functions between mappers and reducers. or from mappers to the previous reducer to get an optimization. Which leads to the issue of brittleness.
Cascading believes in failing fast as possible. The planner attempts to resolve and satisfy dependencies between all those field names before the Hadoop cluster is even engaged in work. This means 90%+ of all issues will be found before waiting hours for your job to find it during execution.
You can alleviate this in raw MapReduce code by creating domain objects like Person or Document, but many applications don't need all the fields down stream. Consider if you needed the average age of all males. You do not want to pay the IO penalty of passing a whole Person around the network when all you need is a binary gender and numeric age.
With fail fast semantics and lazy binding of sinks and sources, it becomes very easy to build frameworks on Cascading that themselves create Cascading flows (which become many Hadoop MapReduce jobs). A project I'm currently involved with ends up with 100's of MapReduce jobs per run, many created on the fly mid run based on feedback from the data being processed. Search for Cascalog to see an example of a Clojure based framework for simply creating complex processes. Or Bixo for a web mining toolkit and framework that's far easier to customize than Nutch.
Finally Hadoop is never used alone, that means your data is always pulled from some external source and pushed to another after processing. The dirty secret about Hadoop is that it is a very effective ETL framework (so its silly to hear ETL vendors talk about using their tools to push/pull data onto/from Hadoop). Cascading eases this pain somewhat by allowing you to write your operations, applications, and unit tests independent of the integration end-points. Cascading is used in production to load systems like Membase, Memcached, Aster Data, Elastic Search, HBase, Hypertable, Cassandra, etc. (Unfortunately not all the adapters have been released by their authors.)
If you will, please send me a list of the issues your are experiencing with the interface. I am constantly looking for better ways to improve the API and documentation, and the user community is always around to help.
I've been using Cascading for a couple of years now. I find it to be extremely helpful. Ultimately, it's about Productivity gains. I can be much more efficient in creating and maintaining M/R jobs as compared to plain java code. Here's a few reasons why:
A lot of the boilerplate code used to start a job is already written for you.
Composability. Generally code is easier to read and easier to reuse when it is written as components (operations) which are stitched together to perform some more complex processing.
I find unit testing to be easier. There are examples in the cascading package demonstrating how to write simple unit tests to directly test the output of flows.
The Tap (source and sink) paradigm makes it easy to change the input and ouput of a job, so you can, for example, start with output to STDOUT for development and debugging and then switch to HDFS sequencefiles for batch jobs and then switch to an HBase tap for pseudo-real time updates.
Another great advantage of writing Cascading jobs is that you're really writing more of a factory that creates jobs. This can be a huge advantage when you need to build something dynamically (i.e. the results of one job control what subsequent jobs you create and run). Or, in another case, I needed to create a job for each combination of 6 binary variables. This is 64 jobs which are all very similar. This would be a hassle with just hadoop map reduce classes.
While there are a lot of pre-built components that you can compose together, if a particular section of your processing logic seems like it would be easier to just write in straight Java, you can always create a Cascading function to wrap that. This allows you to have the benefits of Cascading, but very custom operations can be written as straight java functions (implementing a Cascading interface).
I used Cascading with Bixo to write the complete anti-spam link classification pipeline for a large social network.
The Cascading pipeline resulted in 27 MR jobs, which would have been very difficult to maintain in plain MR. I have written MR jobs before, but using something like Cascading feels like switching from Assembler to Java (insert_fav_language_here).
One of the big advantages over Hive or Pig IMHO is that Cascading is a single jar, which you bundle with your job. Pig and Hive have more dependencies (e.g. MySQL) or are not as easy to embed.
Disclaimer: While I know Chris Wensel personally, I really think Cascading is kick a**. Considering its complexity it is extremely impressive that I haven't found a single bug using it.
I teach the Hadoop Boot Camp course for Scale Unlimited, and also make extensive use of Cascading in Bixo and for building web mining apps at Bixo Labs - so I think I've got a good appreciation for both approaches.
The biggest single advantage I see in Cascading is that it allows you to think about your data processing workflow in terms of operations on fields, and to (mostly) avoid worrying about how to transpose this view of the world onto the key/value model that's intrinsically part of any map-reduce implementation.
The biggest challenge with Cascading is that it is a different way of thinking about data processing workflows, and there's a corresponding conceptual "hump" you need to get over before it all starts making sense. Plus the error messages can remind one of the output from lex/yacc ("conflict in shift/reduce") :)
-- Ken
I think that the place that Cascading's advantages begin to show are instances where you have a pile of simple functions that should all be kept separate in source code, but which can all be collected into a composition in your mapper or reducer. Putting them together makes your basic map-reduce code hard to read, separating them makes the program really slow. Cascading's optimizer can put them together even though you write them separately. Pig and to some extent Hive can do this as well, but for large programs, I think Cascading has a maintainability advantage.
In a few months Plume may be an expressivity competitor, but if you have real programs to write and run in a production setting, then Cascading is probably your best bet.
Cascading allows you to use simple field names and tuples in place of the primitive types offered by Hadoop which, "... tend to be at the wrong level of granularity for creating sophisticated, highly composable code that can be shared among different developers" (Tom White, Hadoop The Definitive Guide). Cascading was designed to solve those problems. Keep in mind, some of the applications like Cascading, Hive, Pig, etc, were developed in parallel and sometimes do the same thing. If you don't like Cascading or find it confusing, maybe you would be better of using something else?
I'm sure you already have this, but here is the user guide: http://www.cascading.org/1.1/userguide/pdf/userguide.pdf. It provides a decent walk through of the flow of data in a typical Cascading application.
I worked on cascading for couple of years and below are useful things in cascading.
1. code testability
2. easy integration with other tools
3. easily extensibile
4. you will focus only on business logic not on keys and values
5. proven in production and used by even twitter.
I recommend people use cascading most of the times.
Cascading is a wrapper around Hadoop that provides Taps and Sinks to and from Hadoop.
Writing Mappers and Reducers for all your tasks is going to be tedious. Try writing one Cascading job and then you're all set to avoiding writing any mappers and reducers.
You also want to look at cascading Taps and Schemes (this is how you input data into your cascading processing job).
With these two, i.e. Ability to avoid writing ad-hoc Hadoop Mappers with Reducers and the ability to consume a wide variety of data sources, you can solve a lot of your data processing very fast and effective.
Cascading is more than just a simple wrapper around hadoop, I am trying to keep the answer simple. For example, I've ported a huge mysql database containing terabytes of data to log files using cascading jdbc tap

Getting started with massive data

I'm a mathematician and occasionally do some statistics/machine learning analysis consulting projects on the side. The data I have access to are usually on the smaller side, at most a couple hundred of megabytes (and almost always far less), but I want to learn more about handling and analyzing data on the gigabyte/terabyte scale. What do I need to know and what are some good resources to learn from?
Hadoop/MapReduce is one obvious start.
Is there a particular programming language I should pick up? (I primarily work now in Python, Ruby, R, and occasionally Java, but it seems like C and Clojure are often used for large-scale data analysis?)
I'm not really familiar with the whole NoSQL movement, except that it's associated with big data. What's a good place to learn about it, and is there a particular implementation (Cassandra, CouchDB, etc.) I should get familiar with?
Where can I learn about applying machine learning algorithms to huge amounts of data? My math background is mostly on the theory side, definitely not on the numerical or approximation side, and I'm guessing most of the standard ML algorithms don't really scale.
Any other suggestions on things to learn would be great!
Apache Hadoop is indeed a good start, because it's free, has a large community and is easy to set up.
Hadoop is build in Java, so this can be the language of choice. But it is possible to use ohter languages with Hadoop as well ("pipes" and "streams"). I know, that Python is often used for example.
You can avoid having your data in data bases, if you like to. Originally, Hadoop works with data on the (distributed) file system. But as you already seem to know, there are distributed data bases for Hadoop available.
Did you ever had a look an Mahout? I think that would be a hit for you ;-) Many work you need, may already had been done!?
Read the Quick Start and set up your own (pseudo-distributed?) cluster and run the word-count example.
Let me know, if you have any questions :-) A comment will remind me on this question.
I've done some large scale machine learning (3-5GB datasets), so here are some insights:
First, there are logistics issues at large scales. Can you load all your data into memory? With Java and a 64 bit JVM you can access as much RAM as you have: for example, command line parameter -Xmx8192M will give you access to 8GB (if you have that much). Matlab, being a Java application, can also benefit from this and work with fairly large datasets.
More importantly, the algorithms that you run on your data. Chances are that standard implementations will expect all of the data in memory. You might have to implement a working set approach yourself, where you swap data in and out to the disk, and only work on a portion of data at a time. These are sometimes referred to as chunking, batch or even incremental algorithms, depending on the context.
You are right to suspect that a lot of algorithms do not practically scale, so you might have to go for an approximate solution. The good news is that for almost any algorithm you can find research papers that deal with approximation and/or discuss large scale solutions. The bad news is that you'll most likely have to implement those approaches yourself.
Hadoop is great, but can be a pain in the ass to set up. This is by far the best article I've read on Hadoop setup. I strongly recommend it:
http://www.michael-noll.com/wiki/Running_Hadoop_On_Ubuntu_Linux_%28Single-Node_Cluster%29
Clojure is built on top of Java so it's unlikely that it's going to be any faster than Java. However, it is one of the few languages that does shared memory well, which may or may not be helpful. I'm not a math guy but it seems most math calculations are very parallelizable, with little need of threads sharing memory. Either way, you might want to check out Incanter, which is Clojure's statistical computing library, and clojure-hadoop, which makes writing Hadoop jobs a lot less painful.
In terms of languages, I find that the differences in performance end up being constant factors. It's far better to just find a language you enjoy and focus on improving your algorithms. However, according to some shootout cited by Peter Norvig (scroll down to the colorful table, you may want to shy away from Python and Perl due to their crappiness with arrays.
In a nutshell, NoSQL is great for unstructured/arbitrarily structured data while SQL/RDBMS is great (or at least tolerable) for structured data. Changing/adding fields is expensive in RDBMS so if that's going to happen alot, you might want to shy away from them.
However, in your case, it seems like you're going to be batch processing a ton of data and then getting back an answer as opposed to having data around that you will periodically ask questions about? You could probably just process CSVs/text files in Hadoop. Unless you need a performant way of accessing arbitrary information about your data on the fly, I'm not sure either SQL or NoSQL would be useful.

Relation between language and scalability

I came across the following statement in Trapexit, an Erlang community website:
Erlang is a programming language used
to build massively scalable soft
real-time systems with requirements on
high availability.
Also I recall reading somewhere that Twitter switched from Ruby to Scala to address scalability problem.
Hence, I wonder what is the relation between a programming language and scalability?
I would think that scalability depends only on the system design, exception handling etc. Is it because of the way a language is implemented, the libraries, or some other reasons?
Hope for enlightenment. Thanks.
Erlang is highly optimized for a telecommunications environment, running at 5 9s uptime or so.
It contains a set of libraries called OTP, and it is possible to reload code into the application 'on the fly' without shutting down the application! In addition, there is a framework of supervisor modules and so on, so that when something fails, it gets automatically restarted, or else the failure can gradually work itself up the chain until it gets to a supervisor module that can deal with it.
That would be possible in other languages of course too. In C++, you can reload dlls on the fly, load plugsin. In Python you can reload modules. In C#, you can load code in on-the-fly, use reflection and so on.
It's just that that functionality is built in to Erlang, which means that:
it's more standard, any erlang developer knows how it works
less stuff to re-implement oneself
That said, there are some fundamental differences between languages, to the extent that some are interpreted, some run off bytecode, some are native compiled, so the performance, and the availability of type information and so on at runtime differs.
Python has a global interpreter lock around its runtime library so cannot make use of SMP.
Erlang only recently had changes added to take advantage of SMP.
Generally I would agree with you in that I feel that a significant difference is down to the built-in libraries rather than a fundamental difference between the languages themselves.
Ultimately I feel that any project that gets very large risks getting 'bogged down' no matter what language it is written in. As you say I feel architecture and design are pretty fundamental to scalability and choosing one language over another will not I feel magically give awesome scalability...
Erlang comes from another culture in thinking about reliability and how to achieve it. Understanding the culture is important, since Erlang code does not become fault-tolerant by magic just because its Erlang.
A fundamental idea is that high uptime does not only come from a very long mean-time-between-failures, it also comes from a very short mean-time-to-recovery, if a failure happened.
One then realize that one need automatic restarts when a failure is detected. And one realize that at the first detection of something not being quite right then one should "crash" to cause a restart. The recovery needs to be optimized, and the possible information losses need to be minimal.
This strategy is followed by many successful softwares, such as journaling filesystems or transaction-logging databases. But overwhelmingly, software tends to only consider the mean-time-between-failure and send messages to the system log about error-indications then try to keep on running until it is not possible anymore. Typically requiring human monitoring the system and manually reboot.
Most of these strategies are in the form of libraries in Erlang. The part that is a language feature is that processes can "link" and "monitor" each other. The first one is a bi-directional contract that "if you crash, then I get your crash message, which if not trapped will crash me", and the second is a "if you crash, i get a message about it".
Linking and monitoring are the mechanisms that the libraries use to make sure that other processes have not crashed (yet). Processes are organized into "supervision" trees. If a worker process in the tree fails, the supervisor will attempt to restart it, or all workers at the same level of that branch in the tree. If that fails it will escalate up, etc. If the top level supervisor gives up the application crashes and the virtual machine quits, at which point the system operator should make the computer restart.
The complete isolation between process heaps is another reason Erlang fares well. With few exceptions, it is not possible to "share values" between processes. This means that all processes are very self-contained and are often not affected by another process crashing. This property also holds between nodes in an Erlang cluster, so it is low-risk to handle a node failing out of the cluster. Replicate and send out change events rather than have a single point of failure.
The philosophies adopted by Erlang has many names, "fail fast", "crash-only system", "recovery oriented programming", "expose errors", "micro-restarts", "replication", ...
Erlang is a language designed with concurrency in mind. While most languages depend on the OS for multi-threading, concurrency is built into Erlang. Erlang programs can be made from thousands to millions of extremely lightweight processes that can run on a single processor, can run on a multicore processor, or can run on a network of processors. Erlang also has language level support for message passing between processes, fault-tolerance etc. The core of Erlang is a functional language and functional programming is the best paradigm for building concurrent systems.
In short, making a distributed, reliable and scalable system in Erlang is easy as it is a language designed specially for that purpose.
In short, the "language" primarily affects the vertical axii of scaling but not all aspects as you already eluded to in your question. Two things here:
1) Scalability needs to be defined in relation to a tangible metric. I propose money.
S = # of users / cost
Without an adequate definition, we will discussing this point ad vitam eternam. Using my proposed definition, it becomes easier to compare system implementations. For a system to be scalable (read: profitable), then:
Scalability grows with S
2) A system can be made to scale based on 2 primary axis:
a) Vertical
b) Horizontal
a) Vertical scaling relates to enhancing nodes in isolation i.e. bigger server, more RAM etc.
b) Horizontal scaling relates to enhancing a system by adding nodes. This process is more involving since it requires dealing with real world properties such as speed of light (latency), tolerance to partition, failures of many kinds etc.
(Node => physical separation, different "fate sharing" from another)
The term scalability is too often abused unfortunately.
Too many times folks confuse language with libraries & implementation. These are all different things. What makes a language a good fit for a particular system has often more to do with the support around the said language: libraries, development tools, efficiency of the implementation (i.e. memory footprint, performance of builtin functions etc.)
In the case of Erlang, it just happens to have been designed with real world constraints (e.g. distributed environment, failures, need for availability to meet liquidated damages exposure etc.) as input requirements.
Anyways, I could go on for too long here.
First you have to distinguish between languages and their implementations. For instance ruby language supports threads, but in the official implementation, the thread will not make use of multicore chips.
Then, a language/implementation/algorithm is often termed scalable when it supports parallel computation (for instance via multithread) AND if it exhibits a good speedup increase when the number of CPU goes up (see Amdahl Law).
Some languages like Erlang, Scala, Oz etc. have also syntax (or nice library) which help writing clear and nice parallel code.
In addition to the points made here about Erlang (Which I was not aware of) there is a sense in which some languages are more suited for scripting and smaller tasks.
Languages like ruby and python have some features which are great for prototyping and creativity but terrible for large scale projects. Arguably their best features are their lack of "formality", which hurts you in large projects.
For example, static typing is a hassle on small script-type things, and makes languages like java very verbose. But on a project with hundreds or thousands of classes you can easily see variable types. Compare this to maps and arrays that can hold heterogeneous collections, where as a consumer of a class you can't easily tell what kind of data it's holding. This kind of thing gets compounded as systems get larger. e.g. You can also do things that are really difficult to trace, like dynamically add bits to classes at runtime (which can be fun but is a nightmare if you're trying to figure out where a piece of data comes from) or call methods that raise exceptions without being forced by the compiler to declare the exception. Not that you couldn't solve these kinds of things with good design and disciplined programming - it's just harder to do.
As an extreme case, you could (performance issues aside) build a large system out of shell scripts, and you could probably deal with some of the issues of the messiness, lack of typing and global variables by being very strict and careful with coding and naming conventions ( in which case you'd sort of be creating a static typing system "by convention"), but it wouldn't be a fun exercise.
Twitter switched some parts of their architecture from Ruby to Scala because when they started they used the wrong tool for the job. They were using Ruby on Rails—which is highly optimised for building green field CRUD Web applications—to try to build a messaging system. AFAIK, they're still using Rails for the CRUD parts of Twitter e.g. creating a new user account, but have moved the messaging components to more suitable technologies.
Erlang is at its core based on asynchronous communication (both for co-located and distributed interactions), and that is the key to the scalability made possible by the platform. You can program with asynchronous communication on many platforms, but Erlang the language and the Erlang/OTP framework provides the structure to make it manageable - both technically and in your head. For instance: Without the isolation provided by erlang processes, you will shoot yourself in the foot. With the link/monitor mechanism you can react on failures sooner.

Resources