Apache Beam and ETL processes - etl

Given following processes:
manually transforming huge .csv's files via rules (using MS excel or excel like software) & sharing them via ftp
scripts (usually written in Perl or Python) which basically transform data preparing them for other processes.
API's batch reading from files or other origin sources & updating their corresponding data model.
Springboot deployments used (or abused) to in part regularly collect & aggregate data from files or other sources.
And given these problems/ areas of improvement:
Standardization: I'd like to (as far as it makes sense), to propose a unified powerful tool that natively deals with these types of (kind of big) data transformation workflows.
Rising the abstraction level of the processes (related to the point above): Many of the "tasks/jobs" I mentioned above, are seen by the teams using them, in a very technical low level task-like way. I believe having a higher level view of these processes/flows highlighting their business meaning would help self document these processes better, and would also help to establish a ubiquitous language different stakeholders can refer to and think unambiguously about.
IO bottlenecks and resource utilization (technical): Some of those processes do fail more often that what would be desirable, (or take a very long time to finish) due to some memory or network bottleneck. Though it is clear that hardware has limits, resource utilization doesn't seem to have been a priority in many of these data transformation scripts.
Do the Dataflow model and specifically the Apache Beam implementation paired with either Flink or Google Cloud Dataflow as a backend runner, offer a proven solution to those "mundane" topics? The material on the internet mainly focuses on discussing the unified streaming/batch model and also typically cover more advanced features like streaming/event windowing/watermarks/late events/etc, which do look very elegant and promising indeed, but I have some concerns regarding tool maturity and community long term support.

It's hard to give a concrete answer to such a broad question, but I would say that, yes, Beam/Dataflow is a tool that handle this kind of thing. Even though the documentation focuses on "advanced" features like windowing and streaming, lots of people are using it for more "mundane" ETL. For questions about tool maturity and community you could consider sources like Forrester reports that often speak of Dataflow.
You may also want to consider pairing it with other technologies like Arflow/Composer.

Related

Interview assignment - design a system like S3

In my phone interview at one of the financial firms as an software architect, "design a cloud storage system like AWS S3".
Here is what I answered, Would you please help with your critiques & comments and on my approach. I would like to improve based on your feedback.
First
, I listed requirements
- CRUD Microservices on objects
- Caching layer to improve performance
- Deployment on PaaS
- resiliency with failover
- AAA support ( authorization, auditing, accounting/billing)
- Administration microservices (user, project, lifecycle of object, SLA dashboard)
- Metrics collection (Ops, Dev)
- Security for service endpoints for admin UI
Second,
I defined basic APIs.
https://api.service.com/services/get Arugments object id, metadata return binary object
https://api.service.com/services/upload Arguments object returns object id
https://api.service.com/services/delete Arugments object id returns success/error
http://api.service.com/service/update-meta Arugments object id, metadata return success/error
Third,
I drew the picture on board with architecture and some COTS components i can use. below is the picture.
Interviewer did not ask me much questions, and hence I am bit worried that if I am on right track with my process. Pl provide your feedback..
Thanks in advance..
There are a couple of areas of feedback that might be helpful:
1. Comparison with S3's API
The S3 API is a RESTful API these days (it used to support SOAP) and it represents each 'file' (really a blob of data indexed by a key) as an HTTP resource, where the key is the path in the resource's URI. Your API is more RPC, in that each HTTP resource represents an operation to be carried out and the key to the blob is one of the parameters.
Whether or not this is a good or bad thing depends on what you're trying to achieve and what architectural style you want to adopt (although I am a fan of REST, it doesn't mean you have to adopt it for all applications), however since you were asked to design a system like S3, your answer would have benefited from a clear argument as to why you chose NOT to use REST as S3 does.
2. Lines connecting things
Architecture diagrams tend to often be very high level - which is appropriate - but there is a tendency sometimes to just draw lines between boxes without being clear about what those lines mean. Does it mean there is a network connection between the infrastructure hosting those software components? Does it mean there is an information or data flow between those components?
When you a draw a line like in your diagram that has multiple boxes all joining together on the line, the implication is that there is some relationship between the boxes. When you add arrows, there is the further implication that the relationship follows the direction of the arrows. But there is no clarity about what that relationship is, or why the directionality is important.
One could infer from your diagram that the Memcache Cluster and the File Storage cluster are both sending data to the Metrics/SLA portal, but that they are not sending data to each other. Or that the ELB is not connected to the microservices. Clearly that is not the case.
3. Mixing Physical, Logical, Network & Software Architecture
General Type of Architecture
Logical Architecture - tends to be more focussed on information flows between areas of functional responsibility
Physical Architecture - tends to be more focussed on deployable components, such as servers, VMs, containers, but I also group installable software packages here, as a running executable process may host multiple elements from the logical architecture
Specific Types of Architecture
Network Architecture - focuses on network connectivity between machines and devices - may reference VLANs, IP ranges, switches, routers etc.
Software Architecture - focuses on the internal structures of a software program design - may talk about classes, modules, packages etc.
Your diagram includes a Load Balancer (more physical) and also a separate box per microservice (could be physical or logical or software), where each microservice is responsible for a different type of operation. It is not clear if each microservice has it's own load balancer, or if the load balancer is a layer 7 balancer that can map paths to different front ends.
4. Missing Context
While architectures often focus on the internal structure of a system, it is also important to consider the system context - i.e. what are the important elements outside the system that the system needs to interract with? e.g. what are the expected clients and their methods of connectivity?
5. Actual Architectural Design
While the above feedback focussed on the method of communicating your, this is more about the actual design.
COTS products - did you talk about alternatives and why you selected the one you chose? Or is it just the only one you know. Awareness of the options and ability to select the appropriate option for a given purpose is valuable.
Caching - you have caching in front of the file storage, but nothing in front of the microservices (edge cache, or front end reverse proxy) - assuming the microservices are adding some value to the process, caching their results might also be useful
Redundancy and durability of data - while you talk about resiliency to failover, data redundancy and durability of the data storage is a key requirement in something like this and some explicit reference to how that would be achieved would be useful. Note this is slightly different to availability of services.
Performance - you talk about introducing a caching layer to improve performance, but don't qualify the actual performance requirements - 100's of objects stored or retrieved per second, 1000's or millions? You need to know that to know what to build in
Global Access - S3 is a multi-region/multi-datacentre solution - your architecture does not reference any aspect of multi-datacentre such as replication of the stored objects and metadata
Security - you reference requirements around AAA but your proposed solution doesn't define which component is responsible for security, and at which layer or at what point in the request path a request is verified and accepted or rejected
6. The Good
Lest this critique be thought too negative, it's worth saying that there is a lot to like in your approach - your assessment of the likely requirements is thorough, and great to see inclusion of security and also operational monitoring and sla's considered up front.
However, reviewing this, I'd wonder what kind of job it actually was - it looks more like the application for a cloud architect role, rather than a software architect role, for which I'd expect to see more discussion of packages, modules, assemblies, libraries and software components.
All of the above notwithstanding, it's also worth considering - what is an interviewer looking for if they ask this in an interview? Nobody expects you to propose an architecture in 15 minutes that can do what has taken a team of Amazon engineers and architects many years to build and refine! They are looking for clarity of thought and expression, thoroughness of examination, logical conclusions from clearly stated assumptions, and knowledge and awareness of industry standards and practices.
Hope this is helpful, and best of luck on the job hunt!

Lamina vs Storm

I am designing a prototype realtime monitor for processing fairly large amounts (>30G/day) of streaming numeric data. I would like to write this in Clojure, as the language seems to be well suited to the kind of "Observer + state machine" system that this will probably end up as.
The two main candidates I have found for a framework are Lamina and Storm. There is also Riemann and Pulse, but the former seems to be more of a full solution rather than a framework, and I'd rather not commit to a final design yet; Pulse's repo looks a little unmaintained?
What I would like to know is; what kinds of data- and work flow are these two projects optimised for? Storm seems to be more mature, but Lamina seems more composable and "Clojureic" (my background is Python, so I tend to rate this highly).
What I've found from reading online:
Storm seems to be Big Data(stream) focussed, the core is straight Java with a Clojure DSL. It appears to have pre=built handlers for a number of existing data sources.
Lamina is more a lightweight, reusable component that does the Clojure thing of coding to abstractions, meaning it can be reused as a base for other eventing systems. The data sources need to be handled in code.
Both have a useful set of aggregation/splitting/computation library functions out of the box. Lamina's graphviz integration is a nice touch.
Storm probably isn't a bad choice, but "over 30GB per day" of numeric data isn't big data, it is tiny data. Any semi-modern computer can handle that much data easily on one node with lamina. You might want to go with Storm anyway, so that once you do get into a realm where you need more servers you can scale easily, but I imagine there's some initial friction to getting Storm set up (and some ongoing friction in maintaining the cluster), which will be wasted if you never have to scale up.
Storm incorporates cluster management and handling of failed nodes in the flow because it was designed to be sort of "like Hadoop but for streaming", which from what I understand of your requirements seems to be closer to your use case.
Lamina seems like an okay choice, but it appears to be totally lacking the killer feature of Storm--cluster computing management. A Storm cluster will take care of most of the dirty work of distributing your computation across a cluster of nodes, leaving you to just focus on your business logic so long as you fit it within the Storm framework. Lamina, from what I can see, provides a nice way to organize your computation, but then you'll have to take care of all the details of scaling that out if that's something you need.

Metrics for comparing event-based and thread-based programming models

I have been asked to compare the programming models used by two different OSs for wireless sensor networks, TinyOS (which uses an event-based model) and Contiki (which uses events internally, but offers a protothread model for application programmers). I have developed the same application in both systems, and I can present a qualitative analysis of the pros and cons of both models, and give my subjective impression.
However, I have been asked to put forward metrics for comparing them. Apart from the time spent to write the programs (which is roughly equal), I'm not sure what other metrics are applicable. Can you suggest some?
Time to understand these programs? Number of questions ask on net about deadlocks (normalized by userbase)
I ended up using lines of code and cyclomatic complexity to show how different models impact code organization. I also estimated the difficulty of understanding the two programs by asking another programmer to read them.

Choosing a strategy for BI module

The company I work for produces a content management system (CMS) with different various add-ons for publishing, e-commerce, online printing, etc. We are now in process of adding "reporting module" and I need to investigate which strategy should be followed. The "reporting module" is otherwise known as Business Intelligence, or BI.
The module is supposed to be able to track item downloads, executed searches and produce various reports out of it. Actually, it is not that important what kind of data is being churned as in the long term we might want to be able to push whatever we think is needed and get a report out of it.
Roughly speaking, we have two options.
Option 1 is to write a solution based on Apache Solr (specifically, using https://issues.apache.org/jira/browse/SOLR-236). Pros of this approach:
free / open source / good quality
we use Solr/Lucene elsewhere so we know the domain quite well
total flexibility over what is being indexed as we could take incoming data (in XML format), push it through XSLT and feed it to Solr
total flexibility of how to show search results. Similar to step above, we could have custom XSLT search template and show results back in any format we think is necessary
our frontend developers are proficient in XSLT so fitting this mechanism for a different customer should be relatively easy
Solr offers realtime / full text / faceted search which are absolutely necessary for us. A quick prototype (based on Solr, 1M records) was able to deliver search results in 55ms. Our estimated maximum of records is about 1bn of rows (this isn't a lot for typical BI app) and if worse comes to worse, we can always look at SolrCloud, etc.
there are companies doing very similar things using Solr (Honeycomb Lexicon, for example)
Cons of this approach:
SOLR-236 might or might not be stable, moreover, it's not yet clear when/if it will be released as a part of official release
there would possibly be some stuff we'd have to write to get some BI-specific features working. This sounds a bit like reinventing the wheel
the biggest problem is that we don't know what we might need in the future (such as integration with some piece of BI software, export to Excel, etc.)
Option 2 is to do an integration with some free or commercial piece of BI software. So far I have looked at Wabit and will have a look at QlikView, possibly others. Pros of this approach:
no need to reinvent the wheel, software is (hopefully) tried and tested
would save us time we could spend solving problems we specialize in
Cons:
as we are a Java shop and our solution is cross-platform, we'd have to eliminate a lot of options which are in the market
I am not sure how flexible BI software can be. It would take time to go through some BI offerings to see if they can do flexible indexing, real time / full text search, fully customizable results, etc.
I was told that open source BI offers are not mature enough whereas commercial BIs (SAP, others) cost fortunes, their licenses start from tens of thousands of pounds/dollars. While I am not against commercial choice per se, it will add up to the overall price which can easily become just too big
not sure how well BI is made to work with schema-less data
I am definitely not be the best candidate to find the most approprate integration option in the market (mainly because of absence of knowledge in BI area), however a decision needs to be done fast.
Has anybody been in a similar situation and could advise on which route to take, or even better - advise on possible pros/cons of the option #2? The biggest problem here is that I don't know what I don't know ;)
I have spent some time playing with both QlikView and Wabit, and, have to say, I am quite disappointed.
I had an expectation that the whole BI industry actually has some science under it but from what I found this is just a mere buzzword. This MSDN article was actually an eye opener. The whole business of BI consists of taking data from well-normalized schemas (they call it OLTP), putting it into less-normalized schemas (OLAP, snowflake- or star-type) and creating indices for every aspect you want (industry jargon for this is data cube). The rest is just some scripting to get the pretty graphs.
OK, I know I am oversimplifying things here. I know I might have missed many different aspects (nice reports? export to Excel? predictions?), but from a computer science point of view I simply cannot see anything beyond a database index here.
I was told that some BI tools support compression. Lucene supports that, too. I was told that some BI tools are capable of keeping all index in the memory. For that there is a Lucene cache.
Speaking of the two candidates (Wabit and QlikView) - the first is simply immature (I've got dozens of exceptions when trying to step outside of what was suggested in their demo) whereas the other only works under Windows (not very nice, but I could live with that) and the integration would likely to require me to write some VBScript (yuck!). I had to spend a couple of hours on QlikView forums just to get a simple date range control working and failed because the Personal Edition I had did not support downloadable demo projects available on their site. Don't get me wrong, they're both good tools for what they have been built for, but I simply don't see any point of doing integration with them as I wouldn't gain much.
To address (arguable) immatureness of Solr I will define an abstract API so I can move all the data to a database which supports full text queries if anything goes wrong. And if worse comes to worse, I can always write stuff on top of Solr/Lucene if I need to.
If you're truly in a scenario where you're not sure what you don't know i think it's best to explore an open-source tool and evaluate its usefulness before diving into your own implementation. It could very well be that using the open-source solution will help you further crystallise your own understanding and required features.
I had worked previously w/ an open-source solution called Pentaho. I seriously felt that I understood a whole lot more by learning to use Pentaho's features for my end. Of course, as is the case of working w/ most of the open-source solutions, Pentaho seemed to be a bit intimidating at first, but I managed to get a good grip of it in a month's time. We also worked with Kettle ETL tool and Mondrian cubes - which I think most of the serious BI tools these days build on top of.
Earlier, all these components were independent, but off-late i believe Pentaho took ownership of all these projects.
But once you're confident w/ what you need and what you don't, I'd suggest building some basic reporting tool of your own on top of a mondrian implementation. Customising a sophisticated open-source tool can indeed be a big issue. Besides, there are licenses to be wary of. I believe Pentaho is GPL, though you might want to check on that.
First you should make clear what your reports should show. Which reporting feature do you need? Which output formats do you want? Do you want show it in the browser (HTML) or as PDF or with an interactive viewer (Java/Flash). Where are the data (database, Java, etc.)? Do you need Ad-Hoc reporting or only some hard coded reports? This are only some questions.
Without answers to this question it is difficult to give a real recommendation, but my general recommendation would be i-net Clear Reports (used to be called i-net Crystal-Clear). It is a Java tool. It is a commercial tool but the cost are lower as SAP and co.

Relation between language and scalability

I came across the following statement in Trapexit, an Erlang community website:
Erlang is a programming language used
to build massively scalable soft
real-time systems with requirements on
high availability.
Also I recall reading somewhere that Twitter switched from Ruby to Scala to address scalability problem.
Hence, I wonder what is the relation between a programming language and scalability?
I would think that scalability depends only on the system design, exception handling etc. Is it because of the way a language is implemented, the libraries, or some other reasons?
Hope for enlightenment. Thanks.
Erlang is highly optimized for a telecommunications environment, running at 5 9s uptime or so.
It contains a set of libraries called OTP, and it is possible to reload code into the application 'on the fly' without shutting down the application! In addition, there is a framework of supervisor modules and so on, so that when something fails, it gets automatically restarted, or else the failure can gradually work itself up the chain until it gets to a supervisor module that can deal with it.
That would be possible in other languages of course too. In C++, you can reload dlls on the fly, load plugsin. In Python you can reload modules. In C#, you can load code in on-the-fly, use reflection and so on.
It's just that that functionality is built in to Erlang, which means that:
it's more standard, any erlang developer knows how it works
less stuff to re-implement oneself
That said, there are some fundamental differences between languages, to the extent that some are interpreted, some run off bytecode, some are native compiled, so the performance, and the availability of type information and so on at runtime differs.
Python has a global interpreter lock around its runtime library so cannot make use of SMP.
Erlang only recently had changes added to take advantage of SMP.
Generally I would agree with you in that I feel that a significant difference is down to the built-in libraries rather than a fundamental difference between the languages themselves.
Ultimately I feel that any project that gets very large risks getting 'bogged down' no matter what language it is written in. As you say I feel architecture and design are pretty fundamental to scalability and choosing one language over another will not I feel magically give awesome scalability...
Erlang comes from another culture in thinking about reliability and how to achieve it. Understanding the culture is important, since Erlang code does not become fault-tolerant by magic just because its Erlang.
A fundamental idea is that high uptime does not only come from a very long mean-time-between-failures, it also comes from a very short mean-time-to-recovery, if a failure happened.
One then realize that one need automatic restarts when a failure is detected. And one realize that at the first detection of something not being quite right then one should "crash" to cause a restart. The recovery needs to be optimized, and the possible information losses need to be minimal.
This strategy is followed by many successful softwares, such as journaling filesystems or transaction-logging databases. But overwhelmingly, software tends to only consider the mean-time-between-failure and send messages to the system log about error-indications then try to keep on running until it is not possible anymore. Typically requiring human monitoring the system and manually reboot.
Most of these strategies are in the form of libraries in Erlang. The part that is a language feature is that processes can "link" and "monitor" each other. The first one is a bi-directional contract that "if you crash, then I get your crash message, which if not trapped will crash me", and the second is a "if you crash, i get a message about it".
Linking and monitoring are the mechanisms that the libraries use to make sure that other processes have not crashed (yet). Processes are organized into "supervision" trees. If a worker process in the tree fails, the supervisor will attempt to restart it, or all workers at the same level of that branch in the tree. If that fails it will escalate up, etc. If the top level supervisor gives up the application crashes and the virtual machine quits, at which point the system operator should make the computer restart.
The complete isolation between process heaps is another reason Erlang fares well. With few exceptions, it is not possible to "share values" between processes. This means that all processes are very self-contained and are often not affected by another process crashing. This property also holds between nodes in an Erlang cluster, so it is low-risk to handle a node failing out of the cluster. Replicate and send out change events rather than have a single point of failure.
The philosophies adopted by Erlang has many names, "fail fast", "crash-only system", "recovery oriented programming", "expose errors", "micro-restarts", "replication", ...
Erlang is a language designed with concurrency in mind. While most languages depend on the OS for multi-threading, concurrency is built into Erlang. Erlang programs can be made from thousands to millions of extremely lightweight processes that can run on a single processor, can run on a multicore processor, or can run on a network of processors. Erlang also has language level support for message passing between processes, fault-tolerance etc. The core of Erlang is a functional language and functional programming is the best paradigm for building concurrent systems.
In short, making a distributed, reliable and scalable system in Erlang is easy as it is a language designed specially for that purpose.
In short, the "language" primarily affects the vertical axii of scaling but not all aspects as you already eluded to in your question. Two things here:
1) Scalability needs to be defined in relation to a tangible metric. I propose money.
S = # of users / cost
Without an adequate definition, we will discussing this point ad vitam eternam. Using my proposed definition, it becomes easier to compare system implementations. For a system to be scalable (read: profitable), then:
Scalability grows with S
2) A system can be made to scale based on 2 primary axis:
a) Vertical
b) Horizontal
a) Vertical scaling relates to enhancing nodes in isolation i.e. bigger server, more RAM etc.
b) Horizontal scaling relates to enhancing a system by adding nodes. This process is more involving since it requires dealing with real world properties such as speed of light (latency), tolerance to partition, failures of many kinds etc.
(Node => physical separation, different "fate sharing" from another)
The term scalability is too often abused unfortunately.
Too many times folks confuse language with libraries & implementation. These are all different things. What makes a language a good fit for a particular system has often more to do with the support around the said language: libraries, development tools, efficiency of the implementation (i.e. memory footprint, performance of builtin functions etc.)
In the case of Erlang, it just happens to have been designed with real world constraints (e.g. distributed environment, failures, need for availability to meet liquidated damages exposure etc.) as input requirements.
Anyways, I could go on for too long here.
First you have to distinguish between languages and their implementations. For instance ruby language supports threads, but in the official implementation, the thread will not make use of multicore chips.
Then, a language/implementation/algorithm is often termed scalable when it supports parallel computation (for instance via multithread) AND if it exhibits a good speedup increase when the number of CPU goes up (see Amdahl Law).
Some languages like Erlang, Scala, Oz etc. have also syntax (or nice library) which help writing clear and nice parallel code.
In addition to the points made here about Erlang (Which I was not aware of) there is a sense in which some languages are more suited for scripting and smaller tasks.
Languages like ruby and python have some features which are great for prototyping and creativity but terrible for large scale projects. Arguably their best features are their lack of "formality", which hurts you in large projects.
For example, static typing is a hassle on small script-type things, and makes languages like java very verbose. But on a project with hundreds or thousands of classes you can easily see variable types. Compare this to maps and arrays that can hold heterogeneous collections, where as a consumer of a class you can't easily tell what kind of data it's holding. This kind of thing gets compounded as systems get larger. e.g. You can also do things that are really difficult to trace, like dynamically add bits to classes at runtime (which can be fun but is a nightmare if you're trying to figure out where a piece of data comes from) or call methods that raise exceptions without being forced by the compiler to declare the exception. Not that you couldn't solve these kinds of things with good design and disciplined programming - it's just harder to do.
As an extreme case, you could (performance issues aside) build a large system out of shell scripts, and you could probably deal with some of the issues of the messiness, lack of typing and global variables by being very strict and careful with coding and naming conventions ( in which case you'd sort of be creating a static typing system "by convention"), but it wouldn't be a fun exercise.
Twitter switched some parts of their architecture from Ruby to Scala because when they started they used the wrong tool for the job. They were using Ruby on Rails—which is highly optimised for building green field CRUD Web applications—to try to build a messaging system. AFAIK, they're still using Rails for the CRUD parts of Twitter e.g. creating a new user account, but have moved the messaging components to more suitable technologies.
Erlang is at its core based on asynchronous communication (both for co-located and distributed interactions), and that is the key to the scalability made possible by the platform. You can program with asynchronous communication on many platforms, but Erlang the language and the Erlang/OTP framework provides the structure to make it manageable - both technically and in your head. For instance: Without the isolation provided by erlang processes, you will shoot yourself in the foot. With the link/monitor mechanism you can react on failures sooner.

Resources