Interview assignment - design a system like S3

Interview assignment - design a system like S3 - caching

In my phone interview at one of the financial firms as an software architect, "design a cloud storage system like AWS S3".
Here is what I answered, Would you please help with your critiques & comments and on my approach. I would like to improve based on your feedback.
First
, I listed requirements
- CRUD Microservices on objects
- Caching layer to improve performance
- Deployment on PaaS
- resiliency with failover
- AAA support ( authorization, auditing, accounting/billing)
- Administration microservices (user, project, lifecycle of object, SLA dashboard)
- Metrics collection (Ops, Dev)
- Security for service endpoints for admin UI
Second,
I defined basic APIs.
https://api.service.com/services/get Arugments object id, metadata return binary object
https://api.service.com/services/upload Arguments object returns object id
https://api.service.com/services/delete Arugments object id returns success/error
http://api.service.com/service/update-meta Arugments object id, metadata return success/error
Third,
I drew the picture on board with architecture and some COTS components i can use. below is the picture.
Interviewer did not ask me much questions, and hence I am bit worried that if I am on right track with my process. Pl provide your feedback..
Thanks in advance..

There are a couple of areas of feedback that might be helpful:
1. Comparison with S3's API
The S3 API is a RESTful API these days (it used to support SOAP) and it represents each 'file' (really a blob of data indexed by a key) as an HTTP resource, where the key is the path in the resource's URI. Your API is more RPC, in that each HTTP resource represents an operation to be carried out and the key to the blob is one of the parameters.
Whether or not this is a good or bad thing depends on what you're trying to achieve and what architectural style you want to adopt (although I am a fan of REST, it doesn't mean you have to adopt it for all applications), however since you were asked to design a system like S3, your answer would have benefited from a clear argument as to why you chose NOT to use REST as S3 does.
2. Lines connecting things
Architecture diagrams tend to often be very high level - which is appropriate - but there is a tendency sometimes to just draw lines between boxes without being clear about what those lines mean. Does it mean there is a network connection between the infrastructure hosting those software components? Does it mean there is an information or data flow between those components?
When you a draw a line like in your diagram that has multiple boxes all joining together on the line, the implication is that there is some relationship between the boxes. When you add arrows, there is the further implication that the relationship follows the direction of the arrows. But there is no clarity about what that relationship is, or why the directionality is important.
One could infer from your diagram that the Memcache Cluster and the File Storage cluster are both sending data to the Metrics/SLA portal, but that they are not sending data to each other. Or that the ELB is not connected to the microservices. Clearly that is not the case.
3. Mixing Physical, Logical, Network & Software Architecture
General Type of Architecture
Logical Architecture - tends to be more focussed on information flows between areas of functional responsibility
Physical Architecture - tends to be more focussed on deployable components, such as servers, VMs, containers, but I also group installable software packages here, as a running executable process may host multiple elements from the logical architecture
Specific Types of Architecture
Network Architecture - focuses on network connectivity between machines and devices - may reference VLANs, IP ranges, switches, routers etc.
Software Architecture - focuses on the internal structures of a software program design - may talk about classes, modules, packages etc.
Your diagram includes a Load Balancer (more physical) and also a separate box per microservice (could be physical or logical or software), where each microservice is responsible for a different type of operation. It is not clear if each microservice has it's own load balancer, or if the load balancer is a layer 7 balancer that can map paths to different front ends.
4. Missing Context
While architectures often focus on the internal structure of a system, it is also important to consider the system context - i.e. what are the important elements outside the system that the system needs to interract with? e.g. what are the expected clients and their methods of connectivity?
5. Actual Architectural Design
While the above feedback focussed on the method of communicating your, this is more about the actual design.
COTS products - did you talk about alternatives and why you selected the one you chose? Or is it just the only one you know. Awareness of the options and ability to select the appropriate option for a given purpose is valuable.
Caching - you have caching in front of the file storage, but nothing in front of the microservices (edge cache, or front end reverse proxy) - assuming the microservices are adding some value to the process, caching their results might also be useful
Redundancy and durability of data - while you talk about resiliency to failover, data redundancy and durability of the data storage is a key requirement in something like this and some explicit reference to how that would be achieved would be useful. Note this is slightly different to availability of services.
Performance - you talk about introducing a caching layer to improve performance, but don't qualify the actual performance requirements - 100's of objects stored or retrieved per second, 1000's or millions? You need to know that to know what to build in
Global Access - S3 is a multi-region/multi-datacentre solution - your architecture does not reference any aspect of multi-datacentre such as replication of the stored objects and metadata
Security - you reference requirements around AAA but your proposed solution doesn't define which component is responsible for security, and at which layer or at what point in the request path a request is verified and accepted or rejected
6. The Good
Lest this critique be thought too negative, it's worth saying that there is a lot to like in your approach - your assessment of the likely requirements is thorough, and great to see inclusion of security and also operational monitoring and sla's considered up front.
However, reviewing this, I'd wonder what kind of job it actually was - it looks more like the application for a cloud architect role, rather than a software architect role, for which I'd expect to see more discussion of packages, modules, assemblies, libraries and software components.
All of the above notwithstanding, it's also worth considering - what is an interviewer looking for if they ask this in an interview? Nobody expects you to propose an architecture in 15 minutes that can do what has taken a team of Amazon engineers and architects many years to build and refine! They are looking for clarity of thought and expression, thoroughness of examination, logical conclusions from clearly stated assumptions, and knowledge and awareness of industry standards and practices.
Hope this is helpful, and best of luck on the job hunt!

Related

Apache Beam and ETL processes

Given following processes:
manually transforming huge .csv's files via rules (using MS excel or excel like software) & sharing them via ftp
scripts (usually written in Perl or Python) which basically transform data preparing them for other processes.
API's batch reading from files or other origin sources & updating their corresponding data model.
Springboot deployments used (or abused) to in part regularly collect & aggregate data from files or other sources.
And given these problems/ areas of improvement:
Standardization: I'd like to (as far as it makes sense), to propose a unified powerful tool that natively deals with these types of (kind of big) data transformation workflows.
Rising the abstraction level of the processes (related to the point above): Many of the "tasks/jobs" I mentioned above, are seen by the teams using them, in a very technical low level task-like way. I believe having a higher level view of these processes/flows highlighting their business meaning would help self document these processes better, and would also help to establish a ubiquitous language different stakeholders can refer to and think unambiguously about.
IO bottlenecks and resource utilization (technical): Some of those processes do fail more often that what would be desirable, (or take a very long time to finish) due to some memory or network bottleneck. Though it is clear that hardware has limits, resource utilization doesn't seem to have been a priority in many of these data transformation scripts.
Do the Dataflow model and specifically the Apache Beam implementation paired with either Flink or Google Cloud Dataflow as a backend runner, offer a proven solution to those "mundane" topics? The material on the internet mainly focuses on discussing the unified streaming/batch model and also typically cover more advanced features like streaming/event windowing/watermarks/late events/etc, which do look very elegant and promising indeed, but I have some concerns regarding tool maturity and community long term support.

It's hard to give a concrete answer to such a broad question, but I would say that, yes, Beam/Dataflow is a tool that handle this kind of thing. Even though the documentation focuses on "advanced" features like windowing and streaming, lots of people are using it for more "mundane" ETL. For questions about tool maturity and community you could consider sources like Forrester reports that often speak of Dataflow.
You may also want to consider pairing it with other technologies like Arflow/Composer.

HL7 V2.2 - SIU^S12

I am using the SIU^S12 segment and I need to indicate the financial entity (IN1). But IN1 is not allowed in segment SIU^S12. Has anyone ever experienced this?

You're right that financial information is not included in the SIU^S12 message, nor is it in the entire HL7v2 scheduling domain. I can't really attest to the "why" behind this, but only share my experience in the US domain.
From a very high level, in the United States, scheduling HL7v2 interfaces are used almost exclusively before the patient arrives, and ADT (HL7v2 Chapter 3) is used almost exclusively once the patient arrives. This is not ideal, and sometimes bears extra licensing costs in terms of getting two HL7v2 interfaces.
From a design perspective, I it makes sense to have a separation of concerns - SIU^S12 is chiefly concerned with scheduling resources - even treating patients as resources. SIU^S12 can have multiple patients in its schema, whereas ADT^A01 must have exactly one. While it would be possible to attach GT1, IN1/2/3 to each potential patient in SIU, it's easier to wrap your head around when only one patient is in play.
From a workflow perspective, insurance/payment information is typically verified with the patient in person once they arrive, so the majority of use cases won't need or more importantly trust insurance information were it to be sent by a scheduling system.

Does it make sense to use GraphQL for microservices intercommunication?

I've read a lot on using GraphQL as API gateway for the front-end in front of the micro-services.
But I wonder if all the GraphQL advantages over Rest aren't relevant to communication between the micro-services as well.
Any inputs, pros/cons and successful usage examples will be appreciated.

Key notes to consider:
GraphQL isn't a magic bullet, nor is it "better" than REST. It is just different.
You can definitely use both at the same time, so it is not either/or.
Per specific use, GraphQL (or REST) can be anywhere on the scale of great to horrible.
GraphQL and REST aren't exact substitutes:
GraphQL is a query language, specification, and collection of tools, designed to operate over a single endpoint via HTTP, optimizing for performance and flexibility.
REST is an Architectural Style / approach for general communication utilizing the uniform interface of the protocols it exists in.
Some reasons for avoiding a common use of GraphQL between microservices:
GraphQL is mainly useful when the client need a flexible response it can control without making changes to the server's code.
When you grant the client service control over the data that comes in, it can lead to exposing too much data, hence compromising Encapsulation on the serving service. This is a long-term risk on System maintainability and ability to change.
Between microservices, latency is far less an issue than between client-server, so goes for the aggregation capabilities.
Uniform interface is really useful when you have many services - but graphQL may be counter-productive for that cause.
The flexible queries defined by QueryQL can be more challenging in terms of performance optimizing it.
Updating an hierarchy of object at once (graphQL natural structure) may add complexities in atomicity, idempotency, error reporting, etc.
To recap:
GraphQL can be really great for server to server communication, but most likely it would be a good fit in a small percentage of the use-cases.
Do you have a use-case for an API Gateway between services? Maybe this is the question you should ask yourself. GraphQL is just a (popular) tool.
Like always, it is best to match a problem to a tool.

I don't have experience with using GraphQL in a microservices environment but I'm inclined to think that its not the greatest for microservices.
To add a little more color to #Lior Bar-On's answer, GraphQL is more of a query language and is more dynamic in nature. It is often used to aggregate data sets as a result of a single request which in turn will potentially require many requests being made to many services in a microservice environment. At the same time, it also adds complexity to have to translate the gathering of information from respective sources of the information (other microservices). Of course, how complex would depend on how micro your services are and what queries you may look to support.
On the other hand, I think a monolith that uses an MVC architecture may actually have an upper hand because it owns a larger body of a data that it can query.

Does it make sense to use actor/agent oriented programming in Function as a Service environment?

I am wondering, if is it possible to apply agent/actor library (Akka, Orbit, Quasar, JADE, Reactors.io) in Function as a Service environment (OpenWhisk, AWS Lambda)?
Does it make sense?
If yes, what is minimal example hat presents added value (that is missing when we are using only FaaS or only actor/agent library)?
If no, then are we able to construct decision graph, that can help us decide, if for our problem should we use actor/agent library or FaaS (or something else)?

This is more opinion based question, but I think, that in the current shape there's no sense in putting actors into FaaS - opposite works actually quite well: OpenWhisk is implemented on top of Akka actually.
There are several reasons:
FaaS current form is inherently stateless, which greatly simplifies things like request routing. Actors are stateful by nature.
From my experience FaaS functions are usually disjointed - ofc you need some external resources, but this is the mental model: generic resources and capabilities. In actor models we tend to think in category of particular entities represented as actors i.e. user Max, rather than table of users. I'm not covering here the scope of using actors solely as unit of concurrency.
FaaS applications have very short lifespan - this is one of the founding stones behind them. Since creation, placement and state recovery for more complex actors may take a while, and you usually need a lot of them to perform a single task, you may end up at point when restoring the state of a system takes more time than actually performing the task, that state is needed for.
That being said, it's possible that in the future, those two approaches will eventually converge, but it needs to be followed with changes in both mental and infrastructural model (i.e. actors live in runtime, which FaaS must be aware of). IMO setting up existing actor frameworks on top of existing FaaS providers is not feasible at this point.

What is Unifying Logic within the Semantic Stack Model and who is supposed to take care of it?

Basically, after hours of researching I still dont get what the Unifying Logic Layer within Semantic Web Stack Model is and whose problem it is to take care of it.

I think this depends on what your conceptualisation of the semantic web is. Suppose the ultimate expression of the semantic web is to make heterogeneous information sources available via web-like publishing mechanisms to allow programs - agents - to consume them in order to satisfy some high-level user goal in an autonomous fashion. This is close to Berners-Lee et al's original conceptualisation of the purpose of the semantic web. In this case, the agents need to know that the information they get from RDF triple stores, SPARQL end-points, rule bases, etc, is reliable, accurate and trustworthy. The semantic web stack postulates that a necessary step to getting to that end-point is to have a logic, or collection of logics, that the agent can use when reasoning about the knowledge it has acquired. It's rather a strong AI view, or well towards that end of the spectrum.
However, there's an alternative conceptualisation (and, in fact, there are probably many) in which the top layers of the semantic web stack, including unifying logic, are not needed, because that's not what we're asking agents to do. In this view, the semantic web is a way of publishing disaggregated, meaningful information for consumption by programs but not autonomously. It's the developers and/or the users who choose, for example, what information to treat as trustworthy. This is the linked data perspective, and it follows that the current stack of standards and technologies is perfectly adequate for building useful applications. Indeed, some argue that even well-established standards like OWL are not necessary for building linked-data applications, though personally I find it essential.
As to whose responsibility it is, if you take the former view it's something the software agent community is already working on, and if you take the latter view it doesn't matter whether something ever gets standardised because we can proceed to build useful functionality without it.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio