Two questions about terminologies in computer cluster [closed] - cluster-computing

Closed. This question is off-topic. It is not currently accepting answers.
Want to improve this question? Update the question so it's on-topic for Stack Overflow.
Closed 10 years ago.
Improve this question
I am a layman to computer science, I have some questions about terminology of computer cluster here.
A cluster has 300 nodes.
Does it mean the cluster has 300 computers?
The cores of CPU is using hyperthreading, so there can be a total of 16 threads running simultaneously.
What are hyperthreading or threads here? A stream of data or logic?

Hyperthreading is an Intel technology for managing simultaneous multithreading. It doesn't necessarily have anything to do with cluster computing.
A thread is as a single running task that can be executed asynchronously from everything else. So depending on what kind of a processor your computer has it may be able to execute a different amount of things asynchronously.
Meanwhile, nodes in clusters are usually separate computers. However, they can be Virtual Machines as well. Within some contexts I have seen a single processor core regarded as a node too (in this case one machine with 2 cores will have 2 nodes on it). This kind of depends on the framework and what exactly you're doing.
But usually with cluster computing you have a given task that can be scaled, and every entity that does it is called a node.

Related

What are the scalability benefits of splinting monolithic codebase into microservices? [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 2 years ago.
Improve this question
Let's say that facebook runs as a single binary. Publishing N copies of this service in N servers split the workload evenly, fine. Now, if I split facebook codebase in half, exactly how it is more scalable? (in the sense of y-axis scale from this article).
If I allocate 1 server for first half and 2 servers for second half, it will certainly be faster than one monolithic server, because now we have 3. Thats exactly like x-axis scaling. Only that now you have uneven load-balancing.
But consider the servers to be 25% of its original size. Right from the startup, these servers have a higher percentage of used RAM. This is so because splitting code in half doesn't implies halving RAM footprint. Each server will be wasting more RAM on duplicated library code, etc.
I wonder if there is any benefit on using microservice from this performance/computing resource perspective.
Microservices architecture allows developers to create separate components of an application through building an application from a combination of small services...
Now... If we had a monolith application, and identify that one specific feature recive more trafic/request than other then we need to scale all application, including the parts of application with lower or zero trafic, and if this part of application crash, than all application crashes! In microservices architecture every small feature could be deployed as a small microservices, in this scenarios we can scale only the part of system that we need, and even if this service crash, all other parts of system still alive... This scenario show us two benefits: isolation and scalability.
When you think in RAM consumption, you need to take the scenario above in consideration, because we can spaw a service only when we need, and if we does not need we can reduce to a lower number of instances and here we can obtain gains in terms of performance/computing efeciency resource perspective... Looking in term of storage and still tink in footprint, a monolith application mybe use some libs to a specific group of features and another group of feature maybe use other libs... here the problem remain! Mey we had in a monolith application scalling a lot os libs and resources being scaled that we does not need to scale.
If we think in life cycle of a aplication, develop new features can be achieved developping or updating small services so, functionality associated with microservice architecture is the ability to easily understand when compared to an entire monolithic app. The microservice approach lets developers choose the right tools for the right task. They can build each server utilizing a language or framework they need without affecting the communication between other microservices. This scenarios show us more two benefits: Productivity and flexibility.

Dealing with tons of queries and avoiding duplicates [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 6 years ago.
Improve this question
My project involves concurrency and database management. Meaning that I have to be editing a database simultaneously between all threads. To be more specific I am reading a line from the database then inserting a line to mark that I grabbed that line. This could work with transactions but due to the fact I will be running this program on multiple machines, I will be having different database connections on each one. Is their a better way for me to accomplish my above task?
Applying Optimistic Concurrency using transactions and a version field/column (could be a time-stamp or a time-stamp plus an actual version number that just increases or other mechanism for version number) is a must here.
But since you are doing this on different machines, it's possible that a substantial amount of repetitive failed transactions occur.
To prevent this, you could use a queuing mechanism. A dispatcher program reads the non-processed records from database and dispatch them to workers - using a queue or a job dispatcher. Then each worker will take the id from that queue and process it in a transaction.
This way:
if a transaction fails, dispatcher would queue it again
if a worker goes down, other workers would continue (noticing the going down is a matter of monitoring)
workers can easily scale-out and new workers can be added at any time (as long as your database is not your bottleneck)
A request/reply schema would do best in this case to prevent queue congestion. I've used NATS successfully (and happily) for similar cases. Of-course you could use another tool of your choice but remember that you have to take care of request/reply part. Just throwing things at queues does not solve all problems and the amount of queued work should be controlled!

Which are the cons of a purely stream-based architecture against a Lambda architecture? [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 9 years ago.
Improve this question
Disclaimer: I'm not a real-time architectures expert, I'd like only to throw a couple of personal considerations and evaluate what others would suggest or point out.
Let's imagine we'd like to design a real-time analytics system. Following, Lambda architecture Nathan Marz's definition, in order to serve the data we would need a batch processing layer (i.e. Hadoop), continuously recomputing views from a dataset of all the data, and a so-called speed layer (i.e. Storm) that constantly processes a subset of the views (made by the events coming in after the last full recomputation of the batch layer). You query your system by merging the results of the two together.
The rationale behind this choice makes perfect sense to me, and its a combination of software engineering and systems engineering observations. Having a ever-growing master dataset of immutable timestamped facts makes the system resilient to human errors in computing the views (if you do an error, you just fix it and recompute them in the batch layer) and enables the system to answer to virtually any query that would come up in the future. Also, such datastore would require to support only random reads and batch inserts, whereas the datastore for the speed/real-time part would require to support efficiently random reads and random writes, increasing its complexity.
My objection/trigger for a discussion about this is that, in certain scenarios, this approach might be an overkill. For the sake of discussion, assume we do a couple of simplifications:
Let's assume that in our analytics system we can define beforehand an immutable set of use-cases\queries that hour system needs to be able to provide, and that they won't change in the future.
Let's assume that we have a limited amount of resources (engineering power, infrastructure, etc) to implement it. Storing the whole set of elementary events coming to our system, instead of already precomputing views\aggregates, may just be too expensive.
Let's assume that the we succesfully minimize the impact of human mistakes (...).
The system would still need to be scalable and handle ever-increasing traffic and data.
Given these observations, I'd like to know what would stop us from designing a fully stream-oriented architecture. What I imagine is an architecture where the events (i.e. page views) are pushed inside a stream, that could be RabbitMQ + Storm or Amazon Kinesis, and where the consumers of such streams would directly update the needed views through random writes/updates to a NoSQL database (i.e. MongoDB).
At a first approximation, it looks to me that such architecture could scale horizontally. Storm can be clusterized, and Kinesis expected QoS could also be reserved upfront. More incoming events would mean more stream consumers, and as they are totally independent nothing stops us from adding more. Regarding the database, sharding it with a proper policy would let us distribute the increasing number of writes to an increasing number of shards. In order to avoid reads to be affected, each shard could have one or more read-replicas.
In terms of reliability, Kinesis promises to reliabily store your messages for up to 24 hours, and a distributed RabbitMQ (or whatever queue system of your choice) with proper usage of acknowledges' mechanisms could probably satisfy the same requirement.
Amazon's documentation on Kinesis deliberately (I believe) avoids to lock you into a specific architectural solution, but my overall impression is that they would like to push developers to simplify the Lambda architecture and arrive to a fully stream based solution similar to the one I've exposed.
To be slighly more compliant to the Lambda architecture requirements, nothing stops us from having, in parallel with the consumers constantly updating our views, a set of consumers that process the incoming events and store them as atomic immutable units in a different datastore that could be used in the future to produce new views (via Hadoop for instance) or recompute faulty data.
What's your opinion on this reasoning? I'd like to know in which scenarios a purely stream-based architecture would fail to scale up, and if you have any other observations, pros\cons of a Lambda architecture vs a stream-based architecture.

What is distributed caching? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 9 years ago.
Improve this question
Can I define "Distributed Caching" like one of the possible configurations resulting of the combination between caching and clustering ?
Yes you can, here the term "distributed" comes from Distributed Computing where as "caching" is the technique used to exploit the locality(both temporal and spacial) of data when entertaining memory requests. Traditional idea of memory caches is combined with distributed computing architecture in order to provide high performance, availability and scalability to the applications that use caches.
Once the caching mechanism is distributed we can refer it as "distributed caching". This distribution is a more general term that the cache can be distributed over a distant network or within a different nodes of a single computing base.
The most common application is to set up a separate cache cluster in order to distribute the load of cache queries among different nodes.
Reference:
Cache Topologies
Distributed Caching On The Path To Scalability
Caching in the Distributed Environment

Does partial instanc-hour appear frequently in EC2 on-demand instance [closed]

Closed. This question is off-topic. It is not currently accepting answers.
Want to improve this question? Update the question so it's on-topic for Stack Overflow.
Closed 9 years ago.
Improve this question
Because pricing is per instance-hour consumed for each instance, from the time an instance is launched until it is terminated. Each partial instance-hour consumed will be billed as a full hour.
Here is my question:
Does the partial instance-hour appear frequently or rarely?
Or in what kind of context, the partial instance-hour appear frequently?
Would anyone has these experiences on it?
Partial hours happen most frequently when using systems that scale often. For example, in my system I launch 10-20 servers extra each saturday and sunday to handle the extra traffic. When these servers are stopped I will be charged a partial hour. Amazon has a new feature for auto scaling groups that tells it to terminate ( if it has to ) the servers closer to the hour marker in order to save money.
Other possible uses are for services like MapReduce where a large number of instances will be started and then when the job is complete they will be terminated.
My experiences though is that the actual cost of partial hours is insignificant for me. Maybe if you're using larger servers it costs a lot but i'm using the c1.medium and i barely notice the $5 i get charged on a weekend for partial hours.

Resources