Serverless Internet-Wide Chatting Client? - client

Does anyone know of the logic behind a server-less chat client that would be able to interconnect not LAN-wide but internet-wide? It doesn't need to be able to detect other users world wide, it just needs to obtain some kind of unique identification to be able to connect to a user, like an IP or a unique ID of some sort.

To start, you need some information from somewhere. You can't just turn it on and start chatting without knowing where everyone is. You might have one node that is online all the time and knows a few other nodes. The other nodes would know other nodes and those would know more, etc. It is debatable whether or not you would call that static node a "server" or not. It could just be your friend's node, or a publicly available IP. Once you are up and running, you wont need the start node anymore.
In this type of system, you would need to query your neighbors if you want some sort of identification besides IP address. An IP address has its own drawbacks as well, because you might have two people behind a router at a home DSL connection. Unique ID's would require a recursive query across the whole mesh to find out if your ID is unique.
In this type of system, you would only need to know a limited subset of people in order to chat with anyone, as you can query everyone around you (and the query happens recursively) for the location of that person. An artificial limit on the number of people stored on the local node might be implemented with a Least Recently Used algorithm, kind of like a CPU cache.

Related

consistent hashing where you want a key mapped to multiple servers

I'm wondering if I'm missing a concept here somewhere, and wondering if someone can explain how this might work.
My understanding of consistent hashing makes perfect sense where I want to map a particular key to one server. I can just map the key to a single server or virtual node clockwise or counterclockwise from the key.
How does consistent hashing work if I want to specify that each key should be mapped to some quorum of servers I define? For example, what if I have 5 servers and want each key mapped on at least two servers? Would I just choose the two unique servers clockwise on the ring, or is there some other strategy you need? Could you equivalently choose one server clockwise and one counterclockwise? What if you want a key mapped to an arbitrary number of servers?
My issue may be also that I just don't know the right terminology to search for. My particular use case is that I want to have a cluster of Prometheus collectors, say 7, and say I have a pool of 150 exporters. I want each exporter to be collected by at least 3 of the collectors. What's a good way to think about this problem? Appreciate thoughts, thanks!
It turns out that consistent hashing is actually a special case of rendezvous hashing, which is what I was looking for.

How do I process a graph that is constantly updating, with low latency?

I am working on a project that involves many clients connecting to a server(servers if need be) that contains a bunch of graph info (node attributes and edges). They will have the option to introduce a new node or edge anytime they want and then request some information from the graph as a whole (shortest distance between two nodes, graph coloring, etc).
This is obviously quite easy to develop the naive algorithm for, but then I am trying to learn to scale this so that it can handle many users updating the graph at the same time, many users requesting information from the graph, and the possibility of handling a very large (500k +) nodes and possibly a very large number of edges as well.
The challenges I can foresee:
with a constantly updating graph, I need to process the whole graph every time someone requests information...which will increase computation time and latency quite a bit
with a very large graph, the computation time and latency will obviously be a lot higher (I read that this was remedied by some companies by batch processing a ton of results and storing them with an index for later use...but then since my graph is being constantly updated and users want the most up to date info, this is not a viable solution)
a large number of users requesting information which will be quite a load on the servers since it has to process the graph that many times
How do I start facing these challenges? I looked at hadoop and spark, but they seem have high latency solutions (with batch processing) or solutions that address problems where the graph is not constantly changing.
I had the idea of maybe processing different parts of the graph and indexing them, then keeping track of where the graph is updated and re-process that section of the graph (a kind of distributed dynamic programming approach), but im not sure how feasible that is.
Thanks!
How do I start facing these challenges?
I'm going to answer this question, because it's the important one. You've enumerated a number of valid concerns, all of which you'll need to deal with and none of which I'll address directly.
In order to start, you need to finish defining your semantics. You might think you're done, but you're not. When you say "users want the most up to date info", does "up to date" mean
"everything in the past", which leads to total serialization of each transaction to the graph, so that answers reflect every possible piece of information?
Or "everything transacted more than X seconds ago", which leads to partial serialization, which multiple database states in the present that are progressively serialized into the past?
If 1. is required, you may well have unavoidable hot spots in your code, depending on the application. You have immediate information for when to roll back a transaction because it of inconsistency.
If 2. is acceptable, you have the possibility for much better performance. There are tradeoffs, though. You'll have situations where you have to roll back a transaction after initial acceptance.
Once you've answered this question, you've started facing your challenges and, I assume, will have further questions.
I don't know much about graphs, but I do understand a bit of networking.
One rule I try to keep in mind is... don't do work on the server side if you can get the client to do it.
All your server needs to do is maintain the raw data, serve raw data to clients, and notify connected clients when data changes.
The clients can have their own copy of raw data and then generate calculations/visualizations based on what they know and the updates they receive.
Clients only need to know if there are new records or if old records have changed.
If, for some reason, you ABSOLUTELY have to process data server side and send it to the client (for example, client is 3rd party software, not something you have control over and it expects processed data, not raw data), THEN, you do have a bit of an issue, so get a bad ass server... or 3 or 30. In this case, I would have to know exactly what the data is and how it's being processed in order to make any kind of suggestions on scaled configuration.

Cache Geospatial calculations or calculate on the fly?

I'm a developer on a service vehicle dispatching web app. It's written in .Net 4+, MVC4, using SQL server.
There are 2000+ locations stored in the database as geography data-types. Assuming we send resources from location A to location B, the drive time / distance etc... needs to be displayed at one point. If I calculate the distance with SQL Server's STDistance it will only give me the "As the crow flies" distance. So the system will need to hit a geo spatial service like bing, Google, or ESRI and get the actual drive time or suggested routes. the problem is this is a core function and will happen ALOT.
Should I pre-populate a lookup table with pre-calculated distances or average drive times? The down side is even without adding more locations that's 4Million records to search every time the information is needed.
On top of this, most times the destination is not one of our stored geospatial coordinates and can instead be an address or long/lat point anywhere on the continent which makes pre-calculating impossible.
I'm trying to avoid performance issues having to hit some geoservies endpoint constantly.
Any suggestions on how best to approach this?
-thanks!
Having looked at these problems before, you are unlikely to be able to store them all.
it is usually against almost all of the routing providers TOS for you to cache the results. You can sometimes negotiate this ability but it costs alot.
Given that there is not a fixed set of points you are searching against, doing one calculation gives you little information for the next calculation.
I would say maybe you can store the route for pair once they have been selected so you can show that route again if needed. Once the transaction is done I would remove the route from your DB.
If you really want to cache all this or have more control over it you can use PGRouting (with Postgresql) and then obtain street data. though I doubt it is worth your effort.

What is the design & architecture behind facebook's status update mechanism?

I'm planning on creating a social network and I don't think I quite understand how the status update module of facebook is designed. Hoping I can find some help here. At algorithmic and datastructure level, what is the most efficient way to create a status update mechanism in a social network?
A full table scan for all friends and then sorting their updates is very naive and costly. Do we use some sort of mechanism based on hashing or something else? Please let me know.
P.S: I'm not talking about their EdgeRank algorithm but the basic status update. How do they find and fetch them from the database?
Thanks in advance for the help!
Here is a great presentation that answers your question. The specific answer comes up at around minute 55:40, but I suggest that you watch the entire presentation to understand how the solution fits into the entire architecture.
In short:
A particular server ("leaf") stores all feed items for a particular user. So data for each of your friends is stored entirely at a specific destination.
When you want to view your news feed, one of the aggregator servers sends request to all the leaf servers for your friends and ranks the results. The aggregator knows which servers to send requests to based on the userid of each friend.
This is terribly simplified, of course. This only works because all of it is memcached, the system is designed to minimize latency, some ranking is done at the leaf server that contains the friend's feed items, etc.
You really don't want to be hitting the database for any of this to work at a reasonable speed. FB use MySql mostly as a key-value store; JOINing tables is just impossible at their scale. Then they put memcache servers in front of the databases and application servers.
Having said that, don't worry about scaling problems until you have them (unless, of course, you are worrying about them for the fun of it.) On day one, scaling is the least of your problems.

How to ensure correctness of data gathered via crowdsourcing?

I have a site where users are entering data of some products they buy.
How do I ensure correctness of data entered via crowdsourcing (enabling users to vote/edit products) minimizing amount of work that needs to be done by administrator? I'm looking for some how-tos, best practices, etc.
What sort of data are you collecting ?
You're talking about crowd-sourcing, and thus (I assume) aggregating of data across this crowd. As they're talking about products they buy, I suspect you're going to be athering product attributes and prices.
Some possible approaches. If you users are entering non-numerical data (e.g. colours), just record the most common entries, or the mode (the most commonly entered).
If they're entering numeric data, discard outliers. i.e. bin the lowest and highest results, and average the rest (you could do this for prices, say. This is the approach that electronic exchanges use for resolving closing prices out of many trades).
Depending on your application, you may want to have a historical bias towards the most recent entries.
But this all depends on your application, and how much storage and crunching of data you're prepared to do.
Make sure you keep a log of IP addresses with every action made, malicious users or bots would trample on session data or cookies. Doing this ensures that a single entity cannot skew any results or do anything drastic by appearing to be multiple users.
As a high level data can be gathered from the 'crowd' with an associated correctness value. Looking at SO, an answer or response from someone with 1000+ rep, has more wieght that a casual user. Look for validations and triangulation, if it's a single voice in the crowd that you're listening too, then it's probably not worth that much. If other voices join then you know you're onto something, again in SO terms we all get a chance to upvote questions.
I've recently seen some really good iPhone apps which rely in crowd sourcing for their data, and then validate it by asking other users if it's correct.

Resources