I built an app where people connect to rooms through websocket based on the room's id in the URL, such as app.com/9l4CvjXFxn . The thing is I want to run multiple node instances in order to secure an always up instance, in case others crash, and also because I heard loadbalancing is good. I also have the static UI content to serve. I am using socketio only, no REST api.
My plan now is to use a load-balancing and serving proxy such as nginx or haproxy. I have never used any of them. I have also thought about using PM2 to easily run many node instances. The app will probably be deployed on AWS.
Websockets happen through an HTTP1 upgrade to a certain path in socketio, I have set it to root /. So in the initial upgrade request I can't change the path, but I can put things into the url query or cookie headers.
So the main requirement for the loadbalancer is to direct websocket upgrades of a certain room to one specific node instance consistently, hence I thought of hash loadbalancing, but I have no idea how to do this and if this is the correct approach at all.
Could you help?
Thanks!
Couldn't solve it and it might be that its a "wrong" problem to have. This is because I don't have persistence and I keep my objects in node. If my node instance was stateless and use a redis I could spin up many nodes without having this problem. This is the next step though because, still an early app. Checkiout though: quarantime.io :)
Related
I have been playing around with creating a webapp that uses elasticsearch to perform queries. Currently, everything is in production, thus on the localhost, let's say elasticsearch runs at 123.123.123.123:9200. All fun and games, but once the webapplication (react) is finished, the webapp should be able to send the queries to the current local elastic search db.
I have been reading around on how to get this done in a proper and most of all secure way. Summary of this all is currently:
"First off, exposing an Elasticsearch node directly to the internet without protections in front of it is usually bad, bad news." (see here: Accessing elasticsearch from a public domain name or IP).
Another interesting blog I found: https://code972.com/blog/2017/01/dont-be-ransacked-securing-your-elasticsearch-cluster-properly-107.
The problem with the above-mentioned sources is that they are a bit older, and thus I am not sure whether they are up to date.
Therefore the following questions:
Is nginx sufficient to act as a secure middleman, passing the queries from the end-users to elastic?
What is the difference at that point with writing a backend into the react application (e.g. using node and express)?
What is the added value taking into account the built-in security from elasticsearch (usernames, password, apikey, certificates, https,...)?
I am reading a lot about using a VPN or tunneling. I have the impression that these solutions are more geared towards a corporate-collaborative approach. Let's say I am running my front-end on a live server, I can use tunneling to show my work to colleagues, my employer. VPN would be more realistic for allowing employees -wish I had them, just a cs student here- to access e.g. the database within my private network (let's say an employee needs to access kibana to adapt something, let's say an API-key - just making something up here), he/she could use a VPN connection for that.
Thank you so much for helping me clarify the above-mentioned points!
TLS, authorisation and access control are free for the Elastic Stack, and have been for a while. I'd start by looking at the docs, as it's an easy way to natively secure your cluster
for nginx, it can be useful for rate limiting, or blocking specific queries for eg. however it's another thing to configure and maintain
from a client POV it would really only matter if you are using the official Elasticsearch clients, and you use nginx and make changes to the way the API would respond to the client (eg path rewrites, rate limiting)
it's free, it's native, it's easy to manage via Kibana
I'd follow the docs to secure Elasticsearch and then see if you need this at some point in the future. this would be handled outside Elasticsearch anyway, and you'd still want to secure Elasticsearch
The point in exposing Elasticsearch nodes directly to the internet is a higher vulnerability in principle. You should follow the rule of the least "surface" of your system on the internet.
A good practice is to hide from the internet whatever doesn't need to be there, although well protected. It takes ~20mins to get cyber attacks on any exposed service (see a showcase).
So I suggest you install a private network, such as a traditional VPN or an SDP product such as Shieldoo Mesh.
We have a number of EC2 instances running Solr in EC2, which we've used in the past through another application. We would like to move towards allowing users (via web browser) to directly access Solr.
Without something "in front" of Solr this results in a security risk, so we have opted to try to use ELB (specifically the Application Load Balancer) as a simple and maintenance free way of preventing certain requests from hitting SOLR (i.e. preventing the public from DELETING or otherwise modifying the documents in Solr).
This worked great, but we realize that we need to deal with the CORS issue. In other words, we need to add the appropriate headers to requests that come in from a browser. I have not yet seen a way of doing this with Application Load Balancer but am wondering if it is possible to do someway. If it is not possible, I would love as an additional recomendation the easier and least complicated way of adding these headers. We really really really hate to add something like nginx in front of Solr because then we've got additional redundancy to deal with, more servers, etc.
Thank you!
There is not much I can find on CORS for ALB either and I remember when I used Beanstalk with ELB I had to add CORS support in my java application directly.
Having said that, I can find a lot of articles on how to set up CORS for Solr.
Can it be an option for you?
I dont have a tonne of experience with heroku, and even less with phoenix, so this may be a stupid question... but want to make sure I am making a good choice on hosting :)
From what I understand, the way you scale phoenix is add another server, launch another node, and connect them, then let BEAM / OTP work its magic to handle work load balancing. On heroku, dynos can't really talk together over a local network, which from what I understand is something that BEAM requires to cluster. So adding dynos will result in a more "traditional" scaling model, where you have an external load balancer balancing connections between unconnected nodes, with the db being shared state.
My question here is how big of an impact will this have? Is it more only an issue when you are hitting serious levels of load / scale, or will it mean spending a lot more money on infrastructure then is necessary?
You'll get the best performance on a host that supports clustering, but Phoenix has a PubSub adapter system exactly for deployments like heroku:
https://github.com/phoenixframework/phoenix_pubsub
One line config change and mix.exs deps entry and you'll have multinode channels on heroku via our Redis adapter.
This is very open question, so I am sure my answer won't be comprehensive.
In your situation the most important question is: will I Phoenix use channels?
If you use plain old HTTP, it can be mostly stateless. There are lots of methods to simulate stateful connection like storing sessions in cookies. At the end of the day, it doesn't matter if your backend servers are connected with each other, because each of them is doing independent computations. Your load balancer can randomly select any server and it will always work. This cool feature of http enables this protocol to scale so well. You can definitely use Heroku in that scenario and it will work great.
If you use Phoenix channels, things get complicated. You still want to be able to connect to any of the servers, but you will probably send messages to other users real time and they can be connected to other servers. Phoenix solves this problem for you by clustering using BEAM and this will be hard on Heroku. Or even impossible.
To sum up: it is not a question of small scale/big scale. It is a question of features. Scaling channels will require clustering, scaling plain old HTTP will not.
The current setup is compose by 3 elastic search servers, of witch one is the master and the other 2 are slave, at least they define themselves as that.
It might happen that the master goes down, for any kind of problem, this means that elastic search is going to find a new elegible master and switch to this new one.
Currently the problem is that all my application on the frontend servers is totally not aware of this so it will keel making queries to the same backend, of course killing all my website because it will not answer. I had a look around but I was not able to find anything related to backend switch on the fly even related to the new Haystack 2.x.
Any suggestion?
Many thanks in advance
It doesn't seem to be necessary to me to leave this to your application layer. Most probably you access ES through HTTP-REST requests, which means you can put any HTTP load balancer like Nginx in front of your ES servers which should also switch to another node if one times out.
Most solutions I've read here for supporting subdomain-per-user at the DNS level are to point everything to one IP using *.domain.com.
It is an easy and simple solution, but what if I want to point first 1000 registered users to serverA, and next 1000 registered users to serverB? This is the preferred solution for us to keep our cost down in software and hardware for clustering.
alt text http://learn.iis.net/file.axd?i=1101
(diagram quoted from MS IIS site)
The most logical solution seems to have 1 x A-record per subdomain in Zone Datafiles. BIND doesn't seem to have any size limit on the Zone Datafiles, only restricted to memory available.
However, my team is worried about the latency of getting the new subdoamin up and ready, since creating a new subdomain consist of inserting a new A-record & restarting DNS server.
Is performance of restarting DNS server something we should worry about?
Thank you in advance.
UPDATE:
Seems like most of you suggest me to use a reverse proxy setup instead:
alt text http://learn.iis.net/file.axd?i=1102
(ARR is IIS7's reverse proxy solution)
However, here are the CONS I can see:
single point of failure
cannot strategically setup servers in different locations based on IP geolocation.
Use the wildcard DNS entry, then use load balancing to distribute the load between servers, regardless of what client they are.
While you're at it, skip the URL rewriting step and have your application determine which account it is based on the URL as entered (you can just as easily determine what X is in X.domain.com as in domain.com?user=X).
EDIT:
Based on your additional info, you may want to develop a "broker" that stores which clients are to access which servers. Make that public facing then draw from the resources associated with the client stored with the broker. Your front-end can be load balanced, then you can grab from the file/db servers based on who they are.
The front-end proxy with a wild-card DNS entry really is the way to go with this. It's how big sites like LiveJournal work.
Note that this is not just a TCP layer load-balancer - there are plenty of solutions that'll examine the host part of the URL to figure out which back-end server to forward the query too. You can easily do it with Apache running on a low-spec server with suitable configuration.
The proxy ensures that each user's session always goes to the right back-end server and most any session handling methods will just keep on working.
Also the proxy needn't be a single point of failure. It's perfectly possible and pretty easy to run two or more front-end proxies in a redundant configuration (to avoid failure) or even to have them share the load (to avoid stress).
I'd also second John Sheehan's suggestion that the application just look at the left-hand part of the URL to determine which user's content to display.
If using Apache for the back-end, see this post too for info about how to configure it.
If you use tinydns, you don't need to restart the nameserver if you modify its database and it should not be a bottleneck because it is generally very fast. I don't know whether it performs well with 10000+ entries though (it would surprise me if not).
http://cr.yp.to/djbdns.html