Load Balanced File Access without the Cloud - microservices

I have an application made up of several microservices to allow independent pieces to be scaled up/down as needed (running as containers in docker swarm and maybe eventually kubernetes). I'm looking for any suggestions for controlling access to a shared file system from multiple containers as I feel like there is probably some system/software out there that I just don't know how to search for it.
Data in the form of physical files will come in to a single location. I need to have one microservice read each file that comes in and basically upload it to a database.
What solutions exist, if any, that would prevent 2 or more instances of the same microservice from fighting over each file?
This will need to work in a 100% offline environment so cloud storage is not an option. Ideally it would work similar to how a message queue works in that the next available consumer would pick up the next available file from this shared file system.

Related

How to detect Windows file closures locally and on network drives

I'm working on a Win32 based document management system that employs an automatic check in/check out model. The model it currently uses for tracking documents in use (monitoring the processes of the applications that open the documents) is not particularly robust so I'm researching alternatives.
Check outs are easy as the DocMgt application is responsible for launching the other application (Word, Adobe, Notepad etc) and passing it the document.
It's the automatic check-in requirement that is more difficult. When the user closes the document in Word/Adobe/Notepad ideally the DocMgt system would be automatically notified so it can perform an automatic check in of the updated document.
To complicate things further the document is likely to be stored on a network drive not a local drive.
Anyone got any tips on API calls, techniques or architectures to support this sort of functionality?
I'm not expecting a magic 3 line solution, the research I've done so far leads me to believe that this is far from a trivial problem and will require some significant work to implement. I'm interested in all suggestions whether they're for a full or part solution.
What you describe is a common task. It is perfectly doable, though not without its share of hassle. Here I assume that the files are closed on the computer where your code can run (even if the files are stored on the mounted network share).
There exist two approaches to controlling the files when they are used: the filter and the virtual filesystem.
The filter sits in the middle, between the process and the filesystem (any filesystem, either local, network or fully virtual) and intercepts file requests that go to this filesystem. Here it is required that the filter code is run on the computer, via which the requests are passed (this requirement seems to be met in your scenario).
The virtual filesystem is an endpoint for the requests that come from the applications. When you implement the virtual filesystem, you handle all requests, so you always fully control the lifetime of the files. As the filesystem is virtual, you are free to keep the files anywhere including the real disk (local or network) or even in the cloud.
The benefit of the filter approach is that you can control individual files that reside on the real disks, while the virtual filesystem can be mounted only to the new drive letter or into the empty directory on the NTFS drive, which is not always fisible. At the same time, sitting in the middle, the filter is to some extent more restricted at what it can do, and the files can be altered while the filter is not running. Finally, filters are more complicated and potentially error prone, as they sit in the middle and must play nice with other filters and with endpoints.
I don't have specific recommendations, but if the separate drive letter is an option, I would recommend the virtual filesystem.
Our company developed (and continues to maintain for the new owner) two products, CBFS Filter and CBFS Connect, which let you create a filter and a virtual filesystem respectively, all in the user mode. Those products are used in many software titles, including some Document Management Systems (which is close to what you do). You will find both products on their website.

What technology to use to avoid too many VMs

I have a small web and mobile application partly running on a webserver written in PHP (Symfony). I have a few clients using the application, and slowly expanding to more clients.
My back-end architecture looks like this at the moment:
Database is Cloud SQL running on GCP (every client has it's own
database instance)
Files are stored on Cloud Storage (GCP) or S3 (AWS), depending on the client. (every client has it's own bucket)
PHP application is running in a Compute Engine VM (GCP), (every client has it's own VM)
Now the thing is, in the PHP code, the only thing client specific is a settings file with the database credentials and the Storage/S3 keys in it. All the other code is exactly the same for every client. And mostly the different VMs sit idle all day, waiting on a few hours usage per client.
I'm trying to find a way to avoid having to create and maintain a VM for every customer. How could I rearchitect my back-end so I can keep separate Databases and Storage Buckets per client, but only scale up my VM's when capacity is needed?
I'm hearing alot about Docker, was thinking about keeping db credentials and keys in a Redis DB or Cloud Datastore, was looking at Heroku, AppEngine, Elastic Beanstalk, ...
This is my ideal scenario as I see it now
An incoming request is done, hits a load balancer
From the request, determine which client the request is for
Find the correct settings file, or credentials from a DB
Inject the settings file in an unused "container"
Handle the request
Make the container idle again
And somewhere in there, determine based on the the amount of incoming requests or traffic, if I need to spin up or spin down containers to handle the extra or reduced (temporary) load.
All this information overload has me stuck, I have no idea what direction to choose, and I fail seeing how implementing any of the above technologies will actually fix my problem.
There are several ways do it with minimum efforts:
Rewrite loading of config file depending from customer
Make several back-end web sites on one VM (best choice i think)

Storing files in a webserver

I have a project using MEAN stack that uploads imagefiles to a server and the names of the images to db. Then the images are shown for users of the applications kinda like an image gallery.
I have been trying to figure out an effiecent way of storing the imagefiles. atm im storing them under the angular application in a folder /var/www/app/files
What are the usual ways of storing them in a cloud server like digital ocean, heroku and many others.
Im a bit thrown off by the fact they offer many options for datastorage.
Lets say that hundres of thousands of images were uploaded by the application to the server.
Saving all of them in inside your front end app in a subfolder might not be the best solution? or am i wrong with this.
I am very new to these webserver cloud services and how they actually operate.
Can someone clarify on what would be the optimal solution.
Thanks!
Saving all of them in inside your front end app in a subfolder might not be the best solution?
You're very right about this. Over time this will get cluttered, and unless you use some very convoluted logic, will slow down your server.
If you're using Angular and this is in the public folder sent to every user, this is even worse.
The best solution to this is using something like an AWS S3 Bucket (DigitalOcean has Block Storage and I believe Heroku has something a bit different). These services offer storage of files, and essentially act as logic-less servers. You can set some privacy policies and other settings, but there's no runtime like NodeJS that can do logic for you.
Your Node server (or any other server setup) interfaces with this storage server, and handles most of the fetching and storing of files. You can optionally limit these storage services so they can only communicate with your Node server, so any file traffic would be done through your Node server.

how to install more than one instance DSpace the in same tomcat server?

I want to create another instance of dspace to work on different projects. however I do not know how or if it will conflict with this running.
While it's technically possible, I would advise against this for the following 3 key reasons:
Quite a few configuration aspects of DSpace still count on a Tomcat restart to take effect. If you have two instances running in the same Tomcat, it means you have to bring both of them down when you want to update one of them.
Performance related issues are already far from trivial to debug in DSpace, even if you have only one instance running in one Tomcat. If you run two instances, it is very likely that you will only make this more difficult.
This kind of setup is non-standard. As with all non-standard setups, you will find it much harder to get community support, as very few other people will be in the same boat.
So ... either run two VMs, or just two Tomcat processes on one VM.
If after these warnings, you still want to do it, the basics would be to run all of the webapps you want twice in the tomcat, on different ports. The minimum you would need are 2x XMLUI OR JSPUI and 2x SOLR. It could be possible to run one solr webapp, and keep 2 search, statistics, authority and oai indexes in this one SOLR webapp, but I don't know what the side effects could be.
1) Obviously each instance should be installed to a different set of directories.
2a) Create a separate Context for each instance. That will give them different paths: http://legion.example.com/one/, http://legion.example.com/two/.... I do this on my development workstation all the time.
2b) You can also create separate domains and IP addresses, bind them to multiple Host objects in a single Tomcat configuration: http://one.example.com/, http://two.example.com/.... I have four low-volume production DSpace instances running in one Tomcat instance on a midsized host.
Each DSpace instance needs its own database, but PostgreSQL can host dozens. You should consider creating separate database user accounts for each.
You'll also need separate Handle resolvers for each DSpace, just the same as if each instance was on its own host. When configured for DSpace, the Handle resolver uses the DSpace database instead of its own, so it's specific to a single instance.
Solr ought to be able to serve several sets of cores for several DSpaces, but you'll have to do a fair amount of configuration to keep them distinct and ensure that each DSpace is using its own set. You'll learn a lot more about Solr than you need to know for the captive Solr instance that gets installed with a single DSpace.
But then, you'll also learn a lot more about Tomcat than you need to know for a single DSpace....
If you declare your Contexts in external files ([Tomcat]/config/Catalina/localhost/one.xml etc.) and you have automatic deployment set up right, you can just 'touch' one of the Contexts to restart it without restarting a whole Tomcat. Otherwise you can use the Tomcat Manager webapp to do this. Consider well whether you want to have the Manager running, though, because it is quite powerful and it is exposed on the network. I run such applications on yet another, non-routable address so they can't be reached from Outside.
DSpace is not small, so you will need to ensure that you have enough memory to run several instances and that Tomcat's memory limits are adjusted accordingly. I would suggest also installing a resource monitor such as Psi Probe and glancing at it regularly. The above comments on performance are spot-on.
Learning to make all this work was loads of fun, and took quite some time. On the other hand, for development you may prefer something like https://github.com/DSpace/vagrant-dspace, a packaged virtual machine with DSpace and friends inside.

Images in load balanced environment

I have a load balanced enviorment with over 10 web servers running IIS. All websites are accessing a single file storage that hosts all the pictures. We currently have 200GB of pictures - we store them in directories of 1000 images per directory. Right now all the images are in a single storage device (RAID 10) connected to a single server that serves as the file server. All web servers are connected to the file server on the same LAN.
I am looking to improve the architecture so that we would have no single point of failure.
I am considering two alternatives:
Replicate the file storage to all of the webservers so that they all access the data locally
replicate the file storage to another storage so if something happens to the current storage we would be able to switch to it.
Obviously the main operations done on the file storage are read, but there are also a lot of write operations. What do you think is the preferred method? Any other idea?
I am currently ruling out use of CDN as it will require an architecture change on the application which we cannot make right now.
Certain things i would normally consider before going for arch change is
what are the issues of current arch
what am i doing wrong with the current arch.(if this had been working for a while, minor tweaks will normally solve a lot of issues)
will it allow me to grow easily (here there will always be a upper limit). Based on the past growth of data, you can effectively plan it.
reliability
easy to maintain / monitor / troubleshoot
cost
200GB is not a lot of data, and you can go in for some home grown solution or use something like a NAS, which will allow you to expand later on. And have a hot swappable replica of it.
Replicating to storage of all the webservers is a very expensive setup, and as you said there are a lot of write operations, it will have a large overhead in replicating to all the servers(which will only increase with the number of servers and growing data). And there is also the issue of stale data being served by one of the other nodes. Apart from that troubleshooting replication issues will be a mess with 10 and growing nodes.
Unless the lookup / read / write of files is very time critical, replicating to all the webservers is not a good idea. Users(of web) will hardly notice the difference of 100ms - 200ms in loadtime.
There are some enterprise solutions for this sort of thing. But I don't doubt that they are expensive. NAS doesn’t scale well. And you have a single point of failure which is not good.
There are some ways that you can write code to help with this. You could cache the images on the web servers the first time they are requested, this will reduce the load on the image server.
You could get a master slave set up, so that you have one main image server but other servers which copy from this. You could load balance these, and put some logic in your code so that if a slave doesn’t have a copy of an image, you check on the master. You could also assign these in priority order so that if the master is not available the first slave then becomes the master.
Since you have so little data in your storage, it makes sense to buy several big HDs or use the free space on your web servers to keep copies. It will take down the strain on your backend storage system and when it fails, you can still deliver content for your users. Even better, if you need to scale (more downloads), you can simply add a new server and the stress on your backend won't change, much.
If I had to do this, I'd use rsync or unison to copy the image files in the exact same space on the web servers where they are on the storage device (this way, you can swap out the copy with a network file system mount any time).
Run rsync every now and then (for example after any upload or once in the night; you'll know better which sizes fits you best).
A more versatile solution would be to use a P2P protocol like Bittorreent. This way, you could publish all the changes on the storage backend to the web servers and they'd optimize the updates automatcially.

Resources