Building a high performance and automatically backupped queue

Building a high performance and automatically backupped queue - data-structures

Please give me some hints on my issue.
I'm building a queue data structure that:
has a backup on hard disk at realtime
and can restore the backup
Can respond to massive enqueue/dequeue request
Thank you!

Is this an exercise your doing. If not, you should probably look at some of the production message queueing technologies (e.g. MSMQ for Windows) which supports persisting the queues on the disk and not just storing them in memory.
In terms of your requirements
1. has a backup on hard disk at realtime
Yes MSMQs can do that.
2. and can restore the backup
And that.
3. Can respond to massive enqueue/dequeue request
And this...

If you can avoid it, don't roll your own. For Java try ActiveMQ.

You're probably looking at something more complex than just a simple library.
Since this is an exercise (probably wanting you to think about the underlying data structures), you might start the simple way by queueing to a mySQL database. Your performance will suck compared to dedicated queueing software, but you can at least get the rest of the infrastructure working.
After that, you'll probably be looking at some form of custom file format and a multi-threaded server on top of it. You might be able to pull it off using something like BDB or SQLite for the I/O layer, thus saving you the trouble of writing the actual disk routines.

Related

What is the right method for storing files in a microservice architecture?

I'm currently working on a traditional monolith application, but I am in the process of breaking it up into spring microservices managed by kubernetes. The application allows the uploading/downloading of large files and these files are normally stored on the host filesystem. I'm wondering what would be the most viable method of persisting these files in a microservice architecture?

You have a bunch of different options, Googling your question you'll find many answers, for any budget and taste. Basically you'd want high-availability storage like AWS S3. You could setup your own dedicated server to store these files as well if you wanted to cut costs, but then you'd have to worry about backups and availability. If you need low latency access to these files then you'd want to have them behind CDN as well.

We are mostly on prem. We end up using nfs. Path to least resistance, but probably not the most performant and making it highly available is tough. If you have the chance i agree with Denis Pshenov, that S3-like system for example minio might be a better alternative.

Maybe you should have a look at the rook project (https://rook.io/). It's easy to set up and provides different kinds of storage and persistence technologies to your CNAs.

There are many places to store your data. It also depends on the budget that you are able to spent (Holding duplicate data means also more storage which costs money) and mostly on your business requirements.
Is all data needed at all time?
Are there geo/region-related cases?
How fast needs a read / write operation need to be?
Do things need to be cached?
Statefull or Stateless?
Are there operational requirements? How should this be maintained?
...
A part from this your microservices should not know where the data is actually stored. In kubernetes you can use Persistent-Volumes https://kubernetes.io/docs/concepts/storage/persistent-volumes/ that can link to a storage of your Cloud-Provider or something else. The microservice should just mount the volume and be able to treat it like a local file.
Note that the Cloud Provider Storages already include solutions for scaling, concurrency etc. So I would probably use a single Blob-Storage under the hood.
However it has to be said, there is trend to understand a microservice as a package of data and logic coupled together and also accept duplicating the data, which leads to better scalability.
See for more information:
http://blog.christianposta.com/microservices/the-hardest-part-about-microservices-data/
https://github.com/katopz/best-practices/blob/master/best-practices-for-building-a-microservice-architecture.md#stateless-service-instances
https://12factor.net/backing-services
https://blog.twitter.com/engineering/en_us/topics/infrastructure/2017/the-infrastructure-behind-twitter-scale.html

NoSQL for multi-site archival logging with full-text search

I'm looking at building a somewhat complex log handling system to replace an old ad-hoc setup and could use a bit of advice. I'm pretty familiar with SQL databases and networking, but am very new to NoSQL stores, which seem to be the key to solving this mess. Note that we have a very good team, but a limited licensing budget, so free/open-source options are vastly preferred. (That said, availability of support if something goes pear-shaped would be nice.)
Requirements:
Archive (test) logs generated in the several GB/day range at multiple sites around the world.
Provide full text search of those logs at each site fairly instantaneous for debugging purposes.
Push that archived data back to a central location (though a replica at each site would be absolutely okay).
Provide for analytics of that data back at the central location.
Constraints:
The sites have fairly crap Internet connections for the moment (high latency and fairly low bandwidth). Much of the data is generated during the day and a good portion of the sync would have to lag behind and finish overnight each day.
Sites MUST be able to function if the WAN goes completely off-line.
Extras
The log data is (as usual) highly compressible. Any solution that compresses data transacting from node to node across the WAN is preferred.
Many log files are related to each other in multi-level hierarchies, and that relationship is very important and must be maintained!
Sites will generally not modify the same data or modify it again once stored. This is all archival for the most part.
We can either stream as the logs are generated or push blocks of logs. Streaming is preferred, as it would simplify things considerably.
Options I'm aware of:
Local MySQL and folder structure for logging and local configuration management.
This is what we have now and it's running, but not a long-term solution by any means.
Elasticsearch
I've read that ElasticSearch would probably be really good for this, though from what I understand that doesn't support multi-site.
Cassandra
This seems to have built-in multi-site support, but I'm not exactly familiar with the data-model. Is this a good choice for something like this, or will I hate myself if I give it a try?
CouchDB
This is a document store that seems(?) like a good match for log data, but again doesn't appear to have multi-site support.
Apache Kafka
I read up on this, but I haven't quite wrapped my head around it yet...
Questions:
Do any of these actually let you stream-append logs or are they best suited to dumping completed files in?
Is there a solution I'm missing that might be better?
Any recommendations on multi-site with some of the options that don't support multi-site by themselves?
Interesting links:
https://engineering.linkedin.com/distributed-systems/log-what-every-software-engineer-should-know-about-real-time-datas-unifying
http://blog.cloudera.com/blog/2015/07/deploying-apache-kafka-a-practical-faq/
https://www.elastic.co/blog/scaling_elasticsearch_across_data_centers_with_kafka
https://kafka.apache.org/08/ops.html
https://github.com/Stratio/cassandra-lucene-index

I may be a bit biased, since Couchbase is my employer, but this sounds like the kind of problem that XDCR (Cross Datacenter Replication) was made to solve.
You could stand up a cluster on multiple geographical sites (Couchbase calls these "datacenters") and then XDCR would automatically replicate (bidirectionally) the data between sites. If I understand your requirements correctly, this sounds like just what you need.

Can/Should I disable the cache expiry when backing data store is unavailable?

I'm just started out with Ehcache, and it seems pretty good so far. I'm using it in a simplistic fashion to speed up reads against a database, but I wonder whether I can also use it to let the application stay up if the database is unavailable for short periods. (Update - my context is a application with high-availability modules that only read from the database)
It seems like I could do that by disabling expiry in the event of a database read problem, and re-enabling it when a read works again.
What do you think? Is that a reasonable approach or have I missed something? If it's a fair approach, any tips for how best to implement appreciated.
Update - ehcache supports a dynamically configurable option to un/set the cache to 'eternal'. This seems to do what I need.

Interesting question - usually, the answer would be "it depends".
Firstly, if you have database reliability problems, I'd invest time and energy in fixing them, rather than applying a bandaid solution.
Secondly, most applications need both reading and writing to work - it doesn't seem to make sense to keep your app up for reads only.
However, if your app has a genuine "read only" function, and there's a known and controlled reason for database down time (e.g. backups), then yes, you can use your cache to keep the application up and running while the database is down. I would do this by extending the cache periods, rather than trying to code specific edge cases. For instance, you might have a background process which checks whether the database is available and swaps in a different configuration file when there's trouble.

Is SQLite suitable for use as a read only cache on a web server?

I am currently building a high traffic GIS system which uses python on the web front end. The system is 99% read only. In the interest of performance, I am considering using an externally generated cache of pre-generated read-optimised GIS information and storing in an SQLite database on each individual web server. In short it's going to be used as a distributed read-only cache which doesn't have to hop over the network. The back end OLTP store will be postgreSQL but that will handle less than 1% of the requests.
I have considered using Redis but the dataset is quite large and therefore it will push up the administrative cost and memory cost on the virtual machines this is being hosted on. Memcache is not suitable as it cannot do range queries.
Am I going to hit read-concurrency problems with SQLite doing this?
Is this a sensible approach?

Ok after much research and performance testing, SQLite is suitable for this. It has good request concurrency on static data. SQLite only becomes an issue if you are doing writes as well as heavy reads.
More information here:
http://www.sqlite.org/lockingv3.html

if usage case is just a cache why don't you use something like
http://memcached.org/.
You can find memcached bindings for python in pypi repository.
Another options is that you use materialized views in postgres, this way you will keep things simple and have everything in one place.
http://tech.jonathangardner.net/wiki/PostgreSQL/Materialized_Views

Core Data's Limits, can Core Data be used as a Serverside Technology?

I've found no clear answer so far, but maybe I've searched the wrong way.
My Question is, can Core Data to be used as a Persitence Storage for a Server Project? Where are Core Data's Limits, how much Data can be handled with Core Data and SQLite? SQLite should handle a lot of Data very well according to their website. I know of a properitary Java Persitence Manager with an Oracle DB as Storage that handles Millions of Entries and 3000 Clients without Problems. For my own Project I wonder if I can use Core Data on the Server Side for User Mangament and intern microblogging, texting with up to 5000 clients. Will it handle such big amounts of Data or do I have to manage something like that myself? Does anyone happend to have experience with huge amounts if Data and Core Data?
Thank you
twickl

I wouldn't advise using Core Data for a server side project. Core Data was designed to handle the data of individual, object-oriented applications therefore it lacks many of the common features of dedicated server software such as easily handling multiple simultaneous accesses.
Really, the only circumstance where I would advise using it is when the server side logic is very complex and the number of users small. For example, if you wanted to write an in house web app and have almost all the logic on the server, then Core Data might serve well.
Apple used to have WebObjects which was a package to manage servers using an object-oriented DB much like Core Data. (Core Data was inspired by a component of WebObjects called Enterprise Objects.) However, IIRC Apple no longer supports WebObjects for external use.
Your better off using one of the many dedicated server packages out there than trying to roll your own.

I have no experience using Core Data in the manner you describe, but my understanding of the architecture leads me to believe that it could be used, depending on how you plan to query and manipulate the data.
Core Data is very good at maintaining an object graph and using faults to bring parts into memory as needed. In that manner, it could be good on a server for reducing memory requirements even with a large data set.
Core Data is not very good at manipulating collections of objects without loading them into memory, making a change, and writing them back out to disk. Brent Simmons wrote a blog post about this, where he decide to stop using Core Data for some of his RSS reader's model objects because an operation like "mark all as read" didn't scale. While you would like to be able to say something like UPDATE articles SET status = 'read', Core Data must load each article, set its status property, then write it back to disk.
This isn't because Apple engineers are stupid, but because the query layer can't make assumptions about the storage layer (you could be using XML instead of SQLite) and it also must take into account cascading changes and the fact that some article objects may already be loaded into memory and will need to be updated there.
Note that you can also write your own storage providers for Core Data, see Aaron Hillegass's BNRPersistence project. So if Core Data was "mostly good" you might be able to improve on it for your application.
So, a possible answer to your question is that Core Data may be appropriate to your application, as long as you do not need to rely on batch updates to large number of objects. In general, no algorithm or data structure is appropriate for every scenario. Engineering is about wisely choosing between trade-offs. You won't find anything that works well for many clients in every case. It always matters what you are doing.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio