With tens of thousands of upstream users and frequent changes, what are the problems with the current implementation of Apache APISIX How to solve it? - apache-apisix

If I have tens of thousands of upstream now and it's changing a lot, how can I use Apisix to better solve this, are there any best practices?

Related

NoSQL for multi-site archival logging with full-text search

I'm looking at building a somewhat complex log handling system to replace an old ad-hoc setup and could use a bit of advice. I'm pretty familiar with SQL databases and networking, but am very new to NoSQL stores, which seem to be the key to solving this mess. Note that we have a very good team, but a limited licensing budget, so free/open-source options are vastly preferred. (That said, availability of support if something goes pear-shaped would be nice.)
Requirements:
Archive (test) logs generated in the several GB/day range at multiple sites around the world.
Provide full text search of those logs at each site fairly instantaneous for debugging purposes.
Push that archived data back to a central location (though a replica at each site would be absolutely okay).
Provide for analytics of that data back at the central location.
Constraints:
The sites have fairly crap Internet connections for the moment (high latency and fairly low bandwidth). Much of the data is generated during the day and a good portion of the sync would have to lag behind and finish overnight each day.
Sites MUST be able to function if the WAN goes completely off-line.
Extras
The log data is (as usual) highly compressible. Any solution that compresses data transacting from node to node across the WAN is preferred.
Many log files are related to each other in multi-level hierarchies, and that relationship is very important and must be maintained!
Sites will generally not modify the same data or modify it again once stored. This is all archival for the most part.
We can either stream as the logs are generated or push blocks of logs. Streaming is preferred, as it would simplify things considerably.
Options I'm aware of:
Local MySQL and folder structure for logging and local configuration management.
This is what we have now and it's running, but not a long-term solution by any means.
Elasticsearch
I've read that ElasticSearch would probably be really good for this, though from what I understand that doesn't support multi-site.
Cassandra
This seems to have built-in multi-site support, but I'm not exactly familiar with the data-model. Is this a good choice for something like this, or will I hate myself if I give it a try?
CouchDB
This is a document store that seems(?) like a good match for log data, but again doesn't appear to have multi-site support.
Apache Kafka
I read up on this, but I haven't quite wrapped my head around it yet...
Questions:
Do any of these actually let you stream-append logs or are they best suited to dumping completed files in?
Is there a solution I'm missing that might be better?
Any recommendations on multi-site with some of the options that don't support multi-site by themselves?
Interesting links:
https://engineering.linkedin.com/distributed-systems/log-what-every-software-engineer-should-know-about-real-time-datas-unifying
http://blog.cloudera.com/blog/2015/07/deploying-apache-kafka-a-practical-faq/
https://www.elastic.co/blog/scaling_elasticsearch_across_data_centers_with_kafka
https://kafka.apache.org/08/ops.html
https://github.com/Stratio/cassandra-lucene-index
I may be a bit biased, since Couchbase is my employer, but this sounds like the kind of problem that XDCR (Cross Datacenter Replication) was made to solve.
You could stand up a cluster on multiple geographical sites (Couchbase calls these "datacenters") and then XDCR would automatically replicate (bidirectionally) the data between sites. If I understand your requirements correctly, this sounds like just what you need.

Monitoring solution for EC2 based deployment

We have some 20 or so servers in EC2, most are dynamically spawned (scaling groups).
We're looking for a solution to monitor the uptime of our application.
As an added bonus this solution could also extend to actually monitoring the servers involved so its easy to go back in time and see what happened just before a downtime or whatnot.
We're looking for a hosted solution ideally, and it should be easy to scale with it (it needs to somehow dynamically deal with servers being added/removed with no interaction from us).
Anyways, hoping for some recommendations from you guys.
A bit of background ...
We're currently using a custom Nagios setup, its been reduced to basically doing a simple http check now that the servers have become fully dynamic. We've already been using PagerDuty to deliver the pages. It does ok, but for the maintenance cost we could well be using a http check # Server Density of Pingdom.
I've looked briefly at ServerDensity, and it does look promising, I especially like their install mechanism of just dumping their files into your AMI and it takes care of the rest.
I'd like to know what options there are tho before diving deeper into any particular solution.
We use a combination of Server Density for monitoring and PagerDuty for alerting. The two work quite well together.

CDN vs Homegrown Caching

My understanding of a CDN (like Akamai or Limelight) is that they are heavy-duty caching services.
I also understand that they are very expensive. So I'm wondering why I can't just create my own cluster of replicated caching servers (using, say, EhCache or Memcached) for all my web app's caching needs (images, URL hits/responses, Javascripts, etc) and basically get the same thing?
In essence, from a developer's perspective, what are the benefits (technical or otherwise) of paying a CDN vs. just using your own caching solution? Or, if I have completely misunderstood what CDNs are, please correct my understanding! Thanks in advance!
The key to this is the N in CDN, Content Delivery Network.
The advantage of a CDN (a good one, at least) is that it's geographically distributed. What the big CDN providers do (the ones you mentioned along with others) is have hundreds or thousands of servers in facilities all over the world. This lets what ever resource the CDN is serving sit as geographically close as possible to the end user no matter where they are in the world.
You can certainly replicate the functionality. At a very basic level most CDN's are simply an object/value store which plenty of software can do - what you can't do, at the scale these guys do it at any rate, is have servers all over the world to serve and replicate these objects.
If all you're interested in is an efficient way to store and serve static files then an object store coupled with something like nginx would do just fine. A CDN is used to make sites load as fast as they possibly can, but they come at a price.
For the record, Amazon's Cloud Front, while not as distributed as Limelight or Akamai is a lot cheaper and a good middle ground compromise on cost vs service.
The advantage of big CDNs is that they own distributed resources all over the world, allowing them to serve the users from a servers near them. Besides caching, this is the major point that makes them fast.
To build your own CDN, you would have to install servers on multiple continents, arrange good Internet connectivity for them, setup caching and make sure you have the servers all synchronized. It's not impossible to do that, but it will be very cost-intensive and might not be profitable.
With all due respect, I disagree with the other two answers given thus far to this question. Just to be clear, I'm not saying they are wrong, I'm just offering a different perspective.
While it is certainly true that the key to CDN performance is the "N," as Bulk eloquently explained, that doesn't mean you can't build your own CDN. The question is whether or not it's worth the time (and by definition, the money).
We live in a world of cheap servers and even cheaper virtual machines. Sure the big CDN networks have thousands/millions of servers all over the world, but that's because they have thousands/millions of sites to serve. Depending on the size of your site/app, all you really need in terms of resources is as much as your site(s) need. If you're small, the minimum might be a VPS on each US coasts, one in Europe, maybe two in Asia, and one in Australia. Sure, the hardware costs are too high for your typical homepage, but they are certainly not extreme, and if you're looking at CDNs in the first place, they are probably within your budget.
To me, commercial CDN services just provide PaaS convenience, but there is nothing preventing you from getting IaaS and building up your own platform.
One more thing on this topic:
I once read a comment either by David Heinemeier Hansson (the creator of Ruby on Rails) or by someone referring to him that went something along the lines of: The owner(s) of 37 Signals were concerned DHH was using Ruby to build their application. At that point Ruby was still very obscure. Almost all web hosts were offering PHP, Perl, and Microsoft technologies. When asked about the fact that there were only a handful of Ruby hosts in the world, DHH asked, "well how many do you need?"
To me the point is you need to look at what's best for your needs and the needs of your application, and not necessarily what the guys with millions of servers think.

Strategies For AutoCompletion Web Services in .Net. Non UI focused

I'm getting a little tired of all the UI demos of auto completion in ASP.Net. I believe the UI portion of autocompletion has been solved multiple times over again.
My question is how do you best handle the queries hitting your webservices? I'm currently implementing an autocompletion service for a musician database. The database is fairly small with only 20,000 rows, but autocompletion is extremely speed sensitive. It needs to be fairly instant to be of any use.
I'm currently using NHibernate for my DAL, but I'm wondering if this is a place where I may want to bypass NHibernate. Perhaps projections on named queries would be the best strategy? Where do I cache? NHibernate's 2nd level cache? Let the web service cache?
I've already thought of a lot of naive methods to develop this, but I would like to soak in any tips that people already have in the wild. Also, what if you have many different types of entities you want autocompletion on? Do you spread those implementations around in their different repositories or do you design/implement a completely separate autocompletion service?
This depends on how large your sites traffic is. I generally suggest using a product such as MemCached or MemCached Win32 depending on your environment availability (MemCached for cheap linux boxes if you can is best...all that is needed is a ton of memory!). You might also look to something like Velocity (MS's new cache cloud offering). This would then allow you to cache a key (what ever the query is) with the results efficiently! Keep your cache times down based on however frequently you are updating your dataset. If you don't update often then the cache time can be longer. If you find that your cache cloud is growing like crazy you might want to only cache what is most frequently asked for (though your cache implementation should handle this by removing what is not accessed frequently!).

Has anyone tried using ZooKeeper?

I was currently looking into memcached as way to coordinate a group of server, but came across Apache's ZooKeeper along the way. It looks interesting, and Yahoo uses it, so it shouldn't be bad, but I'd never heard of it before, so I'm kind of skeptical. Has anyone else given it a try? Any comments or ideas?
ZooKeeper and Memcached have different purposes. You can use memcached to do server coordination, but you'll have to do most of this work yourself. Memcached only allows coordination in that it caches common data lookups to be used by multiple clients. From reading ZooKeeper's documentation, it has a much broader focus than this. ZooKeeper seems to provide support for server clustering, which isn't the same as the cache clustering memcached provides.
Have a look at Brad Fitzpatrick's Linux Journal article on memcached to get a better idea what I mean.
To get an overview of what Zookeper is capable of, watch the following presentation by it's creators. It's capable of so much more (creating queue's, electing master processes amongst a group of peers, distributed high performance run time configurations, rendezvous points for dis-joined processes, determining if processes are still running, etc).
http://zookeeper.sourceforge.net/index.sf.shtml
To answer your question, if "coordination" is what you are looking for Zookeeper is much better targeted at that than memcached.
Zookeeper is great for coordinating data across servers. It does a good job of ordering every transaction and making guarantees that transactions happen in order. However when first breaking into it the documentation sucks; it's very 'high-level' without enough concrete examples or explanations as how to properly handle certain events. One of the included examples (as of version 3.3.3) had its own bugs in it.
Your code will also need to be cognizant of event driven interactions, and polling interactions. With massively distributed architecture, when acting upon 'events' you can inadvertently create a stampede that could not be desirable for your environment (herding effect).

Resources