Have any good links/articles on job queuing & db sharding? - sharding

Does anyone have good links on how/when/why to use job queuing to scale web apps? Also, articles on db sharding would be useful too :)

This fantastic presentation covers lots of scalability issues, including job queueing and db sharing. There are also quite a few stack overflow questions about this:
How would I learn more about sharding userdata for a website?
When people talk about scaling a website with 'shards', what do they mean?
Resources for Database Sharding and Partitioning

Related

application insights vs elastic (ELK)

Or I am really bad at searching or there is no detailed comparison between App Insights and ELK stack ?
All monitoring is going to be used for simple Web API, there going to be tons of end points but user traffic should not be too high.
So my question.. Is there any general points/differences when choosing between ELK and App Insights, personally never had a chance to set up any of those, but before setting up test environment would be nice to know in advance, what to expect/look for.
I'm from App Insights team. I think the link provided by #rickvdbosch in a comment gives quite good perspective. It is 1+ years old at this point, so, some items regarding App Insights evolved since then.
I think App Insights and ELK are quite different offerings. The former is managed offering (you can set it up within couple minutes), focused on very broad range of out-of-the-box experiences (collecting incoming/outgoing requests, exceptions, smart alerts, availability monitoring, analytics, live metrics, application map, end-to-end transactions across apps).
My understanding of ELK is that it has very powerful UI visualization and powerful dashboards (though there are adapters for Kibana to work with Azure Monitor). For scenarios where there is a need to store a lot of data (highly loaded apps with adaptive sampling still store limited amount of data) ELK solution might be cheaper to run.
Final decision was to use ELK as servers already have all the configuration, because other team uses it and mainly because logging will need a lot customization.

understanding about stackoverflow underlying software infrastructure

I wonder what all databases/combination of databases stack overflow uses underneath, managing extensive user profile information over various verticals.
As i case of social networking sites like twitter and facebook the Big Data managemnet is done over hadoop. Is stack overflow also handles such higher volumes of data?
How about indexing the information , is redis part of stackoverflow solutions?
It will be really interesting to understand solution deployed at world most popular technical forum .
This article provides a glimpse at what stackoverflow's architecture looks like circa March 2011: http://highscalability.com/blog/2011/3/3/stack-overflow-architecture-update-now-at-95-million-page-vi.html
At a high level, its a .NET application which uses MS SQL server for a database, Redis for caching, HAProxy for load balancing, and a whole host of tools and hosted on both windows servers and linux servers (ubuntu+centos).
It doesn't look like they had any hadoop usage at the time of that article, but that could have changed. They might also be doing something different/custom for map/reduce type jobs or might not need anything like that at all yet. With delicacy, SQL servers can be scaled pretty far without needing to lean on "big data" toys. This is especially true if you can get most of your data out of your caching layer.

hosted BigQuery instance

Is there any way I can host big query software on my company server?
The company does not want the data to be anywhere else other than own data center.
What are BigQuery alternatives? (cloud as well as hosted)
Is there any way I can host big query software on my company server?
Google Big Query is an implementation of the Google Dremel Paper, but is offered as a service and is not available as a software to be installed in-premise.
What are big query alternatives? (cloud as well as hosted)
Apache Drill is an implementation of the above mentioned Dremel, but has just started and might take some time to materialize.
Cloudera has recently announced Imapala for real-time queries on Hadoop. Check the blog for more details.
Would be interested to know some other alternatives for real-time queries on Big Data.
Edit : Here is an interesting article from InfoWorld on the same.
Hive and Pig are two common solutions to making a queryable system, but since you mentioned Google's Big Query, I assume you mean real-time queries.
In addition to the real-time solutions mentioned by Praveen, there are some workarounds to making other column-oriented solutions faster by writing redundant stores, in a normalized fashion. Think of it this way: You can 'pre-join' the data in a column family, as long as you understand that you're trading fast access against excess volume and slower insertion speed.
-t.

Scalable web project architecture

Where do you get info about 'how to build scalable, high perfomance web app'? I mean architecture, best practice ets. regardless of platform and language: .net, php, java ...
Did you get your own 'epic fails' in your project and then refactor your system in a few nights or get info from internet?
Is there any communities where I could share my own expirience and get some response?
Yeah, I know that every project is individual.
You can read the High Scalability blog. If you have questions about architechture and scalability, you can always use StackOverflow unless the question is subjective.
It is not easy to answer this question. Language and Platform takes a secondary place when thinking about scalability.
"Scalability is actually a property of a system, not an individual layer of that system, infrastructure. Even with the best, sexiest, most automatic scaling layer, you can easily write code that just doesn't scale. - glyph"
How ever, you can immerse into a very good collection of resources on this issue at
http://www.royans.net/arch/library/
Just focusing on the web part of scalable web architecture, you might want to take a look at these 7 reasons why you should be using XMPP instead of AJAX especially if your web app needs to scale with lots of real-time social features.

implementing a Distributed System/database

We are implementing a system in the company I work for where by we will need to install the system in various sites of the same client (warehouses). The users in all sites should see the same information. The system should be able to work in each site when the network is down. What design architecture solution would be most suitable?
I suggest you consider CouchDB. Its robust replication feature is designed specifically for this sort of use case. It supports both continuous replication, which could keep the data in the various warehouses in sync in near-real-time during normal operation, and occasional replication, which could be used to sync data after a network outage.
There's a really good free O'Reilly book: CouchDB: The Definitive Guide, which has a chapter on replication.

Resources