Best way to prepare for Design and Architecture questions related to big data [closed] - algorithm

As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 10 years ago.
Recently, I attended an onsite interview for a company and I was asked design questions related to big data like e.g: get me the list of users accessed a website (say google) between time t1 and t2. What data structures to use, how to handle concurrency, stale data, how many servers are needed to store the data, and requirements(software, hardware) of each server etc.....
Please point me some books/web references to increase my knowledge in this new area.Also provide me insights on how to answer such type of design questions

this book (free download) (amazon: mining of massive datasets) was just posted to HN (that thread also has some useful comments) - from a first skim it looks really good. you could read that.

Related

Text-Based Databases for Log Search? [closed]

As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 10 years ago.
I am working on a large amounts of multidimensional log data at my company. I have to save and retrieve data from my text database really fast because there are amounts of data and if I build a search query(they are not so simple queries, i.e. between some dates etc.) it takes an efficient time.
Here are my points:
We use Lucene but it doesn't fit out requirements.
We don't use SQL based databases because it is overkill for storing large amount of log data and querying at this situation.
We don't want to use NoSQL databases for log search because of our needs. We need a text based database.
We want to use Pytables however my question is that I want to learn if there exists any other systems to store and search fast on logs?

Node.js memory, GC and performance [closed]

As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 10 years ago.
There are rumors that current Node.js (or, more exactly V8 GC) performs badly when there are lots of JS objects and memory used.
Can You please explain what exatly is the problem - lots of objects or lots of properties on one object (or array)?
Maybe there are some benchmarks, would be interesting to see actual code and numbers.
As far as I know the main problem - lots of properties on one object, not lots of objects itself (although I'm not sure).
If so - would be the in-memory graph database (about couple of hundreds of properties on each node at max) a good case?
Also I heard that latest versions of V8 has improved GC and that it solved some parts of this problems - is this true, and when it will be available in Node.js?

how to design cassandra (or another nosql) scheme? [closed]

As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 11 years ago.
We are about to move a project on apache cassandra from test to pilot and as a rdbms team, we were propably missing something.
Basic rules (or lessons learned):
be sure you have big or almost no data (nothing between)
do not believe in extremely cheap storage (cheap or not expensive might be
better)
think of your primary key as it was a reverse index
think of time (or another data creation order) as it was a row/clustering key
forgot about 100% foreign keys whenewer you can
sample if you can
do not care about dups
json and asynchronous time aggregation on client can make cpus more relaxed
ETL:
sample history if you can (or sample it just for reporting usage on separate reporting cluster)
single threaded data streams spreaded over couple of servers will come in hand
if you can afford asynchronous processing you can profit from knowledge of data patterns
throw scrap data away (horizontaly and vertically) - or it will mislead BI people or even board members in worse case
do not care about dups
The question is am I still missing something?
Are there another ways to achieve even better performance?

Which data structure should be used while storing large number of data, but not any RDBMS? [closed]

As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 11 years ago.
This question was asked in an interview. First, I came up with B-tree. He asked me to be more specific and asked me to describe how I would store the data so that it would be easier to retrieve.
Can you please throw some light on this. Thanks in advance
You question isn't really clear.
"Good" ways to store the data depend on what you want to do with it.
If you want access parts of your data, a list of offsets suffices. If you want to search in text, using an additional inverted index in combonation with docIds->offsets is great. If you have frequent updates to your data and reading is rare, none of those make sense. So it really depends
Sounds like an open question, so you can demonstrate your vast experience of ... well, http://en.wikipedia.org/wiki/NoSQL would be my guess, but you could argue that http://en.wikipedia.org/wiki/Dbm answers the question.

Whats the best way to determine availability or uptime of my systems [closed]

As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 11 years ago.
By this I mean, whats the best way show the uptime of systems? Idealy id like to show some sort of percentage figure, like what the webhosts do. ie 99.5% uptime.
Is there a standard way to determine this?
We use Pingdom to monitor our servers, and they generate the sort of numbers you're looking for (we just use the free account). They also seem to have an API which will let you get your info programatically - no guarantees that'll work with a free account, though.
Hope this helps!

Resources