As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 11 years ago.
We are about to move a project on apache cassandra from test to pilot and as a rdbms team, we were propably missing something.
Basic rules (or lessons learned):
be sure you have big or almost no data (nothing between)
do not believe in extremely cheap storage (cheap or not expensive might be
better)
think of your primary key as it was a reverse index
think of time (or another data creation order) as it was a row/clustering key
forgot about 100% foreign keys whenewer you can
sample if you can
do not care about dups
json and asynchronous time aggregation on client can make cpus more relaxed
ETL:
sample history if you can (or sample it just for reporting usage on separate reporting cluster)
single threaded data streams spreaded over couple of servers will come in hand
if you can afford asynchronous processing you can profit from knowledge of data patterns
throw scrap data away (horizontaly and vertically) - or it will mislead BI people or even board members in worse case
do not care about dups
The question is am I still missing something?
Are there another ways to achieve even better performance?
Related
As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 10 years ago.
I am working on a large amounts of multidimensional log data at my company. I have to save and retrieve data from my text database really fast because there are amounts of data and if I build a search query(they are not so simple queries, i.e. between some dates etc.) it takes an efficient time.
Here are my points:
We use Lucene but it doesn't fit out requirements.
We don't use SQL based databases because it is overkill for storing large amount of log data and querying at this situation.
We don't want to use NoSQL databases for log search because of our needs. We need a text based database.
We want to use Pytables however my question is that I want to learn if there exists any other systems to store and search fast on logs?
As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 10 years ago.
We have an application where in I have 65 GB of Data in MSSQL Server.
with around 250 tables and 1000 stored procedures and functions.
Now the application is complete DB specific with almost all the logic coded in procedures and functions. Some of the Stored procs take as long as over 4-5 minutes to execute. Now we have been given the task of optimizing/re-engineering these slow running stored procs.
We have not much info about the project/schema/design but we have access to the schema and data and we fortunately have to deal with just a module to optimize which is slow. (But that deals with many SPs and functions running over 1000 of lines.. encompassing application logic..)
My question is how do I get started with such a project. We have been set some unrealistic deadline of coming up with fixes in 2-3 days and i have already spent a day in setting things up!
What should be the approach:
Suggest increase in hardware infrastructure.
Re-engineer app (push some of the computations to the app side) make it less DB-centric ?
Ask for more time (how much) to optimize this ? Funny thing is we are not the original coders and have very less idea about the App i.e. whats coded in the SPs and functions.
Thanks
You'll need to know the problem areas before you can attempt any fixes.
You say you are just looking at one module to begin with, then I'd suggest using things like SQL Profiler to determine the frequency with which statements are executed and also times taken to execute and use this data as a starting point to see if the logic can be optimised.
Look for any operations that use cursors that could possibly benefit from a more set based approach.
As for your three options, I'd say you HAVE to go for (3) because you've stated you don't have a thorough understanding of the app, so you'll need to gain some further exposure in order to establish where to focus your efforts. I don't think (1) is a long term solution although it would obviously provide some benefit (how much determines current and proposed specs). You'll only have an idea if (2) is a valid option once you've had a chance to establish the problem areas first.
Best of luck.
As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 10 years ago.
I'm looking for a framework, a combination of frameworks, best-practices, or a tutorial about visualizing large data sets with Hadoop.
I am not looking for a framework to visualize the mechanics of running Hadoop jobs or managing disk space on Hadoop. I am looking for an approach or a guideline for visualizing the data contained within HDFS using graphs and charts, etc.
For example, let's say I have a set of data points stored in multiple files in HDFS, and I would like to show a histogram of the data. Is my only option to write a custom map/reduce job that would try and figure out which points fall into which bucket, write the totals to a file, and then use a plotting library to visualize that?
Do I need to roll out a custom solution, or is there anyone else doing this sort of thing out there? I've trying looking online, but I haven't been able to find something that directly relates to this.
Thank you for your help
We do something like this at Datameer. The files would take a few more processing steps to get to our visualizations, but we run natively on Hadoop so the files would not be far away.
As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 10 years ago.
Recently, I attended an onsite interview for a company and I was asked design questions related to big data like e.g: get me the list of users accessed a website (say google) between time t1 and t2. What data structures to use, how to handle concurrency, stale data, how many servers are needed to store the data, and requirements(software, hardware) of each server etc.....
Please point me some books/web references to increase my knowledge in this new area.Also provide me insights on how to answer such type of design questions
this book (free download) (amazon: mining of massive datasets) was just posted to HN (that thread also has some useful comments) - from a first skim it looks really good. you could read that.
As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 10 years ago.
I have a team lead who seems to think that business logic is very subjective, to the point that if my stored procedure has a WHERE ID = #ID — he would call this “business logic”
What approach should I take to define “business logic” in a very objective way without offending my team lead?
I really think you just need to agree on a clear definition of what you mean when you say "business logic". If you need to be "politically sensitive", you could even craft the definition around your team lead's understanding, then come up with another term ("domain rules"?) that defines what you want to talk about.
Words and terms are relatively subjective -- of course, once you leave that company you will need to 're-learn' industry standards, so it's always better to stick with them if you can, but the main goal is to communicate clearly and get work done.
One way to differentiate is that "business logic" is something the customer would care about and that could be explained to a customer without referring to computer-specific words.
You could try to argue your point with a timed example, run a sql select against an indexed table and then run a loop to find exactly the same item in the same set but this time in code. The code will be much slower.
Let the database do what it was designed to do, select sets and subsets of data :) I think realistically though, all you can do is get your team together to build a set of standards which you will all code to, democracy rules!