Out of LevelDB, RocksDB, HyperLevelDB, LMDB, and any others that fit in this category, what are the expected performance properties and tradeoffs between them?
I've found various benchmarks and articles making claims regarding parts of what I'm looking for, but none that are very recent, so it's unclear how much of the information is still accurate/relevant. e.g. I'm seeing that the latter three may be better than Level in terms of things like concurrency and read performance, but I'm not sure how much of that is true or how they would compare to each other either way.
Related
I'm working on a project for university. Do you guys know what kind of algorithms i could implement that would help with the proper design and general performance of a database? Until now i came up with an algorithm that can help the user pick candidate keys and also an algorithm for normalization up to 3NF. Do you have any other ideas or suggestions? Thanks.
This is like asking how you can figure out how to make a car be more efficient. It's such a broad question that it's essentially unanswerable. There are so many moving parts to a car, and each one has its own problems. You really need to understand what each component is doing. In the case of databases, you need to understand the data before you try and fix it. And if you want a good answer, you have to ask the right questions.
A good question should include context on what you are working with, and what you are trying to do. And when it comes to data manipulation, the details are extremely important. How is your data represented? What kind of infrastructure are you working with? What purpose does the data serve, and what processes use this data? If you are working with floating point numbers, are your processes tolerant of small rounding errors? Would your organization even let you make changes to how the data is stored?
In general, adding algorithms to improve data performance is probably largely unnecessary. Databases are designed out of the box to be simple and efficient. If there were a known method to increase efficiency in general without any drawbacks, there's no reason why the designers of the system wouldn't have implemented it already.
I am just putting an answer because I have no way to tell this in the comment section. You need to understand a basic principle in database design and data model construction. What your database is for ? That is the main question, and believe it or not, sometimes people with experience make the same mistake.
As you were saying, 3NF could be good for OLTP systems, but it would be horrendous for Data Warehouse or Reporting Databases where the queries are huge and they work on big batch operations. In those systems denormalization offers always better results.
Once you know what you're database is for, then you can start to apply some "Best Practices" , but even here there is a lot of room for interpretation, and even worse, same principles could be good in one place but very bad in another. I am just going to provide you an example of my real experience
8 years ago I started a project and we have to design a database for a financial application. After some analysis, we decided to use a start model, or dimension-fact model. We decided to create indexes ( including bitmap ) for some tables, even though we were rebuild them during batch to avoid performance degradation.
Funny thing is that after some months, I realised that the indexes were useless, as the users were running queries that were accessing the whole data, mostly analytics and aggregation. Consequence: I drop all indexes.
Is it a good thing to do ? No, it is not, but in my scenario it was the best thing and the performance increased a lot, both in batch and also in user experience.
Summary, like an old friend of mine that that was working in Oracle Support used to tell me: "Performance is an art my friend, not a science"
There are too many database algorithms to list, but below is a structured way of thinking about classes of algorithms that affect database performance.
Algorithm analysis is a helpful way of categorizing and thinking about many database performance problems. While most performance problems are solved with best practices and trial-and-error, we'll never truly understand why one solution is better than another without understanding the algorithms behind them. Below is a list of functions that describe the algorithmic complexity of different database operations, ordered from fastest to slowest.
O(1/N) – Batching to reduce overhead for bulk collect, sequences, fetching rows
O(1) - Hashing for hash partitioning, hash clusters, hash joins
O(LOG(N)) – Index access for b-trees
1/((1-P)+P/N) – Amdahl's Law and its implications for parallelizing large data warehouse workloads
O(N) - Full table scans, hash joins (in theory)
O(N*LOG(N)) – Full table scan versus repeated index reads, sorting, global versus local indexes, gathering statistics (distinct approximations and partition birthday problems)
O(N^2) – Cross joins, nested loops, parsing
O(N!) – Join order
O(∞) – The optimizer (satisficing and avoiding the halting problem)
One suggestion - based on the way you phrased your questions and comments, you're thinking of a database as merely a place to store data. But the most interesting parts of a database happen when you think of them as joining machines. There's not much to optimize about data sitting around, the real work happens when data is combined.
The above list is based on Chapter 16 of my book, Pro Oracle SQL Development. You can read an early version of the entire chapter for free here. While the chapter mostly stands alone, it requires an advanced understanding of Oracle. But each of the topics could be the basis for a lifetime of academic study, so you only need to pick one.
I want to build a recommendation system, and the target is to deal with really big data set, like 1 TB data.
And each user has really huge amount of items, however the number of user is small, like thousands or 10 thousands.
I have search from google, I found there is some open-source recommendation engine based on hadoop like Mahout, I guess it may have ability to deal with such big data, however I'm not sure.
I also find some engine write in C++ python, even php, I don't think script languages can deal with such big data, cause memory can't contain the whole dataset.
Or I'm wrong? Could some give me some recommendation?
Your question title is:
Which opensource recommendation system should I choose to deal with
big dataset?
and in the first line you say
I want to build a recommendation system, and the target is to deal with really big data set, > like 1 TB data.
And you are asking for an recommendation as an answer.
To answer your second question first. In my experience of building recommender systems I would advise you do not "build" a recommender system from the ground up if you can avoid it. Recommender Systems are complex and can use a wide range of techniques to provide a user with a recommendation. So my recommendation is unless you are really committed, and have a team of people with a range of experience and knowledge in recommender systems, statistics, and software engineering then look to implement an existing recommender system rather than building your own.
In terms of which open source recommender system you should choose, this is actually pretty difficult to answer with great accuracy. Let me try to answer this by breaking it down.
Consider the open source license, its restrictions and your requirements.
Consider which algorithm you want to use to make recommendations
Consider the environment you will be running your recommender system on.
I recommend you look more into the algorithm side as it will be the determining factor as to which tool you can use, or whether you will need to roll your own. Start reading here http://www.ibm.com/developerworks/library/os-recommender1/ for a very brief insight in to the different approaches that recommender systems use. In summary the different approaches are:
Content based
Neighbourhood / Collaborative filtering based
Constraint based
Graph-based
In your case to keep things relatively straightforward it sounds like you should consider a user-user collaborative filtering algorithm for this. The reasons being:
Neighbourhood Collaborative Filtering is quite intuitive to understand and it can be relatively easy to implement.
With this method you can also justify your recommendations to your users in a basic way
There is no requirement to build a model for training, and the processing of neighbours can be done "offline", to provide quick recommendations to the end user.
Storing neighbours is actually quite memory efficient, which means better scalability. Something it sounds like you will need lots of.
The user-based part of my suggestion is because it sounds like you have less users than you do items. In a user-based nearest neighbourhood a predicted rating of a new item I for user U is calculated by looking at the other users who have also rated item I and are most similar to user U. Because you have fewer users than items in your system it will be faster to compute user-based collaborative filtering compared with item-based collaborative filtering.
Within the user-based collaborative filtering you need to consider what rating normalisation (mean-centering vs z-score) you want to use, the similarity weight computation method (e.g. Cosine vs Pearsons correlation vs other similarity measures) you want to use, neighbourhood selection criteria (pre-filtering of neighbours, number of neighbours involved in the prediction), and any Dimensionality Reduction methods (SVD, SVD++) you want to implement (with a large dataset like yours you will want to seriously consider DM).
So really instead of looking for an open source that will be able to process your data set you should consider your algorithm choice first, then look to find a tool that has an implementation of this algorithm, and then assess whether it can process your the volume involved in your dataset.
In saying all of that, if you do choose to go down the user-based collaborative filtering route then I am confident that Apache Mahout will be able to solve your problem, and if not it will certainly help you understand the complexity involved in building your own (just look at their source code).
Please note the advice is really consider the algorithm choice. "Good" recommender systems are so much more than just being able to process a large dataset. You need to think about accuracy, coverage, confidence, novelty, serendipity, diversity, robustness, privacy, risk user trust, and finally scalability. You should also consider how you are going to perform experiments and evaluate your recommendations, remember if the recommendations you are churning out are rubbish and it is turning your users off then there is no point to have a recommender system!
It is such a big area with lots to think about, there is probably no one single tool that is going to help you with everything, so be prepared to do a lot of reading and research as well as implementing lots of different open source tools to help you.
In saying that, start looking at Apache Mahout. Going back to the break-down of the 3 areas I said you should think about.
It has a commercial-friendly open-source license,
it has really great implementation of the algorithms you are likely going to need to use, and
it can work on distributed environments (read scalable).
Hope that helps, and good luck.
I have a problem with making search output more practically usefull for the end users. The problem is rather related to the algorithm and approach then to exact technology or framework to use.
At the moment we have a database of products, that can be described with following schema:
From the search perspective we've done pretty standard things, 3-rd party text search with token analyzer, handling mistypes and synonyms (it is not the full list, but as I said, it is rather out of scope). But stil we need to perform extra work to make the search result closer to real life user needs, probably, in somewhat similar way how Google ranks indexed pages by relevancy. Ideas, that we`ve already considered as potentially applicable in solving the problem:
Analyze most popular search requests in widespread search engines (it is still a question how to get them) and increase rank for those entries in the index, which correspond (could be found with) to the popular requests;
Increase rank for newest (hot) entries;
Increase rank for the biggest group of entries, which correspond to the popular request and have something in common (that`s why it is a group);
Appreciate for any help or advising a direction, where to dig.
You may try pLSA; there are many references on the web, and there should be libraries and source code.
EDIT:
well, I took a closer look at Lucene recently, and it seems to give a much better answer to what the question actually asked (it does not use pLSA). As for the integration with db, you may use Hibernate Search (although it does not seem to be as powerful as using Lucene directy is).
I have some tables that are connected between them in a primary key-foreign key one to many relation. For example tables books and authors. A book can have one author and every author can have many books. If I have the book object and I need to use its author's properties, is it better performance to drill down everytime in the code, or should I pull the author object and use its properties?
I hope my question is clear, it suppose to be easy I assume but I don't have anybody to ask.
Just use the code that most clearly and concisely expresses what you want to achieve. The performance will depend many factors but probably already will be satisfactory.
Later, in case an actual performance problem appears, you can reason more clearly about performance because you will know the actual numbers (i.e. number of books, authors, modify and read rates for each, available RAM, CPU load, database server load, network load) involved.
Following on from my previous question, I'm looking to run some performance tests on various potential schema representations of an object model. However, the catch is that while the model is conceptually complete, it's not actually finalised yet - and so the exact number of tables, and numbers/types of attributes in each table aren't definite.
From my (possibly naive) perspective it seems like it should be possible to put together a representative prototype model for each approach, and test the performance of each of these to determine which is the fastest approach for each case.
And that's where the question comes in. I'm aware that the performance characteristics of databases can be very non-intuitive, such that a small (even "trivial") change can lead to an order of magnitude difference. Thus I'm wondering what common pitfalls there might be when setting up a dummy table structure and populating it with dummy data. Since the environment is likely to make a massive difference here, the target is Oracle 10.2.0.3.0 running on RHEL 3.
(In particular, I'm looking for examples such as "make sure that one of your tables has a much more selective index than the other"; "make sure you have more than x rows/columns because below this you won't hit page faults and the performance will be different"; "ensure you test with the DATETIME datatype if you're going to use it because it will change the query plan greatly", and so on. I tried Google, expecting there would be lots of pages/blog posts on best practices in this area, but couldn't find the trees for the wood (lots of pages about tuning performance of an existing DB instead).)
As a note, I'm willing to accept an answer along the lines of "it's not feasible to perform a test like this with any degree of confidence in the transitivity of the result", if that is indeed the case.
There are a few things that you can do to position yourself to meet performance objectives. I think they happen in this order:
be aware of architectures, best practices and patterns
be aware of how the database works
spot-test performance to get additional precision or determine impact of wacky design areas
More on each:
Architectures, best practices and patterns: one of the most common reasons for reporting databases to fail to perform is that those who build them are completely unfamiliar with the reporting domain. They may be experts on the transactional database domain - but the techniques from that domain do not translate to the warehouse/reporting domain. So, you need to know your domain well - and if you do you'll be able to quickly identify an appropriate approach that will work almost always - and that you can tweak from there.
How the database works: you need to understand in general what options the optimizer/planner has for your queries. What's the impact to different statements of adding indexes? What's the impact of indexing a 256 byte varchar? Will reporting queries even use your indexes? etc
Now that you've got the right approach, and generally understand how 90% of your model will perform - you're often done forecasting performance with most small to medium size databases. If you've got a huge one, there's a ton at stake, you've got to get more precise (might need to order more hardware), or have a few wacky spots in the design - then focus your tests on just this. Generate reasonable test data - and (important) stats that you'd see in production. And look to see what the database will do with that data. Unless you've got real data and real prod-sized servers you'll still have to extrapolate - but you should at least be able to get reasonably close.
Running performance tests against various putative implementation of a conceptual model is not naive so much as heroically forward thinking. Alas I suspect it will be a waste of your time.
Let's take one example: data. Presumably you are intending to generate random data to populate your tables. That might give you some feeling for how well a query might perform with large volumes. But often performance problems are a product of skew in the data; a random set of data will give you an averaged distribution of values.
Another example: code. Most performance problems are due to badly written SQL, especially inappropriate joins. You might be able to apply an index to tune an individual for SELECT * FROM my_table WHERE blah but that isn't going to help you forestall badly written queries.
The truism about premature optimization applies to databases as well as algorithms. The most important thing is to get the data model complete and correct. If you manage that you are already ahead of the game.
edit
Having read the question which you linked to I more clearly understand where you are coming from. I have a little experience of this Hibernate mapping problem from the database designer perspective. Taking the example you give at the end of the page ...
Animal > Vertebrate > Mammal > Carnivore > Canine > Dog type hierarchy,
... the key thing is to instantiate objects as far down the chain as possible. Instantiating a column of Animals will perform much slower than instantiating separate collections of Dogs, Cats, etc. (presuming you have tables for all or some of those sub-types).
This is more of an application design issue than a database one. What will make a difference is whether you only build tables at the concrete level (CATS, DOGS) or whether you replicate the hierarchy in tables (ANIMALS, VERTEBRATES, etc). Unfortunately there are no simple answers here. For instance, you have to consider not just the performance of data retrieval but also how Hibernate will handle inserts and updates: a design which performs well for queries might be a real nightmare when it comes to persisting data. Also relational integrity has an impact: if you have some entity which applies to all Mammals, it is comforting to be able to enforce a foreign key against a MAMMALS table.
Performance problems with databases do not scale linearly with data volume. A database with a million rows in it might show one hotspot, while a similar database with a billion rows in it might reveal an entirely different hotspot. Beware of tests conducted with sample data.
You need good sound database design practices in order to keep your design simple and sound. Worry about whether your database meets the data requirements, and whether your model is relevant, complete, correct and relational (provided you're building a relational database) before you even start worrying about speed.
Then, once you've got something that's simple, sound, and correct, start worrying about speed. You'd be amazed at how much you can speed things up by just tweaking the physical features of your database, without changing any app code. To do this, you need to learn a lot about your particular DBMS.
They never said database development would be easy. They just said it would be this much fun!