Rails + Apache solr taxonomy integration - ruby

I am little bottle-necked with design approach to my application. My plan is to create model Product and add filter, that uses Country, Region and City.
Country -> Region -> City are one to many relations respectively.
How well does apache solr handle tree strucures like these and how fast they are? Is it better to use something like ancestry for Solr https://github.com/stefankroes/ancestry or it is better to seperate models for City, Region and Country?
System will need to handle a lot of requests and this tree can become very gigantic.
Thanks for any advice on this.

Have you considered Neo4j, a graph database?
It supports Rails.
It supports Lucene query syntax.
Friendly advice:
I had huge performance problems some 5 years back, while trying to fit a large tree into a relational db. After replacing it with Neo4j a SP computation time went from 4 hours to 0 (the need for it simply vanished since I suddenly could do all relationship-based algebra in the run time).

Related

What are the pitfalls for using ElasticSearch as a nosql db for a social application vs a graph database?

Our company has several products and several teams. One team is in charge of searching, and is standardizing on Elasticsearch as a nosql db to store all their data, with plans to use Neo4j later to compliment their searches with relationship data.
My team is responsible for the product side of a social app (people have friends, and work for companies, and will be colleagues with everyone working at their companies, etc). We're looking at graph dbs as a solution (after abandoning the burning ship that is n^2 relationships in rdbms), specifically neo4j (the Cypher query language is a beautiful thing).
A subset of our data is similar to the data used by the search team, and we will need to make sure search can search over their data and our data simultaneously. The search team is pushing us to standardize on ElasticSearch for our db instead of Neo4j or any graph db. I believe this is for the sake of standardization and consistency.
We're obviously coming from very different places here, search concerns vs product concerns. He asserts that ElasticSearch can cover all our use cases, including graph-like queries to find suggestions. While that's probably true, I'm really looking to stick with Neo4j, and use an ElasticSearch plugin to integrate with their search.
In this situation, are there any major gotchas to choosing ElasticSearch over Neo4j for a product db (or vice versa)? Any guidelines or anecdotes from those who have been in similar situations?
We are heavy users of both technologies, and in our experience you would better use both to what they are good for.
Elasticsearch is a super good piece of software when it comes to search functionalities, logs management and facets.
Despite their graph plugin, if you want to use a lot of social network and alike relationships in elasticsearch indices, you will have two problems :
You will have to update documents everytime a relationship changes, which can come to a lot when a single entity changes. For example, let's say you have organizations having users which are doing contributions on github, and you want to search for organizations having the top contributors in a certain language, everytime a user is doing a contribution on github you will have to reindex the whole organization, compute percentage of contributions of languages for all users etc... And this is a simple example.
If you intend to use nested fields and partent/child mapping, you will loose performance during search, in reference, the quote from the "tuning for search" documentation here : https://www.elastic.co/guide/en/elasticsearch/reference/master/tune-for-search-speed.html#_document_modeling
Documents should be modeled so that search-time operations are as cheap as possible.
In particular, joins should be avoided. nested can make queries
several times slower and parent-child relations can make queries
hundreds of times slower. So if the same questions can be answered
without joins by denormalizing documents, significant speedups can be
expected.
Relationships are very well handled in a graph database like neo4j. Neo4j on the contrary lacks search features elasticsearch provides, doing full_text search is possible but not so performant and introduces some burden in your application.
Note apart : when you talk about "store", elasticsearch is a search engine not a database (while being used a lot as it), while neo4j is a database fully transactional.
However, combining both is the winning process, we have actually written an article describing this process that we call Graph-Aided Search with a set of open source plugins for both Elasticsearch and Neo4j providing you a powerful two-way integration out of the box.
You can read more about it here : http://graphaware.com/neo4j/2016/04/20/graph-aided-search-the-rise-of-personalised-content.html

How couchdb 1.6 inherently take advantage of Map reduce when it is Single Server Database

I am new to couch db, while going through documentation of Couch DB1.6, i came to know that it is single server DB, so I was wondering how map reduce inherently take advantage of it.
If i need to scale this DB then do I need to put more RAID hardware, of will it work on commodity hardware like HDFS?
I came to know that couch db 2.0 planning to bring clustering feature, but could not get proper documentation on this.
Can you please help me understanding how exactly internally file get stored and accessed.
Really appreciate your help.
I think your question is something like this:
"MapReduce is … a parallel, distributed algorithm on a cluster." [shortened from MapReduce article on Wikipedia]
But CouchDB 1.x is not a clustered database.
So what does CouchDB mean by using the term "map reduce"?
This is a reasonable question.
The historical use of "MapReduce" as described by Google in this paper using that stylized term, and implemented in Hadoop also using that same styling implies parallel processing over a dataset that may be too large for a single machine to handle.
But that's not how CouchDB 1.x works. View index "map" and "reduce" processing happens not just on single machine, but even on a single thread! As dch (a longtime contributor to the core CouchDB project) explains in his answer to https://stackoverflow.com/a/12725497/179583:
The issue is that eventually, something has to operate in serial to build the B~tree in such a way that range queries across the indexed view are efficient. … It does seem totally wacko the first time you realise that the highly parallelisable map-reduce algorithm is being operated sequentially, wat!
So: what benefit does map/reduce bring to single-server CouchDB? Why were CouchDB 1.x view indexes built around it?
The benefit is that the two functions that a developer can provide for each index "map", and optionally "reduce", form very simple building blocks that are easy to reason about, at least after your indexes are designed.
What I mean is this:
With e.g. the SQL query language, you focus on what data you need — not on how much work it takes to find it. So you might have unexpected performance problems, that may or may not be solved by figuring out the right columns to add indexes on, etc.
With CouchDB, the so-called NoSQL approach is taken to an extreme. You have to think explicitly about how you each document or set of documents "should be" found. You say, I want to be able to find all the "employee" documents whose "supervisor" field matches a certain identifier. So now you have to write a map function:
function (doc) {
if (doc.isEmployeeRecord) emit(doc.supervisor.identifier);
}
And then you have to query it like:
GET http://couchdb.local:5984/personnel/_design/my_indexes/_view/by_supervisor?key=SOME_UUID
In SQL you might simply say something like:
SELECT * FROM personnel WHERE supervisor == ?
So what's the advantage to the CouchDB way? Well, in the SQL case this query could be slow if you don't have an index on the supervisor column. In the CouchDB case, you can't really make an unoptimized query by accident — you always have to figure out a custom view first!
(The "reduce" function that you provide to a CouchDB view is usually used for aggregate functions purposes, like counting or averaging across multiple documents.)
If you think this is a dubious advantage, you are not alone. Personally I found designing my own indexes via a custom "map function" and sometimes a "reduce function" to be an interesting challenge, and it did pay off in knowing the scaling costs at least of queries (not so much for replications…).
So don't think of CouchDB view so much as being "MapReduce" (in the stylized sense) but just as providing efficiently-accessible storage for the results of running [].map(…).reduce(…) across a set of data. Because the "map" function is applied to only a document at once, the total set of data can be bigger than fits in memory at once. Because the "reduce" function is limited in its size, it further encourages efficient processing of a large set of data into an efficiently-accessed index.
If you want to learn a bit more about how the indexes generated in CouchDB are stored, you might find these articles interesting:
The Power of B-trees
CouchDB's File Format is brilliantly simple and speed-efficient (at the cost of disk space).
Technical Details, View Indexes
You may have noticed, and I am sorry, that I do not actually have a clear/solid answer of what the actual advantage and reasons were! I did not design or implement CouchDB, was only an avid user for many years.
Maybe the bigger advantage is that, in systems like Couchbase and CouchDB 2.x, the "parallel friendliness" of the map/reduce idea may come into play more. So then if you have designed an app to work in CouchDB 1.x it may then scale in the newer version without further intervention on your part.

elasticsearch - tips on how to organize my data

I'm trying elasticsearch by getting some data from facebook and twitter to.
The question is: how can I organize this data in index?
/objects/posts
/objects/twits
or
/posts/post
/twits/twit
I'm trying queries such as, get posts by author_id = X
You need to think about the long term when deciding how to structure your data in Elasticsearch. How much data are you planning on capturing? Are search requests going to look into both Facebook and Twitter data? Amount of requests, types of queries and so on.
Personally I would start of with the first approach, localhost:9200/social/twitter,facebook/ as this will reduce the need for another index when it isn't necessarily required. You can search across both of the types easily which has less overhead than searching across two indexes. There is quite an interesting article here about how to grow with intelligence.
Elasticsearch has many configurations, essentially its finding a balance which fits your data.
First one is the good approach. Because creating two indices will create two lucence instances which will effect the response time.

WHat is the best solution for large caching in Ruby?

I'm building an REST API in Ruby with JRuby+Sinatra running on top of Trinidad web server.
One of the functionalities of the API will be getting very large datasets from a database and storing them in a middle caching/non relational DB layer. This is for performing filter/sorting/actions on top of that dataset without having to rebuild it from the database.
We're looking into a good/the best solution for implementing this middle layer.
My thoughts:
Using a non relational database like Riak to store the datasets and having a caching layer (like Cache Money) on top.
Notes:
Our datasets can be fairly large
Since you asked for an opinion, I'll give you mine... I think MongoDB would be a good match for your needs:
http://www.mongodb.org/
I've used used it to store large, historical datasets for a couple of years now that just keep getting bigger and bigger, and it remains up to the task. I haven't even needed to delve into "sharding" or some of the advanced features.
The reasons I think it would be appropriate for the application you describe are:
It is an indexed, schemaless document store which means it can be very "dynamic" with fields being added or removed
I've benchmarked it's performance versus some SQL databases for large "flat" data it performs orders of magnitude better in some cases.
https://github.com/guyboertje/jmongo will let you access MongoDB from JRuby

Normalize or Denormalize in high traffic websites

What are the best practices for database design and normalization for high traffic websites like stackoverflow?
Should one use a normalized database for record keeping or a normalized technique or a combination of both?
Is it sensible to design a normalized database as the main database for record keeping to reduce redundancy and at the same time maintain another denormalized form of the database for fast searching?
or
Should the main database be denormalized but with normalized views at the application level for fast database operations?
or some other approach?
The performance hit of joining is frequently overestimated. Database products like Oracle are built to join very efficiently. Joins are often regarded as performing badly when the real culprit is a poor data model or a poor indexing strategy. People also forget that denormalised databases perform very badly when it comes to inserting or updating data.
The key thing to bear in mind is the type of application you're building. Most of the famous websites are not like regular enterprise applications. That's why Google, Facebook, etc don't use relational databases. There's been a lot of discussion of this topic recently, which I have blogged about.
So if you're building a website which is primarily about delivering shedloads of semi-structured content you probably don't want to be using a relational database, denormalised or otherwise. But if you're building a highly transactional website (such as an online bank) you need a design which guarantees data security and integrity, and does so well. That means a relational database in at least third normal form.
Denormalizing the db to reduce the number of joins needed for intense queries is one of many different ways of scaling. Having to do fewer joins means less heavy lifting by the db, and disk is cheap.
That said, for ridiculous amounts of traffic good relational db performance can be hard to achieve. That is why many bigger sites use key value stores(e.g. memcached) and other caching mechanisms.
The Art of Capacity Planning is pretty good.
You can listen to a discussion on this very topic by the creators of stack overflow on thier podcast at:
http://itc.conversationsnetwork.org/shows/detail3993.html
First: Define for yourself what hight-traffic means:
50.000 Page-Viewss per day?
500.000 Page-Views per day?
5.000.000 Page-Views per day?
more?
Then calculate this down to probable peak page-views per minute and per seconds.
After that think about the data you want to query per page-view. Is the data cacheable? How dynamic is the data, how big is the data?
Analyze your individual requirements, program some code, do some load-testing, optimize. In most cases, before you need to scale out the database servers you need to scale out the web-servers.
Relational-database can be, if fully optimized, amazingly fast, when joining tables!
A relational-database could be hit seldom when to as a back-end, to populate a cache or fill some denormalized data tables. I would not make denormalization the default approach.
(You mentioned search, look into e.g. lucene or something similar, if you need full-text search.)
The best best-practice answer is definitely: It depends ;-)
For a project I'm working on, we've gone for the denormalized table route as we expect our major tables to have a high ratio of writes to reads (instead of all users hitting the same tables, we've denormalized them and set each "user set" to use a particular shard). You may find read http://highscalability.com/ for examples of how the "big sites" cope with the volume - Stack Overflow was recently featured.
Neither matters if you aren't caching properly.

Resources