Ruby efficienct queries - ruby

So far in my new workplace I've been dealing with querying databases and finding out the most efficient ways of getting the desired data.
I've found out about using pluck and getting the desired attributes instead of loading the whole result in the memory and other tricks, such as using inject (reduce), map, reject and such, which made my life a whole lot easier.
However, I haven't exactly found any theoretical explanation why inject/map/reject should be used in order to gain higher efficiency, only some sort of empiric conclusions from my own attempts. Like, why should I use map instead of iterating over an array with an "each".
Could someone please shed some light?

Related

Efficient Data Structure for Query Sync

I have a giant lists of query searches with cached image results for a few different servers and I want to sync the queries efficiently. I know that one way would be to do it in two steps. First comparing the queries, and second, only syncing non-identical results. Instead though I'd like it to be faster and more efficient by only exchanging a small fixed amount of data and then syncing non-identical results based on that data (it's fine if it happens to sync a small amount of identical results).
What kind of data structure for these queries would be recommended to accomplish this? I've been looking at https://en.wikipedia.org/wiki/List_of_data_structures to try to get a better idea, but I don't have a lot of experience in algorithms so I could really use some direction. I'm planning to do this in C++ if that needs to be taken into consideration. All suggestions appreciated, thanks.

Efficient way to represent locations, and query based on proximity?

I'm pondering over how to efficiently represent locations in a database, such that given an arbitrary new location, I can efficiently query the database for candidate locations that are within an acceptable proximity threshold to the subject.
Similar things have been asked before, but I haven't found a discussion based on my criteria for the problem domain.
Things to bear in mind:
Starting from scratch, I can represent data in any way (eg. long&lat, etc)
Any result set is time-sensitive, in that it loses validity within a short window of time (~5-15mins) so I can't cache indefinitely
I can tolerate some reasonable margin of error in results, for example if a location is slightly outside of the threshold, or if a row in the result set has very recently expired
A language agnostic discussion is perfect, but in case it helps I'm using C# MVC 3 and SQL Server 2012
A couple of first thoughts:
Use an external API like Google, however this will generate thousands of requests and the latency will be poor
Use the Haversine function, however this looks expensive and so should be performed on a minimal number of candidates (possibly as a Stored Procedure even!)
Build a graph of postcodes/zipcodes, such that from any node I can find postcodes/zipcodes that border it, however this could involve a lot of data to store
Some optimization ideas to reduce possible candidates quickly:
Cache result sets for searches, and when we do subsequent searches, see if the subject is within an acceptable range to a candidate we already have a cached result set for. If so, use the cached result set (but remember, the results expire quickly)
I'm hoping the answer isn't just raw CPU power, and that there are some approaches I haven't thought of that could help me out?
Thank you
ps. Apologies if I've missed previously asked questions with helpful answers, please let me know below.
What about using GeoHash? (refer to http://en.wikipedia.org/wiki/Geohash)

Algorithm to organize table into many tables to have less cells?

I'm not really trying to compress a database. This is more of a logical problem. Is there any algorithm that will take a data table with lots of columns and repeated data and find a way to organize it into many tables with ID's in such a way that in total there are as few cells as possible, and that this tables can be then joined with a query to replicate the original one.
I don't care about any particular database engine or language. I just want to see if there is a logical way of doing it. If you will post code, I like C# and SQL but you can use any.
I don't know of any automated algorithms but what you really need to do is heavily normalize your database. This means looking at your actual functional dependencies and breaking this off wherever it makes sense.
The problem with trying to do this in a computer program is that it isn't always clear if your current set of stored data represents all possible problem cases. You can't only look at numbers of values either. It makes little sense to break off booleans into their own table because they have only two values, for example, and this is only the tip of the iceberg.
I think that at this point, nothing is going to beat good ol' patient, hand-crafted normalization. This is something to do by hand. Any possible computer algorithm will either make a total mess of things or make you define the relationships such that you might as well do it all yourself.

Faster to call kind_of? or iterate through an array with one value?

In short, is the cost (in time and cpu) higher to call kind_of? twice or to create a new array with one value, then iterate through it? The 'backstory' below simply details why I need to know this, but is not a necessary read to answer the question.
Backstory:
I have a bunch of location data. Latitude/longitude pairs and the name of the place they represent. I need to sort these lat/lon values by distance from another lat/lon pair provided by a user. I have to calculate the distances on the fly, and they aren't known before.
I was thinking it would be easy to do this by adding the distance => placename map to a hash, then get a keyset and sort that, then read out the values in that order. However, there is the potential for two distances being equal, making two keys equal to each other.
I have come up with two solutions to this, either I map
if hash.has_key?(distance)
hash[distance].kind_of? Array
? hash[distance] << placename
: hash.merge!({distance => [hash[distance], placename]})
else
hash.merge!({distance => placename})
end
then when reading the values I check
hash[distance] kind_of? Array ? grab the placename : iterate through hash and grab all placenames
each time. Or I could make each value an array from the start even if it has only one placename.
You've probably spent more time thinking about the issue than you will ever save in CPU time. Developer brain time (both yours and others who will maintain the code when you're gone) is often much more precious than CPU cycles. Focus on code clarity.
If you get indications that your code is a bottleneck, it may be a good idea to benchmark it, but don't forget to benchmark both before and after any changes you make, to make sure that you are actually improving the code. It is surprisingly how often "optimizations" aren't improving the code at all, just making it harder to read.
To be honest, this sounds like a very negligible performance issue, so I'd say just go with whatever feels better to you.
If you really believe that this has a real world performance impact (and frankly, there are other areas of Ruby you should worry more about speed-wise), reduce your problem to the simplest form that still resembles your problem and use the Benchmark module:
http://www.ruby-doc.org/stdlib/libdoc/benchmark/rdoc/index.html
I would bet that you'll achieve both higher performance and better legibility using the built-in Enumerable#group_by method.
As others have said, it's likely that this isn't a bottleneck, that gains will be negligible in any case and that you should focus on other things!

Why is pagination so resource-expensive?

It's one of those things that seems to have an odd curve where the more I think about it, the more it makes sense. To a certain extent, of course. And then it doesn't make sense to me at all.
Care to enlighten me?
Because in most cases you've got to sort your results first. For example, when you search on Google, you can view only up to 100 pages of results. They don't bother sorting by page-rank beyond 1000 websites for given keyword (or combination of keywords).
Pagination is fast. Sorting is slow.
Lubos is right, the problem is not the fact that you are paging (which takes a HUGE amount of data off the wire), but that you need to figure out what is actually going on the page..
The fact that you need to page implies there is a lot of data. A lot of data takes a long time to sort :)
This is a really vague question. We'd need a concrete example to get a better idea of the problem.
This question seems pretty well covered, but I'll add a little something MySQL specific as it catches out a lot of people:
Avoid using SQL_CALC_FOUND_ROWS. Unless the dataset is trivial, counting matches and retrieving x amount of matches in two separate queries is going to be a lot quicker. (If it is trivial, you'll barely notice a difference either way.)
I thought you meant pagination of the printed page - that's where I cut my teeth. I was going to enter a great monologue about collecting all the content for the page, positioning (a vast number of rules here, constrait engines are quite helpful) and justification... but apparently you were talking about the process of organizing information on webpages.
For that, I'd guess database hits. Disk access is slow. Once you've got it in memory, sorting is cheap.
Of course sorting on a random query takes some time, but if you're having problems with the same paginated query being used regulary, there's either something wrong with the database setup (improperly indexing/none at all, too little memory etc. I'm not a db-manager) or you're doing pagination seriously wrong:
Terribly wrong: e.g. doing select * from hugetable where somecondition; into an array getting the page count with the array.length pick the relevant indexes and dicard the array - then repeating this for each page... That's what I call seriously wrong.
The better solution two queries: one getting just the count then another getting results using limit and offset. (Some proprietary, nonstandard-sql server might have a one query option, I dunno)
The bad solution might actually work quite okay in on small tables (in fact it's not unthinkable that it's faster on very small tables, because the overhead of making two queries is bigger than getting all rows in one query. I'm not saying it is so...) but as soon as the database begins to grow the problems become obvious.

Resources