Efficient way to implement search operation in JSON file - ruby

I have a huge JSON file which is an array of objects containing city crime information. The number of crimes per city is listed as a key/value. I'm parsing it to a hash using yajl/json_gem.
What is the efficient way to find top 10 cities that have most crimes / least crimes?

Generally, an efficient way of traversing through a list to find the k min or max elements is with a min or max heap. A heap is a tree-like data structure that always has the smallest or largest element at the top of the tree, and inserting a new element or deleting an element is O(log n).
Let's say you have N elements in your table and want to keep track of the k max elements (the process is identical for min, you just use a different heap). Per this StackOverflow post, storing the data in a max-heap of size k (and dropping values that are smaller than the minimum value in the heap) is an efficient solution to this problem.
The space complexity is O(n) (for each element in the table, there is one element in the heap), and the time complexity is O(n log k) (because you have to insert n elements worst case, and each one takes log k time).
Now, on to the implementation: Ruby doesn't have a Heap data structure, but the gem algorithms has a heap implemented in C.
I don't want to write the code for you, but I think that from this theory, you should be able to implement an efficient solution.

I do not expect this to be a complete answer, as the question is not clear, but this may provide the beginnings of a solution.
Suppose
h = { "info":[
{"name": "Paris", "crime_rate": "750"},
{"name": "Rome", "crime_rate": "800"},
{"name": "London", "crime_rate": "600"},
{"name": "Berlin", "crime_rate": "400"},
{"name": "Amsterdam", "crime_rate": "700"}
]
}
and the cities with the top two and bottom two crime rates are desired.
def top_so_many(h, meth, nbr)
h[:info].public_send(meth, nbr) { |g| g[:crime_rate] }.map { |g| g[:name] }
end
top_so_many(h, :max_by, 2)
#=> ["Rome", "Paris"]
top_so_many(h, :min_by, 2)
#=> ["Berlin", "London"]

I would try something like this:
Store your JSON in a variable:
json = {"info":[ {"name": "xyz", "crime_rate": 750}, {"name":"ABC", "crime_rate", "900"}......]}
Parse the JSON:
h = JSON.parse(s)
Use find or select to find the required numer, sort, and take 10 first objects
h.find { |el| el.crime_rate > 500 }.first(10) # or any other condition

Related

How to order a list according to an arbitrary order

I searched a relevant question but couldn't find one. So my question is how do I sort an array based on an arbitrary order. For example, let's say the ordering is:
order_of_elements = ['cc', 'zz', '4b', '13']
and my list to be sorted:
list_to_be_sorted = ['4b', '4b', 'zz', 'cc', '13', 'cc', 'zz']
so the result needs to be:
ordered_list = ['cc', 'cc', 'zz', 'zz', '4b', '4b', '13']
please note that the reference list(order_of_elements) describes ordering and I don't ask about sorting according to the alphabetically sorted indices of the reference list.
You can assume that order_of_elements array includes all the possible elements.
Any pseudocode is welcome.
A simple and Pythonic way to accomplish this would be to compute an index lookup table for the order_of_elements array, and use the indices as the sorting key:
order_index_table = { item: idx for idx, item in enumerate(order_of_elements) }
ordered_list = sorted(list_to_be_sorted, key=lambda x: order_index_table[x])
The table reduces order lookup to O(1) (amortized) and thus does not change the time complexity of the sort.
(Of course it does assume that all elements in list_to_be_sorted are present in order_of_elements; if this is not necessarily the case then you would need a default return value in the key lambda.)
Since you have a limited number of possible elements, and if these elements are hashable, you can use a kind of counting sort.
Put all the elements of order_of_elements in a hashmap as keys, with counters as values. Traverse you list_to_be_sorted, incrementing the counter corresponding to the current element. To build ordered_list, go through order_of_elements and add each current element the number of times indicated by the counter of that element.
hashmap hm;
for e in order_of_elements {
hm.add(e, 0);
}
for e in list_to_be_sorted {
hm[e]++;
}
list ordered_list;
for e in order_of_elements {
list.append(e, hm[e]); // Append hm[e] copies of element e
}
Approach:
create an auxiliary array which will hold the index of 'order_of_elements'
sort the auxiliary array.
2.1 re-arrange the value in the main array while sorting the auxiliary array

Find the Most Frequent Ordered Word Pair In a Document

This is a problem from S. Skiena's "Algorithm. Design Manual" book, the problem statement is:
Give an algorithm for finding an ordered word pair(e.g."New York")
occurring with the greatest frequency in a given webpage.
Which data structure would you use? Optimize both time and space.
One obvious solution is inserting each ordered pair in a hash-map and then iterating over all of them, to find the most frequent one, however, there definitely should be a better way, can anyone suggest anything?
In a text with n words, we have exactly n - 1 ordered word pairs (not distinct of course). One solution is to use a max priority queue; we simply insert each pair in the max PQ with frequency 1 if not already present. If present, we increment the key. However, if we use a Trie, we don't need to represent all n - 1 pairs separately. Take for example the following text:
A new puppy in New York is happy with it's New York life.
The resulting Trie would look like the following:
If we store the number of occurrences of a pair in the leaf nodes, we could easily compute the maximum occurrence in linear time. Since we need to look at each word, that's the best we can do, time wise.
Working Scala code below. The official site has a solution in Python.
class TrieNode(val parent: Option[TrieNode] = None,
val children: MutableMap[Char, TrieNode] = MutableMap.empty,
var n: Int = 0) {
def add(c: Char): TrieNode = {
val child = children.getOrElseUpdate(c, new TrieNode(parent = Some(this)))
child.n += 1
child
}
def letter(node: TrieNode): Char = {
node.parent
.flatMap(_.children.find(_._2 eq node))
.map(_._1)
.getOrElse('\u0000')
}
override def toString: String = {
Iterator
.iterate((ListBuffer.empty[Char], Option(this))) {
case (buffer, node) =>
node
.filter(_.parent.isDefined)
.map(letter)
.foreach(buffer.prepend(_))
(buffer, node.flatMap(_.parent))
}
.dropWhile(_._2.isDefined)
.take(1)
.map(_._1.mkString)
.next()
}
}
def mostCommonPair(text: String): (String, Int) = {
val root = new TrieNode()
#tailrec
def loop(s: String,
mostCommon: TrieNode,
count: Int,
parent: TrieNode): (String, Int) = {
s.split("\\s+", 2) match {
case Array(head, tail # _*) if head.nonEmpty =>
val word = head.foldLeft(parent)((tn, c) => tn.add(c))
val (common, n, p) =
if (parent eq root) (mostCommon, count, word.add(' '))
else if (word.n > count) (word, word.n, root)
else (mostCommon, count, root)
loop(tail.headOption.getOrElse(""), common, n, p)
case _ => (mostCommon.toString, count)
}
}
loop(text, new TrieNode(), -1, root)
}
Inspired by the question here.
I think the first point to note is that finding the most frequent ordered word pair is no more (or less) difficult than finding the most frequent word. The only difference is that instead of words made up of the letters a..z+A.Z separated by punctuation or spaces, you are looking for word-pairs made up of the letters a..z+A..Z+exactly_one_space, similarly separated by punctuation or spaces.
If your web-page has n words then there are only n-1 word-pairs. So hashing each word-pair then iterating over the hash table will O(n) in both time and memory. This should be pretty quick to do even if n is ~10^6 (i.e. the length of an average novel). I can't imagine anything more efficient unless n is fairly small, in which case the memory savings resulting from constructing an ordered list of word pairs (instead of a hash table) might outweigh the cost of increasing time complexity to O(nlogn)
why not keep all the ordered pairs in AVL tree with 10 elements array to track top 10 ordered pairs. In AVL we would keep all the order pairs with their occurring count and top 10 will keep in the array. this way searching of any ordered pair would be O(log N) and traversing would be O(N).
I think we could not do better than O(n) in terms of time as one would have to see at least each element once. So time complexity cannot be optimised further.
But we can use a trie to optimise the space used. In a page, there are often words which are repeated, so this might lead to significant reduction in space usage. The leaf nodes in the trie cold store the frequency of the ordered pair and using two pointers to iterate in the text where one would point at the current word and second would point at previous word.

Efficient algorithms for merging hashes into a sparse matrix

I have time data with irregular intervals and I need to convert it to a sparse matrix for use with a graphing library.
The data is currently in the following format:
{
:series1 => [entry, entry, entry, entry, ...],
:series2 => [entry, entry, entry, entry, ...]
}
where entry is an object with two properties, timestamp(a unix timestamp) and value(an integer)
I need to put it in this format in as close to O(n) time as possible.
{
timestamp1 => [ value, value, nil ],
timestamp2 => [ value, nil, value ],
timestamp3 => [ value, value, value],
...
}
Here each row represents a point in time which I have an entry for. Each column represents a series ( a line on a line graph ). Thats why it is very important to represent missing values with nil.
I have some pretty slow implementations but this seems like a problem that has been solved before so I'm hoping that there is a more efficient way to do this.
I'm slightly confused by your asking for O(n), so feel free to correct me, but as far as I can tell, O(n) is easily possible.
First find the length of your starting hash (the number of series in the data). This should be O(1), but no worse than O(S) (where S is no of series), and S <= O(n) (assuming no series with no values) so is still O(n).
Store this length somewhere, and then setup your hash for the sparse matrix to automatically initialise any row to an empty array of this size.
matrix = Hash.new {|hsh,k| hsh[k] = Array.new(S)}
Then simply go through each series, by index. And for each entry, set the appropriate cell in the array to be the right value.
For each entry, this is O(1) (average) for the lookup of the timestamp in the hash, then O(1) for setting the cell in the array. This happens n times, giving you O(n) there.
There will also be the creation of an Array for each row in the matrix. As far as I am aware this is O(1) for one Array, so O(T) (where T is number of timestamps) overall. As we are not creating empty rows where there are no entries with that timestamp, T must be <= n, so this is O(n) as well.
So overall we have O(n) + O(n) + O(n) = O(n). There are probably ways to speed this up in Ruby, but as far as I am aware this is not only close to, but actually O(n).
How about something like this:
num = series.count
timestamps = {}
series.each_with_index do |(k, entries), i|
entries.each do |entry|
timestamps[entry.timestamp] ||= Array.new(num)
timestamps[entry.timestamp][i] = entry.value
end
end
Not sure though about the initial ordering of your series, I guess your real situation is a bit more complex than presented in the question.

Find ranges in array

I've been trying to find the optimal solution to the following (interesting?) problem that came up at work: Eventually I settled for a good enough solution but I'd like to know if there's a better one.
Let a1...an be an array of strings.
Let s1...sk be an unordered list of strings, all of them also members of the array.
The task is to find the minimum set of index ranges eleements of s cover in a.
So for example if a = [ "x", "y", "a", "f", "c" ] and s = { "c","y","f" }, the answer would be (1;1), (3;4), assuming that the array is indexed from zero.
a is typically fairly large (hundreds of thousands of elements), while s is relatively small, typically length(s) < log(length(a)).
So the question is: can you find a time-efficient algorithm for this problem? (Space efficiency is not a concern within reasonable limits.)
Just a quick but important update: I need to perform this operation with different s values but the same a a lot. So precomputing stuff based on a is allowed, indeed it is the only way.
Build a hash table H(a) to map from element to index: ax->x in O(n) time and space. Then look up each sy in H(a) (in O(1) time on average for a total of O(k) for s) and keep track of the ranges. For that you can use an array of pair(min_index, max_index) sorted by min_index and do a binary search to either locate the range or where you should insert the new 1 element range.
So overall, the solution above would take O( n + k + k * log( nb_ranges ) ) time and O( n + nb_ranges ) space.
This is what you want, written in python:
def flattened(indexes):
s, rest = indexes[0], indexes[1:]
result = (s, s)
for e in rest:
if e == result[1] + 1:
result = (result[0], e)
else:
yield result
result = (e, e)
yield result
a = ["x", "y", "a", "f", "c"]
s = ["c", "y", "f"]
# Create lookup table of ai to index in a
src_indexes = dict((key, i) for i, key in enumerate(a))
# Create sorted list of all indexes into a
raw_dst_indexes = sorted(src_indexes[key] for key in s)
# Convert sorted list of indexes into an array of ranges
dst_indexes = [r for r in flattened(raw_dst_indexes)]
print dst_indexes
I think you can throw the elements of S into a set or hashtable, anything with near O(1) to check for membership. Then just do a linear scan on A, with a flag to determine if you are currently covering elements in S, and the start position of that cover. Should be O(n + k).

Whats the best data-structure for storing 2-tuple (a, b) which support adding, deleting tuples and compare (either on a or b))

So here is my problem. I want to store 2-tuple (key, val) and want to perform following operations:
keys are strings and values are Integers
multiple keys can have same value
adding new tuples
updating any key with new value (any new value or updated value is greater than the previous one, like timestamps)
fetching all the keys with values less than or greater than given value
deleting tuples.
Hash seems to be the obvious choice for updating the key's value but then lookups via values will be going to take longer (O(n)). The other option is balanced binary search tree with key and value switched. So now lookups via values will be fast (O(lg(n))) but updating a key will take (O(n)). So is there any data-structure which can be used to address these issues?
Thanks.
I'd use 2 datastructures, a hash table from keys to values and a search tree ordered by values and then by keys. When inserting, insert the pair into both structures, when deleting by key, look up the value from the hash and then remove the pair from the tree. Updating is basically delete+insert. Insert, delete and update are O(log n). For fetching all the keys less than a value lookup the value in the search tree and iterate backwards. This is O(log n + k).
The choices for good hash table and search tree implementations depend a lot on your particular distribution of data and operations. That said, a good general purpose implementation of both should be sufficient.
For binary Search Tree Insert is O(logN) operation in average and O(n) in worst case. The same for lookup operation. So this should be your choice I believe.
Dictionary or Map types tend to be based on one of two structures.
Balanced tree (guarantee O(log n) lookup).
Hash based (best case is O(1), but a poor hash function for the data could result in O(n) lookups).
Any book on algorithms should cover both in lots of detail.
To provide operations both on keys and values, there are also multi-index based collections (with all the extra complexity) which maintain multiple structures (much like an RDBMS table can have multiple indexes). Unless you have a lot of lookups over a large collection the extra overhead might be a higher cost than a few linear lookups.
You can create a custom data structure which holds two dictionaries.
i.e
a hash table from keys->values and another hash table from values->lists of keys.
class Foo:
def __init__(self):
self.keys = {} # (KEY=key,VALUE=value)
self.values = {} # (KEY=value,VALUE=list of keys)
def add_tuple(self,kd,vd):
self.keys[kd] = vd
if self.values.has_key(vd):
self.values[vd].append(kd)
else:
self.values[vd] = [kd]
f = Foo()
f.add_tuple('a',1)
f.add_tuple('b',2)
f.add_tuple('c',3)
f.add_tuple('d',3)
print f.keys
print f.values
print f.keys['a']
print f.values[3]
print [f.values[v] for v in f.values.keys() if v > 1]
OUTPUT:
{'a': 1, 'c': 3, 'b': 2, 'd': 3}
{1: ['a'], 2: ['b'], 3: ['c', 'd']}
1
['c', 'd']
[['b'], ['c', 'd']]

Resources