The title is not so clear, because I cannot put my problem in a sentence (If you have a better title for this question, please suggest). I'll try to clarify my requirement with an example:
Suppose I have a table like this:
| Origin | Destination | Airline | Free Baggage |
===================================================
| NYC | London | American | 20KG |
---------------------------------------------------
| NYC | * | Southwest | 30KG |
---------------------------------------------------
| * | * | Southwest | 25KG |
---------------------------------------------------
| * | LA | * | 20KG |
---------------------------------------------------
| * | * | * | 15KG |
---------------------------------------------------
and so on ...
This table describes free baggage amount that the airlines provide in different routes. You can see that some rows have * value, meaning that they match all possible values (those values are not known necessarily).
So we have a large list of baggage rules (like the table above) and a large list of flights (which their origin, destination and airline is known), and we intend to find the baggage amount for each one of flights in the most efficient way (iterating the list is not an efficient way, obviously, as it will cost an O(N) computation). It is possible to exist more than one result for each flight, but we will assume that in this case either the first matching or the most specific one will be preferred (whichever is simpler for you to continue with).
If there was not * signs in the table, the problem would be easy, and we could use a Hashmap or Dictionary with a Tuple of values as a key. But with presence of those * (lets say match-all) keys, it is not so straight forward to provide a general solution for that.
Please note that the above example was just an example, and I need a solution that can be used for any number of keys, not just three.
Do you have any idea or implementation for this problem, with a lookup method having time complexity equal or close to O(1) like a regular hashmap (memory will not be an issue)? What would be the best possible solution?
Regarding the comments, the more I think about it, and the more it looks like a relational database with indexes rather than an hashmap...
A trivial, quite easy solution could be something like an In-memory SQlite database. But it would probably be something in O(log2(n)), and not O(1). The main advantage is that it's easy to set up, and IF performances are good enough, it could be the final solution.
Here, key is to use proper indexes, the LIKE operator, and of course well-defined JOIN clauses.
From scratch, I can't think about any solution that, having N rows and M columns, isn't at least in O(M)... But usually, you'll have way less columns than rows. Quickly - I may have skipped a detail, I write that on-the-fly - I can propose you this algorithm / container:
Data must be stored in a vector-like container VECDATA, accessed by a simple index in O(1). Think about this as a primary key in databases, and we'll call it PK. Knowing PK gives you instantly, in O(1), the required data. You'll have N rows grand total.
For each row NOT containing any *, you'll insert in a real hashmap called MAINHASH the pair (<tuple>, PK). This is your primary index, for exact results. It will be in O(1), BUT what you requested may not be within... Obviously, you must maintain consistency between MAINHASH and VECDATA, with whatever is needed (mutexes, locks, don't care as long as both are consistents).
This hash contains at most N entries. Without any joker, it will act near as a standard hashmap, but for the indirection to VECDATA. It's still O(1) in this case.
For each searchable column, you'll build a specific index, dedicated to this column.
The index has N entries. It will be a standard hashmap, but it MUST allow multiple values for a given key. That's quite a common container, so it shouldn't be an issue.
For each row, the index entry will be: ( <VECDATA value>, PK ). The container is stored in a vector of indexes, INDEX[i] (with 0<=i<M).
Same as MAINHASH, consistency must be enforced.
Obviously, all these indexes / subcontainers should be constructed when an entry is inserted into VECDATA, and saved on disk across sessions if needed - you don't want to reconstruct all this each time you start the application...
Searching a row
So, user search for a given tuple.
Search it in MAINHASH. If found, return it, search done.
Upgrade (see below): search also in CACHE before going to step #2.
For each tuple element tuple[0<=i<M], search in INDEX[i] for both tuple[i] (returns a vector of PK, EXACT[i]) AND for * (returns another vector of PK, FUZZY[i]).
With these two vectors, build another (temporary) hash TMPHASH, associating ( PK, integer COUNT ). It quite simple: COUNT is initialized to 1 if entry comes from EXACT, and 0 if it comes from FUZZY.
For next column, build EXACT and FUZZY (see #2). But instead of making a new TMPHASH, you'll MERGE the results into rather than creating a new temporary hash.
Method is: if TMPHASH doesn't have this PK entry, trash this entry: it can't match at all. Otherwise, read the COUNT value, add 1 or 0 to it according to where it comes from, reinject it in TMPHASH.
Once all columns are done, you'll have to analyze TMPHASH.
Analyzing TMPHASH
First, if TMPHASH is empty, then you don't have any suitable answer. Return that to user. If it contains only one entry, same: return to user directly.
For more than one element in TMPHASH:
Parse the whole TMPHASH container, searching for the maximum COUNT. Maintain in memory the PK associated to the current maximum for COUNT.
Developper's choice: in case of multiple COUNT at the same maximum value, you can either return them all, return the first one, or the last one.
COUNT if obviously always stricly lower than M - otherwise, you would have found the tuple in MAINHASH. This value, compared to M, can give a confidence mark to your result (=100*COUNT/M% of confidence).
You can also now store the original tuple searched, and the corresponding PK, in another hashmap called CACHE.
Since it would be way too complicated to update properly CACHE when adding/modifying something in VECDATA, simply purge CACHE when it occurs. It's only a cache, after all...
This is quite complex to implement if the language doesn't help you, in particular by allowing to redefine operators and having all base containers available, but it should work.
Exact matches / cached matches are in O(1). Fuzzy search is in O(n.M), n being the number of matching rows (and 0<=n<N, of course).
Without further researchs, I can't see anything better than that. It will consume an obscene amount of memory, but you said that it won't be an issue.
I would recommend doing this with Tries that have a little data decorated. For routes, you want to know the lowest route ID so we can match to the first available route. For flights you want to track how many flights there are left to match.
What this will allow you to do, for instance, is partway through the match ONLY ONCE realize that flights from city1 to city2 might be matching routes that start off city1, city2, or city1, * or *, city2, or *, * without having to repeat that logic for each route or flight.
Here is a proof of concept in Python:
import heapq
import weakref
class Flight:
def __init__(self, fields, flight_no):
self.fields = fields
self.flight_no = flight_no
class Route:
def __init__(self, route_id, fields, baggage):
self.route_id = route_id
self.fields = fields
self.baggage = baggage
class SearchTrie:
def __init__(self, value=0, item=None, parent=None):
# value = # unmatched flights for flights
# value = lowest route id for routes.
self.value = value
self.item = item
self.trie = {}
self.parent = None
if parent:
self.parent = weakref.ref(parent)
def add_flight (self, flight, i=0):
self.value += 1
fields = flight.fields
if i < len(fields):
if fields[i] not in self.trie:
self.trie[fields[i]] = SearchTrie(0, None, self)
self.trie[fields[i]].add_flight(flight, i+1)
else:
self.item = flight
def remove_flight(self):
self.value -= 1
if self.parent and self.parent():
self.parent().remove_flight()
def add_route (self, route, i=0):
route_id = route.route_id
fields = route.fields
if i < len(fields):
if fields[i] not in self.trie:
self.trie[fields[i]] = SearchTrie(route_id)
self.trie[fields[i]].add_route(route, i+1)
else:
self.item = route
def match_flight_baggage(route_search, flight_search):
# Construct a heap of one search to do.
tmp_id = 0
todo = [((0, tmp_id), route_search, flight_search)]
# This will hold by flight number, baggage.
matched = {}
while 0 < len(todo):
priority, route_search, flight_search = heapq.heappop(todo)
if 0 == flight_search.value: # There are no flights left to match
# Already matched all flights.
pass
elif flight_search.item is not None:
# We found a match!
matched[flight_search.item.flight_no] = route_search.item.baggage
flight_search.remove_flight()
else:
for key, r_search in route_search.trie.items():
if key == '*': # Found wildcard.
for a_search in flight_search.trie.values():
if 0 < a_search.value:
heapq.heappush(todo, ((r_search.value, tmp_id), r_search, a_search))
tmp_id += 1
elif key in flight_search.trie and 0 < flight_search.trie[key].value:
heapq.heappush(todo, ((r_search.value, tmp_id), r_search, flight_search.trie[key]))
tmp_id += 1
return matched
# Sample data - the id is the position.
route_data = [
["NYC", "London", "American", "20KG"],
["NYC", "*", "Southwest", "30KG"],
["*", "*", "Southwest", "25KG"],
["*", "LA", "*", "20KG"],
["*", "*", "*", "15KG"],
]
routes = []
for i in range(len(route_data)):
data = route_data[i]
routes.append(Route(i, [data[0], data[1], data[2]], data[3]))
flight_data = [
["NYC", "London", "American"],
["NYC", "Dallas", "Southwest"],
["Dallas", "Houston", "Southwest"],
["Denver", "LA", "American"],
["Denver", "Houston", "American"],
]
flights = []
for i in range(len(flight_data)):
data = flight_data[i]
flights.append(Flight([data[0], data[1], data[2]], i))
# Convert to searches.
flight_search = SearchTrie()
for flight in flights:
flight_search.add_flight(flight)
route_search = SearchTrie()
for route in routes:
route_search.add_route(route)
print(route_search.match_flight_baggage(flight_search))
As Wisblade notices in his answer, for an array of N rows and M columns the best possible complexity is O(M). You can get O(1) only if you consider M to be a constant.
You can easily solve your problem in O(2^M) which is practical for a small M and is effectively O(1) if you consider M to be a constant.
Create a single hashmap which contains (as keys) strings of concatenated column values, possibly separated by some special character, e.g. a slash:
map.put("NYC/London/American", "20KG");
map.put("NYC/*/Southwest", "30KG");
map.put("*/*/Southwest", "25KG");
map.put("*/LA/*", "20KG");
map.put("*/*/*", "15KG");
Then, when you query, you try different combinations of actual data and wildcard characters. E.g. let's assume you want to query NYC/LA/Southwest; then you try the following combinations:
map.get("NYC/LA/Southwest"); // null
map.get("NYC/LA/*"); // null
map.get("NYC/*/Southwest"); // found: 30KG
If the answer in the third step was null, you would continue as follows:
map.get("NYC/*/*"); // null
map.get("*/LA/Southwest"); // null
map.get("*/LA/*"); // found: 20KG
And there still remain two options:
map.get("*/*/Southwest"); // found: 25KG
map.get("*/*/*"); // found: 15KG
Basically, for three data columns you have 8 possibilities to check in the hashmap -- not bad! and possibly you find an answer much earlier.
I did some tests around performance of selection from ets tables and noted weird behaviour. For example we have a simple ets table (without any specific options) which stores key/value - a random string and a number:
:ets.new(:table, [:named_table])
for _i <- 1..2000 do
:ets.insert(:table, {:crypto.strong_rand_bytes(10)
|> Base.url_encode64
|> binary_part(0, 10), 100})
end
and one entry with known key:
:ets.insert(:table, {"test_string", 200})
Now there is simple stupid benchmark function which tries to select test_string from the ets table multiple times and to measure time of each selection:
test_fn = fn() ->
Enum.map(Enum.to_list(1..10_000), fn(x) ->
:timer.tc(fn() ->
:ets.select(:table, [{{:'$1', :'$2'},
[{:'==', :'$1', "test_string"}],
[:'$_']}])
end)
end) |> Enum.unzip
end
Now If I take a look at maximum time with Enum.max(timings) it will return a value which is approximately 10x times greater than almost of all other selections. So, for example:
iex(1)> {timings, _result} = test_fn.()
....
....
....
iex(2)> Enum.max(timings)
896
iex(3)> Enum.sum(timings) / length(timings)
96.8845
We may see here that maximum value is almost 10x times greater than average value.
What's happening here? Is it somehow related to GC, time for memory allocation or something like this? Do you have any ideas why selection from an ets table may give such slowdowns sometimes or how to profile this.
UPD.
here is the graph of timings distribution:
match_spec, the 2nd argument of the select/2 is making it slower.
According to an answer on this question
Erlang: ets select and match performance
In trivial non-specific use-cases, select is just a lot of work around match.
In non-trivial more common use-cases, select will give you what you really want a lot quicker.
Also, if you are working with tables with set or ordered_set type, to get a value based on a key, use lookup/2 instead, as it is lot faster.
On my pc, following code
def lookup() do
{timings, _} = Enum.map(Enum.to_list(1..10_000), fn(_x) ->
:timer.tc(fn() ->
:ets.lookup(:table, "test_string")
end)
end) |> Enum.unzip
IO.puts Enum.max(timings)
IO.puts Enum.sum(timings) / length(timings)
end
printed
0
0.0
While yours printed
16000
157.9
In case you are interested, here you can find the NIF C code for ets:select.
https://github.com/erlang/otp/blob/9d1b3bb0db87cf95cb821af01189f6d6be072f79/erts/emulator/beam/erl_db.c
In elixir, I would like to be able to filter an ets table using a function.
I currently have a simple ets table example in the iex shell...
iex> :ets.new(:nums, [:named_table])
:nums
iex> :ets.insert :nums, [{1}, {2}, {3}, {4}, {5}]
true
fun = :ets.fun2ms(fn {n} when n < 4 -> n end)
[{{:"$1"}, [{:<, :"$1", 4}], [:"$1"]}]
:ets.select(:nums, fun)
[1, 3, 2]
This all works as you would expect. My question relates to the function being used to query the ets table. Currently it uses a guard clause to filter for results less than 4.
I would like to know if there is a way put the guard clause syntax into the function body. For example...
iex> fun2 = :ets.fun2ms(fn {n} -> if n < 4, do: n end)
but if I do this then I get the following error...
Error: the language element case (in body) cannot be translated into match_spec
{:error, :transform_error}
Is something like this possible?
It turns out, this is the only way to go
From erlang documentation
The fun is very restricted, it can take only a single parameter (the object to match): a sole variable or a tuple. It must use the is_ guard tests. Language constructs that have no representation in a match specification (if, case, receive, and so on) are not allowed.
More info about Match Specifications in Erlang
The matrix-qr function in Racket's math library outputs two values. I know about call-with-values to put both output values into the next function you want.
However, how can I take each individual output and put define some constant with that value? The QR function outputs a Q matrix and an R matrix. I need something like:
(define Q ...)
(define R ...)
Also, how could I just use one of the outputs from a function that outputs two values?
The usual way to create definitions for multiple values is to use define-values, which pretty much works like you’d expect.
(define-values (Q R) ; Q and R are defined
(matrix-qr (matrix [[12 -51 4]
[ 6 167 -68]
[-4 24 -41]])))
There is also a let equivalent for multiple values, called let-values (as well as let*-values and letrec-values).
Ignoring values is harder. There is no function like (first-value ...), for example, because ordinary function application does not produce a continuation that can accept multiple values. However, you can use something like match-define-values along with the _ “hole marker” to ignore values and simply not bind them.
(match-define-values (Q _) ; only Q is defined
(matrix-qr (matrix [[12 -51 4]
[ 6 167 -68]
[-4 24 -41]])))
It is theoretically possible to create a macro that could either convert multiple values to a list or simply only use a particular value, but in general this is avoided. Returning multiple values should not be done lightly, which is why, for almost all functions that return them, it wouldn’t usually make much sense to use one of the values but ignore the other.
Im a student who is really new to functional programming. Im working on a banking application where the data has been already defined as,
type Accountno = Int
data Accounttype = Saving | Current | FixedDeposit deriving (Show,Read)
type Accountamount = Int
type Name = String
type Account = (Accountno, Name, Accounttype, Accountamount)
exampleBase :: [Account]
exampleBase = [ (1,"Jennifer",Saving,1000 ) ,
(5,"Melissa",Current,3000) ,
(2,"Alex",Saving,1500)]
Im trying to sort the list by its account number using the following code,
sortByID :: (Ord a) => [a] -> [a]
sortByID [] = []
sortByID (l :ls) =
let
smallerSorted = sortByID [x | x <- ls, x <= l]
biggerSorted = sortByID [x | x <- ls, x > l]
in
smallerSorted ++ [l] ++ biggerSorted
viewSortedDetails :: IO()
viewSortedDetails =
do
putStrLn "Account Details Sorted By Account ID"
let records = sortByID exampleBase
let viewRecord = map show records
mapM_ putStrLn viewRecord
But I do not get the expected result. as it gives me an error, informing "Instance of Ord Accounttype required for definition of viewSortedDetails".Please can some one help me to overcome this problem
Thanks a lot!
Well, the problem is that you're using ordering comparisons, such as <=, on two Account values, which would require Account to be an instance of Ord. Now, Account is a synonym for a four-element tuple, which are defined to be instances of Ord when all the types in the tuple are. Accountno, Name, and Accountamount are all synonyms for types with Ord instances, but Accounttype is not.
You could make it possible to sort Account values directly by making Accounttype an instance of Ord, which you can do by simply adding it to the deriving clause.
However, if you want to specifically sort only by the account number, not the other elements of the tuple, you'll need to do something differently. One option would be to make Account a data type with a custom Ord instance:
data Account = Account Accountno Name Accounttype Accountamount deriving (Eq, Show, Read)
instance Ord Account where
(...)
Then you can define the ordering however you like.
Alternatively, you can leave it as is and instead only compare the element you want instead of the entire Account value, using something like this:
accountNo :: Account -> Accountno
accountNo (n,_,_,_) = n
...and then doing the comparison with something like smallerSorted = sortByID [x | x <- ls, accountNo x <= accountNo l]. The standard libraries also include a function on for this purpose, but it would awkward to use in this case.
A few other remarks, which are less relevant to your question, on the general subject of Haskell code:
Defining Account as a data type, probably using the record syntax, would be nicer than using a type synonym here. Large tuples can be awkward to work with.
Accountno and Accountamount should probably be different types as well, to avoid mixing them with other Ints: the first because doing arithmetic on account numbers makes little sense, the latter in part because (I'm guessing) you're implicitly using fixed point arithmetic, such that 100 actually means 1.00, and in general just to avoid confusion.
In fact, Int is probably a bad choice for Accountamount anyway: Why not something from Data.Fixed, Ratio Integer, or a base-10-safe floating point type (although there isn't one in the standard libraries, unfortunately).
The standard libraries of course include sorting functions already--I'm assuming the reimplementation is for learning purposes, but in practice it could all be replaced by something like sortBy (compare `on` accountNo).