The title is not so clear, because I cannot put my problem in a sentence (If you have a better title for this question, please suggest). I'll try to clarify my requirement with an example:
Suppose I have a table like this:
| Origin | Destination | Airline | Free Baggage |
===================================================
| NYC | London | American | 20KG |
---------------------------------------------------
| NYC | * | Southwest | 30KG |
---------------------------------------------------
| * | * | Southwest | 25KG |
---------------------------------------------------
| * | LA | * | 20KG |
---------------------------------------------------
| * | * | * | 15KG |
---------------------------------------------------
and so on ...
This table describes free baggage amount that the airlines provide in different routes. You can see that some rows have * value, meaning that they match all possible values (those values are not known necessarily).
So we have a large list of baggage rules (like the table above) and a large list of flights (which their origin, destination and airline is known), and we intend to find the baggage amount for each one of flights in the most efficient way (iterating the list is not an efficient way, obviously, as it will cost an O(N) computation). It is possible to exist more than one result for each flight, but we will assume that in this case either the first matching or the most specific one will be preferred (whichever is simpler for you to continue with).
If there was not * signs in the table, the problem would be easy, and we could use a Hashmap or Dictionary with a Tuple of values as a key. But with presence of those * (lets say match-all) keys, it is not so straight forward to provide a general solution for that.
Please note that the above example was just an example, and I need a solution that can be used for any number of keys, not just three.
Do you have any idea or implementation for this problem, with a lookup method having time complexity equal or close to O(1) like a regular hashmap (memory will not be an issue)? What would be the best possible solution?
Regarding the comments, the more I think about it, and the more it looks like a relational database with indexes rather than an hashmap...
A trivial, quite easy solution could be something like an In-memory SQlite database. But it would probably be something in O(log2(n)), and not O(1). The main advantage is that it's easy to set up, and IF performances are good enough, it could be the final solution.
Here, key is to use proper indexes, the LIKE operator, and of course well-defined JOIN clauses.
From scratch, I can't think about any solution that, having N rows and M columns, isn't at least in O(M)... But usually, you'll have way less columns than rows. Quickly - I may have skipped a detail, I write that on-the-fly - I can propose you this algorithm / container:
Data must be stored in a vector-like container VECDATA, accessed by a simple index in O(1). Think about this as a primary key in databases, and we'll call it PK. Knowing PK gives you instantly, in O(1), the required data. You'll have N rows grand total.
For each row NOT containing any *, you'll insert in a real hashmap called MAINHASH the pair (<tuple>, PK). This is your primary index, for exact results. It will be in O(1), BUT what you requested may not be within... Obviously, you must maintain consistency between MAINHASH and VECDATA, with whatever is needed (mutexes, locks, don't care as long as both are consistents).
This hash contains at most N entries. Without any joker, it will act near as a standard hashmap, but for the indirection to VECDATA. It's still O(1) in this case.
For each searchable column, you'll build a specific index, dedicated to this column.
The index has N entries. It will be a standard hashmap, but it MUST allow multiple values for a given key. That's quite a common container, so it shouldn't be an issue.
For each row, the index entry will be: ( <VECDATA value>, PK ). The container is stored in a vector of indexes, INDEX[i] (with 0<=i<M).
Same as MAINHASH, consistency must be enforced.
Obviously, all these indexes / subcontainers should be constructed when an entry is inserted into VECDATA, and saved on disk across sessions if needed - you don't want to reconstruct all this each time you start the application...
Searching a row
So, user search for a given tuple.
Search it in MAINHASH. If found, return it, search done.
Upgrade (see below): search also in CACHE before going to step #2.
For each tuple element tuple[0<=i<M], search in INDEX[i] for both tuple[i] (returns a vector of PK, EXACT[i]) AND for * (returns another vector of PK, FUZZY[i]).
With these two vectors, build another (temporary) hash TMPHASH, associating ( PK, integer COUNT ). It quite simple: COUNT is initialized to 1 if entry comes from EXACT, and 0 if it comes from FUZZY.
For next column, build EXACT and FUZZY (see #2). But instead of making a new TMPHASH, you'll MERGE the results into rather than creating a new temporary hash.
Method is: if TMPHASH doesn't have this PK entry, trash this entry: it can't match at all. Otherwise, read the COUNT value, add 1 or 0 to it according to where it comes from, reinject it in TMPHASH.
Once all columns are done, you'll have to analyze TMPHASH.
Analyzing TMPHASH
First, if TMPHASH is empty, then you don't have any suitable answer. Return that to user. If it contains only one entry, same: return to user directly.
For more than one element in TMPHASH:
Parse the whole TMPHASH container, searching for the maximum COUNT. Maintain in memory the PK associated to the current maximum for COUNT.
Developper's choice: in case of multiple COUNT at the same maximum value, you can either return them all, return the first one, or the last one.
COUNT if obviously always stricly lower than M - otherwise, you would have found the tuple in MAINHASH. This value, compared to M, can give a confidence mark to your result (=100*COUNT/M% of confidence).
You can also now store the original tuple searched, and the corresponding PK, in another hashmap called CACHE.
Since it would be way too complicated to update properly CACHE when adding/modifying something in VECDATA, simply purge CACHE when it occurs. It's only a cache, after all...
This is quite complex to implement if the language doesn't help you, in particular by allowing to redefine operators and having all base containers available, but it should work.
Exact matches / cached matches are in O(1). Fuzzy search is in O(n.M), n being the number of matching rows (and 0<=n<N, of course).
Without further researchs, I can't see anything better than that. It will consume an obscene amount of memory, but you said that it won't be an issue.
I would recommend doing this with Tries that have a little data decorated. For routes, you want to know the lowest route ID so we can match to the first available route. For flights you want to track how many flights there are left to match.
What this will allow you to do, for instance, is partway through the match ONLY ONCE realize that flights from city1 to city2 might be matching routes that start off city1, city2, or city1, * or *, city2, or *, * without having to repeat that logic for each route or flight.
Here is a proof of concept in Python:
import heapq
import weakref
class Flight:
def __init__(self, fields, flight_no):
self.fields = fields
self.flight_no = flight_no
class Route:
def __init__(self, route_id, fields, baggage):
self.route_id = route_id
self.fields = fields
self.baggage = baggage
class SearchTrie:
def __init__(self, value=0, item=None, parent=None):
# value = # unmatched flights for flights
# value = lowest route id for routes.
self.value = value
self.item = item
self.trie = {}
self.parent = None
if parent:
self.parent = weakref.ref(parent)
def add_flight (self, flight, i=0):
self.value += 1
fields = flight.fields
if i < len(fields):
if fields[i] not in self.trie:
self.trie[fields[i]] = SearchTrie(0, None, self)
self.trie[fields[i]].add_flight(flight, i+1)
else:
self.item = flight
def remove_flight(self):
self.value -= 1
if self.parent and self.parent():
self.parent().remove_flight()
def add_route (self, route, i=0):
route_id = route.route_id
fields = route.fields
if i < len(fields):
if fields[i] not in self.trie:
self.trie[fields[i]] = SearchTrie(route_id)
self.trie[fields[i]].add_route(route, i+1)
else:
self.item = route
def match_flight_baggage(route_search, flight_search):
# Construct a heap of one search to do.
tmp_id = 0
todo = [((0, tmp_id), route_search, flight_search)]
# This will hold by flight number, baggage.
matched = {}
while 0 < len(todo):
priority, route_search, flight_search = heapq.heappop(todo)
if 0 == flight_search.value: # There are no flights left to match
# Already matched all flights.
pass
elif flight_search.item is not None:
# We found a match!
matched[flight_search.item.flight_no] = route_search.item.baggage
flight_search.remove_flight()
else:
for key, r_search in route_search.trie.items():
if key == '*': # Found wildcard.
for a_search in flight_search.trie.values():
if 0 < a_search.value:
heapq.heappush(todo, ((r_search.value, tmp_id), r_search, a_search))
tmp_id += 1
elif key in flight_search.trie and 0 < flight_search.trie[key].value:
heapq.heappush(todo, ((r_search.value, tmp_id), r_search, flight_search.trie[key]))
tmp_id += 1
return matched
# Sample data - the id is the position.
route_data = [
["NYC", "London", "American", "20KG"],
["NYC", "*", "Southwest", "30KG"],
["*", "*", "Southwest", "25KG"],
["*", "LA", "*", "20KG"],
["*", "*", "*", "15KG"],
]
routes = []
for i in range(len(route_data)):
data = route_data[i]
routes.append(Route(i, [data[0], data[1], data[2]], data[3]))
flight_data = [
["NYC", "London", "American"],
["NYC", "Dallas", "Southwest"],
["Dallas", "Houston", "Southwest"],
["Denver", "LA", "American"],
["Denver", "Houston", "American"],
]
flights = []
for i in range(len(flight_data)):
data = flight_data[i]
flights.append(Flight([data[0], data[1], data[2]], i))
# Convert to searches.
flight_search = SearchTrie()
for flight in flights:
flight_search.add_flight(flight)
route_search = SearchTrie()
for route in routes:
route_search.add_route(route)
print(route_search.match_flight_baggage(flight_search))
As Wisblade notices in his answer, for an array of N rows and M columns the best possible complexity is O(M). You can get O(1) only if you consider M to be a constant.
You can easily solve your problem in O(2^M) which is practical for a small M and is effectively O(1) if you consider M to be a constant.
Create a single hashmap which contains (as keys) strings of concatenated column values, possibly separated by some special character, e.g. a slash:
map.put("NYC/London/American", "20KG");
map.put("NYC/*/Southwest", "30KG");
map.put("*/*/Southwest", "25KG");
map.put("*/LA/*", "20KG");
map.put("*/*/*", "15KG");
Then, when you query, you try different combinations of actual data and wildcard characters. E.g. let's assume you want to query NYC/LA/Southwest; then you try the following combinations:
map.get("NYC/LA/Southwest"); // null
map.get("NYC/LA/*"); // null
map.get("NYC/*/Southwest"); // found: 30KG
If the answer in the third step was null, you would continue as follows:
map.get("NYC/*/*"); // null
map.get("*/LA/Southwest"); // null
map.get("*/LA/*"); // found: 20KG
And there still remain two options:
map.get("*/*/Southwest"); // found: 25KG
map.get("*/*/*"); // found: 15KG
Basically, for three data columns you have 8 possibilities to check in the hashmap -- not bad! and possibly you find an answer much earlier.
I have quite a specific data set that I need to store in most compact way as a byte array. It is a live stream of integers that are constantly increasing, often by one, but not always one. Each integer value has a tag that is a byte value. There may be values with same value and tag, but I need to store only distincts. Only supported operations are adding new elements, removal and check if element exists - I keep this data set to check if some pair has been 'seen' recently.
Some sample data:
# | value | tag |
1 | 1000 | 0 |
2 | 1000 | 1 |
3 | 1000 | 2 |
4 | 1001 | 0 |
5 | 1002 | 2 |
6 | 1004 | 1 |
7 | 1004 | 2 |
8 | 1005 | 0 |
As I said this is a live stream, but I can tolerate storing only last few thousands. The goal is to make it as memory efficient as possible in the storage (and in RAM), operations can cost much.
If I had no tags, I could store ranges or values, (1000-1002), (1002-1005) etc, there are usually about 5-6 values in a row without gaps. But the tags mess all this.
My current approach is to encode each value + tag pair in a few bytes - one byte for tag and 1 or more bytes for 'delta' from previous value.
This way I need to store first value, 1000 in above case, and than I store deltas - 0 for #1, #2, 1 for #4, 1 for #5, 2 for #6 etc.
Most deltas are small 1-10, so I can store it in one byte only - first bit is a flag if value is small enough to fit in 7 bits, if not - next 7 bits store a value of how may bytes delta occupies.
Maybe there is a better, more compact, approach?
Since you have only 127 different tag values, you could maintain 127 different tables, one for each tag, thus saving yourself from having to store the tags. In each table you could still use your nifty trick with deltas.
Let the pair (value, tag) where value is a uint32 and tag is a uint8 be a typical item stored in your data structure.
Use an associative array data structure that maps uint32 to an array list of uint16. In C++ terms, the data structure is the following.
std::map<std::uint32_t, std::vector<std::uint16_t>>
Each array list stays sorted with distinct values and never exceeds a size of 216.
Let D be an instance of this data structure. We store (value, tag) in the array list D[value >> 8] as (static_cast<std::uint16_t>(value) << 8) + tag.
The idea is basically that the data is paged. The most-significant 3 bytes of value determine the page, and then the least-significant byte of value and the single byte of tag are stored in the page.
This should exploit the structure of your data very efficiently because, assuming each page is holding many values, you're using 2 bytes per item.
I am a newbie in Powershell, but this is driving me a bit crazy. I have looked at various questions here, but could not find an answer so here I go. Apologies if this has been covered already.
I have two text files containing columns of numbers. I would like to create an array containing those 2 columns and sort it by column 1 or 2.
If we had
$a=#(1,5,10,15,25)
$b=#(100,99,98,99,10)
we create
c$=$a,$b
My initial thought was to try something like this:
$c | sort { [int]$_[0] }
But it does not work. I have tried many different things so any advice would be appreciated.
I am editing this as my question was not so clear. Ultimately, if I sort $c by ascending column 2, I expect something like:
25,10
10,98
5,99
15,99
1,100
Any idea how to achieve this ?
I am not sure about how you have declared your dimensional array because it is like you want it to be declared like this or something similar
$c = #(#(1,100),#(5,99),#(10,98),#(15,99),#(25,10))
If it was in that state then sorting is a breeze
$c | Sort-Object #{Expression={$_[1]}; Ascending=$True} | %{
"$($_[0]),$($_[1])"
}
Sort-Object works well with one dimensional arrays. When multiple properties are involved you need to specify which property to sort on to get the expected output. Since there are none we use a calculated expression to make on base on the second "column".
Sample Output
25,10
10,98
5,99
15,99
1,100
If you really want to work with your arrays like that we need an intermediate step to convert what you have to how it can be sorted the way you expect.
$a=#(1,5,10,15,25)
$b=#(100,99,98,99,10)
$c = #()
for($i = 0;$i -lt $a.Count; $i++){
$c += ,#($a[$i],$b[$i])
}
After running this code $c will work just like it does with my sorting.
Welcome to powershell world. The syntax is slightly different from classical programming languages, usually cmdlets take their input from current pipeline. In this case the command you talk about is Sort-Object and you can use it directly with the pipe content where you have the array content
$c = ($a | Sort-Object), ($b | Sort-Object)
Ruby noob here!
I have an array of structs that look like this
Token = Struct.new(:token, :ordinal)
So an array of these would look like this, in tabular form:
Token | Ordinal
---------------
C | 2
CC | 3
C | 5
And I want to group by the "token" (i.e. the left hand column) of the struct and get a count, but also preserve the "ordinal" element. So the above would look like this
Token | Merged Ordinal | Count
------------------------------
C | 2, 5 | 2
CC | 3 | 1
Notice that the last column is a count of the grouped tokens and the middle column merges the "ordinal". The first column ("Token") can contain a variable number of characters, and I want to group on these.
I have tried various methods, using group_by (I can get the count, but not the middle column), inject, iterating (does not seem very functional) but I just can't get it right, partly because I don't have a good grasp of Ruby and the available operations / functions.
I have also had a good look around SO, but I am not getting very far.
Any help, pointers would be much appreciated!
Use Enumerable#group_by to do the grouping for you and use the resulting hash to get what you want with map or similar.
structs.group_by(&:token).map do |token, with_same_token|
[token, with_same_token.map(&:ordinal), with_same_token.size]
end
I'm working on a multiplayer flash game. The server informs each client what other players are near the player. To do this the server has to check which clients are near each other continuously. The following is what I am using at this moment, as a temporary solution:
private function checkVisibilities()
{
foreach ($this->socketClients as $socketClient1)
{ //loop every socket client
if (($socketClient1->loggedIn()) && ($socketClient1->inWorld()))
{ //if this client is logged in and in the world
foreach ($this->socketClients as $cid2 => $socketClient2)
{ //loop every client for this client to see if they are near
if ($socketClient1 != $socketClient2)
{ //if it is not the same client
if (($socketClient2->loggedIn()) && ($socketClient2->inWorld())
{ //if this client is also logged in and also in the world
if ((abs($socketClient1->getCharX() - $socketClient2->getCharX()) + abs($socketClient1->getCharY() - $socketClient2->getCharY())) < Settings::$visibilities_range)
{ //the clients are near each other
if (!$socketClient1->isVisible($cid2))
{ //not yet visible -> add
$socketClient1->addVisible($cid2);
}
}
else
{ //the clients are not near each other
if ($socketClient1->isVisible($cid2))
{ //still visible -> remove
$socketClient1->removeVisible($cid2);
}
}
}
else
{ //the client is not logged in
if ($socketClient1->isVisible($cid2))
{ //still visible -> remove
$socketClient1->removeVisible($cid2);
}
}
}
}
}
}
It works fine. However, so far I've only been playing with 2 players at a time. This function is looping every client for every client. So, with 100 players that would be 100 * 100 = 10.000 loops every time the function is run. This doesn't seem the best or most efficient way to do it.
Now I wonder what you folks think about my current setup and if you have any suggestions on a better way of handling these visibilities.
Update: I forgot to mention that the world is infinite. It is actually "the universe". There are no maps. Also, it is a two dimensional (2D) game.
Thanks in advance.
The first thing I would say is that your code looks inside-out. Why do you have a high level game logic function that has to do the grunt-work of checking which clients are logged in and in the world? All that networking stuff should be removed from the game logic so that it's done on a higher level and the in-game logic only has to handle the players who are currently playing and in the world. This leaves you with a simple question: are these 2 players near enough to each other? A simple distance check suffices here, as you already have.
The next thing is to reduce the amount of looping you do. Distance is generally a commutative property so you don't need to check the distance between A and B as well as between B and A. To do this, whereas your first loop goes through all the clients, the second loop only needs to iterate over all the clients that come after the first one. This halves the number of iterations you need to do.
You also don't have to to do this continuously, as you state. You just have to do it often enough to ensure that the game runs smoothly. If movement speed is not all that high then you might only have to do this every few seconds for it to be good enough.
If this is still not good enough for you then some sort of spatial hashing system as described by ianh is a good way of reducing the number of queries you do. A grid is easiest but some sort of tree structure (ideally self-balancing) is another option.
The most straightforward solution is to partition the world into a uniform grid, like so:
_|____|____|____|_
| | | |
_|____|____|____|_
| | | |
_|____|____|____|_
| | | |
_|____|____|____|_
| | | |
Then insert your objects into any grid tile that they intersect:
_|____|____|____|_
| # | | |
_|____|____|____|_
| |d d | |
_|____|____|____|_
| | d | d |
_|____|____|____|_
| | | |
Now to do a query for nearby objects, you only need to look at nearby cells. For example, to see who within one tile from the player (#), you only need to check in 9 tiles, not the whole map:
/|////|////|____|_
/|/#//|////| |
/|////|////|____|_
/|////|d/d/| |
/|////|////|____|_
| | d | d |
_|____|____|____|_
| | | |
Depending on your world, however, this technique can be quite wasteful: there could be a lot of empty cells. If this becomes a problem, you may want to implement a more complex spatial index.
Try using a quad tree to represent the players' locations.
The wiki article for this is here.
What it does is keeping the objects you give it in space (users) in a tree which partitions the space (plane) as much as needed.
As for the infinity problem - nothing in programming is really infinite, so define a border which cannot be passed by the users (go even for a very large number for a coordinate, something that will take a user 100 years or so to get to).