Ruby and pointers - ruby

I'm programming a Dungeon Generator for a litte game.
Dungeons are made of rooms. A room has connections to other rooms.
room.connections = [room_a, room_b]
and
room.number = 1 # unique id
Now I need to pick a room by it's number.
I did this first with a recursive_scan method, which did not work because rooms can lead into circles, which throws a StackOverflowError. So I put an array called already_scanned with the room numbers, which were already picked into the args of the method. Then it didn't scan all rooms - btw, I have no idea why, by my logical undstandness it should have worked.
Then I tried to put all rooms also in an array and then iterate the array for the wanted room - but here I get the problem, every room is basically connected to every other room, at least with some other rooms betwenn it; so the array gets as big as dungeon_size * array_of_rooms.length.
What I need now is an explicit pointer - I know almost every var in ruby is a pointer, except Fixnums and Float (and maybe some other). Even though, the array gets to big, so I need a real pointer.
(I also tried to setup an array of object_ids and load them via ObectSpace, but sadly - because I often have to load the rooms - the rooms with the wanted object_id are already recycled then, as an error message explains.)
This is my recursive scan method:
def room(number)
recursive_scan(#map, number, []) # #map is the entrance room
end
private
def recursive_scan(room, number, scanned)
scanned << room.room_number
if room.room_number == number
room
else
r = nil
room.connections.each do |next_room|
if !scanned.include?(next_room.room_number)
r = recursive_scan(next_room, number, scanned)
end
end
r
end
end

Everything in Ruby is already a reference.
Why not just maintain a room index?
rooms[room.number] = room
Then you can get anything with rooms[i]. I would keep the index up to date incrementally by simply modifying the initialize method of Room.
def initialize
rooms[self.number] = self
. . .
end
This won't take up much space because the array is just an index, it doesn't actually have copies of the rooms. Each reference obtained from the array is essentially the same thing as a reference obtained via any other mechanism in your program, and the only real difference between the reference and a classic pointer is a bit of overhead for garbage collection.
If rooms are ever deleted (other than just before exit) you will want to set the rooms[x] = nil when on deletion.
I don't see why you need to create the data structure first and then index the rooms, but FWIW you should be able to do that recursive enumeration and use the rooms presence in the room index array as the been-here flag. I'm not sure why it didn't work before but it really has to if written carefully.

It is a classical graph problem. Using graph library will handle most of these issues. Try using the rgl gem.
Define each room as a vertex in the graph. The connections will be the edges.
require "rgl/adjacency"
map = RGL::AdjacencyGraph.new
rooms.each {|room| map.add_vertex room}
rooms.connections.each {|room1, room2| map.add_edge room1, room2}
Now you can test whether two rooms are directly connected in O(1):
map.has_edge? room1, room2
Or get the list of all rooms:
map.vertices
You can also get the list of all adjacent rooms:
map.adjacent_vertices(room)

A really hackish way to get all Rooms in memory would be:
all_rooms = ObjectSpace.each_object(Room).to_a

You may want to look at the NArray gem, which will speed up using arrays of numbers. You may be trying to force a square peg into a round hole here with this approach.

Related

Resource Allocation Algorithm (Weights in Containers)

I am currently trying to work through this problem. But I cannot seem to find the solution for this problem.
So here is the premise, there are k number of containers. Each container has a capacity associated with it. You are to place weights in these containers. The weights can have random values. However, the total weight in the container cannot exceed the capacity of the container. Or else the container will break. There could be a situation where the weight does not fit in any of the container. Then, you can rearrange the weights to accommodate the new weight.
Example:
Container 1: [10, 4], Capacity = 20
Container 2: [7, 6], Capacity = 20
Container 3: [10, 6], Capacity = 20
Now lets say we have to add new weight with value 8.
One possible solution is to move the 6 from Container 2 to Container 1. And place the new weight in Container 2.
Container 1: [10, 4, 6], Capacity = 20
Container 2: [7, 8], Capacity = 20
Container 3: [10, 6], Capacity = 20
I would like to reallocate this in an few moves as possible.
Let me know if this does not make sense. I am sure there is an algorithm out there but I just cannot seem to find it.
Thanks.
I thought the "Distribution of Cookies" problem would help but that requires to many moves.
As I noted in the comments, the problem of finding if ANY solution exists is called Bin Packing and is NP-complete. Therefore any solution is either going to sometimes fail to find answers, or will be possibly exponentially slow.
The stated preference is for sometimes failing to find an answer. So I'll make reasonable decisions that result in that.
Note that this is would take me a couple of days for me to implement. Take a shot yourself, but if you want you can email btilly#gmail.com and we can discuss a contract. (I already spent too long on it.)
Next, the request for shortest path means a breadth first search. So we'll take a breadth-first search through "the reasonableness of the path". Basically we'll try greedy first strategies, and then cut it off if it takes too long. So we may find the wrong answer (if greedy was wrong), or give up (if it takes too long). But we'll generally do reasonably well.
So what is a reasonable path? Well a good greedy solution to bin packing is always place the heaviest thing first, and place it in the fullest bin you can. That's great for placing a bunch of objects in at once, but it won't help you directly with moving objects.
And therefore we'll prioritize moves that create large holes first. And so our rules for the first things to try become:
Always place the heaviest thing we have first.
If possible, place it where we leave the container as full as possible.
Try moving things to create large spaces before small ones.
Deduplicate early.
Figuring this out is going to involve a lot of, "Pick the closest to full bin where I fit," and, "Pick the smallest thing in this bin which lets me fit." And you'd like to do this while looking at a lot of, "We did, X, Y and Z..." and then looking at "...or maybe X, Y and W...".
Luckily I happen to have a perfect data structure for this. https://stackoverflow.com/a/75453554/585411 shows how to have a balanced binary tree, kept in sorted order, which it is easy to clone and try something with while not touching the original tree. There I did it so you can iterate over the old tree. But you can also use it to create a clone and try something out that you may later abandon.
I didn't make that a multi-set (able to add elements multiple times) or add a next_biggest method. A multi-set is doable by adding a count to a node. Now contains can return a count (possibly 0) instead of a boolean. And next_biggest is fairly easy to add.
We need to add a hash function to this for deduplication purposes. We can define this recursively with:
node.hash = some_hash(some_hash(node.value) + some_hash(node.left.hash) + some_hash(node.right.hash))
(insert appropriate default hashes if node.left or node.right is None)
If we store this in the node at creation, then looking it up for deduplication is very fast.
With this if you have many bins and many objects each, you can have the objects stored in sorted order of size, and the bins stored sorted by free space, then bin.hash. And now the idea is to add a new object to a bin as follows
new_bin = old_bin.add(object)
new_bins = old_bins.remove(old_bin).add(new_bin)
And remove similarly with:
new_bin = old_bin.remove(object)
new_bins = old_bins.remove(old_bin).add(new_bin)
And with n objects across m bins this constructs each new state using only O(log(n) + log(m)) new data. And we can easily see if we've been here before.
And now we create partial solutions objects consisting of:
prev_solution (the solution we came from, may be None)
current_state (our data for bins and objects in bins)
creation_id (ascending id for partial solutions)
last_move (object, from_bin, to_bin)
future_move_bins (list of bins in order of largest movable object)
future_bins_idx (which one we last looked at)
priority (what order to look at these in)
moves (how many moves we've actually used)
move_priority (at what priority we started emptying the from_bin)
Partial solutions should compare based on priority and then creation_id. They should hash based on (solution.state.hash, solution.last_move.move_to.hash, future_bins_idx).
There will need to be a method called next_solutions. It will return the next group of future solutions to consider. (These may share
The first partial solution will have prev_solution = None, creation_id=1, last_move=None, and priority = moves = move_priority = 0. The future_move_bins will be a list of bins sorted by biggest movable element descending. And future_move_bins_idx will be 0
When we create a new partial solution, we will have to:
clone old solution into self
self.prev_solution = old solution
self.creation_id = next_creation_id
next_creation_id += 1
set self.last_move
remove object from self.state.from_bin
add object to self.state.to_bin
(fixing future_move_bins left to caller)
self.moves += 1
if the new from_bin matches the previous:
self.priority = max(self.moves, self.move_priority)
else:
self.priority += 1
self.move_priority = self.priority
OK, this is a lot of setup. We're ALMOST there. (Except for the key future_moves business.)
The next thing that we need is the idea of a Priority Queue. Which in Python can be realized with heapq.
And NOW here is the logic for the search:
best_solution_hash_moves = {}
best_space_by_moves = {}
construct initial_solution
queue = []
add initial_solution.next_solutions() to queue
while len(queue) and not_time_to_stop(): # use this to avoid endless searches:
solution = heapq.heappop(queue)
# ANSWER HERE?
if can add target object to solution.state:
walk prev_solution backwards to get the moves we want
return reverse of the moves we found.
if solution.hash() not in best_solution_hash:
# We have never seen this solution hash
best_solution_hash[solution.hash()] = solution
elif solution.moves < best_solution_hash[solution.hash()].moves:
# This is a better way of finding this state we previously got to!
# We want to redo that work with higher priority!
solution.priority = min(solution.priority, best_solution_hash[solution.hash()].priority - 0.01)
best_solution_hash[solution.hash()] = solution
if best_solution_hash[solution.hash()] == solution:
for next_solution in solution.next_solutions():
# Is this solution particularly promising?
if solution.moves not in best_space_by_moves or
best_space_by_moves[solution.moves] <=
space left in solution.last_move.from_bin:
# Promising, maybe best solution? Let's prioritize it!
best_space_by_moves[solution.moves] =
space left in solution.last_move.from_bin:
solution.priority = solution.move_priority = solution.moves
add next_solution to queue
return None # because no solution was found
So the idea is that we take the best looking current solution, consider just a few related solutions, and add them back to the queue. Generally with a higher priority. So if something fairly greedy works, we'll try that fairly quickly. In time we'll get to unpromising moves. If one of those surprises us on the upside, we'll set its priority to moves (thereby making us focus on it), and explore that path more intensely.
So what does next_solutions do? Something like this:
def next_solutions(solution):
if solution.last_move is None:
if future_bins is not empty:
yield result of moving largest movable in future_bins[0] to first bin it can go into (ie enough space)
else:
if can do this from solution:
yield result of moving largest movable...
in future_bins[bin_idx]...
to smallest bin it can go in...
...at least as big as last_move.to_bin
if can move smaller object from same bin in prev_solution:
yield that with priority solution.priority+2
if can move same object to later to_bin in prev_solution:
yield that with priority solution.priority+2
if can move object from next bin_idx in prev_solution:
yield result of moving that with priority solution.priority+1
Note that trying moving small objects first, or moving objects to an emptier bin than needed are possible, but are unlikely to be a good idea. So I penalized that more severely to have the priority queue focus on better ideas. This results in a branching factor of about 2.7.
So if an obvious greedy approach succeeds in less than 7 steps, the queue will likely get to size 1000 or so before you find it. And is likely to find it if you had a couple of suboptimal choices.
Even if a couple of unusual choices need to be made, you'll still get an answer quickly. You might not find the best, but you'll generally find pretty good ones.
Solutions of a dozen moves with a lot of data will require the queue to grow to around 100,000 items, and that should take on the order of 50-500 MB of memory. And that's probably where this approach maxes out.
This all may be faster (by a lot) if the bins are full enough that there aren't a lot of moves to make.

Godot : How to instantiate a scene from a list of possible scenes

I am trying to create a game in which procedural generation will be like The Binding of Isaac one : successive rooms selected from a list. I think I will be able to link them together and all, but I have a problem : How do I choose a room from a list ?
My first thought is to create folders containing scenes, something like
zone_1
basic_rooms
room_1.tscn
room_2.tscn
special_rooms
...
zone_2
...
and to select a random scene from the folder I need, for example a random basic room from the first zone would be a random scene from zone_1/basic_rooms.
The problem is that I have no idea if this a good solution as it will create lots of scenes, and that I don't know how to do this properly. Do I simply use a string containing the folder path, or are there better ways ? Then I suppose I get all the files in the folder, choose one randomly, load it and instanciate it, but again, I'm not sure.
I think I got a little lost in my explainations, but to summarize, I am searching for a way to select a room layout from a list, and don't know how to do.
What you suggest would work.
You can instance scene by this pattern:
var room_scene = load("res://zone/room_type/room_1.tscn")
var room_instance = room_scene.instance()
parent.add_child(room_instance)
I'll also remind you to give a position to the room_instance.
So, as you said, you can build the string you pass to load.
I'll suggest to put hat logic in an autoload and call it where you need it.
However, the above code will stop the game while it is loading the scene. Instead do Background Loading with ResourceLoader.
First you need to call load_interactive which will give you a ResourceInteractiveLoader object:
loader = ResourceLoader.load_interactive(path)
Then you need to call poll on the loader. Until it returns ERR_FILE_EOF. In which case you can get the scene with get_resource:
if loader.poll() == ERR_FILE_EOF:
scene = loader.get_resource()
Otherwise, it means that call to poll wasn't enough to finish loading.
The idea is to spread the calls to poll across multiple frames (e.g. by calling it from _process).
You can call get_stage_count to get the number of times you need to call poll, and get_stage will tell you how many you have called it so far.
Thus, you can use them to compute the progress:
var progress = float(loader.get_stage()) / loader.get_stage_count()
That gives you a value from 0 to 1. Where 0 is not loaded at all, and 1 is done. Multiply by 100 to get a percentage to display. You may also use it for a progress bar.
The problem is that I have no idea if this a good solution as it will create lots of scenes
This is not a problem.
Do I simply use a string containing the folder path
Yes.
Then I suppose I get all the files in the folder, choose one randomly
Not necessarily.
You can make sure that all the scenes in the folder have the same name, except for a number, then you only need to know how many scenes are in the folder, and pick a number.
However, you may not want full randomness. Depending on your approach to generate the rooms, you may want to:
Pick the room based on the connections it has. To make sure it connects to adjacent rooms.
Have weighs for how common or rare a room should be.
Thus, it would be useful to have a file with that information (e.g. a json or a csv file). Then your autoload code responsible for loading scenes would load that file into a data structure (e.g. a dictionary or an array), from where it can pick what scene to load, considering any weighs or constraints specified there.
I will assume that your rooms exist on a grid, and can have doors for NORTH, SOUTH, EAST, WEST. I will also assume that the player can backtrack, so the layout must be persistent.
I don't know how far ahead you will generate. You can choose to generate all the map at once, or generate rooms as the player attempt to enter, or generate a few rooms ahead.
If you are going to generate as the player attempts to enter, you will want an room transition animation where you can hide the scene loading (with the Background Loading approach).
However, you should not generate a room that has already been generated. Thus, keep a literal grid (an array) where you can store if a room has been generated. You would first check the grid (the array), and if it has been generated, there is nothing to do. But if it hasn't, then you need to pick a room at random.
But wait! If you are entering - for example - from the south, the room you pick must have a south door to go back. If you organize the rooms by the doors they have, then you can pick from the rooms that have south doors - in this example.
In fact, you need to consider the doors of any neighbor rooms you have already generated. Thus, store in the grid (the array) what doors the room that was generated has. So you can later read from the array to see what doors the new room needs. If there is no room, decide at random if you want a door there. Then pick a room at random, from the sets that have the those doors.
Your sets of rooms would be, the combinations of NORTH, SOUTH, EAST, WEST. A way to generate the list, is to give each direction a power of two. For example:
NORTH = 1
SOUTH = 2
EAST = 4
WEST = 8
Then to figure out the sets, you can count, and the binary representation gives the doors. For example 10 = 8 + 2 -> WEST and SOUTH.
Those are your sets of rooms. To reiterate, look at the already generated neighbors for doors going into the room you are going to generate. If there is no room, decide at random if you want a door there. That should tell you from what set of rooms you need to pick to generate.
This is similar to the approach auto-tile solution use. You may want to read how that works.
Now assuming the rooms in the set have weights (so some rooms are more common and others are rarer), and you need to pick at random.
This is the general algorithm:
Sum the weights.
Normalize the weights (Divide the weights by the sum, so they add up to 1).
Accumulate the normalized weights.
Generate a random number from 0 to 1, and what is the last accumulated normalized weight that is greater than the random number we got.
Since, presumably, you will be picking rooms from the same set multiple times, you can calculate and store the accumulated normalized weights (let us call them final weights), so you don't compute them every time.
You can compute them like this:
var total_weight:float = 0.0
for option in options:
total_weight = total_weight + option.weight
var final_weight:float = 0.0
var final_weights:Array = []
for option in options:
var normalized_weight = option.weight / total_weight
final_weight = final_weight + normalized_weight
final_weights.append(final_weight)
Then you can pick like this:
var randomic:float = randf()
for index in final_weights.size():
if final_weights[index] > randomic:
return options[index]
Once you have picked what room to generate, you can load it (e.g. with the Background Loading approach), instance it, and add it to the scene tree. Remember to give a position in the world.
Also remember to update the grid (the array) information. You picked a room from a set that have certain doors. You want to store that to take into account to generate the adjacent rooms.
And, by the way, if you need large scale path-finding (for something going from a room to another), you can use that grid too.

Need a Ruby way to determine the elements of a matrix "touching" another element

I think I need a method called “Touching” (as in contiguous, not emotional.)
I need to identify those elements of a matrix that are next to an individual element or set of elements. At least that’s the way I’ve thought of to solve the problem at hand.
The matrix State in the program below represents, let’s say, some underwater topography. As I lower the water, eventually the highest point will stick out and become an “island”. When the “water level” is at 34 then the element State[2,3] is the single point of the island. The array atlantis holds the coordinates of that single point .
As we lower the water level further, additional points will be “above water.” Additional contiguous points will become part of the island and their coordinates would be added to the array atlantis. (For example, the next piece of land to be part of atlantis would be State[3,4] at 31.)
My thought about how to do this is to identify all the matrix elements that touch/are next to the element in the atlantis, find the one with the highest elevation and then add it to the array atlantis. Looking for the elements next to a single element is a challenge in itself, but we could write some code to examine the set [i,j-1], [i,j+1], [i-1,j-1], [i-1,j], [i-1,j+1], [i+1,j-1], [i+1,J], [i+1,j+1]. (I think I got that right.)
But as we add additional points, the task of determining which points surround the points in atlantis becomes increasingly difficult. So that’s my question: can anyone think of any mechanism by which to do this? Any kind of simplified algorithm using capabilities of ruby of which I am unaware? (which include all but the most basic.) If such a method could be written then I could write atlantis.touching and get an array, for example, containing all the coordinates of all the points presently contiguous to atlantis.
At least that’s how I’m thinking this could be done. Any other ideas would be welcome. And if anyone knows any kind of partnering site where I could seek others who might be interested in working with me on this, that would be great.
# create State database using matrix
require 'matrix'
State=Matrix[ [3,1,4,4,6,2,8,12,8,2],
[6,2,4,13,25,21,11,22,9,3,],
[6,20,27,34,22,14,12,11,2,5],
[6,28,17,23,31,18,11,9,18,12],
[9,18,11,13,8,9,10,14,24,11],
[3,9,7,16,9,12,28,24,29,21],
[5,8,4,7,17,14,19,30,33,4],
[7,17,23,9,5,9,22,21,12,21,],
[7,14,25,22,16,10,19,15,12,11],
[5,16,7,3,6,3,9,8,1,5] ]
#find sate elements contiguous to island
atlantis=[[2,3]]
find all state[i,j] "touching" atlantis
Only checking the points around the currently exposed area doesn't sound like it could cover every case - what if the next point to be exposed was the beginning of a new island?
I'd go about it like this: Have another array - let's call it sorted which contains your points sorted by height. Every time you raise the water level, pop all the elements higher than the new water level off sorted and onto atlantis.
In fact, there's no need for separate sorted and atlantis arrays if you do it this way. Just store the index of the highest point not above water, and you've essentially got two arrays in one - everything above water on one side, and everything below water on the other.
Hope that helps!

Ruby : The best way to manage a large 3d array

I would like to know what is the best way to manage a large 3d array with something like :
x = 1000
y = 1000
z = 100
=> 100000000 objects
And each cell is an object with some amount of data.
Simple methods are very loooooong even if all data are collapsed (I a first tryed an array of array of array of objects)
class Test
def initialize
#name = "Test"
end
end
qtt = 1000*1000*100
Array.new(qtt).each { |e| e = Test.new }
I read somewhere that DB could be a good thing for such cases.
What do you think about this ?
What am I trying to do ?
This "matrix" represents a world. And each element is a 1mx1mx2m block who could be a different kind (water, mud, stone, ...) Some block could be empty too.
But the user should be able to remove blocks everywhere and change everything around (if they where water behind, it will flow through the hole for exemple.
In fact what I wish to do is not Minecraft be a really small clone of DwarfFortress (http://www.bay12games.com/dwarves/)
Other interesting things
In my model the groud is at level 10. It means that [0,10] is empty sky in most of cases.
Only hills and parts of mountains could be present on those layers.
Underground is basicaly unknown and not dug. So we should not have to add instances for unused blocks.
What we should add from the beginning to the model : gems, gold, water who could stored without having to store the adjacent stone/mood/earth blocks.
At the beginning of the game, 80% of the cube doesn't need to be loaded in memory.
Each time we dig we create new blocks : the empty block we dug and the blocks around.
The only things we should index is :
underground rivers
underground lakes
lava rivers
Holding that many objects in memory is never a good thing. A flat-file or database-centric approach would be a lot more efficient and easier to maintain.
What I would do - The object-oriented approach
Store the parameters of the blocks as simple data and construct the objects dynamically.
Create a Block class to represent a block in the game, and give it variables to hold the parameters of that particular block:
class Block
# location of the Block
attr_accessor :x, :y, :z
# an individual id for the Block
attr_accessor :id
# to define the block type (rock, water etc.)
attr_accessor :block_type
# and add any other attributes of a Block...
end
I'd then create a few methods that would enable me to serialise/de-serialise the data to a file or database.
As you've stated it works on a board, you'd also need a Board class to represent it that would maintain the state of the game as well as perform actions on the Block objects. Using the x, y, z attributes from each Block you can determine its location within the game. Using this information you can then write a method in the Block class that locates those blocks adjacent to the current one. This would enable you to perform the "cascading" effects you talk about where one Block is affected by actions on another.
Accessing the data efficiently
This will rely entirely on how you choose to serialise the Block objects. I would probably choose a binary format to reduce unnecessary data reads and store the objects via their id parameter, and then use something like MMIO to quickly do random-access reads/writes on a large data file in an Array-like manner. This will allow you to access the data quickly and efficiently, without the memory overhead. How you read the data will relate to your adjacent blocks method above.
You can of course also choose the DB storage route which will allow you to isolate the Blocks and do lookups on particular blocks in a higher-level manner, however that might give you a bit of extra overhead.
It sounds like an interesting project, I hope this helps a bit! :)
P.S With regards to the comment above by #Linuxious about choosing a different language. Yes this might be true in some cases, but a skilled programmer never blames his tools. A program is only as efficient as the programmer makes it...unless you're writing it in Java ;)

Hashing - What Does It Do?

So I've been reading up on Hashing for my final exam, and I just cannot seem to grasp what is happening. Can someone explain Hashing to me the best way they understand it?
Sorry for the vague question, but I was hoping you guys would just be able to say "what hashing is" so I at least have a start, and if anyone knows any helpful ways to understand it, that would be helpful also.
Hashing is a fast heuristic for finding an object's equivalence class.
In smaller words:
Hashing is useful because it is computationally cheap. The cost is independent of the size of the equivalence class. http://en.wikipedia.org/wiki/Time_complexity#Constant_time
An equivalence class is a set of items that are equivalent. Think about string representations of numbers. You might say that "042", "42", "42.0", "84/2", "41.9..." are equivalent representations of the same underlying abstract concept. They would be in the same equivalence class. http://en.wikipedia.org/wiki/Equivalence_class
If I want to know whether "042" and "84/2" are probably equivalent, I can compute hashcodes for each (a cheap operation) and only if the hash codes are equal, then I try the more expensive check. If I want to divide representations of numbers into buckets, so that representations of the same number are in the buckets, I can choose bucket by hash code.
Hashing is heuristic, i.e. it does not always produce a perfect result, but its imperfections can be mitigated for by an algorithm designer who is aware of them. Hashing produces a hash code. Two different objects (not in the same equivalence class) can produce the same hash code but usually don't, but two objects in the same equivalence class must produce the same hash code. http://en.wikipedia.org/wiki/Heuristic#Computer_science
Hashing is summarizing.
The hash of the sequence of numbers (2,3,4,5,6) is a summary of those numbers. 20, for example, is one kind of summary that doesn't include all available bits in the original data very well. It isn't a very good summary, but it's a summary.
When the value involves more than a few bytes of data, some bits must get rejected. If you use sum and mod (to keep the sum under 2billion, for example) you tend to keep a lot of right-most bits and lose all the left-most bits.
So a good hash is fair -- it keeps and rejects bits equitably. That tends to prevent collisions.
Our simplistic "sum hash", for example, will have collisions between other sequences of numbers that also happen to have the same sum.
Firstly we should say about the problem to be solved with Hashing algorithm.
Suppose you have some data (maybe an array, or tree, or database entries). You want to find concrete element in this datastore (for example in array) as much as faster. How to do it?
When you are built this datastore, you can calculate for every item you put special value (it named HashValue). The way to calculate this value may be different. But all methods should satisfy special condition: calculated value should be unique for every item.
So, now you have an array of items and for every item you have this HashValue. How to use it? Consider you have an array of N elements. Let's put your items to this array according to their HashHalues.
Suppose, you are to answer for this question: Is the item "it1" exists in this array? To answer to it you can simply find the HashValue for "it1" (let's call it f("it1")) and look to the Array at the f("it1") position. If the element at this position is not null (and equals to our "it1" item), our answer is true. Otherwise answer is false.
Also there exist collisions problem: how to find such coolest function, which will give unique HashValues for all different elements. Actually, such function doesn't exist. There are a lot of good functions, which can give you good values.
Some example for better understanding:
Suppose, you have an array of Strings: A = {"aaa","bgb","eccc","dddsp",...}. And you are to answer for the question: does this array contain String S?
Firstle, we are to choose function for calculating HashValues. Let's take the function f, which has this meaning - for a given string it returns the length of this string (actually, it's very bad function. But I took it for easy understanding).
So, f("aaa") = 3, f("qwerty") = 6, and so on...
So now we are to calculate HashValues for every element in array A: f("aaa")=3, f("eccc")=4,...
Let's take an array for holding this items (it also named HashTable) - let's call it H (an array of strings). So, now we put our elements to this array according to their HashValues:
H[3] = "aaa", H[4] = "eccc",...
And finally, how to find given String in this array?
Suppose, you are given a String s = "eccc". f("eccc") = 4. So, if H[4] == "eccc", our answer will be true, otherwise it fill be false.
But how to avoid situations, when to elements has same HashValues? There are a lot of ways to it. One of this: each element in HashTable will contain a list of items. So, H[4] will contain all items, which HashValue equals to 4. And How to find concrete element? It's very easy: calculate fo this item HashValue and look to the list of items in HashTable[HashValue]. If one of this items equals to our searching element, answer is true, owherwise answer is false.
You take some data and deterministically, one-way calculate some fixed-length data from it that totally changes when you change the input a little bit.
a hash function applied to some data generates some new data.
it is always the same for the same data.
thats about it.
another constraint that is often put on it, which i think is not really true, is that the hash function requires that you cannot conclude to the original data from the hash.
for me this is an own category called cryptographic or one way hashing.
there are a lot of demands on certain kinds of hash f unctions
for example that the hash is always the same length.
or that hashes are distributet randomly for any given sequence of input data.
the only important point is that its deterministic (always the same hash for the same data).
so you can use it for eample verify data integrity, validate passwords, etc.
read all about it here
http://en.wikipedia.org/wiki/Hash_function
You should read the wikipedia article first. Then come with questions on the topics you don't understand.
To put it short, quoting the article, to hash means:
to chop and mix
That is, given a value, you get another (usually) shorter value from it (chop), but that obtained value should change even if a small part of the original value changes (mix).
Lets take x % 9 as an example hashing function.
345 % 9 = 3
355 % 9 = 4
344 % 9 = 2
2345 % 9 = 5
You can see that this hashing method takes into account all parts of the input and changes if any of the digits change. That makes it a good hashing function.
On the other hand if we would take x%10. We would get
345 % 10 = 5
355 % 10 = 5
344 % 10 = 4
2345 % 10 = 5
As you can see most of the hashed values are 5. This tells us that x%10 is a worse hashing function than x%9.
Note that x%10 is still a hashing function. The identity function could be considered a hash function as well.
I'd say linut's answer is pretty good, but I'll amplify it a little. Computers are very good at accessing things in arrays. If I know that an item is in MyArray[19], I can access it directly. A hash function is a means of mapping lookup keys to array subscripts. If I have 193,372 different strings stored in an array, and I have a function which will return 0 for one of the strings, 1 for another, 2 for another, etc. up to 193,371 for the last one, I can see if any given string is in the array by running that function and then seeing if the given string matches the one in that spot in the array. Nice and easy.
Unfortunately, in practice, things are seldom so nice and tidy. While it's often possible to write a function which will map inputs to unique integers in a nice easy range (if nothing else:
if (inputstring == thefirststring) return 0;
if (inputstring == thesecondstring) return 1;
if (inputstring == thethirdstring) return 1;
... up to the the193371ndstring
in many cases, a 'perfect' function would take so much effort to compute that it wouldn't be worth the effort.
What is done instead is to design a system where a hash function says where one should start looking for the data, and then some other means is used to search for the data from there. A few common approaches are:
Linear hashing -- If two items map to the same hash value, store one of them in the array slot following the one indicated by the hash code. When looking for an item, search in the indicated slot, and then next one, then the next, etc. until the item is found or one hits an empty slot. Linear hashing is simple, but it works poorly unless the table is much bigger than the number of items in it (leaving lots of empty slots). Note also that deleting items from such a hash table can be difficult, since the existence of an item may have prevented some other item from going into its indicated spot.
Double hashing -- If two items map to the same value, compute a different hash value for the second one added, and shove the second item that many slots away (if that slot is full, keep stepping by that increment until a vacant slot is found). If the hash values are independent, this approach can work well with a more-dense table. It's even harder to delete items from such a table, though, than with a linear hash table, since there's no nice way to find items which were displaced by the item to be deleted.
Nested hashing -- Each slot in the hash table contains a hash table using a different function from the main table. This can work well if the two hash functions are independent, but is apt to work very poorly if they aren't.
Chain-bucket hashing -- Each slot in the hash table holds a list of things that map to that hash value. If N things map to a particular slot, finding one of them will take time O(N). If the hash function is decent, however, most non-empty slots will contain only one item, most of those with more than that will contain only two, etc. so no slot will hold very many items.
When dealing with a fixed data set (e.g. a compiler's set of keywords), linear hashing is often good; in cases where it works badly, one can tweak the hash function so it will work well. When dealing with an unknown data set, chain bucket hashing is often the best approach. The overhead of dealing with extra lists may make it more expensive than double hashing, but it's far less likely to perform really horribly.

Resources