Google Go Lang Assignment Order - go

Let's look at the following Go code:
package main
import "fmt"
type Vertex struct {
Lat, Long float64
}
var m map[string]Vertex
func main() {
m = make(map[string]Vertex)
m["Bell Labs"] = Vertex{
40.68433, 74.39967,
}
m["test"] = Vertex{
12.0, 100,
}
fmt.Println(m["Bell Labs"])
fmt.Println(m)
}
It outputs this:
{40.68433 74.39967}
map[Bell Labs:{40.68433 74.39967} test:{12 100}]
However, if I Change one minor part of the test vertex declaration, by moving the right "}" 4 spaces, like so:
m["test"] = Vertex{
12.0, 100,
}
.. then the output changes to this:
{40.68433 74.39967}
map[test:{12 100} Bell Labs:{40.68433 74.39967}]
Why the heck does that little modification affect the order of my map?

Map "order" depends on the hash function used. The hash function is randomized to prevent denial of service attacks that use hash collisions. See the issue tracker for details:
http://code.google.com/p/go/issues/detail?id=2630
Map order is not guaranteed according to the specification. Although not done in current go implementations, a future implementation could do some compacting during GC or other operation that changes the order of a map without the map being modified by your code. It is unwise to assume a property not defined in the specification.
A map is an unordered group of elements of one type, called the element type, indexed by a set of unique keys of another type, called the key type.

A map shouldn't always print its key-element in any fixed order:
See "Go: what determines the iteration order for map keys?"
However, in the newest Go weekly release (and in Go1 which may be expected to be released this month), the iteration order is randomized (it starts at a pseudo-randomly chosen key, and the hashcode computation is seeded with a pseudo-random number).
If you compile your program with the weekly release (and with Go1), the iteration order will be different each time you run your program.
It isn't exactly spelled out like that in the spec though (Ref Map Type):
A map is an unordered group of elements of one type, called the element type, indexed by a set of unique keys of another type, called the key type.
Actually, the specs do spell it out, but in the For statement section:
The iteration order over maps is not specified and is not guaranteed to be the same from one iteration to the next.
If map entries that have not yet been reached are deleted during iteration, the corresponding iteration values will not be produced.
If map entries are inserted during iteration, the behavior is implementation-dependent, but the iteration values for each entry will be produced at most once.
If the map is nil, the number of iterations is 0.
This has been introduced by code review 5285042 in October 2011:
runtime: random offset for map iteration
The go-nuts thread points out:
The reason that "it's there to stop people doing bad stuff" seemed particularly weak.
Avoiding malicious hash collisions makes a lot more sense.
Further the pointer to the code makes it possible in development to revert that behaviour in the case that there is an intermittent bug that's difficult to work through.
To which Patrick Mylund Nielsen replies:
Dan's note was actually the main argument why Python devs were reluctant to adopt hash IV randomization--it broke their unit tests! PHP ultimately chose not to do it at all, and instead limited the size of the http.Request header, and Oracle and others didn't think it was a language problem at all.
Perl saw the problem and applied a fix similar to Go's which was included in Perl 5.8.1 in 2003.
I might be wrong, but I think they were the only ones to actually care then when this paper was presented: "Denial of Service via Algorithmic Complexity Attacks", attacking hash tables.
(Worst-case hash table collisions)
For others, this, which became very popular about a year ago, was a good motivator:
"28c3: Effective Denial of Service attacks against web application platforms (YouTube video, December 2011)", which shows how a common flaw in the implementation of most of the popular web
programming languages and platforms (including PHP, ASP.NET, Java, etc.) can be (ab)used to force web application servers to use 99% of CPU for several minutes to hours for a single HTTP request.
This attack is mostly independent of the underlying web application and just
relies on a common fact of how web application servers typically work..

Related

How do I store module-level write-once state?

I have some module-level state I want to write once and then never modify.
Specifically, I have a set of strings I want to use to look things up in later. What is an efficient and ordinary way of doing this?
I could make a function that always returns the same set:
my_set() -> sets:from_list(["a", "b", "c"]).
Would the VM optimize this, or would the code for constructing the set be re-run every time? I suspect the set would just get GCd.
In that case, should I cache the set in the process dictionary, keyed on something unique like the module md5?
Key = proplists:get_value(md5, module_info()), put(Key, my_set())
Another solution would be to make the caller to call an init function to get back an opaque chunk of state, then pass that state into each function in the module.
A compile-time constant, like your example list ["a", "b", "c"], will be stored in a constant pool on the side when the module is loaded, and not rebuilt each time you run the expression. (In the old days, the list would have been reconstructed from its elements for each new call.) This goes for all constants no matter how complicated (like lists of lists of tuples). But when you call out to a function like sets:from_list/1, the compiler cannot assume anything about the representation used by the sets module, and the set will be constructed dynamically from that constant list.
While an ETS table would work, it is less efficient for larger constants (like, say, a set or map containing many entries), because an ETS table has the same memory model as a process - data is written and read by copying, as if by sending messages. If the constants are small, the difference between copying them and recreating them locally would be negligible, and if the constants are large, you waste time copying them.
What you want instead is a fairly new feature called Persistent Term Storage: https://erlang.org/doc/man/persistent_term.html (since Erlang/OTP 21). It is similar to the way compile time constants are handled, so there is no copying when looking up a value. (The key could be the name of your module.) Persistent Term is intended as pretty much a write-once-read-many storage - you can update the stored entry, but that's a more expensive operation which may trigger a global GC.

several questions about multi-paxos?

I have several questions about multi-paxos
will each instance has it's own proposal Number and accepted ballot and accepted value ? or all the instance share with the same
proposal number ,after one is finished ,then anther one start?
if all the instance share with the same proposal number ,Consider the below condition, server A sends a proposal ,and the acceptor returns the accepted instanceId which might be greater or less than the proposal'instanceid ,then what will proposal do? use that instanceId and it's value for accept phase? then increase it'own instanceId ,waiting for next round ,then re-proposal with it own value? if so , when is the previous accepted value removed,because if it's not removed ,the acceptor will return this intanceId and value again,then it seems it is a loop
Multi-Paxos has a vague description so two persons may build two different systems based on it and in a context of one system the answer is "no," and in the context of another it's "yes."
Approach #1 - "No"
Mindset: Paxos is a two-phase protocol for building write-once registers. Multi-Paxos is a technique how to create a log on top of them.
One of the possible ways to build a log is
Create an array of completely independent write-once registers and initialize the first one with an initial value.
On new record we should:
A) Guess an index (X) of a vacant register and try to write a dummy record here (if it's already used then pick a register with a higher index and retry).
B) Start writing dummy records to every register with smaller than X index until we find a register filled with a non-dummy record.
C) Calculate a new record based on it (e.g., a record may have an ordinal, and we can use it to calculate an ordinal of the new record; since some registers are filled with dummy records the ordinals aren't equal to index) and write it to the X+1 register. In case of a conflict, we should restart the procedure from step A).
To read the log we should start writing dummy values from the first record, and on each conflict, we should increment index and retry until the write is succeeded which would indicate that the log's end is reached.
Of course, there is a lot of overhead in this approach, so please treat it just like a top-level overview what Multi-Paxos is.
The log is a powerful concept, and we can use it as a recipe for building distributed state machines - just think of each record as an update command. Unfortunately, in some cases, there is also a lot of overhead. For example, if you want to build a key/value storage and you care only about the current value than you don't need history and probably need to implement garbage collection to remove past versions from the log to optimize storage costs.
Approach #2 - "Yes"
Mindset: rewritable register as a heavily optimized version of Multi-Paxos.
If you start with the described approach with an application to the creation of key/value storage and then iterate in other to get rid of overhead, e.g., by doing garbage collection on the fly then eventually you may come up with an idea how to update the write-once register to be rewritable.
In that case, each instance uses the same ballot numbers just because all the instances are collapsed into one rewritable instance.
I described this approach in the How Paxos Works post and implemented it in the Gryadka project with 500-lines of JavaScript. Also, the idea behind it was independently checked with TLA+ by Greg Rogers and Tobias Schottdorf.

TDD strategy when implementing a multi-stage process?

At the moment I'm developing a piece of code which first gathers sentences from a set of documents, then tokenises these, then uses the results to analyse recurring frequencies of token sequences, including case variations (upper case/lower case/leading cap/other), then prints out the results.
Now I want to introduce two more stages before printing out the results:
1. firstly, removing "stop words" (i.e. words or short sequences the frequency of which can never be of interest, such as, in English, "the", "of the", "of which", etc.) - these stop words/"stop sequences" to be taken from a database table
2. secondly, bringing up a dialog enabling the user to identify sequences of new stop words, which would then remove the token sequences involved and also add the sequence in question to the database table.
The thing is, this is a multi-stage process, and I'm just wondering what TDD experts do faced with a situation like this: do I create a new test method for each individual stage...? The problem being that each individual stage requires the use of "live memory data" from the previous stage: another possibility could be to somehow serialise this data and then deserialise it when testing for the next stage... but then this would involve the app code doing things which were of benefit only for the testing code, i.e. it would mean tweaking ("distorting"?) the app code for the benefit of the testing code, which seems wrong in principle...
Also, if anyone can point me in the direction of a book or site which helps TDD newbs like myself go to "the next level" I would be very grateful.
later
To the person who marked this as "favorite": I've now got hold of a book called "Growing Object-Oriented Software, Guided by Tests", which is well-reviewed and appears to be for someone wanting to move from beginner to intermediate. First impressions good.
Any views on this book by experts also welcome, of course...
On the face of it, you seem to be building a pipeline. From what I can tell, you're currently implementing all of it within a single class, which stores both the data that's being worked on and implements the methods that do the processing. One approach that you could take would be to break down the problem into smaller chunks. Rather than having a single class, you have a class for each stage of the pipeline and another class for orchestrating the process which is responsible for plugging the stages together in the correct order.
So, scanning through what you've described, you appear to have the following processors:
DocumentReader (reads documents from somewhere into in memory document)
SentenceExtractor (document/list of documents in, list of sentences out)
1 or more SentenceAnalysers (sentences in, statistics out), you might want to break this down depending on the type of analysis and how complex it is.
StopWordExtractor (StopWordProvider and sentences in, sentences out)
There are additional supporting classes that would be needed, to support writing of new stopwords to the database and depending on how the stopwordprovider was implemented keeping it in sync as the user selects new ones.
Essentially, what I'm saying is that you appear to be doing too much in a single location. If you're really happy that the code as you've described it is a single unit, then there is nothing wrong with you testing it all in one place, but then your inputs will be your starting documents/sentences and your outputs will be the end of the process. If you agree with me that really, there are several distinct components involved in the process that could change independently, then I would suggest breaking the process down into smaller classes and testing that those perform as expected for given sets of inputs/outputs...

What are alternatives for GUIDs for key generation when central server is not possible?

I am looking for alternative to GUIDs for key generation in a distributed app. For example supposed I have Bob, James, and Jack all running a bug tracking application on their desktop where they can do thing like create bug tickets ala JIRA, or Bugzilla ... etc. When a ticket is created it is assigned a number such as T-1, T-2, T-3, T-4 ... etc. Tickets need to have a stable ID and should be creatable without having to consult a central server.
I understand that this is what GUID's are really good for but it in my case displaying a GUID in a UI is ugly people can't just copy and paste it and discuss it on a phone call, I really want integers or some sort of short string that is easy to talk about read in one glance .. etc.
Is there a way to use the bitcoin block chain as some sort of counter?
You may evaluate the approach taken by git. They use sha1 hash of commit information. And then abbreviate IDs are allowed which are much shorter and easier to read\transfer manually.
Having the number of bugs in your tracker is not going to reach millions that should be sufficient. Once it is you'll just need a longer abbreviation.
There seem to be plenty info around on how git calculates hash IDs and abbreviates them.
If I recall correctly how UUIDv1 works - it's "just" putting together the mac address and a very exact timestamp + maybe some additional integer. As your mac address should be unique (unless you've fiddled with it) and there are only so many UUIDs one computer can generate within a nano second, the resulting ID will be unique.
This is a very general and uninformed way to create IDs. If you'd implement a version of it yourself for your specific use case you could get much smaller IDs.
Assuming you can identify each node with a bug tracking system with a simple and unique string - for instance "Bob", "James", "Jack" - and you can create unique continuous integers within each node, you could combine those two and have IDs like "Bob-1", "James-12", ...
As you can see, actually there has to be again one central point, which will assign the unique strings, however depending on the number of nodes and how long they stay within the system, this could be as well done just by a human being.
The additional disadvantage (or advantage, depends how you look at it) of this approach (as well as of UUIDv1) would be, that you'd know where the ticket has been created as well as order of the tickets within one system.

Why do we need a canonical format for the GUID?

One hard working day I noticed that GUIDs I've been generating with usual .NET's Guid.NewGuid() method had the same number 4 in the beginning of the third block:
efeafa5f-fe21-4ab4-ba82-b9eefd5fa225
480b64d0-6762-4afe-8496-ac7cf3292898
397579c2-a4f4-4611-9fda-16e9c1e52d6a
...
There were ten of them appearing on the screen once a second or so. I've kept my eye on this pattern right after the fifth GUID. Finally, the last one had the same four bits inside and I've decided that I'm a lucky guy. I went home and felt that the whole world is opened for such an exceptional person as me. Next week I found a new work, cleaned my room and made a call to my parents.
But today I've faced the same pattern again. Thousand times. And I don't feel the Chosen One anymore.
I've googled it and now I know about UUID and a canonical format with 4 reserved bits for version and 2 for variant.
Here's a snippet to experiment with:
static void Main(string[] args)
{
while (true)
{
var g = Guid.NewGuid();
Console.WriteLine(BitConverter.ToString(g.ToByteArray()));
Console.WriteLine(g.ToString());
Console.ReadLine();
}
}
But still there is one thing I don't understand (except how to go on living). Why do we need these reserved bits? I see how it can harm - exposing internal implementation details, more collisions (still nothing to worry about, but one day...), more suicides - but I don't see any benefit. Can you help me to find any?
It is so that if you update the algorithm you can change that number. Otherwise 2 different algorithms could produce the exact same UUID for different reasons, leading to a collision. It is a version identifier.
For example, consider a contrived simplistic UUID format:
00000000-00000000
time - ip
now suppose we change that format for some reason to:
00000000-00000000
ip - time
This could generate a collision when a machine with IP 12.34.56.78 generates a UUID using the first method at time 01234567, and later a second machine with IP 01.23.45.67 generates a UUID at time 12345678 using the newer method. But if we reserve some bits for a version identifier, this cannot possibly cause a collision.
The value 4 specifically refers to a randomly generated UUID (therefore it relies on the miniscule chance of collisions given so many bits) rather than other methods which could use combinations of the time, mac address, pid, or other sorts of time & space identifiers to guarantee uniqueness.
See here for the relevant spec: https://www.rfc-editor.org/rfc/rfc4122#section-4.1.3

Resources