GO: Best way to search element on structs [duplicate] - go

This question already has answers here:
How to search for an element in a golang slice
(8 answers)
Closed last month.
I'm working with go structs, Now I have a struct that has more structs in it, in this case I need to find out the value of the Id in an slice. I only have the name of an element in the last structure,
The way that I'm do it now, is reading each element of a slice called Genes till find my string Name.
Are there a better practice to find my string Name?
type GenresResponse struct {
Count int `xml:"count,attr"`
PageIndex int `xml:"page_index,attr"`
PageSize int `xml:"page_size,attr"`
NumOfResults int `xml:"num_of_results,attr"`
TotalPages int `xml:"total_pages,attr"`
Genes []Gene `xml:"gene"`
}
type Gene struct {
Category string `xml:"category,attr"`
Id string `xml:"id,attr"`
Translations Translations `xml:"translations"`
}
type Translations struct{
Translation Translation `xml:"translation"`
}
type Translation struct{
Lang string `xml:"lang,attr"`
Name string `xml:"name"`
}
And this is the way that I'm reading it
idToFind := "0"
for _, genreItem := range responseStruct.Genes {
if strings.ToLower(genreItem.Translations.Translation.Name) == strings.ToLower(myNameValue){
idToFind = genreItem.Id
break
}
}

Your code seems to be working fine and to my knowledge there isn't any "better" way to do a linear search.
But if you're dealing with a lot of data (especially when dealing with a high amount of searching), you might want to use a scheme were the Gene array is sorted (by name in this case). In this case various faster searching algorithms (like binary search) can be applied, which lowers the complexity of searching from O(x) to O(log(x)). This can make a big difference when searching big amounts of data.
More on the binary search algorithm can be found on Wikipedia: http://en.wikipedia.org/wiki/Binary_search_algorithm
Go also includes a default package which can handle sorting and binary search, especially the examples could be quite useful: http://golang.org/pkg/sort/

Go works well with json. As an option, of course, it is not optimal from the point of view of memory and CPU. But, you can apply marshaling and search through the text of the entire structure...

Related

Why does golang implement different behavior on `[]`operator between slice and map? [duplicate]

This question already has answers here:
Why are map values not addressable?
(2 answers)
Closed 4 years ago.
type S struct {
e int
}
func main() {
a := []S{{1}}
a[0].e = 2
b := map[int]S{0: {1}}
b[0].e = 2 // error
}
a[0] is addressable but b[0] is not.
I know first 0 is an index and second 0 is a key.
Why golang implement like this? Any further consideration?
I've read source code of map in github.com/golang/go/src/runtime and map structure already supported indirectkey and indirectvalue if maxKeySize and maxValueSize are little enough.
type maptype struct {
...
keysize uint8 // size of key slot
indirectkey bool // store ptr to key instead of key itself
valuesize uint8 // size of value slot
indirectvalue bool // store ptr to value instead of value itself
...
}
I think if golang designers want this syntax, it works easy now.
Of course indirectkey indirectvalue may cost more resource and GC also need do more work.
So performance is the only reason for supporting this?
Or any other consideration?
In my opinion, supporting syntax like this is valuable.
As far as I known,
That's because a[0] can be replaced with address of array.
Similarly, a[1] can be replace with a[0]+(keySize*1).
But, In case of map one cannot do like that, hash algorithm changes from time to time based on your key, value pairs and number of them.
They are also rearranged from time to time.
specific computation is needed in-order to get the address of value.
Arrays or slices are easily addressable, but in case of maps it's like multiple function calls or structure look-ups ...
If one is thinking to replace it with what ever computation is needed, then binary size is going to be increased in orders of magnitude, and more over hash algorithm can keep changing from time to time.

Why does the golang place the type specifier "after" the variable name?

Just out of curiosity, why does the golang place the type specifier after the variable name like below. Have to? Or happen to?
type person struct {
name string
age int
}
Why not just like this? It's more natural IMHO, and it saves the type keyword.
struct person {
string name
int age
}
I think the Go programming language follows these principles:
declarations start with a keyword, so that the parser can be implemented with a single token look-ahead (like in Pascal)
the rest of the declaration follows the English grammar, with every redundant word left out (also like in Pascal, but with fewer keywords)
Examples:
The type Frequency is a map indexed by string, mapping to int
type Frequency map[string]int
The type Point is a struct with two fields, x and y of type int
type Point struct { x, y int }
The above sentences focus more on the names than on the types, which makes sense since the names convey more meaning.
If I had to explain to novice programmers how to write declarations in Go, I would let them first describe it in plain English and then remove every word that might even seem redundant.
Up to now, I didn't find any contradictions to these rules.

Data structure for occurrence counting in long tail distribution

I have a big list of elements (tens of millions).
I am trying to count the number of occurrence of several subset of these elements.
The occurrence distribution is long-tailed.
The data structure currently looks like this (in an OCaml-ish flavor):
type element_key
type element_aggr_key
type raw_data = element_key list
type element_stat =
{
occurrence : (element_key, int) Hashtbl.t;
}
type stat =
{
element_stat_hashtable : (element_aggr_key, element_stat) Hashtbl.t;
}
Element_stat currently use hashtable where the key is each elements and the value is an integer. However, this is inefficient because when many elements have a single occurrence, the occurrence hashtable is resized many times.
I cannot avoid resizing the occurrence hashtable by setting a big initial size because there actually are many element_stat instances (the size of hashtable in stat is big).
I would like to know if there is a more efficient (memory-wise and/or insertion-wise) data structure for this use-case. I found a lot of existing data structure like trie, radix tree, Judy array. But I have trouble understanding their differences and whether they fit my problem.
What you have here is a table mapping element_aggr_key to tables that in turn map element_key to int. For all practical purposes, this is equivalent to a single table that maps element_aggr_key * element_key to int, so you could do:
type stat = (element_aggr_key * element_key, int) Hashtbl.t
Then you have a single hash table, and you can give it a huge initial size.

Data Structure for phone book [duplicate]

This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
storing 1 million phone numbers
How to design a data structure for a phone address book with 3 fields
name, phone number , address
one must be able to search this phone book on any of the 3 fields
Hash table wouldn't work because all the three fields should hash to the same value which is i think impossible. I thought about trie and other data structures too but couldn't think of a proper answer.
You Should use TRIE data Structure for Implementing Phonebook. TRIE is an ordered tree data structure that uses strings as keys. Unlike Binary Trees, TRIE does not store keys associated with the node.
Good example
You could accomplish this with a single hash table or other type of associative array (if you wanted to). For each person, just have three keys in the table (name, address, phone) all pointing to the same record.
I think a combination of a trie (each phone book entry is one leaf) and two skip lists (one for each name and address) could turn out to be effective.
Just assign each node one set of pointers to move along the name axis, and one set of pointers to move along the address axis (that is, to traverse the skip lists).
You can't exactly sort something in three ways at the same time. Nor can you feasibly build a single hash table which allows lookup with only a third of the key.
What you probably want to do is basically what databases do:
Store one (possibly unsorted) master list of all your records.
For each column you want to be able to search on, build some kind of lookup structure which returns a pointer/index into the master list.
So, for example, you build a flat array of {name, phone, address} structs in whatever order you want, and then for each row, put a (phone -> row#) mapping into a hash table. Non-unique columns could hash to a list of row numbers, or you could put them in a binary tree where duplicate keys aren't an issue.
As far as space requirements, you basically end up storing every element twice, so your space requirement will at least double. On top of this you've got the overhead from the data structures themselves; keeping three hash tables loaded at ~70% capacity, your storage requirements increase by at least 2.4 times.
You can do away with one of these auxiliary lookup structures by keeping your main table sorted on one of the columns, so you can search on it directly in O(logN). However, this makes inserting/deleting rows very expensive (O(N)), but if your data is fairly static, this isn't much of an issue. And if this is the case, sorted arrays would be the most space-efficient choice for your auxiliary lookups as well.
in a phone book, the telephone number should be unique, address is unique, but the name could be duplicated.
so perhaps you can use hash table combine with linked list to approach this.
you can use any one or combination of the 'name, address, phone number' as hash key, if you simply use name as hash key, then linked list is needed to store the duplicated entries.
in this approach, search based on the hash key is O(1) efficiency, but search based on the other two will be O(n).
C or C++ or C#?
Use a list of classes
public class PhoneBook
{
public string name;
public string phoneNumber;
public string address;
}
place this in a list and you have a phone book
In C, I think a struct is the best option.
typedef struct _Contact Contact;
struct _Contact
{
char* name;
char* number;
char* address;
};
Contact* add_new_contact( char* name, char* number, char* address )
{
Contact* c = (Contact*) malloc( sizeof( Contact ) );
c->name = name;
c->number = number;
c->address = address;
return c;
}
Contact* phone_book [ 20 ]; /* An array of Contacts */
Use the standard string functions ( <string.h> or if using a C++ compiler, <cstring> ) or something like the glib for searching the names, numbers etc.
Here's a simple example:
Contact* search_for_number( Contact* phone_book[], const char* number )
{
register int i;
for( i = 0; i < sizeof( phone_book ); i++)
{
if ( strcmp( phone_book[i]->number, number ) == 0 ) return phone_book[i];
}
return NULL;
}
There is also a good, helpful code example over here.
Alternatively
You may be able to use linked lists, but since C or the C standard library doesn't provide linked-lists, you either need to implement it yourself, or to use a third-party library.
I suggest using the g_linked_list in the glib.

Store and update huge (and sparse?) multi-dimensional array efficiently to count conditional probabilities

Just for fun I would like to count the conditional probabilities that a word (from a natural language) appears in a text, depending on the last and next to last word. I.e. I would take a huge bunch of e.g. English texts and count how often each combination n(i|jk) and n(jk) appears (where j,k,i are sucsessive words).
The naive approach would be to use a 3-D array (for n(i|jk)), using a mapping of words to position in 3 dimensions. The position look-up could be done efficiently using tries (at least that's my best guess), but already for O(1000) words I would run into memory constraints. But I guess that this array would be only sparsely filled, most entries being zero, and I would thus waste lots of memory. So no 3-D array.
What data structure would be suited better for such a use case and still be efficient to do a lot of small updates like I do them when counting the appearances of the words? (Maybe there is a completely different way of doing this?)
(Of course I also need to count n(jk), but that's easy, because it's only 2-D :)
The language of choice is C++ I guess.
C++ code:
struct bigram_key{
int i, j;// words - indexes of the words in a dictionary
// a constructor to be easily constructible
bigram_key(int a_i, int a_j):i(a_i), j(a_j){}
// you need to sort keys to be used in a map container
bool operator<(bigram_key const &other) const{
return i<other.i || (i==other.i && j<other.j);
}
};
struct bigram_data{
int count;// n(ij)
map<int, int> trigram_counts;// n(k|ij) = trigram_counts[k]
}
map<bigram_key, bigram_data> trigrams;
The dictionary could be a vector of all found words like:
vector<string> dictionary;
but for better lookup word->index it could be a map:
map<string, int> dictionary;
When you read a new word. You add it to the dictionary and get its index k, you already have i and j indexes of the previous two words so then you just do:
trigrams[bigram_key(i,j)].count++;
trigrams[bigram_key(i,j)].trigram_counts[k]++;
For better performance you may search for bigram only once:
bigram_data &bigram = trigrams[bigram_key(i,j)];
bigram.count++;
bigram.trigram_counts[k]++;
Is it understandable? Do you need more details?

Resources