Efficient way to realize full text search in golang - go

I am trying to realize a simple full text search in golang but all my implementations turn out to be too slow to overcome the thresholds.
The task is as follows:
Documents are non-empty strings of lowercase words divided by spaces
Each document has an implicit identifier equal to its index in the input array
New() constructs the index
Search(): accepts a query, which is also a string of lowercase words divided by spaces, and returns a sorted array of unique identifiers of documents that contains all words from the query regardless of their order
Example:
index := New([]string{
"this is the house that jack built", //: 0
"this is the rat that ate the malt", //: 1
})
index.Search("") // -> []
index.Search("in the house that jack built") // -> []
index.Search("malt rat") // -> [1]
index.Search("is this the") // -> [0, 1]
I have already tried to implement:
a binary search tree for each document and for all documents all together
a trie (prefix tree) for each document and for all documents all together
inverted index search
binary search tree (for all documents):
type Tree struct {
m map[int]bool
word string
left *Tree
right *Tree
}
type Index struct {
tree *Tree
}
binary search tree (a tree for each document):
type Tree struct {
word string
left *Tree
right *Tree
}
type Index struct {
tree *Tree
index int
next *Index
}
trie (for all documents):
type Trie struct {
m map[uint8]*Trie
end_node map[int]bool
}
type Index struct {
trie *Trie
}
trie (for each document):
type Trie struct {
m map[uint8]*Trie
end_node bool
}
type Index struct {
trie *Trie
index int
next *Index
}
inverted index:
type Index struct {
m map[string]map[int]bool
}
New and Search implementation for inverted index:
// New creates a fulltext search index for the given documents
func New(docs []string) *Index {
m := make(map[string]map[int]bool)
for i := 0; i < len(docs); i++ {
words := strings.Fields(docs[i])
for j := 0; j < len(words); j++ {
if m[words[j]] == nil {
m[words[j]] = make(map[int]bool)
}
m[words[j]][i+1] = true
}
}
return &(Index{m})
}
// Search returns a slice of unique ids of documents that contain all words from the query.
func (idx *Index) Search(query string) []int {
if query == "" {
return []int{}
}
ret := make(map[int]bool)
arr := strings.Fields(query)
fl := 0
for i := range arr {
if idx.m[arr[i]] == nil {
return []int{}
}
if fl == 0 {
for value := range idx.m[arr[i]] {
ret[value] = true
}
fl = 1
} else {
tmp := make(map[int]bool)
for value := range ret {
if idx.m[arr[i]][value] == true {
tmp[value] = true
}
}
ret = tmp
}
}
ret_arr := []int{}
for value := range ret {
ret_arr = append(ret_arr, value-1)
}
sort.Ints(ret_arr)
return ret_arr
}
Am I doing something wrong or is there a better algorithm for search in golang?
Any help is appreciated.

I can't really help you for the language specific part, but if it's of any help, here is a pseudocode that describes a Trie implementation along with a function to solve your current problem in a decently efficient manner.
struct TrieNode{
map[char] children // maps character to children
set[int] contains // set of all ids of documents that contain the word
}
// classic search function in trie, except it returns a set of document ids instead of a simple boolean
function get_doc_ids(TrieNode node, string w, int depth){
if (depth == length(w)){
return node.contains
} else {
if (node.hasChild(w[depth]) {
return get_doc_ids(node.getChild(w[depth], w, depth+1)
} else {
return empty_set()
}
}
}
// the answering query function, as straight forward as it can be
function answer_query(TrieNode root, list_of_words L){
n = length(L)
result = get_docs_ids(root, L[0], 0)
for i from 1 to n-1 do {
result = intersection(result, get_docs_ids(root, L[i], 0)) // set intersection
if (result.is_empty()){
break // no documents contains them all, no need to check further
}
}
return result
}

Related

Reading from a slice of unknown length in Golang

I'm trying to replicate this algorithm for finding duplicates in an array in Golang. Here's the javascript version:
function hasDuplicateValue(array) {
let existingNumbers = [];
for(let i = 0; i < array.length; i++) {
if(existingNumbers[array[i]] === 1) {
return true;
} else {
existingNumbers[array[i]] = 1;
}
}
return false;
}
On line 2, the algorithm creates an empty array of unknown length, and then adds 1 to an index in the array corresponding with each number that it finds (e.g. if it finds the number 3 in the array, it will add a 1 to index 3 in existing numbers.
I'm wondering — how do I replicate this in Golang (since we need to have slots allocated in the slice before reading it). Would I first need to find the max value in the array and then declare the existingNumbers slice to be of that same size?
Or is there a more efficient way of doing this (instead of searching through the array and finding the max value before constructing the slice).
Thanks!
Edit:
I realized that I can't do this with a slice because I can't read from an empty value. However, as #icza suggested, it will work with a map:
func findDuplicates(list []int)(bool) {
temp := make(map[int]int)
for _, elem := range list {
if temp[elem] == 1 {
return true
} else {
temp[elem] = 1
}
}
return false
}
As comments, I would also suggest using a map to keep the state of the duplications, but we can use map[int]struct{} because empty structs are not consumed any memory in Go.
And also I have simplified the code a bit and it is as follows.
func findDuplicates(list []int) bool {
temp := make(map[int]struct{})
for _, elem := range list {
if _, ok := temp[elem]; ok {
return true
}
temp[elem] = struct{}{}
}
return false
}
Full code can be executed here

Sort 2D array of structs Golang

I want to create a consistent ordering for a 2D slice of structs, I am creating the 2D slice from a map so the order is always different.
My structs look like
// Hit contains the data for a hit.
type Hit struct {
Key string `json:"key"`
Data []Field `json:"data"`
}
// Hits stores a list of hits.
type Hits [][]Hit
I want to provide a consistent order for the contents of my Hits type.
I have tried:
func (c Hits) Len() int { return len(c) }
func (c Hits) Swap(i, j int) { c[i], c[j] = c[j], c[i] }
func (c Hits) Less(i, j int) bool { return strings.Compare(c[i][0].Key, c[j][0].Key) == -1 }
But the results still seem to come back in random order.
I was thinking of possibly hashing each item in the slice but thought there might be an easier option
The order of iteration over a map, because it's a hash table is rather indeterminate (it's not, really — insert items with the same keys in the same exact sequence into 2 maps and the order of iteration for each will be identical).
Assuming that your map is a map[string]Hit, to iterate it over in a determinate order, I would enumerate the set of keys in the map, sort that, and use that sorted set to enumerate the map.
Something like this:
package main
import (
"fmt"
"sort"
)
type Hit struct {
Key string `json:"key"`
Data []Field `json:"data"`
}
type Field struct {
Value string `json:"value"`
}
func main() {
var mapOfHits = getSomeHits()
var sortedHits = sortHits(mapOfHits)
for _, h := range sortedHits {
fmt.Println(h.Key)
}
}
func getSomeHits() map[string]Hit {
return make(map[string]Hit, 0)
}
func sortHits(m map[string]Hit) []Hit {
keys := make([]string, 0, len(m))
sorted := make([]Hit, 0, len(m))
for k := range m {
keys = append(keys, k)
}
sort.Strings(keys)
for _, k := range keys {
sorted = append(sorted, m[k])
}
return sorted
}

Generating combinatorial string from map

I have a map as such:
// map[int] position in string
// map[rune]bool characters possible at said position
func generateString(in map[int]map[rune]bool) []string {
// example: {0: {'A':true, 'C': true}, 1: {'E': true}, 2: {'I': true, 'X': true}}
result := []string{"AEI", "AEX", "CEI", "CEX"} // should generate these
return result
}
The difference with all possible permutations is that we are specifying which permutations are possible by index and I think that's the real head-breaker here.
First, we need to convert map[int]map[rune]bool to []map[rune]bool since map iteration isn't guaranteed to be sorted by key's
After that, this is a recursive approach
var res []string
func dfs(curString string, index int, in []map[rune]bool) {
if index == len(in) {
res = append(res, curString)
return
}
for ch, is := range in[index] {
if !is { // I assume booleans can be false
return
}
dfs(curString+string(ch), index+1, in)
}
}
and we can call it with dfs("", 0, arr) where arr is given map converted to slice and answer will be in res variable

How one can do case insensitive sorting using sort.Strings() in Golang?

Is there any way to pass the custom function in the sort.Strings() to do the case-insensitive sorting on the list of strings?
data := []string{"A", "b", "D", "c"}
The output should be: A, b, c, D
The equivalent of the above requirement in Python is like :
li = sorted(data, key=lambda s: s.lower())
Do we have something like that in golang?
The translation of the Python code to Go is:
sort.Slice(data, func(i, j int) bool { return strings.ToLower(data[i]) < strings.ToLower(data[j]) })
Run it on the Go Playground.
This approach, like the Python code in the question, can allocate two strings for each comparison. The allocations are probably OK for the example in the question, but can be a problem in other scenarios.
To avoid allocations, compare the strings rune by rune:
func lessLower(sa, sb string) bool {
for {
rb, nb := utf8.DecodeRuneInString(sb)
if nb == 0 {
// The number of runes in sa is greater than or
// equal to the number of runes in sb. It follows
// that sa is not less than sb.
return false
}
ra, na := utf8.DecodeRuneInString(sa)
if na == 0 {
// The number of runes in sa is less than the
// number of runes in sb. It follows that sa
// is less than sb.
return true
}
rb = unicode.ToLower(rb)
ra = unicode.ToLower(ra)
if ra != rb {
return ra < rb
}
// Trim rune from the beginning of each string.
sa = sa[na:]
sb = sb[nb:]
}
}
⋮
sort.Slice(data, func(i, j int) bool { return lessLower(data[i], data[j]) })
Run it on the Go Playground.
Take a look at the collate package if you need to sort by language or culture specific sort orders.
The solution below is more verbose and more performant. The main difference is that in the other answers, using strings.ToLower at each comparison allocates some memory, and the code below takes care of comparing runes without creating any new string.
// lessCaseInsensitive compares s, t without allocating
func lessCaseInsensitive(s, t string) bool {
for {
if len(t) == 0 {
return false
}
if len(s) == 0 {
return true
}
c, sizec := utf8.DecodeRuneInString(s)
d, sized := utf8.DecodeRuneInString(t)
lowerc := unicode.ToLower(c)
lowerd := unicode.ToLower(d)
if lowerc < lowerd {
return true
}
if lowerc > lowerd {
return false
}
s = s[sizec:]
t = t[sized:]
}
}
sort.Slice(data, func(i, j int) bool { return lessCaseInsensitive(data[i], data[j]) })
You can see in this benchmark for example that avoiding allocs makes the case-insensitive sorting 5x faster.
You need a type that implements sort.Interface.
https://play.golang.org/p/JTm0AjuxCRV

Golang: find first character in a String that doesn't repeat

I'm trying to write a function that returns the finds first character in a String that doesn't repeat, so far I have this:
package main
import (
"fmt"
"strings"
)
func check(s string) string {
ss := strings.Split(s, "")
smap := map[string]int{}
for i := 0; i < len(ss); i++ {
(smap[ss[i]])++
}
for k, v := range smap {
if v == 1 {
return k
}
}
return ""
}
func main() {
fmt.Println(check("nebuchadnezzer"))
}
Unfortunately in Go when you iterate a map there's no guarantee of the order so every time I run the code I get a different value, any pointers?
Using a map and 2 loops :
play
func check(s string) string {
m := make(map[rune]uint, len(s)) //preallocate the map size
for _, r := range s {
m[r]++
}
for _, r := range s {
if m[r] == 1 {
return string(r)
}
}
return ""
}
The benfit of this is using just 2 loops vs multiple loops if you're using strings.ContainsRune, strings.IndexRune (each function will have inner loops in them).
Efficient (in time and memory) algorithms for grabbing all or the first unique byte http://play.golang.org/p/ZGFepvEXFT:
func FirstUniqueByte(s string) (b byte, ok bool) {
occur := [256]byte{}
order := make([]byte, 0, 256)
for i := 0; i < len(s); i++ {
b = s[i]
switch occur[b] {
case 0:
occur[b] = 1
order = append(order, b)
case 1:
occur[b] = 2
}
}
for _, b = range order {
if occur[b] == 1 {
return b, true
}
}
return 0, false
}
As a bonus, the above function should never generate any garbage. Note that I changed your function signature to be a more idiomatic way to express what you're describing. If you need a func(string) string signature anyway, then the point is moot.
That can certainly be optimized, but one solution (which isn't using map) would be:
(playground example)
func check(s string) string {
unique := ""
for pos, c := range s {
if strings.ContainsRune(unique, c) {
unique = strings.Replace(unique, string(c), "", -1)
} else if strings.IndexRune(s, c) == pos {
unique = unique + string(c)
}
}
fmt.Println("All unique characters found: ", unique)
if len(unique) > 0 {
_, size := utf8.DecodeRuneInString(unique)
return unique[:size]
}
return ""
}
This is after the question "Find the first un-repeated character in a string"
krait suggested below that the function should:
return a string containing the first full rune, not just the first byte of the utf8 encoding of the first rune.

Resources