AutoComplete implementation - data-structures

(From an Interview Question)
Say you have a DB table with two cols: SearchPhrase(String) | Popularity(Int).
You need to initialize a DS so that you could use it to implement an autocomplete
feature (like google suggest) comfortably. The requirement: Once the data from the db
is processed into the data structure, when you type a letter you get the 10 most popular searchphrases from the db starting with that letter,then when you type the next one you get the 10 .... with these two letters and so on.
The question only concerns planning the ds and pseudocoding Insert,Search etc.
Note: YOU CANNOT USE TRIE DS.
Any ideas?

A trie would be the best fit but since you can't use it, what about a DAWG?
Have you seen similar questions on here?
autocomplete algorithms, papers, strategies, etc

Related

How can I use golang apache arrow library to read repeated field for parquet?

I am using apache arrow golang library to read parquet. No-repeated column seems straight forward, but how can I read repeated field?
For reading repeated fields in Parquet there's really two answers: a complex way and an easy way.
The easy way is to use the pqarrow package and just read directly into an Arrow list array of some kind and let the complexity be handled for you. (https://pkg.go.dev/github.com/apache/arrow/go/v10#v10.0.1/parquet/pqarrow)
To read them the complex way, you have to understand repetition and definition levels and how Parquet uses them. Instead of trying to explain them here, I'm going to point you to the excellent write-up on the Apache Arrow blog here: https://arrow.apache.org/blog/2022/10/08/arrow-parquet-encoding-part-2/ which explains how to decode definition and repetition levels (yes it's in the context of the Rust implementation of Parquet, but the basic concepts are the same for the Go implementation).
All of the ColumnChunkReader types allow you to retrieve those Definition and Repetition levels in their ReadBatch methods. For an example have a look at https://pkg.go.dev/github.com/apache/arrow/go/v10#v10.0.1/parquet/file#Float32ColumnChunkReader.ReadBatch
When you call ReadBatch you can pass an []int16 for the definition levels and the repetition levels to be filled in alongside the data, and then you can use those to decode the repeated field accordingly. Personally, I prefer to use the pqarrow package which does it for you, but sometimes you do need the granular access.

How can I query kendo trees and MVVM models?

According to the docs I should be able to do this ...
$("#tree").data("kendoTreeView").expand(".k-item");
Great if i want to expand everything, but what if i only want to expand nodes where the property "expanded" in my model items is set to true?
Is there a way i can query the tree based on something in the model then perform an action on all results?
The real answer here is quite long, the short version being as with everything kendo, spend hours with support to be given half the solution and told to write the rest yourself.
I got round this problem by using another library (jslinq) to query the model data.
This is yet another frustrating issue with kendo that really should at the very least be offered as a core part of the heirarchy data source at some basic level (essentially an incomplete implementation).

Elasticsearch subset filter

I have a dataset about books, each of which can be in one or more languages. Every user is registered as having one or more languages.
When a user searches for books, I'd like to return only those books where they understand all of its languages.
For example, the following two books are in the system:
Book A: English, French, German
Book B: English, Greek
If John is registered as knowing English, German, French, and Italian, then his query results should never include Book B.
My system is currently written using Apache Solr, where I ended up writing a plugin to perform a subset operation (where a record matches if the languages of the record are a subset of the languages of the user, where the user's languages are declared in the query).
However, I'd like to transition to an Elasticsearch backend. This particular subsetting behavior, however, doesn't seem to be part of the core filter package. Am I missing something, or should I look at writing a similar plugin / custom filter?
This can be done using a script filter , you can pass it a comma separated list of strings as a param and use for loop to ensure each component is contained , if even one is not use break and return false. if all present loop exits and it returns a true.
I'm not sure how efficient this is, but theoretically this can be done on elasticsearch. Ideally apply an optimized filter to narrow down the set of books and then run this on those subsets look at https://www.elastic.co/blog/all-about-elasticsearch-filter-bitsets and docs on post_filters, the efficiency should be ideally tested over a bunch of queries as this filter will preform better once its result begins to be cached
Another possible answer to this is to invert the problem on its head.This data has certain characteristics. Assuming sufficient scale and real world practicalities the basic idea is that the cardinality of the language field is extremely low wrt books, users and authors (you could further improve this by using language roots as a field eg Latin- for english, italian and proto languages http://en.wikipedia.org/wiki/List_of_proto-languages at index time) Frequently users tend to know languages from the same family so you can exploit this fact to your benefit.
Then the user query would be essentially be the difference of the sets of all present and the one he knows. These can easily be modeled as a bunch of filters using the execution:bool flag (extremely optimized bitsets internally) to cache and combine them. Make sure you are wise about execution order of filters have a look at https://www.elastic.co/blog/all-about-elasticsearch-filter-bitsets

How can I do "related tags"?

I have tags on my website, and I input them one by one when I create a blog post. I love gmail's new feature, that ask you if you want to include X in a mail, if you type Y's name and that you often include both of them in the same messages.
I'd like to do something similar on my website, but I don't know how to represent the tags "related-ness" in an object or database ... thoughts ?
It all boils down to create associations between certain characteristics of your posts and certain tags, and then - when you press the "publish" button - to analyse the new post and propose all tags matched with your post characteristics.
This can be done in several ways from a "totally hard-coded" association to some sort of "learning AI"... and everything in-between.
Hard-coded solutions
This are the simplest algorithms to implement. You should first decide what characteristics of your post are relevant for tagging (e.g.: it's length if you tag them "short" or "long", the presence of photos or videos if you tag them "multimedia-content", etc...). The most obvious is however to focus on which words are used in posts. For example you could build a mapping like this:
tag_hint_words = {'code-development' : ['programming',
'language', 'python', 'function',
'object', 'method'],
'family' : ['Theresa', 'kids',
'uncle Ben', 'holidays']}
Then you would check your post for the presence of the words in the list (the code between [ and ] ) and propose the tag (the word before :) as a possible candidate.
A common approach is to give "scores", or in other word to put a number that indicates the probability a given tag is the right one. For example: if your post would contain the sentence...
After months of programming, we finally left for the summer holidays at uncle Ben's cottage. Theresa and the kids were ecstatic!
...despite the presence of the word "programming" the program should indicate family as the most likely tag to use, as there are many more words hinting.
Learning AI's
One of the obvious limitations of the above method is that - say one day you pick up java beside python - you would probably need to change your code and include words like "java" or "oracle" too. The same applies if you create new tags.
To circumvent this limitation (and have some fun!!) you could try to implement a learning algorithm. Learning algorithms are those who refine their outcome the more you use them (so they indeed... learn!). Some algorithm requires initial training (many spam filters and voice recognition programs need this initial "primer"). Some don't.
I am absolutely no expert on the subject, but two common AI's are: the Naive Bayes Classifier and some flavour of Neural network.
Although the WP pages might look scary, they are surprisingly easy to implement (at least in Python). Here's the recording of a lecture at PyCon 2009 on the subject "Easy AI with Python". I found it very informative and even somehow inspiring! :)
HTH!
You should have a look at this post :
Any suggestions for a db schema for storing related keywords?
If you're looking for a schema for storing related tags it will help.
Relevancy searches where multiple agents play a part are usually done using Collaborative filtering. You might want to give that a look see.
Look up Clustering (Machine Learning algorithm). Don't be intimidated by math, it's a pretty straightforward algorithm. Check out Machine Learning for Hackers for simpler explanations of many Machine Learning algorithms and methods.

Searching algorithmics: Parsing and processing a request OOP style

Say you were to create a search engine that can accept a query statement under the form of a String. The statement can be used to retrieve different types of objects with a given set of characteristics and possibly linked to other objects. In plain english or pseudo-code using an OOP approach, how would you go about parsing and processing statements as follows to get the series of desired objects ?
get fruit with colour green
get variety of apples, pears from Andy
get strawberry with colour "deep red" and origin not Spain
get total of sales of melons between 2010-10-10 and 2010-12-30
get last deliverydate of bananas from "Pete" and state not sold
Hope the question is clear. If not I'll be more than happy to reformulate.
P.S: This isn't homework ;)
Your problem is well suited to a document-oriented store such as Lucene. For example you can design a schema such as
Type
Variety
Color
Origin
DateSold
etc
:
Then you can write a Lucene query such as Type:Fruit AND Color:Green. You can also build nested queries such as (Fruit:Straberry AND Color:Deep Red) AND NOT Origin:Spain.
Apache Lucene is a Java library with portts available for most major languages. Apache Solr is a full-fledged search server built using Lucene lib and easily integrable into your platform-of-choice because it has a RESTful API.
BTW Solr has something called faceting which lets the user filter results using each of the criteria above. So user types fruit into search box and then gets results back.
Type:
- Fruit (109)
- Nut (99)
Origin:
- Spain(32)
- France(39)
Color:
- Red (22)
- Deep Red(45)
Clicking on each of the facets filters the results with the intersection. So if you want a more user-friendly interaction model, faceting/filtering is much easier, than getting users to type extensive Lucene queries.
Update: You might still need to do some lexical parsing if you wish to let users type natural language queries and break it down, but given the tremendously difficult challenge, my suggestion would be to use the simple & powerful faceting approach.
Hope that helps.
It sounds like you're developing a mini language, since you're concerned with syntax and parsing. So, check out the many tools used to generate lexers and parsers. You can start here: http://en.wikipedia.org/wiki/Lexical_analysis
I agree with John.
a) Start with lexical analysis
b) Take statistics of searches and use them to index
c) Find relationships by analysing possibly related searches
This is just a wild guess though, never tried it before.

Resources