Which ER-model notation does Elasticsearch default to? - elasticsearch

As asked in the title I'd like to know which ER-model notation Elasticsearch defaults to.
In terms of context, I've received a database schema which I cannot openly share, but the following screenshot shows one of the obscure looking relations.
After a few quick Google searches I was surprised at the fact that I couldn't find any official statement about Elasticsearch's default notation. While searching I came across this post where the OP showed a similar screenshot, but without the obscure double-dashed-line notation: https://discuss.elastic.co/t/convert-relational-schema-to-elasticsearch-mapping/72291
Is this Crow's Foot Notation, or something else?

there's no official model notation that is used for Elasticsearch, so it'd come down to whatever the tool that built this uses

Related

Matching users with objects based on keywords and activity in Ruby

I have users that have authenticated with a social media site. Now based on their last X (let's say 200) posts, I want to map how much that content matches up with a finite list of keywords.
What would be the best way to do this to capture associated words/concepts (maybe that's too difficult) or just get a score of how much, say, my tweet history maps to 'Walrus' or 'banana'?
Would a naive Bayes work here to separate into 'matches' and 'no match'?
In Python I would say NLTK can easily do it. In Ruby maybe gem called lda-ruby will help you. Whole LDA concept is well explained here - look at Sarah Palin's email for example. There's even the example of an app (not entirely in Ruby, but still) which did that -> github.com/echen/sarah-palin-lda
Or maybe I just say stupid things and that can't help you at all. I'm not an expert ;)
A simple bayes would work in this case, it is highly used to detect if emails are spam or not so for a simple keyword matching it should work pretty well.
For this problem you could also apply a recommendation system where you look for the top recommended keyword for a user (or for a post).
There are a ton of ways for doing this. I would recommend you to read Programming Collective Intelligence. It is explained using python but since you know ruby there should be not problem to understand the code.

How do I see/debug the way SOLR find it's results?

Let's say I search for "ABLS" and the SOLR returns a result that to me does not make any sense.
How can I debug why SOLR picked this record to be returned?
debugQuery=true would help you get the detailed score calculation and the explanation for each scores.
An over view of the scoring is available at link
For detailed explaination of the debug information you can refer Link
You could add debugQuery=true&indent=true to the url and examine the results. You could also use the analysis tool in solr. Go to the admin and click analysis. You would need to read the wiki to understand either of these more in depth.
queryDebug will give you knowledge about why your scoring looks like it does (end how every field is relevant).
I will get some results that you are not understand and play with them with Solr's analysis
You should find it under:
/admin/analysis.jsp?highlight=on
Alternatively turn on highlighting over your results to see what is actually matching in your results
Solr queries are full of short parameters, hard to read and modify, especially when the parameters are too many.
And after it is even harder to debug and understand why a document is more or less relevant than another. The debug explain output usually is a three too big to fit in one page.
I found this Google Chrome extension useful to see Solr Query explain and debug in a clear manner.
For those who still use very old version of solr 3.X, "debugQuery=true" will not put the debug information. you should specify "debugQuery=on".
There are two ways of doing that. First is the query level, which means adding the debugQuery=on to your query. That will include a few things:
parsed query
debug timing information
detailed scoring information which helps you with analysis of why a give document is given a score.
In addition to that, you can use the [explain] transformer and add it to your fl parameter. For example ...&fl=*,[explain], which will result in your documents having the scoring information as another field.
The scoring information can be quite extensive and will include calculations done by the similarity algorithm. If you would like to learn more about the similarities and the scoring algorithm in Solr, have a look at this my and my colleague Radu from Sematext talk from the Activate conference: https://www.youtube.com/watch?v=kKocQdYGVJM

How does spell checker and spell fixer of Google (or any search engine) work?

When searching for something in Google, if you misspell a word (may be by mistake or may be when you really mean this non-dictionary word), Google says:
"Showing results for ..... Search instead for .......".
I am trying to figure out how this would work.
This basically means being able to find the closest dictionary word to the non-dictionary word entered. How does it work? One way I can guess is :
count no. of instances of each character and then scan dictionary to find a word with same no. of instances of each character (only with +-1 difference). But this will also return anagrams.
Is some kind of probabilistic model of any use here such as Markov etc. I don't understand Markov well enough to throw it around but just a very wild guess.
Any insights?
You're forgetting that google has a lot more information available to it then you do. They track when people type in a word, don't select a result, and then do another search shortly afterwards. They then use this information to suggest better searches for you.
See How does the Google "Did you mean?" Algorithm work? for a fuller explanation.
Note that this approach makes sense when you consider that Google aren't actually doing spell-checking. Instead, they are trying to work out what search term will give you the answer you are looking for. Obviously there is a lot of overlap between this and spell-checking, but it means they are not always trying to correct a search for, e.g., "Flickr".
When you search something which is related to other searches performed earlied closed to yours and got more results, google shows suggest on them.
We are sure that it is not spell checking but it shows what other people queried the related keywords.

Searching a datastore for related topics by keyword

For example, how does StackOverflow decide other questions are similar?
When I typed in the question above and then tabbed to this memo control I saw a list of existing questions which might be the same as the one I am asking.
What technique is used to find similar questions?
I got an email from team#stackoverflow.com on Mar 20 that mentions how it works:
the "ask a question" search is
exclusively on title and will not
match anything in the body. It is a
mystery to me why people think it's
better.
The last sentence refers to the search bar, which I've found is less useful when I'm trying to find a specific question I've already seen.
I think it's plain old word matching. However, I might add that this feature does not work as well as I would like it to. It's much better to do google search with site:stackoverflow.com prefix than to rely on SO to provide the relevant suggestions.
Poorly -- using MS SQL Full Text Search, I believe. You'll have better luck using Lucene, IMO. For more background on the topic see the Wikipedia article on Lucene or the general topic of information retrieval.
The matching program would store an index of all questions. When you ask a question, all keywords in your question are matched against the index. This is similar to Google Search. Lucene open source search can be (and with high probability has been) used for this. Since the results are not quite accurate, I presume they index just the headlines of the questions, as an approximation.
The other related keyword is collaborative filtering, the algorithm popularized by Amazon to recommend products based on behavior of other similar customers. In the current case, an alternative algorithm based on collaborative filtering is: keywords are extracted from the question, then tags associated (in the history) with the keywords are found. Questions which have those tags are returned. Well, experiments are needed to see whether it works well at all.

Lightweight fuzzy search library

Can you suggest some light weight fuzzy text search library?
What I want to do is to allow users to find correct data for search terms with typos.
I could use full-text search engines like Lucene, but I think it's an overkill.
Edit:
To make question more clear here is a main scenario for that library:
I have a large list of strings. I want to be able to search in this list (something like MSVS' intellisense) but it should be possible to filter this list by string which is not present in it but close enough to some string which is in the list.
Example:
Red
Green
Blue
When I type 'Gren' or 'Geen' in a text box, I want to see 'Green' in the result set.
Main language for indexed data will be English.
I think that Lucene is to heavy for that task.
Update:
I found one product matching my requirements. It's ShuffleText.
Do you know any alternatives?
Lucene is very scalable—which means its good for little applications too. You can create an index in memory very quickly if that's all you need.
For fuzzy searching, you really need to decide what algorithm you'd like to use. With information retrieval, I use an n-gram technique with Lucene successfully. But that's a special indexing technique, not a "library" in itself.
Without knowing more about your application, it won't be easy to recommend a suitable library. How much data are you searching? What format is the data? How often is the data updated?
I'm not sure how well Lucene is suited for fuzzy searching, the custom library would be better choice. For example, this search is done in Java and works pretty fast, but it is custom made for such task:
http://www.softcorporation.com/products/people/
Soundex is very 'English' in it's encoding - Daitch-Mokotoff works better for many names, especially European (Germanic) and Jewish names. In my UK-centric world, it's what I use.
Wiki here.
You didn't specify your development platform, but if its PHP then suggest you look at the ZEND Lucene lubrary :
http://ifacethoughts.net/2008/02/07/zend-brings-lucene-to-php/
http://framework.zend.com/manual/en/zend.search.lucene.html
As it LAMP its far lighter than Lucene on Java, and can easily be extended for other filetypes, provided you can find a conversion library or cmd line converter - there are lots of OSS solutions around to do this.
Try Walnutil - based on Lucene API - integrated to SQL Server and Oracle DBs . You can create any type of index and then use it. For simple search you can use some methods from walnutilsoft, for more complicated search cases you can use Lucene API. See web based example where was used indexes created from Walnutil Tools. Also you can see some code example written on Java and C# which you can use it for creating different type of search.
This tools is free.
http://www.walnutilsoft.com/
If you can choose to use a database, I recommend using PostgreSQL and its fuzzy string matching functions.
If you can use Ruby, I suggest looking into the amatch library.
#aku - links to working soundex libraries are right there at the bottom of the page.
As for Levenshtein distance, the Wikipedia article on that also has implementations listed at the bottom.
A powerful, lightweight solution is sphinx.
It's smaller then Lucene and it supports disambiguation.
It's written in c++, it's fast, battle-tested, has libraries for every env and it's used by large companies, like craigslists.org

Resources