Find item by arbitrary query

Find item by arbitrary query - ruby

Problem: I have item in database, wich called "AABGng-LS 4х4 0.66kV". AABG is vendor, ng-LS is type, 4*4 is cable cross-section, 0.66 kV is voltage. User must find this item for this queries:
AABG ng LS 4х4 660 V
AABGng-LS-660 4х4
AABG ng-LS 0.66 4*4
How can it be solved (algorithm)? I prefer ruby language, but algorithm in any language can be suggested.

the problem that you are describing is one of a search-index. this involves a lot of steps to get it working if you want to do it yourself, like normalizing, stemming, matching etc.
i would advise you to have a look at lucene based search indexes like elasticsearch, solr etc.

Related

How to ignore "stop words" while sorting in MarkLogic?

Is there any way to ignore "stop words" while sorting.
For example:
I have words like
dixit
singla
the marklogic
On sorting in descending order the result should be
singla, the marklogic, dixit
As in the above example the is ignored.
Any way to achieve this?
Update:
Stop word can occur at any place.
for example
the MarkLogic
MarkLogic is the best
the MarkLogic is awesome
while sorting should not consider any stop word in the text.
Above is just a small example to describe the problem.
In actual I am using search:search API.
For sorting, I am using sort-order search options.
The element on which I have to perform sorting is dynamic. There are approx 30-35 elements.
Is there any way to customize the collation at this level like to configure some words (stop words) which will be ignored while sorting.

There is no standard collation URI that is going to do this for you (at least none that I've ever seen). You can do it dynamically, of course, by sorting on the result of a function invocation, but if you want it done efficiently at scale (and available to search:search), then you need to materialize the sortable string into your document. I've often done this as an attribute on the element:
<title sortable="Great Gatsby, The">The Great Gatsby</title>
Then you put a range index on the title/#sortable attribute.
You can also use the "envelope pattern" where materialized metadata like this is maintained in its own section of the document with the original kept in its own section. For things like this, I think it's a bit more elegant to decorate the elements directly, to keep the context.

If I understand your question correctly you're trying to get rid of the definite article when sorting your result-set.
In order to do this you need to use some additional functions and create a 'sort' criteria. My solution would look like this (I'm also including some sample documents so that you can test this just by copy-pasting):
(:
xdmp:document-insert("/peter.xml", <person><firstName>Peter</firstName><lastName>O'Toole</lastName><age>60</age></person>);
xdmp:document-insert("/john.xml", <person><firstName>John</firstName><lastName>Adams</lastName><age>18</age></person>);
xdmp:document-insert("/simon.xml", <person><firstName>Simon</firstName><lastName>Petrov</lastName><age>22</age></person>);
xdmp:document-insert("/mark.xml", <person><firstName>Mark</firstName><lastName>the Lord</lastName><age>25</age></person>);
:)
for $person in /person
let $sort := fn:reverse(fn:tokenize($person/lastName, ' '))[1]
order by $sort
(: return $person :)
return $person/lastName/text()
Notice that now the sort order is going to be
- Adams
- the Lord
- O'Toole
- Petrov
I hope this will help.

Solr query conundrum

I've recently swapped from using Lucene for Sitecore to Solr.
For the most part it has been smooth, but the way I was writing some queries (using Sitecore.ContentSearch.Linq) abstraction now don't seem to be compatible.
Specifically, I have a situation where I've got "global" content and "regional" content, like so:
Home (000)
X
Y
Z
Regions (ID: 111)
Region 1 (ID: 221)
A
B
Region 2 (ID: 222)
D
My code worked on Lucene, but now doesn't on Solr. It should find all "global" and a single region's content, excluding all other region's content. So as an example, if the user's current region was Region 1, I'd want the query to return content X, Y, Z, A, B.
Sitecore's Item Crawler has a field for each item in the index called "_path" which is a multivalued string field of IDs, so as an example, Region 1's _path field value would be [000, 111, 221 ].
When I write this using the Linq abstraction it comes out as below which doesn't return results.
-_path:(111) OR _path:(221)
But _path:(111) does return result. Mind blown.
When I use the Solr interface and wrap each side of the OR in extra brackets like below (which I'd consider redundant) it works! Mind blown v2.
(-_path:(111)) OR (_path:(221))
Firstly, what's the difference between those queries?
Secondly, my real problem is I can't add these extra brackets as I'm working in an abstraction Linq so the brackets will be "optimized" out.
Any advice would be awesome! Cheers.

The problem here is, lucene's negative queries don't work like you think they do. They only remove results from what has been found. -_path:111 doesn't find all documents which aren't in 111, it doesn't find anything at all. It only removes results. So you are finding all results with path "221", then removing any that also have path "111", which from your heirarchy, I assume is all of them. See my answer here for a bit more on that topic.
The OR makes it seem like it ought to work, but really -_path:(111) OR _path:(221) is the same as -_path:(111) _path:(221). The moral here is: Don't use Lucene's AND/OR/NOT syntax, if you can help it. Use +/-. +/- syntax actually expresses how the query operates, AND/OR/NOT doesn't. It attempts to shoehorn it into a different, SQL-like retrieval model and leads to some unexpected behavior like this.
So, what about: (-_path:(111)) OR (_path:(221))
Well, first, does it actually work? Or does it just get some results?
If it just gets some results, but just seems to get the same results as _path:221: The reason is -_path:111 gets no results, so your query is, in practice, something like: (nothing) OR (_path:221), which is equivalent to _path:221
If it really does get the results you expect (I'm guessing it probably does): Something is translating your query into something like: (*:* -_path:111) (_path:221). Solr does have some logic along these lines, though I'm not quite sure in this case. Essentially, it puts a match-all in front of any lonely negative queries it finds, allowing them to do what you were expecting. If the implicit *:* makes you nervous about performance, well, it should. But lucene is an inverted index, it does well with finding matches on a term quickly. Getting everything that doesn't match goes against the grain of that retrieval model, and will pretty much have to do a full scan of the index.

Sorting by counting the intersection of two lists in MongoDB

We have a posting analyzing requirement, that is, for a specific post, we need to return a list of posts which are mostly related to it, the logic is comparing the count of common tags in the posts. For example:
postA = {"author":"abc",
"title":"blah blah",
"tags":["japan","japanese style","england"],
}
there are may be other posts with tags like:
postB:["japan", "england"]
postC:["japan"]
postD:["joke"]
so basically, postB gets 2 counts, postC gets 1 counts when comparing to the tags in the postA. postD gets 0 and will not be included in the result.
My understanding for now is to use map/reduce to produce the result, I understand the basic usage of map/reduce, but I can't figure out a solution for this specific purpose.
Any help? Or is there a better way like custom sorting function to work it out? I'm currently using the pymongodb as I'm python developer.

You should create an index on tags:
db.posts.ensure_index([('tags', 1)])
and search for posts that share at least one tag with postA:
posts = list(db.posts.find({_id: {$ne: postA['_id']}, 'tags': {'$in': postA['tags']}}))
and finally, sort by intersection in Python:
key = lambda post: len(tag for tag in post['tags'] if tag in postA['tags'])
posts.sort(key=key, reverse=True)
Note that if postA shares at least one tag with a large number of other posts this won't perform well, because you'll send so much data from Mongo to your application; unfortunately there's no way to sort and limit by the size of the intersection using Mongo itself.

R: Which heatmap/image to get row-sorted plot without any dendrogram?

Which package is best for a heatmap/image with sorting on rows only, but don't show any dendrogram or other visual clutter (just a 2D colored grid with automatic named labels on both axes). I don't need fancy clustering beyond basic numeric sorting. The data is a 39x10 table of numerics in the range (0,0.21) which I want to visualize.
I searched SO (see this) and the R sites, and tried a few out. Check out R Graphical Manual to see an excellent searchable list of screenshots and corresponding packages.
The range of packages is confusing - which one is the preferred heatmap (like ggplot2 is for most other plotting)? Here is what I found out so far:
base::image - bad, no name labels on axes, no sorting/clustering
base::heatmap - options are far less intelligible than the following:
pheatmap::pheatmap - fantastic but can't seem to turn off the
dendrograms? (any hacks?)
ggplot2 people use geom_tile, as Andrie points out
gplots::heatmap.2 , ref - seems
to be favored by biotech people, but way overkill for my purposes. (no
relation to ggplot* or Prof Wickham)
plotrix::color2D.matplot also exists
base::heatmap is annoying, even with args heatmap(..., Colv=NA, keep.dendro=FALSE) it still plots the unwanted dendrogram on rows.
For now I'm going with pheatmap(..., cluster_cols=FALSE, cluster_rows=FALSE) and manually presorting my table, like this guy: Order of rows in heatmap?
Addendum: to display the value inside each cell, see: display a matrix, including the values, as a heatmap . I didn't need that but it's nice-to-have.

With pheatmap you can use options treeheight_row and treeheight_col and set these to 0.

just another option you have not mentioned...package bipartite as it is as simple as you say
library(bipartite)
mat<-matrix(c(1,2,3,1,2,3,1,2,3),byrow=TRUE,nrow=3)
rownames(mat)<-c("a","b","c")
colnames(mat)<-c("a","b","c")
visweb(mat,type="nested")

Where can I learn more about the Google search "did you mean" algorithm? [duplicate]

This question already has answers here:
Closed 12 years ago.
Possible Duplicate:
How do you implement a “Did you mean”?
I am writing an application where I require functionality similar to Google's "did you mean?" feature used by their search engine:
Is there source code available for such a thing or where can I find articles that would help me to build my own?

You should check out Peter Norvigs article about implementing the spell checker in a few lines of python:
How to Write a Spelling Corrector It also has links for implementations in other languages (i.e. C#)

I attended a seminar by a Google engineer a year and a half ago, where they talked about their approach to this. The presenter was saying that (at least part of) their algorithm has little intelligence at all; but rather, utilises the huge amounts of data they have access to. They determined that if someone searches for "Brittany Speares", clicks on nothing, and then does another search for "Britney Spears", and clicks on something, we can have a fair guess about what they were searching for, and can suggest that in future.
Disclaimer: This may have just been part of their algorithm

Python has a module called difflib. It provides a functionality called get_close_matches. From the Python Documentation:
get_close_matches(word, possibilities[, n][, cutoff])
Return a list of the best "good
enough" matches. word is a sequence
for which close matches are desired
(typically a string), and
possibilities is a list of sequences against which to match
word (typically a list of strings).
Optional argument n (default
3) is the maximum number of close
matches to return; n must be
greater than 0.
Optional argument cutoff (default
0.6) is a float in the range [0,
1]. Possibilities that don't score
at least that similar to word are
ignored.
The best (no more than n) matches
among the possibilities are returned
in a list, sorted by similarity
score, most similar first.
>>> get_close_matches('appel', ['ape', 'apple', 'peach', 'puppy'])
['apple', 'ape']
>>> import keyword
>>> get_close_matches('wheel', keyword.kwlist)
['while']
>>> get_close_matches('apple', keyword.kwlist)
[]
>>> get_close_matches('accept', keyword.kwlist)
['except']
Could this library help you?

You can use http://developer.yahoo.com/search/web/V1/spellingSuggestion.html which would give a similar functionality.

You can check out the source code for Xapian which provides this functionality, as do a lot of other search libraries. http://xapian.org/

I am not sure if it serves your purpose but a String Edit distance Algorithm with a dictionary might suffice for a small Application.

I'd take a look at this article on google bombing. It shows that it just suggests answers based off previously entered results.

AFAIK the "did you mean ?" feature doesn't check the spelling. It only gives you another query based on the content parsed by google.

A great chapter to this topic can be found in the openly available Introduction to Information Retrieval.

U could use ngram for the comparisment: http://en.wikipedia.org/wiki/N-gram
Using python ngram module: http://packages.python.org/ngram/index.html
import ngram
G2 = ngram.NGram([ "iis7 configure ftp 7.5",
"ubunto configre 8.5",
"mac configure ftp"])
print "String", "\t", "Similarity"
for i in G2.search("iis7 configurftp 7.5", threshold=0.1):
print i[0], "\t", i[1]
U get:
>>>
String Similarity
"iis7 configure ftp 7.5" 0.76
"mac configure ftp 0.24"
"ubunto configre 8.5" 0.19

take a look at Levenshtein-Automata

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio