Elastic Greeklish to Greek conversion - elasticsearch

I am new to a elastic and I am trying to find a way to convert greeklish character to greek when the search executes.
e.g word "papoutsia" to be searched as "παπουτσια" (shoes)
Due to my search I found the following plugins:
elasticsearch-analysis-greeklish
elasticsearch-skroutz-greekstemmer
Applied the filters to my index as the example but my queries still hit nothing.
Do I have to apply the filter some way in every query or do a special one?
Sorry I this question has a very large/broad answer to be given.
I trying to figure how the whole filtering thing works for a couple of days to understand if I am even in the correct direction or have to find an other way for this solution.

Unfortunately, the intention of the greeklish plugin / char filter is the inverse of what you want to achieve:
Using this filter, you can retrieve greek text from a document, using a query that is written in latin characters ("greeklish").
So, for your example, you can add a document with the text παπούτσια and retrieve it using the terms papoutsia, papoutsi, etc.
We have prepared a detailed text pipeline example in the repo's wiki for future reference.

Related

Kibana (elasticsearch visualization) - how add plot based on sub-string of field?

I have a field in my logs called json_path containing data like /nfs/abc/123/subdir/blah.json and I want to create count plot on part of the string abc here, so the third chunk using the token /. I have tried all sorts of online answers, but they're all partial answers (nothing I can easily understand how to use or integrate). I've tried running POST/GET queries in the Console, which all failed due to syntax errors I couldn't manage to debug (they were complaining about newline control chars, when there were none that I could obviously see or see in a text editor explicitly showing control-characters). I also tried Management -> Index Patterns -> Scripted Field but after adding my code there, basically the whole Kibana crashed (stopped working temporarily) until I removed that Scripted Field.
All this elasticsearch and kibana stuff is annoyingly difficult, all the docs expect you to be an expert in their tool, rather than just an engineer needing to visualize some data.
I don't really want to add a new data field in my log-generation code, because then all my old logs will be unsupported (which have the relevant data, it just needs that bit of string processing before data viz). I know I could probably back-annotate the old logs, but the whole Kibana/elasticsearch experience is just frustrating and I don't use it enough to justify learning such detailed procedures (I actually learned a bunch of this stuff a year ago, and then promptly forgot it due to lack of use).
You cannot plot on a sub string of a field unless you extract that sub string into a new field. I can understand the frustration in learning a new product but to be able to achieve what you want you need to have that sub string value in a new field. Scripted fields are generally used to modify a field. To be able to extract sub string from a field I’d recommend using Ingest Node processor like grok processor. This will add a new field which you can use to plot in Kibana visualizations..

autocomplete and search in Elasticsearch

Is there any possibility to make a search on two non-complete words in the same field using Elasticsearch in Rails? I mean the situation when I could successfully search for example "victorian buildings" phrase by inserting into search input for example "vict bui" phrase (only beginnings of words, also with fuzziness).
Partial match (word_start, text_start etc. available in Searchkick) doesn't work in this project. I've also tried using wildcard queries, but it also failed. Maybe writing some custom mappings/settings would be a good idea?
Can I ask you for any suggestions on what to search/read to do this task?
Try this example
"%#{params[:place]}%"
Since % is a wildcard, doing a like on '%%' matches everything,
and you get all the records in the result.

CouchDB, all_docs and filter design documents with endkey

First, this question - filter design documents from all_docs - already seemed to be solved like described here:
https://plus.google.com/+JasonDeRose/posts/1iP5tu3wVqw
/mydb/_all_docs?endkey=%22_%22
and worked in first place. However, suddenly in a different setup (actually just different deploy), the query only returns an empty collection []. It seems like the ordering changed, without endkey="_" the full collection is returned (including design documents). I tried various combinations of endkey/startkey but cannot achieve to filter the design documents again.
Finally I added a filter and switched to _changes?include_docs=true to load the initial documents. I also thought about defining a view, but don't like that this results in data replication and some inconveniences with the changes feed (needed in another context). The filter on the other hand will be executed for every document.
Is it a bug that endkey=%22_%22 doesn't work anymore and is there a more convenient, still working way?
/_all_docs is a special case for CouchDB. Instead of the normal Unicode Collation, it uses ASCII collation.
The '_' character in ASCII order shows up between uppercase letters and lowercase letters. So if your doc id starts with lowercase letters (default behaviour), they will show up after any design docs. If your doc ids start with uppercase letters, they will show up before design docs.
Try creating a document with an id of: "ABC" You will see it show up before the design doc and your trick to filter design docs would work in this case.
However, I recommend you stop using the `_all_docs view altogether. Instead use the normal view functionality. When you create a view, CouchDB automatically skips design docs for you. So if your view looked like:
function(doc){
emit(doc._id, null);
}
You could query this with no start or end key, and get all docs without design docs.
Also, please look at Unicode Collation order, this is the order all your other views will be in, and it's important to understand as you work with CouchDB. You can read all about it here:
http://docs.couchdb.org/en/stable/ddocs/views/collation.html

ElasticSearch search for partial alphanumeric values

I have a string field with values like PA2456U or PA23U-RB and I would like to do a partial match, so that I can search for PA24 and I would get the first result, or search PA23U-RB and find the second result (so that would be a full match.
I tried using ngram, but it ignores the numeric values, so, if I enter pa111 it returns anything that starts with pa
See this gist for an example.
This may be a separate question, or related, but searching for 12345001 should also match 12345-001
Thanks
Update
The final analyzer I used is here: https://gist.github.com/3803180
Making ngrams looks like a good choice based on your requirements, but I think edge_ngrams should be enough. This way your index would grow a little bit slower since you'd be indexing less terms. Anyway the problem is that you don't need to apply the same analyzer to the query too, otherwise querying for pa111 would mean querying for all the ngrams that you can make out of it, which would lead you to a lot more matches that you'd expect.
You just need to change your search_analyzer to an analyzer which doesn't make ngrams. You can use the same you already have and remove the ngram token filter (only for the search_analyzer, the index_analyzer is fine).
Regarding the dash question, have a look at the Word delimiter token filter. You need to configure it to make it work as you expect. I guess the generate_number_parts=false, generate_word_parts=false and split_on_numerics=false options should make it work as you want. That way the dash won't be indexed. You need to apply the token filter at both index time and query time.

to_tsquery() validation

I'm currently developing a website that allows a search on a PostgreSQL
database, the search works with to_tsquery() and I'm trying to find a way to validate the input before it's being sent as a query.
Other than that I'm also trying to add a phrasing capability, so that if someone searches for HELLO | "I LIKE CATS" it will only find results with "hello" or the entire phrase "i like cats" (as opposed to I & LIKE & CATS that will find you articles that have all 3 words,
regardless where they might appear).
Is there some reason why it's too expensive to let the DB server validate it? It does seem a bit excessive to duplicate the ts_query parsing algorithm in the client.
If the concern is that you don't want it to try running the whole query (which presumably will involve table access) each time it validates, you could use the input in a smaller query, just in pseudocode (which may look a bit like Python, but that's just coincidence):
is_valid_query(input):
try:
execute("SELECT ts_query($1)", input);
return True
except DatabaseError:
return False
With regard to phrasing, it's probably easiest to search by the non-phrased query first (using indexes), then filter those for having the phrase. That could be done server side or client side. Depending on the language being parsed, it might be easiest to construct a simple regex of the phrase that deals with repeated whitespace or other ignorable symbols.
Search for to_tsquery('HELLO|(I&LIKE&CATS)'), getting back a list of documents which loosely match.
In the client, filter that to those matching the regex "HELLO|(I\s+LIKE\s+CATS)".
The downside is you do need some additional code for translating your query into the appropriate looser query, and then for translating it into a regex.
Finally, there might be a technique in PostgreSQL to do proper phrase searching using the lexeme positions that are stored in ts_vectors. I'm guessing that phrase searches are one of the intended uses, but I couldn't find an example of it in my cursory search. There's a section on it near the bottom of http://linuxgazette.net/164/sephton.html at least.

Resources