ElasticSearch What analyzer to use for searching code - elasticsearch

I'm writing a search tool for searching code but I'm having a hard time finding the right analyzer to use. I've tried doing a whitespace analyzer but you end up with issues where you might have dbo.My_Procedure and searching "my_procedure" should work as well as searching ".My_Procedure". My idea is to split on special characters but store them into their own tokens as well. But then if you write my_procedure as a search it will just look for my, _ and procedure anywhere in the file unless you wrap it in quotes (even though to the user it looks like it's just one word). What approach have people taken for analyzing code?

If your code is in Java, according to Java naming conventions your methods and classes should be camel-case so you should not run into names like my_search but rather mySearch.
If that is the case - you can use the (default) standard analyzer which uses word boundaries as delimiters for split.
That said, if there's no way around it and you have to consider names like my_search in the tokenizing part, you can implement your own custom analyzer.
This answer shows an example of setting a custom-analyzer.

Related

Can't figure out how to search LOINC using FHIR for a specific test by name?

Can anyone provide some insight on the required syntax to use to search LOINC using FHIR for a specific string in the labs descriptive text portion of an Observation resource?
Is this even possible?
The documentation is all over the place and I can't find an example for this generic kind of search.
I found similar examples here: https://www.hl7.org/fhir/2015Sep/valueset-operations.html
Such as: GET "[base]/ValueSet/23/$validate-code?system=http://loinc.org&code=1963-8&display=test"
But none of them are providing a general enough case to do a global search of the LOINC system for a specific string in an Observation resource.
None of my attempts to use the FHIR UI here, http://polaris.i3l.gatech.edu:8080/gt-fhir-webapp/search?serverId=gatechreadonly&resource=Observation , have been successful. I keep getting a 500 Internal Server Error because I don't know the correct syntax to use for the value part of the search, and I can't find any documentation out of all the copious documents online that explains this very simple concept.
Can anyone provide some insight?
Totally frustrated at this point.
Observation?code=12345-6
or
Observation?code=http://loinc.org|12345-6
where 12345-6 is whatever LOINC code you want to look for (e.g. 39802-4)
The second ensures you'll only match on LOINC codes as opposed to codes from other systems, though given the relatively unique format of LOINC codes, you're mostly safe without including that.
If you want to search for a set of codes, then you can separate the codes or the tuples with commas: E.g.
Observation?code=12345-6,12345-7
or
Observation?code=http://loinc.org|12345-6,http://loinc.org|123456
If you expect to search by a really long list of codes frequently, you can define a value set that includes all the desired codes and then filter by value set:
Observation?code:in=http://somwhere.org/whatever/ValueSet/123
Note: for readability, I haven't escaped the URL contents, but you'll need to escape the URL values appropriately.

Ruby Text/Sentiment Analysis

I have two strings -
"I like running around the track.
I like swimming in the pool, but only in the morning.
I need to pull out what people "like" from the above two comments (running around the track and swimming in the pool.
Does anyone have a recommendation for a text analytics gem or other method of pulling in that kind of information? I don't necessarily need word counts or n-grams, I just want to know what words are seen in relation to the word "like".
For a quick-and-dirty fix, you could use a Regex to search for all the forms of "like" and pull out all the text between there and the punctuation mark or Newline character.
You could use a dependency parser such as The Stanford Parser
to parse your text and find the keys words in your sentiment dictionary, and probably put some constraints on the type of dependencies for disambiguation. For example, the dependency needs to be of type "dobj" (direct object). Then follow the dependency structures to the end of phrase or sentence depending on your needs.

How to configure standard tokenizer in elasticsearch

I have a multi language data set and a Standard analyzer that takes care of the tokenizing for this data set very nicely. The only bad part is that it removes the special characters like #, #, :, etc.. Is there any way that I can use the standard tokenizer and still be able to search on the special characters?
I have already looked into combo analyzer plugin which did not work as I had hoped.Apparently the combination of analyzers do not work in a chain like the token filters. They work independently which is not useful for me.
Also I looked into the char mapping filter in order to process the data before tokenizing it, but it does not work like the word delimiter token filter where we can specify "type_table" to convert a special character into an ALPHANUM. It just maps one word to another word. As a result I won't be able to search on the special characters.
Also, I have looked into the pattern analyzers, which would work for the special characters but they are not recommended for a multi language data set.
Can anybody point me in the right direction in order to solve this problem?
Thanks in advance!

How to define aspell word delimiters?

Aspell considers words with underscores or dashes as two, e.g. cloud-based is spell checked as "cloud" and "based". Is there any way to specify the word delimiters so as to exclude dash and underscore?
If I understand the question correctly, Aspell cannot do exactly what you want (up to my knowledge). This has to do with conditional compound word treatment, which is on the Aspells TODO list.
On the same list it is mentioned that Hunspell does a better job with compound words, so it might be a viable alternative if you're not bound to Aspell.
OpenOffice uses Hunspell for spellchecking, so it is easy to find out whether it fits your requirements. It does, at least, work for the "cloud-based" example, and does NOT consider all hyphenated words unconditional compounds, i.e. "based-cloud" would not be considered a spelling error.
Aspell is unable to do what you want it to do at this point. The interface it uses for handling word with symbols in them is not sophisticated enough to handle such a case at this time. More information on this is listed here.
Sorry that this cannot be solved up to this point, unless you want to implement your own interface. I would recommend using Hunspell as Mikhail suggested.

to_tsquery() validation

I'm currently developing a website that allows a search on a PostgreSQL
database, the search works with to_tsquery() and I'm trying to find a way to validate the input before it's being sent as a query.
Other than that I'm also trying to add a phrasing capability, so that if someone searches for HELLO | "I LIKE CATS" it will only find results with "hello" or the entire phrase "i like cats" (as opposed to I & LIKE & CATS that will find you articles that have all 3 words,
regardless where they might appear).
Is there some reason why it's too expensive to let the DB server validate it? It does seem a bit excessive to duplicate the ts_query parsing algorithm in the client.
If the concern is that you don't want it to try running the whole query (which presumably will involve table access) each time it validates, you could use the input in a smaller query, just in pseudocode (which may look a bit like Python, but that's just coincidence):
is_valid_query(input):
try:
execute("SELECT ts_query($1)", input);
return True
except DatabaseError:
return False
With regard to phrasing, it's probably easiest to search by the non-phrased query first (using indexes), then filter those for having the phrase. That could be done server side or client side. Depending on the language being parsed, it might be easiest to construct a simple regex of the phrase that deals with repeated whitespace or other ignorable symbols.
Search for to_tsquery('HELLO|(I&LIKE&CATS)'), getting back a list of documents which loosely match.
In the client, filter that to those matching the regex "HELLO|(I\s+LIKE\s+CATS)".
The downside is you do need some additional code for translating your query into the appropriate looser query, and then for translating it into a regex.
Finally, there might be a technique in PostgreSQL to do proper phrase searching using the lexeme positions that are stored in ts_vectors. I'm guessing that phrase searches are one of the intended uses, but I couldn't find an example of it in my cursory search. There's a section on it near the bottom of http://linuxgazette.net/164/sephton.html at least.

Resources