Ignoring apostrophes in sphinx indexes - ruby

In my sphinx config file, I have the following:
ignore_chars: "U+0027"
charset_table: "0..9, a..z, _, A..Z->a..z, U+00C0->a, U+00C1->a,
U+00C2->a, U+00C3->a, U+00C4->a, U+00C5->a, U+00C7->c, U+00C8->e,
U+00C9->e, U+00CA->e, U+00CB->e, U+00CC->i, U+00CD->i, U+00CE->i [SNIP]"
(The charset_table entry is from here: http://speeple.com/unicode-maps.txt)
The expected result is that querying kyles will return all records matching kyles and/or kyle's, since I'm telling sphinx to exclude ' (single quote/apos) from the index (ab'cd -> abcd). However, in practice, this is not happening.

I believe adding it to the ignore_chars has the opposite of the desired effect. This is telling sphinx not to split on that character, but instead it will collapse the word around the characters to be ignored. So, kyle's will become kyles instead of kyle and s.
The solution I just tried for this issue that seems to have worked was to add s to my list of stopwords (might need 's in there also, can't remember). Sphinx seems to split kyle's up into the words kyle and 's. Because match all mode is on, some documents fail on the match for 's. Adding it to the stop words seems to have the desired effect.
It seems like the normal stemming should take care of this however, so maybe we're both doing something wrong...

Related

How do I escape the word "And" in Elasticsearch if I want to search by the literal "And"?

I'm trying to search over an index that includes constellation code names, and the code name for the Andromeda constellation is And.
Unfortunately, if I search using And, all results are returned. This is the only one that doesn't work, across dozens of constellation code names, and I assume it's because it's interpreted as the logical operator AND.
(constellation:(And)) returns my entire result set, regardless of the value of constellation.
Is there a way to fix this without doing tricks like indexing with an underscore in front?
Thanks!
I went for a bit of a hack, indexing the constellation as __Foo__ and then changing my search query accordingly by adding the __ prefix and suffix to the selected constellation.

Maching two words as a single word

Consider that I have a document which has a field with the following content: 5W30 QUARTZ INEO MC 3 5L
A user wants to be able to search for MC3 (no space) and get the document; however, search for MC 3 (with spaces) should also work. Moreover, there can be documents that have the content without spaces and that should be found when querying with a space.
I tried indexing without spaces (e.g. 5W30QUARTZINEOMC35L), but that does not really work as using a wildcard search I would match too much, e.g. MC35 would also match, and I only want to match two exact words concatenated together (as well as exact single word).
So far I'm thinking of additionally indexing all combinations of two words, e.g. 5W30QUARTZ, QUARTZINEO, INEOMC, MC3, 35L. However, does Elasticsearch have a native solution for this?
I'm pretty sure what you want can be done with the shingle token filter. Depending on your mapping, I would imagine you'd need to add a filter looking something like this to your content field to get your tokens indexed in pairs:
"filter_shingle":{
"type":"shingle",
"max_shingle_size":2,
"min_shingle_size":2,
"output_unigrams":"true"
}
Note that this is also already the default configuration, I just added it for clarity.

Getting an exact match to the string `#deprecated` in Kibana/ELK

I'm using Kibana to find all logs containing an exact match of the string #deprecated.
For a reason I don't understand, it matches string with the word "deprecated" without the # sign.
I tried to use escaping for # according to the Lucene Documentation. i.e. message:"\\#deprecated" - without change in results.
How can I query to exact match the #deprecated text exact match only
Why is this happening?
You problem isn't an issue with query syntax, which is what escaping is for, it's with analysis. You analyzer removes punctuation, because it's parsing it as full text. It removes #, in much the same way that it will remove periods and commas.
So, after analysis (assuming standard analysis) of something like: "Class is #deprecated" the token stream generated will have the following tokens: "class", "deprecated" ("is" is a stop word). The indexed form of "#deprecated" and "deprecated" are identical, so it is impossible to have a query that can differentiate between them as it is currently indexed.
To fix this you would have to change your analyzer. WhitespaceAnalyzer may be a good choice, and should fix this issue. However, be careful you aren't doing more harm than good. If you use WhitespaceAnalyzer, you are going to have to contend with other punctuation as well, and a search for "sentence"
would not find "match at the end of this sentence.", because of the period. So, if you are searching full text, this will certainly cause far more problems than it solves.
If you want to know the full rules of standard analysis, by the way, it's an implementation of UAX #29 word boundaries

How to handle two dashes in ReST

I'm using Sphinx to document a command line utility written in Python. I want to be able to document a command line option, such as --region like this:
**--region** <region_name>
in ReST and then use Sphinx to to generate my HTML and man pages for me.
This works great when generating man pages but in the generated HTML, the -- gets turned into - which is incorrect. I have found that if I change my source ReST document to look like this:
**---region** <region_name>
The HTML generates correctly but now my man pages have --- instead of --. Also incorrect.
I've tried escaping the dashes with a backslash character (e.g. \-\-) but that had no effect.
Any help would be much appreciated.
This is a configuration option in Sphinx that is on by default: the html_use_smartypants option (http://sphinx-doc.org/config.html?highlight=dash#confval-html_use_smartypants).
If you turn off the option, then you will have to use the Unicode character '–' if you want an en-dash.
With
**-\\-region** <region_name>
it should work.
In Sphinx 1.6 html_use_smartypants has been deprecated, and it is no longer necessary to set html_use_smartypants = False in your conf.py or as an argument to sphinx-build. Instead you should use smart_quotes = False.
If you want to use the transformations formerly provided by html_use_smartypants, instead it is recommended to use smart_quotes, e.g., smart_quotes = True.
Note that at the time of this writing Read the Docs pins sphinx==1.5.3, which does not support the smart_quotes option. Until then, you'll need to continue using html_use_smartypants.
EDIT It appears that Sphinx now uses smartquotes instead of docutils smart_quotes. h/t #bad_coder.
To add two dashes, add the following:
.. include:: <isotech.txt>
|minus|\ |minus|\ region
Note the backward-slash and the space. This avoids having a space between the minus signs and the name of the parameter.
You only need to include isotech.txt once per page.
With this solution, you can keep the extension smartypants and write two dashes in every part of the text you need. Not just in option lists or literals.
As commented by #mzjn, the best way to address the original submitter's need is to use Option Lists.
The format is simple: a sequence of lines that start with -, --, + or /, followed by the actual option, (at least) two spaces and then the option's description:
-l long listing
-r reversed sorting
-t sort by time
--all do not ignore entries starting with .
The number of spaces between option and description may vary by line, it just needs to be at least two, which allows for a clear presentation (as above) on the source, as well as on the generated document.
Option Lists have syntax for an option argument as well (just put an additional word or several words enclosed in <> before the two spaces); see the linked page for details.
The other answers on this page targeted the original submitter's question, this one addresses their actual need.

Sphinx char_set resource?

I need to have Sphinx index some characters that are apparently part of the internal char_set, notably "," and "&".
As far as I understand there is no way (?!) to simply add these as indexable chars, I need to
Make a char_set table in my index
Not only include the "," and "&" as indexable characters but now manually add the char_set Sphinx uses
This is a little frustrating as it seems one should be able to add back in chars without manually recreating the char_set used internally. However if that is the situation that is the situation, yet in the documentation
http://sphinxsearch.com/docs/current/conf-charset-table.html
I don't see or understand how I would manually specify a char_set table and exclude / index the two or three characters I want.
You do need to duplicate the 'stock' charset_table before you can add to it.
At less than 100 characters, even if you typed it rather than copy/pasting, would still be les effort than typing this whole question!
(even less if use the the english shorthand)
Just copy what there and add your new chars to the end. If you don't know the unicode mapping, something like asciitable.com is useful at least for the chars in ASCII.

Resources