Cannot get wildcard search working with Sphinx Real Time indexes - full-text-search

There are several questions already here on SO for this, as well as on Google. However, repeated attempts and plenty of Googling has not netted me an answer so far. This doesn't seem difficult, but clearly I'm missing something.
I've added combinations of the following:
enable_star = 1
dict = keywords
min_infix_len = 3
min_prefix_len = 3
Note: I did not do prefix and infix at the same time.
I have blown away and re-created my indexes, re-started searchd and still no luck.
If I insert a value such as "wildcardtest" I can do the following with a hit
select * from rtindex where match('wildcardtest');
but anything else such as
select * from rtindex where match('wildcardt*');
returns 0 results.
I was using 2.1.4 but upgraded to 2.1.9 with no change.

I upgraded to 2.2.7 and tweaked the config a bit and this is working.
The essential config option needed is dict=keywords
min_prefix_len/min_infix_len also work, but they do seem to change the behavior compared to dict=keywords on its own. Searching for the same pattern with the various config options yields slightly different results.
I did have to re-build my disk-based indexes and then attach (after truncating) to the RT indexes to get the historical content searchable the way I wanted.

I haven't used the RT indexes, but on a regular index i would pass it in like so '"wildcard*"'. I found that wrapping it like this I would get the results that I was looking for. In your conf file you should also have enable_star = 1 as well.

Related

indexing not reliable on WindowCollection

I have upgraded to recent WATIR(7.0.0.beta5) and when I am executing the following line
#browser.windows.last.use
I am getting this error
"indexing not reliable on WindowCollection"
This was working fine in my previous WATIR version(6.19.1). What is the issue?
Another question is, It looks like there are plenty of changes of capabilities. Do I have to set page_load_timeout and read_timeout separately now? and also I have read open_timeout, I don't know what it is for, Can someone help me understand what it is?
Drivers are not guaranteed to always return windows in the same order, so it has been decided to deprecate accessing windows based on their index (ie position in the Array of windows). You have a couple of options.
Working with 2 windows
If you are only working with 2 windows (probably most cases), the recommendation is to use 2 new methods:
# Switch to the second window that was opened
browser.switch_window
# Return to the first window
browser.original_window
Working with 3+ windows
The best approach would be to locate the window you want based on known properties:
# By url
browser.window(url: /closeable\.html/).use
# By title
browser.window(title: 'closeable window').use
# By an element in the window (new)
browser.window(element: browser.a(id: 'close')).use
(Not Recommended) By index
This is not recommended, but if you insist on using index, you can cheat by forcing the WindowsCollection to an Array:
browser.windows.each.to_a.last

CouchDB, all_docs and filter design documents with endkey

First, this question - filter design documents from all_docs - already seemed to be solved like described here:
https://plus.google.com/+JasonDeRose/posts/1iP5tu3wVqw
/mydb/_all_docs?endkey=%22_%22
and worked in first place. However, suddenly in a different setup (actually just different deploy), the query only returns an empty collection []. It seems like the ordering changed, without endkey="_" the full collection is returned (including design documents). I tried various combinations of endkey/startkey but cannot achieve to filter the design documents again.
Finally I added a filter and switched to _changes?include_docs=true to load the initial documents. I also thought about defining a view, but don't like that this results in data replication and some inconveniences with the changes feed (needed in another context). The filter on the other hand will be executed for every document.
Is it a bug that endkey=%22_%22 doesn't work anymore and is there a more convenient, still working way?
/_all_docs is a special case for CouchDB. Instead of the normal Unicode Collation, it uses ASCII collation.
The '_' character in ASCII order shows up between uppercase letters and lowercase letters. So if your doc id starts with lowercase letters (default behaviour), they will show up after any design docs. If your doc ids start with uppercase letters, they will show up before design docs.
Try creating a document with an id of: "ABC" You will see it show up before the design doc and your trick to filter design docs would work in this case.
However, I recommend you stop using the `_all_docs view altogether. Instead use the normal view functionality. When you create a view, CouchDB automatically skips design docs for you. So if your view looked like:
function(doc){
emit(doc._id, null);
}
You could query this with no start or end key, and get all docs without design docs.
Also, please look at Unicode Collation order, this is the order all your other views will be in, and it's important to understand as you work with CouchDB. You can read all about it here:
http://docs.couchdb.org/en/stable/ddocs/views/collation.html

Leftover / Unused Fields from Solr Default 'example'

I started configuring Solr based off of the original example and so there were many random fields that they included for the example.
I might use one or two of them but for the most part they are all just sitting there unused (nothing in my data config mentions them or stores data to them.
Does this create any impact on performance ( even a small amount ) & should I just go ahead and delete them or does it not matter?
Thanks guys
I recommend deleting the fields and associated unused types. Possibly no performance impact, but there are some dynamic fields and copyFields which may cause confusion down the line. Or just build a new config directly instead (you only need two files to start: schema.xml and solrconfig.xml).
If you are doing it by trimming unused fields, make sure to keep or change whatever your uniqueKey points at, whatever your 'df' default field is and _ version_ (no spaces with underscores).
The last one is requires for real-time get and updateLog, which are enabled in example's solrconfig.xml as well. Easier to keep _ version_ than to try removing all those things.
(Update Jan 2017: There is now a presentation video, specifically working through examples and how to clean them)

Lucene.NET MatchAllDocsQuery doesn't honor document boost?

I have a Lucene index of document, all nearly identical (test 1, test 2, etc.) except that some have a higher boost than others. When using a default query (MatchAllDocsQuery OR .Parse(":") on the query parser) the documents come back in the order they went in every time. By adding a search term ("test" in this case), the document boost is apparent and the documents are sorted according to the boost. I can change the boost levels around and the new order is reflected in the results. All my code is pretty standard fair, I'm using a default Sort() is both cases.
I found that this same bug was reported and fixed in Lucene back in 2005-2006, and I checked my MatchAllDocsQuery.cs file (Lucene .NET 2.9.2) and it seems to have this change present, but the behavior is as described in the ticket above.
Any ideas what I might be doing wrong? Perhaps someone running the Java version has experienced this (or not)? Thanks.
Uh, don't I feel silly now. This is as-designed behavior. I guess. According to Lucene in Action, MatchAllDocsQuery uses a constant for the boost.

LDAP Syntax/Semantics: Filter vs. Base DN?

This is probably pretty stupid, but I'm still green to LDAP. So I hope someone can lend me a hand.
I am using Apache Directory Studio to do my searches and I am confused about when I should be using a filter or when I should be breaking up my filter into two, using one part as the filter and the other as my search base.
Here's an example where I'm trying filter out a group.
Filter: CN=JohnTestGroup,OU=TECH,DC=lab,DC=ing
Base: DC=lab,DC=ing
This yielded zero results. I realized that perhaps I am being redundant as part of the base is in the filter, so I got rid of that part in the filter.
Filter: CN=JohnTestGroup,OU=TECH
Base: DC=lab,DC=ing
This still did not yield anything. So I tried this:
Filter: CN=JohnTestGroup
Base: OU=TECH,DC=lab,DC=ing
I moved the OU parameter into the Base. This worked, but I don't understand why the first or second attempts didn't. Someone care to drop some knowledge on me?
This is probably a matter of syntax/semantics, so if anyone could point me to a resource, I'd be more than willing to read more about it.
Read about Scopes there: http://www.idevelopment.info/data/LDAP/LDAP_Resources/SEARCH_Setting_the_SCOPE_Parameter.shtml
If you set you search scope to SUBTREE both (2 and 3), possible 1 variants start work, but searching by subtree works slower
I think you are misunderstanding how the filter works. It is meant to be key=value pairings.
So (objectClass=iNetOrgPerson) as an example.
If you wish a filter to find a DN, then you pick an identifying chracteristic like CN, and filter (CN=JohnTestGroup) or perhaps (mail=John#mail.net).
The base tells the LDAP server where to start looking, as seriyPS notes in his/her answer, the SCOPE is the next question. How deep should the server search, as that adds overhead and performance issues. Subtree is simplist conceptually. Just keep looking from here down, till you run out of tree to look through.
That is why your last one works.
Now, if you want to find a specific object and you know its DN, you do an ENTRY scope query for the base of the specific DN.

Resources