Solr equivalent of SQL BETWEEN for numeric fields - performance

Hoping for some pointers to speed up some (very) slow solr queries in version 3.4.0.
I have an index of about 6-million documents. Each document is quite small, and contains two solr.TrieDoubleFields; "start" and "end".
<fieldType name="double" class="solr.TrieDoubleField" precisionStep="0" omitNorms="true" positionIncrementGap="0"/>
----
<field name="start" type="double" indexed="true" stored="false" />
<field name="end" type="double" indexed="true" stored="false" />
When querying I need to perform the SQL equivalent of:
WHERE #input BETWEEN Start AND End
To do this I'm writing my query as:
start:[* TO #input] AND end:[#input TO *]
The query succeeds, returning the correct document, but with a QTime of ~4,500; most other queries are well below 100.
What can be modified to improve performance?

I believe that you should try the Function Range (frange) query parser in Solr. Please see the introductory blog post - Ranges Over Functions in Solr 1.4 for more details on how to use the Function Range query parser.
(Note this should apply to all versions of Solr greater that 1.4 as well)
So I think something like the following should work...
`?q=*:*&fq={!frange l=0 u=<#input>}start
&fq{!frange l=<#input> u=<number larger than greatest value>}end`
I am not sure if you can use wildcards for the upper and lower bounds, probably not...

Related

XSD Verify Element has at least one of the attributes specified

Using XSD schema validation 1.0 I want to verify an element has at least one attribute specified.
For example, a simple element like this:
<foo a="1" b="2" c="3" />
I want to verify that at least attribute b or c is specified. But note that both can also be specified--they're not mutually exclusive.
I tried using a key along the lines of:
<xs:key name="AttributeSpecified">
<xs:selector xpath="." />
<xs:field xpath="#b|#c" />
</xs:key>
but it fails when both attributes are specified (because multiple results are returned).
Can it be done?
This is not possible in XSD 1.0. It might be possible in XSD 1.1.
I am a fan of XML Schema, but I would not choose it for this type of validation. You might be able to make it work using XSD1.1 but if your requirements became just a little more complex you could end up with some horrible-looking constraints.
On the other hand, an XPath expression can elegantly express any constraint you can think of, and you would not need to bend the language to make it work.

Lucene search on contents of several tags (eXist-db)

I have the following lucene indices in collection.xconf
<lucene>
<text qname="tei:text" />
<text qname="tei:summary"/>
<text qname="tei:placeName"/>
</lucene>
My xquery code then makes the following query:
let $query-results := ($documents[.//tei:text[ft:query(., $q)]],
$documents[.//tei:summary[ft:query(., $q)]],
$documents[.//tei:placeName[ft:query(., $q)]])
Now I want to, say, search on:
-pommern erik
Now all the documents in query-results have the word "erik" but not the
word "pommern" in the contents of the tei:text tag.
But some of the results have both "erik" and "pommern" in the
tei:summary tag.
How do I make an xpath so that the user can search across all three tags
with lucene syntax (so that the contents of the three tags appears as
one text)?
At the moment you are doing a union of the results, you need a intersection instead.
To do this you really need to query against the parent and and the predicates together, something like:
$documents[ft:query(.//tei:text, $q)][ft:query(.//tei:summary, $q)][ft:query(.//tei:placeName, $q)]

Trouble omitting punctuation from Solr Alphabetic Range query search string

I am trying to create a alphabetic browse of names (personal and institutional) using range queries that will sort without regard to punctuation or capitalization, but even though the analysis tool in Solr suggests that punctuation in queries should be stripped out correctly, the presence of punctuation in the query still negatively affects the results.
from schema.xml:
<fieldType name="sort" class="solr.TextField" sortMissingLast="true" omitNorms="true">
<analyzer>
<charFilter class="solr.PatternReplaceCharFilterFactory" pattern="-" replacement=" "/>
<charFilter class="solr.PatternReplaceCharFilterFactory" pattern="[\p{Punct}¿¡「」]" replacement=""/>
<charFilter class="solr.PatternReplaceCharFilterFactory" pattern="\s+" replacement=" "/>
<tokenizer class="solr.KeywordTokenizerFactory"/>
<filter class="solr.ICUFoldingFilterFactory"/>
<filter class="solr.LowerCaseFilterFactory" />
<filter class="solr.TrimFilterFactory" />
</analyzer>
</fieldType>
<field name="authorSort" type="sort" indexed="true" stored="true" multiValued="false" required="true"/>
from solrconfig.xml:
<requestHandler name="/authors" class="solr.SearchHandler">
<lst name="defaults">
<str name="defType">lucene</str>
<str name="echoParams">explicit</str>
<str name="fl">*</str>
<str name="df">authorSort</str>
<str name="sort">authorSort asc</str>
<str name="rows">20</str>
<str name="wt">ruby</str>
<str name="indent">true</str>
</lst>
</requestHandler>
My actual queries look like this:
http://myserver/solr/testCore/authors?q=["Search String" TO *]
When I search for q=["ACA" TO *], my top result is "ACA (Academy of Certified Archivists)", which is good. If I vary the capitalization used in "ACA" my results don't change, which is also good. If I search for the acronym with periods (q=["A.C.A." TO *]) I don't get the appropriate results at all, and my top hit is "A3 (Musical group)". In this case I suspect that it's sorting on the period rather than dropping it.
According to the analysis tool in Solr, both "ACA" and "A.C.A." should be rendered down to "aca" using the analyzer I've configured. I'm at a loss to explain why these two searches aren't effectively equivalent.
(If it makes any difference, the index-time analysis is effectively useless as my code is doing the same conversion before submitting the data to be indexed. There are reasons for that. So it's only the query-time analysis that is giving me grief.)
Edit: Here's a screenshot of how my analysis of "A.C.A." as a query should be working (according to the Solr analysis tool).
Added about four months later:
Since posting the question and not finding a resolution, I have
switched to using a custom filter factory for the analysis. This gave
me control over the analysis that would have been difficult or
impossible given the provided filters. My first attempt had the same
problem - the analysis worked in regular search but wasn't applied in
range queries. This problem was resolved by adding
implements MultiTermAwareComponent to my filter factory and overriding
getMultiTermComponent(). I have no idea what this does for a field
which is using the KeywordTokenizer and therefore never has multiple
terms in a field value... but it did fix the problem. This was for
Solr 4.2.

XPath select one type of nodes only in direct child nodes

Maybe someone can help me find a solution to my problem.
I need to perform an XPath query in the xml below that pulls only the "Field" nodes that are direct child nodes.
In the below example, the query should pull fields E1F1, E1F2 and E1F3.
So far I am running the query: //Field, but I get all fields (including the ones that belong to E1_1 which I don't want).
<Entity id="E1">
<Field id="E1F1"></Field>
<Field id="E1F2"></Field>
<Field id="E1F3"></Field>
<Entity id="E1_1">
<Field id="E1_1F1"></Field>
<Field id="E1_1F2"></Field>
<Field id="E1_1F3"></Field>
</Entity>
Thank you!!
Use an absolute XPath:
/Entity/Field
// will match anywhere. If you use a single forwardslash, the match must be exact.
In my case, the wanted node is far from the root element(the /html), so the accepted answer is not what I needed, after some search work, I find the child axes instead of descendant, I hope this may help someone who is use scrapy to get some info from html.

Solr TrieFloat and SortableFloatField, which is best for float number sorting

I have a Solr schema in which a field is declared as TrieFloatField:
<fieldType name="tfloat" class="solr.TrieFloatField" precisionStep="8" omitNorms="true" positionIncrementGap="0"/>
(...)
<field
name="someField"
type="tfloat"
indexed="true"
stored="false"
multiValued="false" />
If I use it to sort the results like so:
solrQuery.addSortField("someField", ORDER.asc);
solrQuery.addSortField("score", ORDER.desc);
The float numbers are not returned in correct numerical order, i.e.: I'd get results such as :
0.31 0.67 0.80 15.13 0.09 15.13 0.04
What's even stranger is that when I use this field to sort my results, some sorting DOES happen (they're in a different order that if, let's say, I don't use any sort field at all). Also, even if I change the sort order from asc to desc, the results are in the same order.
I thought that the TrieFloat type would work well for this. However I now see in the docs that they only mention that it's "Floating point field accessible Lucene TrieRange processing":
http://lucene.apache.org/solr/api-4_0_0-ALPHA/org/apache/solr/schema/TrieFloatField.html
and I honestly don't really know what that means. I also see that there's a SortableFloatField:
http://www.jarvana.com/jarvana/view/org/apache/solr/solr-core/3.5.0/solr-core-3.5.0-javadoc.jar!/org/apache/solr/schema/SortableFloatField.html
but the docs don't really say anything about how it behaves when being used as a sort criteria.
My question simply is: which one of those two types (or what other type) is good for storing float numbers such that they can be used for proper (natural) ascending and descending sorting in a Solr query
Both classes should work, but TrieFloatField will require much less memory than SortableDoubleField (given that the former uses a float field cache while the latter uses a String field cache). Note that if you don't need to perform range queries, you should set precisionStep=0.
However the bug you hit is very strange...

Resources