Solr Query on Unique Integer Field - full-text-search

I have a field defined in schema.xml as:
<field name="id" type="integer" indexed="true" stored="true" required="true" />
It is also the uniqueKey for the schema.
I cannot perform a query on this field with the query url:
/select?q=4525&qf=id&fl=id,name%2Cscore
This returns no results, however, if I search on a different field(such as a text field), with a different query, I get many results, which include the stored id. Solr is working great for text fields, but I cannot query for items based on the unique id.
What am I missing? Are there other steps that need to be performed for indexing?

Looks like you're using the qf parameter the wrong way... it's only meant to be used to boost fields in dismax queries.
Use id:4525 instead, as in:
/select?q=id:4525&fl=id,name,score

Related

Sorting Truncated Date in Solr

I am currently using solr 7.1.0. I have indexed a few documents which have a date associated with it.
The Managed schema configuration for that field is :
<fieldType name="pdate" class="solr.DatePointField" docValues="true"/>
<field name="Published_Date" type="pdate" multiValued="false" indexed="true" stored="true"/>
Example of few values are :
"Published_Date":"2019-10-25T00:00:00Z",
"Published_Date":"2019-10-21T10:00:00Z"
Please help me in finding how I could achieve the following
I want to sort the documents based on these Published_Date parameters but only on the day(not the time/timezones)
Sorting on the basis of
"Published_Date":'2019-10-25'

Constant boost multivalued fields in Solr

I have a multivalued field storing strings which I need to perform queries on. It stores IDs as strings. So, this is the field:
<field name="id" type="string" indexed="true" stored="true" multiValued="true" termVectors="true"/>
And the the query would look like
q: (id:'23' OR id:'24')^2
This filters out documents where the field is 23 or 24. The documents which have both of those IDs are at the top, the documents which have either of those IDs are below.
What I want is a constant boost of 2. If at least one ID matches, give it a boost of 2. How do I achieve something like that?
One possible option is to convert this query to the ConstantScoreQuery, by replacing ^ with ^=
q: (id:'23' OR id:'24')^=2
In this case, if your document will have both terms 23 and 24 or just having either of them you will still have the same score of 2.0

Solr: different sort result between 2 core

I have this 2 core of solr, one called catalogs and the others special_page. It contains data from the same db, the difference is catalogs contain more field than the others (catalogs is in solr 5 and special in solr 4, yes differ solr version).
Problem is, when I have to sort this particular data, eg. special_page which id is 1 then sort by product_scroring desc. This 2 core having different order of results.
catalogs schema for product scoring : <field name="product_scoring" type="text_general" indexed="true" stored="true" multiValued="false" default=""/>
special_page schema for product_scoring : <field name="product_scoring" type="text_general" indexed="true" stored="true"/>
Can anyone suggest me what would make this 2 core produce same order of result? Thanks
If you're actually indexing float values, don't index them as text. Text will split the content into separate tokens based on multiple separators, such as "." and whitespace. Depending on how exact you need the values to be, using a double or float is a possibility (but remember, doubles and floats are not exact).
Secondly, since the value in the fields are identical, the ordering between documents are undefined (.. or it will default to the order they've been added in, but that may change and may not be the same across both cores). Use a secondary, stable field (such as the name, id, date added, etc.) to get identical sorting of the same data across cores (this is also why a cursorMark requires sorting by a unique key when used).
Problem is that: Solr will sort data with same value in random manner , If u will try again and again same query then the order of sorting will change for the same value. I think this is the problem

Display fields as defined in data config file in result query in SOLR

I am still new to SOLR and I've managed to install and index 1000 documents from the database. When I submit a query, the results are returned correctly but the order of the fields are not displayed as how it is defined in the data config file.
Example of data config file:
<field column="id" name="event_id" />
<field column="event_desc_current" name="event_desc" />
<field column="event_cost" name="event_cost" />
<field column="event_sponsors" name="event_sponsors" />
...
Example of results returned:
<result name="response" numFound="7" start="0">
<doc>
<str name="event_desc">Church Fund Raising</str>
<arr name="event_sponsors">
<str/>
</arr>
<str name="event_id">2</str>
<int name="event_cost">428</int>
...
<long name="_version_">1472652516366745600</long></doc>
How can I output the order of the fields as defined in the data config file like this:
event_id
event_desc
event_cost
event_sponsors
...
Typically, the order of the fields should not matter, as you would de-serialize it in a client and the logic of displaying the search results is with the client.
However, if you do want to dictate the order of fields, you could use the fl parameter in your Solr query to get results in the order you prefer.
You could also choose which fields to include in the search field.
Personally, I would recommend that you need not worry about order of fields, and have a client that can consume it in any order. Reason being, if you add a new field to your schema, in the middle, you could potentially breaking the client's logic!

Index the Raw HTML content using solr/lucene

I have some htmls that I have scraped off the web during different period of time from the same site. and the raw data looks like this
timestamp, htmlcontent(500KB)
..
I have written a parser to parse out a few interesting fields from the HTML and I trying to build a search engine based on the fields that I parsed out. NOT JUST BASED ON THE RAW TEXT OF THE HTML BUT THE RAW COMPLETE HTML CONTENT>
now my data looks like:
timestamp, htmlcontent, parsedfield1, parsedfield2
I want the user search for timestamp, parsedfield1 or parsedfield2 and my search engine returns the raw HTML matching the user's query and populating the browser... so it feels like a search engine time machine :)
In this case, I am wondering how should I design the index? which fields should I store and which not. I am following the book "Lucene in Action" and wondering can anyone help me how to approach this problem..
Based on my understanding of Index, there are a few attributes in the schema.xml... index or not? store or not?.... I assume, "Whatever you want to include in the query result, it should be stored. " .. In that case, I have to store the column which contains the raw HTML...
Since that column is so big one record is usually about hundreds of KB... with only hundreds of rows.. you can easily get a dataset of almost 1GB... which won't work in solr and I am trying to index those columns using Lucene and it run into the heapsize problem..
Here is another idea:
Maybe I should store the parsedfield1, parsedfield2 and pointer... where point column is the absolute path of the raw HTML file. Of course, in this case, I need to store each html into a separate file locally/or on HDFS... So when user search for parsedfield1, it will return the absolute path and I go and retrieve those files...
I think I am describing the problem as clearly as I can and wondering can anyone spend a minute giving me some directional guidance...
Much appreciated!
Some Guidelines
1. You need your data in XML or CSV or JSON format i will give you example of xml
eg.--> your data in xml format
<add>
<doc>
<field name="id">01</field>
<field name="timestamp">somevalue</field>
<field name="parsedfield1">your data 1</field>
<field name="parsedfield2">Java data </field>
<field name="htmlcontent">link to that html file</field>
</doc>
</add>
2. You need to modify schema.xml
-- each document should have one unique id
-- as per your need you need to store only path for htmlcontent
-- other fields index only for searching
<field name="id" type="string" indexed="true" stored="true" required="true" multiValued="false"/>
<field name="timestamp" type="text_general" indexed="true" stored="false" />
<field name="parsedfield1" type="text_general" indexed="true" stored="false"/>
<field name="parsedfield2" type="text_general" indexed="true" stored="false" />
<field name="parsedfield2" type="text_general" indexed="true" stored="false" />
<field name="htmlcontent" type="text_general" indexed="true" stored="true" />
3. you can use post.jar to post all XML files to solr or you can use SOLRJ APIs if you need programmatically
**Fields to be stored or not **
Fields on which you want to perform just search no need store unless you want to display them in result

Resources