Spring data solr - multilingual support - spring

I've to implement multilingual search capability & I was reading thro' Spring data Solr but couldn't find much details on how to implement multilingual queries using Spring data Solr.
Consider there's a Solr collection with dynamic fields & we're indexing documents based on locale. Now if I use Spring data Solr, I'll have to create an entity with fields matching all locales.
E.g. field defined in Solr schema.xml
...
...
dynamicField name="name_en_us" type="text_en_us" indexed="true" stored="true" required="false" docValues="false" multiValued="false"
dynamicField name="name_fr_fr" type="text_fr_fr" indexed="true" stored="true" required="false" docValues="false" multiValued="false"
...
...
'name' field will be added in the entity as:
name_en_us,
name_fr_fr,
name_en_uk,
...
...
Is there any way we can do this dynamically? I mean having only 1 field name in entity & based on locale fetch the the documents from Solr using Spring data?
Please suggest.

Related

Sorting Truncated Date in Solr

I am currently using solr 7.1.0. I have indexed a few documents which have a date associated with it.
The Managed schema configuration for that field is :
<fieldType name="pdate" class="solr.DatePointField" docValues="true"/>
<field name="Published_Date" type="pdate" multiValued="false" indexed="true" stored="true"/>
Example of few values are :
"Published_Date":"2019-10-25T00:00:00Z",
"Published_Date":"2019-10-21T10:00:00Z"
Please help me in finding how I could achieve the following
I want to sort the documents based on these Published_Date parameters but only on the day(not the time/timezones)
Sorting on the basis of
"Published_Date":'2019-10-25'

Display fields as defined in data config file in result query in SOLR

I am still new to SOLR and I've managed to install and index 1000 documents from the database. When I submit a query, the results are returned correctly but the order of the fields are not displayed as how it is defined in the data config file.
Example of data config file:
<field column="id" name="event_id" />
<field column="event_desc_current" name="event_desc" />
<field column="event_cost" name="event_cost" />
<field column="event_sponsors" name="event_sponsors" />
...
Example of results returned:
<result name="response" numFound="7" start="0">
<doc>
<str name="event_desc">Church Fund Raising</str>
<arr name="event_sponsors">
<str/>
</arr>
<str name="event_id">2</str>
<int name="event_cost">428</int>
...
<long name="_version_">1472652516366745600</long></doc>
How can I output the order of the fields as defined in the data config file like this:
event_id
event_desc
event_cost
event_sponsors
...
Typically, the order of the fields should not matter, as you would de-serialize it in a client and the logic of displaying the search results is with the client.
However, if you do want to dictate the order of fields, you could use the fl parameter in your Solr query to get results in the order you prefer.
You could also choose which fields to include in the search field.
Personally, I would recommend that you need not worry about order of fields, and have a client that can consume it in any order. Reason being, if you add a new field to your schema, in the middle, you could potentially breaking the client's logic!

Index the Raw HTML content using solr/lucene

I have some htmls that I have scraped off the web during different period of time from the same site. and the raw data looks like this
timestamp, htmlcontent(500KB)
..
I have written a parser to parse out a few interesting fields from the HTML and I trying to build a search engine based on the fields that I parsed out. NOT JUST BASED ON THE RAW TEXT OF THE HTML BUT THE RAW COMPLETE HTML CONTENT>
now my data looks like:
timestamp, htmlcontent, parsedfield1, parsedfield2
I want the user search for timestamp, parsedfield1 or parsedfield2 and my search engine returns the raw HTML matching the user's query and populating the browser... so it feels like a search engine time machine :)
In this case, I am wondering how should I design the index? which fields should I store and which not. I am following the book "Lucene in Action" and wondering can anyone help me how to approach this problem..
Based on my understanding of Index, there are a few attributes in the schema.xml... index or not? store or not?.... I assume, "Whatever you want to include in the query result, it should be stored. " .. In that case, I have to store the column which contains the raw HTML...
Since that column is so big one record is usually about hundreds of KB... with only hundreds of rows.. you can easily get a dataset of almost 1GB... which won't work in solr and I am trying to index those columns using Lucene and it run into the heapsize problem..
Here is another idea:
Maybe I should store the parsedfield1, parsedfield2 and pointer... where point column is the absolute path of the raw HTML file. Of course, in this case, I need to store each html into a separate file locally/or on HDFS... So when user search for parsedfield1, it will return the absolute path and I go and retrieve those files...
I think I am describing the problem as clearly as I can and wondering can anyone spend a minute giving me some directional guidance...
Much appreciated!
Some Guidelines
1. You need your data in XML or CSV or JSON format i will give you example of xml
eg.--> your data in xml format
<add>
<doc>
<field name="id">01</field>
<field name="timestamp">somevalue</field>
<field name="parsedfield1">your data 1</field>
<field name="parsedfield2">Java data </field>
<field name="htmlcontent">link to that html file</field>
</doc>
</add>
2. You need to modify schema.xml
-- each document should have one unique id
-- as per your need you need to store only path for htmlcontent
-- other fields index only for searching
<field name="id" type="string" indexed="true" stored="true" required="true" multiValued="false"/>
<field name="timestamp" type="text_general" indexed="true" stored="false" />
<field name="parsedfield1" type="text_general" indexed="true" stored="false"/>
<field name="parsedfield2" type="text_general" indexed="true" stored="false" />
<field name="parsedfield2" type="text_general" indexed="true" stored="false" />
<field name="htmlcontent" type="text_general" indexed="true" stored="true" />
3. you can use post.jar to post all XML files to solr or you can use SOLRJ APIs if you need programmatically
**Fields to be stored or not **
Fields on which you want to perform just search no need store unless you want to display them in result

Obtaining Image File Metadata and Indexing to Solr using TikaEntityProcessor

Can someone suggest how to obtain the metadata of an image file (ex. .jpg, .png, .gif, etc) and index those data to Apache Solr?
Currently, I'm using Apache Solr 4.2. In the DataImport Configuration file (for me, i named it "db-import-config.xml"), I tried using TikaEntityProcessor with ImageMetadataExtractor.
<entity name="tika-test"
dataSource="binary" // using BinURLDataSource
processor="TikaEntityProcessor"
onError="skip"
rootEntity="false"
url="${dbmw_image.url}"
format="none"
parser="org.apache.tika.parser.image.ImageMetadataExtractor">
<field column="contributor" name="authors" meta="true"/>
<field column="creator" name="authors" meta="true"/>
<field column="data" name="creationDate" meta="true"/>
<field column="modified" name="lastModifiedDate" meta="true"/>
</entity>
The field "column"'s are all from Dublin Core metadata list. When I tried dataimport on Solr, none of these fields were picked up. I need answers for the following questions:
What are the available metadata field NAMEs for image files? (i.e. the values that I can put in the "column" attribute of "field" in the Tika entity above)
How to index and obtain those metadata values (through Tika?) and index to Solr? (ex. Which parser do I need? How should I set the tika entity attributes, etc)
Any suggestions are appreciated.
Thanks,
Did you look at TikaEntityProcessor documentation?
Specifically Finding field names?

Solr Query on Unique Integer Field

I have a field defined in schema.xml as:
<field name="id" type="integer" indexed="true" stored="true" required="true" />
It is also the uniqueKey for the schema.
I cannot perform a query on this field with the query url:
/select?q=4525&qf=id&fl=id,name%2Cscore
This returns no results, however, if I search on a different field(such as a text field), with a different query, I get many results, which include the stored id. Solr is working great for text fields, but I cannot query for items based on the unique id.
What am I missing? Are there other steps that need to be performed for indexing?
Looks like you're using the qf parameter the wrong way... it's only meant to be used to boost fields in dismax queries.
Use id:4525 instead, as in:
/select?q=id:4525&fl=id,name,score

Resources