Index the Raw HTML content using solr/lucene

Index the Raw HTML content using solr/lucene - hadoop

I have some htmls that I have scraped off the web during different period of time from the same site. and the raw data looks like this
timestamp, htmlcontent(500KB)
..
I have written a parser to parse out a few interesting fields from the HTML and I trying to build a search engine based on the fields that I parsed out. NOT JUST BASED ON THE RAW TEXT OF THE HTML BUT THE RAW COMPLETE HTML CONTENT>
now my data looks like:
timestamp, htmlcontent, parsedfield1, parsedfield2
I want the user search for timestamp, parsedfield1 or parsedfield2 and my search engine returns the raw HTML matching the user's query and populating the browser... so it feels like a search engine time machine :)
In this case, I am wondering how should I design the index? which fields should I store and which not. I am following the book "Lucene in Action" and wondering can anyone help me how to approach this problem..
Based on my understanding of Index, there are a few attributes in the schema.xml... index or not? store or not?.... I assume, "Whatever you want to include in the query result, it should be stored. " .. In that case, I have to store the column which contains the raw HTML...
Since that column is so big one record is usually about hundreds of KB... with only hundreds of rows.. you can easily get a dataset of almost 1GB... which won't work in solr and I am trying to index those columns using Lucene and it run into the heapsize problem..
Here is another idea:
Maybe I should store the parsedfield1, parsedfield2 and pointer... where point column is the absolute path of the raw HTML file. Of course, in this case, I need to store each html into a separate file locally/or on HDFS... So when user search for parsedfield1, it will return the absolute path and I go and retrieve those files...
I think I am describing the problem as clearly as I can and wondering can anyone spend a minute giving me some directional guidance...
Much appreciated!

Some Guidelines
1. You need your data in XML or CSV or JSON format i will give you example of xml
eg.--> your data in xml format
<add>
<doc>
<field name="id">01</field>
<field name="timestamp">somevalue</field>
<field name="parsedfield1">your data 1</field>
<field name="parsedfield2">Java data </field>
<field name="htmlcontent">link to that html file</field>
</doc>
</add>
2. You need to modify schema.xml
-- each document should have one unique id
-- as per your need you need to store only path for htmlcontent
-- other fields index only for searching
<field name="id" type="string" indexed="true" stored="true" required="true" multiValued="false"/>
<field name="timestamp" type="text_general" indexed="true" stored="false" />
<field name="parsedfield1" type="text_general" indexed="true" stored="false"/>
<field name="parsedfield2" type="text_general" indexed="true" stored="false" />
<field name="parsedfield2" type="text_general" indexed="true" stored="false" />
<field name="htmlcontent" type="text_general" indexed="true" stored="true" />
3. you can use post.jar to post all XML files to solr or you can use SOLRJ APIs if you need programmatically
**Fields to be stored or not **
Fields on which you want to perform just search no need store unless you want to display them in result

Related

Sorting Truncated Date in Solr

I am currently using solr 7.1.0. I have indexed a few documents which have a date associated with it.
The Managed schema configuration for that field is :
<fieldType name="pdate" class="solr.DatePointField" docValues="true"/>
<field name="Published_Date" type="pdate" multiValued="false" indexed="true" stored="true"/>
Example of few values are :
"Published_Date":"2019-10-25T00:00:00Z",
"Published_Date":"2019-10-21T10:00:00Z"
Please help me in finding how I could achieve the following
I want to sort the documents based on these Published_Date parameters but only on the day(not the time/timezones)
Sorting on the basis of
"Published_Date":'2019-10-25'

Spring data solr - multilingual support

I've to implement multilingual search capability & I was reading thro' Spring data Solr but couldn't find much details on how to implement multilingual queries using Spring data Solr.
Consider there's a Solr collection with dynamic fields & we're indexing documents based on locale. Now if I use Spring data Solr, I'll have to create an entity with fields matching all locales.
E.g. field defined in Solr schema.xml
...
...
dynamicField name="name_en_us" type="text_en_us" indexed="true" stored="true" required="false" docValues="false" multiValued="false"
dynamicField name="name_fr_fr" type="text_fr_fr" indexed="true" stored="true" required="false" docValues="false" multiValued="false"
...
...
'name' field will be added in the entity as:
name_en_us,
name_fr_fr,
name_en_uk,
...
...
Is there any way we can do this dynamically? I mean having only 1 field name in entity & based on locale fetch the the documents from Solr using Spring data?
Please suggest.

Display fields as defined in data config file in result query in SOLR

I am still new to SOLR and I've managed to install and index 1000 documents from the database. When I submit a query, the results are returned correctly but the order of the fields are not displayed as how it is defined in the data config file.
Example of data config file:
<field column="id" name="event_id" />
<field column="event_desc_current" name="event_desc" />
<field column="event_cost" name="event_cost" />
<field column="event_sponsors" name="event_sponsors" />
...
Example of results returned:
<result name="response" numFound="7" start="0">
<doc>
<str name="event_desc">Church Fund Raising</str>
<arr name="event_sponsors">
<str/>
</arr>
<str name="event_id">2</str>
<int name="event_cost">428</int>
...
<long name="_version_">1472652516366745600</long></doc>
How can I output the order of the fields as defined in the data config file like this:
event_id
event_desc
event_cost
event_sponsors
...

Typically, the order of the fields should not matter, as you would de-serialize it in a client and the logic of displaying the search results is with the client.
However, if you do want to dictate the order of fields, you could use the fl parameter in your Solr query to get results in the order you prefer.
You could also choose which fields to include in the search field.
Personally, I would recommend that you need not worry about order of fields, and have a client that can consume it in any order. Reason being, if you add a new field to your schema, in the middle, you could potentially breaking the client's logic!

SolR - Search for room availability and sort by result

I'm trying to implement a kind of hotel/hostel search using SolR and PHP. For any room available I store a new document inside my index containing relevant information about the accomodation and multivalued attributes containing an availableFrom and availableTill date. Running a query against SolR to get all rooms within a certain timespan shouldn't be that hard, but my brain screws up when it comes to sorting...
My goal is to show not only the available accomodations, but all of them matching a general filter query on the destination (country/city/district) and sort these results so that all available rooms are sorted to the start of the list.
So for a search for rooms in Munich from 1st December '12 till 5th December, I would like to get results like these:
Room A (available)
Room B (available)
Room C (not completly available in the given period => nice to have)
Room D (not available at all)
Currently I'm running SolR 3.6 but could switch to the new 4.0 if necessary.
Has any Solr-Guru out there some suggestions for me?
Any help appreciated :-)
-edit-
I think Samuele pushed me in the right direction. So the question is now, how to create a function query to be able to sort by availability. Maybe there is a better way to store my document, i.e. change my schema.xml?
Here is a litte excerpt from it:
<field name="recordId" type="string" indexed="true" stored="true" />
<field name="language" type="int" indexed="true" stored="true" />
<field name="name" type="string" indexed="true" stored="false" />
<field name="maxPersons" type="int" indexed="true" stored="false" />
<field name="avgPrice" type="tdouble" indexed="true" stored="false" />
<field name="city" type="freetext" indexed="true" stored="false" />
<field name="district" type="freetext" indexed="true" stored="false" />
<field name="country" type="freetext" indexed="true" stored="false" />
<field name="availableFrom" type="date" indexed="true" stored="true" multiValued="true" />
<field name="availableTill" type="date" indexed="true" stored="true" multiValued="true" />
Cheers - Sven

well, you have to boost your query based on the field "rooms" (or availability, depends on you) and give different scores based on the value
quick example:
let's give an available room a boost of 20, a partial available a boost of 10 and not available a boost of 1 (just to be sure)
your query (url-wise, i don't know the php interface to solr) would need something like
<query>&bq=rooms:avail^20.0&bq=rooms:part-avail^10.0...
suggestions: if you're using dismax query handler, it's addictive. this means you'll have to add a bigger boost than that (2000 instead of 20 for example) since it adds the boosting value to the query score
also, you should check this link from the solr wiki, which is better than any explanation.

Well, I did some research and testing on the whole thing here... The currect and possibly best solution for my problem is to perform multiple queries against SolR. As suggested by Samuele I query SolR for all accomodations matching the given criteria and timespan in two steps.
1: Get all rooms matching and being available (this includes partially available rooms)
2: Get all unavailable rooms
The second query is obviously only performed when we need to show more results 'cos of the pagination.
After that all results from step 1 are postprocessed to figure out if they are available in the whole requested timespan.
A further "improvement" would be to introduce a new field in the schema: availableDay. For each bookable day there would be an entry for that date. This would split up the first query into two seperate ones. This is then only a matter of additional filters for SolR.
Thanks again for pointing me in the right direction!

Solr Query on Unique Integer Field

I have a field defined in schema.xml as:
<field name="id" type="integer" indexed="true" stored="true" required="true" />
It is also the uniqueKey for the schema.
I cannot perform a query on this field with the query url:
/select?q=4525&qf=id&fl=id,name%2Cscore
This returns no results, however, if I search on a different field(such as a text field), with a different query, I get many results, which include the stored id. Solr is working great for text fields, but I cannot query for items based on the unique id.
What am I missing? Are there other steps that need to be performed for indexing?

Looks like you're using the qf parameter the wrong way... it's only meant to be used to boost fields in dismax queries.
Use id:4525 instead, as in:
/select?q=id:4525&fl=id,name,score

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio