Can Solr join tables in-memory? - performance

There is a table of n products, and a table of features of these products. Each product has many features. Given a Solr DataImportHandler configuration:
<document name="products">
<entity name="item" query="select id, name from item">
<field column="ID" name="id" />
<field column="NAME" name="name" />
<entity name="feature"
query="select feature_name, description from feature where item_id='${item.ID}'">
<field name="feature_name" column="description" />
<field name="description" column="description" />
</entity>
</entity>
</document>
Solr will run n + 1 queries to fetch this data. 1 for the main query, n for the queries to fetch the features. This is inefficient for large numbers of items. Is it possible to configure Solr such that it will run these queries separately and join them in-memory instead? All rows from both tables will be fetched.

This can be done using CachedSqlEntityProcessor:
<document name="products">
<entity name="item" query="select id, name from item">
<field column="ID" name="id" />
<field column="NAME" name="name" />
<entity name="feature"
query="select item_id, feature_name, description from feature"
cacheKey="item_id"
cacheLookup="item.ID"
processor="CachedSqlEntityProcessor">
<field name="feature_name" column="description" />
<field name="description" column="description" />
</entity>
</entity>
</document>
Since Solr's index is 'flat', feature_name and description are not connected in any way; each product will have multi-valued fields for each of these.

I am not sure if Solr can do this, but the database can. Assuming that you are using MySQL, use JOIN and GROUP_CONCAT to convert this into a single query. The query should look something like this:
SELECT id, name, GROUP_CONCAT(description) AS desc FROM item INNER JOIN feature ON (feature.item_id = item.id) GROUP BY id
Don't forget to use the RegexTransformer on desc to separate out the multiple values.

Related

Solr dataimport jdbc multiple columns into one field

I am trying to implement a solr search for a project. Everything was fine so far, a first simple version worked. Now I try to import from a postgres data base where multiple columns should end up in the same field. My config:
<entity name="address" query="SELECT objectid, ags2, ags3, ags5, ags8, ags11, ags20, ags22, pt, stn, hnr_min, hnr_max, plz, ort, ortz, ot1, ot2 FROM variablen2018.ags22_tmp_solr LIMIT 10000;">
<field column="objectid" name="id" />
<field column="plz" name="plz" />
<field column="ort" name="ort" />
<field column="ortz" name="ort" />
<field column="ot1" name="ort" />
<field column="ot2" name="ort" />
<field column="ort" name="ort_res" />
<field column="stn" name="stn" />
<field column="stn" name="stn_res" />
<field column="ags2" name="ags2" />
<field column="ags3" name="ags3" />
<field column="ags5" name="ags5" />
<field column="ags8" name="ags8" />
<field column="ags11" name="ags11" />
<field column="ags20" name="ags20" />
<field column="ags22" name="ags22" />
<field column="pt" name="coord" />
<field column="hnr_min" name="hnr_min" />
<field column="hnr_max" name="hnr_max" />
</entity>
As you can see there are 4 columns from the DB (ort, ortz, ot1, ot2) going into one field (ort). Most of the times only one of the columns is populated at all, in which case the document is indexed normally. But when there are actually multiple entries the indexing of the document fails. The field is defined this way:
<field name="ort" type="text_de" uninvertible="true" indexed="true" required="true" stored="true"/>
DataImporthandler maps the result view of the query to a schema view and hence I don't think that you will be able to map multiple columns to one field. Instead you can assign each column to a new Solr field and then do a copy of them in your schema.
eg
<field name="ort" type="string" />
<field name="ortz" type="string" />
<field name="ot1" type="string" />
<field name="ot2" type="string" />
<field name="ortCombined" type="string" multiValued="true"/>
<copyField source="ort" dest="ortCombined" />
<copyField source="ortz" dest="ortCombined" />
<copyField source="ot1" dest="ortCombined" />
<copyField source="ot2" dest="ortCombined" />
Hope this helps !
you do it this way:
you concatenate all values into a single value in the Select:
select ...,ort||','||ortz||','||or1||','||ort2 AS ort_all FROM variablen2018.ags22_tmp_solr
and then split it into individual values when indexing into solr (this is done with RegexTransformer/splitBy)
< entity name="address" transformer="RegexTransformer"
...
< field column="ort_all" name="ort" splitBy=","/>
Note: inserted a space after < or the text does not show up here...
To watch out:
handle possible nulls, check concat_ws etc
handle possible , inside ort values (use another separator or replace , etc)

Cassandra solr integration with dataimporthandler using CQL Driver . Getting Frame size larger than max length (16384000)!

I am trying to use dataimporthandler to integrate cassandra and solr using org.apache.cassandra.cql.jdbc.CassandraDriver .
I am able to fetch 20000 rows but it tried to fetch all rows its showing "Caused by: org.apache.thrift.transport.TTransportException: Frame size (16402604) larger than max length (16384000)!"
My data-config file :
<dataConfig>
<dataSource autoCommit="true" driver="org.apache.cassandra.cql.jdbc.CassandraDriver" url="jdbc:cassandra://127.0.0.1/test_new" />
<document name="products">
<entity name="testproducts" query="select * from products LIMIT 20015">
<field name="id" column="product_id"/>
<field name="productId" column="product_id"/>
<field name="productPrice" column="sale_price" />
<field name="productSource" column="source"/>
<field name="productMrpPrice" column="mrp_price"/>
<entity name="productrating" query="select * from product_reviews where product_id='${testproducts.product_id}'">
<field name="productRating" column="rating" />
<field name="productReview" column="review" />
<field name="customerId" column="customer_id" />
<field name="customerName" column="customer_name" />
</entity>
</entity>
</document>
</dataConfig>
How to maximze framesize in cql jdbc driver class?
How to import all rows using cql jdbc driver ?

Solr DataImportHandler Cache Support for Multiple Values

I'm trying to use cache for some entities in my data import handler configuration. Somehow if I use cache, I only get the first value of my multivalued field. My configuration looks like this:
<entity name="product" query="SELECT product_id FROM Product WHERE 1">
<entity name="strength" query="SELECT *
FROM Strength WHERE product_id = '${product.product_id}'">
<entity name="form" query="SELECT CONCAT(parent_route,'|',form_name) AS form_name, LOWER(CONCAT_WS('\n',form_name,parent_route)) AS form_name_s,
CAST(form_id AS CHAR(10)) AS form_id_string FROM Form WHERE form_id = '${strength.form_id}'"
transformer="RegexTransformer"
cacheImpl="SortedMapBackedCache" cacheLookup="strength.form_id" cacheKey="form_id_string">
<field column="form_name" name="form_name" />
<field column="form_name_s" splitBy="\n" />
</entity>
</entity>
</entity>
There should be two rows returned for the entity "form" but only the first one is visible if cache is enabled. Does Solr not have the ability to cache multiple rows or am I doing something wrong? My Solr version is 4.1.
Problem is fixed when the where part of the cached query is removed. I'm not sure the following configuration is ideal but what I understand is the aim is reducing the count of queries.
<entity name="product" query="SELECT product_id FROM Product WHERE 1">
<entity name="strength" query="SELECT *
FROM Strength WHERE product_id = '${product.product_id}'">
<entity name="form" query="SELECT CONCAT(parent_route,'|',form_name) AS form_name, LOWER(CONCAT_WS('\n',form_name,parent_route)) AS form_name_s,
CAST(form_id AS CHAR(10)) AS form_id_string FROM Form"
transformer="RegexTransformer"
cacheImpl="SortedMapBackedCache" cacheLookup="strength.form_id" cacheKey="form_id_string">
<field column="form_name" name="form_name" />
<field column="form_name_s" splitBy="\n" />
</entity>
</entity>
</entity>

solr clobtransfomer

I am stuck with ClobTransformer in solr from the past 3 days. I want to convert an oracle clob field to text field in solr. I am using multiple cores and I started my config and schema files from scratch.
This is my config file:
<lib dir="../../../dist/" regex="apache-solr-dataimporthandler-.*\.jar" />
<requestHandler name="/dataimport" class="org.apache.solr.handler.dataimport.DataImportHandler">
<lst name="defaults">
<str name="config">data-config.xml</str>
</lst>
</requestHandler>
These are the columns in my schema file for a core:
<field name="id" type="string" indexed="true" stored="true" multiValued="false" required="true"/>
<field name="mandp" type="text_en_splitting" indexed="true" stored="true" multiValued="false" />
This is my data-config.xml for the core:
<dataConfig>
<dataSource type="JdbcDataSource"
driver="oracle.jdbc.driver.OracleDriver"
url="jdbc:oracle:thin:#***"
user="***"
password="****"/>
<document>
<entity name="wiki" transformer="ClobTransformer"
query="Select t.id as id, t.mandp From table1 t">
<field column="mandp" name="mandp" clob="true" />
</entity>
</document>
</dataConfig>
When I start solr, I can see that dataimporthandler*.jar files have loaded successfully in the console. When I run my dataimport from http://localhost:8983/solr/wiki/dataimport?command=full-import&clean=false, I don't see any errors in the console neither do I see anything related to transformer or clob. So, If I type anything in my transformer parameter (transformer="bla bla bla"), it doesn't throw any errors in the console, that could mean my transformer argument is completely ignored or the full logging is turned off.
When I query solr, I see oracle.sql.CLOB#375c929a in the mandp field. Nothing happens of course if I use HTMLStripTransformer class too. I want to use both on this field.
Any ideas are appreciated!!!
It looks like the ClobTransformer is not fired. I would personally change the mandp column name inside the query like this:
Select t.id as id, t.mandp as mandp From table1 t
please add transformer="ClobTransformer, RegexTransformer" to the entity in your data-config.xml file

Accessing ancestor values in xpath with Solr DataImportHandler

If my xml is structured like so:
<fruit>
<apple appleId="apple_1">
<core coreId="core_1">
<seed>1</seed>
<seed>2</seed>
</core>
</apple>
<apple appleId="apple_2">
<core coreId="core_1">
<seed>1</seed>
</core>
</apple>
</fruit>
and I want the seeds to be the documents in my solr schema, how can I access the appleId and coreId?
Here's the pertinent entity definition from my data-config.xml:
<entity name="apples"
processor="XPathEntityProcessor"
stream="true"
forEach="/fruit/apple/core/seed"
url="fruit.xml"
transformer="script:create_id"
>
<field column="seed_s" xpath="/fruit/apple/core/seed" />
<field column="apple_id_s" xpath="/fruit/apple/#appleId" />
</entity>
script:create_id creates a unique id for each seed.
In this example, apple_id_s is coming back as null.
I found the problem. I need to use commonField="true" and make sure to loop through each apple and core. Also, I need to set the pk="seed_s" which triggers solr to store the document.
Here's my new entity definition:
<entity name="apples"
processor="XPathEntityProcessor"
stream="true"
pk="seed_s"
forEach="/fruit/apple/core/seed | /fruit/apple | /fruit/apple/core"
url="fruit.xml"
transformer="script:create_id"
>
<field column="seed_s" xpath="/fruit/apple/core/seed" />
<field column="apple_id_s" xpath="/fruit/apple/#appleId" commonField="true"/>
<field column="core_id_s" xpath="/fruit/apple/core/#coreId" commonField="true"/>

Resources