Updating Solr Index when product data has changed - performance

We are working on implementing Solr on e-commerce site. The site is continuously updated with a new data, either by updates made in existing product information or add new product altogether.
We are using it on asp.net mvc3 application with solrnet.
We are facing issue with indexing. We are currently doing commit using following:
private static ISolrOperations<ProductSolr> solrWorker;
public void ProductIndex()
{
//Check connection instance invoked or not
if (solrWorker == null)
{
Startup.Init<ProductSolr>("http://localhost:8983/solr/");
solrWorker = ServiceLocator.Current.GetInstance<ISolrOperations<ProductSolr>>();
}
var products = GetProductIdandName();
solrWorker.Add(products);
solrWorker.Commit();
}
Although this is just a simple test application where we have inserted just product name and id into the solr index. Every time it runs, the new products gets updated all at once, and available when we search it. I think this create the new data index into solr everytime it runs? Correct me if I'm wrong.
My Question is:
Does this recreate Solr Index Data in whole? Or just update the data that is changed/new? How? Even if it only updates changed/new data, how it knows what data is changed? With large data set, this must have some issues.
What is the alternative way to track what has changed since last commit, and is there any way to add those product into Solr index that has changed.
What happens when we update existing record into solr? Does it delete old data and insert new and recreate whole index? Is this resource intensive?
How big e-commerce retailer does this with millions of products.
What is the best strategy to solve this problem?

When you do an update only that record is delete and inserted. Solr does not update the records. The other records are untouched. When you commit the data new segments would be created with this new data. On optimize the data is optimized into a single segment.
You can use Incremental build technique to add/update records after the last build. DIH provides it out of the box, If you are handling it manually through jobs you can maintain the timestamp and run builds.
Solr does not have an update operation. It will perform a delete and add. So you have to use the complete data again and not just the updated fields. Its not resource intensive. Usually only Commit and Optimize are.
Solr can handle any amount of data. You can use Sharding if your data grows beyond the handling capacity of a single machine.

Related

Importing a large amount of data into Elasticsearch every time by dropping existing data

Currently, there's a denormalized table inside a MySQL database that contains hundreds of columns and millions of records.
The original source of the data does not have any way to track the changes so the entire table is dropped and rebuilt every day by a CRON job.
Now, I would like to import this data into Elaticsearch. What is the best way to approach this? Should I use logstash to connect directly to the table and import it or is there a better way?
Exporting the data into JSON or similar is an expensive process since we're talking about gigabytes of data every time.
Also, should I drop the index in elastic as well or is there a way to make it recognize the changes?
In any case - I'd recommend using index templates to simplify index creation.
Now for the ingestion strategy, I see two possible options:
Rework your ETL process to do a merge instead of dropping and recreating the entire table. This would definitely be slower but would allow shipping only deltas to ES or any other data source.
As you've imagined yourself - you should be probably fine with Logstash using daily jobs. Create a daily index and drop the old one during the daily migration.
You could introduce buffers, such as Kafka to your infrastructure, but I feel that might be an overkill for your current use case.

Pattern to load data to Elasticsearch from SQL server

Here is what we came up with. By using 3 value status column.
0 = Not indexed
1 = Updated
2 = Indexed
There will be 2 jobs...
Job 1 will select top X records where status = 0 and pop them into a queue like RabitMQ.
Then a consumer will bulk insert those records to ES and update the status of DB records to 1.
For updates, since we have control of our data... The SQL stored proc that updates that particular record will set it's status to 2. Job2 will select top x records where status = 2 and pop them on RabitMQ. Then a consumer will bulk insert those records to ES and update the status of DB records to 1.
Of course we may need an intermediate status for "queued" so none of the jobs pick up the same record again but the same job should not run if it hasn't completed. The chances of a queued record being updated are slim to none. Since updates only happen at end of day usually the next day.
So I know there's rivers (but being deprecated and probably not flexible like ETL)
I would like to bulk insert records from my SQL server to Elasticsearch.
Write a scheduled batch job of some sort either ETL or any other tool doesn't matter.
select from table where id > lastIdInsertedToElasticSearch this will allow to load the latest records into Elasticsearch at scheduled interval.
But what if a record is updated in the SQL server? What would be a good pattern to track updated records in the SQL server and then push the updated records in ES? I know ES has document versions when putting the same Id. But can't seem to be able to visualize a pattern.
So IMHO, batch inserts are good for building or re-building the index. So for the first time, you can run batch jobs that run SQL queries and perform bulk updates. Rivers, as you correctly pointed out, don't provide a lot of flexibility in terms of transformation.
If the entries in your SQL data store are created by you (i.e. some codebase in your control), it would be better that the same code base updates documents in Elasticsearch, may be not directly but by notifying some other service or with the help of queues to not waste time in responding to requests (if that's the kind of setup you have).
We have a pretty similar use case of Elasticsearch. We provide search inside our app, which performs search across different categories of data. Some of this data is actually created by the users of our app through our app - so we handle this easily. Our app writes that data to our SQL data store and pushes the same data in RabbitMQ for indexing/updating in Elasticsearch. On the other side of RabbitMQ, we have a consumer written in Python that basically replaces the entire document in Elasticsearch. So the corresponding rows in our SQL datastore and documents in Elasticsearch share the ID which enables us to update the document.
Another case is where there are a few types of data that we perform search on comes from some 3rd party service which exposes the data over their HTTP API. The data creation is in our control but we don't have an automated mechanism of updating the entries in Elasticsearch. In this case, we basically run a cron job that takes care of this. We have managed to tune the cron's schedule because we also have a limited number of API queries quota. But in this case, our data is not really updated so much per day. So this kind of system works for us.
Disclaimer: I co-developed this solution.
I needed something like the jdbc-river that could do more complex "roll-ups" of data. After careful consideration of what it would take to modify the jdbc-river to suit my needs, I ended up writing the river-net.
Here are a few of the features:
It gets fairly decent performance (comparable to the jdbc-river. We get upwards of 6k rows/sec)
It can join many tables to create complex nested arrays of documents without creating duplicate child documents
It follows a lot of the same conventions as the jdbc-river.
It also supports reading from files.
It's written in C#
It uses Quartz.Net and supports cron expressions for scheduling.
This project is open source, and we already have a second project (also to be open sourced) that does generic job scheduling with RabbitMQ. We have ported over a lot of this project, and plan to the RabbitMQ river for better performance and stability when indexing into Elasticsearch.
To combat large updates, we aren't hitting tables directly. Instead we use stored procedures that only grab deltas. We also have an option on the sp to reset the delta to reindex everything.
The project is fairly young with only a few commits, but we are open to collaboration and new ideas.

Does postgresql index update on inserting new row?

Sorry if this is a dumb question but do i need to reindex my table every time i insert rows, or does the new row get indexed when added?
From the manual
Once an index is created, no further intervention is required: the system will update the index when the table is modified
http://postgresguide.com/performance/indexes.html
I think when you insert rows, the index does get updated. It maintains the sort on the index table as you insert data. Hence there are performance issues or downtimes on a table, if you try adding large number of rows at once.
On top of the other answers: PostgreSQL is a top notch Relational Database. I'm not aware of any Relational Database system where indices are not updated automatically.
It seems to depend on the type of index. For example, according to https://www.postgresql.org/docs/9.5/brin-intro.html, for BRIN indexes:
When a new page is created that does not fall within the last summarized range, that range does not automatically acquire a summary tuple; those tuples remain unsummarized until a summarization run is invoked later, creating initial summaries. This process can be invoked manually using the brin_summarize_new_values(regclass) function, or automatically when VACUUM processes the table.
Although this seems to have changed in version 10.

Solr deployment strategies for 100% Up Time while creating whole index

I'm working on a Solr 3.6 with ASP.net MVC3 e-commerce project.
I've an index of appx. 1 lac products in Solr. There is some changes in requirements, and we need to rebuild the whole index. Whole indexing is taking almost 1 & half hour during which site needs to be down.
How can I rebuild the index and also keep the site live which serving contents from older index. What is best practices to reduce down time while rebuilding the whole index. I wish I can do it with 100% uptime.
Edit
I'm storing a several URLs into Solr data as stored field and hence, which are dynamically generated while adding data into Solr. If I deploy application on different sub domain like test.example.com then it takes wrong URL, wherein it will only work with example.com. So hosting an another application is not an option for me.
You can leverage the concept of multiple cores in Solr and thereby have a live core that users are currently searching against and a standby core where you can make schema changes, re-index, etc. Then using the SWAP command you can switch the live and standby cores without any user downtime. The swap will be handled internally by Solr and your users will never notice a difference.
As I see, there are several ways to correctly solve this problem:
Do not rebuild the whole index - just update necessary records on-the-fly when they changes, Solr can do it pretty simple
Create 2 Solr instances on different ports and use them one after one. When first is rebuilding, on the second you can use old index. And when first is rebuilt, you can use it until the index on second instance is rebuilt.
Add boolean field to your index named, for example, "old_index". And whem reindexing starts, update all currrent records and set old_index=1, then write somewhere in configuration that you looking for records with old_index==1. Than start reindexing, than delete old records. It can be done with Solr`s deleteByQuery and either atomatic update in Solr 4.x or manual update.

Incremental data import in Solr from MSSQL normalized table with complex joins

Working on Solr incremental data import from an existing normalized mssql database. I'm unable to decide on the strategy I need to implement, or not knowing whether there are existing tools to do the same, so I don't need to reinvent the wheel.
I need to import a document into Solr 3.6 to build a Solr Data, which is saved in MSSQL in heavily normalized fashion. To retrieve the data for single document, there are many joins required which is killing performance. I have appx. 1 million such document in db. So full import into Solr is not an option for me.
While deciding the approach I have two issues to consider:
Incremental data import, so that SQL server doesn't have heavy load while fetching data from db.
Updating data that has been changed in SQL Server into Solr data once a day
I am looking after your help in deciding the strategy and tool for incremental data import into Solr. I think, I have following options:
Custom develop application to fetch data from MSSQL and pass it to Solr. I need to keep track of data as what all records are inserted into Solr and what are pending. Again, 2% data records in MSSQL keeps updated on daily basis, so need to track what data has changed since then, and then update them again at some point of time into Solr.
Use any existing tool or utility in Solr to do the same, like DIH. I'm not sure how this will address both of the issue of incremental data retrieval and how it will track what data has change in SQL server? Again, not sure how DIH will handle complex joins requires to fetch data from db.
Or use something like Lusql with DIH, bust still not sure about how it will address both the issues. Although Lusql will give ability to do complex joins in db, so I hope this might fit my purpose.
I'm in favor of using LuSQL with DIH in Solr, if it can fit the purpose, but still not sure how it keep track of what data has change? Or for this part I have to manage manually by maintaining the document id where the change is made, and then supplying it to LuSQL to fetch data from SQL and import into Solr.
I am also looking forward for your suggestions beyond this to handle this kind of situations.
I will share with you the way i do this.
Mainly I have the same requierements and until this week I used solr dataimport with delta imports. I have a program that regularly updates a status for the new items from 0 to 1 and then calls solr data import to get all the documents with status 1. Solrdataimport uses a stored procedure to join and get the documents with status 1 from db. If the import finish successfully I then update the status to 2 and I know that this documents are in solr. If a documents get changed I simply change from status 2 to status 0 and then the import process updates the document in solr.
Everything works fine for me using this process. I always get the new documents in solr without having to fetch all the data form the database.
Now my requirements have changed because we decide to keep the date archived in the database, as we only need it in solr. So I need to have a program that deserialize the data and then have it sent to solr.
My approach now is to add all the new/updated documents via update handler and after I added all the documents to commit them, and if the commit is successful then I update the status in the database. With this approach I have no experience yet so I don't know if it will work or not but I will just try and see what happens.
I researched in the past a better way to do this but I couldn't find anything so if you find a better solution please share it with me.
Good luck :)
We had to index from a heavily normalized schema with 25+ tables, half of them contain over 5M records each. Largest ~20M.
We use informatica to load these records from oracle to solr. ETL tools like informatica provide ways to join tables/results of a query outside the relational database. It has a sorter transformation to sort out side database. An aggregate transformation to group by records outside db. There is also a lookup transformation..
Essentially, our data is de-normalized in stages and loading/indexing process is distributed.
There are open source ETL tools of course. There is a Microsoft ETL tool..
Indexing to solr happens through update handler.. Delta indexing is very similar to full indexing with additional logic for change data capture. ETL activity is scheduled.

Resources