I'm using Algolia, via Laravel
I'm frequently updating some entries in my database (every 5 mins), e.g. $object->update(['field'=>$new_value])
I would NOT like these changes to trigger updates / "operations" in Algolia, as this field is not relevant for Algolia's search indexing, and it would otherwise generate too many operations (which I'm then billed for).
Other changes should trigger updates in Algolia.
How can I do this?
if you using Laravel scout. you could do something like
$modal->unsearchable();
//update the changes
$modal->save()
for more info have a look at this https://laravel.com/docs/5.7/scout
cheers!
Related
I took a data set of 10 collections from api xyz, then updated it to firebase for the first time.
On the second update, api xyz has 9 collections available, How to make the 2nd update with only 9 collections (the existing ones being overwritten and the extra collections from the first update should be deleted)?
I used to delete all the collections and then update but that method is not appropriate.
For you to perform bulk updates on your collections, you need to use
Transactions and Batched Writes - as I imagine that you are already using to perform the update in the first time.
If you just want to overwrite the old date, you need to configure a set() method just to update the collection, without needing to check the existence of any document on it. This way, it should overwrite the data already existed. You can get more information on this question here, where a customer describes the function that he is using that is overwriting data.
Let me know if the information helped you!
Here is what we came up with. By using 3 value status column.
0 = Not indexed
1 = Updated
2 = Indexed
There will be 2 jobs...
Job 1 will select top X records where status = 0 and pop them into a queue like RabitMQ.
Then a consumer will bulk insert those records to ES and update the status of DB records to 1.
For updates, since we have control of our data... The SQL stored proc that updates that particular record will set it's status to 2. Job2 will select top x records where status = 2 and pop them on RabitMQ. Then a consumer will bulk insert those records to ES and update the status of DB records to 1.
Of course we may need an intermediate status for "queued" so none of the jobs pick up the same record again but the same job should not run if it hasn't completed. The chances of a queued record being updated are slim to none. Since updates only happen at end of day usually the next day.
So I know there's rivers (but being deprecated and probably not flexible like ETL)
I would like to bulk insert records from my SQL server to Elasticsearch.
Write a scheduled batch job of some sort either ETL or any other tool doesn't matter.
select from table where id > lastIdInsertedToElasticSearch this will allow to load the latest records into Elasticsearch at scheduled interval.
But what if a record is updated in the SQL server? What would be a good pattern to track updated records in the SQL server and then push the updated records in ES? I know ES has document versions when putting the same Id. But can't seem to be able to visualize a pattern.
So IMHO, batch inserts are good for building or re-building the index. So for the first time, you can run batch jobs that run SQL queries and perform bulk updates. Rivers, as you correctly pointed out, don't provide a lot of flexibility in terms of transformation.
If the entries in your SQL data store are created by you (i.e. some codebase in your control), it would be better that the same code base updates documents in Elasticsearch, may be not directly but by notifying some other service or with the help of queues to not waste time in responding to requests (if that's the kind of setup you have).
We have a pretty similar use case of Elasticsearch. We provide search inside our app, which performs search across different categories of data. Some of this data is actually created by the users of our app through our app - so we handle this easily. Our app writes that data to our SQL data store and pushes the same data in RabbitMQ for indexing/updating in Elasticsearch. On the other side of RabbitMQ, we have a consumer written in Python that basically replaces the entire document in Elasticsearch. So the corresponding rows in our SQL datastore and documents in Elasticsearch share the ID which enables us to update the document.
Another case is where there are a few types of data that we perform search on comes from some 3rd party service which exposes the data over their HTTP API. The data creation is in our control but we don't have an automated mechanism of updating the entries in Elasticsearch. In this case, we basically run a cron job that takes care of this. We have managed to tune the cron's schedule because we also have a limited number of API queries quota. But in this case, our data is not really updated so much per day. So this kind of system works for us.
Disclaimer: I co-developed this solution.
I needed something like the jdbc-river that could do more complex "roll-ups" of data. After careful consideration of what it would take to modify the jdbc-river to suit my needs, I ended up writing the river-net.
Here are a few of the features:
It gets fairly decent performance (comparable to the jdbc-river. We get upwards of 6k rows/sec)
It can join many tables to create complex nested arrays of documents without creating duplicate child documents
It follows a lot of the same conventions as the jdbc-river.
It also supports reading from files.
It's written in C#
It uses Quartz.Net and supports cron expressions for scheduling.
This project is open source, and we already have a second project (also to be open sourced) that does generic job scheduling with RabbitMQ. We have ported over a lot of this project, and plan to the RabbitMQ river for better performance and stability when indexing into Elasticsearch.
To combat large updates, we aren't hitting tables directly. Instead we use stored procedures that only grab deltas. We also have an option on the sp to reset the delta to reindex everything.
The project is fairly young with only a few commits, but we are open to collaboration and new ideas.
I'm not sure whether I need a way to lock records returned by the context or simply need a new approach.
Here's the story. We currently have a small number of apps that integrate with our CRM. Some of them open a XrmServiceContext and return a few thousand record to perform updates. These scripts are calling SaveChanges along the way but there will still be accounts near the end that will be saved a couple of minutes after the context return them. If a user updates the record during this time, their changes are overwritten by the script.
Is there a way of locking the records until the context has saved the update back or is there a better approach I should be taking?
Kit
In my opinion, this type of database transaction issue is what CRM is currently lacking the most. There is no way to ensure that someone else doesn't monkey with your data, it's always a last-one-in-wins world in CRM.
With that being said, my suggestion would be to only update the attributes you care about. If you're returning all columns for an entity, when you update that entity, you're possibly going to update all the attributes of the entity, even if you only updated one of them.
If you're dealing with a system were you can't tolerate the last-one-in-wins mentality, then you're probably better off not using CRM.
Update 1
CRM 2015 SP1 and above supports Optimistic Updates. Which allows the use of a version number to ensure that no one has updated the record since you retrieved it.
You have a several options here, it just depends on what you want to do. First of all though, if you can move some of these automated processes to off-time hours, then that's the best option.
Another option would be to retrieve each record 1 by 1 instead of by 1000+.
If you are only updating a percentage of the records retrieved, then you would be better off to check before saving if an update occurred (comparing the modified date). If the modified date changed, then you need to do a single retrieve and then save.
At first thought, I would create a field or status that indicates a pending operation and then use JScript in the form OnLoad event to warn/lock the form. When you process completes, it could clear the flag.
I am working on a WP7 app that contains
CategoryGroups
Categories
Products
The rows for each of these entities are populated on first run of the application.
The issues is that when the app gets published, the rows in each of the entities will change (added, deleted, modified). I would like some suggestions on how I should handle this? Any pointers to existing code samples will be great?
I am using an object oriented database to store my entities. The app also allows the user to add their own entities (which get added to the database as personalized (flagged) entities). One solution I was thinking was to read an xml file from the server and then loop through the database entries and make the necessary modifications in the database. So, on the first run, all the entities will just get inserted. On subsequent runs, if the version number attribute in xml is different, then the system populated data is reloaded from xml but the user data is preserved.
Also, maybe only check for the new xml file on the server when internet connection is available and only periodically (like every 2 weeks).
Any other suggestions are welcome. If there is a simpler, cleaner way - please share.
Pratik
I think it's fair to say that this question has nothing to do with WP7 and everything to do with finding an efficient way to to compute and deliver update deltas.
Timestamp your items. When requesting an update, specify the time of last update. You server can trivially query for items newer than this and return a delta. At the client (ie in the phone) it is not necessary to store a last update time because you can simply add one second to the most recent timestamp in the items present on the phone.
Here is the issue.
On a site I've recently taken over it tracks "miles" you ran in a day. So a user can log into the site, add that they ran 5 miles. This is then added to the database.
At the end of the day, around 1am, a service runs which calculates all the miles, all the users ran in the day and outputs a text file to App_Data. That text file is then displayed in flash on the home page.
I think this is kind of ridiculous. I was told they had to do this due to massive performance issues. They won't tell me exactly how they were doing it before or what the major performance issue was.
So what approach would you guys take? The first thing that popped into my mind was a web service which gets the data via an AJAX call. Perhaps every time a new "mile" entry is added, a trigger is fired and updates the "GlobalMiles" table.
I'd appreciate any info or tips on this.
Thanks so much!
Answering this question is a bit difficult since there we don't know all of your requirements and something didn't work before. So here are some different ideas.
First, revisit your assumptions. Generating a static report once a day is a perfectly valid solution if all you need is daily reports. Why hit the database multiple times throghout the day if all that's needed is a snapshot (for instance, lots of blog software used to write html files when a blog was posted rather than serving up the entry from the database each time -- many still do as an optimization). Is the "real-time" feature something you are adding?
I wouldn't jump to AJAX right away. Use the same input method, just move the report from static to dynamic. Doing too much at once is a good way to get yourself buried. When changing existing code I try to find areas that I can change in isolation wih the least amount of impact to the rest of the application. Then once you have the dynamic report then you can add AJAX (and please use progressive enhancement).
As for the dynamic report itself you have a few options.
Of course you can just SELECT SUM(), but it sounds like that would cause the performance problems if each user has a large number of entries.
If your database supports it, I would look at using an indexed view (sometimes called a materialized view). It should support allows fast updates to the real-time sum data:
CREATE VIEW vw_Miles WITH SCHEMABINDING AS
SELECT SUM([Count]) AS TotalMiles,
COUNT_BIG(*) AS [EntryCount],
UserId
FROM Miles
GROUP BY UserID
GO
CREATE UNIQUE CLUSTERED INDEX ix_Miles ON vw_Miles(UserId)
If the overhead of that is too much, #jn29098's solution is a good once. Roll it up using a scheduled task. If there are a lot of entries for each user, you could only add the delta from the last time the task was run.
UPDATE GlobalMiles SET [TotalMiles] = [TotalMiles] +
(SELECT SUM([Count])
FROM Miles
WHERE UserId = #id
AND EntryDate > #lastTaskRun
GROUP BY UserId)
WHERE UserId = #id
If you don't care about storing the individual entries but only the total you can update the count on the fly:
UPDATE Miles SET [Count] = [Count] + #newCount WHERE UserId = #id
You could use this method in conjunction with the SPROC that adds the entry and have both worlds.
Finally, your trigger method would work as well. It's an alternative to the indexed view where you do the update yourself on a table instad of SQL doing it automatically. It's also similar to the previous option where you move the global update out of the sproc and into a trigger.
The last three options make it more difficult to handle the situation when an entry is removed, although if that's not a feature of your application then you may not need to worry about that.
Now that you've got materialized, real-time data in your database now you can dynamically generate your report. Then you can add fancy with AJAX.
If they are truely having performance issues due to to many hits on the database then I suggest that you take all the input and cram it into a message queue (MSMQ). Then you can have a service on the other end that picks up the messages and does a bulk insert of the data. This way you have fewer db hits. Then you can output to the text file on the update too.
I would create a summary table that's rolled up once/hour or nightly which calculates total miles run. For individual requests you could pull from the nightly summary table plus any additional logged miles for the period between the last rollup calculation and when the user views the page to get the total for that user.
How many users are you talking about and how many log records per day?