Solr capacity to handle delta import frequency - performance

I wanted to arrange a system where a new item gets indexed in Solr as soon as it is created in db system, to avoid a few minutes delay of the time based delta polling. So I tweaked the delta import a little and made it work based on a query parameter. In my c# code, when a new item is saved, I construct a deltaimport url and pass the newsid to be indexed and invoke it by httpwebrequest. It then uses the delta query to fetches the details from the db and index it.
http://localhost:89983/solr/mycore/dataimport?command=deltaimport&clean=false&newsid=1234
This works as expected. But now, the issue comes when the flow of the news gets higher, say 5 news at a time. The url is hit by the code for each item in a loop, but it is so fast that it is observed that one (first) or sometimes 2 items gets indexed only. Rest are missed.
So, I believe that Solr can't handle multiple hits for delta in nearly same time. How can I overcome this situation?

Related

Apollo Client v3 Delete cache entries after given time period

I am wondering if there is a way to expire cached items after a certain time period, e.g., 24 hours.
I know that Apollo Client v3 provides methods such as cache.evict and cache.gc which are a good start and I am already using; however, I want a way to delete cache items after a given time period.
What I am doing at the minute is adding a TimeToLive field to every object in my Apollo schema, and when the backend returns an object, the field is populated with the current time + 24 hours (i.e. the time in 24 hours time). Then when I query the data in the front end, I check the to see if the TimeToLive field of the returned data is in the future (if not that means the data was definitely retrieved from the cache and in which case I call the refetch function, which forces the query to fetch the data from the server. However, this doesn't seem like the best way to do things, mainly because I have to iterate over every result in the returned data anch check if any of the returned objects are expired; and if so, everything is refetched.
Another solution I thought of was to use something like React Native Queue and have a background task that periodically checks the cache and deleted items that have expired. But again, I am not totally sold on this solution.
For a little bit of context here: I am building a cooking / recipes app - and recipes / posts are cached on the device; however, my concern is that a user could delete a post, but everyone else who has that post cached would still be able to see it, and hence by expiring the cached item at least they would only be able to see for a number of hours before it is removed. However they might be a better way to do this all together, i.e. have the sever contact clients with the cached item (though I couldn't think of any low lift solutions at the time of writing this)
apollo-invalidation-policies replaces the Apollo-client InMemoryCache with InvalidationPolicyCache and within the typePolicies you can specify a timeToLive field. If an object is accessed beyond their TTL, they are evicted and no data is returned.

How to increase mdx Query speed in pentaho cde and how to clear Mondrian Schema cache

I have a problem with mdx query. Actually I developed one dashboard has 23 mdx queries. if we run these dashboard it take 2 minute to run.How to solve this problem.
Another issue
i modify some data in database.If we run these dashboard modified data isn't shown. It show previous data only.How to solve this problem.
1) 23 queries on first load may be a bit too much. Can't you simplify that? Also, are the queries all as fast as possible but it's just too many of them? Or are there slower queries that need to be improved? Check also the priority of components. You may have components rendering more than once. Example: you have a Country selector and a City selector. Because the city selector was put in befor the country selector, if they have the same priority (default=5), it'll run first, retrieving the full list of cities; Then the country selector runs and picks the first value as parameter value. As your City selector most likely listens to the Country parameter, it'll fire again because the Country was fireChange'd.
2) Cache. You're changing the data but either Mondrian or CDA (or both) are getting data from their cache. Two options here:
- Clear Mondrian cache and clear CDA cache after the data is updated (suitable for large updates that affect most of the database);
- Disable the cache on the query definition and the cube cache on the Mondrian schema.

What is the most efficient way to filter a search?

I am working with node.js and mongodb.
I am going to have a database setup and use socket.io to have real-time updates that will have the db queried again as well or push the new update to the client.
I am trying to figure out what is the best way to filter the database?
Some more information in regards to what is being queried and what the real time updates are:
A document in the database will include information such as an address, city, time, number of packages, name, price.
Filters include city/price/name/time (meaning only to see addresses within the same city, or within the same time period)
Real-time info: includes adding a new document to the database which will essentially update the admin on the website with a notification of a new address added.
Method 1: Query the db with the filters being searched?
Method 2: Query the db for all searches and then filter it on the client side (Javascript)?
Method 3: Query the db for all searches then store it in localStorage then query localStorage for what the filters are?
Trying to figure out what is the fastest way for the user to filter it?
Also, if it is different than what is the most cost effective way, then the most cost effective as well (which I am assuming is less db queries)...
It's hard to say because we don't see exact conditions of the filter, but in general:
Mongo can use only 1 index in a query condition. Thus whatever fields are covered by this index can be used in an efficient filtering. Otherwise it might do full table scan which is slow. If you are using an index then you are probably doing the most efficient query. (Mongo can still use another index for sorting though).
Sometimes you will be forced to do processing on client side because Mongo can't do what you want or it takes too many queries.
The least efficient option is to store results somewhere just because IO is slow. This would only benefit you if you use them as cache and do not recalculate.
Also consider overhead and latency of networking. If you have to send lots of data back to the client it will be slower. In general Mongo will do better job filtering stuff than you would do on the client.
According to you if you can filter by addresses within time period then you could have an index that cuts down lots of documents. You most likely need a compound index - multiple fields.

CouchDB view is extremely slow

I have a CouchDB (v0.10.0) database that is 8.2 GB in size and contains 3890000 documents.
Now, I have the following as the Map of the view
function(doc) {emit([doc.Status], doc);
And it takes forever to load (4 hours and still no result).
Here's some extra information that might help describing the situation:
The view is not a temp view. The
view is defined before the 3890000
documents are inserted.
There isn't anything on the server. It is a ubuntu box with nothing but the defaults installed.
I see that my CPU is moving and working hard (sometimes shoots to 100%). The memory is moving as well but not increasing.
So my question is:
What is actually happening in the background?
Is this a "one time" thing where I have to wait once and it will somehow works later?
Don't emit the whole doc. It's unnecessary. You can instead run your query with include_docs=true, which will let you access the document via each row's doc attribute.
When you emit the whole doc you make the index as large or larger than your entire database. :)
Views are only updated the next time they are read. Upon reading, it processes all the documents that have been updated (created, updated, deleted) since the last time the view was read.
So even if you're view was defined before inserting the 3890000 documents, it will be processing the 3890000 documents for the view.
From http://wiki.apache.org/couchdb/Introduction_to_CouchDB_views
Note that by default views are not created and updated when a document is saved, but rather, when they are accessed. As a result, the first access might take some time depending on the size of your data while CouchDB creates the view. If preferable the views can also be updated when a document is saved using an external script that calls the views when updates have been made. An example can be found here: RegeneratingViewsOnUpdate
Also just came across this tip, which might be useful if you're running on Ubuntu:
http://nosql.mypopescu.com/post/1299848121/couchdb-and-ubuntu-configuration-trick-for

(ASP.NET) How would you go about creating a real-time counter which tracks database changes?

Here is the issue.
On a site I've recently taken over it tracks "miles" you ran in a day. So a user can log into the site, add that they ran 5 miles. This is then added to the database.
At the end of the day, around 1am, a service runs which calculates all the miles, all the users ran in the day and outputs a text file to App_Data. That text file is then displayed in flash on the home page.
I think this is kind of ridiculous. I was told they had to do this due to massive performance issues. They won't tell me exactly how they were doing it before or what the major performance issue was.
So what approach would you guys take? The first thing that popped into my mind was a web service which gets the data via an AJAX call. Perhaps every time a new "mile" entry is added, a trigger is fired and updates the "GlobalMiles" table.
I'd appreciate any info or tips on this.
Thanks so much!
Answering this question is a bit difficult since there we don't know all of your requirements and something didn't work before. So here are some different ideas.
First, revisit your assumptions. Generating a static report once a day is a perfectly valid solution if all you need is daily reports. Why hit the database multiple times throghout the day if all that's needed is a snapshot (for instance, lots of blog software used to write html files when a blog was posted rather than serving up the entry from the database each time -- many still do as an optimization). Is the "real-time" feature something you are adding?
I wouldn't jump to AJAX right away. Use the same input method, just move the report from static to dynamic. Doing too much at once is a good way to get yourself buried. When changing existing code I try to find areas that I can change in isolation wih the least amount of impact to the rest of the application. Then once you have the dynamic report then you can add AJAX (and please use progressive enhancement).
As for the dynamic report itself you have a few options.
Of course you can just SELECT SUM(), but it sounds like that would cause the performance problems if each user has a large number of entries.
If your database supports it, I would look at using an indexed view (sometimes called a materialized view). It should support allows fast updates to the real-time sum data:
CREATE VIEW vw_Miles WITH SCHEMABINDING AS
SELECT SUM([Count]) AS TotalMiles,
COUNT_BIG(*) AS [EntryCount],
UserId
FROM Miles
GROUP BY UserID
GO
CREATE UNIQUE CLUSTERED INDEX ix_Miles ON vw_Miles(UserId)
If the overhead of that is too much, #jn29098's solution is a good once. Roll it up using a scheduled task. If there are a lot of entries for each user, you could only add the delta from the last time the task was run.
UPDATE GlobalMiles SET [TotalMiles] = [TotalMiles] +
(SELECT SUM([Count])
FROM Miles
WHERE UserId = #id
AND EntryDate > #lastTaskRun
GROUP BY UserId)
WHERE UserId = #id
If you don't care about storing the individual entries but only the total you can update the count on the fly:
UPDATE Miles SET [Count] = [Count] + #newCount WHERE UserId = #id
You could use this method in conjunction with the SPROC that adds the entry and have both worlds.
Finally, your trigger method would work as well. It's an alternative to the indexed view where you do the update yourself on a table instad of SQL doing it automatically. It's also similar to the previous option where you move the global update out of the sproc and into a trigger.
The last three options make it more difficult to handle the situation when an entry is removed, although if that's not a feature of your application then you may not need to worry about that.
Now that you've got materialized, real-time data in your database now you can dynamically generate your report. Then you can add fancy with AJAX.
If they are truely having performance issues due to to many hits on the database then I suggest that you take all the input and cram it into a message queue (MSMQ). Then you can have a service on the other end that picks up the messages and does a bulk insert of the data. This way you have fewer db hits. Then you can output to the text file on the update too.
I would create a summary table that's rolled up once/hour or nightly which calculates total miles run. For individual requests you could pull from the nightly summary table plus any additional logged miles for the period between the last rollup calculation and when the user views the page to get the total for that user.
How many users are you talking about and how many log records per day?

Resources