Removal of metadata added by feed - google-search-appliance

I have a GSA that fulfils a number of roles within my organisation. Honestly it's a bit of a frankenmess but it's what I have to work with.
One of the things we have it doing is indexing a number of sites based on a feed we pass it. Each of the items we pass in the feed gets tagged with a metadata that allows me to setup a frontend that only queries those items. This is working fine for the most part except that now I want to remove some metadata from items that are in the index (thereby stopping them from being in that particular frontend) and I can't figure out how.
I use a metadata-and-url type feed to push in these urls I want the system to be aware of. But it also finds a number of them through standard indexing patterns.
Heres the issue. The items that are in the index that have been found as a part of the standard crawling I can't remove. I just need the GSA to forget that I ever attached metadata to them.
Is this possible?

You can push a new feed that either updates the metadata or deletes the individual records that you want to remove from your frontend.
You can also block specific results from appearing in a specific frontend as a temporary measure while you work it out. See this doco.
It sounds like you would be better off using collections to group the subsets of the index that you want to present in a specific frontend.

Related

Query by ACL in cloud function?

I have set up my class to control access using ACLs. Only the creator of the object can view or edit the object. There are many thousands of these objects in use in a production environment.
I have a new requirement to, under a certain circumstance, allow a user to remove another user's objects. The simplest analogy is to imagine an admin feature that lets moderators remove all comments by a certain abusive user.
Since the client cannot do this, I am defining a cloud function to handle it, which will be able to use the master key. I pass the user ID from the client to the cloud function, and it should remove all comments by this user.
However, the cloud function is not able to find the comments since they are tied to the user only by ACL. As far as I can tell, it is not possible to query by ACL. Is this accurate?
What is the correct approach here? Do I need an additional column besides the ACL to identify the commenter a second time, simply so I can query by it? This seems duplicative. I will also need to update the many existing records, copying the user specified in the ACL into the new column. Is this even possible?
Or is there some way to build an ACL for use by the cloud function and use it instead of the master key so that the query searches as though it were the user in question?
My final (last resort) suggestion is to fetch all of the objects and then iterate over them checking the ACLs. This is obviously a pretty poor solution for scale and performance since I will need to fetch potentially hundreds of thousands to items to check them all.

Updating nested documents en masse

We've been using Elasticsearch to deliver the 700,000 or so pieces of content to the readers of our site for a couple of years but some circumstances have changed and we need to work out whether or not the service can adapt with us... (sorry this post is so long, I tried to anticipate all questions!)
We use Elasticsearch to store "snapshots" of our content to avoid duplicating work and slowing down our apps by making them fetch data and resolve all resources from our content APIs. We also take advantage of Elasticsearch's search API to retrieve the content in all sorts of ways.
To maintain content in our cluster we run a service that receives notifications of content changes from our APIs which triggers a content "ingest" (fetching the data, doing any necessary transformation and indexing it). The same service also periodically "reingests" content over time. Typically a new piece of content will be ingested in <30 seconds of publishing and touched every 5 days or so thereafter.
The most common method our applications use to retrieve content is by "tag". We have list pages to view content by tag and our users can subscribe to content updates for a tag. Every piece of content has one or more tags.
Tags have several properties:- ID, name, taxonomy, and it's relationship to the content. They're indexed as nested objects so that we can aggregate on them etc.
This is where it gets interesting... tags used to be immutable but we have recently changed metadata systems and they may now change - names will be updated, IDs may flux as they move taxonomy etc.
We have around 65,000 tags in use, the vast majority of which are used only in relatively small numbers. If and when these tags change we can trigger a reingest of all the associated content without requiring any changes to our infrastructure.
However, we also have some tags which are very common, the most popular of which is used more than 180,000 times. And we've just received warning it, a few others with tens of thousands of documents are due to change! So we need to be able to cope with these updates now and into the future.
Triggering a reingest of all the associated content and queuing it up is not the problem, but this could take quite some time, at least 3-5 hours in some cases, and we would like to try and avoid our list pages becoming orphaned or duplicated while this occurs.
If you've got this far, thank you! I have two questions:
Is there a more optimal mapping we could use for our documents knowing now that nested objects - often duplicated thousands of times - may change? Could a parent/child mapping work with so many relations?
Is there an efficient way to update a large number of nested objects? Hacks are fine, at least to cover us in the short term. Could the update by query API and a script handle it?
Thanks
I've already answered a similar question to your use case of Nested datatype.
Here is the link to the answer of maintaining Parent-Child relation data into ES using Nested datatype.
Try this. Do let me know if this solution helps in solving your problem.

Create subsets for certain Resources to better fit existing data model?

We are trying to implement a FHIR Rest Server for our application. In our current data model (and thus live data) several FHIR resources are represented by multiple tables, e.g. what would all be Observations are stored in tables for vital values, laboratory values and diagnosis. Each table has an independent, auto-incrementing primary ID, so there are entries with the same ID in different tables. But for GET or DELETE calls to the FHIR server a unique ID is needed. What would be the most sensible way to handle this?
Searching didn't reveal an inherent way of doing this, so I'm considering these two options:
Add a prefix to all (or just the problematic) table IDs, e.g lab-123 and vit-123
Add a UUID to every table and use that as the logical identifier
Both have drawbacks: an ID parser is necessary for the first one and the second requires multiple database calls to identify the correct record.
Is there a FHIR way that allows to split a resource into several sub-resources, even in the Rest URL? Ideally I'd get something like GET server:port/Observation/laboratory/123
Server systems will have all sorts of different divisions of data in terms of how data is stored internally. What FHIR does is provide an interface that tries to hide those variations. So Observation/laboratory/123 would be going against what we're trying to do - because every system would have different divisions and it would be very difficult to get interoperability happening.
Either of the options you've proposed could work. I have a slight leaning towards the first option because it doesn't involve changing your persistence layer and it's a relatively straight-forward transformation to convert between external/fhir and internal.
Is there a FHIR way that allows to split a resource into several
sub-resources, even in the Rest URL? Ideally I'd get something like
GET server:port/Observation/laboratory/123
What would this mean for search? So, what would /Obervation?code=xxx search through? Would that search labs, vitals etc combined, or would you just allow access on /Observation/laboratory?
If these are truly "silos", maybe you could use http://servername/lab/Observation (so swap the last two path parts), which suggests your server has multiple "endpoints" for the different observations. I think more clients will be able to handle that url than the url you suggested.
Best, still, I think is having one of your two other options, for which the first is indeed the easiest to implement.

I want to crawl all messages of every group on Yammer (including All Company Group)

We are trying to crawl all messages of every group on Yammer (including All Company Group) using https://www.yammer.com/api/v1/messages.json?group_id=<>&access_token=<>,nut its giving me duplicates and also i am not getting complete messages. Is there any way to do this?
Is there any way to get new users joined on Yammer after specific date?
Any sort of help is appreciated.
The best way to get this information is to use the Data Export API. This API is available to paid networks and outputs a ZIP file containing CSV files containing all messages, and list of users. You can pass a parameter called "since" to this API and it'll only provide data since a particular time. The users.csv file also includes a joined at date.
If you attempt to iterate over messages you will hit some limits. These limits are technical in nature and you would need to revert to the search API to find much older messages. Unfortunately you will have to put up with these limitations if you are dealing with the free version of Yammer as the data export is only available with the paid version.
I achieved this a different way. I used the export API to get a list of all of the groups.
https://export.yammer.com/api/v1/export?model=Group&access_token=
Then I looped through the list of groups and pulled all of the message data for each group and combined them into one *.json
https://www.yammer.com/api/v1/messages/in_group/###.json
Where ### is the group ID extracted from the groups export data.

Redis multiple requests

I am writing a very simple social networking app that uses Redis.
Each user has a sorted set that contains ids of items in their feed. If I want to display their feed, I do the following steps:
use ZREVRANGE to get ids of items in their feed
use HMGET to get the feed (each feed item is a string)
But now, I also want to know if the user has liked a feed item or not. So I have a set associated with each feed item that contains ids of user who have liked a feed item.
If I get 15 feed items, now I have to execute an additional 15 requests to Redis to find out, for each feed item if current user has commented on it or not (by checking if id exists in each set for each feed).
So that will take 15+1 requests.
Is this type of querying considered 'normal' when using Redis? Are there better ways I can structure the data to avoid this many requests?
I am using redis-rb gem.
You can easily refactor your code to collapse the 15 requests in one by using pipelines (which redis-rb supports).
You get the ids from the sorted sets with the first request and then you use them to get the many keys you need based on those results (using the pipeline)
With this approach you should have 2 requests in total instead of 16 and keep your code quite simple.
As an alternative you can use a lua script and fetch everything in one request.
This kind of database (Non-relational database), you have to make a trade-off between multiple requests and include some data redundancy.
You should analyze each case separately and consider some aspects, like:
How frequently this data will be accessed?
How much space this redundancy will consume?
How many requests I will have to do, in order to have all data, without redundancy?
Performance is an issue?
In your case, I would suggest to keep a Set/Hash or just a JSON encoded data for each user with a historical of all recent user interaction, such as comments, likes, etc. Every time the user access the feeds you just have to read the feeds and the historical; only two requests.
One thing to keep in mind, every user interaction, you must update all redundant data as well.

Resources