Forcing filter execution in ElasticSearch - elasticsearch

Is there a way to force a (query) filter to be executed for every query irrespective of whether or not it is present in the search query request? In my case, I have a native search script which is used to filter documents based on a dynamically changing list which is maintained outside of the elasticsearch instance. Since I do not control all the clients which query the server, I can't guarantee that they will do the filtering properly or add a reference to the script in the request and would therefore like to force the filter execution within the ES server itself. Is this (easily) achievable? (I am using ES 1.7.0/2.0)
TIA

If users can submit arbitrary requests to the cluster, then there is absolutely nothing that you can do to stop them from doing whatever they want to do.
You really only have two options here:
If users can select arbitrary queries/filters, but you control the index or indices that they go too, then you can use filtered aliases to limit what they can see.
Use Shield (not free) to prevent arbitrary access to limit what indices/aliases any given request can access (with aliases using filters to hide data).

Aliases are definitely the way to go. Create an alias per client if you need a different filter per client and ask him to talk to that alias.

Related

Stormcrawler - how does the es.status.filterQuery work?

I am using stormcrawler to put data into some Elasticsearch indexes, and I have a bunch of URL's in the status index, with a variety of statuses - DISCOVERED, FETCHED, ERROR, etc.
I was wondering if I could tell StormCrawler to just crawl the urls that are https and with the status: DISCOVERED and if that would actually work. I have the es-conf.yaml set as follows:
es.status.filterQuery: "-(url:https* AND status:DISCOVERED)"
Is that correct? how does SC make use of the es.status.filterQuery? Does it run a search and apply the value as a filter to retrieve only the applicable documents to fetch?
See code of the AggregationSpout.
how does SC make use of the es.status.filterQuery? Does it run a
search and apply the value as a filter to retrieve only the applicable
documents to fetch?
yes, it filters the queries sent to the ES shards. This is useful for instance to process a subset of a crawl.
It is a positive filter i.e. the documents must match the query in order to be retrieved; you'd need to remove the - for it to do what you described.

separating data access with elasticsearch

I'm just getting to know elasticsearch and I'm wondering if it suits my case at all:
Considering a system where companies (with multiple employees) can register and administer their clients, and send documents to their clients.
Now, I want to enable companies to search their documents - but ONLY theirs, not the documents of other companies. In other words: how to separate the data of those companies for searches? How can this be implemented with elasticsearch?
Is this separation to be handled by elasticsearch itself? I.e. there is some mapping between the companies in my system and a related user for elasticsearch.
Or is this to be handled by the backend of my system? I.e. the backend somehow decides (how?) to show only search results for that particular company. So there would be just one user, namely the backend of my system, that accesses and filters the results of elasticsearch. But is this sensible?
I'm sure there is a wealth of information about this out there. Please just give me a hint, because I don't know what to search for. Searches for elasticsearch authentication/authorization, for example, only yield results about who gains access to the search system in general - not about a pattern to solve this separation.
Thanks in advance!
Elasticsearch on its own does not support Authorization and Authentication, you need to add this via plugins, of which there are two that I know of. Shield is the official solution, which is part of the X-Pack and you need to pay Elastic if you want to use it. SearchGuard is an open source alternative with enterprise upgrades that you can buy.
Both of these enable you to define fine grained access rights for different users. What you'd probably want to do is give every company an index of their own for their documents and then restrict their user to only be able to read/write that index. Or if you absolutely want all documents in one index, you can add document level restrictions as well, so that everybody queries the same index but only gets results returned for their company. Depending on how many companies you expect to service this might make more sense in order to not have too many indices and shards, but I'd suspect that an index per company would be the best way to go.
Without these plugins you would need to resort to something on the http-layer, for example an nginx reverse proxy that filters requests based on the index names contained in the urls or something, but I'd severely advise against this, lots of pain lies that way!

Using ElasticSearch Bulk to update and create documents dynamically?

I'm currently using elasticsearch and running a cron job every 10 minutes that will find newly created/updated data from my DB and sync it with elasticsearch. However, I want to use bulk to sync instead of making and arbitrary amount of requests to update/create documents in an index. I'm using the elasticsearch.js library created by elasticsearch.
I face 2 challenges that I'm uncertain about how to handle:
How to use bulk to update a document if it exists and create a document if it doesn't within bulk without knowing if it exists in the index.
How to format a large amount of JSON to run through bulk to update/create the document because bulk api expects the body to be formatted a certain way.
The best option when trying to stream in data from an SQL database is to use Logstash's JDBC Input to do it for you (the documentation). This can hopefully just do it all for you.
Not all SQL schemes make this easy, so for your specific questions:
How to use bulk to update a document if it exists and create a document if it doesn't within bulk without knowing if it exists in the index.
Bulk currently accepts four different types of sub-requests, which behave differently than you probably expect coming from an SQL world:
index
create
update
delete
The first, index, is the most commonly used option. It means that you want to index (the verb) something to the Elasticsearch index (the noun). However, if it already exists in the index given the same _id, then it will replace it. The rest are probably a bit more obvious.
Each one of the sub-requests behaves like the individual option that they're associated with (so update is an UpdateRequest under the hood, delete is a DeleteRequest, and index is an IndexRequest). In the case of create, it is a specialization of index, which effectively says "add this if it doesn't exist, but fail it if is does exist".
How to format a large amount of JSON to run through bulk to update/create the document because bulk api expects the body to be formatted a certain way.
You should look into using either the Logstash approach or any of the existing client language libraries, such as the Python client, which should work well from cron. The clients will take care of the formatting for you. One for your preferred language most likely already exists.

(Oracle DSEE) LDAP browsing index with parameters in the vlvFilter

I'm running into some problems creating a browsing index for VLV searches.
The oracle docs (https://docs.oracle.com/cd/E19693-01/819-0995/bcatq/index.html) state that
The vlvFilter is the same LDAP filter that is used in the client search operations.
The filter I am using for the VLV searches however is parameterised, e.g.
(&(objectclass=MySpecialObjectClass)(modificationTimestamp>=$someDynamicValue))
So any ideas what should be put in the vlvFilter attribute for this browsing index?
Thanks!
You cannot use VLVIndex for filters that are constantly changing. VLVIndex are meant to provide consistent scrolling lists of results.
VLV must also come with a Sorting order.
It looks like to me that you may just need to index the modifyTimeStamp (for ordering) and use the Page Result Control to avoid getting all entries at once.

How to create an index from search results, all on the server?

I will be getting documents from a filtered query (quite a lot of documents). I will then immediately create an index from them (in Python, using requests to directly query the REST API), without any modification.
Is it possible to make this operation directly on the server, without the round-trip of data to the script and back?
Another question was similar (in the intent) and the only answer is to go via Logstash (equivalent to using my code, though possibly more efficient)
refer http://www.elasticsearch.org/guide/en/elasticsearch/guide/current/reindex.html
in short what you need to do is
0.) ensure you have _source set to true
1.) use scan and scroll API , pass your filtered query with search type scan,
2.)fetch documents using scroll id
2.) bulk index the result using the source field which returns you the json used to index data
refer:
http://www.elasticsearch.org/guide/en/elasticsearch/guide/current/scan-scroll.html
guide/en/elasticsearch/guide/current/bulk.html
guide/en/elasticsearch/guide/current/reindex.html
es 2.3 has an experimental feature that allows reindex from a query
https://www.elastic.co/guide/en/elasticsearch/reference/current/docs-reindex.html

Resources