I want to find the most active users in a workspace that are interested in a certain subject eg : Jira.
I thought of either search who is most active in a specific channel or type certain words in chats.
Should I use the api, or is there already a tool for this or just a clever query that I have missed?
I have looked into : The search message api method. But I am not sure if it is overkill to write rest client for this use-case.
I guess I am looking for a aggregation method like SQL count by , group by having etc in the Slack Search query language .
I don't think you can use the search API because you need to search for something first to use it, when you need to basically index all the content yourself.
Probably the most direct way would be to search a channel your interested in, collecting messages from (and messages to) a person, counting the whole time by their unique id.
Related
What I have : Elastic search database for full text search purposes.
What my requirement is : In a given elasticsearch index, I need to detect some sensitive data like iban no, credit card no, passport no, social security no, address etc. and report them to the client. There will be checkboxes as input parameters. For instance, the client can select credit card no and passport no, then clicks detect button. After that, the system will start scanning index, and reports documents which include credit card no and passport no. It is aimed to have more than 200 sensitive data types, and clients will be able to make multiple selections over these types.
What I have done : I have created a C# application and used Nest library for ES queries. In order to detect each sensitive data type, I have created regular expressions and some special validation rules in my C# app which works well for manually given input string.
In my C# app, I have created a match all query with scroll api. When the user clicks detect button, my app is iterating all the source records which returns from scroll api,and for each record, the app is executing sensitive data finder codes based on client's selection.
The problem here is searching all source records in the ES index, extracting sensitive datas and preparing report as fast as possible with great amount of documents. I know ES is designed for full text search, not for scanning whole system and bringing data. However all data are in elasticsearch right now and I need to use this db to make detecting operation.
I am wondering if I can do this in a different and efficient way. Can this problem be solved with writing an elastic search plugin without a C# app? Or is there a better solution to scan the whole source data in ES index?
Thanks for suggestions.
Passport number, other sensitive information detection algorithm should run once, during indexing time, or maybe asynchronously as a separate job that will update documents with flags representing the presence of sensitive information. Based on the flag the relevant documents can be searched.
Search time analysis in this case will be very costly and should be avoided.
I want to use NLP with elasticsearch. I have been able to achieve one level by using Open NLP plugin as mentioned in comments of this question. I am getting entities like person, organization, location etc indexed while inserting documents.
I have a doubt while searching the same information.Since, I need to process the terms entered by the user during query time. Following is what I have thought of:
Process the query entered by user using apache NLP as specified here.
Extract Person, location and organisation Names from the previous and then run a query against the entities stored in index.
I am also thinking of using Google Knowledge Graph Search Api to fetch related information about the extracted entities in the previous steps and then include them in search query as well. (Reason to do this is because we want to show results of Delhi in case some one searches for Capital Of India). We are not going with Synonyms Search approach in this case as we want the information to be dynamically available.
My question is that-
Is there something better we can do to achieve the same, because lot of processing at query time is going to increase the response time?
I'm just getting to know elasticsearch and I'm wondering if it suits my case at all:
Considering a system where companies (with multiple employees) can register and administer their clients, and send documents to their clients.
Now, I want to enable companies to search their documents - but ONLY theirs, not the documents of other companies. In other words: how to separate the data of those companies for searches? How can this be implemented with elasticsearch?
Is this separation to be handled by elasticsearch itself? I.e. there is some mapping between the companies in my system and a related user for elasticsearch.
Or is this to be handled by the backend of my system? I.e. the backend somehow decides (how?) to show only search results for that particular company. So there would be just one user, namely the backend of my system, that accesses and filters the results of elasticsearch. But is this sensible?
I'm sure there is a wealth of information about this out there. Please just give me a hint, because I don't know what to search for. Searches for elasticsearch authentication/authorization, for example, only yield results about who gains access to the search system in general - not about a pattern to solve this separation.
Thanks in advance!
Elasticsearch on its own does not support Authorization and Authentication, you need to add this via plugins, of which there are two that I know of. Shield is the official solution, which is part of the X-Pack and you need to pay Elastic if you want to use it. SearchGuard is an open source alternative with enterprise upgrades that you can buy.
Both of these enable you to define fine grained access rights for different users. What you'd probably want to do is give every company an index of their own for their documents and then restrict their user to only be able to read/write that index. Or if you absolutely want all documents in one index, you can add document level restrictions as well, so that everybody queries the same index but only gets results returned for their company. Depending on how many companies you expect to service this might make more sense in order to not have too many indices and shards, but I'd suspect that an index per company would be the best way to go.
Without these plugins you would need to resort to something on the http-layer, for example an nginx reverse proxy that filters requests based on the index names contained in the urls or something, but I'd severely advise against this, lots of pain lies that way!
Is there a way to force a (query) filter to be executed for every query irrespective of whether or not it is present in the search query request? In my case, I have a native search script which is used to filter documents based on a dynamically changing list which is maintained outside of the elasticsearch instance. Since I do not control all the clients which query the server, I can't guarantee that they will do the filtering properly or add a reference to the script in the request and would therefore like to force the filter execution within the ES server itself. Is this (easily) achievable? (I am using ES 1.7.0/2.0)
TIA
If users can submit arbitrary requests to the cluster, then there is absolutely nothing that you can do to stop them from doing whatever they want to do.
You really only have two options here:
If users can select arbitrary queries/filters, but you control the index or indices that they go too, then you can use filtered aliases to limit what they can see.
Use Shield (not free) to prevent arbitrary access to limit what indices/aliases any given request can access (with aliases using filters to hide data).
Aliases are definitely the way to go. Create an alias per client if you need a different filter per client and ask him to talk to that alias.
What is the best way to deal with fields that change frequently inside a document for ElasticSearch? Per their docs about partial updates...
Internally, however, the update API simply manages the same retrieve-change-reindex process that we have already described.
In particular, what should be done when the indexing of the document will likely be expensive given the number of indexed field and the size of some of the text fields that have to be analyzed?
As a concrete example, use SO's view and vote counts on questions and answers. It would seem expensive to reindex the text body just to update those values.
Maybe you shouldn't update so frequently. Perhaps things like vote/views should only be periodically updated in ES, while more critical fields like answers/questions be pushed immediately. Consider what's most important and see if you can get away with some level of staleness.
ElasticSearch is great for text search, but I would not consider ES to support SO in its entirety (or similar applications). It could be a useful tool for searching for answers/questions on SO, or for internal applications (like log/event analysis). But perhaps the actual serving of data could be better done with a different solution? Maybe it should be powered by Cassandra instead for the bulk of the work? You get the idea...
If you want to use ES as a solution to your needs, and you MUST update frequently, you could definitely consider the parent/child model mentioned already. of course, that method will require more memory/disk space, and it will take up more cpu/time when you query for totals. An alternative would be to have the parent store searchable fields, and let the child hold the metadata (where the child's fields are not analyzed). this will allow you to make frequent updates without having to undergo an expensive re-index, since there is nothing to index.
You could also consider what I mentioned above and see if you can get away with some staleness. This can be done in many ways too. You can throttle your requests by type of change, or change the refresh/flush interval, or consider de-duping updates if you are sending updates in bulk. These too have their shortcomings...
I think best way to handle the change is to split the document (you can use Parent child relationship, or just have parent id), and make document as small as possible (moving changeable part to new types) .
This can be a way to accomplish your requirement say SO,
You can use multiple types for this, consider This post (Views and Vote count).
Create a type for post, view and vote.
For a post , index a document to post type (index post id, title description tag), and for every view of that post you can index a document to view type (with id of post), and if voted you can index vote with (no of votes , id of post and other info you need [like positive or negative flag] ) to vote type.
So, to get views for post, use filter of post id, and get document counts in views type
To get no of votes, use stat aggregation for no of votes , or terms aggregation followed by stat aggregation for getting positive and negative votes.
This is way I think is best, and there can be other opinion too.
Thanks
What I do is that I use a database like mongo or mysql for storing properties that get updated frequently and use elastic search to store documents for text searching.
Example: I want to keep data about a book and its contents and I also want to keep the total number of views, updating and reindexing the document each time a user views it is a total overkill.