Map multiple values to a unique column in Elasticsearch - elasticsearch

I want to work with Elasticsearch to process some Whatsapp chats. So I am initially planning the data load.
The problem is that the data exported from Whatsapp, doesn't contain a real unique id per user but it only contains the name of the user taken from the contact directory of the device where the chat is exported (ie. a user can change the number or have two numbers in the same group).
Because of that, I need to create a custom explicit mapping table between the user names and a self-generated unique id, that gets populated in an additional column.
Then, my question is: "How can I implement such kind of explicit mapping in Elasticsearch to generate an additional unique column?". Alternatively, a valid answer could be a totally different approach to the problem.
PS. As I write, I think the solution could be in the ingestion process, like in a python script, but I still want to post the question to understand if this is something that Elasticsearch can do by itself.

yes, do it during the index process
if you had the data that maps the name and the id stored in a separate index you could do this with an enrich processor when you index the data to add whichever value you want to the document via a pipeline
also - Elasticsearch doesn't have columns, only fields

Related

using an input field in FileMaker that is not related to any table?

I'm in need of entering a few data points in the UI of a FileMaker app that are used either for search or for computation, but that have no relation to any field in a database (and don't need to be saved). So I want to add an input field without having it tied to a table field, and it seems that's something FileMaker just doesn't do.
Two use cases:
a) I want a custom search/filter interface instead of using the FM one. My users should see two calendars, pick two dates and the data is filtered by those (between them), as well as additional criteria, which don't directly translate to field searches. I know I can use "startdate ... enddate", but I'd like a more user-friendly interface.
b) Users enter a few data points into seperate fields which are then computed and combined into one database field by script. This is technical data that is entered by copy-and-paste and needs a bit of parsing before I put it into the database. Again, I'd like a field that isn't related to the database, put a script trigger on it, and when data is entered there, it is parsed and put into the actual DB fields.
Is it possible at all to have input fields not related to a database in FileMaker ?
If not, what's the best practice? I thought about setting up a dummy table with various fields I can use, but maybe there's a better way?
You should read up on global fields. They can be in any table and are accessible from all tables. They do not retain their value after the session is closed if the file is hosted. Use a script to perform a search based on what the user types in the global field.

Use one quicksight dashboard (created from one analysis) for different data sets

I have a multi-user website and each user has own data which I can store on s3.
I want to integrate(embed) QuickSight to my website, in that way so each user able to see own data.
I want to have one analysis to be able to modify if for all users.
Are there some recommendations on how to achieve this?
Firstly, you will need to add the user's identifier (email, name, generated ID, whatever) to each row that belongs to them in the S3 data. I'm kind of assuming that you are storing the data in a tabular format (e.g. CSV) but let me know if I'm wrong. So let's assume you added this user identifier as a new column called userId.
Secondly, you will need to generate a manifest file that points to all of your users' S3 files.
Then, create a new data set, pointing to that manifest.
Then, you will need to create another new data set that ties a QuickSight UserName to the new userId column you have added. You will need to maintain this data set somehow, but fortunately the QuickSight UserName has a pattern to it (something like embed_role\user_name).
An example of this new data set might look like
UserName,userId
your_embed_role\user3479125,user3479125
Once you have this data set you can attach it to the S3 data set created earlier as row-level-security (RLS). You can think of QuickSight as performing an inner join on userId between the RLS data set and the actual visual data set, thus limiting the data to the given UserName.

DynamoDB Throughput vs Search time

I've just figured out a big mistake I had while creating the dynamodb structure.
I've created 11 tables, whereas one of them is the table mostly refereed to and the others are complementary tables.
For example, I have a table where I hold names (together with other info) called "Names" and another table called "NamesMappings" holding all these names added to the "Names" table so that each time a user wants to add a name to the "Names" table he first tries to put the name in "NamesMappings" and only if it succeed (therefore this name doesn't exist) he can add the name into the "Names" table. This procedure helps if the name is not unique and is not the primary key in the "Names" table and with this technique I don't have to search inside the "Names" table if the name exists, but instead I can try to add it to the "NamesMappings" table and only if it succeed I know this is a unique name.
First of all, I would like to ask you if this is a common approach or there is a better one?
Next, I figured out that with this design I soon reached to 11 tables each has 5 provisioned capacity of read and write which leads to overall 55 provisioned read and write under the free-tier. Then I understood why I get all these payments each month, because as the number of tables is getting bigger, and I leave the provisioned capacity as default (both read/write capacity are 5) I get more and more provisioned capacity.
So, what should be my conclusion from this understanding? Should I try to reduce the number of tables even if it takes more effort to preform scanning and querying inside the table? Or should I split the table same as I do but reduce the capacity of these mappings tables used only for indication if an item exists or not in another table?
If I understand your problem correctly you're missing the whole concept of NoSQL Databases.
Your Names table should have a Hash key (which is similar to a Primary key) that has a uniformly generated identifier (an UUID is a great candidate). This would automatically make this Table queryable by this unique identifier. You said, however, that you don't know the ID but you only know the Name instead. This leads me to think you could create a Global Secondary Index (GSI) on the Name attribute inside the Names table so you can also query by Name. Up to this point, your table structure should look like this:
id | name
Both of them are independently queryable, which gives you a lot of flexibility already.
Now, let's say you want to add the NameMapping attribute (which I don't know how it looks like), you can simply add it under the Names table, getting rid of the NamesMappings table, greatly reducing the number of WCUs and RCUs across your account. Your table structure should now look like this:
id | name | mappings
where mappings is, let's say, a JSON object.
Since you can only query on top level attributes in DynamoDB, you can now perform a query against the name attribute which has a GSI configured. If the query returns nothing, then name is unique. But let's say you still need some data inside the mappings object, then you could query by name and, in your code, you could apply a map/filter/reduce operation on the mappings attribute and decide what to do next.
Remember that duplication is just OK in a NoSQL world. This may look scary if you come from a purely SQL background, but data should be stored in such a way in NoSQL databases that you should be able to fetch all the needed information in one go, therefore avoiding "joins" (joins are still possible in a NoSQL database, but since there are no strong relationships between entities, you need to perform these joins manually on the code level). To give you some real context, imagine you have a Orders table where you keep track of the ordered Products and the Store that the Order belongs to: you'd save both the Products and the Store objects (and not their IDs, as it would happen in the SQL way) inside the Order object, so if you want to query for a given OrderId in the future, you wouldn't need to make extra calls (aka "joins") to the Product/Store tables to fetch the information, since everything would already be stored inside the Order object.

ElasticSearch index per user?

I need to make a system using ElasticSearch.
Each user has its documents, and the scope of these documents is only inside its user scope. Any user document is no accessible for any other system user.
The question is, what's the best approach, create an index per user, or create a single index containing all the documents of each user.
Each user might have its custom meta-information field over their documents that other users have not.
I know that in general it's proposed to use a single index with user aliases, however I don't understand how to add this custom user's document meta-information in this big index.
For example, imagine userA has two documents indexed, and userB has 3 documents. In my system exists system pre-defined meta-information as filename and description, however, the system allows to each user defines each own custom meta-information, for example: userA might have a meta-information color over its documents, and userB might have a size meta-information field over each document.
I understand one posibility would be add new field on the single index, however, it can be out of bounds.
What's would be the best approach?
Thanks for all.
One index per user sounds like you'd run into trouble at some point - there is an overhead per index that would become significant once you have a lot of users (say 10000 or so)
I don't think you need this though - you could allow custom attributes on a per user basis by using nested fields - each nested object would have name and value properties (possibly multiple value properties) and so you can have arbitrary searchable metadata for your documents without needing to change the mapping each time.

XPages: can i filter a view to show only entries that belong to a group?

i have a view in an xpage with some entries (lets say clients). I have an acl group of persons (clients) that contains some of the clients of the view. Now i want to use the search attribute of the view to show only entries that belong to the group.
I already use search attribute to select users by name e.g:
FIELD Name Contains "Chuck Norris"
Is there any similar query? (maybe using #isMember on the field....?)
UPDATE: i will have the group entries (client names) into a text list in a document too. so can i filter the "name" field of the view based on the values of a text list?
Perhaps using a reader field is a good idea. You're talking about restricting document access to a group of Domino users - that's exactly what reader fields are for.
For example, make your text list field containing client names into a reader field like this:
var item = document1.getFirstItem("myfield");
item.setReaders(true);
document1.save();
myfield needs to contain canonical names (CN=firstname lastname/O=organisation).
Using reader fields, you don't need to do any view filtering at all, it happens automatically. If you have really many documents (say, half a million or so), it could slow down things, otherwise, it's a nice approach.
When you want to restrict displaying documents only in one certain view reader fields are no solution, though. In that case, you need to do the view filtering yourself as you tried.
If you want to filter only for ONE certain client, then using a categorized view is the way to go. You can give the view panel the name of one client as category filter then.
If you want to filter for multiple clients, you need to do it based on fulltext search, just as you already tried. In that case, make sure you're working with Domino 9. Previous Domino versions don't apply the view sorting order to a fulltext search result, which means you have to search it manually using custom javascript or so, which is complicated.
Or, as Frantisek suggested, write a scheduled agent which puts documents in folders depending on their clients - but depending on the number of clients you want to filter the view for this may lead to many folders, which may lead to other problems. Furthermore, you need to make sure to remove folders when they are not needed anymore, and you have a lag until new documents appear in a folder.
So in a nutshell, if you want to do an application wide restriction based on client names, use reader fields.
If you want to restrict for one client name at a time, use categories.
Otherwise, use fulltext search with Domino 9.

Resources