Is there an efficient index persistent data structure with multiple indexes - data-structures

I am looking for an efficient indexed persistent data structure. I typically work in .NET and am aware of FSharp's Map however that implementation and most others I am aware of only provide a single 'index', the left side of the mapping.
Basically here is the scenario
public class MyObject
public int Id { get; }
public int GroupId { get; }
public string Name { get; }
Where the Id of an object will be globally unique set of items added. GroupId may have duplicate values, and I want to be able to query for all values with a matching GroupId and within a GroupId names will be unique but may be duplicated across different GroupId's. This not a situation where I can simply create a composite key of the 3 fields as I need independent access to groups of the items based on particular field values.
I can do this, and have in the past, using dictionaries of dictionaries, which has been recommended in other posts here on STackoverflow...however, I also want the data structure to be
1) Fully Persistent and everything that means
2) efficient in memory - meaning that versions need to share as many nodes as possible
3) efficient in modifcations - I would like it to be fast
I realize that I am asking for quite a bit here but I wanted to ask to avoid even trying to re-invent the wheel if it has already been done.
Thanks

I am not sure why elsewhere, and in existing replies to your question, people recommend to imbricate existing structures. Imbricating structures (maps of maps, maps of lists, dictionaries of dictionaries, ...) only works for two indexes if one is looser than the other (two values having the same index for Index1 implies these two values have the same index for Index2), which is an unnecessary constraint.
I would use a record of maps, as many of them as you want different indexes, and I would maintain the invariant that every value that is present in a map is present in all the others in the same record. Adding a value obviously requires adding it to all maps in the record. Similarly for removal. The invariant can be made impossible to transgress from the outside through encapsulation.
If you worry that the values stored in your data structure would be duplicated, don't. Each map would only contain a pointer. They would all point to the same single representation of the value. Sharing will be as good as it already is with simple single-indexed maps.

Just as you could use a Dictionary of Dictionaries, I expect that e.g. an F# Map of Maps may be what you want, e.g.
Map<int, Map<string, MyObject> > // int is groupid, string is name
maybe? I am unclear if you also need fast access by integer id.
You might also check out Clojure's library; I don't know much about Clojure, but a range of efficient persistent data structures seems to be one of Clojure's strengths.

It seems that you are trying to apply OOP principles to your FP application.
If you think in terms of functions, what is it you are trying to do?
If you use a List, for example, you can just tell it you want to pull all the objects that have a certain group value.
If you need fast access by group you could have a Map of Lists so you can pull up all the objects in a group.
There are different data structures and many functions that work on each, but you should first think about your problem from a functional, not object-oriented, POV.

Related

Are IDs guaranteed to be unique across indices in Elasticsearch 6+?

With mapping types being removed in Elasticsearch 6.0 I wonder if IDs of documents are guaranteed to be unique across indices?
Say I have three indices, all with a "parent" field that contains an ID. Do I need to include which index the ID belongs to or can I just search through all three indices when looking for a document with the given ID?
IDs are not unique across indices.
If you want to refer to a document you need to know both the index name and the ID.
Explicit IDs
If you explicitly set the document ID when indexing, nothing prevents you from using the same ID twice for documents going in different indices.
Autogenerated IDs
If you don't set the ID when indexing, ES will generate one before storing the document.
According to the code, the ID is securely generated from a random number, the host MAC address and the current timestamp in ms. Additional work is done to ensure that the timestamp (and thus the ID sequence) increases monotonically.
To generate the same ID, when the JVM starts a specific random number has to be picked and the document ID must be generated in a specific moment with sub-millisecond precision. So while the chance exists, it's so small that I wouldn't care about it. (just like I wouldn't care about collisions when using an hash function to check file integrity)
Final note: as a code comment notes, the implementation is opaque and could change at any time, so what I wrote might not hold true in future versions.

Elastic Search: Modelling data containing variable fields

I need to store data that can be represented in JSON as follows:
Article{
Id: 1,
Category: History,
Title: War stories,
//Comments could be pretty long and also be changed frequently
Comments: "Nice narration, Reminds me of the difficult Times, Tough Decisions"
Tags: "truth, reality, history", //Might change frequently
UserSpecifiedNotes:[
//The array may contain different users for different articles
{
userid: 20,
note: "Good for work"
},
{
userid: 22,
note: "Homework is due for work"
}
]
}
After having gone through different articles, denormalization of data is one of the ways to handle this data. But since common fields could be pretty long and even be changed frequently, I would like to not repeat it. What could be the other ways better ways to represent and search this data? Parent-child? Inner object?
Currently, I would be dealing with a lot of inserts, updates and few searches. But whenever search is to be done, it has to be very fast. I am using NEST (.net client) for using elastic search. The search query to be used is expected to work as follows:
Input: searchString and a userID
Behavior: The Articles containing searchString in either Title, comments, tags or the note for the given userIDsort in the order of relevance
In a normal scenario the main contents of the article will be changed very rarely whereas the "UserSpecifiedNotes"/comments against an article will be generated/added more frequently. This is an ideal use case for implementing parent-child relation.
With inner object you still have to reindex all of the "man article" and "UserSpecifiedNotes"/comments every time a new note comes in. With the use of parent-child relation you will be just adding a new note.
With the details you have specified you can take the approach of 4 indices
Main Article (id, category, title, description etc)
Comments (commented by, comment text etc)
Tags (tags, any other meta tag)
UserSpecifiedNotes (userId, notes)
Having said that what need to be kept in mind is your actual requirement. Having parent-child relation will need more memory, and ma slow down search performance a tiny bit. But indexing will be faster.
On the other hand a nested object will increase your indexing time significantly as you need to collect all the data related to an article before indexing. You can of course store everything and just add as an update. As a simpler maintenance and ease of implementation I would suggest use parent-child.

Is there a way to efficiently find all unique values for a field in MongoDB?

Consider a collection of Users:
{ name: 'Jeff' }
{ name: 'Joel' }
Is there a way to efficiently get all the unique values for name?
User.pluck(:name).uniq
To return
[ 'Jeff', 'Joel' ]
I think this would get the whole collection, so it would be inefficient.
However, if there is an index on name, is there a way to get all the unique values without getting all the documents?
Or is there another way to efficiently get the unique names?
As indicated in the comments, you can efficiently get the unique values of a field over all docs in a collection using distinct.
The documentation specifically mentions that indexes are used when possible, and that they can cover the distinct query. This means that only the supporting index needs to be loaded into memory to get the results.
When possible, db.collection.distinct() operations can use indexes.
Indexes can also cover db.collection.distinct() operations. See
Covered Query for more information on queries covered by indexes.
In Ruby, you would perform your distinct query as:
User.distinct(:name)

ElasticSearch: Performance Implications of Multiple Types in the same Index

We are storing a handful of polymorphic document subtypes in a single index (e.x. let's say we store vehicles with subtypes of car, van, motorcycle, and Batmobile).
At the moment, there is >80% commonality in fields across these subtypes (e.x manufacturer, number of wheels, ranking of awesomeness as a mode of transport).
The standard case is to search across all types, but sometimes users will want to filter the results to a subset of the subtypes: find only cars with...).
How much overhead (if any) is incurred at search/index time from modelling these subtypes as distinct ElasticSearch types vs. modelling them as a single type using some application-specific field to distinguish between subtypes?
I've looked through several related answers already, but can't find the answer to my exact question.
Thanks very much!
There shouldn't be any noticeable overhead.
If you keep everything under the same type, you can filter results by a subtype by adding a "class" field on your objects and adding a condition on this field in your search.
A good reason to model your different classes into different ES types is if there can be a conflict between type of fields with the same name.
That is, assume your "car" class has a "color" field that holds integer number, while your "van" class also has a "color" field but this one is a string. (Stupid example, I know, didn't have any better idea).
Elasticsearch holds the mapping (the data "schema") for a type. So if you index both "car" and "van" under the same type, you will have a field type conflict. A field in a type can have one specific type. If you set the field as integer and then try to index a string into it, it will fail.
This is one of the main guidelines on how to use Elasticsearch types - treat the type as a specific data schema that can't have conflicts.

VB.NET Dictionary.Add method index

When I call mydictionary.Add(key, object), can I guarantee that the object is added to the end of the dictionary in the sense that mydictionary.ElementAt(mydictionary.Count - 1) returns that object? I'm asking because I'm used to Java where HashMap doesn't have any order at all.
I'm hoping to use the ordering given by ElementAt as a way of knowing the order in which objects were added to the dictionary without using a separate data structure.
Update: Looks like ElementAt isn't going to be of any use. Is the best way to do this to use a separate data structure to store the ordering that I need?
Thanks
There is no order to a dictionary. The ElementAt method is a linq extension method that iterates over the dictionary using IEnumerable and counts the number of things, there is no relation to the order things were added.
There is a SortedDictionary, which will sort things by key, but will not keep them in the order they were added in.
If the order is really important you could always have two data structures, a list that you add the object to and a dictionary that stores the key to list index mapping. Or put a field inside your object that set from a counter as you add it to the dictionary.

Resources