Cardinality "set" property in gremlin is ordered or unordered? - set

Cardinality "set" property in apache tinkerpop gremlin is ordered or unordered?
List is not supported in AWS Neptune, but I need to maintain insertion order (or) save & retrieve the items in specific order.

Tinkerpop doesn't define insertion order for properties in set cardinality. It is totally up to the server implementation to define this.
If you want to get the properties listed in a particular order then I would recommend you to use order step explicitly. This will ensure irrespective which server implementation you run your query, the result should always be same.

Related

Does Elasticsearch have a Default Sort Order for Filter Queries?

Does Elasticsearch have a defined default sort order for filter queries if none is specified? Or is it more like an RDBMS without an order by - i.e. nothing is guaranteed?
From my experiments I appear to be getting my documents back in order of their id - which is exactly what I want - I am just wondering if this can be relied on?
When you only have filters (i.e. no scoring) and no explicit sort clause, then the documents are returned in index order, i.e. implicitly sorted by the special field named _doc.
Index order simply means the sequential order in which the documents have been indexed.
If your id is sequential and you've indexed your documents in the same order as your id, then what you observe is correct, but it might not always be the case.
No, the order cannot be relied on (in ES 7.12.1 at least)!
I've tested in a production environment, where we have a cluster with multiple shards and replicas and even running the simplest query like this returns results in different order on every few requests:
POST /my_index/_search
One way to ensure the same order is to add order by _id, which seems to bring a small performance hit with it.
Also, I know it's not related to this question, but keep in mind that if you do have scoring in your query and you still get random results, even after adding an order by _id, the problem is that the scores are randomly generated in a cluster environment. This problem can be solved with adding a parameter to you query:
POST /my_index/_search?search_type=dfs_query_then_fetch
More info and possible solutions can be found here:
https://www.elastic.co/guide/en/elasticsearch/reference/7.17/consistent-scoring.html

Elastic Search: Update Index or do two Requests

Assume following:
I have an entity "Producer" which has an "id", an "alias" and other attributes. The alias is changeable.
Then I have an entity "items" which has some attributes and is produced by a producer (in a relational database it would have a foreign key on "producer.id").
Lets now assume I want to let the user search for items and the alias of the producer is also an option to search for within the items-search.
I can also search for producers.
So I have an index "producer" and an index "items".
What is the best approach to search for the alias within the items-search?
Do I add "alias" as field within the "items"-index (which means I have to reindex, when an alias is changed- which doesn't happen that often)
or
Do I search for the producers with the alias first and take the ids to search for the items in a second request? (I would then add the id-field to the items-index)
Parent-child relationships are costly from a performance and implementation point, Distributed systems main capability and power comes when you denormalize the data, and when all the data you need is available together.
The first option is a very common and in-general preferred way for scalable applications, And in your case, it's even less costly as you mentioned it's not updated very frequently. So would advise you to go with the first option, but you can always do the performance and implementation comparison yourself and choose what fits best according to your use-case.

Forcing filter execution in ElasticSearch

Is there a way to force a (query) filter to be executed for every query irrespective of whether or not it is present in the search query request? In my case, I have a native search script which is used to filter documents based on a dynamically changing list which is maintained outside of the elasticsearch instance. Since I do not control all the clients which query the server, I can't guarantee that they will do the filtering properly or add a reference to the script in the request and would therefore like to force the filter execution within the ES server itself. Is this (easily) achievable? (I am using ES 1.7.0/2.0)
TIA
If users can submit arbitrary requests to the cluster, then there is absolutely nothing that you can do to stop them from doing whatever they want to do.
You really only have two options here:
If users can select arbitrary queries/filters, but you control the index or indices that they go too, then you can use filtered aliases to limit what they can see.
Use Shield (not free) to prevent arbitrary access to limit what indices/aliases any given request can access (with aliases using filters to hide data).
Aliases are definitely the way to go. Create an alias per client if you need a different filter per client and ask him to talk to that alias.

Need clarification about usage of mahout with hadoop

I currently have an implementation of a recommender in mahout using the in memory recommendation apis. However, I would like to move to a distributed solution using hadoop in order to calculate offline recommendations. This is my first time using hadoop and I'm looking for clarification on a few concepts and api usages.
Currently, my understanding of hadoop is minimal and I think that the correct approach is the following:
use something like apache drill in order to populate the hdfs with the user and item data.
using the recommendation job in mahout train on the data from the hdfs.
transform the resultant data in the hdfs to index shards to be used by solr
use solr to provide the recommendations to the userbase
However, I am looking for clarifications on a couple aspects of this design:
How would I utilize a rescorer in the manner that it is used in the in memory live recommendations?
What is the best manner in which to invoke the recommendations job?
I have other questions besides these two but the answers to these would be a huge help.
You may be talking about the Mahout + Hadoop + Solr recommender. This method handles rescoring in a couple different ways.
The basic recommender can be put together in two ways:
After getting data into into HDFS in the form of (user id, item id, preference weight) run the ItemSimilarityJob on the data (use LLR similarity, which is usaully best). It will create what is called an indicator matrix. This will be an item id by item id sparse matrix of values indicating the similarity magnitude between any two items. You must then convert this into values that Solr can index. That means translating the internal Mahout integer IDs into some unique string representation, which is probably what they were at the very beginning. This will look like (item123,item223 item643 item293 item445...) as a CSV. so two Solr fields, the first is an item id, the second is a list of similar items. All ids must be text tokens. Then the query for recommendations is a Solr query made up of item ids that a particular user has shown a preference for. So query = "item223 item344 item445...". Make the query against the filed that olds the indicator matrix values. You will get back an ordered list of item IDs
A much easier way that may work for you is to use a tool in the /examples folder of Mahout 1.0-SNAPSHOT or here: https://github.com/pferrel/solr-recommender. It takes in raw log files with unique strings for user and item ids. it does all the work on Hadoop to output CSVs that can be indexed by Solr directly or loaded into a DB as described above.
The way I did the demo site (https://guide.finderbots.com) is to use my Solr web app integration, putting the indicator matrix into a DB attaching the similar item list to my collection of items. So item123 got item223 item643 item293 item445... in its indicator field. After you index the collection the query is then = "item223 item344 item445..." -- the user's prefered items.
Here are three ways to do rescoring:
Mix in metadata with the query. So you could do query = "item223 item344 item445..." against the indicator field AND "SciFi" against the "genre" field. This gives you blended collaborative filtering and metadata in your query and as you can imagine, the recs are based on the user's prefs but skewed towards "SciFi". There are lots of other interesting things you can do once you get item+indicators+metadata into an index.
Filter recs by metadata. You can get recs not skewed but filtered, if you want. Using the Solr query = "item223 item344 item445..." against the indicator field AND "SciFi" as a filter against the "genre" field. In this case you get nothing but "SciFi" where #1 you would get mostly "SciFi"
Get your ordered list of recs back and rescore them in any way you'd like based on other things you know about the user, context, or items. Often these can be encoded into a Solr query and done with one query but reordering and filtering can be done after the recs are returned too. You would have to write that code, it is not built in.
The fun thing is you can mix filters, metadata fields, and user preferences with what Solr calls "boost" values to get all sorts of rescoring. Solr can even use location to query, skew, or filter.
Note: You don't have to worry about Solr shards necessarily. Solr will index most DBs and HDFS directly but only the index is sharded. You shard an index if you have a very big one, you replicate it if you have lots of queries/second (or for failover). Solr queries are generally very fast so I'd worry about that after you have a functioning system since it's a config thing and shouldn't be affected by the rest of your workflow.

how to specify use of STC algorithm in Solr admin console?

I have a test Solr environment using Carrot2 on Ubuntu. With the Carrot2 workbench I can alternate between the three defined algorithms (Lingo, STC, kmeans). How do I do the same thing in the Solr admin query tool? is it an argument passed with the clustering parameter?
All 3 algorithms are defined in the solrconfig.xml, which is by and large a copy of the example from collection1. I'm inferring it is using Lingo by default but not sure where to switch it to STC if I'd like.
You can use the clustering.engine parameter to select the clustering engine at query time. To cluster the results with STC, add clustering.engine=stc to your request. In Solr Admin, you can pass the extra parameter in the Raw Query Parameters field of the query screen. You may also need to change the default Request-Handler to the one that has results clustering enabled.

Resources