Sorting Across Distributed Spinx Index via SphinxSE - sorting

We currently have a Sphinx 2.2.3beta install on one server and Sphinx 2.0.4 on another. Both have their own build of two indexes on them, each with a distributed index over these two local indices (ie. each server has 'index1' and 'index2', and each has 'index_dist' being a distributed index over 'index1' and 'index2').
When using SphinxSE to query against the distributed index and sorting against a given attribute, we are finding that the results from the 2.2.3beta seem to be given as the sorted results from the first local index followed by the sorted results from the second index.
When performing the same query against the distributed index on the 2.0.4 server, the results are completely sorted (as in, the results from the first local index, combined with the results from the second local index, and then sorted).
This is not an issue when performing the query via SphinxQL, but it is a problem if we make the query via either the PHP Sphinx API or via SphinxSE.
Does anyone have any thoughts / hints / comments around this please?

I reported this issue at http://sphinxsearch.com/bugs/view.php?id=2023 . It's since been resolved but in the mean time, you can either:
download the latest code, rebuild SphinxSE and restart mysqld; or
add select=* to your SphinxSE query
The second option seemed to work for us.

Related

Setting up a daily partitioned index

I'm looking to setup my index such that it is partitioned into daily sub-indices that I can adjust the individual settings of depending on the age of that index, i.e. >= 30 days old should be moved to slower hardware etc. I am aware I can do this with a lifecycle policy.
What I'm unable to join-the-dots on is how to setup the original index to be partitioned by day. When adding data/querying, do I need to specify the individual daily indicies or is there something in Elasticsearch that will do this for me? If the later, how does it work with adding/querying (assuming they are different?)...how does it determine the partitions that are relevant for the query/partition to add a document to? (I'm assuming there is a timestamp field - but I can't see from the docs how its all linked together)
I'm using the base Elasticsearch OSS v7.7.1 without any plugins installed.
there's no such thing as sub indices or partitions in Elasticsearch. if you want to use ilm, which you should, then you are using aliases and multiple indices
you will need to upgrade from 7.7 - which is EOL - and use the default distribution to get access to ilm as well
getting back to your conceptual questions, https://www.elastic.co/guide/en/elasticsearch/reference/current/overview-index-lifecycle-management.html and the following few chapters dive into it. but to your questions;
the major assumption of using ilm is that data being ingested is current, so on a rough level, data from today will end up in an index from today
if you are indexing historic data then you may want to put that into "traditional" index names, eg logs-2021.08.09 and then attach them to the ilm policy as per https://www.elastic.co/guide/en/elasticsearch/reference/current/ilm-with-existing-indices.html
when querying, Elasticsearch will handle accessing all the indices it needs based on the request it receives. it does this via https://www.elastic.co/guide/en/elasticsearch/reference/current/search-field-caps.html

Elasticsearch indexes but does not store documents

I'm having troubles storing documents within a 3-node Elasticsearch cluster that previously was able to store documents. I use the Java API to send bulks of documents to Elasticsearch, which are accepted (no failure in BulkResponse object) AND Elasticsearch has heavy index activities. However, the number of documents are not increased and I assume that none of them are store.
I've looked into Elasticsearch logs (of all three nodes) but I see no errors or warnings.
Note: I've had to restart two nodes previously but search/query is working perfectly. (the count in the image starts at ~17:00 as I've installed the Marvel plugin at this time)
What can I do to solve or debug the problem?
Sorry for this point blank code blindness by me! I forgot to skip the cursor when reading from MongoDB and therefore re-inserted the same 1000 documents into Elasticsearch for thousands of times!
Learning: If this problem occurs check if you select the correct documents in your database and that these documents are not already stored in ES.
Sidenote to Marvel: It would be great is this could be indicated in any way - e.g. by having a chart with "updated documents" (I rechecked and could not find one)

Is solr cloud applicable for use case where indexing is offline?

Solr cloud seems to be the suggested method to scale solr in future. I understand that legacy scaling methods (like master slave and replication) still exists. My use case with solr does not have to be near real time (NRT). It is fine if the newly indexed data is visible for searchers after about 1 day.
In the master slave (legacy scaling), I could replicate it once a day. In Solr cloud do i have an option like this?
Also i don't want the indexing to impact the searcher performance during index time. Is there a way to isolate the indexer from searcher shards in solr cloud?
You could skip SolrCloud and just index on a dedicate separate collection.
Then, you bring the new content to each machine individually and do a Core Swap.
Or similar thing using Aliases to point to the newest core/collection. Which also allows you to segment old content and new content into different collections and search them together.
I also used collection aliases in such cases. You can build your index once a day and when it is ready you simply change the alias. I'll give an example
At very begining you create index called: index_2014_12_01. This index is aliased by index_2014_12_01. The next day you build index_2014_12_02 and changing the alias now to point index_2014_12_02 instead of index_2014_12_01.

How exactly does elasticsearch versioning work?

My understanding was that Elasticsearch would store the lastest copy of the document and just update the version field number? But I was playing around with a few thousand documents and had the need to index them repeatedly without changing any data in the document. My thinking was that the index size would remain the same, but that wasn't the case ... the index size seemed to increase.
This confused me a little bit, so i just wanted to seek clarification on the internal mechanism of versioning within elasticsearch.
An update is a Delete + Insert Lucene operation behind the scene.
But you should know that Lucene does not really delete the document but mark it as deleted.
To remove deleted docs, you have to optimize your Lucene segments.
$ curl -XPOST 'http://localhost:9200/twitter/_optimize?only_expunge_deletes=true'
See Optimize API. Also have a look at merge options. Merging segments happens behind the scene at some time.
For a general overview of versioning support in Elasticsearch, please refer to the Elasticsearch Versioning Support.

solr.RandomSortField on multiple solr server instances

Got a solr question here, I have multiple solr server instances they all the same data and schema, the schema contains a dynamic field which is solr.RandomSortField, so I am wondering if I run sort=rand_1234%20desc on different solr servers, am I suppose to get the same result?
According to the source code of RandomSortField, the seed includes the version number of the index. This means that if you issue a search with the same random parameter (e.g. "sort=random_1234") on different servers the same result is returned if the indexes are equal (same content) and have the same version id (via replication).
You can check the version of the indexes in the /admin/ ui of every server.
http://grepcode.com/file/repo1.maven.org/maven2/org.apache.solr/solr-core/3.5.0/org/apache/solr/schema/RandomSortField.java

Resources