solr.RandomSortField on multiple solr server instances - sorting

Got a solr question here, I have multiple solr server instances they all the same data and schema, the schema contains a dynamic field which is solr.RandomSortField, so I am wondering if I run sort=rand_1234%20desc on different solr servers, am I suppose to get the same result?

According to the source code of RandomSortField, the seed includes the version number of the index. This means that if you issue a search with the same random parameter (e.g. "sort=random_1234") on different servers the same result is returned if the indexes are equal (same content) and have the same version id (via replication).
You can check the version of the indexes in the /admin/ ui of every server.
http://grepcode.com/file/repo1.maven.org/maven2/org.apache.solr/solr-core/3.5.0/org/apache/solr/schema/RandomSortField.java

Related

Setting up a daily partitioned index

I'm looking to setup my index such that it is partitioned into daily sub-indices that I can adjust the individual settings of depending on the age of that index, i.e. >= 30 days old should be moved to slower hardware etc. I am aware I can do this with a lifecycle policy.
What I'm unable to join-the-dots on is how to setup the original index to be partitioned by day. When adding data/querying, do I need to specify the individual daily indicies or is there something in Elasticsearch that will do this for me? If the later, how does it work with adding/querying (assuming they are different?)...how does it determine the partitions that are relevant for the query/partition to add a document to? (I'm assuming there is a timestamp field - but I can't see from the docs how its all linked together)
I'm using the base Elasticsearch OSS v7.7.1 without any plugins installed.
there's no such thing as sub indices or partitions in Elasticsearch. if you want to use ilm, which you should, then you are using aliases and multiple indices
you will need to upgrade from 7.7 - which is EOL - and use the default distribution to get access to ilm as well
getting back to your conceptual questions, https://www.elastic.co/guide/en/elasticsearch/reference/current/overview-index-lifecycle-management.html and the following few chapters dive into it. but to your questions;
the major assumption of using ilm is that data being ingested is current, so on a rough level, data from today will end up in an index from today
if you are indexing historic data then you may want to put that into "traditional" index names, eg logs-2021.08.09 and then attach them to the ilm policy as per https://www.elastic.co/guide/en/elasticsearch/reference/current/ilm-with-existing-indices.html
when querying, Elasticsearch will handle accessing all the indices it needs based on the request it receives. it does this via https://www.elastic.co/guide/en/elasticsearch/reference/current/search-field-caps.html

How can I send StormCrawler content to multiple Elasticsearch indices, based on host?

I currently have a successful StormCrawler instance crawling about 20 sites, and indexing the content to one Elasticsearch index. Is it possible, either in ES or via StormCrawler, to send each host's content to its own unique content index?
Out of curiosity: why do you need to do that? Having one index per host seems rather wasteful. You can filter the results based on a field like host if you want to provide results for a particular host.
To answer your question, there is no direct way of doing it currently as the IndexerBolt it connected to one index only. You could declare one IndexerBolt per index you need and add a custom bolt to fan based on the value of the host metadata but this is not dynamic and rather heavy-handed. There could be a way of doing it using pipelines in ES, not sure.

Is it possible for an Elasticsearch index to have a primary key comprised of multiple fields?

I have a multi-tenant system, whereby each tenant gets their own Mongo database within a MongoDB deployment.
However for elastic search indexing, this all goes into one elastic instance via Mongoosastic, tagged with a TenantDB to keep data separated when searching.
Currently we have some of the same _id's reused across the multiple databases in test data for various config collections(Different document content, same _id), however this is causing a problem when syncing to elastic as although they're in separate databases when they come into elastic with the same Type and ID one of them gets dropped.
Is it possible to specify both the ID and TenantDB as the primary key?
Solution 1: You can search for multiple index in Elasticsearch. But, If you can not separate your index for database, you can follow like below method. While syncing your data to elasticsearch, use a pattern to create elastic document _id. For example, from mongoDb1 use mdb1_{mongo_id}, from mongoDb2 use mdb2_{mongo_id} , etc. This will be unique your _ids if you have not same id in same mongo database.
Solution 2: Separate your index.

Are there conventions for naming/organizing Elasticsearch indexes which store log data?

I'm in the process of setting up Elasticsearch and Kibana as a centralized logging platform in our office.
We have a number of custom utilities and plug-ins which I would like to track the usage of and if users are encountering any errors. Not to mention there are servers, and scheduled jobs I would like to keep track of as well.
So if I have a number of different sources for log data all going to the same elasticsearch cluster what are the conventions or best practices for how this is organized into indexes and document types?
The default index value used by Logstash is "logstash-%{+YYYY.MM.dd}". So it seems like it's best to suffix any index names with the current date, as this makes it easy to purge old data.
However, Kibana allows for adding multiple "index patterns" that can be selected from in the UI. Yet all the tutorials I've read only mention creating a single pattern like logstash-*.
How are multiple index patterns used in practice? Would I just give names for all the sources for my data? Such as:
BackupUtility-%{+YYYY.MM.dd}
UserTracker-%{+YYYY.MM.dd}
ApacheServer-%{+YYYY.MM.dd}
I'm using nLog in a number of my tools which has an elastic search target. The convention for nLog and other similar logging frameworks is to have a "logger" for each class in the source code. Should these logger translate to indexes in elastic search?
MyCompany.CustomTool.FooClass-%{+YYYY.MM.dd}
MyCompany.CustomTool.BarClass-%{+YYYY.MM.dd}
MyCompany.OtherTool.BazClass-%{+YYYY.MM.dd}
Or is this too granular for elasticsearch index names, and it would be better to stick to just to a single dated index for the application?
CustomTool-%{+YYYY.MM.dd}
In my environment we're working through a similar question. We have a mix of system logs, metric alerts from Prometheus, and application logs from both client and server applications. In addition, we have some shared variables between the client and server apps that let us correlate the two (e.g., we know what server logs match some operation on the client that made requests to said server). We're experimenting with the following scheme to help Kibana answer questions for us:
logs-system-{date}
logs-iis-{date}
logs-prometheus-{date}
logs-app-{applicationName}-{date}
Where:
{applicationName} is the unique name of some application we wrote (these could be client or server side)
{date} is whatever date-based scheme you use for indexes
This way we can set up Kibana searches against logs-app-* and quickly search for logs among any of our applications. This is still new for us, but we started without this type of scheme and are already regretting it. It makes searching for correlated logs across applications much harder than it should be.
In my company we have worked lot about this topic. We agree the following convention:
Customer
-- Product
--- Application
---- Date
In any case, it is neccesary to review both how the data is organized and how the data is consulted inside the organization
Kind Regards
Dario Rodriguez
I am not aware of such conventions, but for my environment, we used to create two different type of indexes logstash-* and logstash-shortlived-*depending on the severity level. In my case, I create index pattern logstash-* as it will satisfy both kind of indices.
As these indices will be stored at Elasticsearch and Kibana will read them, I guess it should give you the options of creating the indices of different patterns.
Give it a try on your local machine. Why don't you try logstash-XYZ if you want more granularity otherwise you can always create indices with your custom name.

Sorting Across Distributed Spinx Index via SphinxSE

We currently have a Sphinx 2.2.3beta install on one server and Sphinx 2.0.4 on another. Both have their own build of two indexes on them, each with a distributed index over these two local indices (ie. each server has 'index1' and 'index2', and each has 'index_dist' being a distributed index over 'index1' and 'index2').
When using SphinxSE to query against the distributed index and sorting against a given attribute, we are finding that the results from the 2.2.3beta seem to be given as the sorted results from the first local index followed by the sorted results from the second index.
When performing the same query against the distributed index on the 2.0.4 server, the results are completely sorted (as in, the results from the first local index, combined with the results from the second local index, and then sorted).
This is not an issue when performing the query via SphinxQL, but it is a problem if we make the query via either the PHP Sphinx API or via SphinxSE.
Does anyone have any thoughts / hints / comments around this please?
I reported this issue at http://sphinxsearch.com/bugs/view.php?id=2023 . It's since been resolved but in the mean time, you can either:
download the latest code, rebuild SphinxSE and restart mysqld; or
add select=* to your SphinxSE query
The second option seemed to work for us.

Resources