Is there a way to get the `metadata.depth` value also be added to a field in the doc index? - elasticsearch

Stormcrawler, through the process of crawling, is adding a field to the status index called metadata.depth. I am not sure where that is generated from, but would it be possible to somehow send that depth value to the, er, index index in elasticsearch as well as the status index? That value would help a bit with ranking for searching against the index index.

If it's in the metadata, it can be indexed using indexer.md.mapping, like anything else.
It is generated by MetadataTransfer when an outlink is discovered.

Related

Finding the set "max_result_window" for Elastic Search index?

So when querying ElasticSearch, I know you can constrain the size with the "size" parameter. By default, it's 10,000. I was wondering how to know what's the max (if it has been changed from 10,000)?
I have tried "/index/_settings" in hopes of finding the max_window_size, but couldn't find anything. I'm not necessarily sure if that's because it doesn't have a limit at all, or if I am doing something wrong.
So to rephrase my question: I basically want to know how to find the max size when trying to query "size: xx" to an elastic search server. If the size is 10,000/the default, then I want to know where I can find this number.
Any tips or guidance?
If the value isn't specified on the index itself (in _settings where you were looking), then it is 10000. You can change this setting only on the index itself as far as I know. To automatically apply it to new indices you can use an index template.
It appears to be an oversight by the devs to me, if you use rolling indices by date for example then there is no single index for you to query modifications to the value from (sure you could guess one). I think you just have to make sure to match your query code assumptions to your index template. In my opinion there should be a way to just ask for max results possible without needing to know that value beforehand.
You are correct in that elastic search default max query size is 10000. The way to get more is to use the "scroll" api:
https://www.elastic.co/guide/en/elasticsearch/reference/7.3/search-request-body.html#request-body-search-scroll
This essentially uses pagination to split your result into user defined segments and allows you to "scroll" to the next one using a "Scroll_id" that's returned from the initial query.

Lucene: Filter query by doc ID

I want to have in the search response only documents with specified doc id. In stackoverflow I found this question (Lucene filter with docIds) but as far as I understand there is created the additional field in the document and then doing search by this field. Is there another way to deal with it?
Lucene's docids are intended only to be internal keys. You should not be using them as search keys, or storing them for later use. Those ids are subject to change without warning. They will be changed when updating or reindexing documents, and can change at other times, such as segment merges, as well.
If you want your documents to have a unique identifier, you should generate that key separate from the docId, and index it as a field in your document.

Avoid duplicate documents in Elasticsearch

I parse documents from a JSON, which will be added as children of a parent document. I just post the items to the index, without taking care about the id.
Sometimes there will be updates to the JSON and items will be added to it. So e.g. I parsed 2 documents from the JSON and after a week or two I parse the same JSON again. This time the JSON contains 3 documents.
I found answers like: 'remove all children and insert all items again.', but I doubt this is the solution I'm looking for.
I could compare each item to the children of my target-parent and add new documents, if there is no equal child.
I wondered if there is a way, to let elasticsearch handle duplicates.
Duplication needs to be handled in ID handling itself.
Choose a key that is unique for a document and make that as the _id. In the the key is too large or it is multiple keys , create a SHAH checksum out of it and make that as the _id.
If you already have dedupes in the database , you can use terms aggregation nested with top_hits aggregation to detect those.
You can read more about this approach here.
When adding a new document to elasticsearch, it first scans the existing documents to see if any of the IDs match. If there is already an existing document with that ID, the document will be updated instead of adding in a duplicate document (the version field will be updated at the same time to track the amount of updates that have occurred). You will therefore need to keep track of your document IDs somehow and maintain the same IDs throughout matching documents to eliminate the possibility of duplicates.

Only index certain fields from Wikipedia River

I'm trying to use the Wikipedia River
Is there a way / How can I customize the mapping so that ElasticSearch only index the title fields (I'd still like to access the whole text)?
The mapping is useful more to decide how you index data rather than what you index, unless you set it to dynamic: false which means that elasticsearch effectively accepts only the fields that are explicitly declared in the mapping.
The problem is that the wikipedia river always sends a set of fields for every document and this behaviour is not currently configurable, thus there's no way to index only a subset of those fields (e.g. only title and _source). What you could do is modify your search request so that you get back only the fields that you are interested in, but the content of the index will stay the same.

What indexes are created when indexing a document in elasticsearch

If I create a first document of it's type, or put a mapping, is an index created for each field?
Obviously if i set "index" to "analyzed" or "not analyzed" the field is indexed.
Is there a way to store a field so it can be retrieved but never searched by? I imagine this will save a lot of space? If I set this to "no" will this save space?
Will I still be able to search by this, just take more time, or will this be totally unsearchable?
Is there a way to make a field indexed after some documents are inserted and I change my mind?
For example, I might have a mapping:
{
"book":{"properties":{
"title":{"type":"string", "index":"not_analyzed"},
"shelf":{"type":"long","index":"no"}
}}}
so I want to be able to search by title, but also retrieve the shelf the book is on
index:no will indeed not create an index for that field, so that saves some space. Once you've done that you can't search for that particular field anymore.
Perhaps also useful in this context is to know aboutthe _source field, which is returned by default and includes all fields you've stored. http://www.elasticsearch.org/guide/reference/mapping/source-field/
As to your second question:
you can't change your mind halfway. When you want to index a particular field later on you have to reindex the documents.
That's why you may want to reconsider setting index:no, etc. In fact a good strategy to begin is to don't define a schema for fields at all, unless you're 100% sure you need a non-default analyzer for a particular field for instance. Otherwise ES will use generally usable defaults.

Resources