Lucene: Filter query by doc ID - elasticsearch

I want to have in the search response only documents with specified doc id. In stackoverflow I found this question (Lucene filter with docIds) but as far as I understand there is created the additional field in the document and then doing search by this field. Is there another way to deal with it?

Lucene's docids are intended only to be internal keys. You should not be using them as search keys, or storing them for later use. Those ids are subject to change without warning. They will be changed when updating or reindexing documents, and can change at other times, such as segment merges, as well.
If you want your documents to have a unique identifier, you should generate that key separate from the docId, and index it as a field in your document.

Related

Dealing with Empty Fields

I am new to stormcrawler and elasticsearch in general. I am currently using stormcrawler 2.0 to index website data (including non-HTML items such as PDF's and Word Documents) into elasticsearch. In some cases, the metadata of PDF's or Word documents do not contain a title so the field is stored blank/null in elasticsearch. This is unfortunately causing issues in the webapp I am using to display search results (search-ui). Is there a way I can have stormcrawler insert a default value of "Untitled" into the title field if none exists in the metadata?
I understand that elasticsearch has a null_value field parameter, but if I understand correctly that parameter cannot be used for text fields and only helps with searching.
Thanks!
One option would be to write a custom ParseFilter to give an arbitrary value to any missing key or a key with an empty value. The StormCrawler code has quite a few examples of ParseFilters, see also the WIKI.
The same could be done as a custom Bolt placed between the parser and the indexer; grab the metadata and normalise to your heart's content.

Is there a way to get the `metadata.depth` value also be added to a field in the doc index?

Stormcrawler, through the process of crawling, is adding a field to the status index called metadata.depth. I am not sure where that is generated from, but would it be possible to somehow send that depth value to the, er, index index in elasticsearch as well as the status index? That value would help a bit with ranking for searching against the index index.
If it's in the metadata, it can be indexed using indexer.md.mapping, like anything else.
It is generated by MetadataTransfer when an outlink is discovered.

Get top 10 most used words in text fields

I have an index containing thousands of documents, each one of them having a full text field.
I want to search through all those fields and fetch the 10 most common words that come back most often.
I would also like a way of visualizing it on Kibana if that's possible.
The most common way to achieve that is to duplicate your full text field with a keyword datatype. That will get you able to make terms aggregation on that field - doc here. Maybe you could consider to do a significant term aggregation - doc here, thus to avoid the presence of stopwords and common words. In ES 6.x you could use also the significant text aggregation - doc here, without create the keyword field, but i never try it, i don't know how it works. Instead if you need to retrieve the frequency of the words for each document, you should use the termvector - doc here

analyzed field vs doc_values: true field

We have an elasticsearch that contains over half a billion documents that each have a url field that stores a URL.
The url field mapping currently has the settings:
{
index: not_analyzed
doc_values: true
...
}
We want our users to be able to search URLs, or portions of URLs without having to use wildcards.
For example, taking the URL with path: /part1/user#site/part2/part3.ext
They should be able to bring back a matching document by searching:
part3.ext
user#site
part1
part2/part3.ext
The way I see it, we have two options:
Implement an analysed version of this field (which can no longer have doc_values: true) and do match querying instead of wildcards. This would also require using a custom analyser to leverage the pattern tokeniser to make the extracted terms correct (the standard tokeniser would split user#site into user and site).
Go through our database and for each document create a new field that is a list of URL parts. This field could have doc_values: true still so would be stored off-heap, and we could do term querying on exact field values instead of wildcards.
My question is this:
Which is better for performance: having a list of variable lengths that has doc_values on, or having an analysed field? (ie: option 1 or option 2) OR is there an option 3 that would be even better yet?!
Thanks for your help!
Your question is about a field where you need doc_values but can not index with keyword-analyzer.
You did not mention why you need doc_values. But you did mention that you currently not search in this field.
So I guess that the name of the search-field do not have to be the same: you can copy the field value in an other field which is only for search ( "store": false ). For this new field you can use the pattern-analyzer or pattern-tokenizer for your use case.
It seems that no-one has actually performance tested the two options, so I did.
I took a sample of 10 million documents and created two new indices:
An index with an analysed field that was setup as suggested in the other answer.
An index with a string field that would store all permutations of URL segmentation.
I ran an enrichment process over the second index to populate the fields. The field values on the first index were created when I re-indexed the sample data from my main index.
Then I created a set of gatling tests to run against the indices and compared the gatling results and netdata (https://github.com/firehol/netdata) landscape for each.
The results were as follows:
Regarding the netadata landscape: The analysed field showed a spike - although only a small one - on all elastic nodes. The not_analysed list field tests didn't even register.
It is worth mentioning that enriching the list field with URL segmentation permutations bloated the index by about 80% in our case. So there's a trade off - you never need to do wildcard searches for exact sub-segment matching on URLs, but you'll need a lot more disk to do it.
Update
Don't do this. Go for doc_values. Doing anything with analyzed strings that have a massive number of possible terms will mean massive field data that will, eventually, never fit in the amount of memory you can allocate it.

Avoid duplicate documents in Elasticsearch

I parse documents from a JSON, which will be added as children of a parent document. I just post the items to the index, without taking care about the id.
Sometimes there will be updates to the JSON and items will be added to it. So e.g. I parsed 2 documents from the JSON and after a week or two I parse the same JSON again. This time the JSON contains 3 documents.
I found answers like: 'remove all children and insert all items again.', but I doubt this is the solution I'm looking for.
I could compare each item to the children of my target-parent and add new documents, if there is no equal child.
I wondered if there is a way, to let elasticsearch handle duplicates.
Duplication needs to be handled in ID handling itself.
Choose a key that is unique for a document and make that as the _id. In the the key is too large or it is multiple keys , create a SHAH checksum out of it and make that as the _id.
If you already have dedupes in the database , you can use terms aggregation nested with top_hits aggregation to detect those.
You can read more about this approach here.
When adding a new document to elasticsearch, it first scans the existing documents to see if any of the IDs match. If there is already an existing document with that ID, the document will be updated instead of adding in a duplicate document (the version field will be updated at the same time to track the amount of updates that have occurred). You will therefore need to keep track of your document IDs somehow and maintain the same IDs throughout matching documents to eliminate the possibility of duplicates.

Resources