It is a good solution for convenience, if I put some data into elasticsearch index, which fields only using in _source parameter, not for search, or sort? Theese datas are stored in SQL too, but it's easier to access from elasticsearch, not needed plus SQL call. Is this a good direction?
Yes, You can access the field value from _source if you want to avoid the call to SQL, and better performance. And yes, IMO it's a good direction if you are thinking to call SQL just for retrieving these field values. fetching values from ES will be more efficient.
You can and should disable the index option on the field, where you don't need to search.
Also, refer to source filtering for more information.
Related
The documentation and recommendation for using stored_fields feature in ElasticSearch has been changing. In the latest version (7.9), stored_fields is not recommended - https://www.elastic.co/guide/en/elasticsearch/reference/7.9/search-fields.html
Is there a reason for this?
Where as in version 7.4.0, there is no such negative comment - https://www.elastic.co/guide/en/elasticsearch/reference/7.4/mapping-store.html
What is the guidance in using this feature? Is using _source filtering a better option? I ask because in some other doc, _source filtering is supposed to kill performance - https://www.elastic.co/blog/found-optimizing-elasticsearch-searches
If you use _source or _fields you will quickly kill performance. They access the stored fields data structure, which is intended to be used when accessing the resulting hits, not when processing millions of documents.
What is the best way to filter fields and not kill performance with Elastic Search?
source filtering is the recommended way to fetch the fields and you are getting confused due to the blog, but you seem to miss the very important concept and use-case where it is applicable. Please read the below statement carefully.
_source is intended to be used when accessing the resulting hits, not when processing millions of documents.
By default, elasticsearch returns only 10 hits/search results which can be changed based on the size parameter and if in your search results, you want to fetch few fields value than using source_filter makes perfect sense as it's done on the final result set(not all the documents matching search results),
While if you use the script, and using source value try to read field-value and filter the search result, this will cause queries to scan all the index which is the second part of the above-mentioned statement(not when processing millions of documents.)
Apart from the above, as all the field values are already stored as part of _source field which is enabled by default, you need not allocate extra space if you explicitly mark few fields as stored(disabled by default to save the index size) to retrieve field-values.
In Lucene i want to store the full document as well which would be just stored and not analysed. What i want to do is something like _source in Elastic Search.
But I'm confused as what would be the best data type in Lucene to store such data.
What field type should I use in Lucene to store such data. Should it be a StringField or something else?
I think elasticsearch stores _source as hex data. Not sure though.
Which data type would take less space and still be fast enough to retrieve?
As per this this part of the doc, it seems that Lucene treats each and every data type as:
opaque bytes
which could ideally convey that it doesn't really matter what type of field you're having as long as it's relative to your requirement, where Lucene would anyways convert them.
So deciding on which data type the field should be, totally depends on how do you want your fields to be in the index and also how're you gonna use them to visualize graphs in Kibana. Hope it helps!
I have a doubt about how indexed properties works in Alfresco 4.1.6 with SOLR 1.4.
I use something like this for my queries:
SearchParameters sp = new SearchParameters();
sp.addStore(StoreRef.STORE_REF_WORKSPACE_SPACESSTORE);
sp.setLanguage(SearchService.LANGUAGE_FTS_ALFRESCO);
sp.setQuery(query);
ResultSet results = getSearchService().query(sp);
where query variable is something like this:
PATH:" /app:company_home/app:user_homes/cm:_x0030_123//*" AND
((#cm\:title:food) OR (#cm\:name:abcde) OR (TEXT:valles) OR
(#doc\:custom_property:"report") OR (#doc\:custom_property2:"report")
AND (#doc\:custom_property3:"report") AND TYPE:"{my.model}voc_document"
On my model.xml I specify what custom properties are indexed
<index enabled="true">
My question is... How works SOLR 1.4 with the indexes if I put on the search query two or more indexed properties? Like Oracle? Oracle try the best index and use only this. Or maybe SOLR combine all the indexed properties and uses all the index on the query?
I need this answer to determine how many indexes put on my model.xml. Maybe put a lot of indexes don't give me the best and efficient result and is better index only a few properties.
And finally, one question. I use LANGUAGE_FTS_ALFRESCO, but I can see that exists a LANGUAGE_SOLR_FTS_ALFRESCO. Is the same? I need to use the second if I use SOLR?
Thanks a lot!
Best regards
There is only one "index". Every field you mark as indexable (which is enabled by default) ends up in your solr index. Alfresco takes your query and sends it to SOLR for processing.
If you don't have a lot of documents, you can go ahead and index every field. By far the biggest impact on indexing and search is the full text index of the content field, which is enabled by default also.
LANGUAGE_FTS_ALFRESCO will use whatever index subsystem you have enabled. In later versions it may use SOLR or the database depending on your configuration. If you try to LANGUAGE_SOLR_FTS_ALFRESCO, it's forcing SOLR, so if you don't have solr enabled, you would have an error.
Regards!
Just starting to use elasticsearch with haystack in django using postgres, and I'm pretty happy with it so far.
I'm wondering if the search queries (filters) through ES will submit a query to the DB or do they use data gathered during indexing?
Given that I can delete the data in the DB and still search, the answer seems to be yes, the queries do not touch the DB but only touch the index.
Also, I found this documentation on the matter:
http://django-haystack.readthedocs.org/en/latest/best_practices.html#avoid-hitting-the-database
Further, this is also from the docs:
For example, one great way to leverage this is to pre-rendering an
object’s search result template DURING indexing. You define an
additional field, render a template with it and it follows the main
indexed record into the index. Then, when that record is pulled when
it matches a query, you can simply display the contents of that field,
which avoids the database hit.:
The difference between the two, who hold all of the fields, eludes me.
If my document has:
{"mydoc":
{"properties":
{"name":{"type":"string","store":"true"}},
{"number":{"type":"long","store":"false"}},
{"title":{"type":"string","include_in_all":"false","store":"true"}}
}
}
I understand that _source is a field that has all the fields. But so does _all?
Does this mean that "name" is saved several times (twice? in _source and in _all), increasing the disk space the document takes?
Is "name" stored once for the field, once for _source, and once for _all?
what about "number", is it stored in _all, even though not in _source?
When should I use _source in my query, and when _all?
What is the use case where I can disable _all, and what functionality would then be denied?
It's pretty much the same as the difference between indexed fields and stored fields in lucene.
You use indexed fields when you want to search on them, while you store fields that you want to return as search results.
The _source field is meant to store the whole source document that was originally sent to elasticsearch. It's use as search result, to be retrieved. You can't search on it. In fact it is a stored field in lucene and not indexed.
The _all field is meant to index all the content that come from all the fields that your documents are composed of. You can search on it but never return it, since it's indexed but not stored in lucene.
There's no redundancy, the two fields are meant for a different usecase and stored in different places, within the lucene index. The _all field becomes part of what we call the inverted index, use to index text and be able to execute full-text search against it, while the _source field is just stored as part of the lucene documents.
You would never use the _source field in your queries, only when you get back results since that's what elasticsearch returns by default. There are a few features that depend on the _source field, that you lose if you disable it. One of them is the update API. Also, if you disable it you need to remember to configure as store:yes in your mapping all the fields that you want to return as search results. I would rather say don't disable it unless it bothers you, since it's really helpful in a lot of cases. One other common usecase would be when you need to reindex your data; you can just retrieve all your documents from elasticsearch itself and just resend them to another index.
On the other hand, the _all field is just a default catch all field, that you can use when you just want to search on all fields available and you don't want to specify them all in your queries. It's handy but I wouldn't rely on it too much on production, where it's better to run more complex queries on different fields, with different weights each. You might want to disable it if you don't use it, this will have a smaller impact than disabling the _source in my opinion.