I am currently learning how to use the development environment called Intrexx and I noticed that there are data fields of the "Text" type. What is the difference between string and text data fields?
I tried to look it up but I didn't really find any results. Can someone explain me the difference?
The major difference between the two fields is how many characters you can put in these fields. A string field has a limit of 255 characters, whereas a text field has a character limit of 30,000 characters
found here
Related
My log POCO has several fixed properties, like user id, timestamp, with a flexible data bag property, which is a JSON representation of any kind of extra information I'd like to add to the log. This means the property names could be anything within this data bag, bringing me 2 questions:
How can I configure the mapping so that the data bag property, which is of type string, would be mapped to a JSON object during the indexing, instead of being treated as a normal string?
With the data bag object having arbitrary property names, meaning the overall document type could have a huge number of properties inside, would this hurt the search performance?
For the data translation from string to JSON you can use ingest pipeline with JSON processor:
https://www.elastic.co/guide/en/elasticsearch/reference/master/json-processor.html
It depends of you queries. If you'll use the "free text search" - yes, the huge number of fields will slow the query. If you you'll use query like "field":"value" - no, there is no problem with the fields number in the searches. Additional information about query optimization you cold find here:
https://www.elastic.co/guide/en/elasticsearch/reference/7.15/tune-for-search-speed.html#search-as-few-fields-as-possible
And the question is: what you meen, when say "huge number"? 1000? 10000? 100000? As part of optimization i recommend to use dynamic templates with the definition: each string field automatically ingest into the index as "keyword" and not text + keyword. This setting decrease the number of fields to half.
when use doc["abc"],it turns out no field "abc" exception, only to find params._source["abc"] get everything correct.
I checked the status of doc["abc"].value ,it shows null , also doc["abc"].empty is true.
1.elasticsearch version:5.x
2.use painless inline sort script
can anybody figureit out what. happened?
Depending on where a script is used, it will have access to certain special variables and document fields. I dont know your mapping, but i think this link will answer your questions - Accessing document fields and special variables
To quote further from above link:
Doc values and text fields
The doc['field'] syntax can also be used for analyzed text fields if
fielddata is enabled, but BEWARE: enabling fielddata on a text field
requires loading all of the terms into the JVM heap, which can be very
expensive both in terms of memory and CPU. It seldom makes sense to
access text fields from scripts.
Doc values are a columnar field value store, enabled by default on all
fields except for analyzed text fields.
The _source provides access to the original document body that was
indexed (including the ability to distinguish null values from empty
fields, single-value arrays from plain scalars, etc).
if your field abc is an analyzed text field or an object, doc wont work.
If I add a document with several fields to an Elasticsearch index, when I view it in Kibana, I get each time the same field twice. One of them will be called
some_field
and the other one will be called
some_field.keyword
Where does this behaviour come from and what is the difference between both of them?
PS: one of them is aggregatable (not sure what that means) and the other (without keyword) is not.
Update : A short answer would be that type: text is analyzed, meaning it is broken up into distinct words when stored, and allows for free-text searches on one or more words in the field. The .keyword field takes the same input and keeps as one large string, meaning it can be aggregated on, and you can use wildcard searches on it. Aggregatable means you can use it in aggregations in elasticsearch, which resembles a sql group by if you are familiar with that. In Kibana you would probably use the .keyword field with aggregations to count distinct values etc.
Please take a look on this article about text vs. keyword.
Briefly: since Elasticsearch 5.0 string type was replaced by text and keyword types. Since then when you do not specify explicit mapping, for simple document with string:
{
"some_field": "string value"
}
below dynamic mapping will be created:
{
"some_field": {
"type" "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
}
}
As a consequence, it will both be possible to perform full-text search on some_field, and keyword search and aggregations using the some_field.keyword field.
I hope this answers your question.
Look at this issue. There is some explanation of your question in it. Roughly speaking some_field is analyzed and can be used for fulltext search. On the other hand some_field.keyword is not analyzed and can be used in term queries or in aggregation.
I will try to answer your questions one by one.
Where does this behavior come from?
It is introduced in Elastic 5.0.
What is the difference between the two?
some_field is used for full text search and some_field.keyword is used for keyword searching.
Full text searching is used when we want to include individual tokens of a field's value to be included in search. For instance, if you are searching for all the hotel names that has "farm" in it, such as hay farm house, Windy harbour farm house etc.
Keyword searching is used when we want to include the whole value of the field in search and not individual tokens from the value. For eg, suppose you are indexing documents based on city field. Aggregating based on this field will have separate count for "new" and "york" instead of "new york" which is usually the expected behavior.
From Elastic 5.0 onwards, strings now will be mapped both as keyword and text by default.
I am wondering what are the other advantages except type validation of integer field type in comparison to string type. As far as I know in Lucene index those fields anyway are stored in common byte format.
The reason why I am asking is that I have a field value which can be both string and integer. I am thinking about should I create different types inside a mapping, i.e. localhost:9200/index/string_type and localhost:9200/index/integer_type or I can safely (in terms of performance and other aspects) use string type for both variants.
I am using elastic 2.4.
You could go with the string_type for both actually. I don't personally see any advantages of having an interger_type over the string. But then make sure that you map the string as not_analyzed, hence the value of the field will not be analyzed or tokenized. So that you could simply use the field for aggregations. Maybe you should have a look at this one which elaborates more. Having both the field types at once would not make any difference at all from doing the above.
I'm using the Contains function to search for strings in BLOB fields containing PDFs or Word documents. Recently I did the following search:
SELECT doc_id
FROM table_of_documents
WHERE CONTAINS (BLOB_FIELD, 'SDS.IF.00005') > 0
Most of the records returned were correct, but a few had PDFs in them that did not have "SDS.IF.00005" in them but did have "SDS.EL.00005" in them.
When I say the PDFs did not have the search term, I mean I opened them in Adobe reader and searched them using the search function and my own eyeballs, and also people extremely familiar with the documents insist that the term is not there and should not be there.
I tried treating the dots as escape characters: SDS\\.IF\\.00005 and {SDS.IF.00005}. However, I am still getting the same results.
I also tried setting CONTAINS (BLOB_FIELD, 'SDS.IF.00005') = 100, but I'm still getting documents with SDS.EL.00005 in them and not SDS.IF.00005.
Do the dots in the search term mean something like SDS.%.00005 to Oracle? Or should I be researching how to find deep hidden text in Adobe documents that's not visible to the naked eye or to the Adobe text search function?
Thanks for your help.
As far as I know, CONTAINS is a Oracle Text function that performs full text search, so Oracle is tokenizing your string, probably according to its BASIC_LEXER. This lexer uses . as a word separator. So Oracle understands your query as "return anything that matches at least one of the words 'SDS', 'IF' or '00005'". As your PDF will probably have been indexed using that same lexer, from Oracle Text point of view your PDF contains the words 'SDS', 'EL' and '00005', so it matches 2 of 3 words and so Oracle returns that row.
Actually, 'IF' is included in Oracle Text default stopword list (words that are ignored because they are so common that they mostly introduce "noise"); so your query actually is "return anything that matches at least one of 'SDS' or '00005'". Therefore I am not surprised that a PDF that contains the literal text "SDS.EL.00005" will give you CONTAINS(BLOB_FIELD, 'SDS.IF.00005') = 100 (a "perfect" match) as you wrote.
If you want to search for a verbatim string, I think you should rather not use Oracle Text and just implement a solution using plain old DBMS_LOB.INSTR. If that is not viable, then you will have to find a way to make Oracle Text index those strings without tokenizing them.