Bulk Insert object in Elasticsearch - elasticsearch

I am trying create an index and then do a bulk insert using RestHighLevelClient to my ES (the code is in Kotlin).
The bulk insert code is :
private fun insertEntity(entityList: List<Person>, indexName: String) {
var count = 0
val bulkRequest = BulkRequest()
entityList.forEach {
bulkRequest.add(IndexRequest(indexName).source(it,XContentType.JSON))
count++
if (count == batchSize) {
performBulkInsert(bulkRequest)
}
}
}
When executing this, I am getting an exception saying : Limit of 1000 fields is crossed.
On analysing my code, I feel the implementation is wrong, because :
bulkRequest.add(IndexRequest(indexName).source(it,XContentType.JSON))
source takes a String type but I am passing the Person (it)object itself. So I believe that is causing some issue related to 1000 fields based on my mapping or something.
Not sure if my assumption is correct. If yes, how can I achieve the bulk insert then ?
EDIT
Index creation:
private fun createIndex(indexName: String) {
val request = CreateIndexRequest(indexName)
val settings = FileUtils.readFileToString(
ResourceUtils.getFile(
ResourceUtils.CLASSPATH_URL_PREFIX + "settings/settings.json"), "UTF-8")
val mappings = FileUtils.readFileToString(
ResourceUtils.getFile(
ResourceUtils.CLASSPATH_URL_PREFIX + "mappings/personMapping.json"), "UTF-8")
request.settings(Settings
.builder()
.loadFromSource(settings, XContentType.JSON))
.source(mappings, XContentType.JSON)
restHighLevelClient.indices().create(request, RequestOptions.DEFAULT)
}
Mapping.json
Please note original has 16 fields.
{
"properties": {
"accessible": {
"type": "boolean"
},
"person_id": {
"type": "long"
},
"person_name": {
"type": "string",
"analyzer": "lower_keyword"
}
}
}
Thanks.

Looks like you are using the dynamic mapping and due to some mistake when you index a document it ends up creating new fields in your index which crossed the 1000 fields limit.
Please see if you can use the static mapping or debug the code which prepares the document and compare it with your mapping to see if its creating new fields.
Please refer this SO answer to increase the limit if its legitimate or use static mapping or debug the code to figure out why you are adding new fields to elasticsearch index.

Related

Elasticsearch “data”: { “type”: “float” } query returns incorrect results

I have a query like below and when date_partition field is "type" => "float" it returns queries like 20220109, 20220108, 20220107.
When field "type" => "long", it only returns 20220109 query. Which is what I want.
Each queries below, the result is returned as if the query 20220119 was sent.
--> 20220109, 20220108, 20220107
PUT date
{
"mappings": {
"properties": {
"date_partition_float": {
"type": "float"
},
"date_partition_long": {
"type": "long"
}
}
}
}
POST date/_doc
{
"date_partition_float": "20220109",
"date_partition_long": "20220109"
}
#its return the query
GET date/_search
{
"query": {
"match": {
"date_partition_float": "20220108"
}
}
}
#nothing return
GET date/_search
{
"query": {
"match": {
"date_partition_long": "20220108"
}
}
}
Is this a bug or is this how float type works ?
2 years of data loaded to Elasticsearch (like day-1, day-2) (20 gb pri shard size per day)(total 15 TB) what is the best way to change the type of just this field ?
I have 5 float type in my mapping, what is the fastest way to change all of them.
Note: In my mind I have below solutions but I'm afraid it's slow
update by query API
reindex API
run time search request (especially this one)
Thank you!
That date_partition field should have the date type with format=yyyyMMdd, that's the only sensible type to use, not long and even worse float.
PUT date
{
"mappings": {
"properties": {
"date_partition": {
"type": "date",
"format": "yyyyMMdd"
}
}
}
}
It's not logical to query for 20220108 and have the 20220109 document returned in the results.
Using the date type would also allow you to use proper time-based range queries and create date_histogram aggregations on your data.
You can either recreate the index with the adequate type and reindex your data, or add a new field to your existing index and update it by query. Both options are valid.
It can be answer of my question => https://discuss.elastic.co/t/elasticsearch-data-type-float-returns-incorrect-results/300335

Update restrictions on Elasticsearch Object type field

I have to store documents with a single field contains a single Json object. this object has a variable depth and variable schema.
I config a mapping like this:
"mappings": {
"properties": {
"#timestamp": {
"type": "date"
},
"message": {
"type": "object"
}
}
}
It works fine and Elasticsearch creates and updates mapping with documents that received.
The problem is that after some updates in mapping, it rejects new documents and do not update mapping anymore. At this time I change the indices and mapping update occurred for that indies. I'm looking forward to know the right solution.
for example the first document is:
{
personalInfo:{
fistName: "tom"
}
moviesStatistics: {
count: 100
}
}
the second document that will update Elasticsearch mapping is:
{
personalInfo:{
fistName: "tom",
lastName: "hanks"
},
moviesStatistics: {
count: 100
},
education: {
title: "a title..."
}
}
Elasticsearch creates mapping with doc1 and updates it with doc2, doc3, ... until a number of documents received. After that it starts to reject every document that is not matched to the last mapping fields.
After all I found the solution in the home page of Elasticsearch https://www.elastic.co/guide/en/elasticsearch/reference/7.13//dynamic-field-mapping.html
We can use Dynamic mapping and simply use this mapping:
"mappings": {
"dynamic": "true"
}
You should also change some default restrictions that mentioned here:
https://www.elastic.co/guide/en/elasticsearch/reference/7.13//mapping-settings-limit.html

how to create a join relation using elasticsearch python client

I am looking for any examples that implement the parent-child relationship using the python interface.
I can define a mapping such as
es.indices.create(
index= "docpage",
body= {
"mappings": {
"properties": {
"my_join_field": {
"type": "join",
"relations": {
"my_document": "my_page"
}
}
}
}
}
)
I am then indexing a document using
res = es.index(index="docpage",doc_type="_doc",id = 1, body=jsonDict) ,
where jsonDict is a dict structure of document's text,
jsonDict['my_join_field']= 'my_document', and other relevant info.
Reference example.
I tried adding pageDict where the page is a string containing text on a page in a document, and
pageDict['content']=page
pageDict['my_join_field']={}
pageDict['my_join_field']['parent']="1"
pageDict['my_join_field']['name']="page"
res = es.index(index="docpage",doc_type="_doc",body=pageDict)
but I get a parser error:
RequestError(400, 'mapper_parsing_exception', 'failed to parse')
Any ideas?
This worked for me :
res=es.index(index="docpage",doc_type="_doc",body={"content":page,
"my-join-field":{
"name": "my_page",
"parent": "1"}
})
The initial syntax can work if the parent is also repeated in the "routing" key of the main query body:
res = es.index(index="docpage",doc_type="_doc",body=pageDict, routing=1)

azure logic app with table storage get last rowKey

How can I use the "Get Entity for Azure table storage" connector in a Logic App to return the last rowKey.
This would be used in situation where the rowkey is say an integer incremented each time a new entity is added. I recognize the flaw in design of this but this question is about how some sort of where clause or last condition could be used in the Logic app.
Currently the Logic App code view snippet looks like this:
"actions": {
"Get_entity": {
"inputs": {
"host": {
"connection": {
"name": "#parameters('$connections')['azuretables']['connectionId']"
}
},
"method": "get",
"path": "/Tables/#{encodeURIComponent('contactInfo')}/entities(PartitionKey='#{encodeURIComponent('a')}',RowKey='#{encodeURIComponent('b')}')"
},
"runAfter": {},
"type": "ApiConnection"
}
Where I have the hard coded:
RowKey='#{encodeURIComponent('b')}'
This is fine if I always want this rowKey. What I want though is the last rowKey so something sort of like:
RowKey= last(RowKey)
Any idea on how this can be achieved?
This is fine if I always want this rowKey. What I want though is the last rowKey so something sort of like: RowKey= last(RowKey)
AFAIK, there is no build-in functions for you to achieve this purpose. I assumed that you could use the Azure Functions connector to retrieve the new RowKey value. Here are the detailed steps, you could refer to them:
For test, I created a C# Http Trigger function, then add a Azure Table Storage Input, then retrieve all the items under the specific PartitionKey, then order by the RowKey and calculate the new Row Key.
function.json:
{
"bindings": [
{
"authLevel": "function",
"name": "req",
"type": "httpTrigger",
"direction": "in"
},
{
"name": "$return",
"type": "http",
"direction": "out"
},
{
"type": "table",
"name": "inputTable",
"tableName": "SampleTable",
"take": 50,
"connection": "AzureWebJobsDashboard",
"direction": "in"
}
],
"disabled": false
}
run.csx:
#r "Microsoft.WindowsAzure.Storage"
using Microsoft.WindowsAzure.Storage.Table;
using System.Net;
public static async Task<HttpResponseMessage> Run(HttpRequestMessage req, IQueryable<SampleTable> inputTable,TraceWriter log)
{
log.Info("C# HTTP trigger function processed a request.");
// parse query parameter
string pk = req.GetQueryNameValuePairs()
.FirstOrDefault(q => string.Compare(q.Key, "pk", true) == 0)
.Value;
// Get request body
dynamic data = await req.Content.ReadAsAsync<object>();
// Set name to query string or body data
pk = pk ?? data?.pk;
if(pk==null)
return req.CreateResponse(HttpStatusCode.BadRequest, "Please pass a pk on the query string or in the request body");
else
{
var latestItem=inputTable.Where(p => p.PartitionKey == pk).ToList().OrderByDescending(i=>Convert.ToInt32(i.RowKey)).FirstOrDefault();
if(latestItem==null)
return req.CreateResponse(HttpStatusCode.OK,new{newRowKey=1});
else
return req.CreateResponse(HttpStatusCode.OK,new{newRowKey=int.Parse(latestItem.RowKey)+1});
}
}
public class SampleTable : TableEntity
{
public long P1 { get; set; }
public long P2 { get; set; }
}
Test:
For more details about Azure Functions Storage table bindings, you could refer to here.
azure table storage entities are sorted lexicographically. So choose a row key that actually decrements every time you add a new entity, ie. if your row key is an integer that gets incremented when new entity is created than choose your row key as Int.Max - entity.RowKey. The latest entity for that partition key will always be on the top since it is going to have the lowest row key, so all you need to do then to retrieve it is query only with partition key and Take(1). This is called Log Tail pattern, if you want to read more about it.

Elasticsearch. Can not find custom analyzer

I have model like this:
#Getter
#Setter
#Document(indexName = "indexName", type = "typeName")
#Setting(settingPath = "/elastic/elastic-setting.json")
public class Model extends BaseModel {
#Field(type = FieldType.String, index = FieldIndex.analyzed, analyzer = "customAnalyzer")
private String name;
}
And i have elastic-setting.json inside ../resources/elastic/elastic-setting.json:
{
"index": {
"number_of_shards": "1",
"number_of_replicas": "0",
"analysis": {
"analyzer": {
"customAnalyzer": {
"type": "custom",
"tokenizer": "uax_url_email"
}
}
}
}
}
I clean my elastic DB and when i start my application i have exception:
MapperParsingException[analyzer [customAnalyzer] not found for field [name]]
What's wrong with my code?
Help me, please!
EDIT
Val, I thought #Setting is like an addition for #Document, but looks like they are interchangeably.
In my case i also have another model, with:
#Document(indexName = "indexName", type = "anotherTypeName")
So, first i create index with name "indexName" for anotherModel, next when elastic preparing Model, it see, that index with name "indexName" already created, and he does not use #Setting.
Now i have another quesion.
How to add custom analyzer to already created index in java code, for example in InitializingBean. Something like - is my analyzer created? no - create. yes - do not create.
Modify your elastic-setting.json file like this:
{
"index": {
"number_of_shards": "1",
"number_of_replicas": "0"
},
"analysis": {
"analyzer": {
"customAnalyzer": {
"type": "custom",
"tokenizer": "uax_url_email"
}
}
}
}
}
Note that you need to delete your index first and recreate it.
UPDATE
You can certainly add a custom analyzer via Java code, however, you won't be able to change your existing mapping in order to use that analyzer, so you're really better off wiping your index and recreating it from scratch with a proper elastic-setting.json JSON file.
For Val:
Yeah, i use something like this.
Previously, i had added #Setting in one of my entity class, but when i started app, index with same name was already created, before Spring Data had analysed entity with #Setting, and index was not modified, because index with same name was already created.
Now I add annotation #Setting(path = "elastic-setting.json") on abstract baseModel, and high level hierarchy class was scanned firstly and analyzer was created as well.

Resources