_update_by_query fails to update all documents in ElasticSearch - elasticsearch

I have over 30 million documents in Elasticsearch (version - 6.3.3), I am trying to add new field to all existing documents and setting the value to 0.
For example: I want to add start field which does not exists previously in Twitter document, and set it's initial value to 0, in all 30 million documents.
In my case I was able to update 4 million only. If I try to check the submitted task with TASK API http://localhost:9200/_task/{taskId}, result from says something like ->
{
"completed": false,
"task": {
"node": "Jsecb8kBSdKLC47Q28O6Pg",
"id": 5968304,
"type": "transport",
"action": "indices:data/write/update/byquery",
"status": {
"total": 34002005,
"updated": 3618000,
"created": 0,
"deleted": 0,
"batches": 3619,
"version_conflicts": 0,
"noops": 0,
"retries": {
"bulk": 0,
"search": 0
},
"throttled_millis": 0,
"requests_per_second": -1.0,
"throttled_until_millis": 0
},
"description": "update-by-query [Twitter][tweet] updated with Script{type=inline, lang='painless', idOrCode='ctx._source.Twitter.start = 0;', options={}, params={}}",
"start_time_in_millis": 1574677050104,
"running_time_in_nanos": 466805438290,
"cancellable": true,
"headers": {}
}
}
The query I am executing against ES , is something like:
curl -XPOST "http://localhost:9200/_update_by_query?wait_for_completion=false&conflicts=proceed" -H 'Content-Type: application/json' -d'
{
"script": {
"source": "ctx._source.Twitter.start = 0;"
},
"query": {
"exists": {
"field": "Twitter"
}
}
}'
Any suggestions would be great, thanks

Related

Seaweedfs Delete file succeds but existing filer still holds it

We use seaweedfs 1.78
When i use grpc delete a file via filer.
curl -X DELETE http://filer1:9889/dataset/qiantao/1.txt
It return success.
Because I have 10 filer. after delete!
curl -H "Accept: application/json" "http://filer2:9889/dataset/qiantao/?pretty=y" |grep qiantao |grep txt
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 15723 0 15723 0 0 1917k 0 --:--:-- --:--:-- --:--:-- 2193k
"FullPath": "/dataset/qiantao/1.txt",
If I start a new filer.
It can not got /dataset/qiantao/1.txt; It looks perfect!!!!
But in exist filers.
Filer get file info below.
curl -H "Accept: application/json" "http://filer1:9889/dataset/qiantao/?pretty=y&limit=1"
{
"Path": "/dataset/qiantao",
"Entries": [
{
"FullPath": "/dataset/qiantao/1.txt",
"Mtime": "2020-12-07T11:15:59+08:00",
"Crtime": "2020-12-07T11:15:59+08:00",
"Mode": 432,
"Uid": 0,
"Gid": 0,
"Mime": "text/plain",
"Replication": "010",
"Collection": "",
"TtlSec": 0,
"UserName": "",
"GroupNames": null,
"SymlinkTarget": "",
"Md5": null,
"Extended": null,
"chunks": [
{
"file_id": "4328,587fb084df9f9dbf",
"size": 2,
"mtime": 1607310959158810676,
"e_tag": "c7c83966",
"fid": {
"volume_id": 4328,
"file_key": 1484763268,
"cookie": 3751779775
}
}
]
}
],
"Limit": 1,
"LastFileName": "1.txt",
"ShouldDisplayLoadMore": true
Get volume info below.
{
"Id": 4328,
"Size": 31492542356,
"ReplicaPlacement": {
"SameRackCount": 0,
"DiffRackCount": 1,
"DiffDataCenterCount": 0
},
"Ttl": {
"Count": 0,
"Unit": 0
},
"Collection": "",
"Version": 3,
"FileCount": 111030,
"DeleteCount": 709,
"DeletedByteCount": 1628822733,
"ReadOnly": false,
"CompactRevision": 0,
"ModifiedAtSecond": 0,
"RemoteStorageName": "",
"RemoteStorageKey": ""
},
So download 4328.idx from volume server. and use see_idx lookup it.
./see_idx -dir /Users/qiantao/Documents/seaweedfs -volumeId=4328 -v=4 |grep 587fb084
key:587fb084 offset:2802901546 size:57
key:587fb084 offset:3937021600 size:4294967295
It looks like key:587fb084 is covered with new?
So How can I fix this problem to make it appear normal?
4294967295 is a tombstone, marking the entry has been deleted.

Error while remote indexing with elasticsearch

I'am trying to move from an ES cluster to another, in order to plan an update.
The both are same version (6.4). To achieve this, i'am using this command :
curl -XPOST -H "Content-Type: application/json" http://new_cluster/_reindex -d#reindex.json
And the reindex.json, is looking like this :
{
"source": {
"remote": {
"host": "http://old_cluster:9199"
},
"index": "megabase.33.2",
"query": {
"match_all": {}
}
},
"dest": {
"index": "megabase.33.2"
}
}
I whitelisted one the new cluster the old cluster, and its works but i can't go to the end of the migration of data, because i have this error, and i don't understand what it means here :
{
"took":1762,
"timed_out":false,
"total":8263428,
"updated":5998,
"created":5001,
"deleted":0,
"batches":11,
"version_conflicts":0,
"noops":0,
"retries":{
"bulk":0,
"search":0
},
"throttled_millis":0,
"requests_per_second":-1.0,
"throttled_until_millis":0,
"failures":[
{
"index":"megabase.33.2",
"type":"persona",
"id":"noYOA3IBTWbNbLJUqk6T",
"cause":{
"type":"mapper_parsing_exception",
"reason":"failed to parse [adr_inse]",
"caused_by":{
"type":"illegal_argument_exception",
"reason":"For input string: \"2A004\""
}
},
"status":400
}
]
}
The record in the original cluster look like this :
{
"took": 993,
"timed_out": false,
"_shards": {
"total": 4,
"successful": 4,
"skipped": 0,
"failed": 0
},
"hits": {
"total": 1,
"max_score": 0,
"hits": [
{
"_index": "megabase.33.2",
"_type": "persona",
"_id": "noYOA3IBTWbNbLJUqk6T",
"_score": 0,
"_source": {
"address": "Obfucated",
"adr_inse": "2A004",
"age": 10,
"base": "Obfucated",
"city": "Obfucated",
"cp": 20167,
"email_md5": "Obfucated",
"fraicheur": "2020-01-12T19:39:04+01:00",
"group": 1,
"latlon": "Obfucated",
"partner": "Obfucated",
"partnerbase": 2,
"sex": 2,
"sms_md5": "Obfucated"
}
}
]
}
}
Any clue on what i'am doing wrong ?
Thanks a lot
Found out, the mapping is not well created when using only the reindex method.
So i drop the new indice, recreate mapping using elasticdump :
elasticdump --input=http://oldcluster/megabase.33.2 --output=http://newcluster/megabase.33.2 --type=mapping
Then run the previous script, everything works flawless (and was rather quick)

Laravel - Change pagination root url

I am building an API Gateway and I am having a small problem returning the pagination urls from a microservice to the API Gateway.
This is the current structure of my API Gateway:
When I call the microservice, I can easily pass the paging parameters using the request data:
HTTP::get('http://api.billing.microservice.test/v2/invoices', $request->all());
However, when I make a request for a microservice, it returns the requested data, but with the URL of the microservice:
{
"data": [
# data returned from billing microservice with the billing API URL
],
"links": {
"first": "http://api.billing.microservice.test/v2/invoices?page=1",
"last": "http://api.billing.microservice.test/v2/invoices?page=10",
"prev": null,
"next": "http://api.billing.microservice.test/v2/invoices?page=2"
},
"meta": {
"current_page": 1,
"from": 1,
"last_page": 10,
"path": "http://api.billing.microservice.test/v2/invoices",
"per_page": 30,
"to": 30,
"total": 300
}
}
However I need the return to have the main API URL:
{
"data": [
# data returned from billing microservice with the main API Address
],
"links": {
"first": "http://api.main.test/v2/invoices?page=1",
"last": "http://api.main.microservice.test/v2/invoices?page=10",
"prev": null,
"next": "http://api.main.test/v2/invoices?page=2"
},
"meta": {
"current_page": 1,
"from": 1,
"last_page": 10,
"path": "http://api.main.test/v2/invoices",
"per_page": 30,
"to": 30,
"total": 300
}
}
Has anyone had to do something similar? What is the best way to achieve the desired results? Do a replace using some kind of regex? Is there anything I can do inside the microservice?
I was able to solve my problem with a simple method that already exists:
Model::paginate()->setPath('http://api.main.microservice.test/v2/invoices');
The result is:
{
"data": [
# data
],
"links": {
"first": "http://api.main.microservice.test/v2/invoices?page=1",
"last": "http://api.main.microservice.test/v2/invoices?page=1",
"prev": null,
"next": null
},
"meta": {
"current_page": 1,
"from": 1,
"last_page": 1,
"path": "http://api.main.microservice.test/v2/invoices",
"per_page": 30,
"to": 3,
"total": 3
}
}

Elasticsearch groovy script not working as expected

My partial mapping of an index listing elasticsearch 2.5 (I know I have to upgrade to newer version and start using painless, let's keep that aside for this question)
"name": { "type": "string" },
"los": {
"type": "nested",
"dynamic": "strict",
"properties": {
"start": { "type": "date", "format": "yyyy-MM" },
"max": { "type": "integer" },
"min": { "type": "integer" }
}
}
I have only one document in my storage and that is as follows:
{
"name": 'foobar',
"los": [{
"max": 12,
"start": "2018-02",
"min": 1
},
{
"max": 8,
"start": "2018-03",
"min": 3
},
{
"max": 10,
"start": "2018-04",
"min": 2
},
{
"max": 12,
"start": "2018-05",
"min": 1
}
]
}
I have a a groovy script in my elastic search query as follows:
los_map = [doc['los.start'], doc['los.max'], doc['los.min']].transpose()
return los_map.size()
This groovy query ALWAYS returns 0, which is not possible, as I have one document, as mentioned above (even if I add multiple documents, it still returns 0) and los field is guaranteed to be present in every doc with multiple objects in it. So it seems the transpose which I am doing is not working correctly?
I also tried changing this line los_map = [doc['los.start'], doc['los.max'], doc['los.min']].transpose() to los_map = [doc['los'].start, doc['los'].max, doc['los'].min].transpose() then I get this error "No field found for [los] in mapping with types [listing]"
Does anyone have any idea how to get the transpose work?
By the way, if you are curious, my complete script is as follows:
losMinMap = [:]
losMaxMap = [:]
los_map = [doc['los.start'], doc['los.max'], doc['los.min']].transpose()
los_map.each {st, mx, mn ->
losMinMap[st] = mn
losMaxMap[st] = mx
}
return los_map['2018-05']
Thank you in advance.

elasticsearch spend all time in build_scorer

When we've upgraded our ES from ES 1.4 to ES 5.2 we got performance problem with such type of queries:
{
"_source": false,
"from": 0,
"size": 50,
"profile": true,
"query": {
"bool": {
"filter": [
{
"ids": {
"values": [<list of 400 ids>],
"boost": 1
}
}
],
"should": [
{
"terms": {
"views": [ <list od 20 ints> ]
}
]
"minimum_should_match": "0",
"boost": 1
}
}
}
When profiling we found that the problem with build_scorer, which call for each segment:
1 shard;
20 segments;
took: 55
{
"type": "BooleanQuery",
"description": "views:[9875227 TO 9875227] views:[6991599 TO 6991599] views:[6682953 TO 6682953] views:[6568587 TO 6568587] views:[10080097 TO 10080097] views:[9200174 TO 9200174] views:[9200174 TO 9200174] views:[10080097 TO 10080097] views:[9966870 TO 9966870] views:[6568587 TO 6568587] views:[6568587 TO 6568587] views:[8538669 TO 8538669] views:[8835038 TO 8835038] views:[9200174 TO 9200174] views:[7539089 TO 7539089] views:[6991599 TO 6991599] views:[8222303 TO 8222303] views:[9342166 TO 9342166] views:[7828288 TO 7828288] views:[9699294 TO 9699294] views:[9108691 TO 9108691] views:[9431297 TO 9431297] views:[7539089 TO 7539089] views:[6032694 TO 6032694] views:[9491741 TO 9491741] views:[9498225 TO 9498225] views:[8051047 TO 8051047] views:[9866955 TO 9866955] views:[8222303 TO 8222303] views:[9622214 TO 9622214]",
"time": "39.70427700ms",
"breakdown": {
"score": 99757,
"build_scorer_count": 20,
"match_count": 0,
"create_weight": 37150,
"next_doc": 0,
"match": 0,
"create_weight_count": 1,
"next_doc_count": 0,
"score_count": 110,
"build_scorer": 38648674,
"advance": 918274,
"advance_count": 291
},
So 38 ms of total 55ms was taken by build_scorer, it seems weired.
On ES 1.5 we have about the same number of segments but query run 10x faster
Unfortunately ES 1.x doesn't have profiler to check how many times build_scorer executes in ES 1.x
So the question is why build_scorer_count equal to number of segments and how can we tackle this performance issue?

Resources