Kibana Fatal Error (Server is not ready yet) - elasticsearch

retries=0,throttledUntil=0s,bulk_failures=[],search_failures=[]]", "cluster.uuid": "5FrxkY3GRbGzR2nSuEaxow", "node.id": "kHGHvefFTViG8CdPdF5pxw" }
kibana | {"type":"log","#timestamp":"2021-05-28T09:15:07+00:00","tags":["info","savedobjects-service"],"pid":7,"message":"[.kibana_task_manager] UPDATE_TARGET_MAPPINGS_WAIT_FOR_TASK -> DONE"}
kibana | {"type":"log","#timestamp":"2021-05-28T09:15:07+00:00","tags":["info","savedobjects-service"],"pid":7,"message":"[.kibana_task_manager] Migration completed after 8648ms"}
kibana | {"type":"log","#timestamp":"2021-05-28T09:15:28+00:00","tags":["warning","plugins","licensing"],"pid":7,"message":"License information could not be obtained from Elasticsearch due to Error: Cluster client cannot be used after it has been closed. error"}
kibana | {"type":"log","#timestamp":"2021-05-28T09:15:37+00:00","tags":["warning","plugins-system"],"pid":7,"message":"\"eventLog\" plugin didn't stop in 30sec., move on to the next."}
kibana |
kibana | FATAL Error: Unable to complete saved object migrations for the [.kibana] index. RequestAbortedError: The content length (823769731) is bigger than the maximum allowed string (536870888)
kibana |
kibana exited with code 1
I've obtained this error after creating through Vega-lite a number of visualization. How I manage to solve?
If you need I can post all the informations

After analyzing different response on the web I manage to solve:
1 step:
`http://localhost:9200/.kibana/_search?q=type:dashboard&size=100`
2 step:
`curl -X DELETE "localhost:5601/api/saved_objects/kibanaSavedObjectMeta.searchSourceJSON.index/index-pattern/e68f45b0-ab73-11eb-a01c-8590ef1580f4/" -H 'kbn-xsrf: true'
#(Respectly <name>/<type>/<id>)
`
Before doing it I'll do the instruction of this link to be sure that everything work correctly
In practice the problem is refereed to the fact we've too much saved object over the information in the space. There're different way for solving them in case I've deleted all but you can increase also the space. The other suggestion is to avoid partions

Related

Wazuh - Filebeat - Elasticsearch non-zero metrics

Could you please help me solve this Filebeat error?
Its Wazuh manager server. All is working, I can connect to Kibana web, enter Wazuh app and I can see there my three Wazuh agents connected and active.
I want FIM monitoring nad If I change file on agent server, alert is created and I can see that alert in alert.log on manager server. Issue is, that Filebeat wont send this alert to elasticsearch so I cant see that alert on Kibana web.
Wazuh manager>
Wazuh 4.2.5
Filebeat 7.14.2
Elasticsearch 7.14.2
Kibana 7.14.2
Wazuh alert log - /var/ossec/logs/alerts/2022/Feb/ and /var/ossec/logs/alerts
systemctl status filebeat is active, but I can see there lines:
WARN [elasticsearch] elasticsearch/client.go:405 Cannot>
This is error from > filebeat -e
2022-02-03T12:46:20.386+0100 INFO [monitoring] log/log.go:153 Total non-zero metrics {"monitoring": {"metrics": {"beat":{"cgroup":{"memory":{"id":"session-248447.scope","mem":{"limit":{"bytes":9223372036854771712},"usage":{"bytes":622415872}}}},"cpu":{"system":{"ticks":70,"time":{"ms":72}},"total":{"ticks":300,"time":{"ms":311},"value":300},"user":{"ticks":230,"time":{"ms":239}}},"handles":{"limit":{"hard":262144,"soft":1024},"open":9},"info":{"ephemeral_id":"641d7fdd-47a0-4b10-bda9-36f29c29fdef","uptime":{"ms":98413},"version":"7.14.2"},"memstats":{"gc_next":18917616,"memory_alloc":14197072,"memory_sys":75383816,"memory_total":71337840,"rss":115638272},"runtime":{"goroutines":11}},"filebeat":{"harvester":{"open_files":0,"running":0}},"libbeat":{"config":{"module":{"running":2,"starts":2},"reloads":1,"scans":1},"output":{"events":{"active":0},"type":"elasticsearch"},"
And here is error found in /var/log/messages
Feb 3 10:27:54 filebeat[2531915]: 2022-02-03T10:27:54.707+0100#011WARN#011[elasticsearch]#011elasticsearch/client.go:405#011Cannot index event publisher.Event{Content:beat.Event{Timestamp:time.Time{wall:0xc07705e669760167, ext:958857091513, loc:(*time.Location)(0x5620964fb2a0)}, Meta:{"pipeline":"filebeat-7.14.0-wazuh-alerts-pipeline"}, Fields:{"agent":{"ephemeral_id":"33cb9baa-af71-4b44-99a6-1379c747722f","hostname":"xlc","id":"03fb57ca-9940-4886-9e6e-a3b3e635cd35","name":"xlc","type":"filebeat","version":"7.14.0"},"ecs":{"version":"1.10.0"},"event":{"dataset":"wazuh.alerts","module":"wazuh"},"fields":{"index_prefix":"wazuh-monitoring-"},"fileset":{"name":"alerts"},"host":{"name":"xlc"},"input":{"type":"log"},"log":{"file":{"path":"/var/ossec/logs/alerts/alerts.json"},"offset":122695554},"message":"{\"timestamp\":\"2022-02-03T10:27:52.438+0100\",\"rule\":{\"level\":5,\"description\":\"Registry Value Integrity Checksum Changed\",\"id\":\"750\",\"mitre\":{\"id\":[\"T1492\"],\"tactic\":[\"Impact\"],\"technique\":[\"Stored Data Manipulation\"]},\"firedtimes\":7,\"mail\":false,\"groups\":[\"ossec\",\"syscheck\",\"syscheck_entry_modified\",\"syscheck_registry\"],\"pci_dss\":[\"11.5\"],\"gpg13\":[\"4.13\"],\"gdpr\":[\"II_5.1.f\"],\"hipaa\":[\"164.312.c.1\",\"164.312.c.2\"],\"nist_800_53\":[\"SI.7\"],\"tsc\":[\"PI1.4\",\"PI1.5\",\"CC6.1\",\"CC6.8\",\"CC7.2\",\"CC7.3\"]},\"agent\":{\"id\":\"006\",\"name\":\"CPP\",\"ip\":\"10.74.37.3\"},\"manager\":{\"name\":\"xlc\"},\"id\":\"1643880472.68132386\",\"full_log\":\"Registry Value '[x32] HKEY_LOCAL_MACHINE\\\\System\\\\CurrentControlSet\\\\Services\\\\W32Time\\\\Config\\\\LastKnownGoodTime' modified\\nMode: scheduled\\nChanged attributes: md5,sha1,sha256\\nOld md5sum was: '5df5b1598b729d98734105148103abf2'\\nNew md5sum is : '361334bf60bdd83e30894c4f313d16ec'\\nOld sha1sum was: 'c233c8ccb56fbd363c44b51a9d51c7fa32512474'\\nNew sha1sum is : '7163cffa48f1a7c0bcb4a3ddff6278ae9a4895a6'\\nOld sha256sum was: '3aad3da22f2d53e8ac33c46c73f40c3e8f5db07188d166e24957d8a20b62b5f1'\\nNew sha256sum is : 'bee8072335d870a1624a541cb13ca5085ba85646a8417d4d894deff71c3f4a92'\\n\",\"syscheck\":{\"path\":\"HKEY_LOCAL_MACHINE\\\\System\\\\CurrentControlSet\\\\Services\\\\W32Time\\\\Config\",\"mode\":\"scheduled\",\"arch\":\"[x32]\",\"value_name\":\"LastKnownGoodTime\",\"size_after\":\"8\",\"md5_before\":\"5df5b1598b729d98734105148103abf2\",\"md5_after\":\"361334bf60bdd83e30894c4f313d16ec\",\"sha1_before\":\"c233c8ccb56fbd363c44b51a9d51c7fa32512474\",\"sha1_after\":\"7163cffa48f1a7c0bcb4a3ddff6278ae9a4895a6\",\"sha256_before\":\"3aad3da22f2d53e8ac33c46c73f40c3e8f5db07188d166e24957d8a20b62b5f1\",\"sha256_after\":\"bee8072335d870a1624a541cb13ca5085ba85646a8417d4d894deff71c3f4a92\",\"changed_attributes\":[\"md5\",\"sha1\",\"sha256\"],\"event\":\"modified\"},\"decoder\":{\"name\":\"syscheck_registry_value_modified\"},\"location\":\"syscheck\"}","service":{"type":"wazuh"}}, Private:file.State{Id:"native::1049-64776", PrevId:"", Finished:false, Fileinfo:(*os.fileStat)(0xc000fc9380), Source:"/var/ossec/logs/alerts/alerts.json", Offset:122697450, Timestamp:time.Time{wall:0xc07704f6d4cb3764, ext:510354422, loc:(*time.Location)(0x5620964fb2a0)}, TTL:-1, Type:"log", Meta:map[string]string(nil), FileStateOS:file.StateOS{Inode:0x419, Device:0xfd08}, IdentifierName:"native"}, TimeSeries:false}, Flags:0x1, Cache:publisher.EventCache{m:common.MapStr(nil)}} (status=400): {"type":"illegal_argument_exception","reason":"data_stream [<wazuh-monitoring-{2022.02.03||/d{yyyy.MM.dd|UTC}}>] must not contain the following characters [ , \", *, \\, <, |, ,, >, /, ?]"}
Could you please help with this? I tried google but with no success. Thank you.
Filebeat reads from alerts.json, you can check this file to see if the alerts are being generated. Judging from the log you provided, it looks like filebeat cannot send some logs to elasticsearch (Cannot index event publisher.Event), but we would need more details about the complete error and source logs causing that error. The output of the command # journalctl -f -u filebeat will be useful in this case to provide further assistance.
Based on previous experience. the problem could be that you have reached the maximum limit of shards opened, by default this number is set to 1000. If this is the case, you will see an error like the following: {"type":"validation_exception","reason":"Validation Failed: 1: this action would add [2] total shards, but this cluster currently has [1000]/[1000] maximum shards open;"}
If that's the case, you can either reduce the number of shards, or increase the limit to solve the situation right now. I'd recommend the first approach if you only have 1 Elasticsearch node, having 1000 shards is not healthy for the environment in these cases.
To reduce the number of shards in /etc/filebeat/wazuh-template.json check this information and change it to "1", then restart filebeat. These actions will affect the index from now on, but checking This guide can help you with cases like this one.
Also, you can try to remove old indexes. I would first check what are the indices you have stored. I suppose some of them are related to statistics or other stuff, so I would first try to remove those before actual data (wazuh-alerts-)
You can use:
GET /_cat/indices
As the indices are stored per day by default, so you can remove, for instance, those indices older than 1 month and we only keep one month of those indices
To prevent this from happening in the future, you may try implementing an Index Management Policy after you solve the issue at hand.

CoGroupByKey always failed on big data (PythonSDK)

I have about 4000 files (avg ~7MB each) input.
My pipeline always failed on the step CoGroupByKey when the data size reach about 4GB.
I tried to limit only use 300 file then it run just fine.
In case of fail, the logs on GCP dataflow only show:
Workflow failed. Causes: S24:CoGroup Geo data/GroupByKey/Read+CoGroup Geo data/GroupByKey/GroupByWindow+CoGroup Geo data/Map(_merge_tagged_vals_under_key) failed., The job failed because a work item has failed 4 times. Look in previous log entries for the cause of each one of the 4 failures. For more information, see https://cloud.google.com/dataflow/docs/guides/common-errors. The work item was attempted on these workers:
store-migration-10212040-aoi4-harness-m7j7
Root cause: The worker lost contact with the service.,
store-migration-xxxxx
Root cause: The worker lost contact with the service.,
store-migration-xxxxx
Root cause: The worker lost contact with the service.,
store-migration-xxxxx
Root cause: The worker lost contact with the service.
I digging through all logs in Logs Explorer. Nothing else indicate error other than the above, even my logging.info and try...except code.
Think this relate to the memory of the instances but I didn't digging into that direction. Because it kindna what I don't want to worry about when I am using GCP services.
Thanks.

How to get number of current open shards in elsticsearch cluster?

I can't find where to get the number of current open shards.
I want to make monitoring to avoid cases like this:
this cluster currently has [999]/[1000] maximum shards open
I can get maximum limit - max_shards_per_node
$ curl -X GET "${ELK_HOST}/_cluster/settings?include_defaults=true&flat_settings=true&pretty" 2>/dev/null | grep cluster.max_shards_per_node
"cluster.max_shards_per_node" : "1000",
$
But can't find out how to get number of the current open shards (999).
A very simple way to get this information is to call the _cat/shards API and count the number of lines using the wc shell command:
curl -s -XGET ${ELK_HOST}/_cat/shards | wc -l
That will yield a single number that represents the number of shards in your cluster.
Another option is to retrieve the cluster stats using JSON format, pipe the results into jq and then grab whatever you want, e.g. below I'm counting all STARTED shards:
curl -s -XGET ${ELK_HOST}/_cat/shards?format=json | jq ".[].state" | grep "STARTED" | wc -l
Yet another option is to query the _cluster/stats API:
curl -s -XGET ${ELK_HOST}/_cluster/stats?filter_path=indices.shards.total
That will return a JSON with the shard count
{
"indices" : {
"shards" : {
"total" : 302
}
}
}
To my knowledge there is no single number that ES spits out from any API with the single number. To be sure of that, let's look at the source code.
The error is thrown from IndicesService.java
To see how currentOpenShards is computed, we can then go to Metadata.java.
As you can see, the code is iterating over the index metadata that is retrieved from the cluster state, pretty much like running the following command and count the number of shards, but only for indices with "state" : "open"
GET _cluster/state?filter_path=metadata.indices.*.settings.index.number_of*,metadata.indices.*.state
From that evidence, we can pretty much be sure that the single number you're looking for is nowhere to be found, but needs to be computed by one of the methods I showed above. You're free to open a feature request if needed.
The problem: Seems that your elastic cluster number of shards per node are getting limited.
Solution:
Verify the number of shards per node in your configuration and increase it using elastic API.
For getting the number of shards - use _cluster/stats API:
curl -s -XGET 'localhost/_cluster/stats?filter_path=indices.shards.total'
From elastic docs:
The Cluster Stats API allows to retrieve statistics from a cluster
wide perspective. The API returns basic index metrics (shard numbers,
store size, memory usage) and information about the current nodes that
form the cluster (number, roles, os, jvm versions, memory usage, cpu
and installed plugins).
For updating number of shards (increasing/decreasing), use - _cluster/settings api:
For example:
curl -XPUT -H 'Content-Type: application/json' 'localhost:9200/_cluster/settings' -d '{ "persistent" : {"cluster.max_shards_per_node" : 5000}}'
From elastic docs:
With specifications in the request body, this API call can update
cluster settings. Updates to settings can be persistent, meaning they
apply across restarts, or transient, where they don’t survive a full
cluster restart.
You can reset persistent or transient settings by assigning a null
value. If a transient setting is reset, the first one of these values
that is defined is applied:
the persistent setting the setting in the configuration file the
default value. The order of precedence for cluster settings is:
transient cluster settings persistent cluster settings settings in the
elasticsearch.yml configuration file. It’s best to set all
cluster-wide settings with the settings API and use the
elasticsearch.yml file only for local configurations. This way you can
be sure that the setting is the same on all nodes. If, on the other
hand, you define different settings on different nodes by accident
using the configuration file, it is very difficult to notice these
discrepancies.
curl -s '127.1:9200/_cat/indices' | awk '{ if ($2 == "open") C+=$5*$6} END {print C}'
This works:
GET /_stats?level=shards&filter_path=_shards.total
Reference:
https://stackoverflow.com/a/38108448/4271117

percolate returns empty matches under heavy load during elasticsearch cluster resizing

We have an elasticsearch cluster dynamically re-sizing in respect to percolate message count in a rabbitmq queue.
We have a single shard and ~18K query in our index, and we use auto_expand_replicas: "0-all" at index settings to copy single shard to all nodes when a node becomes available.
But during heavy load and cluster re-sizing, some requests produces unexpected empty matches.
We send ~1M percolate request daily and we were losing ~1K content. We added a cluster status control to our code, if cluster status is not green before and after percolate request we're waiting for green status and re-sending percolate request, we were able to reduce lost content count from 1K to ~100 in this way. We do not live this problem in a cluster with fixed node size.
Unfortunately any loss is not acceptable in our scenario, and we don't want to give up auto scaling, we need to find a workaround for this problem.
To repeat problem, you can use following bash script:
https://gist.github.com/ekesken/de41598a1e7e54c6f33c
This script will download and install elasticsearch 1.5.2 on your current directory, create a cluster with 10 nodes on your local and create index and percolation queries and will start testing.
Normally we expect following output for single percolate request:
curl -XGET 'localhost:9200/my-index/my-type/_percolate' -d '{
"doc" : {
"message" : "A new bonsai tree in the office"
}
}'
{"took":95,"_shards":{"total":1,"successful":1,"failed":0},"total":1,"matches":[{"_index":"my-index","_id":"tree"}]}
After running script, if you see all shards in all nodes are started at http://localhost:9200/_cat/shards response and test script is still running, that means you couldn't reproduce problem, try increasing node count which was 10 by default:
./repeat_percolation_loss.sh 15 test-only
When you reproduce problem, script will exit with following output:
{"took":209,"_shards":{"total":1,"successful":1,"failed":0},"total":0,"matches":[]}
Problem repeated! Congratulations.
You can shutdown all servers and clean all directory and files created via script with command:
./repeat_percolation_loss.sh 15 clean
Change node count above with latest node count you've tried.

Can I ignore these query exceptions in my ElasticSearch log?

I have a large number of indices in my ES instance, and I have noticed that the log files are growing rather large. The ElasticSearch Chef cookbook by default sets the log level to DEBUG and this has resulted in millions of error messages being written into the log. Please see this one as an example:
[2015-02-20 18:42:28,858][DEBUG][action.search.type ] [SEARCHNODE] [child_index][4], node[xxxx], [P], s[STARTED]: Failed to execute [org.elasticsearch.action.search.SearchRequest#1a9a62ad] lastShard [true]
org.elasticsearch.search.SearchParseException: [ichild_index][4]: from[0],size[105]: Parse Failure [Failed to parse source [{"from":0,"size":105,"sort":{"lastmodified":{"order":"desc","missing":"_last"}},"query":{"indices":{"indices":["main_index"],"query":{"filtered":{"query":{"bool":{"must":[{"match_all":{}}]}},"filter":{"and":{"filters":[{"term":{"isclosed":false}},{"or":[{"and":[{"type":{"value":"type_name"}}]}]},{"term":{"planid":1454}},{"bool":{"should":[{"terms":{"roles":[173,935,934,937,930,938,936]}},{"missing":{"field":"roles"}}]}}]}}}},"no_match_query":"none"}},"fields":"[]"}]]
at org.elasticsearch.search.SearchService.parseSource(SearchService.java:660)
at org.elasticsearch.search.SearchService.createContext(SearchService.java:516)
at org.elasticsearch.search.SearchService.createAndPutContext(SearchService.java:488)
at org.elasticsearch.search.SearchService.executeQueryPhase(SearchService.java:257)
at org.elasticsearch.search.action.SearchServiceTransportAction$5.call(SearchServiceTransportAction.java:206)
at org.elasticsearch.search.action.SearchServiceTransportAction$5.call(SearchServiceTransportAction.java:203)
at org.elasticsearch.search.action.SearchServiceTransportAction$23.run(SearchServiceTransportAction.java:517)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Caused by: org.elasticsearch.search.SearchParseException: [child_index][4]: from[0],size[105]: Parse Failure [No mapping found for [lastmodified] in order to sort on]
at org.elasticsearch.search.sort.SortParseElement.addSortField(SortParseElement.java:198)
at org.elasticsearch.search.sort.SortParseElement.addCompoundSortField(SortParseElement.java:172)
at org.elasticsearch.search.sort.SortParseElement.parse(SortParseElement.java:90)
at org.elasticsearch.search.SearchService.parseSource(SearchService.java:644)
The query in the error message contains this fragment:
... {"indices":{"indices":["main_index"] ...
However, the error actually originates from child_index. I'm not sure why my instance would even consider child_index to execute the query on as we clearly don't want to consider that index as per the query.
The query above is actually executed successfully. Results are returned correctly, we don't log anything on the web application that indicates a problem. Presumably the query is at some point run against main_index as well and the results are correctly returned to the web app.
My instance is under a moderate workload and this file can comfortably grow to 5gb in a given 12 hour period. I know that the solution to that problem is simple: decrease the log level to WARN and the errors will go away. However, I'm worried that we might have a hitherto undiagnosed problem with the instance that could bite us later.
Of all the errors to ignore, org.elasticsearch.search.SearchParseException is probably the one you should never ignore. It means that ES was unable to parse your search JSON as it expects to be able to (as far as I can tell).
I took a look at your JSON, and although it lints it appears your "fields" array is actually "fields": "[]" which could be what's causing the issue. Can you try without the quotes and see what happens?
Theory, but it's possible it fails to parse that section and so just ignores it (which should result in the same thing as if it were parsed in this case).

Resources