Graylog2/Passenger 3.0.21; writev() "/tmp/passenger-standalone.3012/proxy_temp/3/00/0000000003" has written only 4096 of 8192 while reading upstream - passenger

I have a Graylog2 install (0.11.0), served with Passenger running as standalone (3.0.21). It's backed with multiple ElasticSearch servers plus MongoDB.
About a week ago, it was running Passenger 3.0.18 and this error started to show up in the Graylog server logs when you tried to load messages:
2013/09/13 13:47:32 [crit] 27720#0: *1451 writev() "/tmp/passenger-standalone.27619/proxy_temp/6/00/0000000006" failed (28: No space left on device) while reading upstream
Checked /tmp/, and it had 8% utilization. Meanwhile on the front-end, when you tried to load the Messages page in Graylog, the page would load fine all except for the actual messages. I tried upgrading Passenger to 3.0.21 and the behavior stayed the same but the error changed:
2013/09/17 10:16:53 [crit] 3113#0: *10 writev() "/tmp/passenger-standalone.3012/proxy_temp/3/00/0000000003" has written only 4096 of 8192 while reading upstream
Next I checked out ES machines. They were running with high CPU load, so I changed the amount of max indexes they were keeping from Graylog and that brought them right back down...but still no change in behavior.
My best guess on this error is that it's some sort of timeout, but I can't find any other thread where anyone's gotten this error, and I don't see why a timeout should be happening now that the ES machines are within range again. All other Graylog web pages work fine, as do Streams.

I ended up doing a few things to resolve this issue.
Change Graylog's processor_wait_strategy to 'blocking'. This greatly reduced the amount of CPU the graylog-server app was using.
Cut the amount of data ElasticSearch was storing by reducing the elasticsearch_max_number_of_indices for Graylog.
And the thing that helped the most, stop the Graylog server, and delete the graylog2_recent ElasticSearch index. Then restart the Graylog server, and it will re-create it. Once I did this, the amount of CPU load on the ElasticSearch servers dropped drastically and Messages and searches began to work again. Once the index re-filled, it continued to work correctly.
Hopefully this helps some other poor individual Googling this error.

Related

Azure Databricks stream fails with StorageException: Could not verify copy source

We have a Databricks job that has suddenly started to consistently fail. Sometimes it runs for an hour, other times it fails after a few minutes.
The inner exception is
ERROR MicroBatchExecution: Query [id = xyz, runId = abc] terminated with error
shaded.databricks.org.apache.hadoop.fs.azure.AzureException: hadoop_azure_shaded.com.microsoft.azure.storage.StorageException: Could not verify copy source.
The job targets a notebook which consumes from event-hub with PySpark structured streaming, calculates some values based on the data, and streams data back to another event-hub topic.
The cluster is a pool with 2 workers and 1 driver running on standard Databricks 9.1 ML.
We've tried to restart job many times, also with clean input data and checkpoint location.
We struggle to determine what is causing this error.
We cannot see any 403 Forbidden errors in logs, which is sometimes mentioned on forums as a reason
.
Any assistance is greatly appreciated.
Issue resolved by moving checkpointing (used internally by Spark) location from standard storage to premium. I don't know why it suddenly started failing after months of running hardly without hiccup.
Premium storage might be a better place for checkpointing anyway since I/O is cheaper.

dncp_block_verification log file increases size in HDFS

We are using cloudera CDH 5.3. I am facing a problem wherein the size of "/dfs/dn/current/Bp-12345-IpAddress-123456789/dncp-block-verification.log.curr" and "dncp-vlock-verification.log.prev" keeps increasing to TBs within hours. I read in some of the blogs and they mention it is an HDFS bug. A temporary solution to this problem is to stop the datanode services and delete these files. But we have observed that the log file increases in size on either of the datanodes (even on the same node after deleting it). Thus, it requires continuous monitoring.
Does anyone have a permanent solution to this problem?
One solution, although slightly drastic, is to disable the block scanner entirely, by setting into the HDFS DataNode configuration the key dfs.datanode.scan.period.hours to 0 (default is 504 in hours). The negative effect of this is that your DNs may not auto-detect corrupted block files (and would need to wait upon a future block reading client to detect them instead); this isn't a big deal if your average replication is 3-ish, but you can consider the change as a short term one until you upgrade to a release that fixes the issue.
Note that this problem will not happen if you upgrade to the latest CDH 5.4.x or higher release versions, which includes the HDFS-7430 rewrite changes and associated bug fixes. These changes have done away with the use of such a local file, thereby removing the problem.

SonarQube 5.1 too busy due to ElasticSearch

I have recently migrated from SonarQube 3.7.2 to SonarQube 5.1. Update was successfull and I was able to run analysis.
However now I cannot reach the server and from log it seems ElasticSearch is slowly eating away my disk space.
I tried to restart the server and to delete the data/es directory, but nothing helped.
sonar.log is full of these lines:
...
2015.05.18 00:00:13 WARN es[o.e.c.r.a.decider] [sonar-1431686361188] high disk watermark [10%] exceeded on [Jbz_O0pFRKecav4NT3DWzQ][sonar-1431686361188] free: 5.6gb[3.8%], shards will be relocated away from this node
2015.05.18 00:00:13 INFO es[o.e.c.r.a.decider] [sonar-1431686361188] high disk watermark exceeded on one or more nodes, rerouting shards
...
There are just a few Java projects, but two of them are around a couple million lines of code (LOC).
Your server does not have enough available disk space to feed its internal Elasticsearch indices.
Note that an external volume can be used by setting the property sonar.path.data (see conf/sonar.properties).

Novell eDirectory: Error while adding replica on new server

I want to add a replica of our whole eDirectory tree to a new server (OES11.2 SLES11.3).
So I wanted to do so via iManager. (Partitions and Replicas / Replica View / Add Replica)
Everthing looks normal. I see our other servers with added replicas and of course the server with the master image.
For addition information: I did that a lot of times without problems until now.
When I want to add a replica to the new server, i get the following error: (Error -636) The server is unreachable.
I checked the /etc/hosts file and the network settings on both servers.
Ndsrepair looks normal too. All servers are in sync and there are no connection errors. The replica depth of the new server is -1. I get that, because there is no replica on it yet.
But if i can connect from one server to another and there are no error messages, why does adding a replica not work?
I also tried to make a LAN trace, but didn't get any information that would help me out here. In the trace the communication seems normal!
Am I forgetting something here?
Every server in our environment runs OES11.2 except the master server which runs OES11.1
Thanks for your help!
Daniel
Nothing wrong.
Error -636 means that the replica is not yet available at the new server. When will the synchronization, the replica will be ready and available. Depending on the size of the Tree and the communication channel we can wait for up to some hours.

Apache Solr requires regular restart

I have Solr installed and set up on my Drupal 7 site. Most of the time it works as expected. However, every so often, maybe every other day at least, the search will suddenly stop working and according to the Drupal error log I get:
"0" Status: Request failed: Connection refused.
The Type column says Apache Solr. To fix this, I just restart the Solr service, is there something I can do to prevent this issue from occurring again? I suspect it's some sort of configuration with the Solr that needs adjusting.
I'm kind of new to Solr, so any tips would be appreciated.
Thanks
How busy is Solr server? If not very busy, check if you have a firewall between your Drupal and Solr servers. Some firewalls kill the connections if there is no traffic going through.
One way to test would be to access Solr admin interface. If you can, the server itself is fine, only Drupal's connection died.
I am assuming that Solr client library in Drupal tries to maintain a persistent connection. If that's not the case, the above does not apply.
I ended up reducing the number of documents to be indexed during cron from 200 to 50. That seemed to resolve the issue, as I have not had any Solr outages over the last couple of weeks.

Resources