Azure DevOps build pipeline self-hosted agent "No space left on device" - bash

I am running a build pipeline on Azure that runs on a private build server (Red Hat Enterprise Linux) running Self Hosted Agent. This build pipeline only has 1 Job and 2 Tasks where the 1st task is it basically SSH's into a Repo server we have (different server that just holds big files) generates an ISO image on that Repo server, then uses curl to put that ISO back on the the build server where the Azure Pipeline agent is running in the stereotypical $(Build.ArtifactStagingDirectory) Azure uses for Artifacts.
This 1st task succeeds, and the ISO is generated and copied over to the build server, but the "Publish Artifact" stage keeps failing. It's trying to publish to the path $(Build.ArtifactStagingDirectory) but produces an error message, with more logs:
No space left on device
I already went in a cleared all the directories and files that exceeded > 1GB in this working directory `/home/azure/vsts/_work
I'm not an expert with Linux. When I run df -h and view the filesystem, there are a bunch in the list. Is there a way to know what partition I'm actually using for this Azure pipeline agent that using the /home/azure/vsts/_work directory?
My df -h list looks like:
Filesystem Size Used Avail Use% Mounted on
/dev/mapper/vg_root-lv_root 19G 19G 28K 100% /
devtmpfs 3.9G 0 3.9G 0% /dev
tmpfs 3.9G 8.0K 3.9G 1% /dev/shm
tmpfs 3.9G 138M 3.8G 4% /run
tmpfs 3.9G 0 3.9G 0% /sys/fs/cgroup
/dev/sdc 100G 152M 100G 1% /glusterfs
/dev/sda1 488M 119M 334M 27% /boot
/dev/mapper/vg_root-lv_var 997M 106M 891M 11% /var
/dev/mapper/vg_docker-lv_docker 50G 3.1G 44G 7% /var/lib/docker/overlay2
/dev/mapper/vg_root-lv_log 997M 46M 952M 5% /var/log
/dev/mapper/vg_root-lv_crash 997M 33M 965M 4% /var/crash
/dev/mapper/vg_root-lv_root_logins 29M 1.8M 27M 6% /var/log/root_logins
/dev/mapper/vg_root-lv_core 125M 6.6M 119M 6% /var/core
/dev/mapper/vg_root-lv_repo 997M 83M 915M 9% /var/cache/yum
/dev/mapper/vg_root-lv_home 997M 33M 965M 4% /export/home
/dev/mapper/vg_root-lv_logins 93M 5.0M 88M 6% /var/log/logins
/dev/mapper/vg_root-lv_audit 725M 71M 655M 10% /var/log/audit
tmpfs 799M 0 799M 0% /run/user/0
walkie1-ap2.nextgen.com:/hdd-volume0 200G 2.3G 198G 2% /gluster-hdd
If anyone could provide some insight I'd greatly appreciate it.
End of error log:
[2020-05-06 05:49:09Z ERR JobRunner] Caught exception from job steps StepsRunner: System.IO.IOException: No space left on device
at Interop.ThrowExceptionForIoErrno(ErrorInfo errorInfo, String path, Boolean isDirectory, Func`2 errorRewriter)
at Microsoft.Win32.SafeHandles.SafeFileHandle.Open(String path, OpenFlags flags, Int32 mode)
at System.IO.FileStream..ctor(String path, FileMode mode, FileAccess access, FileShare share, Int32 bufferSize, FileOptions options)
at System.IO.FileStream..ctor(String path, FileMode mode)
at Microsoft.VisualStudio.Services.Agent.PagingLogger.NewPage()
at Microsoft.VisualStudio.Services.Agent.PagingLogger.Write(String message)
at Microsoft.VisualStudio.Services.Agent.Worker.ExecutionContext.Write(String tag, String message)
at Microsoft.VisualStudio.Services.Agent.Worker.StepsRunner.RunStepAsync(IStep step, CancellationToken jobCancellationToken)
at Microsoft.VisualStudio.Services.Agent.Worker.StepsRunner.RunAsync(IExecutionContext jobContext, IList`1 steps)
at Microsoft.VisualStudio.Services.Agent.Worker.JobRunner.RunAsync(AgentJobRequestMessage message, CancellationToken jobRequestCancellationToken)```

I can't reproduce same issue on my side, but I think you can check this article for a trouble-shooting.
As I know this task itself will take extra space when being executed. You can try a bash command to make copy of content under path $(Build.ArtifactStagingDirectory) to double the size of those content, if this action throws same No space left on device error?
And in build pipeline there's one clean option to clean the caches before executing the job, enable it to check if it helps:
If it's yaml pipeline, try something like:
workspace:
clean: outputs | resources | all # what to clean up before the job runs
and
steps:
- checkout: self | none | repository name # self represents the repo where the initial Pipelines YAML file was found
clean: boolean # if true, run `execute git clean -ffdx && git reset --hard HEAD` before fetching
See Yaml schema.

So I was a n00b here, and the solution was simply to clean out space from the main directory we used to store our large ISO files:
/dev/mapper/vg_root-lv_root 19G 19G 28K 100% /
This is a custom VM we used to run builds on Azure and I wasn't accustomed to the error message. But yes, if anyone says this message and is using a custom build agent, its definitely a space issue.

Related

Sonar-scanner hangs after 'Load active rules (done)' is shown in the logs

The tail of logging shows the following:
22:09:11.016 DEBUG: GET 200 http://someserversomewhere:9000/api/rules/search.protobuf?f=repo,name,severity,lang,internalKey,templateKey,params,actives,createdAt,updatedAt&activation=true&qprofile=AXaXXXXXXXXXXXXXXXw0&ps=500&p=1 | time=427ms
22:09:11.038 INFO: Load active rules (done) | time=12755ms
I have mounted the running container to see if the scanner process is pegged/running/etc and it shows the following:
Mem: 2960944K used, 106248K free, 67380K shrd, 5032K buff, 209352K cached
CPU: 0% usr 0% sys 0% nic 99% idle 0% io 0% irq 0% sirq
Load average: 5.01 5.03 4.83 1/752 46
PID PPID USER STAT VSZ %VSZ CPU %CPU COMMAND
1 0 root S 3811m 127% 1 0% /opt/java/openjdk/bin/java -Djava.awt.headless=true -classpath /opt/sonar-scanner/lib/sonar-scann
40 0 root S 2424 0% 0 0% bash
46 40 root R 1584 0% 0 0% top
I was unable to find any logging in the sonar-scanner-cli container to help indicate the state. It appears to just be hung and waiting for something to happen.
I am running Sonarqube locally from docker at the lts version 7.9.5
I am also running the docker container sonarsource:sonar-scanner-cli which is currently using the following version in the Dockerfile.
SONAR_SCANNER_VERSION=4.5.0.2216
I am triggering the scan via the following command:
docker run --rm \
-e SONAR_HOST_URL="http://someserversomewhere:9000" \
-e SONAR_LOGIN="nottherealusername" \
-e SONAR_PASSWORD="not12345likeinspaceballs" \
-v "$DOCKER_TEST_DIRECTORY:/usr/src" \
--link "myDockerContainerNameForSonarQube" \
sonarsource/sonar-scanner-cli -X -Dsonar.password=not12345likeinspaceballs -Dsonar.verbose=true \
-Dsonar.sources=app -Dsonar.tests=test -Dsonar.branch=master \
-Dsonar.projectKey="${PROJECT_KEY}" -Dsonar.log.level=TRACE \
-Dsonar.projectBaseDir=/usr/src/$PROJECT_NAME -Dsonar.working.directory=/usr/src/$PROJECT_NAME/$SCANNER_WORK_DIR
I have done a lot of digging to try to find anyone with similar issues and found the following older issue which seems to be similar but it is unclear how to determine if I am experiencing something related. Why does sonar-maven-plugin hang at loading global settings or active rules?
I am stuck and not sure what to do next any help or hints would be appreciated.
Additional note is that this process does work for the 8.4.2-developer version of Sonarqube that I am planning migrate to. The purpose of verifying 7.9.5 is to follow the recommended upgrade path from Sonarqube that recommends the interim step of first bringing your current version to the latest LTS then running the data migration before jumping to the next major version.

hdfs + namenode + edit files increasing with huge size and how to limit the size of edit files

we have HDP cluster with 7 datanodes machines
under /hadoop/hdfs/namenode/current/
we can see more then 1500 edit files
each file is around 7M to 20M as the following
7.8M /hadoop/hdfs/namenode/current/edits_0000000002331008695-0000000002331071883
7.0M /hadoop/hdfs/namenode/current/edits_0000000002331071884-0000000002331128452
7.8M /hadoop/hdfs/namenode/current/edits_0000000002331128453-0000000002331189702
7.1M /hadoop/hdfs/namenode/current/edits_0000000002331189703-0000000002331246584
11M /hadoop/hdfs/namenode/current/edits_0000000002331246585-0000000002331323246
8.0M /hadoop/hdfs/namenode/current/edits_0000000002331323247-0000000002331385595
7.7M /hadoop/hdfs/namenode/current/edits_0000000002331385596-0000000002331445237
7.9M /hadoop/hdfs/namenode/current/edits_0000000002331445238-0000000002331506718
9.1M /hadoop/hdfs/namenode/current/edits_0000000002331506719-0000000002331573154
9.0M /hadoop/hdfs/namenode/current/edits_0000000002331573155-0000000002331638086
7.8M /hadoop/hdfs/namenode/current/edits_0000000002331638087-0000000002331697435
7.8M /hadoop/hdfs/namenode/current/edits_0000000002331697436-0000000002331755881
8.0M /hadoop/hdfs/namenode/current/edits_0000000002331755882-0000000002331814933
9.8M /hadoop/hdfs/namenode/current/edits_0000000002331814934-0000000002331884369
11M /hadoop/hdfs/namenode/current/edits_0000000002331884370-0000000002331955341
8.7M /hadoop/hdfs/namenode/current/edits_0000000002331955342-0000000002332019335
7.8M /hadoop/hdfs/namenode/current/edits_0000000002332019336-0000000002332074498
is it possible to minimize file size by some HDFS configuration? ( or minimize edit files numbers )
since we have small disks and the disk is now 100%
/dev/sdb 100G 100G 0 100% /hadoop/hdfs
You can configure the dfs.namenode.num.checkpoints.retained and
dfs.namenode.num.extra.edits.retained properties to control the size
of the directory that holds the NameNode edits directory.
dfs.namenode.num.checkpoints.retained: The number of image checkpoint
files that are retained in storage directories. All edit logs
necessary to recover an up-to-date namespace from the oldest retained
checkpoint are also retained.
dfs.namenode.num.extra.edits.retained: The number of extra transactions that should be retained beyond what is minimally
necessary for a NameNode restart. This can be useful for audit
purposes, or for an HA setup where a remote Standby Node may have been
offline for some time and require a longer backlog of retained edits
in order to start again.
Resource: https://docs.cloudera.com/HDPDocuments/HDP3/HDP-3.1.5/data-storage/content/properties_to_set_the_size_of_the_namenode_edits_directory.html

Kibana: Cannot create Dashboard after a new install

I recently installed ELK stack (Elastic Search, Logstash & Kibana) on a VM. Everything works great, as in, the messages from Elastic Search go thru Logstash & show up as expected in Kibana. Only problem is that I am not able to create any Dashboards in Kibana. I keep getting this error:
Error on saving 'My dashboard'. Forbidden
I Googled, used other answers in StackOverflow & tried several recommended suggestions such as:
Setting 'xpack.security.enabled' false in elasticsearch.yml
cluster.routing.allocation.disk.threshold_enabled: false
index.blocks.read_only_allow_delete: null
Nothing is helping. I have made sure that disk is not more than 95% full. Here's what I see when I run the 'df' command:
Filesystem 1K-blocks Used Available Use% Mounted on
udev 16438908 0 16438908 0% /dev
tmpfs 3294068 1188 3292880 1% /run
/dev/sda2 51340768 43199000 5504100 89% /
tmpfs 16470332 0 16470332 0% /dev/shm
tmpfs 5120 0 5120 0% /run/lock
tmpfs 16470332 0 16470332 0% /sys/fs/cgroup
/dev/loop3 96128 96128 0 100% /snap/core/8935
/dev/loop2 123264 123264 0 100% /snap/docker/423
/dev/loop0 125312 125312 0 100% /snap/docker/418
tmpfs 3294064 0 3294064 0% /run/user/1001
/dev/loop4 96256 96256 0 100% /snap/core/9066
Any ideas? Note: All versions are latest.
Turns out it was indeed a space issue. When I switched to a new VM that has 300Gb of space this 'forbidden' message was gone. A bit confusing error message but now I know.

Spring DataFlow Yarn - Container is running beyond physical memory

I'm running Spring Cloud Tasks on Yarn simple tasks work fine but running bigger tasks which require more resources I got "Container is running beyond physical memory" error:
onContainerCompleted:ContainerStatus: [ContainerId:
container_1485796744143_0030_01_000002, State: COMPLETE, Diagnostics: Container [pid=27456,containerID=container_1485796744143_0030_01_000002] is running beyond physical memory limits. Current usage: 652.5 MB of 256 MB physical memory used; 5.6 GB of 1.3 GB virtual memory used. Killing container.
Dump of the process-tree for container_1485796744143_0030_01_000002 :
|- PID PPID PGRPID SESSID CMD_NAME USER_MODE_TIME(MILLIS) SYSTEM_TIME(MILLIS) VMEM_USAGE(BYTES) RSSMEM_USAGE(PAGES) FULL_CMD_LINE
|- 27461 27456 27456 27456 (java) 1215 126 5858455552 166335 /usr/lib/jvm/java-1.8.0/bin/java -Dserver.port=0 -Dspring.jmx.enabled=false -Dspring.config.location=servers.yml -jar cities-job-0.0.1.jar --spring.datasource.driverClassName=org.h2.Driver --spring.datasource.username=sa --spring.cloud.task.name=city2 --spring.datasource.url=jdbc:h2:tcp://localhost:19092/mem:dataflow
|- 27456 27454 27456 27456 (bash) 0 0 115806208 705 /bin/bash -c /usr/lib/jvm/java-1.8.0/bin/java -Dserver.port=0 -Dspring.jmx.enabled=false -Dspring.config.location=servers.yml -jar cities-job-0.0.1.jar --spring.datasource.driverClassName='org.h2.Driver' --spring.datasource.username='sa' --spring.cloud.task.name='city2' --spring.datasource.url='jdbc:h2:tcp://localhost:19092/mem:dataflow' 1>/var/log/hadoop-yarn/containers/application_1485796744143_0030/container_1485796744143_0030_01_000002/Container.stdout 2>/var/log/hadoop-yarn/containers/application_1485796744143_0030/container_1485796744143_0030_01_000002/Container.stderr
I tried tuning options in DataFlow's server.yml settings:
spring:
deployer:
yarn:
app:
baseDir: /dataflow
taskappmaster:
memory: 512m
virtualCores: 1
javaOpts: "-Xms512m -Xmx512m"
taskcontainer:
priority: 1
memory: 512m
virtualCores: 1
javaOpts: "-Xms256m -Xmx512m"
I found out that taskappmaster memory changes are visible (AM container in YARN is set to this value), but taskcontainer memory options isnt changing - every container for Cloud Task which is created has only 256 mb which is default option for YarnDeployer.
For this server.yml expected result is allocation of 2 containers with 512 both for Application Master and Application Container. But YARN allocates 2 containers 512 for application master and 256 mb for application.
I dont think this problem is connected with YARN wrong options because Spark Applications work correctly seizing GBs of memory.
Some of my YARN settings:
mapreduce.reduce.java.opts -Xmx2304m
mapreduce.reduce.memory.mb 2880
mapreduce.map.java.opts -Xmx3277m
mapreduce.map.memory.mb 4096
yarn.nodemanager.vmem-pmem-ratio 5
yarn.nodemanager.vmem-check-enabled false
yarn.scheduler.minimum-allocation-mb 32
yarn.nodemanager.resource.memory-mb 11520
My Hadoop runtime is EMR 4.4.0 also I had to change default java to 1.8.
Cleaning up /dataflow directory in HDFS resolves problem, after deleting this directory Spring DataFlow upload all needed files. The other way is to remove file by yourself and upload new one.

Container is running beyond physical memory. Hadoop Streaming python MR

I am running a Python Script which needs a file (genome.fa) as a dependency(reference) to execute. When I run this command :
hadoop jar /usr/local/hadoop/share/hadoop/tools/lib/had oop-streaming-2.5.1.jar -file ./methratio.py -file '../Test_BSMAP/genome.fa' - mapper './methratio.py -r -g ' -input /TextLab/sravisha_test/SamFiles/test_sam -output ./outfile
I am getting this Error:
15/01/30 10:48:38 INFO mapreduce.Job: map 0% reduce 0%
15/01/30 10:52:01 INFO mapreduce.Job: Task Idattempt_1422600586708_0001_m_000 009_0, Status : FAILED
Container [pid=22533,containerID=container_1422600586708_0001_01_000017] is running beyond physical memory limits. Current usage: 1.1 GB of 1 GB physical memory used; 2.4 GB of 2.1 GB virtual memory used. Killing container.
I am using Cloudera Manager (Free Edition) .These are my config :
yarn.app.mapreduce.am.resource.cpu-vcores = 1
ApplicationMaster Java Maximum Heap Size = 825955249 B
mapreduce.map.memory.mb = 1GB
mapreduce.reduce.memory.mb = 1 GB
mapreduce.map.java.opts = -Djava.net.preferIPv4Stack=true
mapreduce.map.java.opts.max.heap = 825955249 B
yarn.app.mapreduce.am.resource.mb = 1GB
Java Heap Size of JobHistory Server in Bytes = 397 MB
Can Someone tell me why I am getting this error ??
I think your python script is consuming a lot of memory during the reading of your large input file (clue: genome.fa).
Here is my reason (Ref: http://courses.coreservlets.com/Course-Materials/pdf/hadoop/04-MapRed-6-JobExecutionOnYarn.pdf, Container is running beyond memory limits, http://hortonworks.com/blog/how-to-plan-and-configure-yarn-in-hdp-2-0/)
Container’s Memory Usage = JVM Heap Size + JVM Perm Gen + Native Libraries + Memory used by spawned processes
The last variable 'Memory used by spawned processes' (the Python code) might be the culprit.
Try increasing the mem size of these 2 parameters: mapreduce.map.java.opts
and mapreduce.reduce.java.opts.
Try increasing the maps spawning at the time of execution ... you can increase no. of mappers by decreasing the split size... mapred.max.split.size ...
It will have overheads but will mitigate the problem ....

Resources