Disk.indices is higher than disk.used in cat allocation - elasticsearch

When I run the following command:
GET _cat/allocation?v&s=disk.indices&h=shards,disk.indices,disk.used,disk.available,disk.total,disk.percent
it shows the following output:
shards disk.indices disk.used disk.total disk.percent
160 1.4tb 1.4tb 1.7tb 86
160 1.4tb 1.4tb 1.7tb 87
160 1.5tb 1.5tb 1.7tb 89
160 1.5tb 1.5tb 1.7tb 90
480 7.7tb 3.7tb 20tb 18
480 7.7tb 3.9tb 20tb 19
Can anyone help me understand how come the disk.indices in last two rows exceeds disk.used?
Ideally the disk.used should be greater than or equal to disk.indices.

Related

Can't generate any alignments in MCScanX

I'm trying to find collinearity between a group of genes from two different species using MCScanX. But I don't know what I could be possibly doing wrong anymore. I've checked both input files countless times (.gff and .blast), and they seem to be in line with what the manual says.
Like, for the first species, I've downloaded the gff file from figshare. I already had the fasta file containing only the proteins of interest (that I also got from figshare), so gene ids matched. Then, I downloaded both the gff and the protein fasta file from coffee genome hub. I used the coffee proteins fasta file as the reference genome in rBLAST to align the first specie's genes against it. After blasting (and keeping only the first five best alignments with e-values greater than 1e-10), I filtered both gff files so they only contained genes that matched those in the blast file, and then concatenated them. So the final files look like this:
View (test.blast) #just imagine they're tab separated values
sp1.id1 sp2.id1 44.186 43 20 1 369 411 206 244 0.013 37.4sp1.id1 sp2.id2 25.203 123 80 4 301 413 542 662 0.00029 43.5sp1.id1 sp2.id3 27.843 255 130 15 97 333 458 676 1.75e-05 47.8sp1.id1 sp2.id4 26.667 105 65 3 301 396 329 430 0.004 39.7sp1.id1 sp2.id5 27.103 107 71 3 301 402 356 460 0.000217 43.5sp1.id2 sp2.id6 27.368 95 58 2 40 132 54 139 0.41 32sp1.id2 sp2.id7 27.5 120 82 3 23 138 770 888 0.042 35sp1.id2 sp2.id8 38.596 57 35 0 21 77 126 182 0.000217 42sp1.id2 sp2.id9 36.17 94 56 2 39 129 633 725 1.01e-05 46.6sp1.id2 sp2.id10 37.288 59 34 2 75 133 345 400 0.000105 43.1sp1.id3 sp2.id11 33.846 65 42 1 449 512 360 424 0.038 37.4sp1.id3 sp2.id12 40 50 16 2 676 725 672 707 6.7 30sp1.id3 sp2.id13 31.707 41 25 1 370 410 113 150 2.3 30.4sp1.id3 sp2.id14 31.081 74 45 1 483 550 1 74 3.3 30sp1.id3 sp2.id15 35.938 64 39 1 377 438 150 213 0.000185 43.5
View (test.gff) #just imagine they're tab separated values
ex0 sp2.id1 78543527 78548673ex0 sp2.id2 97152108 97154783ex1 sp2.id3 16555894 16557150ex2 sp2.id4 3166320 3168862ex3 sp2.id5 7206652 7209129ex4 sp2.id6 5079355 5084496ex5 sp2.id7 27162800 27167939ex6 sp2.id8 5584698 5589330ex6 sp2.id9 7085405 7087405ex7 sp2.id10 1105021 1109131ex8 sp2.id11 24426286 24430072ex9 sp2.id12 2734060 2737246ex9 sp2.id13 179361 183499ex10 sp2.id14 893983 899296ex11 sp2.id15 23731978 23733073ts1 sp1.id1 5444897 5448367ts2 sp1.id2 28930274 28935578ts3 sp1.id3 10716894 10721909
So I moved both files to the test folder inside MCScanX directory and ran MCScan (using Ubuntu 20.04.5 LTS, the WSL feature) with:
../MCScanX ./test
I've also tried
../MCScanX -b 2 ./test
(since "-b 2" is the parameter for inter-species patterns of syntenic blocks)
but all I ever get is
255 matches imported (17 discarded)85 pairwise comparisons0 alignments generated
What am I missing????
I should be getting a test.synteny file that, as per the manual's example, looks like this:
## Alignment 0: score=9171.0 e_value=0 N=187 at1&at1 plus
0- 0: AT1G17240 AT1G72300 0
0- 1: AT1G17290 AT1G72330 0
...
0-185: AT1G22330 AT1G78260 1e-63
0-186: AT1G22340 AT1G78270 3e-174
##Alignment 1: score=5084.0 e_value=5.6e-251 N=106 at1&at1 plus

Elasticsearch one specific shard keep initializing in different data nodes

I am getting ElasticsearchStatusWarning saying that the cluster state is yellow. Upon running the health check API, I see below
curl -X GET http://localhost:9200/_cluster/health/
{"cluster_name":"my-elasticsearch","status":"yellow","timed_out":false,"number_of_nodes":8,"number_of_data_nodes":3,"active_primary_shards":220,"active_shards":438,"relocating_shards":0,"initializing_shards":2,"unassigned_shards":0,"delayed_unassigned_shards":0,"number_of_pending_tasks":0,"number_of_in_flight_fetch":0,"task_max_waiting_in_queue_millis":0,"active_shards_percent_as_number":99.54545454545455}
initializing_shards is 2. So, I further run the below call
curl -X GET
http://localhost:9200/_cat/shards?h=index,shard,prirep,state,unassigned.reason
|grep INIT
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 33457 100 33457 0 graph_vertex_24_18549 0 r INITIALIZING ALLOCATION_FAILED
0 79609 0 --:--:-- --:--:-- --:--:-- 79659
curl -X GET http://localhost:9200/_cat/shards/graph_vertex_24_18549
graph_vertex_24_18549 0 p STARTED 8373375 8.4gb IP1 elasticsearch-data-1
graph_vertex_24_18549 0 r INITIALIZING IP2 elasticsearch-data-2
And rerunning the same command in few mins, shows now it's being initialized in elasticsearch-data-0. See below
graph_vertex_24_18549 0 p STARTED 8373375 8.4gb IP1 elasticsearch-data-1
graph_vertex_24_18549 0 r INITIALIZING IP0 elasticsearch-data-0
If i rerun it again in few mins, I can see it's again being initialized in elasticsearch-data-2 again. But it never gets STARTED.
curl -X GET http://localhost:9200/_cat/allocation?v
shards disk.indices disk.used disk.avail disk.total disk.percent host ip node
147 162.2gb 183.8gb 308.1gb 492gb 37 IP1 IP1 elasticsearch-data-2
146 217.3gb 234.2gb 257.7gb 492gb 47 IP2 IP2 elasticsearch-data-1
147 216.6gb 231.2gb 260.7gb 492gb 47 IP3 IP3 elasticsearch-data-0
curl -X GET http://localhost:9200/_cat/nodes?v
ip heap.percent ram.percent cpu load_1m load_5m load_15m node.role master name
IP1 7 77 20 4.17 4.57 4.88 mi - elasticsearch-master-2
IP2 72 59 7 2.59 2.38 2.19 i - elasticsearch-5f4bd5b88f-4lvxz
IP3 57 49 3 0.75 1.13 1.09 di - elasticsearch-data-2
IP4 63 57 21 2.69 3.58 4.11 di - elasticsearch-data-0
IP5 5 59 7 2.59 2.38 2.19 mi - elasticsearch-master-0
IP6 69 53 13 4.67 4.60 4.66 di - elasticsearch-data-1
IP7 8 70 14 2.86 3.20 3.09 mi * elasticsearch-master-1
IP8 30 77 20 4.17 4.57 4.88 i - elasticsearch-5f4bd5b88f-wnrl4
curl -s -XGET http://localhost:9200/_cluster/allocation/explain -d '{
"index": "graph_vertex_24_18549", "shard":
0, "primary": false }' -H 'Content-type: application/json'
{"index":"graph_vertex_24_18549","shard":0,"primary":false,"current_state":"initializing","unassigned_info":{"reason":"ALLOCATION_FAILED","at":"2020-11-04T08:21:45.756Z","failed_allocation_attempts":1,"details":"failed shard on node [1XEXS92jTK-wwanNgQrxsA]: failed to perform indices:data/write/bulk[s] on replica [graph_vertex_24_18549][0], node[1XEXS92jTK-wwanNgQrxsA], [R], s[STARTED], a[id=RnTOlfQuQkOumVuw_NeuTw], failure RemoteTransportException[[elasticsearch-data-2][IP:9300][indices:data/write/bulk[s][r]]]; nested: CircuitBreakingException[[parent] Data too large, data for [<transport_request>] would be [4322682690/4gb], which is larger than the limit of [4005632409/3.7gb], real usage: [3646987112/3.3gb], new bytes reserved: [675695578/644.3mb]]; ","last_allocation_status":"no_attempt"},"current_node":{"id":"o_9jyrmOSca9T12J4bY0Nw","name":"elasticsearch-data-0","transport_address":"IP:9300"},"explanation":"the shard is in the process of initializing on node [elasticsearch-data-0], wait until initialization has completed"}
Thing is I was earlier getting alerted for Unassigned Shards due to the same exception as above - "CircuitBreakingException[[parent] Data too large, data for [<transport_request>] would be [4322682690/4gb], which is larger than the limit of [4005632409/3.7gb]"
But back then heap was only 2G. I increased it to 4G. And now I am seeing same error, but this time with respect to Initialising shards instead of Unallocated shards.
How can I remediate this?

ubiattach failed with too many bad blocks

I was trying to read firmware from a NAND chip, and extract its program and data for analyse.
From online I learned, you must create an UBI device with your image file write to it, then you can mount it to your system.
Description
First I read a bin file from FLASH chip. binwalk I get this.
$ binwalk -Me Flash_data.bin
DECIMAL HEXADECIMAL DESCRIPTION
--------------------------------------------------------------------------------
771180 0xBC46C device tree image (dtb)
772444 0xBC95C device tree image (dtb)
823236 0xC8FC4 CRC32 polynomial table, little endian
2703360 0x294000 uImage header, header size: 64 bytes, header CRC: 0xF092DEF5, created: 2016-10-04 21:32:58, image size: 2773040 bytes, Data Address: 0x80008000, Entry Point: 0x80008000, data CRC: 0x365DF8B1, OS: Linux, CPU: ARM, image type: OS Kernel Image, compression type: none, image name: "Linux-3.2.0"
2703424 0x294040 Linux kernel ARM boot executable zImage (little-endian)
2722452 0x298A94 gzip compressed data, maximum compression, from Unix, last modified: 1970-01-01 00:00:00 (null date)
8110080 0x7BC000 UBI erase count header, version: 1, EC: 0x2, VID header offset: 0x800, data offset: 0x1000
From its output files, I foun this ubi image file.
$ binwalk 7BC000.ubi
DECIMAL HEXADECIMAL DESCRIPTION
--------------------------------------------------------------------------------
0 0x0 UBI erase count header, version: 1, EC: 0x2, VID header offset: 0x800, data offset: 0x1000
$ file 7BC000.ubi
7BC000.ubi: UBI image, version 1
Some information about NAND chip:
PageSize : 2048
SpareSize : 64
PagesPerBlock : 64
Blocks Size : 128KB + 4KB
Total Block : 2048
Device Size : 256MB + 8192KB8192KB
Bus Width : 8
Then I tried to mount it, like this:
$ sudo modprobe mtdblock
$ sudo modprobe nandsim first_id_byte=0x20 second_id_byte=0xac third_id_byte=0x00 fourth_id_byte=0x15
$ mtdinfo /dev/mtd0
mtd0
Name: NAND simulator partition 0
Type: nand
Eraseblock size: 131072 bytes, 128.0 KiB
Amount of eraseblocks: 4096 (536870912 bytes, 512.0 MiB)
Minimum input/output unit size: 2048 bytes
Sub-page size: 512 bytes
OOB size: 64 bytes
Character device major/minor: 90:0
Bad blocks are allowed: true
Device is writable: true
$ sudo flash_erase /dev/mtd0 0 0
$ cp 7BC000.ubi test_infile
$ sudo ubiformat /dev/mtd0 -O 2048 -f test_infile
ubiformat: mtd0 (nand), size 536870912 bytes (512.0 MiB), 4096 eraseblocks of 131072 bytes (128.0 KiB), min. I/O size 2048 bytes
libscan: scanning eraseblock 4095 -- 100 % complete
ubiformat: 4096 eraseblocks are supposedly empty
ubiformat: error!: file "test_infile" (size 268713984 bytes) is not multiple of eraseblock size (131072 bytes)
error 0 (Success)
The size of "test_file" is 0x10044000, so I just remove the last 0x4000 bytes, and tried to ubiformat again.
$ dd if=test_infile of=test_infile_dd bs=268697600 count=1
$ sudo ubiformat /dev/mtd0 -O 2048 -f test_infile_dd
ubiformat: mtd0 (nand), size 536870912 bytes (512.0 MiB), 4096 eraseblocks of 131072 bytes (128.0 KiB), min. I/O size 2048 bytes
libscan: scanning eraseblock 4095 -- 100 % complete
ubiformat: 4096 eraseblocks are supposedly empty
ubiformat: flashing eraseblock 1 -- 0 % complete ubiformat: error!: bad UBI magic 0xffffffff, should be 0x55424923
ubiformat: error!: bad EC header at eraseblock 1 of "test_infile_dd"
I did some research and found out, in this UBI image, there are many blocks, and every block contains data and OOB.
The reason last command fails because it searchs 0x55424923 at wrong position which is 0x20000, because of OOB, 0x55424923 is actually at 0x21000, so I think perhaps delete all of OOB part from "this_file_dd" might work. The bash command and test as follows.
#!/bin/bash
# ./dump.sh
# pagesize 0x20000
# oob size 0x01000
# block 1
dd if=infile of=test_infile_dd_nooob bs=$((0x20000)) count=1
declare -i i=1
# block others
while ((i<2048))
do
dd if=test_infile of=out bs=$((0x21000)) count=1 skip=$i
dd if=out of=outfile bs=$((0x20000)) count=1
cat outfile >> test_infile_dd_nooob
rm out
rm outfile
let i++
done
After remove all OOB, compare 2 file, and noticed that OOB has been remove.
$ xxd test_infile_dd | grep "5542 4923" ⏎
00000000: 5542 4923 0100 0000 0000 0000 0000 0002 UBI#............
00021000: 5542 4923 0100 0000 0000 0000 0000 0002 UBI#............
00042000: 5542 4923 0100 0000 0000 0000 0000 0001 UBI#............
00063000: 5542 4923 0100 0000 0000 0000 0000 0001 UBI#............
00084000: 5542 4923 0100 0000 0000 0000 0000 0001 UBI#............
$ xxd test_infile_dd_nooob | grep "5542 4923"
00000000: 5542 4923 0100 0000 0000 0000 0000 0002 UBI#............
00020000: 5542 4923 0100 0000 0000 0000 0000 0002 UBI#............
00040000: 5542 4923 0100 0000 0000 0000 0000 0001 UBI#............
00060000: 5542 4923 0100 0000 0000 0000 0000 0001 UBI#............
00080000: 5542 4923 0100 0000 0000 0000 0000 0001 UBI#............
Then ubiformat again, another error about bad UBI magic.
$ sudo ubiformat /dev/mtd0 -O 2048 -f test_infile_dd_nooob
ubiformat: mtd0 (nand), size 536870912 bytes (512.0 MiB), 4096 eraseblocks of 131072 bytes (128.0 KiB), min. I/O size 2048 bytes
libscan: scanning eraseblock 4095 -- 100 % complete
ubiformat: 1 eraseblocks have valid erase counter, mean value is 0
ubiformat: 4095 eraseblocks are supposedly empty
ubiformat: warning!: only 1 of 4096 eraseblocks have valid erase counter
ubiformat: erase counter 0 will be used for all eraseblocks
ubiformat: note, arbitrary erase counter value may be specified using -e option
ubiformat: continue? (y/N) y
ubiformat: use erase counter 0 for all eraseblocks
ubiformat: flashing eraseblock 1074 -- 54 % complete ubiformat: error!: bad UBI magic 00000000, should be 0x55424923
ubiformat: error!: bad EC header at eraseblock 1074 of "test_infile_dd_nooob"
Used ghex to fix wrong EC header in EB-1074, and ubiformat again, same block's CRC is not right.
sudo ubiformat /dev/mtd0 -O 2048 -f test_infile_dd_nooob -e 10
ubiformat: mtd0 (nand), size 536870912 bytes (512.0 MiB), 4096 eraseblocks of 131072 bytes (128.0 KiB), min. I/O size 2048 bytes
libscan: scanning eraseblock 4095 -- 100 % complete
ubiformat: 1074 eraseblocks have valid erase counter, mean value is 10
ubiformat: 3022 eraseblocks are supposedly empty
ubiformat: use erase counter 10 for all eraseblocks
ubiformat: flashing eraseblock 1074 -- 54 % complete ubiformat: error!: bad CRC 0x7d72af58, should be 00000000
ubiformat: error!: bad EC header at eraseblock 1074 of "test_infile_dd_nooob"
Fix CRC and ubiformat again, enable ubi and ubiattach to mtd0, but another error occurs.
sudo ubiformat /dev/mtd0 -O 2048 -f test_infile_dd_nooob -e 10 ⏎
ubiformat: mtd0 (nand), size 536870912 bytes (512.0 MiB), 4096 eraseblocks of 131072 bytes (128.0 KiB), min. I/O size 2048 bytes
libscan: scanning eraseblock 4095 -- 100 % complete
ubiformat: 1074 eraseblocks have valid erase counter, mean value is 10
ubiformat: 3022 eraseblocks are supposedly empty
ubiformat: use erase counter 10 for all eraseblocks
ubiformat: flashing eraseblock 1987 -- 100 % complete
ubiformat: formatting eraseblock 4095 -- 100 % complete
$ sudo modprobe ubi
$ sudo modprobe ubi mtd=0
$ sudo ubiattach -m 0 -O 2048
ubiattach: error!: cannot attach mtd0
error 22 (Invalid argument)
Do dmesg I found this message.
$ sudo dmesg
[ 6974.021149] 0001efa0: 00 00 00 00 00 00 00 00 10 0a 00 00 01 00 00 00 00 0a d9 d5 05 f9 20 a1 63 d7 00 00 00 02 fb d2 ...................... .c.......
[ 6974.021150] 0001efc0: ce 15 00 00 00 0d 00 00 02 00 00 00 04 00 20 00 c7 00 00 00 0d 0d 0d 00 00 00 0b 01 b8 00 03 db .............. .................
[ 6974.021151] 0001efe0: 03 9d 03 5e 03 20 02 fd 02 d8 02 93 02 4e 02 1b 01 f6 01 b8 20 00 7a 14 08 00 32 3b 81 0e 04 17 ...^. .......N...... .z...2;....
[ 6974.023703] ubi0 error: validate_ec_hdr [ubi]: node with incompatible UBI version found: this UBI version is 1, image version is 0
[ 6974.023707] ubi0 error: validate_ec_hdr [ubi]: bad EC header
[ 6974.023707] Erase counter header dump:
[ 6974.023708] magic 0x55424923
[ 6974.023708] version 0
[ 6974.023709] ec 10
[ 6974.023709] vid_hdr_offset 2048
[ 6974.023710] data_offset 4096
[ 6974.023710] image_seq 144665903
[ 6974.023711] hdr_crc 0xb574c34c
[ 6974.023711] erase counter header hexdump:
[ 6974.023713] 00000000: 55 42 49 23 00 00 00 00 00 00 00 00 00 00 00 0a 00 00 08 00 00 00 10 00 08 9f 6d 2f 00 00 00 00 UBI#......................m/....
[ 6974.023713] 00000020: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 b5 74 c3 4c .............................t.L
[ 6974.023715] CPU: 4 PID: 14955 Comm: ubiattach Tainted: G W E 5.7.0-kali1-amd64 #1 Debian 5.7.6-1kali2
[ 6974.023716] Hardware name: Dell Inc. Inspiron 7472/0GHVRJ, BIOS 1.1.6 06/14/2018
[ 6974.023716] Call Trace:
[ 6974.023722] dump_stack+0x66/0x90
[ 6974.023725] validate_ec_hdr+0x8a/0xe0 [ubi]
[ 6974.023729] ubi_io_read_ec_hdr+0x1e9/0x280 [ubi]
[ 6974.023732] ubi_attach+0x1d3/0x14c0 [ubi]
[ 6974.023736] ubi_attach_mtd_dev+0x5b3/0xd30 [ubi]
[ 6974.023741] ? __get_mtd_device+0x2c/0xa0 [mtd]
[ 6974.023743] ? _cond_resched+0x15/0x30
[ 6974.023746] ctrl_cdev_ioctl+0xda/0x1c0 [ubi]
[ 6974.023748] ksys_ioctl+0x87/0xc0
[ 6974.023749] __x64_sys_ioctl+0x16/0x20
[ 6974.023751] do_syscall_64+0x52/0x180
[ 6974.023753] entry_SYSCALL_64_after_hwframe+0x44/0xa9
[ 6974.023754] RIP: 0033:0x7f3f55902c87
[ 6974.023756] Code: 00 00 00 48 8b 05 09 92 0c 00 64 c7 00 26 00 00 00 48 c7 c0 ff ff ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 b8 10 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d d9 91 0c 00 f7 d8 64 89 01 48
[ 6974.023756] RSP: 002b:00007ffd4fec6f88 EFLAGS: 00000206 ORIG_RAX: 0000000000000010
[ 6974.023757] RAX: ffffffffffffffda RBX: 00007ffd4fec7020 RCX: 00007f3f55902c87
[ 6974.023758] RDX: 00007ffd4fec6fb0 RSI: 0000000040186f40 RDI: 0000000000000003
[ 6974.023758] RBP: 0000000000000003 R08: 0000000000000001 R09: 0000000000000000
[ 6974.023759] R10: fffffffffffff48e R11: 0000000000000206 R12: 000055c2a393c052
[ 6974.023759] R13: 00007ffd4fec6fb0 R14: 0000000000000000 R15: 0000000000000000
[ 6974.023763] ubi0 error: ubi_io_read_ec_hdr [ubi]: validation failed for PEB 1074
[ 6974.061006] ubi0 error: ubi_attach_mtd_dev [ubi]: failed to attach mtd0, error -22
But I don't know how to solve this, so I just remove block 1074 from file.
$ dd if=test_infile_dd_nooob of=test_infile_dd_nooob_no1074_1 bs=131072 count=1074
$ dd if=test_infile_dd_nooob of=test_infile_dd_nooob_no1074_2 bs=131072 skip=1075
$ cat test_infile_dd_nooob_no1074_1 test_infile_dd_nooob_no1074_2 > test_infile_dd_nooob_no1074
Then ubiformat and attach again, but there was another error.
$ sudo ubiformat /dev/mtd0 -O 2048 -f test_infile_dd_nooob_no1074 -e 10
ubiformat: mtd0 (nand), size 536870912 bytes (512.0 MiB), 4096 eraseblocks of 131072 bytes (128.0 KiB), min. I/O size 2048 bytes
libscan: scanning eraseblock 4095 -- 100 % complete
ubiformat: 4096 eraseblocks have valid erase counter, mean value is 10
ubiformat: use erase counter 10 for all eraseblocks
ubiformat: flashing eraseblock 1986 -- 100 % complete
ubiformat: formatting eraseblock 4095 -- 100 % complete
$ sudo ubiattach -m 0 -O 2048 ⏎
ubiattach: error!: cannot attach mtd0
error 22 (Invalid argument)
Check dmesg and found this, this is where I don't know how to do, I don't know how I get so many bad blocks.
$ sudo dmesg
ubi0: scanning is finished
[ 7392.005554] ubi0 error: ubi_attach [ubi]: 1205 PEBs are corrupted and preserved
[ 7392.005554] Corrupted PEBs are: 1805 1802 1793 1678 1674 1670 1666 1662 1654 1653 1652 1649 1640 1639 1626 1625 1621 1605 1587 1586 1581 1563 1553 1540 1534 1533 1532 1531 1530 1529 1528 1527 1526 1525 1524 1523 1522 1521 1520 1519 1518 1517 1516 1515 1514 1512 1511 1510 1509 1508 1507 1506 1505 1504 1503 1502 1501 1500 1499 1498 1496 1495 1494 1493 1492 1491 1490 1489 1488 1487 1486 1485 1484 1483 1482 1481 1471 1449 1448 1447 1446 1445 1444 1441 1439 1438 1437 1436 1435 1434 1433 1432 1431 1430 1429 1428 1425 1424 1423 1422 1421 1420 1419 1418 1417 1416 1415 1414 1413 1412 1411 1410 1409 1408 1407 1406 1405 1404 1403 1402 1401 1400 1399 1398 1397 1396 1395 1394 1393 1391 1390 1389 1388 1387 1386 1385 1384 1383 1382 1381 1380 1379 1378 1377 1376 1375 1374 1373 1372 1371 1370 1369 1368 1367 1366 1365 1364 1363 1362 1361 1360 1359 1358 1357 1356 1355 1354 1353 1352 1351 1350 1349 1348 1347 1346 1345 1344 1343 1342 1341 1340 1339 1338 1337 1335 1334 1333 1332 1331 1330 1329 1328 1327 1326
[ 7392.005578] 1325 1324 1323 1322 1321 1294 1275 1274 1273 1264 1230 1223 1221 1219 1214 1211 1210 1207 1204 1203 1202 1201 1200 1199 1198 1197 1196 1195 1194 1193 1192 1191 1190 1189 1188 1187 1186 1185 1184 1183 1182 1181 1180 1179 1178 1177 1176 1175 1174 1173 1172 1165 1164 1163 1157 1147 1144 1143 1142 1141 1140 1139 1138 1137 1136 1134 1133 1132 1131 1130 1129 1128 1127 1126 1125 1124 1123 1122 1121 1120 1119 1118 1117 1116 1115 1114 1112 1111 1110 1109 1108 1106 1105 1098 1085 1053 1052 1044 1016 1003 1002 977 973 972 963 939 938 937 936 935 934 933 932 931 930 929 928 927 926 925 924 923 922 921 920 919 918 917 916 915 914 913 912 911 910 909 908 907 906 905 904 903 902 900 899 898 897 896 895 894 893 892 891 890 889 888 887 886 885 884 883 882 881 880 879 878 877 876 875 874 873 872 871 870 869 868 867 866 865 864 863 862 861 860 859 858 857 856 855 854 853 852 851 850 849 848 847 846 845 844 841 840 839 838 837 836 835 834 833 832 831 830 829 828 827 826 825 824 823 822 821 820
[ 7392.005606] 819 818 817 816 815 814 813 812 811 810 809 808 807 806 805 804 803 802 801 800 799 798 797 796 795 794 793 792 791 790 789 788 787 785 784 782 781 780 779 778 777 776 775 774 773 772 771 770 769 768 767 766 765 764 763 762 761 760 759 758 757 756 755 754 753 752 751 750 749 748 747 746 745 744 743 742 741 740 739 738 737 736 735 734 733 732 731 730 729 727 726 725 724 723 722 721 720 719 718 716 715 714 713 712 711 710 709 708 707 706 705 704 703 702 701 700 699 698 697 696 695 694 693 692 691 690 689 688 687 686 685 684 683 682 681 680 679 678 677 676 675 674 673 672 671 670 668 667 666 665 664 663 662 661 660 659 657 656 655 654 653 652 651 650 649 648 647 646 645 644 643 642 641 640 639 638 637 636 635 634 633 632 631 630 629 628 627 626 625 624 623 622 621 620 619 618 617 616 615 614 613 611 610 609 608 607 606 605 604 602 601 600 599 598 597 596 595 594 593 592 591 589 588 587 586 585 584 583 582 581 580 579 578 577 576 575 574 573 572 571 570 569 568 567 566 565 564 563
[ 7392.005634] 562 561 560 559 558 557 556 554 553 552 551 550 549 548 547 546 545 544 543 541 540 539 538 537 536 535 534 533 532 531 530 529 528 527 526 525 524 523 522 521 520 519 518 517 516 515 514 513 512 511 510 509 508 507 506 505 504 503 502 501 500 498 497 496 495 494 493 492 491 489 488 487 486 485 484 483 482 481 480 479 478 477 476 475 474 473 472 471 470 469 468 467 466 465 464 463 461 460 459 458 457 456 455 454 453 452 451 450 449 448 447 446 445 444 443 441 440 438 437 436 435 434 433 432 431 430 429 428 427 426 425 424 423 422 421 420 419 418 417 416 415 414 413 412 411 410 409 408 407 406 405 404 403 402 401 400 399 398 397 396 395 394 393 392 391 390 389 388 387 386 384 383 382 381 380 379 378 377 376 375 374 373 372 371 370 369 368 367 366 365 364 363 362 361 360 359 358 357 356 355 354 353 352 351 350 349 348 347 346 345 344 343 342 341 340 339 338 337 336 335 333 332 331 330 329 328 327 326 325 324 323 322 321 320 319 318 317 316 315 314 313 312 311 310 309 308 307 306
[ 7392.005661] 305 304 303 302 301 300 295 294 293 292 290 289 288 287 286 285 284 283 282 281 280 279 278 277 276 275 274 273 271 270 269 268 267 266 265 264 263 262 261 260 259 258 257 256 255 254 253 252 251 250 249 248 247 246 245 244 243 242 241 240 239 238 237 236 235 233 231 230 229 228 227 226 225 224 223 222 221 220 219 218 217 216 215 214 213 212 211 210 209 208 207 206 205 204 203 202 201 200 199 198 197 196 195 194 193 192 191 190 189 188 187 186 185 184 183 182 181 180 179 177 176 175 174 172 171 170 169 168 167 166 165 164 163 162 161 160 159 158 157 156 155 154 153 152 151 150 149 148 147 146 145 144 143 142 141 140 139 138 137 136 135 134 133 132 131 130 129 128 127 126 125 124 123 122 121 120 118 117 116 115 114 113 112 111 110 109 108 107 106 105 104 103 102 101 100 99 98 97 96 95 94 93 92 91 90 89 88 87 86 85 84 83 82 81 80 79 78 77 76 75 74 73 72 71 70 69 68 67 66 65 64 63 62 61 60 59 58 57 56 55 54 53 52 51 50 49 48 47 46 45 44 43 42 41 40 39 38 37 36 35 34 33 32 31 30
[ 7392.005690] 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2
[ 7392.005698] ubi0 error: ubi_attach.cold [ubi]: too many corrupted PEBs, refusing
[ 7392.028819] ubi0 error: ubi_attach_mtd_dev [ubi]: failed to attach mtd0, error -22
[ 7393.182325] systemd-journald[331]: /dev/kmsg buffer overrun, some messages lost.
I read the official documents, it says only in 2 senerios a block will be marked as bad. One is
when write opertion to eraseblock fails, UBI will move data from bad EB to a good EB, and do some tests so it can confirm whether bad EB is really bad; or when erase opertion have EIO error, then EB will be marked as bad block immediately. I am not sure which reason caused so much bad block.
My questions
In the progess, is my command doing wrong? If not, how to repair this UBI image so I can read its programs and data?
Is there other ways that can get programs and data from this UBI image file?
Tools and versions
Kali 2020.3
mtd-utils 2.1.1

MPICH output not printing

Problem
I'm running an executable cp2k installed on HPC cluster using mpich-3.2. The output from the executable is printed in an out file. The problem is, that there is no output in the out file after some steps are printed, but when I see the status of my job on the cluster, it turns out that it is still running. Basically, the problem is that my job is still running, but the output is not getting printed.
Script
I'm using the following job script:
#!/bin/bash
#PBS -N test
#PBS -o test.log
#PBS -j oe
#PBS -l nodes=2:ppn=20
#PBS -q mini
#PBS -l walltime=2:00:00
cd $PBS_O_WORKDIR
echo Master process running on `hostname`
echo Directory is `pwd`
echo PBS has allocated the following nodes:
echo `cat $PBS_NBODEFILE`
NPROCS=`wc -l < $PBS_NODEFILE`
echo This job has allocated $NPROCS nodes
export I_MPI_FABRICS=shm:dapl
export I_MPI_PROVIDER=psm2
export I_MPI_FALLBACK=0
export KMP_AFFINITY=verbose,scatter
export OMP_NUM_THREADS=1
export I_MPI_IFACE=ib0
echo Starting executation at `date`
EXEC="/home/arshil/software/cp2k-5.1.0/exe/local/cp2k.popt"
cp $EXEC ./cp2k
mpiexec -np $NPROCS --machinefile $PBS_NODEFILE ./cp2k -i test.inp >& out
rm cp2k
echo Finished at `date`
Error
The ouput in the out file:
SCF WAVEFUNCTION OPTIMIZATION
----------------------------------- OT ---------------------------------------
Minimizer : DIIS : direct inversion
in the iterative subspace
using 7 DIIS vectors
safer DIIS on
Preconditioner : FULL_SINGLE_INVERSE : inversion of
H + eS - 2*(Sc)(c^T*H*c+const)(Sc)^T
Precond_solver : DEFAULT
stepsize : 0.08000000 energy_gap : 0.08000000
eps_taylor : 0.10000E-15 max_taylor : 4
----------------------------------- OT ---------------------------------------
Step Update method Time Convergence Total energy Change
------------------------------------------------------------------------------
1 OT DIIS 0.80E-01 21.3 0.00002878 -8797.2068024142 -8.80E+03
2 OT DIIS 0.80E-01 10.9 0.00007114 -8797.2061897209 6.13E-04
3 OT DIIS 0.80E-01 10.8 0.00001688 -8797.2073257531 -1.14E-03
As it can be seen, there is no printing after step 3 in the output file, but the job is still running in the background. Even after the walltime is over, the output file remains the same as above. Where is the output going?
The executable cp2k is used to perform quantum chemical calculations and was installed on the cluster along with mpich-3.2. All CP2K needs is an input file with extension .inp. For my case, test.inp is the input file.
&FORCE_EVAL
METHOD Quickstep
&DFT
BASIS_SET_FILE_NAME GTH_BASIS_SETS
POTENTIAL_FILE_NAME GTH_POTENTIALS
&MGRID
NGRIDS 4
CUTOFF 380
REL_CUTOFF 60
&END MGRID
&QS
METHOD GPW
MAP_CONSISTENT
EXTRAPOLATION ASPC
EXTRAPOLATION_ORDER 3
&END QS
&SCF
MAX_SCF 1000
EPS_SCF 1.0E-5
SCF_GUESS ATOMIC
&OT
PRECONDITIONER FULL_SINGLE_INVERSE
MINIMIZER DIIS
N_DIIS 7
&END OT
&PRINT
&RESTART OFF
&END RESTART
&END PRINT
&END SCF
&XC
&XC_FUNCTIONAL PBE
&END XC_FUNCTIONAL
&vdW_POTENTIAL
DISPERSION_FUNCTIONAL PAIR_POTENTIAL
&PAIR_POTENTIAL
PARAMETER_FILE_NAME dftd3.dat
TYPE DFTD3
REFERENCE_FUNCTIONAL PBE
R_CUTOFF [angstrom] 12.3
&END PAIR_POTENTIAL
&END vdW_POTENTIAL
&END XC
&END DFT
&SUBSYS
&CELL
ABC 24.6904 24.6904 24.6904
PERIODIC XYZ
&END CELL
&KIND C
BASIS_SET TZV2P-GTH
POTENTIAL GTH-PBE-q4
&END KIND
&KIND P
BASIS_SET TZV2P-GTH
POTENTIAL GTH-PBE-q5
&END KIND
&KIND H
BASIS_SET TZV2P-GTH
POTENTIAL GTH-PBE-q1
&END KIND
&KIND O
BASIS_SET TZV2P-GTH
POTENTIAL GTH-PBE-q6
&END KIND
&KIND N
BASIS_SET TZV2P-GTH
POTENTIAL GTH-PBE-q5
&END KIND
&KIND Mg
BASIS_SET TZV2P-GTH
POTENTIAL GTH-PBE-q10
&END KIND
&COLVAR
&COORDINATION
ATOMS_FROM 41
ATOMS_TO 38
R_0 [bohr] 4.5
NN 6
ND 12
&END COORDINATION
&END COLVAR
&COLVAR
&COORDINATION
ATOMS_FROM 41
ATOMS_TO 42 44 47 50 53 56 59 62 65 68 71 74 77 80 83 86 89 92 95 98 101 104 107 110 113 116 119 122 125 128 131 134 137 140 143 146 149 152 155 158 161 164 167 170 173 176 179 182 185 188 191 194 197 200 203 206 209 212 215 218 221 224 227 230 233 236 239 242 245 248 251 254 257 260 263 266 269 272 275 278 281 284 287 290 293 296 299 302 305 308 311 314 317 320 323 326 329 332 335 338 341 344 347 350 353 356 359 362 365 368 371 374 377 380 383 386 389 392 395 398 401 404 407 410 413 416 419 422 425 428 431 434 437 440 443 446 449 452 455 458 461 464 467 470 473 476 479 482 485 488 491 494 497 500 503 506 509 512 515 518 521 524 527 530 533 536 539 542 545 548 551 554 557 560 563 566 569 572 575 578 581 584 587 590 593 596 599 602 605 608 611 614 617 620 623 626 629 632 635 638 641 644 647 650 653 656 659 662 665 668 671 674 677 680 683 686 689 692 695 698 701 704 707 710 713 716 719 722 725 728 731 734 737 740 743 746 749 752 755 758 761 764 767 770 773 776 779 782 785 788 791 794 797 800 803 806 809 812 815 818 821 824 827 830 833 836 839 842 845 848 851 854 857 860 863 866 869 872 875 878 881 884 887 890 893 896 899 902 905 908 911 914 917 920 923 926 929 932 935 938 941 944 947 950 953 956 959 962 965 968 971 974 977 980 983 986 989 992 995 998 1001 1004 1007 1010 1013 1016 1019 1022 1025 1028 1031 1034 1037 1040 1043 1046 1049 1052 1055 1058 1061 1064 1067 1070 1073 1076 1079 1082 1085 1088 1091 1094 1097 1100 1103 1106 1109 1112 1115 1118 1121 1124 1127 1130 1133 1136 1139 1142 1145 1148 1151 1154 1157 1160 1163 1166 1169 1172 1175 1178 1181 1184 1187 1190 1193 1196 1199 1202 1205 1208 1211 1214 1217 1220 1223 1226 1229 1232 1235 1238 1241 1244 1247 1250 1253 1256 1259 1262 1265 1268 1271 1274 1277 1280 1283 1286 1289 1292 1295 1298 1301 1304 1307 1310 1313 1316 1319 1322 1325 1328 1331 1334 1337 1340 1343 1346 1349 1352 1355 1358 1361 1364 1367 1370 1373 1376 1379 1382 1385 1388 1391 1394 1397 1400 1403 1406 1409 1412 1415 1418 1421 1424 1427 1430 1433 1436 1439 1442 1445 1448 1451 1454 1457
ATOMS_TO 1460 1463 1466 1469 1472 1475 1478 1481 1484 1487 1490 1493 1496 1499 1502 1505
R_0 [bohr] 4.5
NN 6
ND 12
&END COORDINATION
&END COLVAR
&END SUBSYS
&END FORCE_EVAL
&GLOBAL
PROJECT test
RUN_TYPE MD
PRINT_LEVEL LOW
&END GLOBAL
&MOTION
&MD
ENSEMBLE NVT
STEPS 100000
TIMESTEP 0.5
TEMPERATURE 310
TEMP_TOL 100
&THERMOSTAT
&NOSE
LENGTH 3
YOSHIDA 3
TIMECON 100.0
MTS 2
&END NOSE
&END
&PRINT
&ENERGY
&EACH
MD 10
&END
&END
&PROGRAM_RUN_INFO
&EACH
MD 100
&END
&END
FORCE_LAST
&END PRINT
&END MD
&FREE_ENERGY
&METADYN
DO_HILLS
LAGRANGE .TRUE.
NT_HILLS 40
WW [kcalmol] 1
TEMPERATURE 310
TEMP_TOL 10
&METAVAR
SCALE 0.05
COLVAR 1
MASS 50
LAMBDA 2
&WALL
POSITION 0.0
TYPE QUARTIC
&QUARTIC
DIRECTION WALL_MINUS
K 10.0
&END
&END
&END METAVAR
&METAVAR
SCALE 0.05
COLVAR 2
MASS 50
LAMBDA 2
&WALL
POSITION 0.0
TYPE QUARTIC
&QUARTIC
DIRECTION WALL_MINUS
K 10.0
&END
&END
&END METAVAR
&PRINT
&COLVAR
COMMON_ITERATION_LEVELS 3
&EACH
MD 1
&END
&END
&HILLS
COMMON_ITERATION_LEVELS 3
&EACH
MD 1
&END
&END
&END
&END METADYN
&END
&PRINT
&TRAJECTORY
&EACH
MD 1
&END
&END
&VELOCITIES OFF
&END
&RESTART
&EACH
MD 20
&END
ADD_LAST NUMERIC
&END
&RESTART_HISTORY
&EACH
MD 2000
&END
&END
&END
&END MOTION
&EXT_RESTART
RESTART_FILE_NAME NVT-1.restart
RESTART_COUNTERS .FALSE.
&END
The problem in my opinion is not with the input file. It has got to do something with mpich-3.2. I would really appreciate some help.
This may be something similar going on / solutions that can be used here: Python "print" not working when embedded into MPI program It is not perfect as you are not using python however it may help.
At a basic level MPI launches many processes - but only the command that launches it has access to stdio etc. The redirect at the end of the line starting with mpiexec sends the stdout of mpiexec to a file. The output from your script is buffered by mpiexec until the processes end (either they complete or they are stopped).
Where the output is going is a good question and may require changes in test.np or some other way of shutting down (you mention you were out of wall time). I'm looking to solve the same problem - and will update this (if) I find an answer.
Also the output from different processes started by mpi can arrive in random order. I do not care about this but if you do you may need to pass the messages back to some common thread which sorts their order.

Vowpal Wabbit model works badly on multiclass classification of images using pixel RGB values

I am using Vowpal Wabbit to classify multi class images. My data set is similar to http://www.cs.toronto.edu/~kriz/cifar.html , consisting of 3000 training samples and 500 testing samples. The features are RGB values of 32*32 images. I used Vowpal Wabbit Logistic loss function to train the model with 100 iterations. During the training process the average loss is below 0.02 (I assume this number is pretty good right?). Then I predict the labels of the training set with the output model, and fount that the predictions are very bad. Nearly all of them are of category six. I really don't know what happened, because it seems to me that during training process the predictions mostly correct, but after I predict with the model they suddenly become all 6.
Here is a sample line of feature.
1 | 211 174 171 165 161 161 162 163 163 163 163 163 163 163 163 163
162 161 162 163 163 163 163 164 165 167 168 167 168 163 160 187 153
102 96 90 89 90 91 92 92 92 92 92 92 92 92 92 92 92 91 90 90 90 90 91
92 94 95 96 99 97 98 127 111 71 71 64 66 68 69 69 69 69 69 69 70 70 69
69 70 71 71 69 68 68 68 68 70 72 73 75 78 78 81 96 111 69 68 61 64 67
67 67 67 67 67 67 68 67 67 66 67 68 69 68 68 67 66 66 67 69 69 69 71
70 77 89 116 74 76 71 72 74 74 72 73 74 74 74 74 74 74 74 72 72 74 76
76 75 74 74 74 73 73 72 73 74 85 92 123 83 86 83 82 83 83 82 83 83 82
82 82 82 82 82 81 80 82 85 85 84 83 83 83 85 85 85 85 86 94 95 127 92
96 93 93 92 91 91 91 91 91 90 89 89 86 86 86 86 87 89 89 88 88 88 92
92 93 98 100 96 98 96 132 99 101 98 98 97 95 93 93 94 93 93 95 96 97
95 96 96 96 96 95 94 100 103 98 93 95 100 105 103 103 96 139 106 108
105 102 100 98 98 98 99 99 100 100 95 98 93 81 78 79 77 76 76 79 98
107 102 97 98 103 107 108 99 145 115 118 115 115 115 113 ......
Here is my training script:
./vw train.vw --oaa 6 --passes 100 --loss_function logistic -c
--holdout_off -f image_classification.model
Here is my predicting script (on the training data set):
./vw -i image_classification.model -t train.vw -p train.predict --quiet
Here is the statistics during training:
final_regressor = image_classification.model Num weight bits = 18
learning rate = 0.5 initial_t = 0 power_t = 0.5 decay_learning_rate =
1 using cache_file = train.vw.cache ignoring text input in favor of
cache input num sources = 1 average since example
example current current current loss last counter
weight label predict features
0.000000 0.000000 1 1.0 1 1 3073
0.000000 0.000000 2 2.0 1 1 3073
0.000000 0.000000 4 4.0 1 1 3073
0.000000 0.000000 8 8.0 1 1 3073
0.000000 0.000000 16 16.0 1 1 3073
0.000000 0.000000 32 32.0 1 1 3073
0.000000 0.000000 64 64.0 1 1 3073
0.000000 0.000000 128 128.0 1 1 3073
0.000000 0.000000 256 256.0 1 1 3073
0.001953 0.003906 512 512.0 2 2 3073
0.002930 0.003906 1024 1024.0 3 3 3073
0.002930 0.002930 2048 2048.0 5 5 3073
0.006836 0.010742 4096 4096.0 3 3 3073
0.012573 0.018311 8192 8192.0 5 5 3073
0.014465 0.016357 16384 16384.0 3 3 3073
0.017029 0.019592 32768 32768.0 6 6 3073
0.017731 0.018433 65536 65536.0 6 6 3073
0.017891 0.018051 131072 131072.0 5 5 3073
0.017975 0.018059 262144 262144.0 3 3 3073
finished run number of examples per pass = 3000 passes used = 100
weighted example sum = 300000.000000 weighted label sum = 0.000000
average loss = 0.017887 total feature number = 921900000
It seems to me that it predicts perfectly during training but after I use the outputed model suddenly everything becomes of category 6. I really have no idea what has gone wrong.
There are several problems in your approach.
1) I guess the training set contains first all images with label 1, then all examples with label 2 and so on, the last label is 6. You need to shuffle such training data if you want to use online learning (which is the default learning algorithm in VW).
2) VW uses sparse feature format. The order of features on one line is not important (unless you use --ngram). So if feature number 1 (red channel of the top left pixel) has value 211 and feature number 2 (red channel of the second pixel) has value 174, you need to use:
1 | 1:211 2:147 ...
3) To get good results in image recognition you need something better than a linear model on the raw pixel values. Unfortunately, VW has no deep learning (multi-layer neural net), no convolutional nets. You can try --nn X to get neural net with one hidden layer with X units (and tanh activation function), but this is just a poor substitute for the state-of-the-art approaches to CIFAR etc. You can also try other non-linear reductions available in VW (-q, --cubic, --lrq, --ksvm, --stage_poly). In general, I think VW is not suitable for such tasks (image recognition), unless you apply some preprocessing which generates (a lot of) features (e.g. SIFT).
4) You are overfitting.
average loss is below 0.02 (I assume this number is pretty good right?
No. You used --holdout_off, so the reported loss is rather the train loss. It is easy to get almost zero train loss by simple memoizing all examples, i.e. overfitting. However, you want to get the test (or holdout) loss low.

Resources