What's the process of ClickHouse primary index - clickhouse

As mentioned in the title, i am very confused about the ClickHouse primary index.
ClickHouse primary index used these files: primary.idx, [primaryField].mrk, [primaryField].bin.
Where is MarkRange stored and how it works?
How do these relate to each other?

primary.idx -- contains primary columns keys values
20.13.1.5273
create table X(A Int64, S String)
Engine=MergeTree order by A settings index_granularity=4096, min_bytes_for_wide_part=0;
insert into X select number, toString(number) from numbers(32768);
cd /var/lib/clickhouse/data/default/X/all_1_1_0/
order by A -- A Int64 -- primary.idx has A value with size 8 bytes one after another,
default granularity is 4096
primary.idx === 0, 4096, 8192
check
od -l -j 0 -N 8 primary.idx #skip zero bytes read 8
0000000 0
od -l -j 8 -N 8 primary.idx #skip 8 bytes read 8
0000010 4096
od -l -j 16 -N 8 primary.idx #skip 16 bytes read 8
0000020 8192 0
mrk files contains offsets to bin (column) from primary key
od -l -j 0 -N 24 S.mrk2
0000000 0 0
0000020 4096
0 offset in compressed file (S.bin)
0 offset in decompressed block
4096 number of rows in granula
od -l -j 48 -N 24 S.mrk2
0000060 0 39850
0000100 4096
0 offset in compressed file (S.bin)
39850 offset in decompressed block
4096 number of rows in granula
od -l -j 72 -N 24 S.mrk2
0000110 0 62618
0000130 4096
0 offset in compressed file (S.bin)
62618 offset in decompressed block
4096 number of rows in granula
4096+4096+4096 = 12288 -- third granula of S column must contain strings with 12288+
check
dd status=none bs=1 skip=0 if=S.bin|clickhouse-compressor -d|dd status=none bs=1 skip=62618 count=50|hexdump -C
00000000 05 31 32 32 38 38 05 31 32 32 38 39 05 31 32 32 |.12288.12289.122|
00000010 39 30 05 31 32 32 39 31 05 31 32 32 39 32 05 31 |90.12291.12292.1|
00000020 32 32 39 33 05 31 32 32 39 34 05 31 32 32 39 35 |2293.12294.12295|
00000030 05 31 |.1|
00000032
some pictures in russian https://raw.githubusercontent.com/clickhouse/clickhouse-presentations/master/meetup27/adaptive_index_granularity.pdf

Related

Bash command "Head" is not showing certain columns of my bed/csv file

I have a bed file named coverage.bed. When I execute head coverage.bed, this is the beginning of what outputs:
chr start end . . strand length CG CA CT CC TG AG GG
chr1 3000380 3000440 . . + 172 0 2 9 2
chr1 3000492 3000552 . . + 172 0 1 9 1
chr1 3000593 3000653 . . + 1055 0 4 7 4
However, when I view the file using gedit coverage.bed, I see that this is the correct first 3 lines:
chr start end . . strand length CG CA CT CC TG AG GG
chr1 3000380 3000440 . . + 172 0 2 9 1 3 5 2
chr1 3000492 3000552 . . + 172 0 1 9 2 8 1 1
chr1 3000593 3000653 . . + 1055 0 4 7 3 6 5 4
Why is this happening? A python script outputted this file-- could it be possible that there is something wrong with the code that would lead to this error?
Edit: the output of sed -n 2p coverage.bed | hexdump -C is:
00000000 63 68 72 31 09 33 30 30 30 33 38 30 09 33 30 30 |chr1.3000380.300|
00000010 30 34 34 30 09 2e 09 2e 09 2b 09 31 37 32 09 30 |0440.....+.172.0|
00000020 09 32 09 39 09 31 09 33 09 35 09 32 0d 0a |.2.9.1.3.5.2..|
0000002e

Is there a way to understand what Oracle DataDump util updates in dmp file after extract?

I do not want to wait for Oracle DataDump expdb to finish writing to dump file.
So I start reading data from the moment it's created.
Then I write this data to another file.
It worked ok - file sizes are the same (the one that OracleDump created and the one my data monitoring script created).
But when I run cmp it shows difference in 27 bytes:
cmp -l ora.dmp monitor_10k_rows.dmp
3 263 154
4 201 131
5 174 173
6 103 75
48 64 70
58 0 340
64 0 1
65 0 104
66 0 110
541 60 61
545 60 61
552 60 61
559 60 61
20508 0 15
20509 0 157
20510 0 230
20526 0 10
20532 0 15
20533 0 225
20534 0 150
913437 0 226
913438 0 37
913454 0 10
913460 0 1
913461 0 104
913462 0 100
ls -al ora.dmp
-rw-r--r-- 1 oracle oinstall 999424 Jun 20 11:35 ora.dmp
python -c 'print 999424-913462'
85962
od ora.dmp -j 913461 -N 1
3370065 000100
3370066
od monitor_10k_rows.dmp -j 913461 -N 1
3370065 000000
3370066
Even if I extract more data the difference is still 27 bytes but different addresses/values:
cmp -l ora.dmp monitor_30k_rows.dmp
3 245 134
4 222 264
5 377 376
6 54 45
48 36 43
57 0 2
58 0 216
64 0 1
65 0 104
66 0 120
541 60 61
545 60 61
552 60 61
559 60 61
20508 0 50
20509 0 126
20510 0 173
20526 0 10
20532 0 50
20533 0 174
20534 0 120
2674717 0 226
2674718 0 47
2674734 0 10
2674740 0 1
2674741 0 104
2674742 0 110
Some writes are the same.
Is there a way know addresses of bytes which will differ?
ls -al ora.dmp
-rw-r--r-- 1 bicadmin bic 2760704 Jun 20 11:09 ora.dmp
python -c 'print 2760704-2674742'
85962
How can update my monitored copy after DataDump updated the original at adress 2674742 using Python for example?
Exact same thing happens if I use COMPRESSION=DATA_ONLY option.
Update: Figured how to sync bytes that differ between 2 files:
def patch_file(fn, diff):
for line in diff.split(os.linesep):
if line:
addr, to_octal, _ = line.strip().split()
with open(fn , 'r+b') as f:
f.seek(int(addr)-1)
f.write(chr(int (to_octal,8)))
diff="""
3 157 266
4 232 276
5 272 273
6 16 25
48 64 57
58 340 0
64 1 0
65 104 0
66 110 0
541 61 60
545 61 60
552 61 60
559 61 60
20508 15 0
20509 157 0
20510 230 0
20526 10 0
20532 15 0
20533 225 0
20534 150 0
913437 226 0
913438 37 0
913454 10 0
913460 1 0
913461 104 0
913462 100 0
"""
patch_file(f3,diff)
wrote a patch using Python:
addr=[3 , 4 , 5 , 6 , 48 , 58 , 64 , 65 , 66 , 541 , 545 , 552 , 559 , 20508 , 20509 , 20510 , 20526 , 20532 , 20533 , 20534 ]
last_range=[85987, 85986, 85970, 85964, 85963, 85962]
def get_bytes(addr):
out =[]
with open(f1 , 'r+b') as f:
for a in addr:
f.seek(a-1)
data= f.read(1)
hex= binascii.hexlify(data)
binary = int(hex, 16)
octa= oct(binary)
out.append((a,octa))
return out
def patch_file(fn, bytes_to_update):
with open(fn , 'r+b') as f:
for (a,to_octal) in bytes_to_update:
print (a,to_octal)
f.seek(int(a)-1)
f.write(chr(int (to_octal,8)))
if 1:
from_file=f1
fsize=os.stat(from_file).st_size
bytes_to_read = addr + [fsize-x for x in last_range]
bytes_to_update = get_bytes(bytes_to_read)
to_file =f3
patch_file(to_file,bytes_to_update)
The reason I do dmp file monitoring is because it cuts backup time in half.

in bash split a variable into an array with each array value containing n values from the list

So i'm issuing a query to mysql and it's returning say 1,000 rows,but each iteration of the program could return a different number of rows. I need to break up (without using a mysql limit) this result set into chunks of 100 rows that i can then programatically iterate through in these 100 row chunks.
So
MySQLOutPut='1 2 3 4 ... 10,000"
I need to turn that into an array that looks like
array[1]="1 2 3 ... 100"
array[2]="101 102 103 ... 200"
etc.
I have no clue how to accomplish this elegantly
Using Charles' data generation:
MySQLOutput=$(seq 1 10000 | tr '\n' ' ')
# the sed command will add a newline after every 100 words
# and the mapfile command will read the lines into an array
mapfile -t MySQLOutSplit < <(
sed -r 's/([^[:blank:]]+ ){100}/&\n/g; $s/\n$//' <<< "$MySQLOutput"
)
echo "${#MySQLOutSplit[#]}"
# 100
echo "${MySQLOutSplit[0]}"
# 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100
echo "${MySQLOutSplit[99]}"
# 9901 9902 9903 9904 9905 9906 9907 9908 9909 9910 9911 9912 9913 9914 9915 9916 9917 9918 9919 9920 9921 9922 9923 9924 9925 9926 9927 9928 9929 9930 9931 9932 9933 9934 9935 9936 9937 9938 9939 9940 9941 9942 9943 9944 9945 9946 9947 9948 9949 9950 9951 9952 9953 9954 9955 9956 9957 9958 9959 9960 9961 9962 9963 9964 9965 9966 9967 9968 9969 9970 9971 9972 9973 9974 9975 9976 9977 9978 9979 9980 9981 9982 9983 9984 9985 9986 9987 9988 9989 9990 9991 9992 9993 9994 9995 9996 9997 9998 9999 10000
Something like this:
# generate content
MySQLOutput=$(seq 1 10000 | tr '\n' ' ') # seq is awful, don't use in real life
# split into a large array, each item stored individually
read -r -a MySQLoutArr <<<"$MySQLOutput"
# add each batch of 100 items into a new array entry
batchSize=100
MySQLoutSplit=( )
for ((i=0; i<${#MySQLoutArr[#]}; i+=batchSize)); do
MySQLoutSplit+=( "${MySQLoutArr[*]:i:batchSize}" )
done
To explain some of the finer points:
read -r -a foo reads contents into an array named foo, split on IFS, up to the next character specified by read -d (none given here, thus reading only a single line). If you wanted each line to be a new array entry, consider IFS=$'\n' read -r -d '' -a foo, which will read each line into an array, terminated at the first NUL in the input stream.
"${foo[*]:i:batchSize}" expands to a list of items in array foo, starting at index i, and taking the next batchSize items, concatenated into a single string with the first character in $IFS used as a separator.

Mifare 1K write block but cannot read value block

For the last three days I have been looking for block and value blocks for Mifare 1K.
For example, I wrote data successfully 1. block with this APDU:
< FF D6 00 01 10 61 79 79 69 6C 64 69 7A 66 61 74 69 68 31 31 31
- Start Block 01
- Number of Bytes to Write: 16
- Data: ayyildizfatih111
> 90 00
- Write Binary Block Success
Then I can read as below APDU:
< FF B0 00 01 10
- Data Read at Start Block 01
- Number of Bytes Read: 16
> 61 79 79 69 6C 64 69 7A 66 61 74 69 68 31 31 31 90 00
- ASCII Mode: ayyildizfatih111
- Read Binary Block Success
But when I tried read value block it's giving this error.
< FF B1 00 01 04
- ACR122U Read Value Block
> 63 00
- Operation failed
So my question is what is the difference? When I am writing data, should I use binary blocks or value blocks. Which one is better?
Reading the value block fails because your block 1 is not a value block. Binary data blocks and value blocks share the same memory, the difference is just how you format the contents of the block and how you set the permissions for the block.
In order to turn block 1 into a value block, you would set the blocks access bits to allow value block operations (decrement, transfer, restore, and (optional) increment). You would then write the block as a value block (with ACR122U V2.02: either using the Value Block Operation command or using a regular Update Binary Block command).
The format of a value block (when using binary data block operations) is:
+----------+----------+----------+----+----+----+----+
Byte | 0..3 | 4..7 | 8..11 | 12 | 13 | 14 | 15 |
+----------+----------+----------+----+----+----+----+
Data | xxxxxxxx | yyyyyyyy | xxxxxxxx | uu | vv | uu | vv |
+----------+----------+----------+----+----+----+----+
Where xxxxxxxx is a 4 byte signed (2's complement) integer (LSB = byte 0), yyyyyyyy is the inverted value of xxxxxxxx, uu is an address byte (can be used when implementing a backup mechanism), vv is the inverted value of uu.
If you should use binary data blocks or should use the value format depends on your application. If you want to store a 4 byte integer value and wat to use value block operations, you may prefer to use the value block format. If you want to store other data, don't need the redundancy of the value block format, only want to use binary read/write operations, you may prefer to use a block as free-form binary data block.

how to add 0 digit to a single symbol hex value where it is missed, bash

I have a some file with the following content
$ cat somefile
28 46 5d a2 26 7a 192 168 2 2
0 15 e c8 a8 a3 192 168 100 3
54 4 2b 8 c 26 192 168 20 3
As you can see the values in first six columns are represented in hex, the values in last four columns in decimal formats. I just want to add 0 to every single symbol hexidecimal value.
Thanks beforehand.
This one should work out for you:
while read -a line
do
hex=(${line[#]:0:6})
printf "%02x " ${hex[#]/#/0x}
echo ${line[#]:6:4}
done < somefile
Example:
$ cat somefile
28 46 5d a2 26 7a 192 168 2 2
0 15 e c8 a8 a3 192 168 100 3
54 4 2b 8 c 26 192 168 20 3
$ while read -a line
> do
> hex=(${line[#]:0:6})
> printf "%02x " ${hex[#]/#/0x}
> echo ${line[#]:6:4}
> done < somefile
28 46 5d a2 26 7a 192 168 2 2
00 15 0e c8 a8 a3 192 168 100 3
54 04 2b 08 0c 26 192 168 20 3
Here is a way with awk if that is an option:
awk '{for(i=1;i<=6;i++) if(length($i)<2) $i=0$i}1' file
Test:
$ cat file
28 46 5d a2 26 7a 192 168 2 2
0 15 e c8 a8 a3 192 168 100 3
54 4 2b 8 c 26 192 168 20 3
$ awk '{for(i=1;i<=6;i++) if(length($i)<2) $i=0$i}1' file
28 46 5d a2 26 7a 192 168 2 2
00 15 0e c8 a8 a3 192 168 100 3
54 04 2b 08 0c 26 192 168 20 3
Please try this too, if it helps (bash version 4.1.7(1)-release)
#!/bin/bash
while read line;do
arr=($line)
i=0
for num in "${arr[#]}";do
if [ $i -lt 6 ];then
if [ ${#num} -eq 1 ];then
arr[i]='0'${arr[i]};
fi
fi
i=$((i+1))
done
echo "${arr[*]}"
done<your_file
This might work for you (GNU sed):
sed 's/\b\S\s/0&/g' file
Finds a single non-space character and prepends a 0.

Resources