Unix - Sorting in shell script - sorting

How to sort a file based on field postion?
For eg. I need to sort the below given file. Based on 4th,5th and 8th positions. Please help.
I tried the following command, its not working :(
sort -d -k 3.42,44 -k 4.47,57 -k 5.59,70 -k 8.73,82
010835 03 0000000010604CAQZ 0912104072 QNZAW AZ ATC 1704698441
010835 03 0000000010604CZWX 7823775785 WDXSD GZ DDF 2804698441
010835 03 0000000010604CBEC 8737518498 DICDC CY HWT 0904698441
010835 03 0000000010604CERV 5648240160 FFVFV DZ UXE 8404698441
010835 03 0000000010604CTTV 2555338251 TTBGB FZ EZS 9504698441
010835 03 0000000010604CADB 1465045344 BINHH TZ QKZ 4604698441
010835 03 0000000010604CIFN 2374902637 NOMJU VZ XHU 6704698441
010835 03 0000000010604COGM 3281553523 JSLKI YZ CLK 5804698441
010835 03 0000000010604CPCL 4190899186 PQJLL QZ UPL 3004698441

Try this command:
sort -k4,4 -k5,5 -k8,8 input.txt
From the sort manual:
-k, --key=POS1[,POS2]
start a key at POS1, end it at POS2 (origin 1)
POS is F[.C][OPTS], where F is the field number and C the character position in the field. OPTS is
one or more single-letter ordering options, which override global ordering options for that key. If
no key is given, use the entire line as the key.
In your command:
-k 3.42,44 means start from (42th char of 3rd field) to (44th field).
Do you mean -k 3.42,3.44?

You could try :
sort -d -t $'\n' -k 1.42,1.44 -k 1.47,1.57 -k 1.59,1.70 -k 1.73,1.82 input.txt>
You would get this :
010835 03 0000000010604CADB 1465045344 BINHH TZ QKZ 4604698441
010835 03 0000000010604CAQZ 0912104072 QNZAW AZ ATC 1704698441
010835 03 0000000010604CBEC 8737518498 DICDC CY HWT 0904698441
010835 03 0000000010604CERV 5648240160 FFVFV DZ UXE 8404698441
010835 03 0000000010604CIFN 2374902637 NOMJU VZ XHU 6704698441
010835 03 0000000010604COGM 3281553523 JSLKI YZ CLK 5804698441
010835 03 0000000010604CPCL 4190899186 PQJLL QZ UPL 3004698441
010835 03 0000000010604CTTV 2555338251 TTBGB FZ EZS 9504698441
010835 03 0000000010604CZWX 7823775785 WDXSD GZ DDF 2804698441
the idea is to use the $'\n' (newline) char is field separator so every line is 1 field
Inspiration from http://www.computing.net/answers/unix/sort-file-by-position-/7735.html

Related

awk or sed command for columns and rows selection from multiple files

Looking for a command for the following task:
I have three files, each with two columns, as seen below.
I would like to create file4 with four columns.
The output should resemble a merge-sorted version of file1, file2 and file3 such that the first column is sorted, the second column is the second column of file1 the third column is the second column of file2 and the fourth column is the second column of file3.
The entries in column 2 to 3 should not be sorted but should match the key-value in the first column of the original files.
I tried intersection in Linux, but not giving the desired outputs.
Any help will be appreciated. Thanks in advance!!
$ cat -- file1
A1 B5
A10 B2
A3 B15
A15 B6
A2 B10
A6 B19
$ cat -- file2
A10 C4
A4 C8
A6 C5
A3 C10
A12 C14
A15 C18
$ cat -- file 3
A3 D1
A22 D9
A20 D3
A10 D5
A6 D10
A21 D11
$ cat -- file 4
col1 col2 col3 col4
A1 B5
A2 B10
A3 B15 C10 D1
A4 C8
A6 B19 C5 D10
A10 B2 C4 D5
A12 C14
A15 B6 C18
A20 D3
A21 D11
A22 D9
Awk + Bash version:
( echo "col1, col2, col3, col4" &&
awk 'ARGIND==1 { a[$1]=$2; allkeys[$1]=1 } ARGIND==2 { b[$1]=$2; allkeys[$1]=1 } ARGIND==3 { c[$1]=$2; allkeys[$1]=1 }
END{
for (k in allkeys) {
print k", "a[k]", "b[k]", "c[k]
}
}' file1 file2 file3 | sort -V -k1,1 ) | column -t -s ','
Pure Bash version:
declare -A a
while read key value; do a[$key]="${a[$key]:-}${a[$key]:+, }$value"; done < file1
while read key value; do a[$key]="${a[$key]:-, }${a[$key]:+, }$value"; done < file2
while read key value; do a[$key]="${a[$key]:-, , }${a[$key]:+, }$value"; done < file3
(echo "col1, col2, col3, col4" &&
for i in ${!a[#]}; do
echo $i, ${a[$i]}
done | sort -V -k1,1) | column -t -s ','
Explanation for "${a[$key]:-, , }${a[$key]:+, }$value" please check Shell-Parameter-Expansion
Using GNU Awk:
gawk '{ a[$1] = substr($1, 1); b[$1, ARGIND] = $2 }
END {
PROCINFO["sorted_in"] = "#val_num_asc"
for (i in a) {
t = i
for (j = 1; j <= ARGIND; ++j)
t = t OFS b[i, j]
print t
}
}' file{1..3} | column -t
There is a simple tool called join that allows you to perform this operation:
#!/usr/bin/env bash
cut -d ' ' -f1 file{1,2,3} | sort -k1,1 -u > ftmp
for f in file1 file2 file3; do
mv -- ftmp file4
join -a1 -e "---" -o auto file4 <(sort -k1,1 "$f") > ftmp
done
sort -k1,1V ftmp > file4
cat file4
This outputs
A1 B5 --- ---
A2 B10 --- ---
A3 B15 C10 D1
A4 --- C8 ---
A6 B19 C5 D10
A10 B2 C4 D5
A12 --- C14 ---
A15 B6 C18 ---
A20 --- --- D3
A21 --- --- D11
A22 --- --- D9
I used --- to indicate an empty field. If you want to pretty print this, you have to re-parse it with awk or anything else.
This might work for you (GNU sed and sort):
s=''; for f in file{1,2,3}; do s="$s\t"; sed -E "s/\s+/$s/" $f; done |
sort -V |
sed -Ee '1i\col1\tcol2\tcol3\tcol4' -e ':a;N;s/^((\S+\t).*\S).*\n\2\t+/\1\t/;ta;P;D'
Replace spaces by tabs and insert the number of tabs between the key and value depending on which file is being processed.
Sort the output by key column order.
Coalesce each line with its key and print the result.

Converting string using bash

I want to convert the output of command:
dmidecode -s system-serial-number
which is a string looking like this:
VMware-56 4d ad 01 22 5a 73 c2-89 ce 3f d8 ba d6 e4 0c
to:
564dad01-225a-73c2-89ce-3fd8bad6e40c
I suspect I need to first of all extract all letters and numbers after the "VMware-" part at that start and then insert "-" at the known positions after character 10, 14, 18, 22.
To try the first extraction I have tried:
$ echo `dmidecode -s system-serial-number | grep -oE '(VMware-)?[a0-Z9]'`
VMware-5 6 4 d a d 0 1 2 2 5 a 7 3 c 2 8 9 c e 3 f d 8 b a d 6 e 4 0 c
However this isn't going the right way.
EDIT:
This gets me to a single log string however it's not elegant:
$ echo `dmidecode -s system-serial-number | sed -s "s/VMware-//" | sed -s "s/-//" | sed -s "s/ //g"`
564dad01225a73c289ce3fd8bad6e40c
Like this :
dmidecode -s system-serial-number |
sed -E 's/VMware-//;
s/ +//g;
s/(.)/\1-/8;
s/(.)/\1-/13;
s/(.)/\1-/23'
You can use Bash sub string extraction:
$ s="VMware-56 4d ad 01 22 5a 73 c2-89 ce 3f d8 ba d6 e4 0c"
$ s1=$(echo "${s:7}" | tr -d '[:space:]')
$ echo "${s1:0:8}-${s1:8:4}-${s1:12:9}-${s1:21}"
564dad01-225a-73c2-89ce-3fd8bad6e40c
Or, built-ins only (ie, no tr):
$ s1=${s:7}
$ s1="${s1// /}"
$ echo "${s1:0:8}-${s1:8:4}-${s1:12:9}-${s1:21}"

replace CR LF in text file using `sed` or `fart` (Find And Replace Text)

I have a 1.5 GB Windows text file with some lines ending with LF and most of lines ending with CR+LF
Can you please help with sed script which
will replace all CR+LF with $|$
replace all LF with CR+LF
replace back all $|$ with CR+LF
I have tried to do all replacements with text editor, but it took very long to perform all replacements in the file (1 percent for half an hour). I've tried to replace it with fart:
fart -c -B -b text.txt "\r\n" "$|$"
with following result
replacement 0 occurence(s) in 0 file(s)..
One with awk:
$ awk '{sub(/(^|[^\r])$/,"&\r")}1' file
Testing it (0x0a is LF, 0x0d is CR):
$ awk 'BEGIN{print "no\nyes\r\n\n\r"}' > foo
$ hexdump -C foo
00000000 6e 6f 0a 79 65 73 0d 0a 0a 0d 0a |no.yes.....|
0000000b
$ awk '{sub(/(^|[^\r])$/,"&\r")}1' foo > bar
$ hexdump -C bar
00000000 6e 6f 0d 0a 79 65 73 0d 0a 0d 0a 0d 0a |no..yes......|
0000000d
I would do this: first remove all \r at the end of the line, then explicitly add a \r to the end of the line.
sed -e 's/\r$//' -e 's/$/\r/' file
Here's a demo:
$ printf "1\r\n2\n3\n4\r\n5\n" > file
$ od -c file
0000000 1 \r \n 2 \n 3 \n 4 \r \n 5 \n
0000014
$ sed -i -e 's/\r$//' -e 's/$/\r/' file
$ od -c file
0000000 1 \r \n 2 \r \n 3 \r \n 4 \r \n 5 \r \n
0000017
This is GNU sed.
It's simpler just to install a util like unix2dos which does it automatically. With unix2dos the proposed intermediate step of converting CR+LF to $|$, (and back), isn't necessary. Demo:
# first dump a file with both *DOS* and *Unix* style line endings:
hexdump -C <({ seq 2 | unix2dos ; seq 3 4; } )
# the same file, run through unix2dos
hexdump -C <({ seq 2 | unix2dos ; seq 3 4; } | unix2dos)
Output:
00000000 31 0d 0a 32 0d 0a 33 0a 34 0a |1..2..3.4.|
0000000a
00000000 31 0d 0a 32 0d 0a 33 0d 0a 34 0d 0a |1..2..3..4..|
0000000c
Or more elaborately, a before/after table, (see man hexdump for details on formatting):
hdf() { hexdump -v -e '/1 "%_ad# "' -e '/1 " _%_u\_\n"' $# ; }
# Note: the `printf` stuff keeps `paste` from misaligning the output.
paste <(hdf <({ seq 2 | unix2dos ; seq 3 4; }) ; printf '\t\n\t\n' ; ) \
<(hdf <({ seq 2 | unix2dos ; seq 3 4; } | unix2dos ))
Output:
0# _1_ 0# _1_
1# _cr_ 1# _cr_
2# _lf_ 2# _lf_
3# _2_ 3# _2_
4# _cr_ 4# _cr_
5# _lf_ 5# _lf_
6# _3_ 6# _3_
7# _lf_ 7# _cr_
8# _4_ 8# _lf_
9# _lf_ 9# _4_
10# _cr_
11# _lf_

Inconsistencies when packing hex string

I am having some inconsistencies when using hexdump and xxd. When I run the following command:
echo -n "a42d9dfe8f93515d0d5f608a576044ce4c61e61e" \
| sed 's/\(..\)/\1\n/g' \
| awk '/^[a-fA-F0-9]{2}$/ { printf("%c",strtonum("0x" $0)); }' \
| xxd
it returns the following results:
00000000: c2a4 2dc2 9dc3 bec2 8fc2 9351 5d0d 5f60 ..-........Q]._`
00000010: c28a 5760 44c3 8e4c 61c3 a61e ..W`D..La...
Note the "c2" characters. This also happens with I run xxd -p
When I run the same command except with hexdump -C:
echo -n "a42d9dfe8f93515d0d5f608a576044ce4c61e61e" \
| sed 's/\(..\)/\1\n/g' \
| awk '/^[a-fA-F0-9]{2}$/ { printf("%c",strtonum("0x" $0)); }' \
| hexdump -C
I get the same results (as far as including the "c2" character):
00000000 c2 a4 2d c2 9d c3 be c2 8f c2 93 51 5d 0d 5f 60 |..-........Q]._`|
00000010 c2 8a 57 60 44 c3 8e 4c 61 c3 a6 1e |..W`D..La...|
However, when I run hexdump with no arguments:
echo -n "a42d9dfe8f93515d0d5f608a576044ce4c61e61e" \
| sed 's/\(..\)/\1\n/g' \
| awk '/^[a-fA-F0-9]{2}$/ { printf("%c",strtonum("0x" $0)); }' \
| hexdump
I get the following [correct] results:
0000000 a4c2 c22d c39d c2be c28f 5193 0d5d 605f
0000010 8ac2 6057 c344 4c8e c361 1ea6
For the purpose of this script, I'd rather use xxd as opposed to hexdump. Thoughts?
The problem that you observe is due to UTF-8 encoding and little-endiannes.
First, note that when you try to print any Unicode character in AWK, like 0xA4 (CURRENCY SIGN), it actually produces two bytes of output, like the two bytes 0xC2 0xA4 that you see in your output:
$ echo 1 | awk 'BEGIN { printf("%c", 0xA4) }' | hexdump -C
Output:
00000000 c2 a4 |..|
00000002
This holds for any character bigger than 0x7F and it is due to UTF-8 encoding, which is probably the one set in your locale. (Note: some AWK implementations will have different behavior for the above code.)
Secondly, when you use hexdump without argument -C, it displays each pair of bytes in swapped order due to little-endianness of your machine. This is because each pair of bytes is then treated as a single 16-bit word, instead of treating each byte separately, as done by xxd and hexdump -C commands. So the xxd output that you get is actually the correct byte-for-byte representation of input.
Thirdly, if you want to produce the precise byte string that is encoded in the hexadecimal string that you are feeding to sed, you can use this Python solution:
echo -n "a42d9dfe8f93515d0d5f608a576044ce4c61e61e" | sed 's/\(..\)/0x\1,/g' | python3 -c "import sys;[open('tmp','wb').write(bytearray(eval('[' + line + ']'))) for line in sys.stdin]" && cat tmp | xxd
Output:
00000000: a42d 9dfe 8f93 515d 0d5f 608a 5760 44ce .-....Q]._`.W`D.
00000010: 4c61 e61e La..
Why not use xxd with -r and -p?
echo a42d9dfe8f93515d0d5f608a576044ce4c61e61e | xxd -r -p | xxd
output
0000000: a42d 9dfe 8f93 515d 0d5f 608a 5760 44ce .-....Q]._`.W`D.
0000010: 4c61 e61e La..

Easiest way to strip newline character from input string in pasteboard

Hopefully fairly straightforward, to explain the use case when I run the following command (OS X 10.6):
$ pwd | pbcopy
the pasteboard contains a newline character at the end. I'd like to get rid of it.
pwd | tr -d '\n' | pbcopy
printf $(pwd) | pbcopy
or
echo -n $(pwd) | pbcopy
Note that these should really be quoted in case there are whitespace characters in the directory name. For example:
echo -n "$(pwd)" | pbcopy
I wrote a utility called noeol to solve this problem. It pipes stdin to stdout, but leaves out the trailing newline if there is one. E.g.
pwd | noeol | pbcopy
…I aliased copy to noeol | pbcopy.
Check it out here: https://github.com/Sidnicious/noeol
For me I was having issues with the tr -d '\n' approach. On OSX I happened to have the coreutils package installed via brew install coreutils. This provides all the "normal" GNU utilities prefixed with a g in front of their typical names. So head would be ghead for example.
Using this worked more safely IMO:
pwd | ghead -c -1 | pbcopy
You can use od to see what's happening with the output:
$ pwd | ghead -c -1 | /usr/bin/od -h
0000000 552f 6573 7372 732f 696d 676e 6c6f 6c65
0000020 696c
0000022
vs.
$ pwd | /usr/bin/od -h
0000000 552f 6573 7372 732f 696d 676e 6c6f 6c65
0000020 696c 000a
0000023
The difference?
The 00 and 0a are the hexcodes for a nul and newline. The ghead -c -1 merely "chomps" the last character from the output before handing it off to | pbcopy.
$ man ascii | grep -E '\b00\b|\b0a\b'
00 nul 01 soh 02 stx 03 etx 04 eot 05 enq 06 ack 07 bel
08 bs 09 ht 0a nl 0b vt 0c np 0d cr 0e so 0f si
We can first delete the trailing newline if any, and then give it to pbcopy as follows:
your_command | perl -0 -pe 's/\n\Z//' | pbcopy
We can also create an alias of this:
alias pbc="perl -0 -pe 's/\n\Z//' | pbcopy"
Then the command would become:
pwd | pbc

Resources