Iterate through split string by another string - bash

I want to create a bash script to parse data returned by a this command :
openvpn --show-pkcs11-ids /usr/lib/libeTPkcs11.so
The typical output is :
The following objects are available for use.
Each object shown below may be used as parameter to
--pkcs11-id option please remember to use single quote mark.
Certificate
DN: XXX
Serial: XXXX
Serialized id: XXXX
Certificate
DN: XXXX
Serial: XXXX
Serialized id: XXXX
Certificate
DN: XXXXX
Serial: XXXX
Serialized id: XXXX
I want to get an array in bash containing 3 elements : the 3 "Certificate" blocks. I tried a lot of method of splitting but all of them only output an echo command, not an actual array.
Any ideas ?
Thx !

This is one where it would be much simpler and (much much faster) to use awk. awk provide arrays and is much more capable at processing input records than read. With awk you simply write rules to be applied to each line of input. In your case you just need to recognize whether the line begins with "DN:", "Serial:", or "Serialized". You can then store the associated value in a separate array, say arrays dn, serial, and serid. To accomplish this in awk you need nothing more than:
awk '
$1 == "Certificate" {n++}; # increment n
NF == 2 { # fill dn & serial array
$1 == "DN:" && dn[n]=$2
$1 == "Serial:" && serial[n]=$2
}
NF == 3 { # fill serid array
$1 == "Serialized" && serid[n]=$3
}
END { # output results
print "\nDN:\t\tSerial:\t\tSerialized id:"
for (i in dn) print dn[i], "\t\t", serial[i], "\t\t", serid[i]
}' file
Above if the first field ($1) is "Certificate" you just increment a counter. If there are 2 fields in the line (NF == 2) then you check if the line begins with "DN:" or "Serial" and add the 2nd field to the proper array. If the line has 3-fields ("Serialized", "id:" and your value) you store the value in the serid array.
With all values stored, you can iterate over the arrays in the END rule, providing any output you need. Above it simply outputs the content in tabular form. You can just copy/middle-mouse-paste in the command line to test.
Example Use/Output
$ awk '
> $1 == "Certificate" {n++}; # increment n
> NF == 2 { # fill dn & serial array
> $1 == "DN:" && dn[n]=$2
> $1 == "Serial:" && serial[n]=$2
> }
> NF == 3 { # fill serid array
> $1 == "Serialized" && serid[n]=$3
> }
> END { # output results
> print "\nDN:\t\tSerial:\t\tSerialized id:"
> for (i in dn) print dn[i], "\t\t", serial[i], "\t\t", serid[i]
> }' file
DN: Serial: Serialized id:
XXX XXXX XXXX
XXXX XXXX XXXX
XXXXX XXXX XXXX
For large file processing, awk will be orders of magnitude faster that looping in a shell script. Let me know if this satisfies your needs of if you need additional help.
Edit Per-Comment
If you are dealing with a file that has mixed tabs and spaces being used a separators, that can present problem with awk parsing using a default field separator (space). To consider a sequence of mixed spaces/tabs as a separator, with GNU awk you can provide a regular expression for the separator. For instance considering a sequence of one or more spaces or tabs can be specified as -F'[ \t]+'. The example below makes use of the separator. (note: the field numbers will change as a result)
awk -F'[ \t]+' '
$1 == "Certificate" {n++}; # increment n
NF == 3 { # fill dn & serial array
$2 == "DN:" && dn[n]=$3
$2 == "Serial:" && serial[n]=$3
}
NF == 4 { # fill serid array
$2 == "Serialized" && serid[n]=$4
}
END { # output results
print "\nDN:\t\tSerial:\t\tSerialized id:"
for (i in dn) print dn[i], "\t\t", serial[i], "\t\t", serid[i]
}' f
Example Use/Output
With your same data you would then have:
$ awk -F'[ \t]+' '
> $1 == "Certificate" {n++}; # increment n
> NF == 3 { # fill dn & serial array
> $2 == "DN:" && dn[n]=$3
> $2 == "Serial:" && serial[n]=$3
> }
> NF == 4 { # fill serid array
> $2 == "Serialized" && serid[n]=$4
> }
> END { # output results
> print "\nDN:\t\tSerial:\t\tSerialized id:"
> for (i in dn) print dn[i], "\t\t", serial[i], "\t\t", serid[i]
> }' f
DN: Serial: Serialized id:
XXX XXXX XXXX
XXXX XXXX XXXX
XXXXX XXXX XXXX
Not knowing what the space/tab makeup of your posted text actually is, this should handle either case.
Further Update Posting Input Contents Taken From Question
The following is the input file f (or file) used with the examples above. It was taken from your question, but there is no guarantee the space/tab translation is the same give the copy/paste into the question. The last example above should handle it regardless. The only other caveat is if you have a file with DOS line ending you are feeding to awk -- it won't work. You can check by running the utility file yourfilename and it will report is DOS CRLF line endings are present. You can then use dos2unix yourfilename to correct the problem and convert the file to Unix/POSIX line endings.
Example Input File
$ cat f
The following objects are available for use.
Each object shown below may be used as parameter to
--pkcs11-id option please remember to use single quote mark.
Certificate
DN: XXX
Serial: XXXX
Serialized id: XXXX
Certificate
DN: XXXX
Serial: XXXX
Serialized id: XXXX
Certificate
DN: XXXXX
Serial: XXXX
Serialized id: XXXX
Hexdump of Contents
$ hexdump -Cv f
00000000 54 68 65 20 66 6f 6c 6c 6f 77 69 6e 67 20 6f 62 |The following ob|
00000010 6a 65 63 74 73 20 61 72 65 20 61 76 61 69 6c 61 |jects are availa|
00000020 62 6c 65 20 66 6f 72 20 75 73 65 2e 0a 45 61 63 |ble for use..Eac|
00000030 68 20 6f 62 6a 65 63 74 20 73 68 6f 77 6e 20 62 |h object shown b|
00000040 65 6c 6f 77 20 6d 61 79 20 62 65 20 75 73 65 64 |elow may be used|
00000050 20 61 73 20 70 61 72 61 6d 65 74 65 72 20 74 6f | as parameter to|
00000060 0a 2d 2d 70 6b 63 73 31 31 2d 69 64 20 6f 70 74 |.--pkcs11-id opt|
00000070 69 6f 6e 20 70 6c 65 61 73 65 20 72 65 6d 65 6d |ion please remem|
00000080 62 65 72 20 74 6f 20 75 73 65 20 73 69 6e 67 6c |ber to use singl|
00000090 65 20 71 75 6f 74 65 20 6d 61 72 6b 2e 0a 0a 43 |e quote mark...C|
000000a0 65 72 74 69 66 69 63 61 74 65 0a 20 20 20 20 20 |ertificate. |
000000b0 20 20 44 4e 3a 20 20 20 20 20 20 20 20 20 20 20 | DN: |
000000c0 20 20 58 58 58 0a 20 20 20 20 20 20 20 53 65 72 | XXX. Ser|
000000d0 69 61 6c 3a 20 20 20 20 20 20 20 20 20 58 58 58 |ial: XXX|
000000e0 58 0a 20 20 20 20 20 20 20 53 65 72 69 61 6c 69 |X. Seriali|
000000f0 7a 65 64 20 69 64 3a 20 20 58 58 58 58 0a 0a 43 |zed id: XXXX..C|
00000100 65 72 74 69 66 69 63 61 74 65 0a 20 20 20 20 20 |ertificate. |
00000110 20 20 44 4e 3a 20 20 20 20 20 20 20 20 20 20 20 | DN: |
00000120 20 20 58 58 58 58 0a 20 20 20 20 20 20 20 53 65 | XXXX. Se|
00000130 72 69 61 6c 3a 20 20 20 20 20 20 20 20 20 58 58 |rial: XX|
00000140 58 58 0a 20 20 20 20 20 20 20 53 65 72 69 61 6c |XX. Serial|
00000150 69 7a 65 64 20 69 64 3a 20 20 58 58 58 58 0a 0a |ized id: XXXX..|
00000160 43 65 72 74 69 66 69 63 61 74 65 0a 20 20 20 20 |Certificate. |
00000170 20 20 20 44 4e 3a 20 20 20 20 20 20 20 20 20 20 | DN: |
00000180 20 20 20 58 58 58 58 58 0a 20 20 20 20 20 20 20 | XXXXX. |
00000190 53 65 72 69 61 6c 3a 20 20 20 20 20 20 20 20 20 |Serial: |
000001a0 58 58 58 58 0a 20 20 20 20 20 20 20 53 65 72 69 |XXXX. Seri|
000001b0 61 6c 69 7a 65 64 20 69 64 3a 20 20 58 58 58 58 |alized id: XXXX|
000001c0 0a |.|
000001c1
Let me know the results of your file examination.

You can use AWK to do that. It is a tool specifically created for transforming table-like output.
openvpn --show-pkcs11-ids /usr/lib/libeTPkcs11.so | grep 'Certificate\|DN:\|Serial:\|Serialized id:' | awk -v RS="Certificate" '{print $2,$4,$7}'
Explanation:
grep 'Certificate\|DN:\|Serial:\|Serialized id:' - Choose only interesting lines of output
awk -v RS="Certificate" '{print $2,$4,$7}' - See below comment
Comment: AWK enables you to change the record separator using "-v RS=" parameter. By default it is a newline, so each line of the file is a record, but it can be changed to any string e.g. "Certificate".
Output is not an array, but every certificate is described in separate line you can further pipe to another tool.

Related

Case doesn't work when examining tail output

I can't find the reasons why my case statement doesn't work when looking tail output.
tail -F -n1 /var/log/pihole.log |
while read input; do
echo "$input" | hexdump -C # just to physically compare the output
case $input in
cached|blacklisted|blocked)
echo "We have a match!";;
*)
echo "No match!"
esac
done
This always returns No match!, even if the strings are in the $input.
:~ $ ./pihole_test.sh
00000000 4a 61 6e 20 31 20 31 31 3a 35 35 3a 35 38 20 64 |Jan 1 11:55:58 d|
00000010 6e 73 6d 61 73 71 5b 36 39 36 5d 3a 20 65 78 61 |nsmasq[696]: exa|
00000020 63 74 6c 79 20 62 6c 61 63 6b 6c 69 73 74 65 64 |ctly blacklisted|
00000030 20 70 6c 61 79 2e 67 6f 6f 67 6c 65 2e 63 6f 6d | play.google.com|
00000040 20 69 73 20 30 2e 30 2e 30 2e 30 0a | is 0.0.0.0.|
0000004c
No match!
Replace
cached|blacklisted|blocked)
with
*cached*|*blacklisted*|*blocked*)
to match substrings.

How can I remove non-breaking spaces from a text file in bash?

I have a csv file with text and numbers.
If a number is bigger than 1000, formatted like this: 1 000,
so it has a space as thousand separator, but it is not space. I tried to sed it, and it worked where real space was, but not in this format.
It is also not TAB, I removed all the TABs with "expand -t 1".
The following is a line that demonstrates the issue:
x17_Provident_GDN_REMARKETING_provident.hu_listák;Display_Hálózat;Szeged;2021-03-09;Kedd;Mobil;HUF;1 736;9;130.83;0.00
In penultimate row, in column 8: 1 736
is the problem.
And running this: grep -E -m 1 -e '[;]1[^;]+736[;]' <yourfile.csv | hexdump -C
gives:
00000000 78 31 37 5f 50 72 6f 76 69 64 65 6e 74 5f 47 44 |x17_Provident_GD|
00000010 4e 5f 52 45 4d 41 52 4b 45 54 49 4e 47 5f 70 72 |N_REMARKETING_pr|
00000020 6f 76 69 64 65 6e 74 2e 68 75 5f 6c 69 73 74 c3 |ovident.hu_list.|
00000030 a1 6b 3b 44 69 73 70 6c 61 79 5f 48 c3 a1 6c c3 |.k;Display_H..l.|
00000040 b3 7a 61 74 3b 53 7a 65 67 65 64 3b 32 30 32 31 |.zat;Szeged;2021|
00000050 2d 30 33 2d 30 39 3b 4b 65 64 64 3b 4d 6f 62 69 |-03-09;Kedd;Mobi|
00000060 6c 3b 48 55 46 3b 31 c2 a0 37 33 36 3b 39 3b 31 |l;HUF;1..736;9;1|
00000070 33 30 2e 38 33 3b 30 2e 30 30 0a |30.83;0.00.|
0000007b
It's a 2 byte, UTF-8 encoded non breaking space - c2 a0.
You can use perl to safely remove it.
perl -pe 's/\xc2\xa0//g' dirty.csv > clean.csv
After we know it is No break space, I simply sed it on mac with entry method:
opt+space
cat test4.csv | sed 's/ //g'
Similar to perl, you can use GNU sed with LC_ALL=C:
LC_ALL=C sed 's/\xc2\xa0//g'

What does 'BS' stands for in sublime text on macOS?

in macOS, I use zsh terminal ,then input command 'man sort > sort-man.txt'.
When open sort-man.txt with Sublime text, I see many 'BS'.
What does 'BS' stands for in sublime text on macOS??
It can be some encoding issue??
question picture
The man command outputs a “bold” character by printing the character, then printing a backspace character, then printing the character again. Thus:
:; man sort | hexdump -C | head
00000000 0a 53 4f 52 54 28 31 29 20 20 20 20 20 20 20 20 |.SORT(1) |
00000010 20 20 20 20 20 20 20 20 20 20 20 42 53 44 20 47 | BSD G|
00000020 65 6e 65 72 61 6c 20 43 6f 6d 6d 61 6e 64 73 20 |eneral Commands |
00000030 4d 61 6e 75 61 6c 20 20 20 20 20 20 20 20 20 20 |Manual |
00000040 20 20 20 20 20 20 20 20 53 4f 52 54 28 31 29 0a | SORT(1).|
00000050 0a 4e 08 4e 41 08 41 4d 08 4d 45 08 45 0a 20 20 |.N.NA.AM.ME.E. |
^ ^ ^
| | +--- ASCII N
| +------ ASCII Backspace
+--------- ASCII N
Way back in the days of physical terminals that printed on paper, this would have the effect of overstriking the character, making it appear bolder.
These days, your terminal emulator app interprets a sequence like this by changing the color or font of the character.
I guess Sublime Text shows the backspace character as BS.
Consulting the man man page, I find this under “TIPS”:
To get a plain text version of a man page, without backspaces and underscores, try
# man foo | col -b > foo.mantxt

How to add newline terminators to a .csv file that doesn't have any

I have a very large .csv file (2 GB) on which running file file.csv returns ASCII text, with very long lines, with no line terminators. I'm trying to split this large file into smaller files, but all the ways of doing this seem to rely on there being some standard kind of newline character, which my file doesn't have.
How can I add newline characters onto the end of the lines?
EDIT: I've tried using unix2dos to do this, but it doesn't do anything.
EDIT 2: Here is a hexdump of the first 128 bytes:
00000000 49 64 2c 55 73 65 6e 6e 61 6d 65 2c 44 61 74 65 |Id,Usenname,Date|
00000010 74 69 6d 65 50 6f 73 74 65 64 2c 54 61 67 55 73 |timePosted,TagUs|
00000020 65 64 34 32 2c 36 31 62 61 61 65 62 61 38 64 65 |ed42,61baaeba8de|
00000030 31 33 36 64 39 63 31 61 61 39 63 31 38 65 63 33 |136d9c1aa9c18ec3|
00000040 38 36 30 65 38 2c 32 30 30 34 2d 31 31 2d 30 34 |860e8,2004-11-04|
00000050 20 30 32 3a 32 35 3a 30 35 2e 33 37 33 37 39 38 | 02:25:05.373798|
00000060 2b 30 30 2c 65 63 6f 6c 69 34 32 2c 36 31 62 61 |+00,ecoli42,61ba|
00000070 61 65 62 61 38 64 65 31 33 36 64 39 63 31 61 61 |aeba8de136d9c1aa|
on line 3, there should be a line break in between ed and 42 and similarly, on line 7 there should be a break between ecoli and 42. Does this output mean that this csv I've been given is actually just one long line?
Thanks

wcslen() works differently in Xcode and VC++

I found that wcslen() in VC++2010 returns correct count of letters; meanwhile Xcode does not.
For example, the code below returns correct 11 in VC++ 2010, but returns incorrect 17 in Xcode 4.2.
const wchar_t *p = L"123abc가1나1다";
size_t plen = wcslen(p);
I guess Xcode app stores wchar_t string as UTF-8 in memory. This is another strange thing.
How can I get 11 just like VC++ in Xcode too?
I ran this program on a Mac Mini running MacOS X 10.7.2 (Xcode 4.2):
#include <stdio.h>
#include <wchar.h>
int main(void)
{
const wchar_t p[] = L"123abc가1나1다";
size_t plen = wcslen(p);
if (fwide(stdout, 1) <= 0)
{
fprintf(stderr, "Failed to make stdout wide-oriented\n");
return -1;
}
wprintf(L"String <<%ls>>\n", p);
putwc(L'\n', stdout);
wprintf(L"Length = %zu\n", plen);
for (size_t i = 0; i < sizeof(p)/sizeof(*p); i++)
wprintf(L"Character %zu = 0x%X\n", i, p[i]);
return 0;
}
When I do a hex dump of the source file, I see:
0x0000: 23 69 6E 63 6C 75 64 65 20 3C 73 74 64 69 6F 2E #include <stdio.
0x0010: 68 3E 0A 23 69 6E 63 6C 75 64 65 20 3C 77 63 68 h>.#include <wch
0x0020: 61 72 2E 68 3E 0A 0A 69 6E 74 20 6D 61 69 6E 28 ar.h>..int main(
0x0030: 76 6F 69 64 29 0A 7B 0A 20 20 20 20 63 6F 6E 73 void).{. cons
0x0040: 74 20 77 63 68 61 72 5F 74 20 70 5B 5D 20 3D 20 t wchar_t p[] =
0x0050: 4C 22 31 32 33 61 62 63 EA B0 80 31 EB 82 98 31 L"123abc...1...1
0x0060: EB 8B A4 22 3B 0A 20 20 20 20 73 69 7A 65 5F 74 ...";. size_t
0x0070: 20 70 6C 65 6E 20 3D 20 77 63 73 6C 65 6E 28 70 plen = wcslen(p
0x0080: 29 3B 0A 20 20 20 20 69 66 20 28 66 77 69 64 65 );. if (fwide
0x0090: 28 73 74 64 6F 75 74 2C 20 31 29 20 3C 3D 20 30 (stdout, 1) <= 0
0x00A0: 29 0A 20 20 20 20 7B 0A 20 20 20 20 20 20 20 20 ). {.
0x00B0: 66 70 72 69 6E 74 66 28 73 74 64 65 72 72 2C 20 fprintf(stderr,
0x00C0: 22 46 61 69 6C 65 64 20 74 6F 20 6D 61 6B 65 20 "Failed to make
0x00D0: 73 74 64 6F 75 74 20 77 69 64 65 2D 6F 72 69 65 stdout wide-orie
0x00E0: 6E 74 65 64 5C 6E 22 29 3B 0A 20 20 20 20 20 20 nted\n");.
0x00F0: 20 20 72 65 74 75 72 6E 20 2D 31 3B 0A 20 20 20 return -1;.
0x0100: 20 7D 0A 20 20 20 20 77 70 72 69 6E 74 66 28 4C }. wprintf(L
0x0110: 22 53 74 72 69 6E 67 20 3C 3C 25 6C 73 3E 3E 5C "String <<%ls>>\
0x0120: 6E 22 2C 20 70 29 3B 0A 20 20 20 20 70 75 74 77 n", p);. putw
0x0130: 63 28 4C 27 5C 6E 27 2C 20 73 74 64 6F 75 74 29 c(L'\n', stdout)
0x0140: 3B 0A 20 20 20 20 77 70 72 69 6E 74 66 28 4C 22 ;. wprintf(L"
0x0150: 4C 65 6E 67 74 68 20 3D 20 25 7A 75 5C 6E 22 2C Length = %zu\n",
0x0160: 20 70 6C 65 6E 29 3B 0A 20 20 20 20 66 6F 72 20 plen);. for
0x0170: 28 73 69 7A 65 5F 74 20 69 20 3D 20 30 3B 20 69 (size_t i = 0; i
0x0180: 20 3C 20 73 69 7A 65 6F 66 28 70 29 2F 73 69 7A < sizeof(p)/siz
0x0190: 65 6F 66 28 2A 70 29 3B 20 69 2B 2B 29 0A 20 20 eof(*p); i++).
0x01A0: 20 20 20 20 20 20 77 70 72 69 6E 74 66 28 4C 22 wprintf(L"
0x01B0: 43 68 61 72 61 63 74 65 72 20 25 7A 75 20 3D 20 Character %zu =
0x01C0: 30 78 25 58 5C 6E 22 2C 20 69 2C 20 70 5B 69 5D 0x%X\n", i, p[i]
0x01D0: 29 3B 0A 20 20 20 20 72 65 74 75 72 6E 20 30 3B );. return 0;
0x01E0: 0A 7D 0A .}.
0x01E3:
The output when compiled with GCC is:
String <<123abc
Length = 11
Character 0 = 0x31
Character 1 = 0x32
Character 2 = 0x33
Character 3 = 0x61
Character 4 = 0x62
Character 5 = 0x63
Character 6 = 0xAC00
Character 7 = 0x31
Character 8 = 0xB098
Character 9 = 0x31
Character 10 = 0xB2E4
Character 11 = 0x0
Note that the string is truncated at the zero byte - I think that is probably a bug in the system, but it seems a little unlikely that I'd manage to find one on my first attempt at using wprintf(), so it is more likely I'm doing something wrong.
You're right, in the multi-byte UTF-8 source code, the string occupies 17 bytes (8 one-byte basic Latin-1 characters, and 3 characters each encoded using 3 bytes). So, the raw strlen() on the source string would return 17 bytes.
GCC version is:
i686-apple-darwin11-llvm-gcc-4.2 (GCC) 4.2.1 (Based on Apple Inc. build 5658) (LLVM build 2335.15.00)
Copyright (C) 2007 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
Just for giggles, I tried clang, and I get a different result. Compiled using:
clang -o row row.c -Wall -std=c99
using:
Apple clang version 2.1 (tags/Apple/clang-163.7.1) (based on LLVM 3.0svn)
Target: x86_64-apple-darwin11.3.0
Thread model: posix
The output when compiled with clang is:
String <<123abc가1나1다>>
Length = 17
Character 0 = 0x31
Character 1 = 0x32
Character 2 = 0x33
Character 3 = 0x61
Character 4 = 0x62
Character 5 = 0x63
Character 6 = 0xEA
Character 7 = 0xB0
Character 8 = 0x80
Character 9 = 0x31
Character 10 = 0xEB
Character 11 = 0x82
Character 12 = 0x98
Character 13 = 0x31
Character 14 = 0xEB
Character 15 = 0x8B
Character 16 = 0xA4
Character 17 = 0x0
So, now the string appears correctly, but the length is given as 17 instead of 11. Superficially, you can take your choice of bugs - string looks OK (in a terminal - /Applications/Utilities/Terminal - acclimatized to UTF8) but length is wrong, or length is right but string does not appear correctly.
I note that sizeof(wchar_t) in both gcc and clang is 4.
The left hand does not understand what the right hand is doing. I think there's a case for claiming both are broken, in different ways.

Resources