how to split by unicode chars in shell - shell

using java:
File file = new File("C:/Users/Administrator/Desktop/es.txt");
List<String> lines = FileUtils.readLines(file, "utf-8");
for (String line : lines) {
String[] arr = line.split("\\u007C\\u001C");
System.out.println(arr.length);
System.out.println(Arrays.toString(arr));
}
how can I do it in shell(awk, tr, or sed)?
I've tried this, but it didn't work:
awk -F\u007c\u001c '{print $1}' es.txt
Thanks.

Obviously, U+007C and U+001C are plain old 7-bit ASCII characters, so splitting on those doesn't actually require any Unicode support (apart from possibly handling any ASCII-incompatible Unicode encoding in the files you are manipulating; but your question indicates that your data is in UTF-8, so that does not seem to be the case here. UTF-16 would require the splitting tool to be specifically aware of and compatible with the encoding).
Assuming your question can be paraphrased as "if I know the numeric Unicode code point I want to split on, how do I pass that to a tool which is capable of splitting on it", my recommendation would be Perl.
perl -CSD -aF'\N{U+1f4a9}' -nle 'print $F[0]' es.txt
using U+1F4A9 as the separator. (Perl's arrays are zero-based, so $F[0] corresponds to Awk's $1. The -a option requests field splitting to the array #F; normally, Perl does not explicitly split the input into fields.) If the hex code for the code point you want to use as the field separator is in a shell variable, use double quotes instead of single, obviously.
PIPE='007C'
FS='001C'
perl -CSD -aF"\N{U+$PIPE}\N{U+$FS}" -nle 'print $F[0]' es.txt
Alternatively, if the tool you want to use handles UTF-8 transparently, you can use the ANSI C quoting facility of Bash to specify the separator. Unicode support seems only to have been introduced in Bash 4.2 so e.g. Debian Squeeze (currently oldoldstable) does not have it.
awk -F$'\U0001f4a9' '{print $1}' es.txt # or $'\u007c' for 4-digit code points
However, because the quoting facility is a form of single quotes, you can't (easily) have the separator's code point value in a variable.

gawk 4.1.3
[root#test /tmp]$ more a
\u8BF7\u5C06\u60A8\u8981\u8F6C\u6362\u7684\u6C49\u6587\u8981\u8F6C\u5185\u5BB9\u
7C98\u8D34\u5728\u8FD9\u91CC\u3002
[root#test /tmp]$ awk -F '.u8981..8F6C' '{print $1}' a
\u8BF7\u5C06\u60A8
[root#test /tmp]$ awk -F '.u8981..8F6C' '{print $2}' a
\u6362\u7684\u6C49\u6587
[root#test /tmp]$ awk -F '.u8981..8F6C' '{print $3}' a
\u5185\u5BB9\u7C98\u8D34\u5728\u8FD9\u91CC\u3002

Pure bash:
As your question is tagged shell there is a pure bash way:
declare -a out=()
pnt=0
while IFS= read -d '' -n1 char ;do
LANG=C LC_ALL=C printf -v val %d "'$char"
(( val == 195 )) && out[pnt]+= &&
printf -v out[pnt+1] "%s" "${char}" &&
((pnt+=2)) ||
printf -v out[pnt] "%s%s" "${out[pnt]}" "${char}"
done <<<'Il est déjà très tard!'
Where submited string containg UTF8 chars and newlines, this will create an array of 7 strings:
declare -p o
declare -a o=([0]="Il est d" [1]="é" [2]="j" [3]="à" [4]=" tr" [5]="è" [6]=$'s tard!\n')
or
cat -n <(printf -- "<%s>\n" "${o[#]#Q}")
1 <'Il est d'>
2 <'é'>
3 <'j'>
4 <'à'>
5 <' tr'>
6 <'è'>
7 <$'s tard!\n'>
Where even fields are separators and odd fields are content.
As a function:
splitOnUnicod () {
local -n out=$1
out=()
local -i pnt=0 cval
local char
while IFS= read -d '' -rn1 char; do
LANG=C LC_ALL=C printf -v cval %d "'$char";
((cval==195)) && out[pnt]+= && printf -v out[++pnt] %s "$char" && pnt+=1 || printf -v out[pnt] %s%s "${out[pnt]}" "$char";
done
}
Then
splitOnUnicod myvar <<<"Généralités"
declare -p myvar
declare -a myvar=([0]="G" [1]="é" [2]="n" [3]="é" [4]="ralit" [5]="é" [6]=$'s\n')
splitOnUnicod myvar < <(printf "Iñès.")
declare -p myvar
declare -a myvar=([0]="I" [1]="ñ" [2]="" [3]="è" [4]="s.")
Where ñ as è are separators, they are in even fields.
paste <(printf %s\\n "${!myvar[#]}") <(printf %s\\n "${myvar[#]}")
0 I
1 ñ
2
3 è
4 s.

Related

How can I interpret a string that contains decimal escape sequences?

I'm trying to parse the "parsable" ouput of the avahi-browse command for use in a shell script. e.g.
for i in $(avahi-browse -afkpt | awk -F';' '{print $4}') ; do <do something with $i> ; done
The output looks like:
+;br.vlan150;IPv4;Sonos-7828CAC5D944\064Bedroom;_sonos._tcp;local
I am particularly interested in the value of the 4th field, which is a "service name".
With the -p|--parsable flag, avahi-browse escapes the "service name" values.
For example 7828CAC5D944\064Bedroom, where \064 is a zero-padded decimal representation of the ASCII character '#'.
I just want 7828CAC5D944#Bedroom so I can, for example, use it as an argument to another command.
I can't quite figure out how to do this inside the shell.
I tried using printf, but that only seems to interpret octal escape sequences. e.g.:
# \064 is interpreted as 4
$ printf '%b\n' '7828CAC5D944\064Bedroom'
7828CAC5D9444Bedroom
How can I parse these values, converting any of the decimal escape sequences to their corresponding ASCII characters?
Assumptions:
there's a reason the -p flag cannot be removed (will removing -p generate a # instead of \064?)
the 4th field is to be further processed by stripping off all text up to and including a hyphen (-)
\064 is the only escaped decimal value we need to worry about (for now)
Since OP is already calling awk to process the raw data I propose we do the rest of the processing in the same awk call.
One awk idea:
awk -F';' '
{ n=split($4,arr,"-") # split field #4 based on a hyphen delimiter
gsub(/\\064/,"#",arr[n]) # perform the string replacement in the last arr[] entry
print arr[n] # print the newly modified string
}'
# or as a one-liner:
awk -F';' '{n=split($4,arr,"-");gsub(/\\064/,"#",arr[n]);print arr[n]}'
Simulating the avahi-browse call feeding into awk:
echo '+;br.vlan150;IPv4;Sonos-7828CAC5D944\064Bedroom;_sonos._tcp;local' |
awk -F';' '{n=split($4,arr,"-");gsub(/\\064/,"#",arr[n]);print arr[n]}'
This generates:
7828CAC5D944#Bedroom
And for the main piece of code I'd probably opt for a while loop, especially if there's a chance the avahi-browse/awk process could generate data with white space:
while read -r i
do
<do something with $i>
done < <(avahi-browse -afkpt | awk -F';' '{n=split($4,arr,"-");gsub(/\\064/,"#",arr[n]);print arr[n]}')
Using perl to do the conversion:
$ perl -pe 's/\\(\d+)/chr $1/ge' <<<"7828CAC5D944\064Bedroom"
7828CAC5D944#Bedroom
As part of your larger script, completely replacing awk:
while read -r i; do
# do something with i
done < <(avahi-browse -afkpt | perl -F';' -lane 'print $F[3] =~ s/\\(\d+)/chr $1/ger')

Replace hex char by a different random one (awk possible?)

have a mac address and I need to replace only one hex char (one at a very specific position) by a different random one (it must be different than the original). I have it done in this way using xxd and it works:
#!/bin/bash
mac="00:00:00:00:00:00" #This is a PoC mac address obviously :)
different_mac_digit=$(xxd -p -u -l 100 < /dev/urandom | sed "s/${mac:10:1}//g" | head -c 1)
changed_mac=${mac::10}${different_mac_digit}${mac:11:6}
echo "${changed_mac}" #This echo stuff like 00:00:00:0F:00:00
The problem for my script is that using xxd means another dependency... I want to avoid it (not all Linux have it included by default). I have another workaround for this using hexdump command but using it I'm at the same stage... But my script already has a mandatory awk dependency, so... Can this be done using awk? I need an awk master here :) Thanks.
Something like this may work with seed value from $RANDOM:
mac="00:00:00:00:00:00"
awk -v seed=$RANDOM 'BEGIN{ FS=OFS=":"; srand(seed) } {
s="0"
while ((s = sprintf("%x", rand() * 16)) == substr($4, 2, 1))
$4 = substr($4, 1, 1) s
} 1' <<< "$mac"
00:00:00:03:00:00
Inside while loop we continue until hex digit is not equal to substr($4, 2, 1) which 2nd char of 4th column.
You don't need xxd or hexdump. urandom will also generate nubmers that match the encodings of the digits and letters used to represent hexadecimal numbers, therefore you can just use
old="${mac:10:1}"
different_mac_digit=$(tr -dc 0-9A-F < /dev/urandom | tr -d "$old" | head -c1)
Of course, you can replace your whole script with an awk script too. The following GNU awk script will replace the 11th symbol of each line with a random hexadecimal symbol different from the old one. With <<< macaddress we can feed macaddress to its stdin without having to use echo or something like that.
awk 'BEGIN { srand(); pos=11 } {
old=strtonum("0x" substr($0,pos,1))
new=(old + 1 + int(rand()*15)) % 16
print substr($0,1,pos-1) sprintf("%X",new) substr($0,pos+1)
}' <<< 00:00:00:00:00:00
The trick here is to add a random number between 1 and 15 (both inclusive) to the digit to be modified. If we end up with a number greater than 15 we wrap around using the modulo operator % (16 becomes 0, 17 becomes 1, and so on). That way the resulting digit is guaranteed to be different from the old one.
However, the same approach would be shorter if written completely in bash.
mac="00:00:00:00:00:00"
old="${mac:10:1}"
(( new=(16#"$old" + 1 + RANDOM % 15) % 16 ))
printf %s%X%s\\n "${mac::10}" "$new" "${mac:11}"
"One-liner" version:
mac=00:00:00:00:00:00
printf %s%X%s\\n "${mac::10}" "$(((16#${mac:10:1}+1+RANDOM%15)%16))" "${mac:11}"
bash has printf builtin and a random function (if you trust it):
different_mac_digit() {
new=$1
while [[ $new = $1 ]]; do
new=$( printf "%X" $(( RANDOM%16 )) )
done
echo $new
}
Invoke with the character to be replaced as argument.
Another awk:
$ awk -v n=11 -v s=$RANDOM ' # set n to char # you want to replace
BEGIN { FS=OFS="" }{ # each char is a field
srand(s)
while((r=sprintf("%x",rand()*16))==$n);
$n=r
}1' <<< $mac
Output:
00:00:00:07:00:00
or oneliner:
$ awk -v n=11 -v s=$RANDOM 'BEGIN{FS=OFS=""}{srand(s);while((r=sprintf("%x",rand()*16))==$n);$n=r}1' <<< $mac
$ mac="00:00:00:00:00:00"
$ awk -v m="$mac" -v p=11 'BEGIN{srand(); printf "%s%X%s\n", substr(m,1,p-1), int(rand()*15-1), substr(m,p+1)}'
00:00:00:01:00:00
$ awk -v m="$mac" -v p=11 'BEGIN{srand(); printf "%s%X%s\n", substr(m,1,p-1), int(rand()*15-1), substr(m,p+1)}'
00:00:00:0D:00:00
And to ensure you get a different digit than you started with:
$ awk -v mac="$mac" -v pos=11 'BEGIN {
srand()
new = old = toupper(substr(mac,pos,1))
while (new==old) {
new = sprintf("%X", int(rand()*15-1))
}
print substr(mac,1,pos-1) new substr(mac,pos+1)
}'
00:00:00:0D:00:00

Replace third column on all rows of a file in shell

I have a file that has about 60 columns of data. The file is also about 80 million records long. I need a bash command to replace the third column with '20190113'. How do we determine it is the third column? It is delimited by the non-printable character '\001'
So replace the third field on all records of data in a file delimited by the special character '\001' with the value '20190113;
awk can handle non-printing characters, including \001.
$ cat -v test.in
abc^Axyz^Afoo
def^Awvu^Abar
$ awk '{$3 = "20190113"}1' FS=$'\1' OFS=$'\1' test.in | cat -v
abc^Axyz^A20190113
def^Awvu^A20190113
$'…' is a construction supported by most shells that lets you use escape characters.
^A represents the \001 character; -v tells cat to print that instead of a literal non-printing \001 byte.
Not as elegant as awk, but here is method with sed.
a=$(printf "1\0012\0013\0014\0015")
# check
echo "$a" | hexdump -c
b=$(echo "$a" | sed -r 's/([^\x01]*\x01[^\x01]*\x01)[^\x01]*[^x01]/\120190113\x01/')
# check
echo "$b" | hexdump -c
You can use the hex format "\xdd" to specify the delimiters for awk.
Just set the Input and Output delimiters in the BEGIN section.
$ cat -v brian.txt
abc^Axyz^Afoo
def^Awvu^Abar
$ awk ' BEGIN{ FS=OFS="\x01"} { $3="20190113"; print } ' brian.txt
abcxyz20190113
defwvu20190113
$ awk ' BEGIN{ FS=OFS="\x01"} { $3="20190113"; print } ' brian.txt | cat -v
abc^Axyz^A20190113
def^Awvu^A20190113
$
You can try with Perl also
$ perl -F"\x01" -lane ' $F[2]="20190113"; print join("\x01",#F) ' brian.txt
abcxyz20190113
defwvu20190113
$ perl -F"\x01" -lane ' $F[2]="20190113"; print join("\x01",#F) ' brian.txt | cat -v
abc^Axyz^A20190113
def^Awvu^A20190113
$
This might work for you (GNU sed):
sed 's/[^[.\d1.]]*/20190113/3' file
This replaces the third occurrence of those characters that do not match \001 with the string 20190113 on every line throughout the file.

Read line by line from a text file and print how I want in shell scripting

I want to read below file line by line from a text file and print how I want in shell scripting
Text file content:
zero#123456
one#123
two#12345678
I want to print this as:
zero#1-6
one#1-3
two#1-8
I tried the following:
file="readFile.txt"
while IFS= read -r line
do echo "$line"
done <printf '%s\n' "$file"
Create a script like below: my_print.sh
file="readFile.txt"
while IFS= read -r line
do
one=$(echo $line| awk -F'#' '{print $1}') ## This splits the line based on '#' and picks the 1st value. So, we get zero from 'zero#123456 '
len=$(echo $line| awk -F'#' '{print $2}'|wc -c) ## This takes the 2nd value which is 123456 and counts the number of characters
two=$(echo $line| awk -F'#' '{print $2}'| cut -c 1) ## This picks the 1st character from '123456' which is 1
three=$(echo $line| awk -F'#' '{print $2}'| cut -c $((len-1))) ## This picks the last character from '123456' which is 6
echo $one#$two-$three ## This is basically printing the output in the format you wanted 'zero#1-6'
done <"$file"
Run it like:
mayankp#mayank:~/$ sh my_print.sh
mayankp#mayank:~/$ cat output.txt
zero#1-6
one#1-3
two#1-8
Let me know of this helps.
It's no shell scripting (missed that first, sorry) but using perl with combined lookahead and lookbehind for a number:
$ perl -pe 's/(?<=[0-9]).*(?=[0-9])/-/' file
Text file content:
zero#1-6
one#1-3
two#1-8
Explained some:
s//-/ replace with a -
(?<=[0-9]) positive lookbehind, if preceeded by a number
(?=[0-9]) positive lookahead, if followed by a number
With sed:
sed -r 's/^(.+)#([0-9])[0-9]*([0-9])\s*$/\1#\2-\3/' readFile.txt
-r: using extented regular expressions (just to write some stuff without escaping them by a backslash)
s/expr1/expr2/: substitute expr1 by expr2
epxr1 is described by a regular expression, relevant matching patterns are caught by 3 capturing groups (parenthesized ones).
epxr2 retrieves captured strings (\1, \2, \3) and insert them in a formatted output (the one you wanted).
Regular-Expressions.info seems to be interesting to start with them. Also you can check your own regexp with Regx101.com.
Update: Also you could do that with awk:
awk -F'#' '{ \
gsub(/\s*/,"", $2) ; \
print $1 "#" substr($2, 1, 1) "-" substr($2, length($2), 1) \
}' < test.txt
I added a gsub() call because your file seems to have trailing blank characters.

Bash index of first character not given

So basically something like expr index '0123 some string' '012345789' but reversed.
I want to find the index of the first character that is not one of the given characters...
I'd rather not use RegEx, if it is possible...
You can remove chars with tr and pick the first from what is left
left=$(tr -d "012345789" <<< "0123_some string"); echo ${left:0:1}
_
once you have the char to find the index follow the same
expr index "0123_some string" ${left:0:1}
5
Using gnu awk and FPAT you can do this:
str="0123 some string"
awk -v FPAT='[012345789]+' '{print length($1)}' <<< "$str"
4
awk -v FPAT='[02345789]+' '{print length($1)}' <<< "$str"
1
awk -v FPAT='[01345789]+' '{print length($1)}' <<< "$str"
2
awk -v FPAT='[0123 ]+' '{print length($1)}' <<< "$str"
5
I know this is in Perl but I got to say that I like it:
$ perl -pe '$i++while s/^\d//;$_=$i' <<< '0123 some string'
4
In case of 1-based index you can use $. which is initialized at 1 when dealing with single lines:
$ perl -pe '$.++while s/^\d//;$_=$.' <<< '0123 some string'
5
I'm using \d because I assume that you by mistake left out the number 6 from the list 012345789
Index is currently pointing to the space:
0123 some string
^ this space
Even if shell globing might look similar, it is not a regex.
It could be done in two steps: cut the string, count characters (length).
#!/bin/dash
a="$1" ### string to process
b='0-9' ### range of characters not desired.
c=${a%%[!$b]*} ### cut the string at the first (not) "$b".
echo "${#c}" ### Print the value of the position index (from 0).
It is written to work on many shells (including bash, of course).
Use as:
$ script.sh "0123_some string"
4
$ script.sh "012s3_some string"
3

Resources