Count number of Special Character in Unix Shell - bash

I have a delimited file that is separated by octal \036 or Hexadecimal value 1e.
I need to count the number of delimiters on each line using a bash shell script.
I was trying to use awk, not sure if this is the best way.
Sample Input (| is a representation of \036)
Example|Running|123|
Expected output:
3

awk -F'|' '{print NF-1}' file
Change | to whatever separator you like. If your file can have empty lines then you need to tweak it to:
awk -F'|' '{print (NF ? NF-1 : 0)}' file

You can try
awk '{print gsub(/\|/,"")}'

Simply try
awk -F"|" '{print substr($3,length($3))}' OFS="|" Input_file
Explanation: Making field separator -F as | and then printing the 3rd column by doing $3 only as per your need. Then setting OFS(output field separator) to |. Finally mentioning Input_file name here.

This will work as far as I know
echo "Example|Running|123|" | tr -cd '|' | wc -c
Output
3

This should work for you:
awk -F '\036' '{print NF-1}' file
3
-F '\036' sets input field delimiter as octal value 036

Awk may not be the best tool for this. Gnu grep has a cool -o option that prints each matching pattern on a separate line. You can then count how many matching lines are generated for each input line, and that's the count of your delimiters. E.g. (where ^^ in the file is actually hex 1e)
$ cat -v i
a^^b^^c
d^^e^^f^^g
$ grep -n -o $'\x1e' i | uniq -c
2 1:
3 2:
if you remove the uniq -c you can see how it's working. You'll get "1" printed twice because there are two matching patterns on the first line. Or try it with some regular ascii characters and it becomes clearer what the -o and -n options are doing.
If you want to print the line number followed by the field count for that line, I'd do something like:
$grep -n -o $'\x1e' i | tr -d ':' | uniq -c | awk '{print $2 " " $1}'
1 2
2 3
This assumes that every line in the file contains at least one delimiter. If that's not the case, here's another approach that's probably faster too:
$ tr -d -c $'\x1e\n' < i | awk '{print length}'
2
3
0
0
0
This uses tr to delete (-d) all characters that are not (-c) 1e or \n. It then pipes that stream of data to awk which just counts how many characters are left on each line. If you want the line number, add " | cat -n" to the end.

Related

Getting last X fields from a specific line in a CSV file using bash

I'm trying to get as bash variable list of users which are in my csv file. Problem is that number of users is random and can be from 1-5.
Example CSV file:
"record1_data1","record1_data2","record1_data3","user1","user2"
"record2_data1","record2_data2","record2_data3","user1","user2","user3","user4"
"record3_data1","record3_data2","record3_data3","user1"
I would like to get something like
list_of_users="cat file.csv | grep "record2_data2" | <something> "
echo $list_of_users
user1,user2,user3,user4
I'm trying this:
cat file.csv | grep "record2_data2" | awk -F, -v OFS=',' '{print $4,$5,$6,$7,$8 }' | sed 's/"//g'
My result is:
user2,user3,user4,,
Question:
How to remove all "," from the end of my result? Sometimes it is just one but sometimes can be user1,,,,
Can I do it in better way? Users always starts after 3rd column in my file.
This will do what your code seems to be trying to do (print the users for a given string record2_data2 which only exists in the 2nd field):
$ awk -F',' '{gsub(/"/,"")} $2=="record2_data2"{sub(/([^,]*,){3}/,""); print}' file.csv
user1,user2,user3,user4
but I don't see how that's related to your question subject of Getting last X records from CSV file using bash so idk if it's what you really want or not.
Better to use a bash array, and join it into a CSV string when needed:
#!/usr/bin/env bash
readarray -t listofusers < <(cut -d, -f4- file.csv | tr -d '"' | tr ',' $'\n' | sort -u))
IFS=,
printf "%s\n" "${listofusers[*]}"
cut -d, -f4- file.csv | tr -d '"' | tr ',' $'\n' | sort -u is the important bit - it first only prints out the fourth and following fields of the CSV input file, removes quotes, turns commas into newlines, and then sorts the resulting usernames, removing duplicates. That output is then read into an array with the readarray builtin, and you can manipulate it and the individual elements however you need.
GNU sed solution, let file.csv content be
"record1_data1","record1_data2","record1_data3","user1","user2"
"record2_data1","record2_data2","record2_data3","user1","user2","user3","user4"
"record3_data1","record3_data2","record3_data3","user1"
then
sed -n -e 's/"//g' -e '/record2_data/ s/[^,]*,[^,]*,[^,]*,// p' file.csv
gives output
user1,user2,user3,user4
Explanation: -n turns off automatic printing, expressions meaning is as follow: 1st substitute globally " using empty string i.e. delete them, 2nd for line containing record2_data substitute (s) everything up to and including 3rd , with empty string i.e. delete it and print (p) such changed line.
(tested in GNU sed 4.2.2)
awk -F',' '
/record2_data2/{
for(i=4;i<=NF;i++) o=sprintf("%s%s,",o,$i);
gsub(/"|,$/,"",o);
print o
}' file.csv
user1,user2,user3,user4
This might work for you (GNU sed):
sed -E '/record2_data/!d;s/"([^"]*)"(,)?/\1\2/4g;s///g' file
Delete all records except for that containing record2_data.
Remove double quotes from the fourth field onward.
Remove any double quoted fields.

Bash + sed/awk/cut to delete nth character

I trying to delete 6,7 and 8th character for each line.
Below is the file containing text format.
Actual output..
#cat test
18:40:12,172.16.70.217,UP
18:42:15,172.16.70.218,DOWN
Expecting below, after formatting.
#cat test
18:40,172.16.70.217,UP
18:42,172.16.70.218,DOWN
Even I tried with below , no luck
#awk -F ":" '{print $1":"$2","$3}' test
18:40,12,172.16.70.217,UP
#sed 's/^\(.\{7\}\).\(.*\)/\1\2/' test { Here I can remove only one character }
18:40:1,172.16.70.217,UP
Even with cut also failed
#cut -d ":" -f1,2,3 test
18:40:12,172.16.70.217,UP
Need to delete character in each line like 6th , 7th , 8th
Suggestion please
With GNU cut you can use the --complement switch to remove characters 6 to 8:
cut --complement -c6-8 file
Otherwise, you can just select the rest of the characters yourself:
cut -c1-5,9- file
i.e. characters 1 to 5, then 9 to the end of each line.
With awk you could use substrings:
awk '{ print substr($0, 1, 5) substr($0, 9) }' file
Or you could write a regular expression, but the result will be more complex.
For example, to remove the last three characters from the first comma-separated field:
awk -F, -v OFS=, '{ sub(/...$/, "", $1) } 1' file
Or, using sed with a capture group:
sed -E 's/(.{5}).{3}/\1/' file
Capture the first 5 characters and use them in the replacement, dropping the next 3.
it's a structured text, why count the chars if you can describe them?
$ awk '{sub(":..,",",")}1' file
18:40,172.16.70.217,UP
18:42,172.16.70.218,DOWN
remove the seconds.
The solutions below are generic and assume no knowledge of any format. They just delete character 6,7 and 8 of any line.
sed:
sed 's/.//8;s/.//7;s/.//6' <file> # from high to low
sed 's/.//6;s/.//6;s/.//6' <file> # from low to high (subtract 1)
sed 's/\(.....\).../\1/' <file>
sed 's/\(.{5}\).../\1/' <file>
s/BRE/replacement/n :: substitute nth occurrence of BRE with replacement
awk:
awk 'BEGIN{OFS=FS=""}{$6=$7=$8="";print $0}' <file>
awk -F "" '{OFS=$6=$7=$8="";print}' <file>
awk -F "" '{OFS=$6=$7=$8=""}1' <file>
This is 3 times the same, removing the field separator FS let awk assume a field to be a character. We empty field 6,7 and 8, and reprint the line with an output field separator OFS which is empty.
cut:
cut -c -5,9- <file>
cut --complement -c 6-8 <file>
Just for fun, perl, where you can assign to a substring
perl -pe 'substr($_,5,3)=""' file
With awk :
echo "18:40:12,172.16.70.217,UP" | awk '{ $0 = ( substr($0,1,5) substr($0,9) ) ; print $0}'
Regards!
If you are running on bash, you can use the string manipulation functionality of it instead of having to call awk, sed, cut or whatever binary:
while read STRING
do
echo ${STRING:0:5}${STRING:9}
done < myfile.txt
${STRING:0:5} represents the first five characters of your string, ${STRING:9} represents the 9th character and all remaining characters until the end of the line. This way you cut out characters 6,7 and 8 ...

Greping asterisk through bash

I am validating few columns in a pipe delimited file. My second column is defaulted with '*'.
E.g. data of file to be validated:
abc|* |123
def|** |456
ghi|* |789
2nd record has 2 stars due to erroneous data.
I teied it as:
Value_to_match="*"
unmatch_count=cat <filename>| cut -d'|' -f2 | awk '{$1=$1};1' | grep -vw "$Value_to_match" | sort -n | uniq | wc -l
echo "unmatch_count"
This gives me count as 0 whereas I am expecting 1 (for **) as I have used -w with grep which is exact match and -v which is invert match.
How can I grep **?
The problem here is grep considering ** a regular expression. To prevent this, use -F to use fixed strings:
grep -F '**' file
However, you have an unnecessarily big set of piped operations, while awk alone can handle it quite well.
If you want to check lines containing ** in the second column, say:
$ awk -F"|" '$2 ~ /\*\*/' file
def|** |456
If you want to count how many of such lines you have, say:
$ awk -F"|" '$2 ~ /\*\*/ {sum++} END {print sum}' file
1
Note the usage of awk:
-F"|" to set the field separator to |.
$2 ~ /\*\*/ to say: hey, in every line check if the second field contains two asterisks (remember we sliced lines by |). We are escaping the * because it has a special meaning as a regular expression.
If you want to output those lines that have just one asterisk as second field, say:
$ awk -F"|" '$2 ~ /^*\s*$/' file
abc|* |123
ghi|* |789
Or check for those not matching this regex with !~:
$ awk -F"|" '$2 !~ /^*\s*$/' a
def|** |456

How to truncate trailing space in xargs

I would like to use xargs to list the contents of some files based on the output of command A. Xargs replace-str seem to be adding a space to the end and causing the command to fail. Any suggestions? I know this can be worked around using for loop. But curious to know how to do this using xargs.
lsscsi |awk -F\/ '/ATA/ {print $NF}' | xargs -L 1 -I % cat /sys/block/%/queue/scheduler
cat: /sys/block/sda /queue/scheduler: No such file or directory
The problem is not with xargs -I, which does not append a space to each argument, which can be verified as follows:
$ echo 'sda' | xargs -I % echo '[%]'
[sda]
Incidentally, specifying -L 1 in addition to -I is pointless: -I implies line-by-line processing.
Therefore, it must be the output from the command that provides input to xargs that contains the trailing space.
You can adapt your awk command to fix that:
lsscsi |
awk -F/ '/ATA/ {sub(/ $/,"", $NF); print $NF}' |
xargs -I % cat '/sys/block/%/queue/scheduler'
sub(/ $/,"", $NF) replaces a trailing space in field $NF with the empty string, thereby effectively removing it.
Note how I've (single-)quoted cat's argument so as to make it work even with filenames with spaces.
lsscsi |awk -F\/ '/ATA/ {print $NF}'| awk '{print $NF}' | xargs -L 1 -I % cat /sys/block/%/queue/scheduler
The first awk stmt splits by "/" so anything else is considered as field. In this is case "sda " becomes whole field including a space at the end. But by default, awk removes space . So after the pipe, the second awk prints $NF (which is last word of the line) and leaves out " " space as delimiter. awk { print $1 } will do the same because we have only one word, "sda" which is both first and last.

how can I get the index of a character in a given concurrence which is repeated several times in a TEXT line using SHELL (BASH) script

I have a Text string like below
"/path/to/log/file/LOG_FILE.log.2013-10-02-15:2013-10-02 15:46:57.809 INFO - TTT005|Receive|0000293|N~0000284~YOS~TTT005~ ~000~YC~|YOS TYOS-YCUPDT1-H 20131002154657669284YCARR TTT005 Y0TD04 |1|0150520106050|001|051052020603|003|015030010101502702060510520101|000||000|| "
Here "|" is repeated several times within the string and I need to get the index of 4th occurrence of "|" character using shell-script (BASH) command. I tried to find a way using grep command's options.
Thanks.
Using awk you can do:
awk -F '|' '{print index($0, $5)-1}' file
This will print character position of fourth pipe in the file.
grep can print the byte-offset; when used with -o it prints the byte-offset of the matching part.
$ string="/path/to/log/file/LOG_FILE.log.2013-10-02-15:2013-10-02 15:46:57.809 INFO - TTT005|Receive|0000293|N~0000284~YOS~TTT005~ ~000~YC~|YOS TYOS-YCUPDT1-H 20131002154657669284YCARR TTT005 Y0TD04 |1|0150520106050|001|051052020603|003|015030010101502702060510520101|000||000||"
$ grep -ob "[^|]*" <<< "${string}" | sed '5!d' | cut -d: -f1
132
Alternatively, without using grep:
$ newstring=$(echo "${string}" | cut -d\| -f5-)
$ echo $(( ${#string} - ${#newstring} ))
132

Resources