How to get word count of a part of a line - bash

The lines of the files are as something like this .
<some character> ||| each line. So far i can get the total number of lines and the text for each on its own line ||| <some text>
Now I want to count the no of words in between the |||.
What I intended to do is
awk -F '|||' '{print $2}' word_file | wc -l
but it throws blank in the awk part ,which suggests it is not taking ||| as I want (which is as a delimiter ),interestingly if i use $1 instead of $2 ,it prints the whole text
However if I use ||| (i.e a space before and after) it gives me some output but does not treat the sentence between the two delimeters as one field ,i.e it prints each instead of the whole sentence if I use the following
awk -F ' ||| ' '{print $2}' word_file
How do I achieve this using a bash command
FYI
awk version -GNU Awk 4.0.1

Awk's -F option, which sets FS, the input-field separator, expects a regular expression as its value.
Thus, for ||| to be interpreted as a literal, you must \-escape the | chars, which are metacharacters in a regex context.
Given that Awk also accepts \-based escape sequences in string literals, you must double the \ instances:
awk -F '\\|\\|\\|' ...
To properly count the words (defined as whitespace-separated tokens) in field 2, you can try this:
awk -F '\\|\\|\\|' 'BEGIN { orgFs=FS } { FS=" "; $0 = $2; print NF; FS=orgFS }' word_file
This splits each input line into fields by literal |||.
By temporarily setting FS to a single space - which is a magic value that tells Awk to split into fields by any nonempty run of whitespace - we can assign $2, the value of field 2, to $0, the whole input line, which causes the new value of $0 to be split into fields again.
At that point NF reflects the number of fields in what was originally the 2nd field - i.e., the number of words - and we can print that.
Restoring FS to its original value then prepares for parsing the next input line.

with gawk multi-char RS support, this might be easier
$ awk -v RS="\\\|\\\|\\\|" 'NR==2{print NF}' file
or if not sure how to escape the pipe, perhaps cleaner with
$ awk -v RS='[|]{3}' ...

Related

How do i get the value present in first double quotes?

I'm currently writing a bash script to get the first value among the many comma separated strings.
I have a file that looks like this -
name
things: "water bottle","40","new phone cover",10
place
I just need to return the value in first double quotes.
water bottle
The value in first double quotes can be one word/two words. That is, water bottle can be sometimes replaced with pen.
I tried -
awk '/:/ {print $2}'
But this just gives
water
I wanted to comma separate it, but there's colon(:) after things. So, I'm not sure how to separate it.
How do i get the value present in first double quotes?
EDIT:
SOLUTION:
I used the below code since I particularly wanted to use awk -
awk '/:/' test.txt | cut -d\" -f2
A solution using the cut utility could be
cut -d\" -f2 infile > outfile
Using gnu awk you could make use of a capture group, and use a negated character class to not cross the , as that is the field delimiter.
awk 'match($0, /^[^",:]*:[^",]*"([^"]*)"/, a) {print a[1]}' file
Output
water bottle
The pattern matches
^ Start of string
[^",:]*:Optionally match any value except " and , and :, then match :
[^",]* Optionally match any value except " and ,
"([^"]*)" Capture in group 1 the value between double quotes
If the value is always between double quotes, a short option to get the desired result could be setting the field separator to " and check if group 1 contains a colon, although technically you can also get water bottle if there is only a leading double quote and not closing one.
awk -F'"' '$1 ~ /:/ {print $2}' file
With your shown samples, please try following awk code.
awk '/^things:/ && match($0,/"[^"]*/){print substr($0,RSTART+1,RLENGTH-1)}' Input_file
Explanation: In awk program checking if line starts with things: AND using match function to match everything between 1st and 2nd " and printing them accordingly.
Solution 1: awk
You can use a single awk command:
awk -F\" 'index($1, ":"){print $2}' test.txt > outfile
See the online demo.
The -F\" sets the field separator to a " char, index($1, ":") condition makes sure Field 1 contains a : char (no regex needed) and then {print $2} prints the second field value.
Solution 2: awk + cut
You can use awk + cut:
awk '/:/' test.txt | cut -d\" -f2 > outfile
With awk '/:/' test.txt, you will extract line(s) containing : char, and then the piped cut -d\" -f2 command will split the string with " as a separator and return the second item. See the online demo.
Solution 3: sed
Alternatively, you can use sed:
sed -n 's/^[^"]*"\([^"]*\)".*/\1/p' file > outfile
See the online demo:
#!/bin/bash
s='name
things: "water bottle","40","new phone cover",10
place'
sed -n 's/^[^"]*"\([^"]*\)".*/\1/p' <<< "$s"
# => water bottle
The command means
-n - the option suppresses the default line output
^[^"]*"\([^"]*\)".* - a POSIX BRE regex pattern that matches
^ - start of string
[^"]* - zero or more chars other than "
" - a " char
\([^"]*\) - Group 1 (\1 refers to this value): any zero or more chars other than "
".* - a " char and the rest of the string.
\1 replaces the match with Group 1 value
p - only prints the result of a successful substitution.

AWK Finding a way to print lines containing a word from a comma separated string

I want to write a bash script that only prints lines that, on their second column, contain a word from a comma separated string. Example:
words="abc;def;ghi;jkl"
>cat log1.txt
hello;abc;1234
house;ab;987
mouse;abcdef;654
What I want is to print only lines that contain a whole word from the "words" variable. That means that "ab" won't match, neither will "abcdef". It seems so simple yet after trying for manymany hours, I was unable to find a solution.
For example, I tried this as my awk command, but it matched any substring.
-F \; -v b="TSLA;NVDA" 'b ~ $2 { print $0 }'
I will appreciate any help. Thank you.
EDIT:
A sample input would look like this
1;UNH;buy;344.74
2;PG;sell;138.60
3;MSFT;sell;237.64
4;TSLA;sell;707.03
A variable like this would be set
filter="PG;TSLA"
And according to this filter, I want to echo these lines
2;PG;sell;138.60
4;TSLA;sell;707.03
Grep is a good choice here:
grep -Fw -f <(tr ';' '\n' <<<"$words") log1.txt
With awk I'd do
awk -F ';' -v w="$words" '
BEGIN {
n = split(w, a, /;/)
# next line moves the words into the _index_ of an array,
# to make the file processing much easier and more efficient
for (i=1; i<=n; i++) words[a[i]]=1
}
$2 in words
' log1.txt
You may use this awk:
words="abc;def;ghi;jkl"
awk -F';' -v s=";$words;" 'index(s, FS $2 FS)' log1.txt
hello;abc;1234

Ignore comma after backslash in a line in a text file using awk or sed

I have a text file containing several lines of the following format:
name,list_of_subjects,list_of_sports,school
Eg1: john,science\,social,football,florence_school
Eg2: james,painting,tennis\,ping_pong\,chess,highmount_school
I need to parse the text file and print the output of fields ignoring the escaped commas. Here those will be fields 2 or 3 like this:
science, social
tennis, ping_pong, chess
I do not know how to ignore escaped characters. How can I do it with awk or sed in terminal?
Substitute \, with a character that your records do not contain normally (e.g. \n), and restore it before printing. For example:
$ awk -F',' 'NR>1{ if(gsub(/\\,/,"\n")) gsub(/\n/,",",$2); print $2 }' file
science,social
painting
Since first gsub is performed on the whole record (i.e $0), awk is forced to recompute fields. But the second one is performed on only second field (i.e $2), so it will not affect other fields. See: Changing Fields.
To be able to extract multiple fields with properly escaped commas you need to gsub \ns in all fields with a for loop as in the following example:
$ awk 'BEGIN{ FS=OFS="," } NR>1{ if(gsub(/\\,/,"\n")) for(i=1;i<=NF;++i) gsub(/\n/,"\\,",$i); print $2,$3 }' file
science\,social,football
painting,tennis\,ping_pong\,chess
See also: What's the most robust way to efficiently parse CSV using awk?.
You could replace the \, sequences by another character that won't appear in your text, split the text around the remaining commas then replace the chosen character by commas :
sed $'s/\\\,/\31/g' input | awk -F, '{ printf "Name: %s\nSubjects : %s\nSports: %s\nSchool: %s\n\n", $1, $2, $3, $4 }' | tr $'\31' ','
In this case using the ASCII control char "Unit Separator" \31 which I'm pretty sure your input won't contain.
You can try it here.
Why awk and sed when bash with coreutils is just enough:
# Sorry my cat. Using `cat` as input pipe
cat <<EOF |
name,list_of_subjects,list_of_sports,school
Eg1: john,science\,social,football,florence_school
Eg2: james,painting,tennis\,ping_pong\,chess,highmount_school
EOF
# remove first line!
tail -n+2 |
# substitute `\,` by an unreadable character:
sed 's/\\\,/\xff/g' |
# read the comma separated list
while IFS=, read -r name list_of_subjects list_of_sports school; do
# read the \xff separated list into an array
IFS=$'\xff' read -r -d '' -a list_of_subjects < <(printf "%s" "$list_of_subjects")
# read the \xff separated list into an array
IFS=$'\xff' read -r -d '' -a list_of_sports < <(printf "%s" "$list_of_sports")
echo "list_of_subjects : ${list_of_subjects[#]}"
echo "list_of_sports : ${list_of_sports[#]}"
done
will output:
list_of_subjects : science social
list_of_sports : football
list_of_subjects : painting
list_of_sports : tennis ping_pong chess
Note that this will be most probably slower then solution using awk.
Note that the principle of operation is the same as in other answers - substitute \, string by some other unique character and then use that character to iterate over the second and third field elemetns.
This might work for you (GNU sed):
sed -E 's/\\,/\n/g;y/,\n/\n,/;s/^[^,]*$//Mg;s/\n//g;/^$/d' file
Replace quoted commas by newlines and then revert newlines to commas and commas to newlines. Remove all lines that do not contain a comma. Delete empty lines.
Using Perl. Change the \, to some control char say \x01 and then replace it again with ,
$ cat laxman.txt
john,science\,social,football,florence_school
james,painting,tennis\,ping_pong\,chess,highmount_school
$ perl -ne ' s/\\,/\x01/g and print ' laxman.txt | perl -F, -lane ' for(#F) { if( /\x01/ ) { s/\x01/,/g ; print } } '
science,social
tennis,ping_pong,chess
You can perhaps join columns with a function.
function joincol(col, i) {
$col=$col FS $(col+1)
for (i=col+1; i<NF; i++) {
$i=$(i+1)
}
NF--
}
This might get used thusly:
{
for (col=1; col<=NF; col++) {
if ($col ~ /\\$/) {
joincol(col)
}
}
}
Note that decrementing NF is undefined behaviour in POSIX. It may delete the last field, or it may not, and still be POSIX compliant. This works for me in BSDawk and Gawk. YMMV. May contain nuts.
Use gawk's FPAT:
awk -v FPAT='(\\\\.|[^,\\\\]*)+' '{print $3}' file
#list_of_sports
#football
#tennis\,ping_pong\,chess
then use gnusub to replace the backslashes:
awk -v FPAT='(\\\\.|[^,\\\\]*)+' '{print gensub("\\\\", "", "g", $3)}' file
#list_of_sports
#football
#tennis,ping_pong,chess

Decrease string length in pattern

I have a string like so:
text1;text2;text3;text4;text5;text6
and i need to decrease length of string after 3rd occurrence ; so in that case
text
, to for example 4 characters.
so far i did:
echo "$(cat pp.txt | awk -F ";" '{print $4}' | sed 's/^\(...\).*/\1/;q' )"
But the output is "text". What i need is:
Text1;text2;text3;text;text5;text6
Please help
You could do it all with awk like
awk -F\; 'BEGIN {OFS=FS} {$4=substr($4, 0, 4); print}' pp.txt
You can modify the fields in awk, which we do with $4=substr($4, 0, 4). That will take the substring of $4 from the 0th to the 4th characters, then store that back into the forth field. Then we just print the line with the updated value.
Also, set the output field separator to be the same as the one we specify on the command line so we don't change that when printing it.

cut string in a specific column in bash

How can I cut the leading zeros in the third field so it will only be 6 characters?
xxx,aaa,00000000cc
rrr,ttt,0000000yhh
desired output
xxx,aaa,0000cc
rrr,ttt,000yhh
or here's a solution using awk
echo " xxx,aaa,00000000cc
rrr,ttt,0000000yhh"|awk -F, -v OFS=, '{sub(/^0000/, "", $3)}1'
output
xxx,aaa,0000cc
rrr,ttt,000yhh
awk uses -F (or FS for FieldSeparator) and you must use OFS for OutputFieldSeparator) .
sub(/srchtarget/, "replacmentstring", stringToFix) is uses a regular expression to look for 4 0s at the front of (^) the third field ($3).
The 1 is a shorthand for the print statement. A longhand version of the script would be
echo " xxx,aaa,00000000cc
rrr,ttt,0000000yhh"|awk -F, -v OFS=, '{sub(/^0000/, "", $3);print}'
# ---------------------------------------------------------^^^^^^
Its all related to awk's /pattern/{action} idiom.
IHTH
If you can assume there are always three fields and you want to strip off the first four zeros in the third field you could use a monstrosity like this:
$ cat data
xxx,0000aaa,00000000cc
rrr,0000ttt,0000000yhh
$ cat data |sed 's/\([^,]\+\),\([^,]\+\),0000\([^,]\+\)/\1,\2,\3/
xxx,0000aaa,0000cc
rrr,0000ttt,000yhh
Another more flexible solution if you don't mind piping into Python:
cat data | python -c '
import sys
for line in sys.stdin():
print(",".join([f[4:] if i == 2 else f for i, f in enumerate(line.strip().split(","))]))
'
This says "remove the first four characters of the third field but leave all other fields unchanged".
Using awks substr should also work:
awk -F, -v OFS=, '{$3=substr($3,5,6)}1' file
xxx,aaa,0000cc
rrr,ttt,000yhh
It just take 6 characters from 5 position in field 3 and set it back to field 3

Resources