Error calling system() within awk - bash

I'm trying to execute a system command to find out how many unique references a csv file has in its first seven characters as part of a larger awk script that processes the same csv file. There are duplicate entries and I don't want awk to parse the whole file twice so I'm avoiding NR. The gist of this part of the script is:
#!/bin/bash
awk '
{
#do some stuff, then when finished, count the number of unique references
productFile="BusinessObjects.csv";
systemCall = sprintf( "cat %s | cut -c 1-7 | sort | uniq | wc -l", $productFile );
productCount=`system( systemCall )`-1; #subtract 1 to remove column label row
}' < BusinessObjects.csv
And the interpreter doesn't like it:
awk: cmd. line:19: ^ syntax error ./awkscript.sh: line 38: syntax error near unexpected token '('
./awkscript.sh: line 38: systemCall = sprintf( "cat %s | cut -c 1-7 | sort | uniq | wc -l", $productFile );
If I hard-code the system command
productCount=`system( "cat BusinessObjects.csv | cut -c 1-7 | sort | uniq | wc -l" )`-1;
I get:
./awkscript.sh: command substitution: line 39: syntax error near unexpected token '"cat BusinessObjects.csv | cut -c 1-7 | sort | uniq | wc -l"'
./awkscript.sh: command substitution: line 39: 'system( "cat BusinessObjects.csv | cut -c 1-7 | sort | uniq | wc -l" )'
Technically, I could do this outside of awk at the start of the shell script, store the result in a system variable, and then pass it to awk using -v, but it's not great for the readability of the awk script (it's a few hundred lines long). Do I have a space or quotes in the wrong place? I've tried fiddling, but I can't seem to present the call to system() in a way that the interpreter will accept. Finally, is there a more sensible way to do this?
Edit: the csv file is indeed semicolon-delimited, so it's best to cut using the delimiter rather than the number of chars (thanks!).
ProductRef;Data1;Data2;etc
1234567;etc;etc;etc
Edit 2:
I'm trying to parse a csv file whose first column is full of N unique product references, and create a series of associated HTML pages that include a "Page n of N" information field. It's (painfully obviously) the first time I've used awk, but it seemed like an appropriate tool for parsing csv files. I'm trying to hence count and return the number of unique references. At the shell
cut -d\; -f1 BusinessObjects.csv | sort | uniq | wc -l
works fine, but I can't get it working inside awk by doing
#!/bin/bash
if [ -n "$1" ]
then
productFile=$1
else
echo "Missing product file argument."
exit
fi
awk -v productFile=$productFile '
BEGIN {
FS=";";
productCount = 0;
("cut -d\"\;\" -f1 " productFile " | sort | uniq | wc -l") | getline productCount;
productCount -=1; #remove the column label row
}
{
print productCount;
}'
I get a syntax error on the cut code if I don't wrap the semicolon in \"\;\" and the script just hangs without printing anything when I do.

I don't remember that you can use backticks in awk.
productCount=`system( systemCall )`-1; #subtract 1 to remove column label row
You can read your output by not using system and running your command directly, and using getline instead:
systemCall | getline productCount
productCount -= 1
Or more completely
productFile = "BusinessObjects.csv"
systemCall = "cut -c 1-7 " productFile " | sort | uniq | wc -l"
systemCall | getline productCount
productCount -= 1
No need to use sprintf and include cat.
Assigning strings to variables is also optional. You can just have "xyz" | getline ....
sort | uniq can just be sort -u if supported.
Quoting may be necessary if filename has spaces or characters that may confuse the command.
getline may alter global variables differently from expected. See https://www.gnu.org/software/gawk/manual/html_node/Getline.html.

Could something like this be an option?
$ cat productCount.sh
#!/bin/bash
if [ -n "$1" ]
then
productCount=`cat $1 | cut -c 1-7 | sort | uniq | wc -l`
echo $productCount
else
echo "please supply a filename as parameter"
fi
$ ./productCount.sh BusinessObjects.csv
9

Related

Find unique URLs in a file

Situation
I have many URLs in a file, and I need to find out how many unique URLs exist.
I would like to run either a bash script or a command.
myfile.log
/home/myfiles/www/wp-content/als/xm-sf0ab5df9c1262f2130a9b313192deca4-f0ab5df9c1262f2130a9b313192deca4-c23c5fbca96e8d641d148bac41017635|https://public.rgfl.org/HS/PowerPoint%20Presentations/Health%20and%20Safety%20Law.ppt,18,17
/home/myfiles/www/wp-content/als/xm-s4bf050d47df5bfaf0486a50a8528cb16-4bf050d47df5bfaf0486a50a8528cb16-c23c5fbca96e8d641d148bac41017635|https://public.rgfl.org/HS/PowerPoint%20Presentations/Health%20and%20Safety%20Law.ppt,15,14
/home/myfiles/www/wp-content/als/xm-sad122bf22152ba4823a520cc2fe59f40-ad122bf22152ba4823a520cc2fe59f40-c23c5fbca96e8d641d148bac41017635|https://public.rgfl.org/HS/PowerPoint%20Presentations/Health%20and%20Safety%20Law.ppt,17,16
/home/myfiles/www/wp-content/als/xm-s3c0f031eebceb0fd5c4334ecef15292d-3c0f031eebceb0fd5c4334ecef15292d-c23c5fbca96e8d641d148bac41017635|https://public.rgfl.org/HS/PowerPoint%20Presentations/Health%20and%20Safety%20Law.ppt,12,11
/home/myfiles/www/wp-content/als/xm-sff661e8c3b4f94957926d5434d0ad549-ff661e8c3b4f94957926d5434d0ad549-c23c5fbca96e8d641d148bac41017635|https://quality.gha.org/Portals/2/documents/HEN/Meetings/nursesinstitute/062013/nursesroleineliminatingharm_moddydunning.pptx,17,16
/home/myfiles/www/wp-content/als/xm-s32c41ec2a5440ad220008b9abfe9add2-32c41ec2a5440ad220008b9abfe9add2-c23c5fbca96e8d641d148bac41017635|https://quality.gha.org/Portals/2/documents/HEN/Meetings/nursesinstitute/062013/nursesroleineliminatingharm_moddydunning.pptx,19,18
/home/myfiles/www/wp-content/als/xm-s28787ca2f4372ddb3616d3fd53c161ab-28787ca2f4372ddb3616d3fd53c161ab-c23c5fbca96e8d641d148bac41017635|https://quality.gha.org/Portals/2/documents/HEN/Meetings/nursesinstitute/062013/nursesroleineliminatingharm_moddydunning.pptx,22,21
/home/myfiles/www/wp-content/als/xm-s89a7b68158e38391da9f0de1e636c0d5-89a7b68158e38391da9f0de1e636c0d5-c23c5fbca96e8d641d148bac41017635|https://quality.gha.org/Portals/2/documents/HEN/Meetings/nursesinstitute/062013/nursesroleineliminatingharm_moddydunning.pptx,13,12
/home/myfiles/www/wp-content/als/xm-sc4b14e10f6151995f21334061ff1d139-c4b14e10f6151995f21334061ff1d139-c23c5fbca96e8d641d148bac41017635|https://royalmechanical.files.wordpress.com/2011/06/hy-wire-car-2.pptx,13,12
/home/myfiles/www/wp-content/als/xm-se589d47d163e43fa0c0d68e824e2c286-e589d47d163e43fa0c0d68e824e2c286-c23c5fbca96e8d641d148bac41017635|https://royalmechanical.files.wordpress.com/2011/06/hy-wire-car-2.pptx,19,18
/home/myfiles/www/wp-content/als/xm-s52f897a623c539d09bfb988bfb153888-52f897a623c539d09bfb988bfb153888-c23c5fbca96e8d641d148bac41017635|https://royalmechanical.files.wordpress.com/2011/06/hy-wire-car-2.pptx,14,13
/home/myfiles/www/wp-content/als/xm-sccf27a904c5b88e96a3522b2e1180fed-ccf27a904c5b88e96a3522b2e1180fed-c23c5fbca96e8d641d148bac41017635|https://royalmechanical.files.wordpress.com/2011/06/hy-wire-car-2.pptx,18,17
/home/myfiles/www/wp-content/als/xm-s6874bf9d589708764dab754e5af06ddf-6874bf9d589708764dab754e5af06ddf-c23c5fbca96e8d641d148bac41017635|https://royalmechanical.files.wordpress.com/2011/06/hy-wire-car-2.pptx,17,16
/home/myfiles/www/wp-content/als/xm-s46c55ec8387dbdedd7a83b3ad541cdc1-46c55ec8387dbdedd7a83b3ad541cdc1-c23c5fbca96e8d641d148bac41017635|https://royalmechanical.files.wordpress.com/2011/06/hy-wire-car-2.pptx,19,18
/home/myfiles/www/wp-content/als/xm-s08cfdc15f5935b947bbaa93c7193d496-08cfdc15f5935b947bbaa93c7193d496-c23c5fbca96e8d641d148bac41017635|https://royalmechanical.files.wordpress.com/2011/06/hydro-power-plant.ppt,9,8
/home/myfiles/www/wp-content/als/xm-s86e267bd359c12de262c0279cee0c941-86e267bd359c12de262c0279cee0c941-c23c5fbca96e8d641d148bac41017635|https://royalmechanical.files.wordpress.com/2011/06/hydro-power-plant.ppt,15,14
/home/myfiles/www/wp-content/als/xm-s5aa60354d134b87842918d760ec8bc30-5aa60354d134b87842918d760ec8bc30-c23c5fbca96e8d641d148bac41017635|https://royalmechanical.files.wordpress.com/2011/06/hydro-power-plant.ppt,14,13
Desired Result:
Unique Urls: 4
cut -d "|" -f 2 file | cut -d "," -f 1 | sort -u | wc -l
Output:
4
See: man cut, man sort
An awk solution would be
awk '{sub(/^[^|]*\|/,"");gsub(/,[^,]*/,"");i+=a[$0]++?0:1}END{print i}' file
4
If you happen to use GNU awk then below would also give you the same result
awk '{i+=a[gensub(/.*(http[^,]*).*/,"\\1",1)]++?0:1}END{print i}' file
4
Or even short as pointed out in this cracker comment by #cyrus
awk -F '[|,]' '{i+=!a[$2]++} END{print i}' file
4
which uses awk multiple field separator functionality with more idiomatic awk.
Note: See the [ awk manual ] for more info.
Parse with sed, and since file appears to be already sorted,
(with respect to URLs), just run uniq, and count it:
echo Unique URLs: $(sed 's/^.*|\([^,]*\),.*$/\1/' file | uniq | wc -l)
Use GNU grep to extract URLs:
echo Unique URLs: $(grep -o 'ht[^|,]*' file | uniq | wc -l)
Output (either method):
Unique URLs: 4
tr , '|' < myfile.log | sort -u -t '|' -k 2,2 | wc -l
tr , '|' < myfile.log translates all commas into pipe characters
sort -u -t '|' -k 2,2 sorts unique (-u), pipe delimited (-t '|'), in the second field only (-k 2,2)
wc -l counts the unique lines

Is it possible to set variable in pipeline?

I have a big txt file which I want to edit in pipeline. But on same place in pipeline I want to set number of lines in variable $nol. I just want to see sintax how could I set variable in pipeline like:
cat ${!#} | tr ' ' '\n'| grep . ; $nol=wc -l | sort | uniq -c ...
That after second pipe is very wrong, but how can I do it in bash?
One of solutions is:
nol=$(cat ${!#} | tr ' ' '\n'| grep . | wc -l)
pipeline all from the start again
but I don't want to do script the same thing twice, bec I have more pipes then here.
I musn't use awk or sed...
You can use a tee and then write it to a file which you use later:
tempfile="xyz"
tr ' ' '\n' < "${!#}" | grep '.' | tee > "$tempfile" | sort | uniq -c ...
nol=$(wc -l "$tempfile")
Or you can use it the other way around:
nol=$(tr ' ' '\n' < "${!#}" | grep '.' \
| tee >(sort | uniq -c ... > /dev/tty) | wc -l
You can set a variable in a particular link of a pipeline, but that's not very useful since only that particular link will be affected by it.
I recommend simply using a temporary file.
set -e
trap 'rm -f "$tmpf"' EXIT
tmpf=`mktemp`
cat ${!#} | tr ' ' '\n'| grep . | sort > "$tmpf"
nol="$(wc "$tmpf")"
< "$tmpf" uniq -c ...
You can avoid the temporary file with tee and a named pipe, but it probably won't perform much better (it may even perform worse).
UPDATE:
Took a minute but I got it...
cat ${!#} | tr ' ' '\n'| tee >(nol=$(wc -l)) | sort | uniq -c ...
PREVIOUS:
The only way I can think to do this is storing in variables and calling back. You would not execute the command more than one time. You would just store the output in variables along the way.
aCommand=($(cat ${!#} | tr ' ' '\n'));sLineCount=$(echo ${#aCommand[#]});echo ${aCommand[#]} | sort | uniq -c ...
aCommand will store the results of the first set of commands in an array
sLineCount will count the elements (lines) in the array
;... echo the array elements and continue the commands from there.
Looks to me like you're asking how to avoid stepping through your file twice, just to get both word and line count.
Bash lets you read variables, and wc can produce all the numbers you need at once.
NAME
wc -- word, line, character, and byte count
So to start...
read words line chars < <( wc < ${!#} )
This populates the three variables based on input generated from process substitution.
But your question includes another partial command line which I think you intend as:
nol=$( sort -u ${!#} | wc -l )
This is markedly different from the word count of your first command line, so you can't use a single wc instance to generate both. Instead, one option might be to put your functionality into a script that does both functions at once:
read words uniques < <(
awk '
{
words += NF
for (i=1; i<=NF; i++) { unique[$i] }
}
END {
print words,length(unique)
}
' ${!#}
)

Print a file in one single row ksh

I have the file DATA, and within it there is:
Name | Karlstrom|
Description | New_Server|
Type | UNIX OS|
Formula | y=kx+j |
Severity | Critical|
I need to know how to display the data like this:
Name| Karlstrom|Description| New_Server|Type UNIX OS|Formula| y=kx+j|Severity| Critical|
USING KORN SHELL | KSH
The requirements do not explain all cases, but the following code will handle your example input:
sed -e 's/ *|/|/' DATA | tr -d "\n"; echo
# Output:
Name| Karlstrom|Description| New_Server|Type| UNIX OS|Formula| y=kx+j |Severity| Critical|
I added an echo after the command, so that the command prompt will be on the next line.
I don't think that shell matters here. Do you have cat, awk and sed utilities? You can do this, for example:
cat DATA | awk 'BEGIN {s=""} {s=s$0} END {print s}' | sed 's/ *//g'

How do I pipe commands inside for loop in bash

I am writing a bash script to iterate through file lines with given value.
The command I am using to list the possible values is:
cat file.csv | cut -d';' -f2 | sort | uniq | head
When I use it in for loop like this it stops working:
for i in $( cat file.csv | cut -d';' -f2 | sort | uniq | head )
do
//do something else with these lines
done
How can I use piped commands in for loop?
You can use this awk command to get sum of 3rd column for each unique value of 2nd columns:
awk -F ';' '{sums[$2]+=$3} END{for (i in sums) print i ":", sums[i]}' file.csv
Input data:
asd;foo;0
asd;foo;2
asd;bar;1
asd;foo;4
Output:
foo: 6
bar: 1

Bash: creating a pipeline to list top 100 words

Ok, so I need to create a command that lists the 100 most frequent words in any given file, in a block of text.
What I have at the moment:
$ alias words='tr " " "\012" <hamlet.txt | sort -n | uniq -c | sort -r | head -n 10'
outputs
$ words
14 the
14 of
8 to
7 and
5 To
5 The
5 And
5 a
4 we
4 that
I need it to output in the following format:
the of to and To The And a we that
((On that note, how would I tell it to print the output in all caps?))
And I need to change it so that I can pipe 'words' to any file, so instead of having the file specified within the pipe, the initial input would name the file & the pipe would do the rest.
Okay, taking your points one by one, though not necessarily in order.
You can change words to use standard input just by removing the <hamlet.txt bit since tr will take its input from standard input by default. Then, if you want to process a specific file, use:
cat hamlet.txt | words
or:
words <hamlet.txt
You can remove the effects of capital letters by making the first part of the pipeline:
tr '[A-Z]' '[a-z]'
which will lower-case your input before doing anything else.
Lastly, if you take that entire pipeline (with the suggested modifications above) and then pass it through a few more commands:
| awk '{printf "%s ", $2}END{print ""}'
This prints the second argument of each line (the word) followed by a space, then prints an empty string with terminating newline at the end.
For example, the following script words.sh will give you what you need:
tr '[A-Z]' '[a-z]' | tr ' ' '\012' | sort -n | uniq -c | sort -r
| head -n 3 | awk '{printf "%s ", $2}END{print ""}'
(on one line: I've split it for readability) as per the following transcript:
pax> echo One Two two Three three three Four four four four | ./words.sh
four three two
You can achieve the same end with the following alias:
alias words="tr '[A-Z]' '[a-z]' | tr ' ' '\012' | sort -n | uniq -c | sort -r
| head -n 3 | awk '{printf \"%s \", \$2}END{print \"\"}'"
(again, one line) but, when things get this complex, I prefer a script, if only to avoid interminable escape characters :-)

Resources