Is it possible to set variable in pipeline? - bash

I have a big txt file which I want to edit in pipeline. But on same place in pipeline I want to set number of lines in variable $nol. I just want to see sintax how could I set variable in pipeline like:
cat ${!#} | tr ' ' '\n'| grep . ; $nol=wc -l | sort | uniq -c ...
That after second pipe is very wrong, but how can I do it in bash?
One of solutions is:
nol=$(cat ${!#} | tr ' ' '\n'| grep . | wc -l)
pipeline all from the start again
but I don't want to do script the same thing twice, bec I have more pipes then here.
I musn't use awk or sed...

You can use a tee and then write it to a file which you use later:
tempfile="xyz"
tr ' ' '\n' < "${!#}" | grep '.' | tee > "$tempfile" | sort | uniq -c ...
nol=$(wc -l "$tempfile")
Or you can use it the other way around:
nol=$(tr ' ' '\n' < "${!#}" | grep '.' \
| tee >(sort | uniq -c ... > /dev/tty) | wc -l

You can set a variable in a particular link of a pipeline, but that's not very useful since only that particular link will be affected by it.
I recommend simply using a temporary file.
set -e
trap 'rm -f "$tmpf"' EXIT
tmpf=`mktemp`
cat ${!#} | tr ' ' '\n'| grep . | sort > "$tmpf"
nol="$(wc "$tmpf")"
< "$tmpf" uniq -c ...
You can avoid the temporary file with tee and a named pipe, but it probably won't perform much better (it may even perform worse).

UPDATE:
Took a minute but I got it...
cat ${!#} | tr ' ' '\n'| tee >(nol=$(wc -l)) | sort | uniq -c ...
PREVIOUS:
The only way I can think to do this is storing in variables and calling back. You would not execute the command more than one time. You would just store the output in variables along the way.
aCommand=($(cat ${!#} | tr ' ' '\n'));sLineCount=$(echo ${#aCommand[#]});echo ${aCommand[#]} | sort | uniq -c ...
aCommand will store the results of the first set of commands in an array
sLineCount will count the elements (lines) in the array
;... echo the array elements and continue the commands from there.

Looks to me like you're asking how to avoid stepping through your file twice, just to get both word and line count.
Bash lets you read variables, and wc can produce all the numbers you need at once.
NAME
wc -- word, line, character, and byte count
So to start...
read words line chars < <( wc < ${!#} )
This populates the three variables based on input generated from process substitution.
But your question includes another partial command line which I think you intend as:
nol=$( sort -u ${!#} | wc -l )
This is markedly different from the word count of your first command line, so you can't use a single wc instance to generate both. Instead, one option might be to put your functionality into a script that does both functions at once:
read words uniques < <(
awk '
{
words += NF
for (i=1; i<=NF; i++) { unique[$i] }
}
END {
print words,length(unique)
}
' ${!#}
)

Related

Find unique URLs in a file

Situation
I have many URLs in a file, and I need to find out how many unique URLs exist.
I would like to run either a bash script or a command.
myfile.log
/home/myfiles/www/wp-content/als/xm-sf0ab5df9c1262f2130a9b313192deca4-f0ab5df9c1262f2130a9b313192deca4-c23c5fbca96e8d641d148bac41017635|https://public.rgfl.org/HS/PowerPoint%20Presentations/Health%20and%20Safety%20Law.ppt,18,17
/home/myfiles/www/wp-content/als/xm-s4bf050d47df5bfaf0486a50a8528cb16-4bf050d47df5bfaf0486a50a8528cb16-c23c5fbca96e8d641d148bac41017635|https://public.rgfl.org/HS/PowerPoint%20Presentations/Health%20and%20Safety%20Law.ppt,15,14
/home/myfiles/www/wp-content/als/xm-sad122bf22152ba4823a520cc2fe59f40-ad122bf22152ba4823a520cc2fe59f40-c23c5fbca96e8d641d148bac41017635|https://public.rgfl.org/HS/PowerPoint%20Presentations/Health%20and%20Safety%20Law.ppt,17,16
/home/myfiles/www/wp-content/als/xm-s3c0f031eebceb0fd5c4334ecef15292d-3c0f031eebceb0fd5c4334ecef15292d-c23c5fbca96e8d641d148bac41017635|https://public.rgfl.org/HS/PowerPoint%20Presentations/Health%20and%20Safety%20Law.ppt,12,11
/home/myfiles/www/wp-content/als/xm-sff661e8c3b4f94957926d5434d0ad549-ff661e8c3b4f94957926d5434d0ad549-c23c5fbca96e8d641d148bac41017635|https://quality.gha.org/Portals/2/documents/HEN/Meetings/nursesinstitute/062013/nursesroleineliminatingharm_moddydunning.pptx,17,16
/home/myfiles/www/wp-content/als/xm-s32c41ec2a5440ad220008b9abfe9add2-32c41ec2a5440ad220008b9abfe9add2-c23c5fbca96e8d641d148bac41017635|https://quality.gha.org/Portals/2/documents/HEN/Meetings/nursesinstitute/062013/nursesroleineliminatingharm_moddydunning.pptx,19,18
/home/myfiles/www/wp-content/als/xm-s28787ca2f4372ddb3616d3fd53c161ab-28787ca2f4372ddb3616d3fd53c161ab-c23c5fbca96e8d641d148bac41017635|https://quality.gha.org/Portals/2/documents/HEN/Meetings/nursesinstitute/062013/nursesroleineliminatingharm_moddydunning.pptx,22,21
/home/myfiles/www/wp-content/als/xm-s89a7b68158e38391da9f0de1e636c0d5-89a7b68158e38391da9f0de1e636c0d5-c23c5fbca96e8d641d148bac41017635|https://quality.gha.org/Portals/2/documents/HEN/Meetings/nursesinstitute/062013/nursesroleineliminatingharm_moddydunning.pptx,13,12
/home/myfiles/www/wp-content/als/xm-sc4b14e10f6151995f21334061ff1d139-c4b14e10f6151995f21334061ff1d139-c23c5fbca96e8d641d148bac41017635|https://royalmechanical.files.wordpress.com/2011/06/hy-wire-car-2.pptx,13,12
/home/myfiles/www/wp-content/als/xm-se589d47d163e43fa0c0d68e824e2c286-e589d47d163e43fa0c0d68e824e2c286-c23c5fbca96e8d641d148bac41017635|https://royalmechanical.files.wordpress.com/2011/06/hy-wire-car-2.pptx,19,18
/home/myfiles/www/wp-content/als/xm-s52f897a623c539d09bfb988bfb153888-52f897a623c539d09bfb988bfb153888-c23c5fbca96e8d641d148bac41017635|https://royalmechanical.files.wordpress.com/2011/06/hy-wire-car-2.pptx,14,13
/home/myfiles/www/wp-content/als/xm-sccf27a904c5b88e96a3522b2e1180fed-ccf27a904c5b88e96a3522b2e1180fed-c23c5fbca96e8d641d148bac41017635|https://royalmechanical.files.wordpress.com/2011/06/hy-wire-car-2.pptx,18,17
/home/myfiles/www/wp-content/als/xm-s6874bf9d589708764dab754e5af06ddf-6874bf9d589708764dab754e5af06ddf-c23c5fbca96e8d641d148bac41017635|https://royalmechanical.files.wordpress.com/2011/06/hy-wire-car-2.pptx,17,16
/home/myfiles/www/wp-content/als/xm-s46c55ec8387dbdedd7a83b3ad541cdc1-46c55ec8387dbdedd7a83b3ad541cdc1-c23c5fbca96e8d641d148bac41017635|https://royalmechanical.files.wordpress.com/2011/06/hy-wire-car-2.pptx,19,18
/home/myfiles/www/wp-content/als/xm-s08cfdc15f5935b947bbaa93c7193d496-08cfdc15f5935b947bbaa93c7193d496-c23c5fbca96e8d641d148bac41017635|https://royalmechanical.files.wordpress.com/2011/06/hydro-power-plant.ppt,9,8
/home/myfiles/www/wp-content/als/xm-s86e267bd359c12de262c0279cee0c941-86e267bd359c12de262c0279cee0c941-c23c5fbca96e8d641d148bac41017635|https://royalmechanical.files.wordpress.com/2011/06/hydro-power-plant.ppt,15,14
/home/myfiles/www/wp-content/als/xm-s5aa60354d134b87842918d760ec8bc30-5aa60354d134b87842918d760ec8bc30-c23c5fbca96e8d641d148bac41017635|https://royalmechanical.files.wordpress.com/2011/06/hydro-power-plant.ppt,14,13
Desired Result:
Unique Urls: 4
cut -d "|" -f 2 file | cut -d "," -f 1 | sort -u | wc -l
Output:
4
See: man cut, man sort
An awk solution would be
awk '{sub(/^[^|]*\|/,"");gsub(/,[^,]*/,"");i+=a[$0]++?0:1}END{print i}' file
4
If you happen to use GNU awk then below would also give you the same result
awk '{i+=a[gensub(/.*(http[^,]*).*/,"\\1",1)]++?0:1}END{print i}' file
4
Or even short as pointed out in this cracker comment by #cyrus
awk -F '[|,]' '{i+=!a[$2]++} END{print i}' file
4
which uses awk multiple field separator functionality with more idiomatic awk.
Note: See the [ awk manual ] for more info.
Parse with sed, and since file appears to be already sorted,
(with respect to URLs), just run uniq, and count it:
echo Unique URLs: $(sed 's/^.*|\([^,]*\),.*$/\1/' file | uniq | wc -l)
Use GNU grep to extract URLs:
echo Unique URLs: $(grep -o 'ht[^|,]*' file | uniq | wc -l)
Output (either method):
Unique URLs: 4
tr , '|' < myfile.log | sort -u -t '|' -k 2,2 | wc -l
tr , '|' < myfile.log translates all commas into pipe characters
sort -u -t '|' -k 2,2 sorts unique (-u), pipe delimited (-t '|'), in the second field only (-k 2,2)
wc -l counts the unique lines

Find unique words

Suppose there is one file.txt in which below content text is written:
ABC/xyz
ABC/xyz/rst
EFG/ghi
I need to write a shell script that can extract the first unique word before the first /.
So as output, I want ABC and EFG to be written in one file.
You can extract the first word with cut (slash as delimiter), then pipe to sort with the -u (for "unique") option:
$ cut -d '/' -f 1 file.txt | sort -u
ABC
EFG
To get the output into a file, just redirect by appending > filename to the command. (Or pipe to tee filename to see the output and get it in a file.)
Try this :
cat file.txt | tr -s "/" ' ' | awk -F " " '{print $1}' | sort | uniq > outfile.txt
Another interesting variation:
awk -F'/' '{print $1 |" sort -u" }' file.txt > outfile.txt
Not that it matters here, but being able to pipe and redirect within awk can be very handy.
Another easy way:
cut -d"/" -f1 file.txt|uniq > out.txt
You can use a mix of cut and sort like so:
cut -d '/' -f 1 file.txt | sort -u > newfile.txt
The first line grabs any string until a slash / and outputs it into newfile.txt.
The second line sorts the text, removing any duplicate strings you might have.

Bash: Formatting multi-line function results alongside each other

I have three functions that digest an access.log file on my server.
hitsbyip() {
cat $ACCESSLOG | awk '{ print $1 }' | uniq -c | sort -nk1 | uniq
}
hitsbyhour() {
cat $ACCESSLOG | cut -d[ -f2 | cut -d] -f1 | awk -F: '{print $2":00"}' | sort -n | uniq -c
}
hitsbymin() {
hr=$1
grep "2015:${hr}" $ACCESSLOG | cut -d[ -f2 | cut -d] -f1 | awk -F: '{print $2":"$3}' | sort -nk1 -nk2 | uniq -c
}
They all work fine when used on their own. All three output 2 small colums of data.
Now I am looking to create another function called report which will simply use printf and its formatting possibilities to print 3 columns of data with header, each of them the result of my three first functions. Something like that:
report() {
printf "%-30b %-30b %-30b\n" `hitsbyip` `hitsbyhour` `hitsbymin 10`
}
The thing is that the format is not what i want; it prints out the columns horizontaly instead of side by side.
Any help would be greatly appreciated.
Once you use paste to combine the output of the three commands into a single stream, then you can operate line-by-line to format those outputs.
while IFS=$'\t' read -r by_ip by_hour by_min; do
printf '%-30b %-30b %-30b\n' "$by_ip" "$by_hour" "$by_min"
done < <(paste <(hitsbyip) <(hitsbyhour) <(hitsbymin 10))
Elements to note:
<() syntax is process substitution; it generates a filename (typically of the form /dev/fd/## on platforms with such support) which will, when read, yield the output of the command given.
paste takes a series of filenames and puts the output of each alongside the others.
Setting IFS=$'\t' while reading ensures that we read content as tab-separated values (the format paste creates). See BashFAQ #1 for details on using read.
Putting quotes on the arguments to printf ensures that we pass each value assembled by read as a single value to printf, rather than letting them be subject to string-splitting and glob-expansion as unquoted values.

Error calling system() within awk

I'm trying to execute a system command to find out how many unique references a csv file has in its first seven characters as part of a larger awk script that processes the same csv file. There are duplicate entries and I don't want awk to parse the whole file twice so I'm avoiding NR. The gist of this part of the script is:
#!/bin/bash
awk '
{
#do some stuff, then when finished, count the number of unique references
productFile="BusinessObjects.csv";
systemCall = sprintf( "cat %s | cut -c 1-7 | sort | uniq | wc -l", $productFile );
productCount=`system( systemCall )`-1; #subtract 1 to remove column label row
}' < BusinessObjects.csv
And the interpreter doesn't like it:
awk: cmd. line:19: ^ syntax error ./awkscript.sh: line 38: syntax error near unexpected token '('
./awkscript.sh: line 38: systemCall = sprintf( "cat %s | cut -c 1-7 | sort | uniq | wc -l", $productFile );
If I hard-code the system command
productCount=`system( "cat BusinessObjects.csv | cut -c 1-7 | sort | uniq | wc -l" )`-1;
I get:
./awkscript.sh: command substitution: line 39: syntax error near unexpected token '"cat BusinessObjects.csv | cut -c 1-7 | sort | uniq | wc -l"'
./awkscript.sh: command substitution: line 39: 'system( "cat BusinessObjects.csv | cut -c 1-7 | sort | uniq | wc -l" )'
Technically, I could do this outside of awk at the start of the shell script, store the result in a system variable, and then pass it to awk using -v, but it's not great for the readability of the awk script (it's a few hundred lines long). Do I have a space or quotes in the wrong place? I've tried fiddling, but I can't seem to present the call to system() in a way that the interpreter will accept. Finally, is there a more sensible way to do this?
Edit: the csv file is indeed semicolon-delimited, so it's best to cut using the delimiter rather than the number of chars (thanks!).
ProductRef;Data1;Data2;etc
1234567;etc;etc;etc
Edit 2:
I'm trying to parse a csv file whose first column is full of N unique product references, and create a series of associated HTML pages that include a "Page n of N" information field. It's (painfully obviously) the first time I've used awk, but it seemed like an appropriate tool for parsing csv files. I'm trying to hence count and return the number of unique references. At the shell
cut -d\; -f1 BusinessObjects.csv | sort | uniq | wc -l
works fine, but I can't get it working inside awk by doing
#!/bin/bash
if [ -n "$1" ]
then
productFile=$1
else
echo "Missing product file argument."
exit
fi
awk -v productFile=$productFile '
BEGIN {
FS=";";
productCount = 0;
("cut -d\"\;\" -f1 " productFile " | sort | uniq | wc -l") | getline productCount;
productCount -=1; #remove the column label row
}
{
print productCount;
}'
I get a syntax error on the cut code if I don't wrap the semicolon in \"\;\" and the script just hangs without printing anything when I do.
I don't remember that you can use backticks in awk.
productCount=`system( systemCall )`-1; #subtract 1 to remove column label row
You can read your output by not using system and running your command directly, and using getline instead:
systemCall | getline productCount
productCount -= 1
Or more completely
productFile = "BusinessObjects.csv"
systemCall = "cut -c 1-7 " productFile " | sort | uniq | wc -l"
systemCall | getline productCount
productCount -= 1
No need to use sprintf and include cat.
Assigning strings to variables is also optional. You can just have "xyz" | getline ....
sort | uniq can just be sort -u if supported.
Quoting may be necessary if filename has spaces or characters that may confuse the command.
getline may alter global variables differently from expected. See https://www.gnu.org/software/gawk/manual/html_node/Getline.html.
Could something like this be an option?
$ cat productCount.sh
#!/bin/bash
if [ -n "$1" ]
then
productCount=`cat $1 | cut -c 1-7 | sort | uniq | wc -l`
echo $productCount
else
echo "please supply a filename as parameter"
fi
$ ./productCount.sh BusinessObjects.csv
9

Connecting Wget and Sed Commands in One Script?

I use 3 commands (wget/sed/and a tr/sort) that all work in command line to produce a most-common words list. I use commands sequentially, saving output from sed to use in the tr/sort command. Now I need to graduate to writing a script that combines these 3 commands. So, 1) wget downloads a file, that I put into 2) sed -e 's/<[^>]*>//g' wget-file.txt, and that output > goes to 3)
cat sed-output.txt | tr -cs A-Za-z\' '\n' | tr A-Z a-z | sort | uniq -c |
sort -k1,1nr -k2 | sed ${1:-100}q > words-list.txt
I'm aware of the problem/debate about using regex to remove HTML tags, but these 3 commands are working for me for the moment. So thanks in helping pull this together.
Using awk.
wget -O- http://down.load/file| awk '{ gsub(/<[^>]*>/,"") # remove the content in label <>
$0=tolower($0) # convert all to lowercase
gsub(/[^a-z]]*/," ") # remove all non-letter chars and replaced by space
for (i=1;i<=NF;i++) a[$i]++ # save each word in array a, and sum it.
}END{for (i in a) print a[i],i|"sort -nr|head -100"}' # print the result, sort it, and get the top 100 records only
This command should do the job:
wget -O- http://down.load/file | sed -e 's/<[^>]*>//g' | \
tr -cs A-Za-z\' '\n' | tr A-Z a-z | sort | uniq -c | \
sort -k1,1nr -k2 | sed ${1:-100}q > words-list.txt

Resources