Is it possible to append a stdout to another stdout? - shell

I'm trying to run a command like:
gunzip -dc file.gz | tail +5c
So this will output the binary file contents minus the first 4 bytes to stdout, and it works. Now I need to append 3 extra bytes to the end of the stream, but only using stdout, never a file.
Imagine the file contains:
1234567890
With the current command, I get:
567890
But I need:
567890000
So... any idea?

Try this :
{ gunzip -dc file.gz | tail -c 5 | tr -d '\n'; echo 000; }

Ok, so based on the answers, the final solution was:
gzcat file.gz | tail -c +5 | echo 000
I didn't need to, and actually shouldn't, use the tr -d '\n', as it will remove the newlines in the middle of the file.

May something like
$ echo "`gunzip -dc file.gz | tail +5c`BBB"
(where BBB are your three extra bytes) work for you?

Related

Find unique URLs in a file

Situation
I have many URLs in a file, and I need to find out how many unique URLs exist.
I would like to run either a bash script or a command.
myfile.log
/home/myfiles/www/wp-content/als/xm-sf0ab5df9c1262f2130a9b313192deca4-f0ab5df9c1262f2130a9b313192deca4-c23c5fbca96e8d641d148bac41017635|https://public.rgfl.org/HS/PowerPoint%20Presentations/Health%20and%20Safety%20Law.ppt,18,17
/home/myfiles/www/wp-content/als/xm-s4bf050d47df5bfaf0486a50a8528cb16-4bf050d47df5bfaf0486a50a8528cb16-c23c5fbca96e8d641d148bac41017635|https://public.rgfl.org/HS/PowerPoint%20Presentations/Health%20and%20Safety%20Law.ppt,15,14
/home/myfiles/www/wp-content/als/xm-sad122bf22152ba4823a520cc2fe59f40-ad122bf22152ba4823a520cc2fe59f40-c23c5fbca96e8d641d148bac41017635|https://public.rgfl.org/HS/PowerPoint%20Presentations/Health%20and%20Safety%20Law.ppt,17,16
/home/myfiles/www/wp-content/als/xm-s3c0f031eebceb0fd5c4334ecef15292d-3c0f031eebceb0fd5c4334ecef15292d-c23c5fbca96e8d641d148bac41017635|https://public.rgfl.org/HS/PowerPoint%20Presentations/Health%20and%20Safety%20Law.ppt,12,11
/home/myfiles/www/wp-content/als/xm-sff661e8c3b4f94957926d5434d0ad549-ff661e8c3b4f94957926d5434d0ad549-c23c5fbca96e8d641d148bac41017635|https://quality.gha.org/Portals/2/documents/HEN/Meetings/nursesinstitute/062013/nursesroleineliminatingharm_moddydunning.pptx,17,16
/home/myfiles/www/wp-content/als/xm-s32c41ec2a5440ad220008b9abfe9add2-32c41ec2a5440ad220008b9abfe9add2-c23c5fbca96e8d641d148bac41017635|https://quality.gha.org/Portals/2/documents/HEN/Meetings/nursesinstitute/062013/nursesroleineliminatingharm_moddydunning.pptx,19,18
/home/myfiles/www/wp-content/als/xm-s28787ca2f4372ddb3616d3fd53c161ab-28787ca2f4372ddb3616d3fd53c161ab-c23c5fbca96e8d641d148bac41017635|https://quality.gha.org/Portals/2/documents/HEN/Meetings/nursesinstitute/062013/nursesroleineliminatingharm_moddydunning.pptx,22,21
/home/myfiles/www/wp-content/als/xm-s89a7b68158e38391da9f0de1e636c0d5-89a7b68158e38391da9f0de1e636c0d5-c23c5fbca96e8d641d148bac41017635|https://quality.gha.org/Portals/2/documents/HEN/Meetings/nursesinstitute/062013/nursesroleineliminatingharm_moddydunning.pptx,13,12
/home/myfiles/www/wp-content/als/xm-sc4b14e10f6151995f21334061ff1d139-c4b14e10f6151995f21334061ff1d139-c23c5fbca96e8d641d148bac41017635|https://royalmechanical.files.wordpress.com/2011/06/hy-wire-car-2.pptx,13,12
/home/myfiles/www/wp-content/als/xm-se589d47d163e43fa0c0d68e824e2c286-e589d47d163e43fa0c0d68e824e2c286-c23c5fbca96e8d641d148bac41017635|https://royalmechanical.files.wordpress.com/2011/06/hy-wire-car-2.pptx,19,18
/home/myfiles/www/wp-content/als/xm-s52f897a623c539d09bfb988bfb153888-52f897a623c539d09bfb988bfb153888-c23c5fbca96e8d641d148bac41017635|https://royalmechanical.files.wordpress.com/2011/06/hy-wire-car-2.pptx,14,13
/home/myfiles/www/wp-content/als/xm-sccf27a904c5b88e96a3522b2e1180fed-ccf27a904c5b88e96a3522b2e1180fed-c23c5fbca96e8d641d148bac41017635|https://royalmechanical.files.wordpress.com/2011/06/hy-wire-car-2.pptx,18,17
/home/myfiles/www/wp-content/als/xm-s6874bf9d589708764dab754e5af06ddf-6874bf9d589708764dab754e5af06ddf-c23c5fbca96e8d641d148bac41017635|https://royalmechanical.files.wordpress.com/2011/06/hy-wire-car-2.pptx,17,16
/home/myfiles/www/wp-content/als/xm-s46c55ec8387dbdedd7a83b3ad541cdc1-46c55ec8387dbdedd7a83b3ad541cdc1-c23c5fbca96e8d641d148bac41017635|https://royalmechanical.files.wordpress.com/2011/06/hy-wire-car-2.pptx,19,18
/home/myfiles/www/wp-content/als/xm-s08cfdc15f5935b947bbaa93c7193d496-08cfdc15f5935b947bbaa93c7193d496-c23c5fbca96e8d641d148bac41017635|https://royalmechanical.files.wordpress.com/2011/06/hydro-power-plant.ppt,9,8
/home/myfiles/www/wp-content/als/xm-s86e267bd359c12de262c0279cee0c941-86e267bd359c12de262c0279cee0c941-c23c5fbca96e8d641d148bac41017635|https://royalmechanical.files.wordpress.com/2011/06/hydro-power-plant.ppt,15,14
/home/myfiles/www/wp-content/als/xm-s5aa60354d134b87842918d760ec8bc30-5aa60354d134b87842918d760ec8bc30-c23c5fbca96e8d641d148bac41017635|https://royalmechanical.files.wordpress.com/2011/06/hydro-power-plant.ppt,14,13
Desired Result:
Unique Urls: 4
cut -d "|" -f 2 file | cut -d "," -f 1 | sort -u | wc -l
Output:
4
See: man cut, man sort
An awk solution would be
awk '{sub(/^[^|]*\|/,"");gsub(/,[^,]*/,"");i+=a[$0]++?0:1}END{print i}' file
4
If you happen to use GNU awk then below would also give you the same result
awk '{i+=a[gensub(/.*(http[^,]*).*/,"\\1",1)]++?0:1}END{print i}' file
4
Or even short as pointed out in this cracker comment by #cyrus
awk -F '[|,]' '{i+=!a[$2]++} END{print i}' file
4
which uses awk multiple field separator functionality with more idiomatic awk.
Note: See the [ awk manual ] for more info.
Parse with sed, and since file appears to be already sorted,
(with respect to URLs), just run uniq, and count it:
echo Unique URLs: $(sed 's/^.*|\([^,]*\),.*$/\1/' file | uniq | wc -l)
Use GNU grep to extract URLs:
echo Unique URLs: $(grep -o 'ht[^|,]*' file | uniq | wc -l)
Output (either method):
Unique URLs: 4
tr , '|' < myfile.log | sort -u -t '|' -k 2,2 | wc -l
tr , '|' < myfile.log translates all commas into pipe characters
sort -u -t '|' -k 2,2 sorts unique (-u), pipe delimited (-t '|'), in the second field only (-k 2,2)
wc -l counts the unique lines

Find unique words

Suppose there is one file.txt in which below content text is written:
ABC/xyz
ABC/xyz/rst
EFG/ghi
I need to write a shell script that can extract the first unique word before the first /.
So as output, I want ABC and EFG to be written in one file.
You can extract the first word with cut (slash as delimiter), then pipe to sort with the -u (for "unique") option:
$ cut -d '/' -f 1 file.txt | sort -u
ABC
EFG
To get the output into a file, just redirect by appending > filename to the command. (Or pipe to tee filename to see the output and get it in a file.)
Try this :
cat file.txt | tr -s "/" ' ' | awk -F " " '{print $1}' | sort | uniq > outfile.txt
Another interesting variation:
awk -F'/' '{print $1 |" sort -u" }' file.txt > outfile.txt
Not that it matters here, but being able to pipe and redirect within awk can be very handy.
Another easy way:
cut -d"/" -f1 file.txt|uniq > out.txt
You can use a mix of cut and sort like so:
cut -d '/' -f 1 file.txt | sort -u > newfile.txt
The first line grabs any string until a slash / and outputs it into newfile.txt.
The second line sorts the text, removing any duplicate strings you might have.

Is it possible to set variable in pipeline?

I have a big txt file which I want to edit in pipeline. But on same place in pipeline I want to set number of lines in variable $nol. I just want to see sintax how could I set variable in pipeline like:
cat ${!#} | tr ' ' '\n'| grep . ; $nol=wc -l | sort | uniq -c ...
That after second pipe is very wrong, but how can I do it in bash?
One of solutions is:
nol=$(cat ${!#} | tr ' ' '\n'| grep . | wc -l)
pipeline all from the start again
but I don't want to do script the same thing twice, bec I have more pipes then here.
I musn't use awk or sed...
You can use a tee and then write it to a file which you use later:
tempfile="xyz"
tr ' ' '\n' < "${!#}" | grep '.' | tee > "$tempfile" | sort | uniq -c ...
nol=$(wc -l "$tempfile")
Or you can use it the other way around:
nol=$(tr ' ' '\n' < "${!#}" | grep '.' \
| tee >(sort | uniq -c ... > /dev/tty) | wc -l
You can set a variable in a particular link of a pipeline, but that's not very useful since only that particular link will be affected by it.
I recommend simply using a temporary file.
set -e
trap 'rm -f "$tmpf"' EXIT
tmpf=`mktemp`
cat ${!#} | tr ' ' '\n'| grep . | sort > "$tmpf"
nol="$(wc "$tmpf")"
< "$tmpf" uniq -c ...
You can avoid the temporary file with tee and a named pipe, but it probably won't perform much better (it may even perform worse).
UPDATE:
Took a minute but I got it...
cat ${!#} | tr ' ' '\n'| tee >(nol=$(wc -l)) | sort | uniq -c ...
PREVIOUS:
The only way I can think to do this is storing in variables and calling back. You would not execute the command more than one time. You would just store the output in variables along the way.
aCommand=($(cat ${!#} | tr ' ' '\n'));sLineCount=$(echo ${#aCommand[#]});echo ${aCommand[#]} | sort | uniq -c ...
aCommand will store the results of the first set of commands in an array
sLineCount will count the elements (lines) in the array
;... echo the array elements and continue the commands from there.
Looks to me like you're asking how to avoid stepping through your file twice, just to get both word and line count.
Bash lets you read variables, and wc can produce all the numbers you need at once.
NAME
wc -- word, line, character, and byte count
So to start...
read words line chars < <( wc < ${!#} )
This populates the three variables based on input generated from process substitution.
But your question includes another partial command line which I think you intend as:
nol=$( sort -u ${!#} | wc -l )
This is markedly different from the word count of your first command line, so you can't use a single wc instance to generate both. Instead, one option might be to put your functionality into a script that does both functions at once:
read words uniques < <(
awk '
{
words += NF
for (i=1; i<=NF; i++) { unique[$i] }
}
END {
print words,length(unique)
}
' ${!#}
)

Simple diff/patch script for sorted unique file

How could I write a simple diff resp. patch script for applying additions and deletions to a list of lines in a file?
This could be a original file (it is sorted and each line is unique):
a
b
d
a simple patch file could look like this (or somehow as simple):
+ c
+ e
- b
The resulting file should look like (or in any other order, since sort could be applied anyways):
a
c
d
e
The normal patch formats can not be used since they include context, which might alter in this case.
Bash alternatives that read input files only once:
To generate patch you can:
comm -3 a.txt b.txt | sed 's/^\t/+ /;t;s/^/- /'
Because comm delimeters outputs from different files using tab, we can use that tab to detect if line should be added or removed.
To apply patch you can:
{ <patch.txt tee >(grep '^+ ' | cut -c3- >&5) |
grep '^- ' | cut -c3- | comm -13 - a.txt; } 5> >(cat)
The tee splits the input, that is the patch file, into two streams. The first part has + filtered and is outputted to file descriptor 5. The file descriptor 5 is opened to just >(cat) so it is just outputted on stdout. The second part has the minus - filtered and it is joined with a.txt and outputted. Because output should be line buffered, it should work.
A shell solution using comm, awk, and grep to apply such a patch would be:
A=a.txt B=b.txt P=patch.txt; { grep '^-' $P | cut -c 3- | comm -23 $A - ; grep '^+' $P | cut -c 3- } | sort -u > $B
to generate the patch file would be:
A=a.txt B=b.txt P=patch.txt; { comm -13 $A $B | awk '{print "+ " $0}' ; comm -23 $A $B | awk '{print "- " $0}' } > $P
Since nobody could give me an answer, I've created a small python script, which does exactly this job. https://github.com/white-gecko/simplepatch
To apply such a patch call it with (where outfile.txt is generated)
./simplepatch.py -m patch -i infile.txt -p patchfile.txt -o outfile.txt
To generate a patch/diff call it with (where patchfile.txt is generated)
./simplepatch.py -m diff -i infile.txt -o outfile.txt -p patchfile.txt

How to write shell script for finding number of pages in PDF?

I am generating a PDF dynamically. How can I check the number of pages in the PDF using a shell script?
Without any extra package:
strings < file.pdf | sed -n 's|.*/Count -\{0,1\}\([0-9]\{1,\}\).*|\1|p' \
| sort -rn | head -n 1
Using pdfinfo:
pdfinfo file.pdf | awk '/^Pages:/ {print $2}'
Using pdftk:
pdftk file.pdf dump_data | grep NumberOfPages | awk '{print $2}'
You can also recursively sum the total number of pages in all PDFs via pdfinfo as follows:
find . -xdev -type f -name "*.pdf" -exec pdfinfo "{}" ";" | \
awk '/^Pages:/ {n += $2} END {print n}'
The imagemagick library provides a tool called identify which in conjunction with counting the lines of output gets you what you are after...imagemagick is a easy install on osx with brew.
Here is a functional bash script that captures it to a shell variable and dumps it back to the screen...
#/bin/bash
pdfFile=$1
echo "Processing $pdfFile"
numberOfPages=$(/usr/local/bin/identify "$pdfFile" 2>/dev/null | wc -l | tr -d ' ')
#Identify gets info for each page, dump stderr to dev null
#count the lines of output
#trim the whitespace from the wc -l outout
echo "The number of pages is: $numberOfPages"
And the output of running it...
$ ./countPages.sh aSampleFile.pdf
Processing aSampleFile.pdf
The number of pages is: 2
$
The pdftotext utility converts a pdf file to text format inserting page breaks between the pages. (aka: form-feed characters $'\f' ):
NAME
pdftotext - Portable Document Format (PDF) to text converter.
SYNOPSIS
pdftotext [options] [PDF-file [text-file]]
DESCRIPTION
Pdftotext converts Portable Document Format (PDF) files to plain text.
Pdftotext reads the PDF file, PDF-file, and writes a text file, text-file. If text-file is
not specified, pdftotext converts file.pdf to file.txt. If text-file is ยด-', the text is
sent to stdout.
There are many combinations to solve your problem, choose one of them:
1) pdftotext + grep:
$ pdftotext file.pdf - | grep -c $'\f'
2) pdftotext + awk (v1):
$ pdftotext file.pdf - | awk 'BEGIN{n=0} {if(index($0,"\f")){n++}} END{print n}'
3) pdftotext + awk (v2):
$ pdftotext sample.pdf - | awk 'BEGIN{ RS="\f" } END{ print NR }'
4) pdftotext + awk (v3):
$ pdftotext sample.pdf - | awk -v RS="\f" 'END{ print NR }'
Hope it Helps!
Here is a version for the command line directly (based on pdfinfo):
for f in *.pdf; do pdfinfo "$f" | grep Pages | awk '{print $2}'; done
Here is a total hack using pdftoppm, which comes preinstalled on Ubuntu (tested on Ubuntu 18.04 and 20.04 at least):
# for a pdf withOUT a password
pdftoppm mypdf.pdf -f 1000000 2>&1 | grep -o '([0-9]*)\.$' \
| grep -o '[0-9]*'
# for a pdf WITH a password which is `1234`
pdftoppm -upw 1234 mypdf.pdf -f 1000000 2>&1 | grep -o '([0-9]*)\.$' \
| grep -o '[0-9]*'
How does this work? Well, if you specify a first page which is larger than the pages in the PDF (I specify page number 1000000, which is too large for all known PDFs), it will print the following error to stderr:
Wrong page range given: the first page (1000000) can not be after the last page (142).
So, I pipe that stderr msg to stdout with 2>&1, as explained here, then I pipe that to grep to match the (142). part with this regular expression (([0-9]*)\.$), then I pipe that to grep again with this regular expression ([0-9]*) to find just the number, which is 142 in this case. That's it!
Wrapper functions and speed testing
Here are a couple wrapper functions to test these:
# get the total number of pages in a PDF; technique 1.
# See this ans here: https://stackoverflow.com/a/14736593/4561887
# Usage (works on ALL PDFs--whether password-protected or not!):
# num_pgs="$(getNumPgsInPdf "path/to/mypdf.pdf")"
# SUPER SLOW! Putting `time` just in front of the `strings` cmd shows it takes ~0.200 sec on a 142
# pg PDF!
getNumPgsInPdf() {
_pdf="$1"
_num_pgs="$(strings < "$_pdf" | sed -n 's|.*/Count -\{0,1\}\([0-9]\{1,\}\).*|\1|p' \
| sort -rn | head -n 1)"
echo "$_num_pgs"
}
# get the total number of pages in a PDF; technique 2.
# See my ans here: https://stackoverflow.com/a/66963293/4561887
# Usage, where `pw` is some password, if the PDF is password-protected (leave this off for PDFs
# with no password):
# num_pgs="$(getNumPgsInPdf2 "path/to/mypdf.pdf" "pw")"
# SUPER FAST! Putting `time` just in front of the `pdftoppm` cmd shows it takes ~0.020 sec OR LESS
# on a 142 pg PDF!
getNumPgsInPdf2() {
_pdf="$1"
_password="$2"
if [ -n "$_password" ]; then
_password="-upw $_password"
fi
_num_pgs="$(pdftoppm $_password "$_pdf" -f 1000000 2>&1 | grep -o '([0-9]*)\.$' \
| grep -o '[0-9]*')"
echo "$_num_pgs"
}
Testing them with the time command in front shows that the strings one is extremely slow, taking ~0.200 sec on a 142 pg pdf, whereas the pdftoppm one is very fast, taking ~0.020 sec or less on the same pdf. The pdfinfo technique in Ocaso's answer below is also very fast--the same as the pdftoppm one.
See also
These awesome answers by Ocaso Protal.
These functions above will be used in my pdf2searchablepdf project here: https://github.com/ElectricRCAircraftGuy/PDF2SearchablePDF.
mupdf/mutool solution:
mutool info tmp.pdf | grep '^Pages' | cut -d ' ' -f 2
Just dug out an old script (in ksh) I found:
#!/usr/bin/env ksh
# Usage: pdfcount.sh file.pdf
#
# Optimally, this would be a mere:
# pdfinfo file.pdf | grep Pages | sed 's/[^0-9]*//'
[[ "$#" != "1" ]] && {
printf "ERROR: No file specified\n"
exit 1
}
numpages=0
while read line; do
num=${line/*([[:print:]])+(Count )?(-)+({1,4}(\d))*([[:print:]])/\4}
(( num > numpages)) && numpages=$num
done < <(strings "$#" | grep "/Count")
print $numpages
If you're on macOS you can query pdf metadata like this:
mdls -name kMDItemNumberOfPages -raw file.pdf
as seen here https://apple.stackexchange.com/questions/225175/get-number-of-pdf-pages-in-terminal
Another mutool solution making better use of the options:
mutool show file.pdf Root/Pages/Count
I made a few improvement in Marius Hofert tip to sum the returned values.
for f in *.pdf; do pdfinfo "$f" | grep Pages | awk '{print $2}'; done | awk '{s+=$1}END{print s}'
To build on Marius Hofert's answer, this command uses a bash for loop to show you the number of pages, display the filename, and it will ignore the case of the file extension.
for f in *.[pP][dD][fF]; do pdfinfo "$f" | grep Pages | awk '{printf $2 }'; echo " $f"; done

Resources