Using terminal to find PDF size - terminal

This is a normal output using pdfinfo
Creator: Pages
Producer: Mac OS X 10.10.1 Quartz PDFContext
CreationDate: Tue Mar 3 01:26:34 2015
ModDate: Tue Mar 3 01:26:34 2015
Tagged: no
Form: none
Pages: 5
Encrypted: no
Page size: 612 x 792 pts (letter) (rotated 0 degrees)
File size: 242463 bytes
Optimized: no
PDF version: 1.3
So I know I can do something like this to grab the amount of pages:
pdfinfo document.pdf | grep Pages: | awk '{print $2}'
I am trying to get the page size to put something like 612 x 792.
At the moment I am trying things like grep "Page size:" but it's obviously not the right way. Could anyone point me in the right direction?

grep/sed work:
pdfinfo document.pdf | \
grep "Page size:" | \
sed -e 's/^[^:]*:[[:space:]]*//' -e 's/[[:space:]]pts.*//'
using grep to simplify the text to just the line you are interested in, then using sed to chop off the beginning and end of the line (for the example you showed).
In this example, there are two sed options (each is a script). Both change characters matching a given pattern to nothing, e.g.,
s/old/new/
but here new is an empty string.
The "^" character at the beginning is an "anchor", matching the beginning of the line. The "[^:]" uses "^" differently, matching any character except ":" (and the "" says zero-or-more). So given "Page size:", that matches the whole thing. After the ":" on your line, there is some whitespace (which may be spaces or tabs). The POSIX character class "[:space:] matches either, and is put inside brackets as you see: "[[:space:]]". Finally, the "." in the second option matches any character (.) zero or more times (*).

Related

Remove a line from a file based on serial number?

I have this file here below here:
#12345 Saab 1998 Red
#54321 Volvo 1990 Grey
#45678 Citroen 2004 Yellow
If I want to remove a line by serial #54321 or #12345?
What should I do to remove a line by the serial number?
You can use grep for this:
grep -E -v "#12345|#54321"
which means the following:
grep -E : use extended regular expressions (different items, separated by "|")
-v : instead of showing the matching lines, show the ones which don't match.
You can use sed to remove a line that matches a given pattern (e.g. #54321) from a given file and update/modify it at the same time (-i arg) like this:
sed -i '/#54321/d' file.txt

Running sed on a large (30G) one-line file returns an empty output

I'm trying to perform simple literal search/replace on a large (30G) one-line file, using sed.
I would expect this to take some time but, when I run it, it returns after a few seconds and, when I look at the generated file, it's zero length.
input file has 30G
$ ls -lha Full-Text-Tokenized-Single-Line.txt
-rw-rw-r-- 1 ubuntu ubuntu 30G Jun 9 19:51 Full-Text-Tokenized-Single-Line.txt
run the command:
$ sed 's/<unk>/ /g' Full-Text-Tokenized-Single-Line.txt > Full-Text-Tokenized-Single-Line-No-unks.txt
the output file has zero length!
$ ls -lha Full-Text-Tokenized-Single-Line-No-unks.txt
-rw-rw-r-- 1 ubuntu ubuntu 0 Jun 9 19:52 Full-Text-Tokenized-Single-Line-No-unks.txt
Things I've tried
running the very same example on a shorter file: works
using -e modifier: doesn't work
escaping "<" and ">": doesn't work
using a simple pattern line ('s/foo/bar/g') instead: doesn't work: zero-length file is returned.
EDIT (more information)
return code is 0
sed version is (GNU sed) 4.2.2
Just use awk, it's designed for handling records separated by arbitrary strings. With GNU awk for multi-char RS:
awk -v RS='<unk>' '{ORS=(RT?" ":"")}1' file
The above splits the input into records separated by <unk> so if enough <unk>s are present in the input then the individual records will be small enough to fit in memory. It then prints each record followed by a blank char so the overall impact to the data is that all <unk>s become blank chars.
If that direct approach doesn't work for you THEN it'd be time to start looking for alternative solutions.
with line-based editors like sed you can't expect this to work, since its unit of work (record) is the line terminated with line breaks.
One suggestion if you have white space in your file (to prevent searched pattern to split) is use
fold -s file_with_one_long_line |
sed 's/find/replace/g' |
tr -d '\n' > output
ps. fold default width is 80, in case you have words longer than 80 you can add -w 1000 or at least the longest word size to prevent word splitting.
Officially gnu sed has no line limit
http://www.linuxtopia.org/online_books/linux_tool_guides/the_sed_faq/sedfaq6_005.html
However the page state that:
"no limit" means there is no "fixed" limit. Limits are actually determined by one's hardware, memory, operating system, and which C library is used to compile sed.
I tried running sed on a 7gb single file could reproduce same issue.
This page https://community.hpe.com/t5/Languages-and-Scripting/Sed-Maximum-Line-Length/td-p/5136721 suggest using perl instead
perl -pe 's/start=//g;s/stop=//g;s/<unk>/ /g' file > output
If the tokens are space(not all whitespace) delimited and assuming your are only matching single words then you could use perl with space as the record separator
perl -040 -pe 's/<unk>/ /' file
or GNU awk to match all whitespace
awk -vRS="[[:space:]]" '{ORS=RT;sub(/<unk>/," ")} file

grep command giving unexpected output when searching exact word in file in csh

I used following script to search every line of one file in another file and if it is found printing 2nd column of that line :
#!/bin/csh
set goldFile=$1
set regFile=$2
set noglob
foreach line ("`cat $goldFile`")
set searchString=`echo $line | awk '{print $1}'`
set id=`grep -w -F "$searchString" $regFile | awk '{print $2}'`
echo "$searchString" "and" "$id"
end
unset noglob
Gold file is as follows :
\$#%$%escaped.Integer%^^&[10]
\$#%$%escaped.Integer%^^&[10][0][0][31]
\$#%$%escaped.Integer%^^&[10][0][0][30]
\$#%$%escaped.Integer%^^&[10][0][0][29]
\$#%$%escaped.Integer%^^&[10][0][0][28]
\$#%$%escaped.Integer%^^&[10][0][0][27]
\$#%$%escaped.Integer%^^&[10][0][0][26]
and RegFile is as follows :
\$#%$%escaped.Integer%^^&[10] 1
\$#%$%escaped.Integer%^^&[10][0][0][31] 10
\$#%$%escaped.Integer%^^&[10][0][0][30] 11
\$#%$%escaped.Integer%^^&[10][0][0][29] 12
\$#%$%escaped.Integer%^^&[10][0][0][28] 13
\$#%$%escaped.Integer%^^&[10][0][0][27] 14
\$#%$%escaped.Integer%^^&[10][0][0][26] 15
Output is coming :
\$#%$%escaped.Integer%^^&[10] and 1 10 11 12 13 14 15
\$#%$%escaped.Integer%^^&[10][0][0][31] and 10
\$#%$%escaped.Integer%^^&[10][0][0][30] and 11
\$#%$%escaped.Integer%^^&[10][0][0][29] and 12
\$#%$%escaped.Integer%^^&[10][0][0][28] and 13
\$#%$%escaped.Integer%^^&[10][0][0][27] and 14
\$#%$%escaped.Integer%^^&[10][0][0][26] and 15
But expected Output is :
\$#%$%escaped.Integer%^^&[10] and 1
\$#%$%escaped.Integer%^^&[10][0][0][31] and 10
\$#%$%escaped.Integer%^^&[10][0][0][30] and 11
\$#%$%escaped.Integer%^^&[10][0][0][29] and 12
\$#%$%escaped.Integer%^^&[10][0][0][28] and 13
\$#%$%escaped.Integer%^^&[10][0][0][27] and 14
\$#%$%escaped.Integer%^^&[10][0][0][26] and 15
Please help me to figure out how to search exact word having some special character using grep.
csh and bash are completely different variants of shell. They're not even supposed to be compatible. Your problem is more associated with usage of grep
Because of the -F flag in grep which lets your string to be fixed pattern, prone to contain all sorts of regex special characters like ,,[],(),.,*,^,$,-,\
The error result is because the -F flag, the line \$#%$%escaped.Integer%^^&[10] in Gold file matches all the input lines on the RegFile.
So normally the exact words of search can be filtered by the word boundary constructs ^ and $ as part of the pattern, but it won't work in your case because of the -F, --fixed-strings flag they will be treated as being part of the search string.
So assuming from the input file, there could be only one match for each line in the Gold file to RegFile you could stop the grep search after the first hit
Using the -m1 flag, which according to the man grep page says,
-m NUM, --max-count=NUM
Stop reading a file after NUM matching lines. If the input is standard input
from a regular file, and NUM matching lines are output, grep ensures that the
standard input is positioned to just after the last matching line before
exiting, regardless of the presence of trailing context lines.
So adding it like,
grep -w -F -m1 "$searchString" $regFile
should solve your problem.

grep one liner - extract two different lines from same file

I've a file containing many number of lines like following.
== domain 1 score: 280.5 bits; conditional E-value: 2.1e-87
TSEEETTCTTTGSG---BTTSSB-HHHHHHHHHHHHHHHHHHSSS---B-HHHHHHHSTTTSTGCGBB-HHHHHHHHHHHTEBEBTTTS---SSCSESECTTGCGSCEBEESEEEEEESSBHHHHHHHHHHHSSEEEEEECTSHHHHTEESSEESCTSCETSS-EEEEEEEEEEEETTEEEEEEE-SBTTTSTBTTEEEEESSSSSGGGTTSSEEEE CS
PF00112.18 2 pesvDwrekkgavtpvkdqgsCGsCWafsavgalegrlaiktkkklvslSeqelvdCskeenegCnGGlmenafeyikknggivtekdypYkakekgkCkkkkkkekvakikgygkvkenseealkkalakngPvsvaidaseedfqlYksGvyketecsktelnhavlivGygvengkkyWivkNsWgtdwgekGYiriargknnecgieseavyp 218
p+svD+r+k+ +vtpvk+qg+CGsCWafs+vgaleg+l+ kt +kl++lS q+lvdC + en+gC GG+m+naf+y++kn+gi++e+ ypY ++e ++C ++ + + ak++gy++++e +e+alk+a+a++gPvsvaidas ++fq+Y++Gvy++++c++++lnhavl+vGyg ++g+k Wi+kNsWg++wg+kGYi +ar+knn cgi++ a++p
1AU0:A 2 PDSVDYRKKG-YVTPVKNQGQCGSCWAFSSVGALEGQLKKKT-GKLLNLSPQNLVDCVS-ENDGCGGGYMTNAFQYVQKNRGIDSEDAYPYVGQE-ESCMYNPTGKA-AKCRGYREIPEGNEKALKRAVARVGPVSVAIDASLTSFQFYSKGVYYDESCNSDNLNHAVLAVGYGIQKGNKHWIIKNSWGENWGNKGYILMARNKNNACGIANLASFP 213
I just want to extract the line that is preceded by the PF and the associated line after it which starts with digit.
Here in this case, line that starts with PF is 'PF00112.18' and line that starts with digit is '1AU0:A'. These ids will change for next domain, but PF is constant and its associated id starts with digit.
Here is what I've tried with grep, I hope there must be mistake in this oneliner. Any help will be greatly appreciated.
grep '^ PF \| \d' inFile.txt
Expected output:
PF00112.18 2 pesvDwrekkgavtpvkdqgsCGsCWafsavgalegrlaiktkkklvslSeqelvdCskeenegCnGGlmenafeyikknggivtekdypYkakekgkCkkkkkkekvakikgygkvkenseealkkalakngPvsvaidaseedfqlYksGvyketecsktelnhavlivGygvengkkyWivkNsWgtdwgekGYiriargknnecgieseavyp 218
1AU0:A 2 PDSVDYRKKG-YVTPVKNQGQCGSCWAFSSVGALEGQLKKKT-GKLLNLSPQNLVDCVS-ENDGCGGGYMTNAFQYVQKNRGIDSEDAYPYVGQE-ESCMYNPTGKA-AKCRGYREIPEGNEKALKRAVARVGPVSVAIDASLTSFQFYSKGVYYDESCNSDNLNHAVLAVGYGIQKGNKHWIIKNSWGENWGNKGYILMARNKNNACGIANLASFP 213
You can use the following grep expression:
grep '^[[:space:]]\+PF\|^[[:space:]]\+[[:digit:]]' input.txt
The first pattern ^[[:space:]]\+PF searches for a line which contains one or more spaces at the start, followed by the term PF. The second pattern also searches for a one ore more spaces at the start at the line, but followed by a digit.
This can be simplyfied to:
grep '^[[:space:]]\+\(PF\|[[:digit:]]\)' input.txt
since both patterns start with one or more spaces at the start of the line.
Let me finally suggest to use egrep instead of grep because extended POSIX regexes will save use some escaping:
egrep '^[[:space:]]+(PF|[[:digit:]])' input.txt
egrep "^[ \t]*(PF|[0-9]).*$" tmp_file
[ \t] is equivalent to a space. Its a tab delimiter.
So ^[ \t]* grabs anything that starts with a space. The asterisk grabs all leading white space thereafter.
(PF|[0-9]).*$ will grab the lines that start with either PF or a digit. The beauty of egrep is that you can specify multiple conditions encapsulated by parenthesis, separated by a pipe.
.*$ grabs every from until the end of the line
so (PF|[0-9]).*$ will grab everything that starts with PF or digits until the end of the line. It will not work without compensating for the leading white space first.
So we get :
egrep "^[ \t]*(PF|[0-9]).*$" tmp_file

Grep characters before and after match?

Using this:
grep -A1 -B1 "test_pattern" file
will produce one line before and after the matched pattern in the file. Is there a way to display not lines but a specified number of characters?
The lines in my file are pretty big so I am not interested in printing the entire line but rather only observe the match in context. Any suggestions on how to do this?
3 characters before and 4 characters after
$> echo "some123_string_and_another" | grep -o -P '.{0,3}string.{0,4}'
23_string_and
grep -E -o ".{0,5}test_pattern.{0,5}" test.txt
This will match up to 5 characters before and after your pattern. The -o switch tells grep to only show the match and -E to use an extended regular expression. Make sure to put the quotes around your expression, else it might be interpreted by the shell.
You could use
awk '/test_pattern/ {
match($0, /test_pattern/); print substr($0, RSTART - 10, RLENGTH + 20);
}' file
You mean, like this:
grep -o '.\{0,20\}test_pattern.\{0,20\}' file
?
That will print up to twenty characters on either side of test_pattern. The \{0,20\} notation is like *, but specifies zero to twenty repetitions instead of zero or more.The -o says to show only the match itself, rather than the entire line.
I'll never easily remember these cryptic command modifiers so I took the top answer and turned it into a function in my ~/.bashrc file:
cgrep() {
# For files that are arrays 10's of thousands of characters print.
# Use cpgrep to print 30 characters before and after search pattern.
if [ $# -eq 2 ] ; then
# Format was 'cgrep "search string" /path/to/filename'
grep -o -P ".{0,30}$1.{0,30}" "$2"
else
# Format was 'cat /path/to/filename | cgrep "search string"
grep -o -P ".{0,30}$1.{0,30}"
fi
} # cgrep()
Here's what it looks like in action:
$ ll /tmp/rick/scp.Mf7UdS/Mf7UdS.Source
-rw-r--r-- 1 rick rick 25780 Jul 3 19:05 /tmp/rick/scp.Mf7UdS/Mf7UdS.Source
$ cat /tmp/rick/scp.Mf7UdS/Mf7UdS.Source | cgrep "Link to iconic"
1:43:30.3540244000 /mnt/e/bin/Link to iconic S -rwxrwxrwx 777 rick 1000 ri
$ cgrep "Link to iconic" /tmp/rick/scp.Mf7UdS/Mf7UdS.Source
1:43:30.3540244000 /mnt/e/bin/Link to iconic S -rwxrwxrwx 777 rick 1000 ri
The file in question is one continuous 25K line and it is hopeless to find what you are looking for using regular grep.
Notice the two different ways you can call cgrep that parallels grep method.
There is a "niftier" way of creating the function where "$2" is only passed when set which would save 4 lines of code. I don't have it handy though. Something like ${parm2} $parm2. If I find it I'll revise the function and this answer.
With gawk , you can use match function:
x="hey there how are you"
echo "$x" |awk --re-interval '{match($0,/(.{4})how(.{4})/,a);print a[1],a[2]}'
ere are
If you are ok with perl, more flexible solution : Following will print three characters before the pattern followed by actual pattern and then 5 character after the pattern.
echo hey there how are you |perl -lne 'print "$1$2$3" if /(.{3})(there)(.{5})/'
ey there how
This can also be applied to words instead of just characters.Following will print one word before the actual matching string.
echo hey there how are you |perl -lne 'print $1 if /(\w+) there/'
hey
Following will print one word after the pattern:
echo hey there how are you |perl -lne 'print $2 if /(\w+) there (\w+)/'
how
Following will print one word before the pattern , then the actual word and then one word after the pattern:
echo hey there how are you |perl -lne 'print "$1$2$3" if /(\w+)( there )(\w+)/'
hey there how
If using ripgreg this is how you would do it:
grep -E -o ".{0,5}test_pattern.{0,5}" test.txt
You can use regexp grep for finding + second grep for highlight
echo "some123_string_and_another" | grep -o -P '.{0,3}string.{0,4}' | grep string
23_string_and
With ugrep you can specify -ABC context with option -o (--only-matching) to show the match with extra characters of context before and/or after the match, fitting the match plus the context within the specified -ABC width. For example:
ugrep -o -C30 pattern testfile.txt
gives:
1: ... long line with an example pattern to match. The line could...
2: ...nother example line with a pattern.
The same on a terminal with color highlighting gives:
Multiple matches on a line are either shown with [+nnn more]:
or with option -k (--column-number) to show each individually with context and the column number:
The context width is the number of Unicode characters displayed (UTF-8/16/32), not just ASCII.
I personally do something similar to the posted answers.. but since the dot key, like any keyboard key, can be tapped or held down.. and I often don't need a lot of context(if I needed more I might do the lines like grep -C but often like you I don't want lines before and after), so I find it much quicker for entering the command, to just tap the dot key for how many dots / how many characters, if it's a few then tapping the key, or hold it down for more.
e.g. echo zzzabczzzz | grep -o '.abc..'
Will have the abc pattern with one dot before and two after. ( in regex language, Dot matches any character). Others used dot too but with curly braces to specify repetition.
If I wanted to be strict re between (0 or x) characters and exactly y characters, then i'd use the curlies.. and -P, as others have done.
There is a setting re whether dot matches new line but you can look into that if it's a concern/interest.

Resources