How to get nth column with regexp delimiter [duplicate] - bash

This question already has answers here:
Shell file size in Linux
(6 answers)
Closed 6 years ago.
Basically I get line from ls -la command:
-rw-r--r-- 13 ondrejodchazel staff 442 Dec 10 16:23 some_file
and want to get size of file (442). I have tried cut and sed commands, but was unsuccesfull. Using just basic UNIX tools (cut, sed, awk...), how can i get specific column from stdin, where delimiter is / +/ regexp?

If you want to do it with cut you need to squeeze the space first (tr -s ' ') because cut doesn't support +. This should work:
ls -la | tr -s ' ' | cut -d' ' -f 5
It's a bit more work when doing it with sed (GNU sed):
ls -la | sed -r 's/([^ ]+ +){4}([^ ]+).*/\2/'
Slightly more finger punching if you use the grep alternative (GNU grep):
ls -la | grep -Eo '[^ ]+( +[^ ]+){4}' | grep -Eo '[^ ]+$'

Parsing ls output is harder than you think. Use a dedicated tool such as stat instead.
size=$(stat -c '%s' some_file)
One way ls -la some_file | awk '{print $5}' could break is if numbers use space as a thousands separator (this is common in some European locales).
See also Why You Shouldn't Parse the Output of ls(1).

Pipe your output with:
awk '{print $5}'
Or even better us to use stat command like this (On Mac):
stat -f "%z" yourFile
Or (on Linux)
stat -c "%s" yourFile
that will output size of file in bytes.

Related

How to grep only matching string from this result?

I am just simply trying to grab the commit ID, but not quite sure what I'm missing:
➜ ~ curl https://github.com/microsoft/vscode/releases -s | grep -oE 'microsoft/vscode/commit/(.*?)/hovercard'
microsoft/vscode/commit/ccbaa2d27e38e5afa3e5c21c1c7bef4657064247/hovercard
The only thing I need back from this is ccbaa2d27e38e5afa3e5c21c1c7bef4657064247.
This works just fine on regex101.com and in ruby/python. What am I missing?
If supported, you can use grep -oP
echo "microsoft/vscode/commit/ccbaa2d27e38e5afa3e5c21c1c7bef4657064247/hovercard" | grep -oP "microsoft/vscode/commit/\K.*?(?=/hovercard)"
Output
ccbaa2d27e38e5afa3e5c21c1c7bef4657064247
Another option is to use sed with a capture group
echo "microsoft/vscode/commit/ccbaa2d27e38e5afa3e5c21c1c7bef4657064247/hovercard" | sed -E 's/microsoft\/vscode\/commit\/([^\/]+)\/hovercard/\1/'
Output
ccbaa2d27e38e5afa3e5c21c1c7bef4657064247
The point is that grep does not support extracting capturing group submatches. If you install pcregrep you could do that with
curl https://github.com/microsoft/vscode/releases -s | \
pcregrep -o1 'microsoft/vscode/commit/(.*?)/hovercard' | head -1
The | head -1 part is to fetch the first occurrence only.
I would suggest using awk here:
awk 'match($0,/microsoft\/vscode\/commit\/[^\/]*\/hovercard/){print substr($0,RSTART+24,RLENGTH-34);exit}'
The regex will match a line containing
microsoft\/vscode\/commit\/ - microsoft/vscode/commit/ fixed string
[^\/]* - zero or more chars other than /
\/hovercard - a /hovercard string.
The substr($0,RSTART+24,RLENGTH-34) will print the part of the line starting at the RSTART+24 (24 is the length of microsoft/vscode/commit/) index and the RLENGTH is the length of microsoft/vscode/commit/ + the length of the /hovercard.
The exit command will fetch you the first occurrence. Remove it if you need all occurrences.
You can use sed:
curl -s https://github.com/microsoft/vscode/releases |
sed -En 's=.*microsoft/vscode/commit/([^/]+)/hovercard.*=\1=p' |
head -n 1
head -n 1 is to print the first match (there are 10)grep -o will print (only) everything that matches, including microsoft/ etc.
Your task can not be achieved with Mac's grep. grep -o prints all matching text (compared to default behaviour of printing matching lines), including microsoft/ etc. A grep which implemented perl regex (like GNU grep on Linux) could make use of look ahead/behind (grep -Po '(?<=microsoft/vscode/commit/)[^/]+(?=/hovercard)'). But it's just not available on Mac's grep.
On MacOS you don't have gnu utilities available by default. You can just pipe your output to a simple awk like this:
curl https://github.com/microsoft/vscode/releases -s |
grep -oE 'microsoft/vscode/commit/[^/]+/hovercard' |
awk -F/ '{print $(NF-1)}'
ccbaa2d27e38e5afa3e5c21c1c7bef4657064247
3a6960b964327f0e3882ce18fcebd07ed191b316
f4af3cbf5a99787542e2a30fe1fd37cd644cc31f
b3318bc0524af3d74034b8bb8a64df0ccf35549a
6cba118ac49a1b88332f312a8f67186f7f3c1643
c13f1abb110fc756f9b3a6f16670df9cd9d4cf63
ee8c7def80afc00dd6e593ef12f37756d8f504ea
7f6ab5485bbc008386c4386d08766667e155244e
83bd43bc519d15e50c4272c6cf5c1479df196a4d
e7d7e9a9348e6a8cc8c03f877d39cb72e5dfb1ff

Get first argument of wc -l myFile.txt [duplicate]

This question already has answers here:
How to get "wc -l" to print just the number of lines without file name?
(10 answers)
Closed 7 years ago.
I'm counting the number of lines in a big file using
wc -l myFile.txt
Result is
110 myFile.txt
But I want only the number
110
How can I do that?
(I want the number of lines as an input argument in a bash script)
There are lots of ways to do this. Here are two:
wc -l myFile.txt | cut -f1 -d' '
wc -l < myFile.txt
Cut is an old Unix tool to
print selected parts of lines from each FILE to standard output.
You can use cat and pipe wc -l:
cat myFile.txt | wc -l
Or if you insist wc -l be the first command, you can use awk:
wc -l myFile.txt | awk '{print $1}'
You can try
wc -l file | awk '{print $1}'

vimdiff files given in a text file

I have a text file files.txt with following entries
"/home/dilawar/a.txt","/home/dilawar/b.txt"
"/home/dilawar/aa.txt","/home/dilawar/bb.txt"
Now I wish to see the diff of files on line 1. I tried the following
head -n 1 files.txt | cut -d, -f 2,3 | sed "s/,/\t/g" | xargs -I files vimdiff files
It is not working. I replaced vimdiff with diff, it did not work either. However this works
head -n 1 files.txt | cut -d, -f 1 | xargs -I file vim file
How to pass file as an argument to diff as two separate file paths rather than a single string?
PS : To make matter worse, I have space in some of file paths.
First take the first line, then recplace the symbols by a space, and feed it to vimdiff via a subshell.
vimdiff $(head -1 files.txt | tr '",' ' ')
The above elegant method will not work with names with a space. The below dirty one will.
awk -F, 'NR==1{print "vimdiff",$1,$2}' files.txt | bash
try this, see if it helps
sed '1{s/,/ /; s/^/diff /;q}' files.txt|sh
I also escaped the whitespace in filepath (first sed command)
head -n 1 files.txt | sed "s/ /\\\\ /g" | sed "s/[\",]/ /g" |xargs vimdiff

Command to get nth line of STDOUT

Is there any bash command that will let you get the nth line of STDOUT?
That is to say, something that would take this
$ ls -l
-rw-r--r--# 1 root wheel my.txt
-rw-r--r--# 1 root wheel files.txt
-rw-r--r--# 1 root wheel here.txt
and do something like
$ ls -l | magic-command 2
-rw-r--r--# 1 root wheel files.txt
I realize this would be bad practice when writing scripts meant to be reused, BUT when working with the shell day to day it'd be useful to me to be able to filter my STDOUT in such a way.
I also realize this would be semi-trivial command to write (buffer STDOUT, return a specific line), but I want to know if there's some standard shell command to do this that would be available without me dropping a script into place.
Using sed, just for variety:
ls -l | sed -n 2p
Using this alternative, which looks more efficient since it stops reading the input when the required line is printed, may generate a SIGPIPE in the feeding process, which may in turn generate an unwanted error message:
ls -l | sed -n -e '2{p;q}'
I've seen that often enough that I usually use the first (which is easier to type, anyway), though ls is not a command that complains when it gets SIGPIPE.
For a range of lines:
ls -l | sed -n 2,4p
For several ranges of lines:
ls -l | sed -n -e 2,4p -e 20,30p
ls -l | sed -n -e '2,4p;20,30p'
ls -l | head -2 | tail -1
Alternative to the nice head / tail way:
ls -al | awk 'NR==2'
or
ls -al | sed -n '2p'
From sed1line:
# print line number 52
sed -n '52p' # method 1
sed '52!d' # method 2
sed '52q;d' # method 3, efficient on large files
From awk1line:
# print line number 52
awk 'NR==52'
awk 'NR==52 {print;exit}' # more efficient on large files
For the sake of completeness ;-)
shorter code
find / | awk NR==3
shorter life
find / | awk 'NR==3 {print $0; exit}'
Try this sed version:
ls -l | sed '2 ! d'
It says "delete all the lines that aren't the second one".
You can use awk:
ls -l | awk 'NR==2'
Update
The above code will not get what we want because of off-by-one error: the ls -l command's first line is the total line. For that, the following revised code will work:
ls -l | awk 'NR==3'
Another poster suggested
ls -l | head -2 | tail -1
but if you pipe head into tail, it looks like everything up to line N is processed twice.
Piping tail into head
ls -l | tail -n +2 | head -n1
would be more efficient?
Is Perl easily available to you?
$ perl -n -e 'if ($. == 7) { print; exit(0); }'
Obviously substitute whatever number you want for 7.
Yes, the most efficient way (as already pointed out by Jonathan Leffler) is to use sed with print & quit:
set -o pipefail # cf. help set
time -p ls -l | sed -n -e '2{p;q;}' # only print the second line & quit (on Mac OS X)
echo "$?: ${PIPESTATUS[*]}" # cf. man bash | less -p 'PIPESTATUS'
Hmm
sed did not work in my case.
I propose:
for "odd" lines 1,3,5,7... ls |awk '0 == (NR+1) % 2'
for "even" lines 2,4,6,8 ls |awk '0 == (NR) % 2'
For more completeness..
ls -l | (for ((x=0;x<2;x++)) ; do read ; done ; head -n1)
Throw away lines until you get to the second, then print out the first line after that. So, it prints the 3rd line.
If it's just the second line..
ls -l | (read; head -n1)
Put as many 'read's as necessary.

Linux commands to output part of input file's name and line count

What Linux commands would you use successively, for a bunch of files, to count the number of lines in a file and output to an output file with part of the corresponding input file as part of the output line. So for example we were looking at file LOG_Yellow and it had 28 lines, the the output file would have a line like this (Yellow and 28 are tab separated):
Yellow 28
wc -l [filenames] | grep -v " total$" | sed s/[prefix]//
The wc -l generates the output in almost the right format; grep -v removes the "total" line that wc generates for you; sed strips the junk you don't want from the filenames.
wc -l * | head --lines=-1 > output.txt
produces output like this:
linecount1 filename1
linecount2 filename2
I think you should be able to work from here to extend to your needs.
edit: since I haven't seen the rules for you name extraction, I still leave the full name. However, unlike other answers I'd prefer to use head rather then grep, which not only should be slightly faster, but also avoids the case of filtering out files named total*.
edit2 (having read the comments): the following does the whole lot:
wc -l * | head --lines=-1 | sed s/LOG_// | awk '{print $2 "\t" $1}' > output.txt
wc -l *| grep -v " total"
send
28 Yellow
You can reverse it if you want (awk, if you don't have space in file names)
wc -l *| egrep -v " total$" | sed s/[prefix]//
| awk '{print $2 " " $1}'
Short of writing the script for you:
'for' for looping through your files.
'echo -n' for printing the current file
'wc -l' for finding out the line count
And dont forget to redirect
('>' or '>>') your results to your
output file

Resources