Sorting output with awk, and formatting it - sorting

I'm trying to format the output of ls -la to only contain files modified in December and output them nicely, this is what they currently look like:
ls -la | awk {'print $6,$7,$8,$9,$10'} | grep "Dec" | sort -r | head -5
Dec 4 20:15 folder/
Dec 4 19:51 ./
Dec 4 17:42 Folder\ John/
Dec 4 16:19 Homework\ MAT\ 08/
Dec 4 16:05 Folder\ Smith/
etc..
How can I set up something like a regular expression to not include things like "./" and "../",
Also how can I omit the slash "\" for folders that have spaces in them. Id like to drop the slash at the end. Is this possible through a shell command? Or would I have to use Perl to make modifications to the test? I do want the date and time to remain as is. Any help would be greatly appreciated!
The box has linux and this is being done via SSH.
Edit:
Heres what I have so far (thanks to Mark and gbacon for this)
ls -laF | grep -vE ' ..?/?$' | awk '{ for (i=6; i<=NF; i++) printf("%s ", $i); printf("\n"); } ' | grep "Dec" | sort -r | head -5
Im just having trouble with replacing "\ " with just a space " ". Other than that Thanks for all the help upto this point!

You can use find to do most of the work for you:
find -mindepth 1 -maxdepth 1 -printf "%Tb %Td %TH:%TM %f\n" | grep "^Dec" | sort -r
The parent directory (..) is not included by default. The -mindepth 1 gets rid of the current directory (.). You can remove the -maxdepth 1 to make it recursive, but you should change the %f to %p to include the path with the filename.
These are the fields in the -printf:
%Tb - short month name
%Td - day of the month
%TM:%TM - hours and minutes
%f - filename
In the grep I've added a match for the beginning of the line so it won't match a file named "Decimal" that was modified in November, for example.

Check and make sure your 'ls' command isn't aliased to something else. Typically, "raw" ls doesn't give you the / for directories, nor should it be escaping the spaces.
Clearly something is escaping the spaces for you for your awk to be printing those files, since awk tends to break field up by whitespace, that's what the \ characters are for.
Spaces is files names are designed specifically to frustrate writing easy script and pipe mashups like you're are trying to do here.

You could filter the output of ls:
ls -la | grep -vE ' ..?/?$' | awk {'print $6,$7,$8,$9,$10'} | grep "Dec" | sort -r | head -5
If you're content to use Perl:
ls -la | perl -lane 's/\\ / /g;
print "#F[5..9]"
if $F[8] !~ m!^..?/?$! &&
$F[5] eq "Dec"'

Here's one of your answers:
How can I set up something like a regular expression to not include things like "./" and "../",
Use ls -lA instead of ls -la.
Instead of printing out a fixed number of columns, you can print out everything from column 6 t the end of the line:
ls -lA | awk '{ for (i=6; i<=NF; i++) printf("%s ", $i); printf("\n"); } '
I don't get the spaces backslashed, so I don't know why you are getting that. To fix it you could add this:
| sed 's/\\//g'

what's with all the greps and seds???
ls -laF | awk '!/\.\.\/$/ && !/\.\/$/ &&/Dec/ { for (i=6; i<=NF; i++) printf("%s ", $i); printf("\n"); }'

well you can drop . and .. by adding grep -v "\." | grep -v "\.\."
not sure about the rest

It really irks me to see pipelines with awk and grep/sed. Awk is a very powerful line-processing tool.
ls -laF | awk '
/ \.\.?\/$/ {next}
/ Dec / {for (i=1; i<=5; i++) $i = ""; print}
' | sort -r | head -5

Related

bash find filenames with the same last four character

I'd like to write a script that will execute a command that finds every set of files that have the same last four characters.
So for example, given a directory with these files,
$ ls -1
GH010119.MP4
GH010120.MP4
GH010126.MP4
GH010127.MP4
GH020119.MP4
GH020126.MP4
GH020127.MP4
GH030119.MP4
GH030126.MP4
I'd like my script to make out these groups:
GH010119.MP4
GH020119.MP4
GH030119.MP4
GH010126.MP4
GH020126.MP4
GH030126.MP4
GH010127.MP4
GH020127.MP4
GH010120.MP4
My current solution is to make out each group manually using: find . -name "*0119*", so I'd also like to know if the script I'd have to come up with won't be overly complex in comparison....
With perl
perl -e 'for (glob("*")){$f{$1}.="$&\n" if /.*(.{4}).MP4/}print "$_\n" for (values %f)'
GH010126.MP4
GH020126.MP4
GH030126.MP4
GH010120.MP4
GH010119.MP4
GH020119.MP4
GH030119.MP4
GH010127.MP4
GH020127.MP4
I'm assuming the filenames without extension are all 8 characters and contain no newlines:
printf "%s\n" * |
sort -k1.5,1.8n |
awk '{key = substr($0,5,4)} NR==1{prev=key} prev != key {print ""} {print; prev=key}'
If the filename is not strictly 8 chars, then
for f in *; do
root=${f%%.*}
echo "${root: -4:4} $f"
done |
sort -k1,1n |
awk 'NR==1 {prev=$1} $1 != prev {print ""} {print $2; prev=$1}'
You can extract the groupings with something like
printf '%s\n' *.MP4 | sed 's/.*\(........\)$/\1/' | sort -u
With the extension .MP4 factored in, which is part of the file name no matter how you look at it, this extracts the last eight characters, and removes any duplicates.
Doing it in Awk might be a bit more efficient.
awk 'FNR == 1 { n = substr(FILENAME, length(FILENAME)-7);
if (seen[n]++ == 0) print n; nextfile }' *.MP4

How to remove all but the last 3 parts of FQDN?

I have a list of IP lookups and I wish to remove all but the last 3 parts, so:
98.254.237.114.broad.lyg.js.dynamic.163data.com.cn
would become
163data.com.cn
I have spent hours searching for clues, including parameter substitution, but the closest I got was:
$ string="98.254.237.114.broad.lyg.js.dynamic.163data.com.cn"
$ string1=${string%.*.*.*}
$ echo $string1
Which gives me the inverted answer of:
98.254.237.114.broad.lyg.js.dynamic
which is everything but the last 3 parts.
A script to do a list would be better than just the static example I have here.
Using CentOS 6, I don't mind if it by using sed, cut, awk, whatever.
Any help appreciated.
Thanks, now that I have working answers, may I ask as a follow up to then process the resulting list and if the last part (after last '.') is 3 characters - eg .com .net etc, then to just keep the last 2 parts.
If this is against protocol, please advise how to do a follow up question.
if parameter expansion inside another parameter expansion is supported, you can use this:
$ s='98.254.237.114.broad.lyg.js.dynamic.163data.com.cn'
$ # removing last three fields
$ echo "${s%.*.*.*}"
98.254.237.114.broad.lyg.js.dynamic
$ # pass output of ${s%.*.*.*} plus the extra . to be removed
$ echo "${s#${s%.*.*.*}.}"
163data.com.cn
can also reverse the line, get required fields and then reverse again.. this makes it easier to use change numbers
$ echo "$s" | rev | cut -d. -f1-3 | rev
163data.com.cn
$ echo "$s" | rev | cut -d. -f1-4 | rev
dynamic.163data.com.cn
$ # and easy to use with file input
$ cat ip.txt
98.254.237.114.broad.lyg.js.dynamic.163data.com.cn
foo.bar.123.baz.xyz
a.b.c.d.e.f
$ rev ip.txt | cut -d. -f1-3 | rev
163data.com.cn
123.baz.xyz
d.e.f
echo $string | awk -F. '{ if (NF == 2) { print $0 } else { print $(NF-2)"."$(NF-1)"."$NF } }'
NF signifies the total number of field separated by "." and so we want the last piece (NF), last but 1 (NF-1) and last but 2 (NF-2)
$ echo $string | awk -F'.' '{printf "%s.%s.%s\n",$(NF-2),$(NF-1),$NF}'
163data.com.cn
Brief explanation,
Set the field separator to .
Print only last 3 field using the awk parameter $(NF-2), $(NF-1),and $NF.
And there's also another option you may try,
$ echo $string | awk -v FPAT='[^.]+.[^.]+.[^.]+$' '{print $NF}'
163data.com.cn
It sounds like this is what you need:
awk -F'.' '{sub("([^.]+[.]){"NF-3"}","")}1'
e.g.
$ echo "$string" | awk -F'.' '{sub("([^.]+[.]){"NF-3"}","")}1'
163data.com.cn
but with just 1 sample input/output it's just a guess.
wrt your followup question, this might be what you're asking for:
$ echo "$string" | awk -F'.' '{n=(length($NF)==3?2:3); sub("([^.]+[.]){"NF-n"}","")}1'
163data.com.cn
$ echo 'www.google.com' | awk -F'.' '{n=(length($NF)==3?2:3); sub("([^.]+[.]){"NF-n"}","")}1'
google.com
Version which uses only bash:
echo $(expr "$string" : '.*\.\(.*\..*\..*\)')
To use it with a file you can iterate with xargs:
File:
head list.dat
98.254.237.114.broad.lyg.js.dynamic.163data.com.cn
98.254.34.56.broad.kkk.76onepi.co.cn
98.254.237.114.polst.a65dal.com.cn
iterating the whole file:
cat list.dat | xargs -I^ -L1 expr "^" : '.*\.\(.*\..*\..*\)'
Notice: it won't be very efficient in large scale, so you need to consider by your own whether it is good enough for you.
Regexp explanation:
.* \. \( .* \. .* \. .* \)
\___| | | | |
| \------------------------/> brakets shows which part we extract
| | |
| \-------/> the \. indicates the dots to separate specific number of words
|
|
-> the rest and the final dot which we are not interested in (out of brakets)
details:
http://tldp.org/LDP/abs/html/string-manipulation.html -> Substring Extraction

Parsing functionality in shell script

If I am trying to look up which host bus is the hard drive attached to, I would use
ls -ld /sys/block/sd*/device
it returns
lrwxrwxrwx 1 root root 0 Oct 18 14:52 /sys/block/sda/device -> ../../../1:0:0:0
Now if I want to parse out that "1" in the end of the above string, what would be the quickest way?
Sorry I am very new to shell scripting, I can't make full use of this powerful scripting language.
Thanks!
Split with slashes, select last field, split it with colons and select first result:
ls -ld /sys/block/sd*/device | awk -F'/' '{ split( $NF, arr, /:/ ); print arr[1] }'
It yields:
1
Try doing this :
$ ls -ld /sys/block/sd*/device | grep -oP '\d+(?=:\d+:\d:\d+)'
0
2
3
or
$ printf '%s\n' /sys/block/sd*/device |
xargs readlink -f |
grep -oP '\d+(?=:\d+:\d:\d+)'
and if you want only the first occurence :
grep ...-m1 ...

Extracting directory name from an absolute path using sed or awk

I want to split this line
/home/edwprod/abortive_visit/bin/abortive_proc_call.ksh
to
/home/edwprod/abortive_visit/bin
using sed or awk scripts? Could you help on this?
dirname
kent$ dirname "/home/edwprod/abortive_visit/bin/abortive_proc_call.ksh"
/home/edwprod/abortive_visit/bin
sed
kent$ echo "/home/edwprod/abortive_visit/bin/abortive_proc_call.ksh"|sed 's#/[^/]*$##'
/home/edwprod/abortive_visit/bin
grep
kent$ echo "/home/edwprod/abortive_visit/bin/abortive_proc_call.ksh"|grep -oP '^/.*(?=/)'
/home/edwprod/abortive_visit/bin
awk
kent$ echo "/home/edwprod/abortive_visit/bin/abortive_proc_call.ksh"|awk -F'/[^/]*$' '{print $1}'
/home/edwprod/abortive_visit/bin
May be command dirname is what you searching for?
dirname /home/edwprod/abortive_visit/bin/abortive_proc_call.ksh
Or if you want sed, so see my solution:
echo /home/edwprod/abortive_visit/bin/abortive_proc_call.ksh | sed 's/\(.*\)\/.*/\1/'
For most platforms and Unix/Linux shells now available dirname:
dirname /home/edwprod/abortive_visit/bin/abortive_proc_call.ksh
Using of dirname is the simpliest way, but it is not recommended for cross platform scripting for example in the last version of autoconf documentation http://www.gnu.org/savannah-checkouts/gnu/autoconf/manual/autoconf-2.69/html_node/Limitations-of-Usual-Tools.html#Limitations-of-Usual-Tools .
So my full featured version of sed-based alternative for dirname:
str="/home/edwprod/abortive_visit/bin/abortive_proc_call.ksh"
echo "$str" | sed -n -e '1p' | sed -e 's#//*#/#g' -e 's#\(.\)/$#\1#' -e 's#^[^/]*$#.#' -e 's#\(.\)/[^/]*$#\1#' -
Examples:
It works like dirname:
For path like /aa/bb/cc it will print /aa/bb
For path like /aa/bb it will print /aa
For path like /aa/bb/ it will print /aa too.
For path like /aa/ it will print /aa
For path like / it will print /
For path like aa it will print .
For path like aa/ it will print .
That is:
It works correct with trailing /
It works correct with paths that contains only base name like aa and aa/
It works correct with paths starting with / and the path / itself.
It works correct with any $str if it contains \n at the end or not, even with many \n
It uses cross platform sed command
It changes all combinations of / (// ///) to /
It can't work correct with paths containing newlines and characters invalid for current locale.
Note
Alternative for basename may be useful:
echo "$str" | awk -F"/" '{print $NF}' -
This code with awk will work perfectly as same as dirname, I guess.
It's so simple and has very low cost to work. Good luck.
Code
$ foo=/app/java/jdk1.7.0_71/bin/java
$ echo "$foo" | awk -F "/*[^/]*/*$" '
{ print ($1 == "" ? (substr($0, 1, 1) == "/" ? "/" : ".") : $1); }'
Result
/app/java/jdk1.7.0_71/bin
Test
foo=/app/java/jdk1.7.0_71/bin/java -> /app/java/jdk1.7.0_71/bin
foo=/app/java/jdk1.7.0_71/bin/ -> /app/java/jdk1.7.0_71
foo=/app/java/jdk1.7.0_71/bin -> /app/java/jdk1.7.0_71
foo=/app/ -> /
foo=/app -> /
foo=fighters/ -> .
More
If you're not available such awk delimiter, try it this way.
$ echo $foo | awk '{
dirname = gensub("/*[^/]*/*$", "", "", $0);
print (dirname == "" ? (substr($0, 1, 1) == "/" ? "/" : ".") : dirname);
}'
awk + for :
echo "/home/edwprod/abortive_visit/bin/abortive_proc_call.ksh" | awk 'BEGIN{res=""; FS="/";}{ for(i=2;i<=NF-1;i++) res=(res"/"$i);} END{print res}'
In addition, to the answer of Kent, an alternative awk solution is:
awk 'BEGIN{FS=OFS="/"}{NF--}1'
which has the same sickness as the one presented by Kent. The following, somewhat longer Awk corrects all the flaws:
awk 'BEGIN{FS=OFS="/"}{gsub("/+","/")}
{s=$0~/^\//;NF-=$NF?1:2;$0=$0?$0:(s?"/":".")};1' <file>
The following table shows the difference:
| path | dirname | awk full | awk short |
|------------+---------+----------+-----------|
| . | . | . | |
| / | / | / | |
| foo | . | . | |
| foo/ | . | . | foo |
| foo/bar | foo | foo | foo |
| foo/bar/ | foo | foo | foo/bar |
| /foo | / | / | |
| /foo/ | / | / | /foo |
| /foo/bar | /foo | /foo | /foo |
| /foo/bar/ | /foo | /foo | /foo/bar |
| /foo///bar | /foo | /foo | /foo// |
note: dirname is the real way to go, unless you have to process masses of them stored in a file.
I'm always amazed at how clever some one-liners are. In my case, I went for readability, so here's my implementation. It's a vanilla Bourne shell script with a dependency on sed and grep that mimics dirname.
strip_trailing_slashes() {
printf "$1" | sed 's/[/]*$//'
}
# If empty arg, return .
# If nothing but slashes, return /
# If no slashes after stripping all trailing slashes, return .
# Otherwise, return everything up until last path component
dirname() {
if [ -z "$1" ]; then printf '.'; return; fi
if ! $(printf "$1" | grep -q '[^/]'); then printf '/'; return; fi
local s="$(strip_trailing_slashes $1)"
if ! $(printf "$s" | grep -q '/'); then printf '.'; fi
printf "$(strip_trailing_slashes $(printf $s | sed 's/[^/]*$//'))"
}
Here's a gist that includes 44 tests for the following cases -- and also tests against dirname for compatibility.
Expect '.' for ('""' '.' '.foo' './foo' 'foo' 'foo/' 'foo///' '..' '..foo' '..foo//' '../')
Expect '/' for ('/' '//' '///')
Expect 'a/b' for ('a/b/c' 'a/b/c/' 'a/b/c///')
Expect '/a/b' for ('/a/b/c' '/a/b/c/' '/a/b/c///')
Expect '//a/b' for ('//a/b/c' '//a/b/c//' '/a/b/c')
A few points:
The one-liner sed version by #Роман Коптев is amazing (how can I be the only one who's upvoted?).
The only two test failures are due to the oneliner version, but it differs from dirname. The oneliner version converts the following "//a/b/c" to "/a/b", which might not be what you want; dirname doesn't strip extra leading slashes and returns "//a/b". I chose to make my version compatible with dirname. (None of the versions strip extra slashes in the middle separating path components).
Hat's off to #Роман Коптев for the oneliner version!

Only get hash value using md5sum (without filename)

I use md5sum to generate a hash value for a file.
But I only need to receive the hash value, not the file name.
md5=`md5sum ${my_iso_file}`
echo ${md5}
Output:
3abb17b66815bc7946cefe727737d295 ./iso/somefile.iso
How can I 'strip' the file name and only retain the value?
A simple array assignment works... Note that the first element of a Bash array can be addressed by just the name without the [0] index, i.e., $md5 contains only the 32 characters of md5sum.
md5=($(md5sum file))
echo $md5
# 53c8fdfcbb60cf8e1a1ee90601cc8fe2
Using AWK:
md5=`md5sum ${my_iso_file} | awk '{ print $1 }'`
You can use cut to split the line on spaces and return only the first such field:
md5=$(md5sum "$my_iso_file" | cut -d ' ' -f 1)
On Mac OS X:
md5 -q file
md5="$(md5sum "${my_iso_file}")"
md5="${md5%% *}" # remove the first space and everything after it
echo "${md5}"
Another way is to do:
md5sum filename | cut -f 1 -d " "
cut will split the line to each space and return only the first field.
By leaning on head:
md5_for_file=`md5sum ${my_iso_file}|head -c 32`
One way:
set -- $(md5sum $file)
md5=$1
Another way:
md5=$(md5sum $file | while read sum file; do echo $sum; done)
Another way:
md5=$(set -- $(md5sum $file); echo $1)
(Do not try that with backticks unless you're very brave and very good with backslashes.)
The advantage of these solutions over other solutions is that they only invoke md5sum and the shell, rather than other programs such as awk or sed. Whether that actually matters is then a separate question; you'd probably be hard pressed to notice the difference.
If you need to print it and don't need a newline, you can use:
printf $(md5sum filename)
md5=$(md5sum < $file | tr -d ' -')
md5=`md5sum ${my_iso_file} | cut -b-32`
md5sum puts a backslash before the hash if there is a backslash in the file name. The first 32 characters or anything before the first space may not be a proper hash.
It will not happen when using standard input (file name will be just -), so pixelbeat's answer will work, but many others will require adding something like | tail -c 32.
if you're concerned about screwy filenames :
md5sum < "${file_name}" | awk NF=1
f244e67ca3e71fff91cdf9b8bd3aa7a5
other messier ways to deal with this :
md5sum "${file_name}" | awk NF=NF OFS= FS=' .*$'
or
| awk '_{ exit }++_' RS=' '
f244e67ca3e71fff91cdf9b8bd3aa7a5
to do it entirely inside awk :
mawk 'BEGIN {
__ = ARGV[ --ARGC ]
_ = sprintf("%c",(_+=(_^=_<_)+_)^_+_*++_)
RS = FS
gsub(_,"&\\\\&",__)
( _=" md5sum < "((_)(__)_) ) | getline
print $(_*close(_)) }' "${file_name}"
f244e67ca3e71fff91cdf9b8bd3aa7a5
Well, I had the same problem today, but I was trying to get the file MD5 hash when running the find command.
I got the most voted question and wrapped it in a function called md5 to run in the find command. The mission for me was to calculate the hash for all files in a folder and output it as hash:filename.
md5() { md5sum $1 | awk '{ printf "%s",$1 }'; }
export -f md5
find -type f -exec bash -c 'md5 "$0"' {} \; -exec echo -n ':' \; -print
So, I'd got some pieces from here and also from 'find -exec' a shell function in Linux
For the sake of completeness, a way with sed using a regular expression and a capture group:
md5=$(md5sum "${my_iso_file}" | sed -r 's:\\*([^ ]*).*:\1:')
The regular expression is capturing everything in a group until a space is reached. To get a capture group working, you need to capture everything in sed.
(More about sed and capture groups here: How can I output only captured groups with sed?)
As delimiter in sed, I use colons because they are not valid in file paths and I don't have to escape the slashes in the filepath.
Another way:
md5=$(md5sum ${my_iso_file} | sed '/ .*//' )
md5=$(md5sum < index.html | head -c -4)

Resources