GREP: exclude file extensions in specific directory - bash

My code takes added, modified, deleted, renamed, copied files from git status -s and compare them with the list of file paths from the file.
git status -s |
grep -E "^M|^D|^A|^R|^C" |
awk '{if ($1~/M+/ || $1~/D+/ || $1~/A+/ || $1~/R+/ || $1~/C+/) print $2}' |
grep --file=$list_of_files --fixed-strings |
grep -r --exclude="*.jar" "SVCS/bus/projects/Resources/"
Prints out git status like M foo.txt
Does some "filtering" operations
More filtering operations
Takes path to files for compare from the text file
Here I am trying to make so the last step would exclude .jar files from specific directory.
How can I do the last step? Or need to add something to the 4th step?

The simple fix is to change the last line to
grep -v 'SVCS/bus/projects/Resources/.*\.jar$'
but that really is some horrible code you have there.
Keeping in mind that grep | awk and awk | grep is an antipattern, how about this refactoring?
git status -s |
grep -E "^M|^D|^A|^R|^C" |
awk '{if ($1~/M+/ || $1~/D+/ || $1~/A+/ || $1~/R+/ || $1~/C+/)
... Hang on, what's the point of that? The grep already made sure that $1 contains one or more of those letters. The + quantifier is completely redundant here.
print $2}'
Will break on files with whitespace in them. This is a very common error which is aggravating because a lot of the time, the programmer knew it would break, but just figured "can't happen here".
git status -s | awk 'NR==FNR { files[$0] = 1; next }
/^[MDARC]/ { gsub(/^[MDARC]+ /, "");
if ($0 ~ /SVCS\/bus\/projects\/Resources\/.*\.jar$/)
next;
if ($0 in files) print }' "$list_of_files" -
The NR==FNR thing is a common idiom to read the first file into an array, then fall through to the next input file. So we read $list_of_files into the keys of the associative array files; then if the file name we read from git status is present in the keys, we print it. The condition to skip .jar files in a particular path is then a simple addition to this Awk script.
This assumes $list_of_files really is a list of actual files, as suggested by the file name. Your code will look for a match anywhere in that file, so a partial file name would also match (for example, if the file contains path/to/ick, a file named somepath/to/icktys/mackerel would match, and thus be printed). If that is the intended functionality, the above script will require some rather drastic modifications.

Related

Bash - how to copy latest files by filename to another folder?

Let's say I have these files in folder Test1
AAAA-12_21_2020.txt
AAAA-12_20_2020.txt
AAAA-12_19_2020.txt
BBB-12_21_2020.txt
BBB-12_20_2020.txt
BBB-12_19_2020.txt
I want below latest files to folder Test2
AAAA-12_21_2020.txt
BBB-12_21_2020.txt
This code would work:
ls $1 -U | sort | cut -f 1 -d "-" | uniq | while read -r prefix; do
ls $1/$prefix-* | sort -t '_' -k3,3V -k1,1V -k2,2V | head -n 1
done
We first iterate over every prefix in the directory specified as the first argument, which we get by sorting the list of files and deleting duplicates, before extracting everything before -. Then we sort those filenames by three fields separated by the _ symbol using the -k option of sort (primarily by years in the third field, then months in second and lastly days). We use version sort to be able to ignore the text around and interpret numbers correctly (as opposed to lexicographical sort).
I'm not sure whether this is the best way to do this, as I used only basic bash functions. Because of the date format and the fact that you have to differentiate prefixes, you have to parse the string fully, which is a job better suited for AWK or Perl.
Nonetheless, I would suggest using day-month-year or year-month-day format for machine-readable filenames.
Using awk:
ls -1 Test1/ | awk -v src_dir="Test1" -v target_dir="Test2" -F '(-|_)' '{p=$4""$2""$3; if(!($1 in b) || b[$1] < p){a[$1]=$0}} END {for (i in a) {system ("mv "src_dir"/"a[i]" "target_dir"/")}}'

Reading groups of lines from a large text file

I am looking to pull certain groups of lines from large (~870,000,000 line) text files. For example in a 50 line file I might want lines 3-6, 18-27, and 39-45.
From browsing Stack Overflow, I have found that the bash command:
tail -n+NUMstart file |head -nNUMend
is the fastest way to get a single line or group of lines starting at NUMstart and going to NUMend. However when reading multiple groups of lines this seems inefficient. Normally the technique wouldn't matter so much, but with files this large it makes a huge difference.
Is there a better way to go about this than using the above command for each group of lines? I am assuming the answer will most likely be a bash command but am really open to any language/tool that will do the job best.
To show lines 3-6, 18-27 and 39-45 with sed:
sed -n "3,6p;18,27p;39,45p" file
It is also possible to feed sed from a file.
Content of file foobar:
3,6p
18,27p
39,45p
Usage:
sed -n -f foobar file
awk to the rescue!
awk -v lines='3-6,18-27,39-45' '
BEGIN {n=split(lines,a,",");
for(i=1;i<=n;i++)
{split(a[i],t,"-");
rs[++c]=t[1]; re[c]=t[2]}}
{for(i=s;i<=c;i++)
if(NR>=rs[i] && NR<=re[i]) {print; next}
else if(NR>re[i]) s++;
if(s>c) exit}' file
provides an early exit after the last printed line. No error checking, the ranges should be provided in increasing order.
The problem with tail -n XX file | head -n YY for different ranges is that you are running it several times, hence the inefficiency. Otherwise, benchmarks suggest that they are the best solution.
For this specific case, you may want to use awk:
awk '(NR>=start1 && NR<=end1) || (NR>=start2 && NR<=end2) || ...' file
In your case:
awk '(NR>=3 && NR<=6) || (NR>=18 && NR<=27) || (NR>=39 && NR<=45)' file
That is, you group the ranges and let awk print the corresponding lines when they occur, just looping through the file once. It may be also useful to add a final NR==endX {exit} (endX being the closing item from the last range) so that it finishes processing once it has read the last interesting line.
In your case:
awk '(NR>=3 && NR<=6) || (NR>=18 && NR<=27) || (NR>=39 && NR<=45); NR==45 {exit}' file

getting the last opened file

input file:
wtf.txt|/Users/jaro/documents/inc/face/|
lol.txt|/Users/jaro/documents/inc/linked/|
lol.txt|/Users/jaro/documents/inc/twitter/|
lol.txt|/Users/jaro/documents/inc/face/|
wtf.txt|/Users/jaro/documents/inc/face/|
omg.txt|/Users/jaro/documents/inc/twitter/|
omg.txt|/Users/jaro/documents/inc/linked/|
wtf.txt|/Users/jaro/documents/inc/linked/|
lol.txt|/Users/jaro/documents/inc/twitter/|
wtf.txt|/Users/jaro/documents/inc/linked/|
lol.txt|/Users/jaro/documents/inc/face/|
omg.txt|/Users/jaro/documents/inc/twitter/|
omg.txt|/Users/jaro/documents/inc/face/|
wtf.txt|/Users/jaro/documents/inc/face/|
wtf.txt|/Users/jaro/documents/inc/twitter/|
omg.txt|/Users/jaro/documents/inc/linked/|
omg.txt|/Users/jaro/documents/inc/linked/|
input file is the list of opened files (opening file means 1 line of file) i want to get the last opened file in
e.g. : get last opened file in dir /Users/jaro/documents/inc/face/
output:
wtf.txt
This fetches the last line in the file whose second field is the desired folder name, and prints the first field.
awk -F '\|' '$2 == "/Users/jaro/documents/inc/face/" { f=$1 }
END { print f }' file
To test whether the most recent file is also an existing file, I would use the shell to reverse the order with tac and perform the logic; skip the files in the wrong path, and the ones which don't exist, then print the first success and quit.
tac file |
while IFS='|' read -r basename path _; do
case $path in "/Users/jaro/documents/inc/face") ;; *) continue;; esac
test -e "$path/$basename" || continue
echo "$basename"
break
done |
grep .
The final grep . is to produce an exit code which reflects whether or not the command was successful -- if it printed a file, it's okay; if none of the extracted files existed, return error.
Below is my original answer, based on a plausible but apparently incorrect interpretation of your question.
Here is a quick attempt at finding the file with the newest modification time from the list. I avoid parsing ls, prefering instead to use properly machine-parseable output from stat. Since your input file is line-oriented, I assume no file names contain newlines, which simplifies things quite a bit.
awk -F '\|' '$2 == "/Users/jaro/documents/inc/face/" { print $2 $1 }' file |
sort -u |
xargs stat -f '%m %N' |
sort -rn |
awk -F '/' '{ print $NF; exit(0) }'
The first sort is to remove any duplicates, to avoid running stat more times than necessary (premature optimization, perhaps), the stat prefixes each line with the file's modification time expressed as seconds since the epoch, which facilitates easy numerical sorting by age, and the final Awk script neatly combines head -n 1 | rev | cut -d / -f1 | rev i.e. extract just the basename from the first line of output, then quit.
If there is any way to use a less wacky input format, that would be an improvement (probably of your life in general as well).
The output format from stat is not properly standardized, but your question is tagged linuxosx so I assume GNU coreutils BSD stat. If portability is desired, maybe look at find (which however may be overkill and/or not much better standardized across diverse platforms) or write a small Perl or Python script instead. (Well, Ruby too, I suppose, but personally, I'd go with Perl.)
perl -F'\|' -lane '{ $t{$F[0]} = (stat($F[1].$F[0]))[10]
if !defined $t{$F[0]} and $F[1] == "/Users/jaro/documents/inc/face/" }
END { print ((sort { $t{$a} <=> $t{$b} } keys %t)[-1]) }' file
atime – The atime (access time) is the time when the data of a file was last accessed. Displaying the contents of a file or executing a shell script will update a file’s atime, for example. You can view the atime with the ls -lu command
http://www.techtrunch.com/linux/ctime-mtime-atime-linux-timestamps
So in your case, will do the trick.
ls -lu /Users/jaro/documents/inc/face/

Save changes to a file AWK/SED

I have a huge text file delimited with comma.
19429,(Starbucks),390 Provan Walk,Glasgow,G34 9DL,-4.136909,55.872982
The first one is a unique id. I want the user to enter the id and enter a value for one of the following 6 fields in order to be replaced. Also, i'm asking him to enter a 2-7 value in order to identify which field should be replaced.
Now i've done something like this. I am checking every line to find the id user entered and then i'm replacing the value.
awk -F ',' -v elem=$element -v id=$code -v value=$value '{if($1==id) {if(elem==2) { $2=value } etc }}' $path
Where $path = /root/clients.txt
Let's say user enters "2" in order to replace the second field, and also enters "Whatever". Now i want "(Starbucks)" to be replaced with "Whatever" What i've done work fine but does not save the change into the file. I know that awk is not supposed to do so, but i don't know how to do it. I've searched a lot in google but still no luck.
Can you tell me how i'm supposed to do this? I know that i can do it with sed but i don't know how.
Newer versions of GNU awk support inplace editing:
awk -i inplace -v elem="$element" -v id="$code" -v value="$value" '
BEGIN{ FS=OFS="," } $1==id{ $elem=value } 1
' "$path"
With other awks:
awk -v elem="$element" -v id="$code" -v value="$value" '
BEGIN{ FS=OFS="," } $1==id{ $elem=value } 1
' "$path" > /usr/tmp/tmp$$ &&
mv /usr/tmp/tmp$$ "$path"
NOTES:
Always quote your shell variables unless you have an explicit reason not to and fully understand all of the implications and caveats.
If you're creating a tmp file, use "&&" before replacing your original with it so you don't zap your original file if the tmp file creation fails for any reason.
I fully support replacing Starbucks with Whatever in Glasgow - I'd like to think they wouldn't have let it open in the first place back in my day (1986 Glasgow Uni Comp Sci alum) :-).
awk is much easier than sed for processing specific variable fields, but it does not have in-place processing. Thus you might do the following:
#!/bin/bash
code=$1
element=$2
value=$3
echo "code is $code"
awk -F ',' -v elem=$element -v id=$code -v value=$value 'BEGIN{OFS=",";} /^'$code',/{$elem=value}1' mydb > /tmp/mydb.txt
mv /tmp/mydb.txt ./mydb
This finds a match for a line starting with code followed by a comma (you could also use ($1==code)), then sets the elemth field to value; finally it prints the output, using the comma as output field separator. If nothing matches, it just echoes the input line.
Everything is written to a temporary file, then overwrites the original.
Not very nice but it gets the job done.

Shell: How to delete mercurial files using an automated script?

I want to make a tiny script that deleted ALL the files in my Symfony project that mercuaril gives me as unwanted files.
For example:
hg status:
...
? web/images/lightwindow/._arrow-up.gif
? web/images/lightwindow/._black-70.png
? web/images/lightwindow/._black.png
? web/images/lightwindow/._nextlabel.gif
? web/images/lightwindow/._pattern_148-70.png
? web/images/lightwindow/._pattern_148.gif
? web/images/lightwindow/._prevlabel.gif
? web/js/._lightwindow.js
? web/sfPropel15Plugin
? web/sfProtoculousPlugin
I would like to delete all the files that are marked with the ?. ONLY THOSE. Not the ones modified -M-, and so on.
I'm trying to do a mini-script for that:
hg status | grep '^?*' | rm -f
I don't know if it is OK. Could you help me with one?
You're missing xargs, which takes the input and gives it to a command as parameters (right now you're actually sending them to rm's standard input, which isn't meaningful). Something like:
hg status | grep '^?' | cut -d' ' -f2 | xargs rm -f
Note: it won't work if your file names contain spaces. It'd still be possible but you need to be more clever.
Try this:
hg status|awk '/^? /{gsub(/^\? /, "", $0);print;}'|while read line; do
rm -f "$line"
done
The awk command matches everything starting with '?', and executes the block '{gsub(/^\? /, "", $0);print;}'. The block does a substitution on $0 (the entire line matched), replacing the starting "? " with nothing, making $0 just the filename. The print then prints $0 (print with no args defaults to printing $0)
So the awk output prints a list of filenames, one per line. This is fed into a read loop, which removes each.
This will preserve whitespace in filenames, but it will break if there are any filenames that contain newlines! Handling newlines gracefully is impossible with hg status as the input, since hg status prints newline-separated output

Resources