Remove files base on integer.name - bash

Ok what I am trying to do is very specific. I need some code that will remove files from a directory based on integer.name.
The files in the directory are listed like this
441.TERM (the # is actually a PID so it'll be random)
442.TERM
No matter what I always want to keep the first .TERM file & remove any .TERM file after that as no more than one should ever be created by my script, but it does happen sometimes due to some issues with the system I am scripting on. I only want it to effect my 000.TERM files any other files it finds in the directory can stay. So if directory contain any .TERM file an with an integer higher than the first one found then remove the .TERM files with higher integers.
PS. .TERM is not an extension just in case there is any confusion.

find /your/path -name "*.TERM" | sort -t. -k1 -n | tail -n +2 | xargs -r rm
Let's break it down:
find /your/path -name "*.TERM" will output a list of all .TERM files.
You could also use ls /your/path/*.TERM, but you may find the output unpredictable. (Example: your implementation may have -F on by default, which would cause every socket to end in a = in the list.)
sort sorts them by the first field (-k1) using a period as a separator (-t.). -n guarantees a numeric sort (such that 5 comes before 06).
tail -n +2 skips the first line and returns the rest
xargs rm sends every output line to an rm command, removing them. -r skips running the rm if there's no output piped in, but is listed as a GNU extension.
The script as above is fairly robust for your needs, but may fail if you have so many files in the directory that they don't fit on one command line, and might get you into trouble if any matching filenames somehow contain a newline.

Related

Passing a file as a parameter

Fellows, I have a .sh that creates a file.log, in this file I have many git logs searched from a range of dates that the user passed before (just to explain what its have).
Now I need to use this file.log that I have from a this external program and use with this code:
find ./* -type f -exec grep -l 'a1009206_vcr' {} \; > file.log
my question is how can I do this?
Recurse with Grep While Ignoring Missing/Unreadable Files
The BSD and GNU grep utilities have options that can save you the hassle of using find, xargs, et al. in many cases. This is one of those. For example:
grep -Flrs "a1009206_vcr" . > file.log
This uses the following flags:
-F, --fixed-strings
Interpret pattern as a set of fixed strings (i.e. force grep to
behave as fgrep).
-l, --files-with-matches
Only the names of files containing selected lines are written to
standard output. grep will only search a file until a match has
been found, making searches potentially less expensive. Path-
names are listed once per file searched. If the standard input
is searched, the string ``(standard input)'' is written.
-R, -r, --recursive
Recursively search subdirectories listed.
-s, --no-messages
Silent mode. Nonexistent and unreadable files are ignored (i.e.
their error messages are suppressed).
to recurse down through the present working directory (e.g. . or $PWD if you prefer) and writing a list of filenames with matches to file.log. The -s flag keeps permissions errors or other cruft from cluttering your output file. You can also turn off standard error with 2>&- if you're so inclined.
Caveat: Symlinks and Recursion
The above should work in most cases, but you may also need to add either -O or -p if you're recursing and don't want to follow some or all of your symlinks. The man page has more specifics about grep's default behavior regarding symlinks, with and without recursing.

How to delete one set of files in a directory containing similarly named files?

A series of several hundred directories contains files in the following pattern:
Dir1:
-text_76.txt
-text_81.txt
-sim_76.py
-sim_81.py
Dir2:
-text_90.txt
-text_01.txt
-sim_90.py
-sim_01.py
Within each directory, the files beginning with text or sim are essentially duplicates of the other text or sim file, respectively. Each set of duplicate files has a unique numerical identifier. I only want one set per directory. So, in Dir1, I would like to delete everything in the set labeled either 81 OR 76, with no preference. Likewise, in Dir2, I would like to delete either the set labeled 90 OR 01. Each directory contains exactly two sets, and there is no way to predict the random numerical IDs used in each directory. How can I do this?
Assuming you always have 1 known file, say text_xx.txt then you could run this script in each sub-directory:
ls text_*.txt | { read first; rm *"${first:4:4}"*; };
This will list all files matching the wildcard pattern text_*.txt. Using read takes only the first matching result of the ls command. This will result in a $first shell variable containing one fully expanded match: text_xx.txt. After that ${first:4:4} sub-strings this fully expanded match to get the characters _xx. by knowing the length of test_ and xx. Finally, rm *""* appends wild cards to the search result and executes it as a command: rm *_xx.*.
I chose to include _ and . around xx to be a bit conservative about what gets deleted.
If the length of xx is not known, things gets a bit more complicated. A safer command unsure of this length might be:
ls text_??.txt | { read first; rm *_"${first:5:2}".*; };
This should remove one "fileset" every time it is run in a given sub-directory. If there is only 1 fileset, it would still remove the fileset.
Edit: Simplified to remove unnecessary use of IFS command.
Edit: Attempt to expand on and clarify the explanation.
ls | grep -P "*[81|76]*" | xargs -d"\n" rm
ls | grep -P "*[90|01]*" | xargs -d"\n" rm
How it works:
ls lists all files (one by line since the result is piped).
grep -P filter
xargs -d"\n" rm executes rm line once for every line that is piped to it.

Diff files in two folders ignoring the first line

I have two folders of files that I want to diff, except I want to ignore the first line in all the files. I tried
diff -Nr <(tail -n +1 folder1/) <(tail -n +1 folder2/)
but that clearly isn't the right way.
If the first lines that you want to ignore have a distinctive format that can be matched by a POSIX regular expression, then you can use diff's --ignore-matching-lines=... option to tell it to ignore those lines.
Failing that, the approach you want to take probably depends on your exact requirements. You say you "want to diff" the files, but it's not obvious exactly how faithfully your resulting output needs to match what you would get from diff -Nr if it supported that feature. (For example, do you need the line numbers in the diff to correctly identify the line numbers in the original files?)
The most precisely faithful approach would probably be as follows:
Copy each directory to a fresh location, using cp --recursive ....
Edit the first line of each file to prepend a magic string like IGNORE_THIS_LINE::, using something like find -type f -exec sed -i '1 s/^/IGNORE_THIS_LINE::/' '{}' ';'.
Use diff -Nr --ignore-matching-lines=^IGNORE_THIS_LINE:: ... to compare the results.
Pipe the output to sed s/IGNORE_THIS_LINE:://, so as to filter out any occurrences of IGNORE_THIS_LINE:: that still show up (due to being within a few lines of non-ignored differences).
Using Process Substitution ist the correct way to create intermediate input file descriptors. But tail doesnt work on folders. Just iterate over all the files in the folder:
for f in folder1/*.txt; do
tail -n +2 $f | diff - <(tail -n +2 folder2/$(basename $f))
done
Note i used +2 instead of +1. tail line numbering starts at line 1 not 0

Iterate through list of filenames in order they were created in bash

Parsing output of ls to iterate through list of files is bad. So how should I go about iterating through list of files in order by which they were first created? I browsed several questions here on SO and they all seem to parsing ls.
The embedded link suggests:
Things get more difficult if you wanted some specific sorting that
only ls can do, such as ordering by mtime. If you want the oldest or
newest file in a directory, don't use ls -t | head -1 -- read Bash FAQ
99 instead. If you truly need a list of all the files in a directory
in order by mtime so that you can process them in sequence, switch to
perl, and have your perl program do its own directory opening and
sorting. Then do the processing in the perl program, or -- worst case
scenario -- have the perl program spit out the filenames with NUL
delimiters.
Even better, put the modification time in the filename, in YYYYMMDD
format, so that glob order is also mtime order. Then you don't need ls
or perl or anything. (The vast majority of cases where people want the
oldest or newest file in a directory can be solved just by doing
this.)
Does that mean there is no native way of doing it in bash? I don't have the liberty to modify the filename to include the time in them. I need to schedule a script in cron that would run every 5 minutes, generate an array containing all the files in a particular directory ordered by their creation time and perform some actions on the filenames and move them to another location.
The following worked but only because I don't have funny filenames. The files are created by a server so it will never have special characters, spaces, newlines etc.
files=( $(ls -1tr) )
I can write a perl script that would do what I need but I would appreciate if someone can suggest the right way to do it in bash. Portable option would be great but solution using latest GNU utilities will not be a problem either.
sorthelper=();
for file in *; do
# We need something that can easily be sorted.
# Here, we use "<date><filename>".
# Note that this works with any special characters in filenames
sorthelper+=("$(stat -n -f "%Sm%N" -t "%Y%m%d%H%M%S" -- "$file")"); # Mac OS X only
# or
sorthelper+=("$(stat --printf "%Y %n" -- "$file")"); # Linux only
done;
sorted=();
while read -d $'\0' elem; do
# this strips away the first 14 characters (<date>)
sorted+=("${elem:14}");
done < <(printf '%s\0' "${sorthelper[#]}" | sort -z)
for file in "${sorted[#]}"; do
# do your stuff...
echo "$file";
done;
Other than sort and stat, all commands are actual native Bash commands (builtins)*. If you really want, you can implement your own sort using Bash builtins only, but I see no way of getting rid of stat.
The important parts are read -d $'\0', printf '%s\0' and sort -z. All these commands are used with their null-delimiter options, which means that any filename can be procesed safely. Also, the use of double-quotes in "$file" and "${anarray[*]}" is essential.
*Many people feel that the GNU tools are somehow part of Bash, but technically they're not. So, stat and sort are just as non-native as perl.
With all of the cautions and warnings against using ls to parse a directory notwithstanding, we have all found ourselves in this situation. If you do find yourself needing sorted directory input, then about the cleanest use of ls to feed your loop is ls -opts | read -r name; do... This will handle spaces in filenames, etc.. without requiring a reset of IFS due to the nature of read itself. Example:
ls -1rt | while read -r fname; do # where '1' is ONE not little 'L'
So do look for cleaner solutions avoiding ls, but if push comes to shove, ls -opts can be used sparingly without the sky falling or dragons plucking your eyes out.
let me add the disclaimer to keep everyone happy. If you like newlines inside your filenames -- then do not use ls to populate a loop. If you do not have newlines inside your filenames, there are no other adverse side-effects.
Contra: TLDP Bash Howto Intro:
#!/bin/bash
for i in $( ls ); do
echo item: $i
done
It appears that SO users do not know what the use of contra means -- please look it up before downvoting.
You can try using use stat command piped with sort:
stat -c '%Y %n' * | sort -t ' ' -nk1 | cut -d ' ' -f2-
Update: To deal with filename with newlines we can use %N format in stat andInstead of cut we can use awk like this:
LANG=C stat -c '%Y^A%N' *| sort -t '^A' -nk1| awk -F '^A' '{print substr($2,2,length($2)-2)}'
Use of LANG=C is needed to make sure stat uses single quotes only in quoting file names.
^A is conrtrol-A character typed using ControlVA keys together.
How about a solution with GNU find + sed + sort?
As long as there are no newlines in the file name, this should work:
find . -type f -printf '%T# %p\n' | sort -k 1nr | sed 's/^[^ ]* //'
It may be a little more work to ensure it is installed (it may already be, though), but using zsh instead of bash for this script makes a lot of sense. The filename globbing capabilities are much richer, while still using a sh-like language.
files=( *(oc) )
will create an array whose entries are all the file names in the current directory, but sorted by change time. (Use a capital O instead to reverse the sort order). This will include directories, but you can limit the match to regular files (similar to the -type f predicate to find):
files=( *(.oc) )
find is needed far less often in zsh scripts, because most of its uses are covered by the various glob flags and qualifiers available.
I've just found a way to do it with bash and ls (GNU).
Suppose you want to iterate through the filenames sorted by modification time (-t):
while read -r fname; do
fname=${fname:1:((${#fname}-2))} # remove the leading and trailing "
fname=${fname//\\\"/\"} # removed the \ before any embedded "
fname=$(echo -e "$fname") # interpret the escaped characters
file "$fname" # replace (YOU) `file` with anything
done < <(ls -At --quoting-style=c)
Explanation
Given some filenames with special characters, this is the ls output:
$ ls -A
filename with spaces .hidden_filename filename?with_a_tab filename?with_a_newline filename_"with_double_quotes"
$ ls -At --quoting-style=c
".hidden_filename" " filename with spaces " "filename_\"with_double_quotes\"" "filename\nwith_a_newline" "filename\twith_a_tab"
So you have to process a little each filename to get the actual one. Recalling:
${fname:1:((${#fname}-2))} # remove the leading and trailing "
# ".hidden_filename" -> .hidden_filename
${fname//\\\"/\"} # removed the \ before any embedded "
# filename_\"with_double_quotes\" -> filename_"with_double_quotes"
$(echo -e "$fname") # interpret the escaped characters
# filename\twith_a_tab -> filename with_a_tab
Example
$ ./script.sh
.hidden_filename: empty
filename with spaces : empty
filename_"with_double_quotes": empty
filename
with_a_newline: empty
filename with_a_tab: empty
As seen, file (or the command you want) interprets well each filename.
Each file has three timestamps:
Access time: the file was opened and read. Also known as atime.
Modification time: the file was written to. Also known as mtime.
Inode modification time: the file's status was changed, such as the file had a new hard link created, or an existing one removed; or if the file's permissions were chmod-ed, or a few other things. Also known as ctime.
Neither one represents the time the file was created, that information is not saved anywhere. At file creation time, all three timestamps are initialized, and then each one gets updated appropriately, when the file is read, or written to, or when a file's permissions are chmoded, or a hard link created or destroyed.
So, you can't really list the files according to their file creation time, because the file creation time isn't saved anywhere. The closest match would be the inode modification time.
See the descriptions of the -t, -u, -c, and -r options in the ls(1) man page for more information on how to list files in atime, mtime, or ctime order.
Here's a way using stat with an associative array.
n=0
declare -A arr
for file in *; do
# modified=$(stat -f "%m" "$file") # For use with BSD/OS X
modified=$(stat -c "%Y" "$file") # For use with GNU/Linux
# Ensure stat timestamp is unique
if [[ $modified == *"${!arr[#]}"* ]]; then
modified=${modified}.$n
((n++))
fi
arr[$modified]="$file"
done
files=()
for index in $(IFS=$'\n'; echo "${!arr[*]}" | sort -n); do
files+=("${arr[$index]}")
done
Since sort sorts lines, $(IFS=$'\n'; echo "${!arr[*]}" | sort -n) ensures the indices of the associative array get sorted by setting the field separator in the subshell to a newline.
The quoting at arr[$modified]="${file}" and files+=("${arr[$index]}") ensures that file names with caveats like a newline are preserved.

Merging large number of files into one

I have around 30 K files. I want to merge them into one. I used CAT but I am getting this error.
cat *.n3 > merged.n3
-bash: /usr/bin/xargs: Argument list too long
How to increase the limit of using the "cat" command? Please help me if there is any iterative method to merge a large number of files.
Here's a safe way to do it, without the need for find:
printf '%s\0' *.n3 | xargs -0 cat > merged.txt
(I've also chosen merged.txt as the output file, as #MichaelDautermann soundly advises; rename to merged.n3 afterward).
Note: The reason this works is:
printf is a bash shell builtin, whose command line is not subject to the length limitation of command lines passed to external executables.
xargs is smart about partitioning the input arguments (passed via a pipe and thus also not subject to the command-line length limit) into multiple invocations so as to avoid the length limit; in other words: xargs makes as few calls as possible without running into the limit.
Using \0 as the delimiter paired with xargs' -0 option ensures that all filenames - even those with, e.g., embedded spaces or even newlines - are passed through as-is.
The traditional way
> merged.n3
for file in *.n3
do
cat "$file" >> merged.n3
done
Try using "find":
find . -name \*.n3 -exec cat {} > merged.txt \;
This "finds" all the files with the "n3" extension in your directory and then passes each result to the "cat" command.
And I set the output file name to be "merged.txt", which you can rename to "merged.n3" after you're done appending, since you likely do not want your new "merged.n3" file appending within itself.

Resources