Get the newest file based on timestamp - bash

I am new to shell scripting so i need some help need how to go about with this problem.
I have a directory which contains files in the following format. The files are in a diretory called /incoming/external/data
AA_20100806.dat
AA_20100807.dat
AA_20100808.dat
AA_20100809.dat
AA_20100810.dat
AA_20100811.dat
AA_20100812.dat
As you can see the filename of the file includes a timestamp. i.e. [RANGE]_[YYYYMMDD].dat
What i need to do is find out which of these files has the newest date using the timestamp on the filename not the system timestamp and store the filename in a variable and move it to another directory and move the rest to a different directory.

For those who just want an answer, here it is:
ls | sort -n -t _ -k 2 | tail -1
Here's the thought process that led me here.
I'm going to assume the [RANGE] portion could be anything.
Start with what we know.
Working Directory: /incoming/external/data
Format of the Files: [RANGE]_[YYYYMMDD].dat
We need to find the most recent [YYYYMMDD] file in the directory, and we need to store that filename.
Available tools (I'm only listing the relevant tools for this problem ... identifying them becomes easier with practice):
ls
sed
awk (or nawk)
sort
tail
I guess we don't need sed, since we can work with the entire output of ls command. Using ls, awk, sort, and tail we can get the correct file like so (bear in mind that you'll have to check the syntax against what your OS will accept):
NEWESTFILE=`ls | awk -F_ '{print $1 $2}' | sort -n -k 2,2 | tail -1`
Then it's just a matter of putting the underscore back in, which shouldn't be too hard.
EDIT: I had a little time, so I got around to fixing the command, at least for use in Solaris.
Here's the convoluted first pass (this assumes that ALL files in the directory are in the same format: [RANGE]_[yyyymmdd].dat). I'm betting there are better ways to do this, but this works with my own test data (in fact, I found a better way just now; see below):
ls | awk -F_ '{print $1 " " $2}' | sort -n -k 2 | tail -1 | sed 's/ /_/'
... while writing this out, I discovered that you can just do this:
ls | sort -n -t _ -k 2 | tail -1
I'll break it down into parts.
ls
Simple enough ... gets the directory listing, just filenames. Now I can pipe that into the next command.
awk -F_ '{print $1 " " $2}'
This is the AWK command. it allows you to take an input line and modify it in a specific way. Here, all I'm doing is specifying that awk should break the input wherever there is an underscord (_). I do this with the -F option. This gives me two halves of each filename. I then tell awk to output the first half ($1), followed by a space (" ")
, followed by the second half ($2). Note that the space was the part that was missing from my initial suggestion. Also, this is unnecessary, since you can specify a separator in the sort command below.
Now the output is split into [RANGE] [yyyymmdd].dat on each line. Now we can sort this:
sort -n -k 2
This takes the input and sorts it based on the 2nd field. The sort command uses whitespace as a separator by default. While writing this update, I found the documentation for sort, which allows you to specify the separator, so AWK and SED are unnecessary. Take the ls and pipe it through the following sort:
sort -n -t _ -k 2
This achieves the same result. Now you only want the last file, so:
tail -1
If you used awk to separate the file (which is just adding extra complexity, so don't do it sheepish), you can replace the space with an underscore again with sed:
sed 's/ /_/'
Some good info here, but I'm sure most people aren't going to read down to the bottom like this.

This should work:
newest=$(ls | sort -t _ -k 2,2 | tail -n 1)
others=($(ls | sort -t _ -k 2,2 | head -n -1))
mv "$newest" newdir
mv "${others[#]}" otherdir
It won't work if there are spaces in the filenames although you could modify the IFS variable to affect that.

Try:
$ ls -lr
Hope it helps.

Use:
ls -r -1 AA_*.dat | head -n 1
(assuming there are no other files matching AA_*.dat)

ls -1 AA* |sort -r|tail -1

Due to the naming convention of the files, alphabetical order is the same as date order. I'm pretty sure that in bash '*' expands out alphabetically (but can not find any evidence in the manual page), ls certainly does, so the file with the newest date, would be the last one alphabetically.
Therefore, in bash
mv $(ls | tail -1) first-directory
mv * second-directory
Should do the trick.
If you want to be more specific about the choice of file, then replace * with something else - for example AA_*.dat

My solution to this is similar to others, but a little simpler.
ls -tr | tail -1
What is actually does is to rely on ls to sort the output, then uses tail to get the last listed file name.
This solution will not work if the filename you require has a leading dot (e.g. .profile).
This solution does work if the file name contains a space.

Related

How can I deduplicate filenames across directories?

I run the following gsutil command:
gsutil ls -d gs://mybucket/v${version}/folder1/*/*.whl |
sort -V |
grep -e "/*.whl"
I get:
gs://mybucket/v1.0.0/folder1/1560924028/file1-cp27-cp27mu-linux_x86_64.whl
gs://mybucket/v1.0.0/folder1/1560926922/file1-cp36-cp36m-linux_x86_64.whl
gs://mybucket/v1.0.0/folder1/1560930522/file1-cp35-cp35m-linux_x86_64.whl
gs://mybucket/v1.0.0/folder1/1561568612/file1-cp37-cp37m-linux_x86_64.whl
gs://mybucket/v1.0.0/folder1/1561595893/file1-cp37-cp37m-linux_x86_64.whl
gs://mybucket/v1.0.0/folder1/1561654308/file1-cp37-cp37m-linux_x86_64.whl
gs://mybucket/v1.0.0/folder1/1563319372/file1-cp27-cp27mu-linux_x86_64.whl
gs://mybucket/v1.0.0/folder1/1563319400/file1-cp36-cp36m-linux_x86_64.whl
gs://mybucket/v1.0.0/folder1/1563329633/file1-cp27-cp27mu-linux_x86_64.whl
gs://mybucket/v1.0.0/folder1/1563411368/file1-cp35-cp35m-linux_x86_64.whl
gs://mybucket/v1.0.0/folder1/1565916833/file1-cp27-cp27mu-linux_x86_64.whl
gs://mybucket/v1.0.0/folder1/1565921265/file1-cp35-cp35m-linux_x86_64.whl
gs://mybucket/v1.0.0/folder1/1566258114/file1-cp27-cp27mu-linux_x86_64.whl
Since some files in different folders have the same names, how can I retrieve unique filenames ignoring the path?
I would do it like this:
blabla_your_command | rev | sort -t'/' -u -k1,1 | rev
rev reverses lines. Then I unique sort using / as a separator on the first field. After the line is reversed, the first field will be the filename, so sorting -u on it would return only unique filenames. Then the line needs to be reversed back.
The following command:
cat <<EOF |
gs://mybucket/v1.0.0/folder1/1560924028/file1-cp27-cp27mu-linux_x86_64.whl
gs://mybucket/v1.0.0/folder1/1560926922/file1-cp36-cp36m-linux_x86_64.whl
gs://mybucket/v1.0.0/folder1/1560930522/file1-cp35-cp35m-linux_x86_64.whl
gs://mybucket/v1.0.0/folder1/1561568612/file1-cp37-cp37m-linux_x86_64.whl
gs://mybucket/v1.0.0/folder1/1561595893/file1-cp37-cp37m-linux_x86_64.whl
gs://mybucket/v1.0.0/folder1/1561654308/file1-cp37-cp37m-linux_x86_64.whl
gs://mybucket/v1.0.0/folder1/1563319372/file1-cp27-cp27mu-linux_x86_64.whl
gs://mybucket/v1.0.0/folder1/1563319400/file1-cp36-cp36m-linux_x86_64.whl
gs://mybucket/v1.0.0/folder1/1563329633/file1-cp27-cp27mu-linux_x86_64.whl
gs://mybucket/v1.0.0/folder1/1563411368/file1-cp35-cp35m-linux_x86_64.whl
gs://mybucket/v1.0.0/folder1/1565916833/file1-cp27-cp27mu-linux_x86_64.whl
gs://mybucket/v1.0.0/folder1/1565921265/file1-cp35-cp35m-linux_x86_64.whl
gs://mybucket/v1.0.0/folder1/1566258114/file1-cp27-cp27mu-linux_x86_64.whl
EOF
rev | sort -t'/' -u -k1,1 | rev
outputs:
gs://mybucket/v1.0.0/folder1/1560930522/file1-cp35-cp35m-linux_x86_64.whl
gs://mybucket/v1.0.0/folder1/1560926922/file1-cp36-cp36m-linux_x86_64.whl
gs://mybucket/v1.0.0/folder1/1561568612/file1-cp37-cp37m-linux_x86_64.whl
gs://mybucket/v1.0.0/folder1/1560924028/file1-cp27-cp27mu-linux_x86_64.whl
Please check awk option given below, this will print the last occurrence of delimiter '/', it worked for me
example:
gsutil ls gs://mybucket/v1.0.0/folder1/1560930522 | awk -F/ '{print $(NF)}'
print all the file names under '1560930522'
your_command|awk -F/ '!($NF in a){a[$NF]; print}'
gs://mybucket/v1.0.0/folder1/1560924028/file1-cp27-cp27mu-linux_x86_64.whl
gs://mybucket/v1.0.0/folder1/1560926922/file1-cp36-cp36m-linux_x86_64.whl
gs://mybucket/v1.0.0/folder1/1560930522/file1-cp35-cp35m-linux_x86_64.whl
gs://mybucket/v1.0.0/folder1/1561568612/file1-cp37-cp37m-linux_x86_64.whl
4 different ways of saying the same thing
nawk -F'^.+/' '++_[$NF]<NF'
gawk -F'/' '__[$NF]++<!_'
mawk -F/ '_^__[$NF]++'
mawk2 -F/ '!_[$NF]--'
gs://mybucket/v1.0.0/folder1/1560924028/file1-cp27-cp27mu-linux_x86_64.whl
gs://mybucket/v1.0.0/folder1/1560926922/file1-cp36-cp36m-linux_x86_64.whl
gs://mybucket/v1.0.0/folder1/1560930522/file1-cp35-cp35m-linux_x86_64.whl
gs://mybucket/v1.0.0/folder1/1561568612/file1-cp37-cp37m-linux_x86_64.whl
Here's a simple, straightforward solution:
$ your_gsutil_command | xargs -L 1 basename | sort -u
The easiest way to remove paths is with basename. Unfortunately it accepts only a single filename, which must be on the command line (not from stdin), so we need to take the following steps:
Create the list of files.
We do this with your_gsutil_command, but you can use any command that generates a list of files.
Send each one to basename to remove its path.
The xargs command does this for us by reading its stdin and invoking basename repeatedly, passing the data as command-line arguments. But xargs efficiently tries to reduce the number of invocations by passing multiple filenames on each command line, and that breaks basename. We prevent that with -L 1, limiting it to only one line (that is, one filename) at a time.
Remove duplicates.
The sort -u command does this.
Using your example data:
$ gsutil ls -d gs://mybucket/v${version}/folder1/*/*.whl |
xargs -L 1 basename | sort -u
file1-cp27-cp27mu-linux_x86_64.whl
file1-cp35-cp35m-linux_x86_64.whl
file1-cp36-cp36m-linux_x86_64.whl
file1-cp37-cp37m-linux_x86_64.whl
Caveat: Spaces break everything. 😡
So far we've assumed the filenames and folders do not contain spaces. Spaces break basename because needs exactly one filename, and it would interpret spaces as separators between multiple filenames. We can get around this in two ways:
ls -Q: If you're deduplicating local filenames, you can use the (non-gsutil) ls command with the -Q flag to put the filenames in quotes, so basename will interpret spaces as part of the filenames rather than separators.
gsutil: The -Q flag is unfortunately not supported, so we'll need to escape the spaces manually:
$ your_gsutil_command | sed 's/ /\\ /g' | xargs -L 1 basename | sort -u
Here we use the sed command to escape each space by inserting a backslash before it. (That is, we replace with \ . Note that we also need to escape the backslash in the sed command, which is why we use \\ and not just \.)

File Name comparision in Bash

I have two files containing list of files. I need to check what files are missing in the list of second file. Problem is that I do not have to match full name, but only need to match last 19 Characters of the file names.
E.g
MyFile12343220150510230000.xlsx
and
MyFile99999620150510230000.xlsx
are same files.
This is a unique problem and I don't know how to start. Kindly help.
awk based solution:
$ awk '
{start=length($0) - 18;}
NR==FNR{a[substr($0, start)]++; next;} #save last 19 characters for every line in file2
{if(!a[substr($0, start)]) print $0;} #If that is not present in file1, print that line.
' file2.list file.list
First you can use comm to match the exact file names and obtain a list of files not matchig. Then you can use agrep. I've never used it, but you might find it useful.
Or, as last option, you can do a brute force and for every line in the first file search into the second:
#!/bin/bash
# Iterate through the first file
while read LINE; do
# Find the section of the filename that has to match in the other file
CHECK_SECTION="$(echo "$LINE" | sed -nre 's/^.*([0-9]{14})\.(.*)$/\1.\2/p')"
# Create a regex to match the filenames in the second file
SEARCH_REGEX="^.*$CHECK_SECTION$"
# Search...
egrep "$SEARCH_REGEX" inputFile_2.txt
done < inputFile_1.txt
Here I assumed the filenames end with 14 digits that must match in the other file and a file extension that can be different from file to file but that has to match too:
MyFile12343220150510230000.xlsx
| variable | 14digits |.ext
So, if the first file is FILE1 and the second file is FILE2 then if the intention is only to identify the files in FILE2 that don't exist in FILE1, the following should do:
tmp1=$(mktemp)
tmp2=$(mktemp)
cat $FILE1 | rev | cut -c -19 | sort | uniq > ${tmp1}
cat $FILE2 | rev | cut -c -19 | sort | uniq > ${tmp2}
diff ${tmp1} ${tmp2} | rev
rm ${tmp1} ${tmp2}
In a nutshell, this reverses the characters on each line, and extracts the part you're interested in, saving to a temporary file, for each list of files. The reversal of characters is done since you haven't said whether or not the length of filenames is guaranteed to be constant---the only thing we can rely on here is that the last 19 characters are of a fixed format (in this case, although the format is easily inferred, it isn't really relevant). The sort is important in order for the diff to show you what's not in the second file that is in the first.
If you're certain that there will only ever be files missing from FILE2 and not the other way around (that is, files in FILE2 that don't exist in FILE1), then you can clean things up by removing the cruft introduced by diff, so the last line becomes:
diff ${tmp1} ${tmp2} | rev | grep -i xlsx | sed 's/[[:space:]]\+.*//'
The grep limits the output to those lines with xlsx filenames, and the sed removes everything on a line from the first space encountered onwards.
Of course, technically this only tells you what time-stamped-grouped groups of files exist in FILE1 but not FILE2--as I understand it, this is what you're looking for (my understanding of your problem description is that MyFile12343220150510230000.xlsx and MyFile99999620150510230000.xlsx would have identical content). If the file names are always the same length (as you subsequently affirmed), then there's no need for the rev's and the cut commands can just be amended to refer to fixed character positions.
In any case, to get the final list of files, you'll have to use the "cleaned up" output to filter the content of FILE1; so, modifying the script above so that it includes the "cleanup" command, we can filter the files that you need using a grep--the whole script then becomes:
tmp1=$(mktemp)
tmp2=$(mktemp)
missing=$(mktemp)
cat $FILE1 | rev | cut -c -19 | sort | uniq > ${tmp1}
cat $FILE2 | rev | cut -c -19 | sort | uniq > ${tmp2}
diff ${tmp1} ${tmp2} | rev | grep -i xlsx | sed 's/[[:space:]]\+.*//' > ${missing}
grep -E "("`echo $(<${missing}) | sed 's/[[:space:]]/|/g'`")" ${tmp1}
rm ${tmp1} ${tmp2} ${missing}
The extended grep command (-E) just builds up an "or" regular expression for each timestamp-plus-extension and applies it to the first file. Of course, this is all assuming that there will never be timestamp-groups that exist in FILE2 and not in FILE1--if this is the case, then the "diff output processing" bit needs to be a little more clever.
Or you could use your standard coreutil tools:
for i in $(cat file1 file2 | sort | uniq -u); do
grep -q "$i" f1.txt && \
echo "f2 missing '$i'" || \
echo "f1 missing '$i'"
done
It will identify which non-common entries are missing from which file. You can also manipulate the non-common filenames in any way you like, e.g. parameter expansion/substring extraction, substring removal, or character indexes.

Getting last element of a path (different from #10124314 as basename falls over)

I need to process a couple of thousand PDF files sorted alphabietically on their filename ideally from bash. So from my simple perspective I need to walk a tree of files, stripping off path as I go and then do various grepping, sorting etc
Having seen an answer to a similar question I've tried doing a
tim#MERLIN:~/Documents/Scanned$ basename `find ./ -print`
but that gets messed up by some directory names which have spaces in them - e.g. there is one called General Letters which acts like a chicken-bone in the works and results in
basename: extra operand ‘Letters’
Try 'basename --help' for more information.
I can't see a way to get find to strip out the pathname and I would prefer to use find given its plethora of options to filter on age, size etc. Nor can I see any way to get basename to cope gracefully with spaces in this context.
I considered using cut but I can't work out how to get cut to give me the last field by doing something like cut -d/ <whatever> I'm sure there must be an easy way to do it: some sort of in-line sed or awk script?
I don't particularly want the buggeration of writing a perl/Python script to do it for me as I know I should be able to do it from the command line.
So any simple tips or suggestions?
Updated/Solved
Many thanks to Cyrus the solution is
tim#MERLIN:~/Documents/Scanned$ find . -name *.pdf -printf '%f\n' | sort
Try this:
find ./ -printf '%f\n'
%f: File's name with any leading directories removed (only the last element).
Here is a working solution using awk:
find ./ | awk -F'/' '{ print $NF }';
It simply uses / as delimiter and prints the last value of the line.
Or with grep:
find ./ | grep -oE "[^/]+$"
Through sed,
find ./ | sed 's/.*\/\(.*\)$/\1/g'
If you want get a list of pathnames (recursively) but want sort them by filenames (not by path names) you can use:
find . -printf '%f|%p\n' | sort -k 1 -t'|' | cut -d'|' -f2-
You need a GNU find for this. (Linux ok, not default in OS X).
Without the GNU find, you can do the above with:
find . -print | sed 's:\(.*\)/\(.*\)$:\2\|\1/\2:' | sort -k 1 -t'|' | cut -d'|' -f2-
(Assuming there is no \n in the filenames)

Listing files in date order with spaces in filenames

I am starting with a file containing a list of hundreds of files (full paths) in a random order. I would like to list the details of the ten latest files in that list. This is my naive attempt:
$ ls -las -t `cat list-of-files.txt` | head -10
That works, so long as none of the files have spaces in, but fails if they do as those files are split up at the spaces and treated as separate files. File "hello world" gives me:
ls: hello: No such file or directory
ls: world: No such file or directory
I have tried quoting the files in the original list-of-files file, but the here-document still splits the files up at the spaces in the filenames, treating the quotes as part of the filenames:
$ ls -las -t `awk '{print "\"" $0 "\""}' list-of-files.txt` | head -10
ls: "hello: No such file or directory
ls: world": No such file or directory
The only way I can think of doing this, is to ls each file individually (using xargs perhaps) and create an intermediate file with the file listings and the date in a sortable order as the first field in each line, then sort that intermediate file. However, that feels a bit cumbersome and inefficient (hundreds of ls commands rather than one or two). But that may be the only way to do it?
Is there any way to pass "ls" a list of files to process, where those files could contain spaces - it seems like it should be simple, but I'm stumped.
Instead of "one or more blank characters", you can force bash to use another field separator:
OIFS=$IFS
IFS=$'\n'
ls -las -t $(cat list-of-files.txt) | head -10
IFS=$OIFS
However, I don't think this code would be more efficient than doing a loop; in addition, that won't work if the number of files in list-of-files.txt exceeds the max number of arguments.
Try this:
xargs -a list-of-files.txt ls -last | head -n 10
I'm not sure whether this will work, but did you try escaping spaces with \? Using sed or something. sed "s/ /\\\\ /g" list-of-files.txt, for example.
This worked for me:
xargs -d\\n ls -last < list-of-files.txt | head -10

How to reverse lines of a text file?

I'm writing a small shell script that needs to reverse the lines of a text file. Is there a standard filter command to do this sort of thing?
My specific application is that I'm getting a list of Git commit identifiers, and I want to process them in reverse order:
git log --pretty=oneline work...master | grep -v DEBUG: | cut -d' ' -f1 | reverse
The best I've come up with is to implement reverse like this:
... | cat -b | sort -rn | cut -f2-
This uses cat to number every line, then sort to sort them in descending numeric order (which ends up reversing the whole file), then cut to remove the unneeded line number.
The above works for my application, but may fail in the general case because cat -b only numbers nonblank lines.
Is there a better, more general way to do this?
In GNU coreutils, there's tac(1)
There is a command for your purpose:
tail -r file.txt
Prints the lines of file.txt in reverse order!
The -r flag is non-standard, may not work on all systems, works e.g. on macOS.
Beware: Amount of lines limited. Works mostly, but when working with huge files be careful and check.
Answer is not 42 but tac.
Edit: Slower but more memory consuming using sed
sed 'x;1!H;$!d;x'
and even longer
perl -e'print reverse<>'
Similar to the sed example above, using perl - maybe more memorable (depending on how your brain is wired):
perl -e 'print reverse <>'
cat -b only numbers nonblank lines"
If that's the only issue you want to avoid, then why not use "cat -n" to number all the lines?
: "#(#)$Id: reverse.sh,v 1.2 1997/06/02 21:45:00 johnl Exp $"
#
# Reverse the order of the lines in each file
awk ' { printf("%d:%s\n", NR, $0);}' $* |
sort -t: +0nr -1 |
sed 's/^[0-9][0-9]*://'
Works like a charm for me...
In this case, just use --reverse:
$ git log --reverse --pretty=oneline work...master | grep -v DEBUG: | cut -d' ' -f1
rev <name of your text file.txt>
You can even do this:
echo <whatever you want to type>|rev
awk '{a[i++]=$0}END{for(;i-->0;)print a[i]}'
More faster than sed and compatible for embed devices like openwrt.

Resources