Bash how pipe sort, find and grep - bash

I'm trying to write a shell script that take the a line of a file that contain a specific number, the problem is that i need this file sorted because i need the line of the last file with some specific name.
I have write this code but it seems doesn't work
sort -n | find -name '*foo*' -exec grep -r -F '11111' {} \;
Is really important that the files are sorted because i need to search in the last file. The files name are of type "res_yyyy_mm_dd_foo" and they have to be ordered by yyyy and if are the same by mm and so

Sounds like the following would do it:
cd home/input_output.1/inp.1/old_res23403/
ls -1 | sort -r | xargs cat -- | grep '11111' | head -n1
ls -1 produces a list of filenames in the current directory, one per line.
sort -r sorts them in reverse alphabetical order, which (given that your names use a big-endian date format) puts the latest files first.
xargs cat -- concatenates the contents of all those files.
grep '11111' finds all lines containing 11111.
head -n1 limits results to the first such line.
In effect this gives you the first matching line of the files in reverse order, i.e. the last such line.

Related

filename group by a pattern and select only one from each group

I have following files(as an example, 60000+ actually) and all the log files follows this pattern:
analyse-ABC008795-84865-201911261249.log
analyse-ABC008795-84866-201911261249.log
analyse-ABC008795-84867-201911261249.log
analyse-ABC008795-84868-201911261249.log
analyse-ABC008795-84869-201911261249.log
analyse-ABC008796-84870-201911261249.log
analyse-ABC008796-84871-201911261249.log
analyse-ABC008796-84872-201911261249.log
analyse-ABC008796-84873-201911261249.log
Only numbers get change in log files. I want to take one file from each category where files should be categorized by ABC.... number. So, as you can see, there are only two categories here:
analyse-ABC008795
analyse-ABC008796
So, what I want to have is one file(let's say first file) from each category. Output should look like this:
analyse-ABC008795-84865-201911261249.log
analyse-ABC008796-84870-201911261249.log
This should be done in Bash/linux environment, so that after I get this, I should use grep to check if my "searching string" contain in those files
ls -l | <what should I do to group and get one file from each category> | grep "searching string"
With bash and awk.
files=(*.log)
printf '%s\n' "${files[#]}" | awk -F- '!seen[$2]++'
Or use find instead of a bash array for a more portable approach.
find . -type f -name '*.log' | awk -F- '!seen[$2]++'
If your find has the -printf flag and you don't want the leading ./ from the filename add it before the pipe |
-printf '%f\n'
The !seen[$2]++ Remove second and subsequent instances of each input line, without having to sort them first. The $2 means the second field which -F is using.

How to sort files based on filename length with subdirectories?

I am trying to look at a directory named Forever where it has sub-directories with Pure,Mineral which are filled with .csv files. I was able to see all the .csv files in the directory, but I am having hard time sorting them according to the length of filename.
As for current directory, I am at Forever. So I am looking at both sub-directories Pure and Mineral.
What I did was:
find -name ".*csv" | tr ' ' '_' | sort -n -r
This just sorts the file reverse alphabetically, which doesn't consider the length.(I had to truncate some name of the files as it had spaces between them.)
I think this answer is more helpful than the marked duplicate because it also accounts for sub-dirs (which the dupe didn't):
find . -name '*.csv' -exec bash -c 'echo -e $(wc -m <<< $(basename {}))\\t{}' \; | sort -nr | cut -f2
FWIW using fd -e csv -x ... was quite a bit faster for me (0.153s vs find's 2.084s)
even though basename removes the file ext, it doesn't matter since find ensures that all of them have it

How to grep files in date order

I can list the Python files in a directory from most recently updated to least recently updated with
ls -lt *.py
But how can I grep those files in that order?
I understand one should never try to parse the output of ls as that is a very dangerous thing to do.
You may use this pipeline to achieve this with gnu utilities:
find . -maxdepth 1 -name '*.py' -printf '%T#:%p\0' |
sort -z -t : -rnk1 |
cut -z -d : -f2- |
xargs -0 grep 'pattern'
This will handle filenames with special characters such as space, newline, glob etc.
find finds all *.py files in current directory and prints modification time (epoch value) + : + filename + NUL byte
sort command performs reverse numeric sort on first column that is timestamp
cut command removes 1st column (timestamp) from output
xargs -0 grep command searches pattern in each file
There is a very simple way if you want to get the filelist in chronologic order that hold the pattern:
grep -sil <searchpattern> <files-to-grep> | xargs ls -ltr
i.e. you grep e.g. "hello world" in *.txt, with -sil you make the grep case insensitive (-i), suppress messages (-s) and just list files (-l); this you then pass on to ls (| xargs), sorting it by date (-t) showing date (-l) and all files (-a).

Listing files in date order with spaces in filenames

I am starting with a file containing a list of hundreds of files (full paths) in a random order. I would like to list the details of the ten latest files in that list. This is my naive attempt:
$ ls -las -t `cat list-of-files.txt` | head -10
That works, so long as none of the files have spaces in, but fails if they do as those files are split up at the spaces and treated as separate files. File "hello world" gives me:
ls: hello: No such file or directory
ls: world: No such file or directory
I have tried quoting the files in the original list-of-files file, but the here-document still splits the files up at the spaces in the filenames, treating the quotes as part of the filenames:
$ ls -las -t `awk '{print "\"" $0 "\""}' list-of-files.txt` | head -10
ls: "hello: No such file or directory
ls: world": No such file or directory
The only way I can think of doing this, is to ls each file individually (using xargs perhaps) and create an intermediate file with the file listings and the date in a sortable order as the first field in each line, then sort that intermediate file. However, that feels a bit cumbersome and inefficient (hundreds of ls commands rather than one or two). But that may be the only way to do it?
Is there any way to pass "ls" a list of files to process, where those files could contain spaces - it seems like it should be simple, but I'm stumped.
Instead of "one or more blank characters", you can force bash to use another field separator:
OIFS=$IFS
IFS=$'\n'
ls -las -t $(cat list-of-files.txt) | head -10
IFS=$OIFS
However, I don't think this code would be more efficient than doing a loop; in addition, that won't work if the number of files in list-of-files.txt exceeds the max number of arguments.
Try this:
xargs -a list-of-files.txt ls -last | head -n 10
I'm not sure whether this will work, but did you try escaping spaces with \? Using sed or something. sed "s/ /\\\\ /g" list-of-files.txt, for example.
This worked for me:
xargs -d\\n ls -last < list-of-files.txt | head -10

Get the newest file based on timestamp

I am new to shell scripting so i need some help need how to go about with this problem.
I have a directory which contains files in the following format. The files are in a diretory called /incoming/external/data
AA_20100806.dat
AA_20100807.dat
AA_20100808.dat
AA_20100809.dat
AA_20100810.dat
AA_20100811.dat
AA_20100812.dat
As you can see the filename of the file includes a timestamp. i.e. [RANGE]_[YYYYMMDD].dat
What i need to do is find out which of these files has the newest date using the timestamp on the filename not the system timestamp and store the filename in a variable and move it to another directory and move the rest to a different directory.
For those who just want an answer, here it is:
ls | sort -n -t _ -k 2 | tail -1
Here's the thought process that led me here.
I'm going to assume the [RANGE] portion could be anything.
Start with what we know.
Working Directory: /incoming/external/data
Format of the Files: [RANGE]_[YYYYMMDD].dat
We need to find the most recent [YYYYMMDD] file in the directory, and we need to store that filename.
Available tools (I'm only listing the relevant tools for this problem ... identifying them becomes easier with practice):
ls
sed
awk (or nawk)
sort
tail
I guess we don't need sed, since we can work with the entire output of ls command. Using ls, awk, sort, and tail we can get the correct file like so (bear in mind that you'll have to check the syntax against what your OS will accept):
NEWESTFILE=`ls | awk -F_ '{print $1 $2}' | sort -n -k 2,2 | tail -1`
Then it's just a matter of putting the underscore back in, which shouldn't be too hard.
EDIT: I had a little time, so I got around to fixing the command, at least for use in Solaris.
Here's the convoluted first pass (this assumes that ALL files in the directory are in the same format: [RANGE]_[yyyymmdd].dat). I'm betting there are better ways to do this, but this works with my own test data (in fact, I found a better way just now; see below):
ls | awk -F_ '{print $1 " " $2}' | sort -n -k 2 | tail -1 | sed 's/ /_/'
... while writing this out, I discovered that you can just do this:
ls | sort -n -t _ -k 2 | tail -1
I'll break it down into parts.
ls
Simple enough ... gets the directory listing, just filenames. Now I can pipe that into the next command.
awk -F_ '{print $1 " " $2}'
This is the AWK command. it allows you to take an input line and modify it in a specific way. Here, all I'm doing is specifying that awk should break the input wherever there is an underscord (_). I do this with the -F option. This gives me two halves of each filename. I then tell awk to output the first half ($1), followed by a space (" ")
, followed by the second half ($2). Note that the space was the part that was missing from my initial suggestion. Also, this is unnecessary, since you can specify a separator in the sort command below.
Now the output is split into [RANGE] [yyyymmdd].dat on each line. Now we can sort this:
sort -n -k 2
This takes the input and sorts it based on the 2nd field. The sort command uses whitespace as a separator by default. While writing this update, I found the documentation for sort, which allows you to specify the separator, so AWK and SED are unnecessary. Take the ls and pipe it through the following sort:
sort -n -t _ -k 2
This achieves the same result. Now you only want the last file, so:
tail -1
If you used awk to separate the file (which is just adding extra complexity, so don't do it sheepish), you can replace the space with an underscore again with sed:
sed 's/ /_/'
Some good info here, but I'm sure most people aren't going to read down to the bottom like this.
This should work:
newest=$(ls | sort -t _ -k 2,2 | tail -n 1)
others=($(ls | sort -t _ -k 2,2 | head -n -1))
mv "$newest" newdir
mv "${others[#]}" otherdir
It won't work if there are spaces in the filenames although you could modify the IFS variable to affect that.
Try:
$ ls -lr
Hope it helps.
Use:
ls -r -1 AA_*.dat | head -n 1
(assuming there are no other files matching AA_*.dat)
ls -1 AA* |sort -r|tail -1
Due to the naming convention of the files, alphabetical order is the same as date order. I'm pretty sure that in bash '*' expands out alphabetically (but can not find any evidence in the manual page), ls certainly does, so the file with the newest date, would be the last one alphabetically.
Therefore, in bash
mv $(ls | tail -1) first-directory
mv * second-directory
Should do the trick.
If you want to be more specific about the choice of file, then replace * with something else - for example AA_*.dat
My solution to this is similar to others, but a little simpler.
ls -tr | tail -1
What is actually does is to rely on ls to sort the output, then uses tail to get the last listed file name.
This solution will not work if the filename you require has a leading dot (e.g. .profile).
This solution does work if the file name contains a space.

Resources