remove files less than a cetain size and extract filenames - bash

I am working on a cluster remotely and give a few thousands of jobs. Some jobs crash early. I need to move the output files of those jobs (smaller than 1KB) to another folder and start them again. I guess find can move them with something like:
find . -size -1000c -exec mv {} ../crashed \;
but I also need to restart these crashed jobs. Output files in a bunch of folders in output folder and I need folder name and file name(without extantion) seperatly.
I guess sed or/and awk can do this easily but i am not sure how. By the way i am working on BASH shell.
I am trying to use cut, which seems to be working:
for i in $( find . -size -1000c )
do
FOLDER=$(echo "${i%.*}" | cut -d'/' -f2)
FILENAME=$(echo "${i%.*}" | cut -d'/' -f3)
done
But wouldn't it be better using sed or awk? And how?

Sed is a stream editor and since you're not changing anything I wouldn't use it in this case. You could use awk instead of cut like this:
FOLDER=$(echo "${i%.*}" | awk -v FS="/" '{ print $2 }')
where the -v FS="/" specifies that the variable FS (field separator, is a slash, kind of the same as what you do with the -d option in cut) and print $2 tells awk to print only the second field.
Same goes for the other instruction you have there. In your case what you have to do is simple enough, so cut actually cuts it :D
I usually use awk for more complicated tasks, involving multiple files and/or mathematical computations.
Edit:
note that I'm using gawk here (the awk implementation by GNU). I'm not sure you can pass a variable value with the -v option in other implementations, they'll have their way to do it.

Related

bash script concatenates input arguments with pipe xargs arguments

I am trying to execute my script, but the $ 1 argument is concatenated with the arguments of the last pipe, resulting in the following
killProcess(){
ps aux |grep $1 | tr -s " " " " | awk "{printf \"%s \",\$2}" | tr " " "\n" | xargs -l1 echo $1
}
$killProcess node
node 18780
node 965856
node 18801
node 909028
node 19000
node 1407472
node 19028
node 583620
node 837
node 14804
node 841
node 14260
but I just want the list of pids, without the node argument to be able to delete them, that only happens when I put it under a script, in command line it works normally for me because I don't pass any arguments to the script and it doesn't get concatenated.
The immediate problem is that you don't want the $1 at the end. In that context, $1 expands to the first argument to the function ("node", in your example), which then gets passed to xargs and treated as part of the command it should execute. That is, the last part of the pipeline expands to:
xargs -l1 echo node
...so when xargs receives "18780" as input, it runs echo node 18780, which of course prints "node 18780".
Solution: remove the $1, making the command just xargs -l1 echo, so when xargs receives "18780" as input, it runs echo 18780, which prints just "18780".
That'll fix it, but there's also a huge amount of simplification that can be done here. Many elements of the pipe aren't doing anything useful, or are working at cross purposes with each other.
Start with the last command in the pipe, xargs. It's taking in PIDs, one per line, and printing them one per line. It's not really doing anything at all (that I can see anyway), so just leave it off. (Unless, of course, you actually want to use kill instead of echo -- in that case, leave it on.)
Now look at the next two commands from the end:
awk "{printf \"%s \",\$2}" | tr " " "\n"`
Here, awk is printing the PIDs with a space after each one, and then tr is turning the spaces into newlines. Why not just have awk print each one with a newline to begin with? You don't even need printf for this, you can just use print since it automatically adds a newline. It's also simpler to pass the script to awk in single-quotes, so you don't have to escape the double-quotes, dollar sign, and (maybe) backslash. So any of these would work:
awk "{printf \"%s\\n\",\$2}"
awk '{printf "%s\n",$2}'
awk '{print $2}'
Naturally, I recommend the last one.
Now, about the command before awk: tr -s " " " ". This "squeezes" runs of spaces into single spaces, but that's not needed since awk treats runs of spaces as (single) field delimiters. So, again, leave that command out.
At this point, we're down to the following pipeline:
ps aux | grep $1 | awk '{print $2}'
There are two more things I'd recommend here. First, you should (almost) always have double-quotes around shell variable, parameter, etc references like $1. So use grep "$1" instead.
But don't do that, because awk is perfectly capable of searching; there's no need for both grep and awk. In fact, awk can be much more precise, searching only a specific field instead of the whole line. The downside is, it is a bit more complex to do, but knowing how to make awk do more complex things is useful. The best way to let awk work with a shell variable or parameter is to use its -v option to create an awk variable with the same value, and use that. You can then use the ~ to check for a regex match to the variable. Something like this:
awk -v proc="$1" '$11 ~ proc {print $2}'
Note: I'm assuming you want to search for $1 in the executable name, and that that's the 11th field of ps aux on your system. Searching that field only will keep it from matching in e.g. the username (killing all of a user's processes because their name contains some program name isn't polite). You might actually want to be even more specific, so that e.g. trying to kill node doesn't accidentally kill nodemon as well; that'll be a matter of using more specific search patterns.
So, here's the final result:
killProcess(){
ps aux | awk -v proc="$1" '$11 ~ proc {print $2}'
}
To actually kill the processes, add back xargs -l1 kill at the end.

How to get the nth recent file in the nth last modified subdirectory using pipes

I'm doing an exercise for OS exam. It requires to get the 3rd recent file of the 2nd last modified sub-directory inside current directory. Then I have to print its lines in reverse order. I can not use tac command. The text suggest to use (other than awk and sed): head, tails, wc.
I've succeded getting filename of the requested file (but in a too complex way I think). Now I have to print it in reverse. I think I can use this awk solution https://stackoverflow.com/a/744093/11614625.
This is how I'm getting the filename:
ls -t | head | awk '{system("test -d \"" $0 "\" && echo \"" $0 "\"")}' | awk 'NR==2 {system("ls \"" $0 "\" | head")}' | awk 'NR==1'
How can I do better? And what if 3rd directory or 2nd file doesn't exists?
See https://mywiki.wooledge.org/ParsingLs and awk '{system("test -d \"" $0 "\" && echo \"" $0 "\"")}' is calling shell to call awk to call system to call shell to call test which is clearly a worse approach than just having shell call test in the first place if you were going to do that. Also, any solution that reads the whole file into memory (as any sed or a naive awk solution would) will fail for large files as they'll exceed available memory.
Unfortunately this is how to do what you want robustly:
dir="$(find . -mindepth 1 -maxdepth 1 -type d -printf '%T+\t%p\0' |
sort -rz |
awk -v RS='\0' 'NR==2{sub(/[^\t]+\t/,""); print; exit}')" &&
file="$(find "$dir" -mindepth 1 -maxdepth 1 -type f -printf '%T+\t%p\0' |
sort -z |
awk -v RS='\0' 'NR==3{sub(/[^\t]+\t/,""); print; exit}')" &&
cat -n "$file" | sort -rn | cut -f2-
If any of the commands in any of the pipes fail then the error message from the command that failed will be printed and then no other command will execute and the overall exit status will be the failure one from that failing command.
I used cat | sort | cut rather than awk or sed to print the file in reverse because awk (unless you write demand paging in it) or sed would have to read the whole file into memory at once and so would fail for very large files while sort is designed to handle large files by using paging with tmp files as necessary and only keeping parts of the file in memory at a time so it's limited only by how much free disk space you have on your device.
The above requires GNU tools to provide/handle NUL line-endings - if you don't have those then change \0 to \n in the find command, remove the z from sort options, and remove -v RS='\0' from the awk command and be aware that the result will only work if your directory or file names don't contain newlines.

sed bash substitution only if variable has a value

I'm trying to find a way using variables and sed to do a specific text substitution using a changing input file, but only if there is a value given to replace the existing string with. No value= do nothing (rather than remove the existing string).
Example
Substitute.csv contains 5 lines=
this-has-text
this-has-text
this-has-text
this-has-text
and file.text has one sentence=
"When trying this I want to be sure that text-this-has is left alone."
If I run the following command in a shell script
Text='text-this-has'
Change=`sed -n '3p' substitute.csv`
grep -rl $Text /home/username/file.txt | xargs sed -i "s|$Text|$Change|"
I end up with
"When trying this I want to be sure that is left alone."
But I'd like it to remain as
"When trying this I want to be sure that text-this-has is left alone."
Any way to tell sed "If I give you nothing new, do nothing"?
I apologize for the overthinking, bad habit. Essentially what I'd like to accomplish is if line 3 of the csv file has a value - replace $Text with $Change inline. If the line is empty, leave $Text as $Text.
Text='text-this-has'
Change=$(sed -n '3p' substitute.csv)
if [[ -n $Change ]]; then
grep -rl $Text /home/username/file.txt | xargs sed -i "s|$Text|$Change|"
fi
Just keep it simple and use awk:
awk -v t="$Text" -v c="$Change" 'c!=""{sub(t,c)} {print}' file
If you need inplace editing just use GNU awk with -i inplace.
Given your clarified requirement, this is probably what you actually want:
awk -v t="$Text" 'NR==FNR{if (NR==3) c=$0; next} c!=""{sub(t,c)} {print}' Substitute.csv file.txt
Testing whether $Change has a value before launching into the grep and sed is undoubtedly the most efficient bash solution, although I'm a bit skeptical about the duplication of grep and sed; it saves a temporary file in the case of files which don't contain the target string, but at the cost of an extra scan up to the match in the case of files which do contain it.
If you're looking for typing efficiency, though, the following might be interesting:
find . -name '*.txt' -exec sed -i "s|$Text|${Change:-&}|" {} \;
Which will recursively find all files whose names end with the extension .txt and execute the sed command on each one. ${Change:-&} means "the value of $Change if it exists and is non-empty, and otherwise an &"; & in the replacement of a sed s command means "the matched text", so s|foo|&| replaces every occurrence of foo with itself. That's an expensive no-op but if your time matters more than your cpu time, it might have been worth it.

Manipulating a file - bash

I need some guidance manipulating a text file that is the result of a diff. I only want those results listed after the > delimiter (which are file names) and then I will add a path to the file name for further work.
I am not dealing with large files.
I am hoping to do it all in place.
Essentially I want to take something like this
96a97,98
> SCR-33333.sql
> SCR-33333-WEB.sql
and create an action like
cp /add/this/path/SCR-33333.sql /to/somewhere/else
Can anyone please give me a quick example I can run with?
Well, you could try this, bearing in mind that it'll only work if filenames do not contains spaces...
diff this that | awk '/^>/{print "/add/this/path/" $2}' | xargs -i cp {} /to/somewhere/else
(note: this is a one-liner command. ignore wrapping caused by web browser.)
grep ">" dummy.txt | cut -f 2 -d ' ' | xargs -I{} cp /add/this/path/{} somewhere
where 'dummy.txt' is your diff file.

Get the newest file based on timestamp

I am new to shell scripting so i need some help need how to go about with this problem.
I have a directory which contains files in the following format. The files are in a diretory called /incoming/external/data
AA_20100806.dat
AA_20100807.dat
AA_20100808.dat
AA_20100809.dat
AA_20100810.dat
AA_20100811.dat
AA_20100812.dat
As you can see the filename of the file includes a timestamp. i.e. [RANGE]_[YYYYMMDD].dat
What i need to do is find out which of these files has the newest date using the timestamp on the filename not the system timestamp and store the filename in a variable and move it to another directory and move the rest to a different directory.
For those who just want an answer, here it is:
ls | sort -n -t _ -k 2 | tail -1
Here's the thought process that led me here.
I'm going to assume the [RANGE] portion could be anything.
Start with what we know.
Working Directory: /incoming/external/data
Format of the Files: [RANGE]_[YYYYMMDD].dat
We need to find the most recent [YYYYMMDD] file in the directory, and we need to store that filename.
Available tools (I'm only listing the relevant tools for this problem ... identifying them becomes easier with practice):
ls
sed
awk (or nawk)
sort
tail
I guess we don't need sed, since we can work with the entire output of ls command. Using ls, awk, sort, and tail we can get the correct file like so (bear in mind that you'll have to check the syntax against what your OS will accept):
NEWESTFILE=`ls | awk -F_ '{print $1 $2}' | sort -n -k 2,2 | tail -1`
Then it's just a matter of putting the underscore back in, which shouldn't be too hard.
EDIT: I had a little time, so I got around to fixing the command, at least for use in Solaris.
Here's the convoluted first pass (this assumes that ALL files in the directory are in the same format: [RANGE]_[yyyymmdd].dat). I'm betting there are better ways to do this, but this works with my own test data (in fact, I found a better way just now; see below):
ls | awk -F_ '{print $1 " " $2}' | sort -n -k 2 | tail -1 | sed 's/ /_/'
... while writing this out, I discovered that you can just do this:
ls | sort -n -t _ -k 2 | tail -1
I'll break it down into parts.
ls
Simple enough ... gets the directory listing, just filenames. Now I can pipe that into the next command.
awk -F_ '{print $1 " " $2}'
This is the AWK command. it allows you to take an input line and modify it in a specific way. Here, all I'm doing is specifying that awk should break the input wherever there is an underscord (_). I do this with the -F option. This gives me two halves of each filename. I then tell awk to output the first half ($1), followed by a space (" ")
, followed by the second half ($2). Note that the space was the part that was missing from my initial suggestion. Also, this is unnecessary, since you can specify a separator in the sort command below.
Now the output is split into [RANGE] [yyyymmdd].dat on each line. Now we can sort this:
sort -n -k 2
This takes the input and sorts it based on the 2nd field. The sort command uses whitespace as a separator by default. While writing this update, I found the documentation for sort, which allows you to specify the separator, so AWK and SED are unnecessary. Take the ls and pipe it through the following sort:
sort -n -t _ -k 2
This achieves the same result. Now you only want the last file, so:
tail -1
If you used awk to separate the file (which is just adding extra complexity, so don't do it sheepish), you can replace the space with an underscore again with sed:
sed 's/ /_/'
Some good info here, but I'm sure most people aren't going to read down to the bottom like this.
This should work:
newest=$(ls | sort -t _ -k 2,2 | tail -n 1)
others=($(ls | sort -t _ -k 2,2 | head -n -1))
mv "$newest" newdir
mv "${others[#]}" otherdir
It won't work if there are spaces in the filenames although you could modify the IFS variable to affect that.
Try:
$ ls -lr
Hope it helps.
Use:
ls -r -1 AA_*.dat | head -n 1
(assuming there are no other files matching AA_*.dat)
ls -1 AA* |sort -r|tail -1
Due to the naming convention of the files, alphabetical order is the same as date order. I'm pretty sure that in bash '*' expands out alphabetically (but can not find any evidence in the manual page), ls certainly does, so the file with the newest date, would be the last one alphabetically.
Therefore, in bash
mv $(ls | tail -1) first-directory
mv * second-directory
Should do the trick.
If you want to be more specific about the choice of file, then replace * with something else - for example AA_*.dat
My solution to this is similar to others, but a little simpler.
ls -tr | tail -1
What is actually does is to rely on ls to sort the output, then uses tail to get the last listed file name.
This solution will not work if the filename you require has a leading dot (e.g. .profile).
This solution does work if the file name contains a space.

Resources