Bash/Shell - paths with spaces messing things up - bash

I have a bash/shell function that is supposed to find files then awk/copy the first file it finds to another directory. Unfortunately if the directory that contains the file has spaces in the name the whole thing fails, since it truncates the path for some reason or another. How do I fix it?
If file.txt is in /path/to/search/spaces are bad/ it fails.
dir=/path/to/destination/ | find /path/to/search -name file.txt | head -n 1 | awk -v dir="$dir" '{printf "cp \"%s\" \"%s\"\n", $1, dir}' | sh
cp: /path/to/search/spaces: No such file or directory
*If file.txt is in /path/to/search/spacesarebad/ it works, but notice there are no spaces. :-/

Awk's default separator is white space. Simply change it to something else by doing:
awk -F"\t" ...
Your script should look like:
dir=/path/to/destination/ | find /path/to/search -name file.txt | head -n 1 | awk -F"\t" -v dir="$dir" '{printf "cp \"%s\" \"%s\"\n", $1, dir}' | sh
As pointed by the comments, you don't really need all those steps, you could actually simply do (one-liner):
dir=/path/to/destination/ && path="$(find /path/to/search -name file.txt | head -n 1)" && cp "$path" "$dir"
Formated code (that may look better, in this case ^^):
dir=/path/to/destination/
path="$(find /path/to/search -name file.txt | head -n 1)"
cp "$path" "$dir"
The "" are used to assign the entire content of the string to the variable, causing the separator IFS, which is a white space by default, not to be considered over the string.

If you think spaces are bad, wait till you get into trouble with newlines. Consider for example:
mkdir spaces\ are\ bad
touch spaces\ are\ bad/file.txt
mkdir newlines$'\n'are$'\n'even$'\n'worse
touch newlines$'\n'are$'\n'even$'\n'worse/file.txt
And:
find . -name file.txt
The head command assumes newline delimiter. You can get around the space and newline issue with GNU find and GNU grep (maybe others) by using \0 delimiters:
find . -name file.txt -print0 | grep -zm1 . | xargs -0 cp -t "$dir"

You could try this.
awk '{print substr($0, index($0,$9))}'
For example this is the output of ls command:
-rw-r--r--. 1 root root 73834496 Dec 6 10:55 File with spaces 2
If you use simple awk like this
# awk '{print $9}'
It returns only
# File
If used with the full command
# awk '{print substr($0, index($0,$9))}'
I get the whole output
File with spaces 2
Here
substr(s, a, b) : it returns b number of chars from string s, starting at position a. The parameter b is optional.
For example if the match is addr:192.168.1.133 and you use substr as follows
# awk '{print substr($2,6)}'
You get the IP i.e 192.168.1.133. Note the 6 is the character starting from a in addr
So in the proper command the $2 is $0 ( which prints whole line.) and index($0,$9) matches $9 and prints everything ahead of column 9. You can change that to index($0,$8) and see that the output changes to
# 10:55 File with spaces 2
`index(IN, FIND)'
This searches the string IN for the first occurrence of the string
FIND, and returns the position in characters where that occurrence
begins in the string IN.
I hope it helps. Moreover if you are assigning this value to a variable in script then you need to enclose the variables in double quotes. Other wise you will get errors if you are doing some other operation for the extracted file name.

Related

Counting Python files with bash and awk always returns zero

I want to get a number of python files on my desktop and I have coded a small script for that. But the awk command does not work as is have expected.
script
ls -l | awk '{ if($NF=="*.py") print $NF; }' | wc -l
I know that there is another solution to finding a number of python files on a PC but I just want to know what am i doing wrong here.
ls -l | awk '{ if($NF=="*.py") print $NF; }' | wc -l
Your code does count of files literally named *.py, you should deploy regex matching and use correct GNU AWK syntax, after fixing that, your code becomes
ls -l | awk '{ if($NF~/[.]py$/) print $NF; }' | wc -l
note [.] which denote literal . and $ denoting end of string.
Your code might be further ameloriated, as there is not need to use if here, as pattern-action will do that is
ls -l | awk '$NF~/[.]py$/{ print $NF; }' | wc -l
Morever you might easily implemented counting inside GNU AWK rather than deploying wc -l as follows
ls -l | awk '$NF~/[.]py$/{t+=1}END{print t}'
Here, t is increased by 1 for every describe line, and after all is processed, that is in END it is printed. Observe there is no need to declare t variable in GNU AWK.
Don't try to parse the output of ls, see https://mywiki.wooledge.org/ParsingLs.
Beyond that your awk script is failing because $NF=="*.py" is doing a literal string partial comparison of the last sting of non-spaces against *.py when you probably wanted a regexp comparison such as $NF~/*.py$/ and your print $NF would fail for any file names containing spaces.
If you really want to involve awk in this for some reason then, assuming the list of python files doesn't exceed ARG_MAX, it'd be:
awk 'BEGIN{print ARGC-1; exit}' *.py
but you could just do it in bash:
shopt -s nullglob
files=(*.py)
echo "${#files[#]}"
or if you want to have a pipe to wc -l for some reason and your files can't have newlines in their names then:
printf '%s\n' *.py | wc -l
gfind . -maxdepth 1 -type f -name "*.py" -print0 |
{m,g}awk 'END { print NR }' RS='\0' FS='^$'
or
{m,g}awk 'END { print --NF }' RS='^$' FS='\0'
879

How to get the nth recent file in the nth last modified subdirectory using pipes

I'm doing an exercise for OS exam. It requires to get the 3rd recent file of the 2nd last modified sub-directory inside current directory. Then I have to print its lines in reverse order. I can not use tac command. The text suggest to use (other than awk and sed): head, tails, wc.
I've succeded getting filename of the requested file (but in a too complex way I think). Now I have to print it in reverse. I think I can use this awk solution https://stackoverflow.com/a/744093/11614625.
This is how I'm getting the filename:
ls -t | head | awk '{system("test -d \"" $0 "\" && echo \"" $0 "\"")}' | awk 'NR==2 {system("ls \"" $0 "\" | head")}' | awk 'NR==1'
How can I do better? And what if 3rd directory or 2nd file doesn't exists?
See https://mywiki.wooledge.org/ParsingLs and awk '{system("test -d \"" $0 "\" && echo \"" $0 "\"")}' is calling shell to call awk to call system to call shell to call test which is clearly a worse approach than just having shell call test in the first place if you were going to do that. Also, any solution that reads the whole file into memory (as any sed or a naive awk solution would) will fail for large files as they'll exceed available memory.
Unfortunately this is how to do what you want robustly:
dir="$(find . -mindepth 1 -maxdepth 1 -type d -printf '%T+\t%p\0' |
sort -rz |
awk -v RS='\0' 'NR==2{sub(/[^\t]+\t/,""); print; exit}')" &&
file="$(find "$dir" -mindepth 1 -maxdepth 1 -type f -printf '%T+\t%p\0' |
sort -z |
awk -v RS='\0' 'NR==3{sub(/[^\t]+\t/,""); print; exit}')" &&
cat -n "$file" | sort -rn | cut -f2-
If any of the commands in any of the pipes fail then the error message from the command that failed will be printed and then no other command will execute and the overall exit status will be the failure one from that failing command.
I used cat | sort | cut rather than awk or sed to print the file in reverse because awk (unless you write demand paging in it) or sed would have to read the whole file into memory at once and so would fail for very large files while sort is designed to handle large files by using paging with tmp files as necessary and only keeping parts of the file in memory at a time so it's limited only by how much free disk space you have on your device.
The above requires GNU tools to provide/handle NUL line-endings - if you don't have those then change \0 to \n in the find command, remove the z from sort options, and remove -v RS='\0' from the awk command and be aware that the result will only work if your directory or file names don't contain newlines.

How to extract codes using the grep command?

I have a file with below input lines.
John|1|R|Category is not found for local configuration/code/123.NNN and customer 113
TOM|2|R|Category is not found for local configuration/code/123.NNN and customer 114
PETER|3|R|Category is not found for local configuration/code/456.1 and customer 115
I need to extract only the above highlighted text using the grep command.
I tried the below command and didn't get the proper result. Getting the extra 2 unwanted characters in the output. Please suggest if there is any other way to achieve this through grep command.
find ./ -type f -name <FileName> -exec cut -f 4 -d'|' {} + |
grep -o 'Category is not found for local configuration/code/...\\....' |
grep -o '...\\....' | sort | uniq
Current Output:
123.NNN
456.1 a
Expected output:
123.NNN
456.1
You can use another grep regular expression.
find ./ -type f -name f -exec cut -f 4 -d'|' {} + |
grep -o 'Category is not found for local configuration/code/...\.[^ ]*' |
grep -o '...\..*' | sort | uniq
. matches any character, [^ ]* matches any sequence of characters until the first space
Output:
123.NNN
456.1
Your regex specifies a fixed character width for strings of variable width. Based on your examples, something like
[0-9]\+\.[A-Z0-9]\+
would seem like a better regex. However, we could probably also simplify this by merging the cut and multiple grep commands into a single Awk script.
find etc etc -exec awk -F '|' '
$4 ~ /Category is not found for local configuration\/code\/[0-9]{3}\.[0-9A-Z]/ {
split($4, a, /\/code\/);
split(a[2], b); print b[1] }' {} + |
sort -u
The two split operations are just a cheap way to pick out the text between /code/ and the next whitespace character; we have already established by way of the regex match that the string after /code/ matches the pattern we're after.
Notice also how sort has a -u option which allows you to replace (trivial cases of) uniq.
The regex variant supported by Awk is slightly different than that supported by POSIX grep; so the backslashed \+ in grep's BRE dialect is plain + in the dialect called ERE which is [more or less] supported by Awk - and grep -E. If you have grep -P you can use a third variant which has a convenient feature;
find etc etc -exec grep -oP '^([^|]*[|]){3}[^|]*Category is not found for local configuration/code/\K[0-9]{3}\.[0-9A-Z]+' {} + |
sort -u
The \K says "match up through here, but forget everything before this" and so only prints the part after this token.
With sed:
sed -E -n 's#.*code/(.*)\s+and.*#\1#p' file.txt | uniq
Output:
123.NNN
456.1
I'd use the -P option:
grep -oP '/code/\K\S+' file | sort -u
You want to extract the non-whitespace characters following /code/
An awk using match():
$ awk 'match($0,/[0-9]+\.[A-Z0-9]+/)&&++a[(b=substr($0,RSTART,RLENGTH))]==1{print b}' file
Output:
123.NNN
456.1
Pretty printed for slightly better readability:
$ awk '
match($0,/[0-9]+\.[A-Z0-9]+/) && ++a[(b=substr($0,RSTART,RLENGTH))]==1 {
print b
}' file
It's not possible just using grep. You should use AWK instead:
awk '{split($7, ar, "/"); print ar[3]}' FILE
Explanation:
The split function splits on a string, here $7, the 7th field, placing the result in an array ar, and using the string / as delimiter.
Then prints the 3rd field of the array.
Note:
I am assuming that all of your input looks like the samples you have given us, i.e.:
aaa|b|c|ddd is not found for local configuration/code/111.nnn and customer nnn
Where aaa and ddd will not contain whitespace.
I also assume you really do have a file FILE containing those lines. It's a bit unclear.
Input:
▶ cat FILE
John|1|R|Category is not found for local configuration/code/123.NNN and customer 113
TOM|2|R|Category is not found for local configuration/code/123.NNN and customer 114
PETER|3|R|Category is not found for local configuration/code/456.1 and customer 115
Output:
▶ awk '{split($7, ar, "/"); print ar[3]}' FILE
123.NNN
123.NNN
456.1
Single sed can do the filtering.
(The pattern can be further generalized as suggested by others if that is an option. But be careful to not to over simplify so that it can match with unexpected inputs)
sed -nE 's#(\S+\s+){6}configuration/code/(\S+)\s.*#\2#p' input.txt
To replace your exact command,
find ./ -type f -name <Filename> -exec cat {} \; | sed -nE 's#(\S+\s+){6}configuration/code/(\S+)\s.*#\2#p' | sort | uniq
Simple substitutions on individual lines is the job sed is best suited for. This will work using any sed in any shell on any UNIX box:
$ cat file
John|1|R|Category is not found for local configuration/code/123.NNN and customer 113
TOM|2|R|Category is not found for local configuration/code/123.NNN and customer 114
PETER|3|R|Category is not found for local configuration/code/456.1 and customer 115
$ sed -n 's:.*Category is not found for local configuration/code/\([^ ]*\).*:\1:p' file | sort -u
123.NNN
456.1

Applying awk pattern to all files with same name, outputting each to a new file

I'm trying to recursively find all files with the same name in a directory, apply an awk pattern to them, and then output to the directory where each of those files lives a new updated version of the file.
I thought it was better to use a for loop than xargs, but I don't exactly how to make this work...
for f in $(find . -name FILENAME.txt );
do awk -F"\(corr\)" '{print $1,$2,$3,$4}' ./FILENAME.txt > ./newFILENAME.txt $f;
done
Ultimately I would like to be able to remove multiple strings from the file at once using -F, but also not sure how to do that using awk.
Also is there a way to remove "(cor*)" where the * represents a wildcard? Not sure how to do while keeping with the escape sequence for the parentheses
Thanks!
To use (corr*) as a field separator where * is a glob-style wildcard, try:
awk -F'[(]corr[^)]*[)]' '{print $1,$2,$3,$4}'
For example:
$ echo '1(corr)2(corrTwo)3(corrThree)4' | awk -F'[(]corr[^)]*[)]' '{print $1,$2,$3,$4}'
1 2 3 4
To apply this command to every file under the current directory named FILENAME.txt, use:
find . -name FILENAME.txt -execdir sh -c 'awk -F'\''[(]corr[^)]*[)]'\'' '\''{print $1,$2,$3,$4}'\'' "$1" > ./newFILENAME.txt' Awk {} \;
Notes
Don't use:
for f in $(find . -name FILENAME.txt ); do
If any file or directory has whitespace or other shell-active characters in it, the results will be an unpleasant surprise.
Handling both parens and square brackets as field separators
Consider this test file:
$ cat file.txt
1(corr)2(corrTwo)3[some]4
To eliminate both types of separators and print the first four columns:
$ awk -F'[(]corr[^)]*[)]|[[][^]]*[]]' '{print $1,$2,$3,$4}' file.txt
1 2 3 4

How to do basename on second field in a file and replace it in line

I'm trying to get something like basename of second field in a file and replace it:
$ myfile=/var/lib/jenkins/myjob/myfile
$ sha512sum "$myfile" | tee myfile-checksum
$ cat myfile-checksum
deb32b1c7122fc750a6742765e0e54a821 /var/lib/jenkins/myjob/myfile
Desired output:
deb32b1c7122fc750a6742765e0e54a821 myfile
So people can easily do sha512sum -c myfile-checksum with no manual edits.
With sed or awk, that is how far i made it for now :)
awk -F/ '{print $NF}' myfile-checksum
sed -i "s|${value}|$(basename $value)|" myfile-checksum
Thanks.
You can set the field separators to both spaces and slashes and print the first and last fields:
awk -F" |/" '{print $1, $NF}'
With your input:
$ awk -F" |/" '{print $1, $NF}' <<< "deb32b1c7122fc750a6742765e0e54a821 /var/lib/jenkins/myjob/myfile"
deb32b1c7122fc750a6742765e0e54a821 myfile
In case your filename contain spaces, do remove everything from the first field up to the last slash, as indicated by Ed Morton:
$ awk '{hash=$1; gsub(/^.*\//,""); print hash, $0}' <<< "deb32b1c7122fc750a6742765e0e54a821 /var/lib/jenkins/myjob/myfile with spaces"
deb32b1c7122fc750a6742765e0e54a821 myfile with spaces
$ awk 'sub(".*/",$1" ")' <<< "deb32b1c7122fc750a6742765e0e54a821 /var/lib/jenkins/myjob/myfile"
deb32b1c7122fc750a6742765e0e54a821 myfile
The will work for any file name except one that contains newlines. If you have that case let us know.
sha512sum will simply use the file name you've passed to it - unchanged.
If you pass
sha512sum /path/to/file
it will give you:
123456.. /path/to/file
But if you:
pushd /path/to
sha512sum file
popd
it will give you
123456.. file
If the filename is a variable you can use parameter expansion like this:
pushd "${file%/*}"
sha256sum "${file##*/}"
popd
or even
# cd will not change the PWD of the current shell since
# the command runs in a sub shell
(cd "${file%/*}"; sha256sum "${file##*/}")
Having that $file contains the filename, ${file%/*} expands to the path without the filename and ${file##*/} expands to the filename without the path.

Resources