How can I deduplicate filenames across directories?

How can I deduplicate filenames across directories? - bash

I run the following gsutil command:
gsutil ls -d gs://mybucket/v${version}/folder1/*/*.whl |
sort -V |
grep -e "/*.whl"
I get:
gs://mybucket/v1.0.0/folder1/1560924028/file1-cp27-cp27mu-linux_x86_64.whl
gs://mybucket/v1.0.0/folder1/1560926922/file1-cp36-cp36m-linux_x86_64.whl
gs://mybucket/v1.0.0/folder1/1560930522/file1-cp35-cp35m-linux_x86_64.whl
gs://mybucket/v1.0.0/folder1/1561568612/file1-cp37-cp37m-linux_x86_64.whl
gs://mybucket/v1.0.0/folder1/1561595893/file1-cp37-cp37m-linux_x86_64.whl
gs://mybucket/v1.0.0/folder1/1561654308/file1-cp37-cp37m-linux_x86_64.whl
gs://mybucket/v1.0.0/folder1/1563319372/file1-cp27-cp27mu-linux_x86_64.whl
gs://mybucket/v1.0.0/folder1/1563319400/file1-cp36-cp36m-linux_x86_64.whl
gs://mybucket/v1.0.0/folder1/1563329633/file1-cp27-cp27mu-linux_x86_64.whl
gs://mybucket/v1.0.0/folder1/1563411368/file1-cp35-cp35m-linux_x86_64.whl
gs://mybucket/v1.0.0/folder1/1565916833/file1-cp27-cp27mu-linux_x86_64.whl
gs://mybucket/v1.0.0/folder1/1565921265/file1-cp35-cp35m-linux_x86_64.whl
gs://mybucket/v1.0.0/folder1/1566258114/file1-cp27-cp27mu-linux_x86_64.whl
Since some files in different folders have the same names, how can I retrieve unique filenames ignoring the path?

I would do it like this:
blabla_your_command | rev | sort -t'/' -u -k1,1 | rev
rev reverses lines. Then I unique sort using / as a separator on the first field. After the line is reversed, the first field will be the filename, so sorting -u on it would return only unique filenames. Then the line needs to be reversed back.
The following command:
cat <<EOF |
gs://mybucket/v1.0.0/folder1/1560924028/file1-cp27-cp27mu-linux_x86_64.whl
gs://mybucket/v1.0.0/folder1/1560926922/file1-cp36-cp36m-linux_x86_64.whl
gs://mybucket/v1.0.0/folder1/1560930522/file1-cp35-cp35m-linux_x86_64.whl
gs://mybucket/v1.0.0/folder1/1561568612/file1-cp37-cp37m-linux_x86_64.whl
gs://mybucket/v1.0.0/folder1/1561595893/file1-cp37-cp37m-linux_x86_64.whl
gs://mybucket/v1.0.0/folder1/1561654308/file1-cp37-cp37m-linux_x86_64.whl
gs://mybucket/v1.0.0/folder1/1563319372/file1-cp27-cp27mu-linux_x86_64.whl
gs://mybucket/v1.0.0/folder1/1563319400/file1-cp36-cp36m-linux_x86_64.whl
gs://mybucket/v1.0.0/folder1/1563329633/file1-cp27-cp27mu-linux_x86_64.whl
gs://mybucket/v1.0.0/folder1/1563411368/file1-cp35-cp35m-linux_x86_64.whl
gs://mybucket/v1.0.0/folder1/1565916833/file1-cp27-cp27mu-linux_x86_64.whl
gs://mybucket/v1.0.0/folder1/1565921265/file1-cp35-cp35m-linux_x86_64.whl
gs://mybucket/v1.0.0/folder1/1566258114/file1-cp27-cp27mu-linux_x86_64.whl
EOF
rev | sort -t'/' -u -k1,1 | rev
outputs:
gs://mybucket/v1.0.0/folder1/1560930522/file1-cp35-cp35m-linux_x86_64.whl
gs://mybucket/v1.0.0/folder1/1560926922/file1-cp36-cp36m-linux_x86_64.whl
gs://mybucket/v1.0.0/folder1/1561568612/file1-cp37-cp37m-linux_x86_64.whl
gs://mybucket/v1.0.0/folder1/1560924028/file1-cp27-cp27mu-linux_x86_64.whl

Please check awk option given below, this will print the last occurrence of delimiter '/', it worked for me
example:
gsutil ls gs://mybucket/v1.0.0/folder1/1560930522 | awk -F/ '{print $(NF)}'
print all the file names under '1560930522'

your_command|awk -F/ '!($NF in a){a[$NF]; print}'
gs://mybucket/v1.0.0/folder1/1560924028/file1-cp27-cp27mu-linux_x86_64.whl
gs://mybucket/v1.0.0/folder1/1560926922/file1-cp36-cp36m-linux_x86_64.whl
gs://mybucket/v1.0.0/folder1/1560930522/file1-cp35-cp35m-linux_x86_64.whl
gs://mybucket/v1.0.0/folder1/1561568612/file1-cp37-cp37m-linux_x86_64.whl

4 different ways of saying the same thing
nawk -F'^.+/' '++_[$NF]<NF'
gawk -F'/' '__[$NF]++<!_'
mawk -F/ '_^__[$NF]++'
mawk2 -F/ '!_[$NF]--'
gs://mybucket/v1.0.0/folder1/1560924028/file1-cp27-cp27mu-linux_x86_64.whl
gs://mybucket/v1.0.0/folder1/1560926922/file1-cp36-cp36m-linux_x86_64.whl
gs://mybucket/v1.0.0/folder1/1560930522/file1-cp35-cp35m-linux_x86_64.whl
gs://mybucket/v1.0.0/folder1/1561568612/file1-cp37-cp37m-linux_x86_64.whl

Here's a simple, straightforward solution:
$ your_gsutil_command | xargs -L 1 basename | sort -u
The easiest way to remove paths is with basename. Unfortunately it accepts only a single filename, which must be on the command line (not from stdin), so we need to take the following steps:
Create the list of files.
We do this with your_gsutil_command, but you can use any command that generates a list of files.
Send each one to basename to remove its path.
The xargs command does this for us by reading its stdin and invoking basename repeatedly, passing the data as command-line arguments. But xargs efficiently tries to reduce the number of invocations by passing multiple filenames on each command line, and that breaks basename. We prevent that with -L 1, limiting it to only one line (that is, one filename) at a time.
Remove duplicates.
The sort -u command does this.
Using your example data:
$ gsutil ls -d gs://mybucket/v${version}/folder1/*/*.whl |
xargs -L 1 basename | sort -u
file1-cp27-cp27mu-linux_x86_64.whl
file1-cp35-cp35m-linux_x86_64.whl
file1-cp36-cp36m-linux_x86_64.whl
file1-cp37-cp37m-linux_x86_64.whl
Caveat: Spaces break everything. 😡
So far we've assumed the filenames and folders do not contain spaces. Spaces break basename because needs exactly one filename, and it would interpret spaces as separators between multiple filenames. We can get around this in two ways:
ls -Q: If you're deduplicating local filenames, you can use the (non-gsutil) ls command with the -Q flag to put the filenames in quotes, so basename will interpret spaces as part of the filenames rather than separators.
gsutil: The -Q flag is unfortunately not supported, so we'll need to escape the spaces manually:
$ your_gsutil_command | sed 's/ /\\ /g' | xargs -L 1 basename | sort -u
Here we use the sed command to escape each space by inserting a backslash before it. (That is, we replace with \ . Note that we also need to escape the backslash in the sed command, which is why we use \\ and not just \.)

Related

How to extract codes using the grep command?

I have a file with below input lines.
John|1|R|Category is not found for local configuration/code/123.NNN and customer 113
TOM|2|R|Category is not found for local configuration/code/123.NNN and customer 114
PETER|3|R|Category is not found for local configuration/code/456.1 and customer 115
I need to extract only the above highlighted text using the grep command.
I tried the below command and didn't get the proper result. Getting the extra 2 unwanted characters in the output. Please suggest if there is any other way to achieve this through grep command.
find ./ -type f -name <FileName> -exec cut -f 4 -d'|' {} + |
grep -o 'Category is not found for local configuration/code/...\\....' |
grep -o '...\\....' | sort | uniq
Current Output:
123.NNN
456.1 a
Expected output:
123.NNN
456.1

You can use another grep regular expression.
find ./ -type f -name f -exec cut -f 4 -d'|' {} + |
grep -o 'Category is not found for local configuration/code/...\.[^ ]*' |
grep -o '...\..*' | sort | uniq
. matches any character, [^ ]* matches any sequence of characters until the first space
Output:
123.NNN
456.1

Your regex specifies a fixed character width for strings of variable width. Based on your examples, something like
[0-9]\+\.[A-Z0-9]\+
would seem like a better regex. However, we could probably also simplify this by merging the cut and multiple grep commands into a single Awk script.
find etc etc -exec awk -F '|' '
$4 ~ /Category is not found for local configuration\/code\/[0-9]{3}\.[0-9A-Z]/ {
split($4, a, /\/code\/);
split(a[2], b); print b[1] }' {} + |
sort -u
The two split operations are just a cheap way to pick out the text between /code/ and the next whitespace character; we have already established by way of the regex match that the string after /code/ matches the pattern we're after.
Notice also how sort has a -u option which allows you to replace (trivial cases of) uniq.
The regex variant supported by Awk is slightly different than that supported by POSIX grep; so the backslashed \+ in grep's BRE dialect is plain + in the dialect called ERE which is [more or less] supported by Awk - and grep -E. If you have grep -P you can use a third variant which has a convenient feature;
find etc etc -exec grep -oP '^([^|]*[|]){3}[^|]*Category is not found for local configuration/code/\K[0-9]{3}\.[0-9A-Z]+' {} + |
sort -u
The \K says "match up through here, but forget everything before this" and so only prints the part after this token.

With sed:
sed -E -n 's#.*code/(.*)\s+and.*#\1#p' file.txt | uniq
Output:
123.NNN
456.1

I'd use the -P option:
grep -oP '/code/\K\S+' file | sort -u
You want to extract the non-whitespace characters following /code/

An awk using match():
$ awk 'match($0,/[0-9]+\.[A-Z0-9]+/)&&++a[(b=substr($0,RSTART,RLENGTH))]==1{print b}' file
Output:
123.NNN
456.1
Pretty printed for slightly better readability:
$ awk '
match($0,/[0-9]+\.[A-Z0-9]+/) && ++a[(b=substr($0,RSTART,RLENGTH))]==1 {
print b
}' file

It's not possible just using grep. You should use AWK instead:
awk '{split($7, ar, "/"); print ar[3]}' FILE
Explanation:
The split function splits on a string, here $7, the 7th field, placing the result in an array ar, and using the string / as delimiter.
Then prints the 3rd field of the array.
Note:
I am assuming that all of your input looks like the samples you have given us, i.e.:
aaa|b|c|ddd is not found for local configuration/code/111.nnn and customer nnn
Where aaa and ddd will not contain whitespace.
I also assume you really do have a file FILE containing those lines. It's a bit unclear.
Input:
▶ cat FILE
John|1|R|Category is not found for local configuration/code/123.NNN and customer 113
TOM|2|R|Category is not found for local configuration/code/123.NNN and customer 114
PETER|3|R|Category is not found for local configuration/code/456.1 and customer 115
Output:
▶ awk '{split($7, ar, "/"); print ar[3]}' FILE
123.NNN
123.NNN
456.1

Single sed can do the filtering.
(The pattern can be further generalized as suggested by others if that is an option. But be careful to not to over simplify so that it can match with unexpected inputs)
sed -nE 's#(\S+\s+){6}configuration/code/(\S+)\s.*#\2#p' input.txt
To replace your exact command,
find ./ -type f -name <Filename> -exec cat {} \; | sed -nE 's#(\S+\s+){6}configuration/code/(\S+)\s.*#\2#p' | sort | uniq

Simple substitutions on individual lines is the job sed is best suited for. This will work using any sed in any shell on any UNIX box:
$ cat file
John|1|R|Category is not found for local configuration/code/123.NNN and customer 113
TOM|2|R|Category is not found for local configuration/code/123.NNN and customer 114
PETER|3|R|Category is not found for local configuration/code/456.1 and customer 115
$ sed -n 's:.*Category is not found for local configuration/code/\([^ ]*\).*:\1:p' file | sort -u
123.NNN
456.1

Curl and xargs in piped commands

I want to process an old database where password are plain text (comma separated ; passwd is the 5th field in the csv file where the database has been exported) to crypt them for further use by dokuwiki. Here is my bash command (grep and sed are there to extract the crypted passwd from curl output) :
cat users.csv | awk 'FS="," { print $4 }' | xargs -l bash -c 'curl -s --data-binary "pass1=$0&pass2=$0" "https://sprhost.com/tools/SMD5.php" -o - ' | xargs | grep -o '<tt.*tt>' | sed -e 's/tt//g' | sed -e 's/<[^>]*>//g'
I get the following comment from xargs
xargs: unmatched single quote; by default quotes are special to xargs unless you use the -0 option
And only the first line of the file is processed, and nothing appends then.
Using the -0 option, and playing around with quotes, doesn't solve anything. Where am I wrong in the command line ? May be a more advanced language will be more adequate to do this.
Thank for help, LM

In general, if you have such a long pipe of commands, it is better to split them if things go wrong. Going through your pipe:
cat users.csv |
Nothing unexpected there.
awk 'FS="," { print $4 }' |
You probably wanted to do awk 'BEGIN {FS=","} { print $4 }'. Try the first two commands in the pipe and see if they produce the correct answer.
xargs -l bash -c 'curl -s --data-binary "pass1=$0&pass2=$0" "https://sprhost.com/tools/SMD5.php" -o - ' |
Nothing wrong there, although there might be better ways to do an MD5 hash.
xargs |
What is this xargs doing in the pipe? It should be removed.
grep -o '<tt.*tt>' |
Note that this will produce two lines:
<tt>$1$17ab075e$0VQMuM3cr5CtElvMxrPcE0</tt>
<tt><your_docuwiki_root>/conf/users.auth.php</tt>
which is probably not what you expected.
sed -e 's/tt//g' |
sed -e 's/<[^>]*>//g'
which will remove the html-tags, though
sed 's/<tt>//;s/<.tt>//'
will do the same.
So I'd say a wrong awk and an xargs too many.

xargs for each element in a list in the middle of a command, not just the end like -L does

What is the correct way to xargs to feed a list of strings as inputs to the middle of a command?
For example, say I want to move all files that come through a complicated series of "pipes" to the home directory. Something like
$ ... | ... | ... | awk '{print $2}' | xargs -L1 mv ~/
Tries to move the home directory to each input, rather than the desired order.
Someone previously asked a question about this, but the answers were not helpful:
Unix - "xargs" - output "in the middle" (not at the end!)
Is there a way to place the xargs input into a specific part of the command and not just at the end.

Using GNU Parallel it looks like this:
$ ... | ... | ... | awk '{print $2}' | parallel mv {} ~/
It will run one job per file. Faster is:
$ ... | ... | ... | awk '{print $2}' | parallel -X mv {} ~/
It will insert multiple names before running the job.

You didn't mention what xargs are you using.
If you use BSD xargs instead of GNU xargs, there is a powerful option -L that can do what you want:
-J replstr
If this option is specified, xargs will use the data read from
standard input to replace the first occurrence of replstr instead
of appending that data after all other arguments. This option
will not affect how many arguments will be read from input (-n),
or the size of the command(s) xargs will generate (-s). The op-
tion just moves where those arguments will be placed in the com-
mand(s) that are executed. The replstr must show up as a dis-
tinct argument to xargs. It will not be recognized if, for in-
stance, it is in the middle of a quoted string. Furthermore,
only the first occurrence of the replstr will be replaced. For
example, the following command will copy the list of files and
directories which start with an uppercase letter in the current
directory to destdir:
/bin/ls -1d [A-Z]* | xargs -J % cp -Rp % destdir
Examples
$ seq 3 | xargs -J# echo # fourth fifth
1 2 3 fourth fifth
seq 3 | xargs -J# ruby -e 'pp ARGV' # fourth fifth
["1", "2", "3", "fourth", "fifth"]
The Ruby code pp ARGV here would pretty print the command arguments it receives. I usually use this way to debug xargs scripts.

grep pipe with sed

This is my bash command
grep -rl "System.out.print" Project1/ |
xargs -I{} grep -H -n "System.out.print" {} |
cut -f-2 -d: |
sed "s/\(.*\):\(.*\)/filename is \1 and line number is \2/
What I'm trying to do here is,I'm trying to iterate through sub folders and check what files contains "System.out.print" (using grep)
using 2nd grep trying to get file names and line numbers
using sed command I display those to console.
from here I want to remove "System.out.print" with "XXXXX" how I can pipe sed command to this?
pls help me
thanxx

GNU sed has an option to change files in place:
find Project1/ -type f | xargs sed -i 's/System\.out\.print/XXXXX/g'
Btw, your script could be written as:
grep -rsn 'root' /etc/ |
awk -F: '{ print "filename is", $1, "and line number is", $2 }'

I'm just building on hop's answer, which I found to be more useful than find -exec. I had search_text dispersed all over my computer, in logs, config files and so on, but I didn't want to search (or especially change) anything in /dev, /sys, /proc, and so on. One note, read man xargs; it doesn't like file names with spaces.
grep -HriIl --exclude-dir=dev --exclude-dir=proc --exclude-dir=sys search_text / | xargs sed -i 's/search_text/replace_text/g'

Get the newest file based on timestamp

I am new to shell scripting so i need some help need how to go about with this problem.
I have a directory which contains files in the following format. The files are in a diretory called /incoming/external/data
AA_20100806.dat
AA_20100807.dat
AA_20100808.dat
AA_20100809.dat
AA_20100810.dat
AA_20100811.dat
AA_20100812.dat
As you can see the filename of the file includes a timestamp. i.e. [RANGE]_[YYYYMMDD].dat
What i need to do is find out which of these files has the newest date using the timestamp on the filename not the system timestamp and store the filename in a variable and move it to another directory and move the rest to a different directory.

For those who just want an answer, here it is:
ls | sort -n -t _ -k 2 | tail -1
Here's the thought process that led me here.
I'm going to assume the [RANGE] portion could be anything.
Start with what we know.
Working Directory: /incoming/external/data
Format of the Files: [RANGE]_[YYYYMMDD].dat
We need to find the most recent [YYYYMMDD] file in the directory, and we need to store that filename.
Available tools (I'm only listing the relevant tools for this problem ... identifying them becomes easier with practice):
ls
sed
awk (or nawk)
sort
tail
I guess we don't need sed, since we can work with the entire output of ls command. Using ls, awk, sort, and tail we can get the correct file like so (bear in mind that you'll have to check the syntax against what your OS will accept):
NEWESTFILE=`ls | awk -F_ '{print $1 $2}' | sort -n -k 2,2 | tail -1`
Then it's just a matter of putting the underscore back in, which shouldn't be too hard.
EDIT: I had a little time, so I got around to fixing the command, at least for use in Solaris.
Here's the convoluted first pass (this assumes that ALL files in the directory are in the same format: [RANGE]_[yyyymmdd].dat). I'm betting there are better ways to do this, but this works with my own test data (in fact, I found a better way just now; see below):
ls | awk -F_ '{print $1 " " $2}' | sort -n -k 2 | tail -1 | sed 's/ /_/'
... while writing this out, I discovered that you can just do this:
ls | sort -n -t _ -k 2 | tail -1
I'll break it down into parts.
ls
Simple enough ... gets the directory listing, just filenames. Now I can pipe that into the next command.
awk -F_ '{print $1 " " $2}'
This is the AWK command. it allows you to take an input line and modify it in a specific way. Here, all I'm doing is specifying that awk should break the input wherever there is an underscord (_). I do this with the -F option. This gives me two halves of each filename. I then tell awk to output the first half ($1), followed by a space (" ")
, followed by the second half ($2). Note that the space was the part that was missing from my initial suggestion. Also, this is unnecessary, since you can specify a separator in the sort command below.
Now the output is split into [RANGE] [yyyymmdd].dat on each line. Now we can sort this:
sort -n -k 2
This takes the input and sorts it based on the 2nd field. The sort command uses whitespace as a separator by default. While writing this update, I found the documentation for sort, which allows you to specify the separator, so AWK and SED are unnecessary. Take the ls and pipe it through the following sort:
sort -n -t _ -k 2
This achieves the same result. Now you only want the last file, so:
tail -1
If you used awk to separate the file (which is just adding extra complexity, so don't do it sheepish), you can replace the space with an underscore again with sed:
sed 's/ /_/'
Some good info here, but I'm sure most people aren't going to read down to the bottom like this.

This should work:
newest=$(ls | sort -t _ -k 2,2 | tail -n 1)
others=($(ls | sort -t _ -k 2,2 | head -n -1))
mv "$newest" newdir
mv "${others[#]}" otherdir
It won't work if there are spaces in the filenames although you could modify the IFS variable to affect that.

Try:
$ ls -lr
Hope it helps.

Use:
ls -r -1 AA_*.dat | head -n 1
(assuming there are no other files matching AA_*.dat)

ls -1 AA* |sort -r|tail -1

Due to the naming convention of the files, alphabetical order is the same as date order. I'm pretty sure that in bash '*' expands out alphabetically (but can not find any evidence in the manual page), ls certainly does, so the file with the newest date, would be the last one alphabetically.
Therefore, in bash
mv $(ls | tail -1) first-directory
mv * second-directory
Should do the trick.
If you want to be more specific about the choice of file, then replace * with something else - for example AA_*.dat

My solution to this is similar to others, but a little simpler.
ls -tr | tail -1
What is actually does is to rely on ls to sort the output, then uses tail to get the last listed file name.
This solution will not work if the filename you require has a leading dot (e.g. .profile).
This solution does work if the file name contains a space.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio