Append xargs argument number as prefix - shell

I want to analyze the most frequentry occuring entries in (column of) a logfile. To write the detail results, I am creating new directories from the output of something along the lines of
cat logs| cut -d',' -f 6 | sort | uniq -c | sort -rn | head -10 | \
awk '{print $2}' |xargs mkdir -p
Is there a way to create the directories with the sequence number of the argument as processed by xargs as a prefix? For e.g. For e.g. "oranges" is the most frequent entry (of the column) the directory created should be named "1.oranges" and so on.

A quick (and dirty?) solution could be to pipe your directory names through cat -n in their proper order and then remove the whitespace separating the line number from the directory name, before passing them to xargs.
A better solution would be to modify your awk command:
... | awk '{ print NR "." $2 }' | xargs mkdir -p
The NR variable contains the record (i.e. line) number.

Related

How to Write A Second Column in Bash in an Existing txt file

I need to extract the ID name of a parent directory and put that in a tab-delimited text file. Then I need to extract names of the contents of that folder and put it in the same row as that ID name I first extracted. Essentially, Column 1 should list the directory name from parent, Column 2 should list the name first file in that directory, Column 3 should be the name of the next file, and so on and so forth.
/path/to/folder/ID/
pwd | xargs echo | awk -F "/" '{print $n; exit}' >> Text.txt
where 'n' is the location of the desired parent folder (in this case, ID). This works fine, and writes something like "ID001" to my Text.txt file.
I try the same little hack again, using my pwd as my input to xargs, listing out the contents of that folder, and writing the names to my Text.txt file:
pwd | xargs echo | awk -F "/" '{print $7; exit}' >> Text.txt | pwd | xargs echo | xargs ls | xargs echo >> Text.txt
But instead of
ID001 file1 file2
I get
file1 file2
ID001
Which is mostly to be expected, given the commands. I am confused as to why my file names are being appended to the first row and not to the last row. The only related article I could find was this for writing a specific column to a CSV, but it wasn't quite what I was looking for.
This find plus awk pipeline MAY be what you're trying to do:
$ ls tmp
a b
$ find tmp -print | awk '{sub("^[^/]+/",""); printf "%s%s", sep, $0; sep="\t"} END{print ""}'
tmp a b
YMMV if your file names contain tabs or newlines of course.
You probably want to do that as part of multiple commands; for ease in understanding.
You can put the commands in a bash script.
Example scenario
$ pwd
/Users/pa357856/test/tmp/foo
$ ls
file1.txt file2.txt
commands -
$ parentDIR=`pwd | xargs echo | awk -F "/" '{print $6}'`
$ filesList=`ls`
$ echo "$parentDIR" "$filesList" >> test.txt
Result -
$ cat test.txt
foo file1.txt file2.txt

How can I deduplicate filenames across directories?

I run the following gsutil command:
gsutil ls -d gs://mybucket/v${version}/folder1/*/*.whl |
sort -V |
grep -e "/*.whl"
I get:
gs://mybucket/v1.0.0/folder1/1560924028/file1-cp27-cp27mu-linux_x86_64.whl
gs://mybucket/v1.0.0/folder1/1560926922/file1-cp36-cp36m-linux_x86_64.whl
gs://mybucket/v1.0.0/folder1/1560930522/file1-cp35-cp35m-linux_x86_64.whl
gs://mybucket/v1.0.0/folder1/1561568612/file1-cp37-cp37m-linux_x86_64.whl
gs://mybucket/v1.0.0/folder1/1561595893/file1-cp37-cp37m-linux_x86_64.whl
gs://mybucket/v1.0.0/folder1/1561654308/file1-cp37-cp37m-linux_x86_64.whl
gs://mybucket/v1.0.0/folder1/1563319372/file1-cp27-cp27mu-linux_x86_64.whl
gs://mybucket/v1.0.0/folder1/1563319400/file1-cp36-cp36m-linux_x86_64.whl
gs://mybucket/v1.0.0/folder1/1563329633/file1-cp27-cp27mu-linux_x86_64.whl
gs://mybucket/v1.0.0/folder1/1563411368/file1-cp35-cp35m-linux_x86_64.whl
gs://mybucket/v1.0.0/folder1/1565916833/file1-cp27-cp27mu-linux_x86_64.whl
gs://mybucket/v1.0.0/folder1/1565921265/file1-cp35-cp35m-linux_x86_64.whl
gs://mybucket/v1.0.0/folder1/1566258114/file1-cp27-cp27mu-linux_x86_64.whl
Since some files in different folders have the same names, how can I retrieve unique filenames ignoring the path?
I would do it like this:
blabla_your_command | rev | sort -t'/' -u -k1,1 | rev
rev reverses lines. Then I unique sort using / as a separator on the first field. After the line is reversed, the first field will be the filename, so sorting -u on it would return only unique filenames. Then the line needs to be reversed back.
The following command:
cat <<EOF |
gs://mybucket/v1.0.0/folder1/1560924028/file1-cp27-cp27mu-linux_x86_64.whl
gs://mybucket/v1.0.0/folder1/1560926922/file1-cp36-cp36m-linux_x86_64.whl
gs://mybucket/v1.0.0/folder1/1560930522/file1-cp35-cp35m-linux_x86_64.whl
gs://mybucket/v1.0.0/folder1/1561568612/file1-cp37-cp37m-linux_x86_64.whl
gs://mybucket/v1.0.0/folder1/1561595893/file1-cp37-cp37m-linux_x86_64.whl
gs://mybucket/v1.0.0/folder1/1561654308/file1-cp37-cp37m-linux_x86_64.whl
gs://mybucket/v1.0.0/folder1/1563319372/file1-cp27-cp27mu-linux_x86_64.whl
gs://mybucket/v1.0.0/folder1/1563319400/file1-cp36-cp36m-linux_x86_64.whl
gs://mybucket/v1.0.0/folder1/1563329633/file1-cp27-cp27mu-linux_x86_64.whl
gs://mybucket/v1.0.0/folder1/1563411368/file1-cp35-cp35m-linux_x86_64.whl
gs://mybucket/v1.0.0/folder1/1565916833/file1-cp27-cp27mu-linux_x86_64.whl
gs://mybucket/v1.0.0/folder1/1565921265/file1-cp35-cp35m-linux_x86_64.whl
gs://mybucket/v1.0.0/folder1/1566258114/file1-cp27-cp27mu-linux_x86_64.whl
EOF
rev | sort -t'/' -u -k1,1 | rev
outputs:
gs://mybucket/v1.0.0/folder1/1560930522/file1-cp35-cp35m-linux_x86_64.whl
gs://mybucket/v1.0.0/folder1/1560926922/file1-cp36-cp36m-linux_x86_64.whl
gs://mybucket/v1.0.0/folder1/1561568612/file1-cp37-cp37m-linux_x86_64.whl
gs://mybucket/v1.0.0/folder1/1560924028/file1-cp27-cp27mu-linux_x86_64.whl
Please check awk option given below, this will print the last occurrence of delimiter '/', it worked for me
example:
gsutil ls gs://mybucket/v1.0.0/folder1/1560930522 | awk -F/ '{print $(NF)}'
print all the file names under '1560930522'
your_command|awk -F/ '!($NF in a){a[$NF]; print}'
gs://mybucket/v1.0.0/folder1/1560924028/file1-cp27-cp27mu-linux_x86_64.whl
gs://mybucket/v1.0.0/folder1/1560926922/file1-cp36-cp36m-linux_x86_64.whl
gs://mybucket/v1.0.0/folder1/1560930522/file1-cp35-cp35m-linux_x86_64.whl
gs://mybucket/v1.0.0/folder1/1561568612/file1-cp37-cp37m-linux_x86_64.whl
4 different ways of saying the same thing
nawk -F'^.+/' '++_[$NF]<NF'
gawk -F'/' '__[$NF]++<!_'
mawk -F/ '_^__[$NF]++'
mawk2 -F/ '!_[$NF]--'
gs://mybucket/v1.0.0/folder1/1560924028/file1-cp27-cp27mu-linux_x86_64.whl
gs://mybucket/v1.0.0/folder1/1560926922/file1-cp36-cp36m-linux_x86_64.whl
gs://mybucket/v1.0.0/folder1/1560930522/file1-cp35-cp35m-linux_x86_64.whl
gs://mybucket/v1.0.0/folder1/1561568612/file1-cp37-cp37m-linux_x86_64.whl
Here's a simple, straightforward solution:
$ your_gsutil_command | xargs -L 1 basename | sort -u
The easiest way to remove paths is with basename. Unfortunately it accepts only a single filename, which must be on the command line (not from stdin), so we need to take the following steps:
Create the list of files.
We do this with your_gsutil_command, but you can use any command that generates a list of files.
Send each one to basename to remove its path.
The xargs command does this for us by reading its stdin and invoking basename repeatedly, passing the data as command-line arguments. But xargs efficiently tries to reduce the number of invocations by passing multiple filenames on each command line, and that breaks basename. We prevent that with -L 1, limiting it to only one line (that is, one filename) at a time.
Remove duplicates.
The sort -u command does this.
Using your example data:
$ gsutil ls -d gs://mybucket/v${version}/folder1/*/*.whl |
xargs -L 1 basename | sort -u
file1-cp27-cp27mu-linux_x86_64.whl
file1-cp35-cp35m-linux_x86_64.whl
file1-cp36-cp36m-linux_x86_64.whl
file1-cp37-cp37m-linux_x86_64.whl
Caveat: Spaces break everything. 😡
So far we've assumed the filenames and folders do not contain spaces. Spaces break basename because needs exactly one filename, and it would interpret spaces as separators between multiple filenames. We can get around this in two ways:
ls -Q: If you're deduplicating local filenames, you can use the (non-gsutil) ls command with the -Q flag to put the filenames in quotes, so basename will interpret spaces as part of the filenames rather than separators.
gsutil: The -Q flag is unfortunately not supported, so we'll need to escape the spaces manually:
$ your_gsutil_command | sed 's/ /\\ /g' | xargs -L 1 basename | sort -u
Here we use the sed command to escape each space by inserting a backslash before it. (That is, we replace with \ . Note that we also need to escape the backslash in the sed command, which is why we use \\ and not just \.)

Scanning, group and count file extensions on Linux

Is there a way to scan a path and group and count the file extensions?
If I understand your question, you can use this command -
ls -ls | awk '{print $10}' | grep "\." | awk -F. '{print $2}' | sort | uniq -c
which count the extensions in the current path.
How to count files
To count how many files for each extension are present in a path, you can use one of the answers of "Count files in a directory by extension" question of another site[1], e.g.:
ls | awk -F . '{print $NF}' | sort | uniq -c | awk '{print $2,$1}'
How to list grouped by extension
To group the files by extension you can use simply the -X option of ls
ls -X
--sort=WORD
sort by WORD instead of name:
none -U, extension -X, size -S, time -t, version -v
Note:
The concept of extension is imported from DOS, under Unix there is only the file name eventually with more than one '.' character inside...

File Name comparision in Bash

I have two files containing list of files. I need to check what files are missing in the list of second file. Problem is that I do not have to match full name, but only need to match last 19 Characters of the file names.
E.g
MyFile12343220150510230000.xlsx
and
MyFile99999620150510230000.xlsx
are same files.
This is a unique problem and I don't know how to start. Kindly help.
awk based solution:
$ awk '
{start=length($0) - 18;}
NR==FNR{a[substr($0, start)]++; next;} #save last 19 characters for every line in file2
{if(!a[substr($0, start)]) print $0;} #If that is not present in file1, print that line.
' file2.list file.list
First you can use comm to match the exact file names and obtain a list of files not matchig. Then you can use agrep. I've never used it, but you might find it useful.
Or, as last option, you can do a brute force and for every line in the first file search into the second:
#!/bin/bash
# Iterate through the first file
while read LINE; do
# Find the section of the filename that has to match in the other file
CHECK_SECTION="$(echo "$LINE" | sed -nre 's/^.*([0-9]{14})\.(.*)$/\1.\2/p')"
# Create a regex to match the filenames in the second file
SEARCH_REGEX="^.*$CHECK_SECTION$"
# Search...
egrep "$SEARCH_REGEX" inputFile_2.txt
done < inputFile_1.txt
Here I assumed the filenames end with 14 digits that must match in the other file and a file extension that can be different from file to file but that has to match too:
MyFile12343220150510230000.xlsx
| variable | 14digits |.ext
So, if the first file is FILE1 and the second file is FILE2 then if the intention is only to identify the files in FILE2 that don't exist in FILE1, the following should do:
tmp1=$(mktemp)
tmp2=$(mktemp)
cat $FILE1 | rev | cut -c -19 | sort | uniq > ${tmp1}
cat $FILE2 | rev | cut -c -19 | sort | uniq > ${tmp2}
diff ${tmp1} ${tmp2} | rev
rm ${tmp1} ${tmp2}
In a nutshell, this reverses the characters on each line, and extracts the part you're interested in, saving to a temporary file, for each list of files. The reversal of characters is done since you haven't said whether or not the length of filenames is guaranteed to be constant---the only thing we can rely on here is that the last 19 characters are of a fixed format (in this case, although the format is easily inferred, it isn't really relevant). The sort is important in order for the diff to show you what's not in the second file that is in the first.
If you're certain that there will only ever be files missing from FILE2 and not the other way around (that is, files in FILE2 that don't exist in FILE1), then you can clean things up by removing the cruft introduced by diff, so the last line becomes:
diff ${tmp1} ${tmp2} | rev | grep -i xlsx | sed 's/[[:space:]]\+.*//'
The grep limits the output to those lines with xlsx filenames, and the sed removes everything on a line from the first space encountered onwards.
Of course, technically this only tells you what time-stamped-grouped groups of files exist in FILE1 but not FILE2--as I understand it, this is what you're looking for (my understanding of your problem description is that MyFile12343220150510230000.xlsx and MyFile99999620150510230000.xlsx would have identical content). If the file names are always the same length (as you subsequently affirmed), then there's no need for the rev's and the cut commands can just be amended to refer to fixed character positions.
In any case, to get the final list of files, you'll have to use the "cleaned up" output to filter the content of FILE1; so, modifying the script above so that it includes the "cleanup" command, we can filter the files that you need using a grep--the whole script then becomes:
tmp1=$(mktemp)
tmp2=$(mktemp)
missing=$(mktemp)
cat $FILE1 | rev | cut -c -19 | sort | uniq > ${tmp1}
cat $FILE2 | rev | cut -c -19 | sort | uniq > ${tmp2}
diff ${tmp1} ${tmp2} | rev | grep -i xlsx | sed 's/[[:space:]]\+.*//' > ${missing}
grep -E "("`echo $(<${missing}) | sed 's/[[:space:]]/|/g'`")" ${tmp1}
rm ${tmp1} ${tmp2} ${missing}
The extended grep command (-E) just builds up an "or" regular expression for each timestamp-plus-extension and applies it to the first file. Of course, this is all assuming that there will never be timestamp-groups that exist in FILE2 and not in FILE1--if this is the case, then the "diff output processing" bit needs to be a little more clever.
Or you could use your standard coreutil tools:
for i in $(cat file1 file2 | sort | uniq -u); do
grep -q "$i" f1.txt && \
echo "f2 missing '$i'" || \
echo "f1 missing '$i'"
done
It will identify which non-common entries are missing from which file. You can also manipulate the non-common filenames in any way you like, e.g. parameter expansion/substring extraction, substring removal, or character indexes.

Linux commands to output part of input file's name and line count

What Linux commands would you use successively, for a bunch of files, to count the number of lines in a file and output to an output file with part of the corresponding input file as part of the output line. So for example we were looking at file LOG_Yellow and it had 28 lines, the the output file would have a line like this (Yellow and 28 are tab separated):
Yellow 28
wc -l [filenames] | grep -v " total$" | sed s/[prefix]//
The wc -l generates the output in almost the right format; grep -v removes the "total" line that wc generates for you; sed strips the junk you don't want from the filenames.
wc -l * | head --lines=-1 > output.txt
produces output like this:
linecount1 filename1
linecount2 filename2
I think you should be able to work from here to extend to your needs.
edit: since I haven't seen the rules for you name extraction, I still leave the full name. However, unlike other answers I'd prefer to use head rather then grep, which not only should be slightly faster, but also avoids the case of filtering out files named total*.
edit2 (having read the comments): the following does the whole lot:
wc -l * | head --lines=-1 | sed s/LOG_// | awk '{print $2 "\t" $1}' > output.txt
wc -l *| grep -v " total"
send
28 Yellow
You can reverse it if you want (awk, if you don't have space in file names)
wc -l *| egrep -v " total$" | sed s/[prefix]//
| awk '{print $2 " " $1}'
Short of writing the script for you:
'for' for looping through your files.
'echo -n' for printing the current file
'wc -l' for finding out the line count
And dont forget to redirect
('>' or '>>') your results to your
output file

Resources