I dont want to print repeated lines based on column 6 and 7 - sorting

I don't want to print repeated lines based on column 6 and 7. sort -u does not seem to help
cat /tmp/testing :-
-rwxrwxr-x. 1 root root 52662693 Feb 27 13:11 /home/something/bin/proxy_exec
-rwxrwxr-x. 1 root root 27441394 Feb 27 13:12 /home/something/bin/keychain_exec
-rwxrwxr-x. 1 root root 45570820 Feb 27 13:11 /home/something/bin/wallnut_exec
-rwxrwxr-x. 1 root root 10942993 Feb 27 13:12 /home/something/bin/log_exec
-rwxrwxr-x. 1 root root 137922408 Apr 16 03:43 /home/something/bin/android_exec
When I try cat /tmp/testing | sort -u -k 6,6 -k 7,7 I get :-
-rwxrwxr-x. 1 root root 137922408 Apr 16 03:43 /home/something/bin/android_exec
-rwxrwxr-x. 1 root root 52662693 Feb 27 13:11 /home/something/bin/proxy_exec
Desired output is below, as that is the only file different from others based on month and date column
-rwxrwxr-x. 1 root root 137922408 Apr 16 03:43 /home/something/bin/android_exec

[not] to print repeated lines based on column 6 and 7 using awk, you could:
$ awk '
++seen[$6,$7]==1 { # count seen instances
keep[$6,$7]=$0 # keep first seen ones
}
END { # in the end
for(i in seen)
if(seen[i]==1) # the ones seen only once
print keep[i] # get printed
}' file # from file or pipe your ls to the awk
Output for given input:
-rwxrwxr-x. 1 root root 137922408 Apr 16 03:43 /home/something/bin/android_exec
Notice: All standard warnings against parsing ls output still apply.

tried on gnu sed
sed -E '/^\s*(\S+\s+){5}Feb\s+27/d' testing
tried on gnu awk
awk 'NR==1{a=$6$7;next} a!=$6$7{print}' testing

Related

How to automate concatenating several series of files using bash extended globbing and negation patterns?

thank you so much for any advice and feedback on this matter.
This is is my situation:
I have a directory with several hundred files, all that start with foo* and end with *.txt. However, they differ in between the beginning and end with a unique identifier that is "Group#.#" and the files look like so:
foo.Group1.1.txt
foo.Group1.2.txt
foo.Group1.4.txt
foo.Group2.45.txt
.
.
.
foo.Group16.9.txt
The files begin with Group1 and end at Group 16. They are simple one column txt files, each file has several thousand lines. Each row is a number.
I want to so a series of concatenations with these files in which I concatenate all but those files with the "Group1" and then all the files except "Group1" and "Group2" and then all the files except "Group1", "Group2", and "Group3" and so on, until I am left with just the last Group: "Group16"
In order to do this I use a bash extended globbing expression with a negation syntax to concatenate all files except those that have "Group1" as their ID.
I make a directory "jacks" and output the concatenated file into a txt file within this subdirectory:
cat !(*Group1.*) > jacks/jackknife1.freqs.txt
I can then continue using this command, but adding "Group2" and "Group3" for subsequent concatenations.
cat !(*Group1.*|*Group2.*) > jacks/jackknife2.freqs.txt
cat !(*Group1.*|*Group2.*|*Group3.*) > jacks/jackknife3.freqs.txt
Technically, this works. And 16 Groups isn't too terrible to do this manually.
But I am wondering if there is a way, perhaps using loops or bash scripting to automate this process and speed it up?
I would appreciate any advice or leads on this question!
thank you very much,
daniela
Some tries around bash globbing
Try using echo before cat !
touch foo.Group{1..3}.{1..5}.txt
ls -l
total 0
-rw-r--r-- 1 user user 0 Oct 21 18:37 foo.Group1.1.txt
-rw-r--r-- 1 user user 0 Oct 21 18:37 foo.Group1.2.txt
-rw-r--r-- 1 user user 0 Oct 21 18:37 foo.Group1.3.txt
-rw-r--r-- 1 user user 0 Oct 21 18:37 foo.Group1.4.txt
-rw-r--r-- 1 user user 0 Oct 21 18:37 foo.Group1.5.txt
-rw-r--r-- 1 user user 0 Oct 21 18:37 foo.Group2.1.txt
-rw-r--r-- 1 user user 0 Oct 21 18:37 foo.Group2.2.txt
-rw-r--r-- 1 user user 0 Oct 21 18:37 foo.Group2.3.txt
-rw-r--r-- 1 user user 0 Oct 21 18:37 foo.Group2.4.txt
-rw-r--r-- 1 user user 0 Oct 21 18:37 foo.Group2.5.txt
-rw-r--r-- 1 user user 0 Oct 21 18:37 foo.Group3.1.txt
-rw-r--r-- 1 user user 0 Oct 21 18:37 foo.Group3.2.txt
-rw-r--r-- 1 user user 0 Oct 21 18:37 foo.Group3.3.txt
-rw-r--r-- 1 user user 0 Oct 21 18:37 foo.Group3.4.txt
-rw-r--r-- 1 user user 0 Oct 21 18:37 foo.Group3.5.txt
Then
echo !(*Group1.*)
foo.Group2.1.txt foo.Group2.2.txt foo.Group2.3.txt foo.Group2.4.txt foo.Group2.5.txt foo.Group3.1.txt foo.Group3.2.txt foo.Group3.3.txt foo.Group3.4.txt foo.Group3.5.txt
Ok, and
echo !(*Group[23].*)
foo.Group1.1.txt foo.Group1.2.txt foo.Group1.3.txt foo.Group1.4.txt foo.Group1.5.txt
Or
echo !(*Group*(1|3).*)
foo.Group2.1.txt foo.Group2.2.txt foo.Group2.3.txt foo.Group2.4.txt foo.Group2.5.txt
Or even
echo !(*Group*(1|*.3).*)
foo.Group2.1.txt foo.Group2.2.txt foo.Group2.4.txt foo.Group2.5.txt foo.Group3.1.txt foo.Group3.2.txt foo.Group3.4.txt foo.Group3.5.txt
and
echo !(*Group*(1|*.[2-4]).*)
foo.Group2.1.txt foo.Group2.5.txt foo.Group3.1.txt foo.Group3.5.txt
I will let you think about last two sample! ;-)

For loop with if statements isn't working as expected in bash

It only prints the "else" statement for everything but I know for a fact the files exist that it's looking for. I've tried adapting some of the other answers but I thought this should definitely work.
Does anyone know what's wrong with my syntax?
# Contents of script
for ID_SAMPLE in $(cut -f1 metadata.tsv | tail -n +2);
do if [ -f ./output/${ID_SAMPLE} ]; then
echo Skipping ${ID_SAMPLE};
else
echo Processing ${ID_SAMPLE};
fi
done
Additional information
# Output directory
(base) -bash-4.1$ ls -lhS output/
total 170K
drwxr-xr-x 8 jespinoz tigr 185 Jan 3 16:16 ERR1701760
drwxr-xr-x 8 jespinoz tigr 185 Jan 17 18:03 ERR315863
drwxr-xr-x 8 jespinoz tigr 185 Jan 16 23:23 ERR599042
drwxr-xr-x 8 jespinoz tigr 185 Jan 17 00:10 ERR599072
drwxr-xr-x 8 jespinoz tigr 185 Jan 16 13:00 ERR599078
# Example of inputs
(base) -bash-4.1$ cut -f1 metadata.tsv | tail -n +2 | head -n 10
ERR1701760
ERR599078
ERR599079
ERR599070
ERR599071
ERR599072
ERR599073
ERR599074
ERR599075
ERR599076
# Output of script
(base) -bash-4.1$ bash test.sh | head -n 10
Processing ERR1701760
Processing ERR599078
Processing ERR599079
Processing ERR599070
Processing ERR599071
Processing ERR599072
Processing ERR599073
Processing ERR599074
Processing ERR599075
Processing ERR599076
# Checking a directory
(base) -bash-4.1$ ls -l ./output/ERR1701760
total 294
drwxr-xr-x 2 jespinoz tigr 386 Jan 15 21:00 checkpoints
drwxr-xr-x 2 jespinoz tigr 0 Jan 10 01:36 tmp
-f is for checking whether the name is a file, but all your names are directories. Use -d to check that.
if [ -d "./output/$ID_SAMPLE" ]
then
If you want to check whether the name exists with any type, use -e.

How to get a filename list with ncftp?

So I tried
ncftpls -l
which gives me a list
-rw-r--r-- 1 100 ftpgroup 3817084 Jan 29 15:50 1548773401.tar.gz
-rw-r--r-- 1 100 ftpgroup 3817089 Jan 29 15:51 1548773461.tar.gz
-rw-r--r-- 1 100 ftpgroup 3817083 Jan 29 15:52 1548773521.tar.gz
-rw-r--r-- 1 100 ftpgroup 3817085 Jan 29 15:53 1548773582.tar.gz
-rw-r--r-- 1 100 ftpgroup 3817090 Jan 29 15:54 1548773642.tar.gz
But all I want is to check the timestamp (which is the name of the tar.gz)
How to only get the timestamp list ?
As requested, all I wanted to do is delete old backups, so awk was a good idea (at least it was effective) even it wasn't the right params. My method to delete old backup is probably not the best but it works
ncftpls *authParams* | (awk '{match($9,/^[0-9]+/, a)}{ print a[0] }') | while read fileCreationDate; do
VALIDITY_LIMIT="$((`date +%s`-600))"
a=$VALIDITY_LIMIT
b=$fileCreationDate
if [ $b -lt $a ];then
deleteFtpFile $b
fi
done;
You can use awk to only display the timestamps from the output like so:
ncftpls -l | awk '{ print $5 }'

Number of logins on Linux using Shell script and AWK

How can I get the number of logins of each day from the beginning of the wtmp file using AWK?
I thought about using an associative array but I don't know how to implement it in AWK..
myscript.sh
#!/bin/bash
awk 'BEGIN{numberoflogins=0}
#code goes here'
The output of the last command:
[fnorbert#localhost Documents]$ last
fnorbert tty2 /dev/tty2 Mon Apr 24 13:25 still logged in
reboot system boot 4.8.6-300.fc25.x Mon Apr 24 16:25 still running
reboot system boot 4.8.6-300.fc25.x Mon Apr 24 13:42 still running
fnorbert tty2 /dev/tty2 Fri Apr 21 16:14 - 21:56 (05:42)
reboot system boot 4.8.6-300.fc25.x Fri Apr 21 19:13 - 21:56 (02:43)
fnorbert tty2 /dev/tty2 Tue Apr 4 08:31 - 10:02 (01:30)
reboot system boot 4.8.6-300.fc25.x Tue Apr 4 10:30 - 10:02 (00:-27)
fnorbert tty2 /dev/tty2 Tue Apr 4 08:14 - 08:26 (00:11)
reboot system boot 4.8.6-300.fc25.x Tue Apr 4 10:13 - 08:26 (-1:-47)
wtmp begins Mon Mar 6 09:39:43 2017
The shell script's output should be:
Apr 4: 4
Apr 21: 2
Apr 24: 3
, using associative array if it's possible
In awk, arrays can be indexed by strings or numbers, so you can use it like an associative array.
However, what you're asking will be hard to do with awk reliably because the delimiters are whitespace, therefore empty fields will throw off the columns, and if you use FIELDWIDTHS you'll also get thrown off by columns longer than their assigned width.
If all you're looking for is just the number of logins per day you might want to use a combination of sed and awk (and sort):
last | \
sed -E 's/^.*(Mon|Tue|Wed|Thu|Fri|Sat|Sun) (Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec) ([ 0-9]{2}).*$/\2 \3/p;d' | \
awk '{arr[$0]++} END { for (a in arr) print a": " arr[a]}' | \
sort -M
The sed -E uses extended regular expressions, and the pattern just prints the date of each line that is emitted by last (This matches on the day of week, but only prints the Month and Date)
We could have used uniq -c to get the counts, but using awk we can do an associative array as you hinted.
Finally using sort -M we're sorting on the abbreviated date formats like Apr 24, Mar 16, etc.
Try the following awk script(assuming that the month is the same, points to current month):
myscript.awk:
#!/bin/awk -f
{
a[NR]=$0; # saving each line into an array indexed by line number
}
END {
for (i=NR-1;i>1;i--) { # iterating lines in reverse order(except the first/last line)
if (match(a[i],/[A-Z][a-z]{2} ([A-Z][a-z]{2}) *([0-9]{1,2}) [0-9]{2}:[0-9]{2}/, b))
m=b[1]; # saving month name
c[b[2]]++; # accumulating the number of occurrences
}
for (i in c) print m,i": "c[i]
}
Usage:
last | awk -f myscript.awk
The output:
Apr 4: 4
Apr 21: 2
Apr 24: 3

awk: Group by and then Sort by sub strings of a string

Assuming we have following files:
-rw-r--r-- 1 user group 120 Aug 17 18:27 A.txt
-rw-r--r-- 1 user group 155 May 12 12:28 A.txt
-rw-r--r-- 1 user group 155 May 10 21:14 A.txt
-rw-rw-rw- 1 user group 700 Aug 15 17:05 B.txt
-rw-rw-rw- 1 user group 59 Aug 15 10:02 B.txt
-rw-r--r-- 1 user group 180 Aug 15 09:38 B.txt
-rw-r--r-- 1 user group 200 Jul 2 17:09 C.txt
-rw-r--r-- 1 user group 4059 Aug 9 13:58 D.txt
Considering only HH:MM in timestamp (ie ignoring date/day part of timestamp), I want to sort this listing to pick maximum and minimum timestamp for each file name.
So we want to group by last column and get min & max HH:MM.
Please assume that filename duplicates are allowed in my input data.
In awk code, I particularly got stuck to group by and then sort by HH first and then MM.
Output we are expecting is in format:
Filename | Min HHMM | Max HHMM
A.txt 12:28 21:14
C.txt 17:09 17:09
..
(or any other output format giving this details is good)
Can you please help..TIA
Try:
awk '{if ($8<min[$9] || !min[$9])min[$9]=$8; if ($8>max[$9])max[$9]=$8} END{for (f in min)print f,min[f],max[f]}' file | sort
Example
$ cat file
-rw-r--r-- 1 user group 120 Aug 17 18:27 A.txt
-rw-r--r-- 1 user group 155 May 12 12:28 A.txt
-rw-r--r-- 1 user group 155 May 10 21:14 A.txt
-rw-rw-rw- 1 user group 700 Aug 15 17:05 B.txt
-rw-rw-rw- 1 user group 59 Aug 15 10:02 B.txt
-rw-r--r-- 1 user group 180 Aug 15 09:38 B.txt
-rw-r--r-- 1 user group 200 Jul 2 17:09 C.txt
-rw-r--r-- 1 user group 4059 Aug 9 13:58 D.txt
$ awk '{if ($8<min[$9] || !min[$9])min[$9]=$8; if ($8>max[$9])max[$9]=$8} END{for (f in min)print f,min[f],max[f]}' file | sort
A.txt 12:28 21:14
B.txt 09:38 17:05
C.txt 17:09 17:09
D.txt 13:58 13:58
Warning
Your input looks like it was produced by ls. If that is so, be aware that the output of ls has a myriad of peculiarities and compatibility issues. The authors of ls recommend against parsing the output of ls.
How the code works
awk implicitly loops over every line of input. This code uses two associative arrays. min keeps track of the minimum time for each file name. max keeps track of the maximum.
if ($8<min[$9] || !min[$9])min[$9]=$8
This updates min if the time, $8, in the time for the current line is less than the previously seen time for this filename, $9.
if ($8>max[$9])max[$9]=$8
This updates max if the time, $8, in the time for the current line is greater than the previously seen time for this filename, $9.
END{for (f in min)print f,min[f],max[f]}
This prints out the results for each file name.
sort
This sorts the output into a cosmetically pleasing form.
similar awk
$ awk '{k=$9;v=$8} # set key (k), value (v)
!(k in min){min[k]=max[k]=v} # initial value for min/max
min[k]>v{min[k]=v} # set min
max[k]<v{max[k]=v} # set max
END{print "Filename | Min HHMM | Max HHMM";
for(k in min) print k,min[k],max[k] | "sort"}' file
Filename | Min HHMM | Max HHMM
A.txt 12:28 21:14
B.txt 09:38 17:05
C.txt 17:09 17:09
D.txt 13:58 13:58
note that printing header and piping data to sort in awk keeps the header in the first line.
$ cat > test.awk
BEGIN {
min["\x00""Filename"]="Min_HHMM"OFS"Max_HHMM" # set header in min[], preceded by NUL
} # to place on top when ordering (HACK)
!($9 in min)||min[$9]>$8 { # if candidate smaller than current min
min[$9]=$8 # set new min
}
max[$9]<$8 {
max[$9]=$8 # set new max
}
END {
PROCINFO["sorted_in"]="#ind_str_asc" # set array scanning order for for loop
for(i in min)
print i,min[i],max[i]
}
$ awk -f test.awk file
Filename Min_HHMM Max_HHMM
A.txt 12:28 21:14
B.txt 09:38 17:05
C.txt 17:09 17:09
D.txt 13:58 13:58
The BEGIN hack can be replaced by a static print in the beginning of END block:
print "Filename"OFS"Min_HHMM"OFS"Max_HHMM";

Resources