How to use awk and sed to count number of elements in a column - bash

There are some emails in my email account's inbox:
12:00 <harry#hotmail.com>
12:20 <harry#hotmail.com>
12:22 <jim#gmail.com>
12:30 <clare#bbc.org>
12:40 <harry#hotmail.com>
12:50 <jim#gmail.com>
12:55 <harry#hotmail.com>
I would like to use command line (awk, sed, grep etc.) to count the number of emails I received from different people.(change all the minute to :00) How can I make it?
I prefer the result like:
Number of email time From
3 12:00 <jim#gmail.com>
4 12:00 <harry#hotmail.com>
1 12:00 <clare#bbc.org>
Appreciate for your help!

Here is how to do it with awk
awk '{a[$1]++} END {for (i in a) print a[i]"\t"i}' file
4 <harry#hotmail.com>
1 <clare#bbc.org>
2 <jim#gmail.com>

You may want to use uniq after sort:
$ sort file | uniq -c
1 <clare#bbc.org>
4 <harry#hotmail.com>
2 <jim#gmail.com>
You can also get the header using printf:
$ printf "Number of email\temail\n%s\n" "$(sort file | uniq -c)"
Number of email email
1 <clare#bbc.org>
4 <harry#hotmail.com>
2 <jim#gmail.com>
We initially have to sort the file in order to uniq to work properly. From man uniq:
Filter adjacent matching lines from INPUT

Related

How can I count and display only the words that are repeated more than once using unix commands?

I am trying to count and display only the words that are repeated more than once in a file. The basic idea is:
You are given a file with names and characters like commas, colons, slashes, etc..
Use the cut command to display only the first names in the file (other commands are also allowed).
Count and then display only the names repeated more than once.
I got to the point of counting and displaying all the names. However, I haven't found a way to display and to count only those names repeated more than once.
Here is a section of the file:
user1:x:80:200:Mia,Spurs:/home/user1:/bin/bash
user2:x:80:200:Martha,Dalton:/home/user2:/bin/bash
user3:x:80:200:Lucy,Carlson:/home/user3:/bin/bash
user4:x:80:200:Carl,Bingo:/home/user4:/bin/bash
Here is what I have been able to do:
Daniel#Daniel-MacBook-Pro Files % cut -d ":" -f 5-5 file1 | cut -d "," -f 1-1 | sort -n | uniq -c
1 Mia
3 Martha
1 Lucy
1 Carl
1 Jessi
1 Joke
1 Jim
2 Race
1 Sem
1 Shirly
1 Susan
1 Tim
You can filter out the rows with count 1 with grep.
cut -d ":" -f 5 file1 | cut -d "," -f 1 | sort | uniq -c | grep -v '^ *1 '

awk length is counting +1

I'm trying, as an exercise, to output how many words exist in the dictionary for each possible length.
Here is my code:
$ awk '{print length}' dico.txt | sort -nr | uniq -c
Here is the output:
...
1799 5
427 4
81 3
1 2
My problem is that awk length count one more letter for each word in my file. The right output should have been:
1799 4
427 3
81 2
1 1
I checked my file and it does not contain any space after the word:
ABAISSA
ABAISSABLE
ABAISSABLES
ABAISSAI
...
So I guess awk is counting the newline as a character, despite the fact it is not supposed to.
Is there any solution? Or something I'm doing wrong?
I'm gonna venture a guess. Isn't your awk expecting "U*X" style newlines (LF), but your dico.txt has Windows style (CR+LF). That easily give you the +1 on all lengths.
I took your four words:
$ cat dico.txt
ABAISSA
ABAISSABLE
ABAISSABLES
ABAISSAI
And ran your line:
$ awk '{print length}' dico.txt | sort -nr | uniq -c
1 11
1 10
1 8
1 7
So far so good. Now the same, but dico.txt with windows newlines:
$ cat dico.txt | todos > dico_win.txt
$ awk '{print length}' dico_win.txt | sort -nr | uniq -c
1 12
1 11
1 9
1 8

Unix script for checking logs for last 10 days

I have a log table which is maintained for a single day and the data from the table is only present for one day.However, the logs for it is present in the unix directory.
My requirement is to check the logs for the last 10 days and find me the count of records got loaded.
In the log file the pattern is something like this( fastload log of teradata).
**** 13:16:49 END LOADING COMPLETE
Total Records Read = 443303
Total Error Table 1 = 0 ---- Table has been dropped
Total Error Table 2 = 0 ---- Table has been dropped
Total Inserts Applied = 443303
Total Duplicate Rows = 0
I want to the script to be parametrized( parameter will be stage table name) which find the records inserted into table and error tables for the last 10 days.
Is this possible? Can anyone help me build the unix script for this?
There are many logs in the logs directory. what if a want to check only for the below:
bash-3.2$ ls -ltr 2018041*S_EVT_ACT_FLD*
-rw-rw----+ 1 edwops abgrp 52610 Apr 10 17:37 20180410173658_S_EVT_ACT_FLD.log
-rw-rw----+ 1 edwops abgrp 52576 Apr 11 18:12 20180411181205_S_EVT_ACT_FLD.log
-rw-rw----+ 1 edwops abgrp 52646 Apr 13 18:04 20180413180422_S_EVT_ACT_FLD.log
-rw-rw----+ 1 edwops abgrp 52539 Apr 14 16:16 20180414161603_S_EVT_ACT_FLD.log
-rw-rw----+ 1 edwops abgrp 52538 Apr 15 14:15 20180415141523_S_EVT_ACT_FLD.log
-rw-rw----+ 1 edwops abgrp 52576 Apr 16 15:38 20180416153808_S_EVT_ACT_FLD.log
Thanks.
find . -ctime -10 -type f -print|xargs awk -F= '/Total Records Read/ {print $2}'|paste -sd+| bc
find . -ctime -10 -type f -print get the filenames of files 10 days or younger in current working directory. To run on a different directory replace . with the path
awk -F= '/Total Records Read/ {print $2}' using = as a field seperator filter out the second half of any line containing the key phrase
Total Records Read
paste -sd+ add a plus sign
bc evaluate the stream of numbers and operators into a single answer
I could not use find. because the system is Solaris, find doesn't have maxdepth future. I use case to create a FILTER2 and use it to
ls -l --time-style=long-iso FOLDER | grep -E $FILTER.
but I know it's not a good way.
LOCAL_DAY=`date "+%d"`
LOCAL_MONTH=`date "+%Y-%m"`
LASTTENDAYE_MONTH=`date --date='10 days ago' "+%Y-%m"`
case $LOCAL_DAY in
0*)
FILTER2="$LASTTENDAY_MONTH-[2-3][0-9]|$LOCAL_MONTH";;
1*)
FILTER2="$LOCAL_MONTH-0[0-9]|$LOCAL_MONTH-1[0-9]";;
2*)
FILTER2="$LOCAL_MONTH-1[0-9]|$LOCAL_MONTH-2[0-9]";;
esac

Csh - Fetching fields via awk inside xargs

I'm struggling to understand this behavior:
Script behavior: read a file (containing dates); print a list of files in a multi-level directory tree and get their size, print the file size only, (future step: sum the overall file size).
Starting script:
cat dates | xargs -I {} sh -c "echo '{}: '; du -d 2 "/folder/" | grep {} | head"
2000-03:
1000 /folder/2000-03balbasldas
2000-04:
12300 /folder/2000-04asdwqdas
[and so on]
But when I try to filter via awk on the first field, I still get the whole line
cat dates | xargs -I {} sh -c "echo '{}: '; du -d 2 "/folder/" | grep {} | awk '{print $1}'"
2000-03:
1000 /folder/2000-03balbasldas
2000-04:
12300 /folder/2000-04asdwqdas
I've already approached it via divide-et-impera, and the following command works just fine:
du -d 2 "/folder/" | grep '2000-03' | awk '{print $1}'
1000
I'm afraid that I'm missing something very trivial, but I haven't found anything so far.
Any idea? Thanks!
Input: directory containing folders named YYYY-MM-random_data and a file containing strings:
ls -l
drwxr-xr-x 2 user staff 68 Apr 24 11:21 2000-03-blablabla
drwxr-xr-x 2 user staff 68 Apr 24 11:21 2000-04-blablabla
drwxr-xr-x 2 user staff 68 Apr 24 11:21 2000-05-blablabla
drwxr-xr-x 2 user staff 68 Apr 24 11:21 2000-06-blablabla
drwxr-xr-x 2 user staff 68 Apr 24 11:21 2000-06-blablablb
drwxr-xr-x 2 user staff 68 Apr 24 11:21 2000-06-blablablc
[...]
cat dates
2000-03
2000-04
2000-05
[...]
Expected output: sum of the disk space occupied by all the files contained in the folder whose name include the string in the file dates
2000-03: 1000
2000-04: 2123
2000-05: 1222112
[...]
======
But in particular, I'm interested in why awk is not able to fetch the column $1 I asked it to.
Ok it seems I found the answer myself after a lot of research :D
I'll post it here, hoping that it will help somebody else out.
https://unix.stackexchange.com/questions/282503/right-syntax-for-awk-usage-in-combination-with-other-command-inside-xargs-sh-c
The trick was to escape the $ sign.
cat dates | xargs -I {} sh -c "echo '{}: '; du -d 2 "/folder/" | grep {} | awk '{print \$1}'"
Using GNU Parallel it looks like this:
parallel --tag "eval du -s folder/{}* | perl -ne '"'$s+=$_ ; END {print "$s\n"}'"'" :::: dates
--tag prepends the line with the date.
{} is replaced with the date.
eval du -s folder/{}* finds all the dirs starting with the date and gives the total du from those dirs.
perl -ne '$s+=$_ ; END {print "$s\n"}' sums up the output from du
Finally there is bit of quoting trickery to get it quoted correctly.

pick up files based on dates in ksh script

I have this list of files . Now I will have to pick the latest file based on some condition
3679 Jul 21 23:59 belk_rpo_error_**po9324892**_07212014.log
0 Jul 22 23:59 belk_rpo_error_**po9324892**_07222014.log
3679 Jul 23 23:59 belk_rpo_error_**po9324892**_07232014.log
22 Jul 22 06:30 belk_rpo_error_**po9324267**_07012014.log
0 Jul 20 05:50 belk_rpo_error_**po9999992**_07202014.log
411 Jul 21 06:30 belk_rpo_error_**po9999992**_07212014.log
742 Jul 21 07:30 belk_rpo_error_**po9999991**_07212014.log
0 Jul 23 2014 belk_rpo_error_**po9999991**_07232014.log
For a PATRICULAR Order_No(Marked with ** **)
If the latest file is 0 kB then we will discard it (rest of the files with same Order_no as well)
if the latest file is non Zero then I will take it.(Only the latest one)
Then append the contents in a txt file .
My expected output would be ::
411 Jul 21 06:30 belk_rpo_error_**po9999992**_07212014.log
3679 Jul 23 23:59 belk_rpo_error_**po9324892**_07232014.log
22 Jul 22 06:30 belk_rpo_error_**po9324267**_07012014.log
I am at my wits end here. I cant seem to figure out how to compare dates in Unix. Any help is very appreciated.
You can try something like:
touch test.txt
for var in ` find . ! -empty -exec ls -r {} \;`
do
cat $var>>test.txt
done
untested
use stat to emit date (epoch time), size and filename.
use awk to filter out zero-length files and extract order number.
sort by order number and date
awk to pick up the last filename for each order number
stat -c $'%Y\t%s\t%n' *.log |
awk -F'\t' -v OFS='\t' '
$2 > 0 {
split($3, a, /_/)
print a[4], $1, $3
}' |
sort -t $'\t' -k1,1 -k2,2n |
awk -F'\t' '
NR > 1 && $1 != prev_order {print filename}
{filename = $3; prev_order = $1}
END {print filename}
'
The sort command might be wrong: In order to group by order number, you might need to sort first by file time then by order number.
If I understand your question, the resulting files need to be concatenated and appended to a file. If the above pipeline is working OK, then pipe into | xargs cat >> something.log

Resources