Uniq a column and print out number of rows in that column - shell

I have a file, with header
name, age, id, address
Smith, 18, 201392, 19 Rand Street, USA
Dan, 19, 029123, 23 Lambert Rd, Australia
Smith, 20, 192837, 61 Apple Rd, UK
Kyle, 25, 245123, 103 Orange Rd, UK
And I'd like to sort out duplicates on names, so the result will be:
Smith, 18, 201392, 19 Rand Street, USA
Dan, 19, 029123, 23 Lambert Rd, Australia
Kyle, 25, 245123, 103 Orange Rd, UK
# prints 3 for 3 unique rows at column name
I've tried sort -u -t, -k1,1 file, awk -F"," '!_[$1]++' file but it doesn't work because I have commas in my address.

Well, you changed the functionality since the OP, but this should get you unique names in your file (considering it's named data), unsorted:
#!/bin/bash
sed "1 d" data | awk -F"," '!_[$1]++ { print $1 }'
If you need to sort, append | sort to the command line above.
And append | wc -l to the command line to count lines.

Related

Adding text to a specific (row,column) in csv with sed

Improved question for clarity:
Hello there so I have about 2000 csv files
One master file called fileaa.csv
And 1999 description files called fileaa-1.csv, fileaa-2.csv, fileaa-4.csv... (some numbers are missing)
I want to add a 3rd column to the 2 column master file:
| link | link2 |
1| somelink.com | somelink2.com |
like so
| link | link2 | description |
1| somelink.com | somelink2.com | some description |
where the description of line 1 comes from fileaa-1.csv, which is a single-cell csv with a paragraph of text.
Does anyone know how to do this at scale? I have 100 other masters with about 2000 descriptions each.
Edit (incl. commands):
Things I couldn't try:
cat * | awk 'NR==FNR{a[NR]=$0;next}{print a[FNR],$0}' fileaa.csv fileaa-1.csv
wouldn't work because of the missing numbers
awk '{print $0,NR}' fileaa.csv; \
find /mnt/media/fileaa.csv -type f -exec sed -i 's/1/fileaa-1.csv/g' {} \;
because sed can't read external files inside the -exec sed command
Edit 1:
The exact contents of fileaa-1.csv are:
"Texan singer-songwriter Robert Earl Keen performs the song "What I Really Mean" acoustically with his band, live in the Magnolia Avenue Studios of KDHX, St. Louis, Missouri, February 11, 2010. The full session aired Sun, Feb. 28, 2010 on Songwriter's Showcase, heard Sundays from 10:30 a.m.-noon Central on KDHX with host Ed Becker. Sound and Video by Andy Coco and Ed Kleinberg. Discover more great music (streaming audio, photos, video and more)"
The exact input:
| link | link2 |
1| https://www.youtube.com/watch?v=lhNFZ37OfE4 | https://www.youtube.com/user/kdhx |
The exact desired output:
| link | link2 | description |
1| https://www.youtube.com/watch?v=lhNFZ37OfE4 | https://www.youtube.com/user/kdhx | "Texan singer-songwriter Robert Earl Keen performs the song "What I Really Mean" acoustically with his band, live in the Magnolia Avenue Studios of KDHX, St. Louis, Missouri, February 11, 2010. The full session aired Sun, Feb. 28, 2010 on Songwriter's Showcase, heard Sundays from 10:30 a.m.-noon Central on KDHX with host Ed Becker. Sound and Video by Andy Coco and Ed Kleinberg. Discover more great music (streaming audio, photos, video and more)" |
Edit 2:
The contents of fileaa.csv are already in order and do not need to be sorted. It is not possible for there to be a fileaa-[number].csv that does not match a row in fileaa.csv.
Edit 3:
There are no | of linefeeds in the data.
To be honest I am a complete beginner and I don't really know where to start on this one.
Any help will be appreciated ❤️
Assumptions:
the 'paragraph' from the fileaa-*.csv files is on a single line (ie, does not include any embedded linefeeds)
assuming the sample from OP's fileaa-1.csv is one long line and what we're seeing in the question is an issue of incorrect formatting of the paragraph (ie, there are no linefeeds)
we can ignore anything on lines 2-N from the fileaa-*.csv files
we only append a field to a line in fileaa.csv if we find a matching file (ie, we don't worry about appending an empty field if the matching fileaa-*.csv files does not exist)
the finale result (ie, contents of all files) will fit in memory
Adding some additional sample data:
$ head fileaa*csv
==> fileaa-1.csv <==
"Texan singer-songwriter Robert Earl Keen performs the song "What I Really Mean" acoustically with his band, live in the Magnolia Avenue Studios of KDHX, St. Louis, Missouri, February 11, 2010. The full session aired Sun, Feb. 28, 2010 on Songwriter's Showcase, heard Sundays from 10:30 a.m.-noon Central on KDHX with host Ed Becker. Sound and Video by Andy Coco and Ed Kleinberg. Discover more great music (streaming audio, photos, video and more)"
==> fileaa-2.csv <==
"this one has a short paragraph ... 1 ... 2 ... 3"
==> fileaa-3.csv <==
and then there's this paragraph with a bunch of random characters ... as;dlkfjaw;eorifujqw4[-09hjavnd;oitjuwae[-0g9ujadg;flkjas
==> fileaa.csv <==
| link | link2 |
1| https://www.youtube.com/watch?v=lhNFZ37OfE4 | https://www.youtube.com/user/kdhx |
2| https://www.youtube.com/watch?v=AAAAAAAAAAAAAAAAA | https://www.youtube.com/user/AAAAAAAA |
3 | https://www.youtube.com/watch?v=BBBBB | https://www.youtube.com/user/BBBBBBBBBBBBBBB |
4| https://www.youtube.com/watch?v=CCCCCCCC | https://www.youtube.com/user/CCCCCC |
NOTE: since there is no fileaa-4.csv we will not append anything to the last line (where 1st field = 4) in fileaa.csv
One awk idea:
master='fileaa'
awk '
FNR==NR { if (FNR==1)
lines[0]=$0 " Description |" # save header line
else {
split($0,a,"|") # get line number
ndx=a[1]+0 # remove spaces and leading zeros
lines[ndx]=$0 # save line
max=ndx > max ? ndx : max # keep track of the max line number
}
next
}
{ split(FILENAME,a,/[-.]/) # split filename on dual delimiters: hyphen and period
ndx=a[2]+0 # remove leading zeros
lines[ndx]=lines[ndx] " " $0 " |" # append current line to matching line from 1st file
nextfile # skip the rest of the current file
}
END { for (i=0;i<=max;i++)
print lines[i]
}
' "${master}".csv "${master}"-*.csv
This generates:
| link | link2 | Description |
1| https://www.youtube.com/watch?v=lhNFZ37OfE4 | https://www.youtube.com/user/kdhx | "Texan singer-songwriter Robert Earl Keen performs the song "What I Really Mean" acoustically with his band, live in the Magnolia Avenue Studios of KDHX, St. Louis, Missouri, February 11, 2010. The full session aired Sun, Feb. 28, 2010 on Songwriter's Showcase, heard Sundays from 10:30 a.m.-noon Central on KDHX with host Ed Becker. Sound and Video by Andy Coco and Ed Kleinberg. Discover more great music (streaming audio, photos, video and more)" |
2| https://www.youtube.com/watch?v=AAAAAAAAAAAAAAAAA | https://www.youtube.com/user/AAAAAAAA | "this one has a short paragraph ... 1 ... 2 ... 3" |
3 | https://www.youtube.com/watch?v=BBBBB | https://www.youtube.com/user/BBBBBBBBBBBBBBB | and then there's this paragraph with a bunch of random characters ... as;dlkfjaw;eorifujqw4[-09hjavnd;oitjuwae[-0g9ujadg;flkjas |
4| https://www.youtube.com/watch?v=CCCCCCCC | https://www.youtube.com/user/CCCCCC |
This might work.
Based on the FILENAME ending in a number or not ending in a number, columns one and two are collected if the FILENAME does not end in a number and column three is collected if the FILENAME ends in a number.
After all input files are processed, columns one, two, and three are printed.
./doit.awk fileeaa*
|link|link2|Description
1| https://www.youtube.com/watch?v=lhNFZ37OfE4 | https://www.youtube.com/user/kdhx |Texan singer-songwriter Robert Earl Keen performs the song "What I Really Mean" acoustically with his band, live in the Magnolia Avenue Studios of KDHX, St. Louis, Missouri, February 11, 2010. The full session aired Sun, Feb. 28, 2010 on Songwriter's Showcase, heard Sundays from 10:30 a.m.-noon Central on KDHX with host Ed Becker. Sound and Video by Andy Coco and Ed Kleinberg. Discover more great music (streaming audio, photos, video and more)
#!/usr/local/bin/gawk -f
BEGIN { FS="|" }
FILENAME !~ /[0-9]\.csv$/ && $1 > 0 {
join_on[$1]=$1
c1[$1] = $2
c2[$1] = $3
joins++
}
FILENAME ~ /[0-9]\.csv$/ {
match(FILENAME , /-([0-9]+)\.csv/, join_int)
c3[join_int[1]] = $0
}
END {
print "|link|link2|Description"
for (j in join_on) {
print j "|" c1[j] "|" c2[j] "|" c3[j]
}
}

How can we sum the values group by from file using shell script

I have a file where I have student Roll no, Name, Subject, Obtain Marks and Total Marks data:
10 William English 80 100
10 William Math 50 100
10 William IT 60 100
11 John English 90 100
11 John Math 75 100
11 John IT 85 100
How can i get Group by sum (total obtained marks) of every student in shell Shell? I want this output:
William 190
John 250
i have tried this:
cat student.txt | awk '{sum += $14}END{print sum" "$1}' | sort | uniq -c | sort -nr | head -n 10
This is not working link group by sum.
With one awk command:
awk '{a[$2]+=$4} END {for (i in a) print i,a[i]}' file
Output
William 190
John 250
If you want to sort the output, you can pipe to sort, e.g. descending by numerical second field:
awk '{a[$2]+=$4} END {for (i in a) print i,a[i]}' file | sort -rnk2
or ascending by student name:
awk '{a[$2]+=$4} END {for (i in a) print i,a[i]}' file | sort
You need to use associative array in awk.
Try
awk '{ a[$2]=a[$2]+$4 } END {for (i in a) print i, a[i]}'
a[$2]=a[$2]+$4 Create associate array with $2 as index and sum of values $4 as value
END <-- Process all records
for (i in a) print i, a[i] <-- Print index and value of array
Demo :
$awk '{ a[$2]=a[$2]+$4 } END {for (i in a) print i, a[i]}' temp.txt
William 190
John 250
$cat temp.txt
10 William English 80 100
10 William Math 50 100
10 William IT 60 100
11 John English 90 100
11 John Math 75 100
11 John IT 85 100
$

A UNIX Command to Find the Name of the Student who has the Second Highest Score

I am new to Unix Programming. Could you please help me to solve the question.
For example, If the input file has the below content
RollNo Name Score
234 ABC 70
567 QWE 12
457 RTE 56
234 XYZ 80
456 ERT 45
The output will be
ABC
I tried something like this
sort -k3,3 -rn -t" " | head -n2 | awk '{print $2}'
Using awk
awk 'NR>1{arr[$3]=$2} END {n=asorti(arr,arr_sorted); print arr[arr_sorted[n-1]]}'
Demo:
$cat file.txt
RollNo Name Score
234 ABC 70
567 QWE 12
457 RTE 56
234 XYZ 80
456 ERT 45
$awk 'NR>1{arr[$3]=$2} END {n=asorti(arr,arr_sorted); print arr[arr_sorted[n-1]]}' file.txt
ABC
$
Explanation:
NR>1 --> Skip first record
{arr[$3]=$2} --> Create associtive array with marks as index and name as value
END <-- read till end of file
n=asorti(arr,arr_sorted) <-- Sort array arr on index value(i.e marks) and save in arr_sorted. n= number of element in array
print arr[arr_sorted[n-1]]} <-- n-1 will point to second last value in arr_sorted (i.e marks) and print corresponding value from arr
Your attempt is 90% correct just a single change
Try this...it will work.
sort -k3,3 -rn -t" " | head -n1 | awk '{print $2}'
Instead of using head -n2 replace it with head -n1

shell script inserting "$" into a formatted column and adding new column

Hi guys pardon for my bad English. I manage to display out my data nicely and neatly using column program in the code. But how do i add a "$" in the price column. Secondly how do i add a new column total sum to it and display it with "$". (Price * Sold)
(echo "Title:Author:Price:Quantity:Sold" && cat BookDB.txt) | column -s: -t
Output:
Title Author Price Quantity Sold
The Godfather Mario Puzo 21.50 50 20
The Hobbit J.R.R Tolkien 40.50 50 10
Romeo and Juliet William Shakespeare 102.80 200 100
The Chronicles of Narnia C.S.Lewis 35.90 80 15
Lord of the Flies William Golding 29.80 125 25
Memories of a Geisha Arthur Golden 35.99 120 50
I guess you could do it with awk (line break added before && for readability
(echo "Title:Author:Price:Quantity:Sold:Calculated"
&& awk -F: '{printf ("%s:%s:$%d:%d:%d:%d\n",$1,$2,$3,$4,$5,$3*$5)}' BookDB.txt) | column -s: -t

Combining tab delimited files based on a column value

Suppose I have two tab-delimited files that share a column. Both files have a header line that gives a label to each column. What's an easy way to take the union of the two tables, i.e. take the columns from A and B, but do so according to the value of column K?
for example, table A might be:
employee_id name
123 john
124 mary
and table B might be:
employee_id age
124 18
123 22
then the union based on column 1 of table A ("employee_id") should yield the table:
employee_id name age
123 john 22
124 mary 18
i'd like to do this using Unix utilities, like "cut" etc. how can this be done?
you can use the join utility, but your files need to be sorted first.
join file1 file2
man join for more information
here's a start. I leave you to format the headers as needed
$ awk 'NR>1{a[$1]=a[$1]" "$2}END{for(i in a)print a[i],i}' tableA.txt tableB.txt
age employee_id
john 22 123
mary 18 124
another way
$ join <(sort tableA.txt) <(sort tableB.txt)
123 john 22
124 mary 18
employee_id name age
experiment with the join options when needed (see info page or man page)
Try:
paste file1 file2 > file3

Resources