Adding text to a specific (row,column) in csv with sed - bash

Improved question for clarity:
Hello there so I have about 2000 csv files
One master file called fileaa.csv
And 1999 description files called fileaa-1.csv, fileaa-2.csv, fileaa-4.csv... (some numbers are missing)
I want to add a 3rd column to the 2 column master file:
| link | link2 |
1| somelink.com | somelink2.com |
like so
| link | link2 | description |
1| somelink.com | somelink2.com | some description |
where the description of line 1 comes from fileaa-1.csv, which is a single-cell csv with a paragraph of text.
Does anyone know how to do this at scale? I have 100 other masters with about 2000 descriptions each.
Edit (incl. commands):
Things I couldn't try:
cat * | awk 'NR==FNR{a[NR]=$0;next}{print a[FNR],$0}' fileaa.csv fileaa-1.csv
wouldn't work because of the missing numbers
awk '{print $0,NR}' fileaa.csv; \
find /mnt/media/fileaa.csv -type f -exec sed -i 's/1/fileaa-1.csv/g' {} \;
because sed can't read external files inside the -exec sed command
Edit 1:
The exact contents of fileaa-1.csv are:
"Texan singer-songwriter Robert Earl Keen performs the song "What I Really Mean" acoustically with his band, live in the Magnolia Avenue Studios of KDHX, St. Louis, Missouri, February 11, 2010. The full session aired Sun, Feb. 28, 2010 on Songwriter's Showcase, heard Sundays from 10:30 a.m.-noon Central on KDHX with host Ed Becker. Sound and Video by Andy Coco and Ed Kleinberg. Discover more great music (streaming audio, photos, video and more)"
The exact input:
| link | link2 |
1| https://www.youtube.com/watch?v=lhNFZ37OfE4 | https://www.youtube.com/user/kdhx |
The exact desired output:
| link | link2 | description |
1| https://www.youtube.com/watch?v=lhNFZ37OfE4 | https://www.youtube.com/user/kdhx | "Texan singer-songwriter Robert Earl Keen performs the song "What I Really Mean" acoustically with his band, live in the Magnolia Avenue Studios of KDHX, St. Louis, Missouri, February 11, 2010. The full session aired Sun, Feb. 28, 2010 on Songwriter's Showcase, heard Sundays from 10:30 a.m.-noon Central on KDHX with host Ed Becker. Sound and Video by Andy Coco and Ed Kleinberg. Discover more great music (streaming audio, photos, video and more)" |
Edit 2:
The contents of fileaa.csv are already in order and do not need to be sorted. It is not possible for there to be a fileaa-[number].csv that does not match a row in fileaa.csv.
Edit 3:
There are no | of linefeeds in the data.
To be honest I am a complete beginner and I don't really know where to start on this one.
Any help will be appreciated ❤️

Assumptions:
the 'paragraph' from the fileaa-*.csv files is on a single line (ie, does not include any embedded linefeeds)
assuming the sample from OP's fileaa-1.csv is one long line and what we're seeing in the question is an issue of incorrect formatting of the paragraph (ie, there are no linefeeds)
we can ignore anything on lines 2-N from the fileaa-*.csv files
we only append a field to a line in fileaa.csv if we find a matching file (ie, we don't worry about appending an empty field if the matching fileaa-*.csv files does not exist)
the finale result (ie, contents of all files) will fit in memory
Adding some additional sample data:
$ head fileaa*csv
==> fileaa-1.csv <==
"Texan singer-songwriter Robert Earl Keen performs the song "What I Really Mean" acoustically with his band, live in the Magnolia Avenue Studios of KDHX, St. Louis, Missouri, February 11, 2010. The full session aired Sun, Feb. 28, 2010 on Songwriter's Showcase, heard Sundays from 10:30 a.m.-noon Central on KDHX with host Ed Becker. Sound and Video by Andy Coco and Ed Kleinberg. Discover more great music (streaming audio, photos, video and more)"
==> fileaa-2.csv <==
"this one has a short paragraph ... 1 ... 2 ... 3"
==> fileaa-3.csv <==
and then there's this paragraph with a bunch of random characters ... as;dlkfjaw;eorifujqw4[-09hjavnd;oitjuwae[-0g9ujadg;flkjas
==> fileaa.csv <==
| link | link2 |
1| https://www.youtube.com/watch?v=lhNFZ37OfE4 | https://www.youtube.com/user/kdhx |
2| https://www.youtube.com/watch?v=AAAAAAAAAAAAAAAAA | https://www.youtube.com/user/AAAAAAAA |
3 | https://www.youtube.com/watch?v=BBBBB | https://www.youtube.com/user/BBBBBBBBBBBBBBB |
4| https://www.youtube.com/watch?v=CCCCCCCC | https://www.youtube.com/user/CCCCCC |
NOTE: since there is no fileaa-4.csv we will not append anything to the last line (where 1st field = 4) in fileaa.csv
One awk idea:
master='fileaa'
awk '
FNR==NR { if (FNR==1)
lines[0]=$0 " Description |" # save header line
else {
split($0,a,"|") # get line number
ndx=a[1]+0 # remove spaces and leading zeros
lines[ndx]=$0 # save line
max=ndx > max ? ndx : max # keep track of the max line number
}
next
}
{ split(FILENAME,a,/[-.]/) # split filename on dual delimiters: hyphen and period
ndx=a[2]+0 # remove leading zeros
lines[ndx]=lines[ndx] " " $0 " |" # append current line to matching line from 1st file
nextfile # skip the rest of the current file
}
END { for (i=0;i<=max;i++)
print lines[i]
}
' "${master}".csv "${master}"-*.csv
This generates:
| link | link2 | Description |
1| https://www.youtube.com/watch?v=lhNFZ37OfE4 | https://www.youtube.com/user/kdhx | "Texan singer-songwriter Robert Earl Keen performs the song "What I Really Mean" acoustically with his band, live in the Magnolia Avenue Studios of KDHX, St. Louis, Missouri, February 11, 2010. The full session aired Sun, Feb. 28, 2010 on Songwriter's Showcase, heard Sundays from 10:30 a.m.-noon Central on KDHX with host Ed Becker. Sound and Video by Andy Coco and Ed Kleinberg. Discover more great music (streaming audio, photos, video and more)" |
2| https://www.youtube.com/watch?v=AAAAAAAAAAAAAAAAA | https://www.youtube.com/user/AAAAAAAA | "this one has a short paragraph ... 1 ... 2 ... 3" |
3 | https://www.youtube.com/watch?v=BBBBB | https://www.youtube.com/user/BBBBBBBBBBBBBBB | and then there's this paragraph with a bunch of random characters ... as;dlkfjaw;eorifujqw4[-09hjavnd;oitjuwae[-0g9ujadg;flkjas |
4| https://www.youtube.com/watch?v=CCCCCCCC | https://www.youtube.com/user/CCCCCC |

This might work.
Based on the FILENAME ending in a number or not ending in a number, columns one and two are collected if the FILENAME does not end in a number and column three is collected if the FILENAME ends in a number.
After all input files are processed, columns one, two, and three are printed.
./doit.awk fileeaa*
|link|link2|Description
1| https://www.youtube.com/watch?v=lhNFZ37OfE4 | https://www.youtube.com/user/kdhx |Texan singer-songwriter Robert Earl Keen performs the song "What I Really Mean" acoustically with his band, live in the Magnolia Avenue Studios of KDHX, St. Louis, Missouri, February 11, 2010. The full session aired Sun, Feb. 28, 2010 on Songwriter's Showcase, heard Sundays from 10:30 a.m.-noon Central on KDHX with host Ed Becker. Sound and Video by Andy Coco and Ed Kleinberg. Discover more great music (streaming audio, photos, video and more)
#!/usr/local/bin/gawk -f
BEGIN { FS="|" }
FILENAME !~ /[0-9]\.csv$/ && $1 > 0 {
join_on[$1]=$1
c1[$1] = $2
c2[$1] = $3
joins++
}
FILENAME ~ /[0-9]\.csv$/ {
match(FILENAME , /-([0-9]+)\.csv/, join_int)
c3[join_int[1]] = $0
}
END {
print "|link|link2|Description"
for (j in join_on) {
print j "|" c1[j] "|" c2[j] "|" c3[j]
}
}

Related

Uniq a column and print out number of rows in that column

I have a file, with header
name, age, id, address
Smith, 18, 201392, 19 Rand Street, USA
Dan, 19, 029123, 23 Lambert Rd, Australia
Smith, 20, 192837, 61 Apple Rd, UK
Kyle, 25, 245123, 103 Orange Rd, UK
And I'd like to sort out duplicates on names, so the result will be:
Smith, 18, 201392, 19 Rand Street, USA
Dan, 19, 029123, 23 Lambert Rd, Australia
Kyle, 25, 245123, 103 Orange Rd, UK
# prints 3 for 3 unique rows at column name
I've tried sort -u -t, -k1,1 file, awk -F"," '!_[$1]++' file but it doesn't work because I have commas in my address.
Well, you changed the functionality since the OP, but this should get you unique names in your file (considering it's named data), unsorted:
#!/bin/bash
sed "1 d" data | awk -F"," '!_[$1]++ { print $1 }'
If you need to sort, append | sort to the command line above.
And append | wc -l to the command line to count lines.

AWK or SED Replace space between alphabets in a particular column [closed]

Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 3 years ago.
Improve this question
I have an infile as below:
infile:
INM00042170 28.2500 74.9167 290.0 CHURU 2015 2019 2273
INM00042182 28.5833 77.2000 211.0 NEW DELHI/SAFDARJUNG 1930 2019 67874
INXUAE05462 28.6300 77.2000 216.0 NEW DELHI 1938 1942 2068
INXUAE05822 25.7700 87.5200 40.0 PURNEA 1933 1933 179
INXUAE05832 31.0800 77.1800 2130.0 SHIMLA 1926 1928 728
PKM00041640 31.5500 74.3333 214.0 LAHORE CITY 1960 2019 22915
I want to replace the space between two words by an underscore in column 5 (example: NEW DELHI becomes NEW_DELHI). I want output as below.
outfile:
INM00042170 28.2500 74.9167 290.0 CHURU 2015 2019 2273
INM00042182 28.5833 77.2000 211.0 NEW_DELHI/SAFDARJUNG 1930 2019 67874
INXUAE05462 28.6300 77.2000 216.0 NEW_DELHI 1938 1942 2068
INXUAE05822 25.7700 87.5200 40.0 PURNEA 1933 1933 179
INXUAE05832 31.0800 77.1800 2130.0 SHIMLA 1926 1928 728
PKM00041640 31.5500 74.3333 214.0 LAHORE_CITY 1960 2019 22915
Thank you
#!/bin/bash
# connect field 5 and 6 and remove those with numbers.
# this returns a list of new names (with underscore) for
# all cities that need to be replaced
declare -a NEW_NAMES=$(cat infile | awk '{print $5 "_" $6}' | grep -vE "_[0-9]")
# iterating all new names
for NEW_NAME in ${NEW_NAMES[#]}; do
OLD_NAME=$(echo $NEW_NAME | tr '_' ' ')
# replace in file
sed -i "s/${OLD_NAME}/${NEW_NAME}/g" infile
done

Print unique names of users logged on with finger

I'm trying to write a shell script that prints the full names of users logged on to a machine. The finger command gives me a list of users, but there are many duplicates. How can I loop through and print out only the unique ones?
Edit:
This is the format of what finger gives me:
xxxx XX of group XXX pts/59 1:00 Feb 13 16:38
xxxx XX of group XXX pts/71 1:11 Feb 13 16:27
xxxx XX of group XXX pts/105 1d Feb 12 15:22
xxxx YY of group YYY pts/102 2:19 Feb 13 14:13
xxxx ZZ of group ZZZ pts/42 2d Feb 7 12:11
I'm trying to extract the full name (i.e. whatever comes before 'of group' in column 2), so I would be using awk together with finger.
What you want is actually fairly difficult in a shell script, here is, for example, my full output of finger(1):
Login Name TTY Idle Login Time Office Phone
martin Martin Tournoij *v0 1d Wed 14:11
martin Martin Tournoij pts/2 22 Wed 15:37
martin Martin Tournoij pts/5 41 Thu 23:16
martin Martin Tournoij pts/7 31 Thu 23:24
martin Martin Tournoij pts/8 Thu 23:29
You want the full name, but this may contain 1 space (as per my example), or it may just be 'Teller' (no space), or it may be 'Captain James T. Kirk' (3 spaces). So you can't just use the space as delimiter. You could use the character position of 'TTY' in the header as an indicator, but that's not very elegant IMHO (especially with shell scripting).
My solution is therefore slightly different, we get only the username from finger(1), then we get the full name from /etc/passwd
#!/bin/sh
prev=""
for u in $(finger | tail +2 | cut -w -f1 | sort); do
[ "$u" = "$prev" ] && continue
echo "$u $(grep "^$u" /etc/passwd | cut -d: -f5)"
prev="$u"
done
Which gives me both the username & login name:
martin Martin Tournoij
Obviously, you can also print just the real name (without the $u).
The sort and uniq BinUtils commands can be used to removed duplicates.
finger | sort -u
This will remove all duplicate lines, but you will still see similar lines due to how verbose the finger command is. If you just want a list of usernames, you can filter it out further to be very specific.
finger | cut -d ' ' -f1 | sort -u
Now, you can take this one step further, and remove the "header/label" line printed out by the finger command.
finger | cut -d ' ' -f1 | sort -u | grep -iv login
Hope this helps.
Other possible solution:
finger | tail -n +2 | awk '{ print $1 }' | sort | uniq
tail -n +2 to omit the first line.
awk '{ print $1 }' to extract the first column.
sort to prepare input for uniq.
uniq remove duplicates.
If you want to iterate use:
for user in $(finger | tail -n +2 | awk '{ print $1 }' | sort | uniq)
do
echo "$user"
done
Could this be simpler?
No spaces or any other special characters to worry about!
finger -l | awk '/^Login/'
Edit: To remove the content after of group
finger -l | awk '/^Login/' | sed 's/of group.*//g'
Output:
Login: xx Name: XX
Login: yy Name: YY
Login: zz Name: ZZ

Combine text from two files, output to another [duplicate]

This question already has answers here:
Inner join on two text files
(5 answers)
Closed 1 year ago.
i'm having a bit of a problem and i've been searching allll day. this is my first Unix class don't be to harsh.
so this may sound fairly simple, but i can't get it
I have two text files
file1
David 734.838.9801
Roberto‭ ‬313.123.4567
Sally‭ ‬248.344.5576
Mary‭ ‬313.449.1390
Ted‭ ‬248.496.2207
Alice‭ ‬616.556.4458
Frank‭ ‬634.296.1259
file2
Roberto Tuesday‭ ‬2
Sally Monday‭ ‬8
Ted Sunday‭ ‬16
Alice Wednesday‭ ‬23
David Thursday‭ ‬10
Mary Saturday‭ ‬14
Frank Friday‭ ‬15
I am trying to write a script using a looping structure that will combine both files and come out with the output below as a separate file
output:
Name On-Call Phone Start Time
Sally Monday 248.344.5576 8am
Roberto Tuesday 313.123.4567 2am
Alice‭ Wednesday‭ 616.556.4458‭ 11pm
David‭ Thursday‭ 734.838.9801‭ 10am
Frank‭ Friday‭ 634.296.1259‭ 3pm
Mary‭ Saturday‭ 313.449.1390‭ 2pm
Ted‭ ‬ Sunday‭ 248.496.2207‭ 4pm
This is what i tried( i know its horrible)
echo " Name On-Call Phone Start Time"
file="/home/xubuntu/date.txt"
file1="/home/xubuntu/name.txt"
while read name2 phone
do
while read name day time
do
echo "$name $day $phone $time"
done<"$file"
done<"$file1"
any help would be appreciated
First, sort the files using sort and then use this command:
paste file1 file2 | awk '{print $1,$4,$2,$5}'
This will bring you pretty close. After that you have to figure out how to format the time from the 24 hour format to the 12 hour format.
If you want to avoid using sort separately, you can bring in a little more complexity like this:
paste <(sort file1) <(sort file2) | awk '{print $1,$4,$2,$5}'
Finally, if you have not yet figured out how to print the time in 12 hour format, here is your full command:
paste <(sort file1) <(sort file2) | awk '{"date --date=\"" $5 ":00:00\" +%I%P" |& getline $5; print $1 " " $4 " " $2 " " $5 }'
You can use tabs (\t) in place of spaces as connectors to get a nicely formatted output.
In this case join command will also work,
join -1 1 -2 1 <(sort file1) <(sort file2)
Description
-1 -> file1
1 -> first field of file1 (common field)
-2 -> file2
1 -> first field of file2 (common field)
**cat file1**
David 734.838.9801
Roberto 313.123.4567
Sally 248.344.5576
Mary 313.449.1390
Ted 248.496.2207
Alice 616.556.4458
Frank 634.296.1259
**cat file2**
Roberto Tuesday 2
Sally Monday 8
Ted Sunday 16
Alice Wednesday 23
David Thursday 10
Mary Saturday 14
Frank Friday 15
output
Alice 616.556.4458 Wednesday 23
David 734.838.9801 Thursday 10
Frank 634.296.1259 Friday 15
Mary 313.449.1390 Saturday 14
Roberto 313.123.4567 Tuesday 2
Sally 248.344.5576 Monday 8
Ted 248.496.2207 Sunday 16

Bash and awk: converting a field from 12 hour to 24 hour clock time

I have a large txt file space delimited which I split into 18 smaller files (each with their own number of columns). This split is based on a delimiter i.e. whenever the timestamp hits midnight. So effectively, I'll end up with a 18 files in the form of (note, ignore the dashes and pipes, I've used them to improve the readability):
file1
time ----------- valueA - valueB
12:00:00 AM | 54.13 | 239.12
12:00:01 AM | 51.83 | 119.93
..
file18
time ---------- valueA - valueB - valueC - valueD
12:00:00 AM | 54.92 | 239.12 | 231.23 | 882.12
12:00:01 AM | 23.92 | 121.92 | 201.23 | 892.12
..
Once I split the file I then perform some processing on each of the files using AWK so in short there's 2 stages the 'split stage' and the 'processing stage'.
Unfortunately, the timestamp contained in the large txt file is in 1 of 2 formats. Either the desirable 24 hour format of "00:00:01" or the undesirable 12 hour format of "12:00:01 AM".
As a result, I'm trying to convert all formats to be 24 hours and I'm not sure how to do this. I'm also not sure whether to attempt this at the split stage using bash or at the process stage using AWK. I know that the following function converts 12 hour to 24 hr
'date --date="12:00:01 AM" +%T'
however, I'm not sure how to incorporate this into my shell script were I'm using 'while read line' at the 'split stage' or whether I should do the time conversion in AWK (if possible?) at the 'processing stage'.
see the test below, is it helpful for you?
kent$ echo "12:00:00 AM | 54.92 | 239.12 | 231.23 | 882.12 "\
|awk -F'|' 'BEGIN{OFS="|"}{("date --date=\""$1"\" +%T") |getline $1;print }'
output
00:00:00| 54.92 | 239.12 | 231.23 | 882.12

Resources