This question already has answers here:
Inner join on two text files
(5 answers)
Closed 1 year ago.
i'm having a bit of a problem and i've been searching allll day. this is my first Unix class don't be to harsh.
so this may sound fairly simple, but i can't get it
I have two text files
file1
David 734.838.9801
Roberto 313.123.4567
Sally 248.344.5576
Mary 313.449.1390
Ted 248.496.2207
Alice 616.556.4458
Frank 634.296.1259
file2
Roberto Tuesday 2
Sally Monday 8
Ted Sunday 16
Alice Wednesday 23
David Thursday 10
Mary Saturday 14
Frank Friday 15
I am trying to write a script using a looping structure that will combine both files and come out with the output below as a separate file
output:
Name On-Call Phone Start Time
Sally Monday 248.344.5576 8am
Roberto Tuesday 313.123.4567 2am
Alice Wednesday 616.556.4458 11pm
David Thursday 734.838.9801 10am
Frank Friday 634.296.1259 3pm
Mary Saturday 313.449.1390 2pm
Ted Sunday 248.496.2207 4pm
This is what i tried( i know its horrible)
echo " Name On-Call Phone Start Time"
file="/home/xubuntu/date.txt"
file1="/home/xubuntu/name.txt"
while read name2 phone
do
while read name day time
do
echo "$name $day $phone $time"
done<"$file"
done<"$file1"
any help would be appreciated
First, sort the files using sort and then use this command:
paste file1 file2 | awk '{print $1,$4,$2,$5}'
This will bring you pretty close. After that you have to figure out how to format the time from the 24 hour format to the 12 hour format.
If you want to avoid using sort separately, you can bring in a little more complexity like this:
paste <(sort file1) <(sort file2) | awk '{print $1,$4,$2,$5}'
Finally, if you have not yet figured out how to print the time in 12 hour format, here is your full command:
paste <(sort file1) <(sort file2) | awk '{"date --date=\"" $5 ":00:00\" +%I%P" |& getline $5; print $1 " " $4 " " $2 " " $5 }'
You can use tabs (\t) in place of spaces as connectors to get a nicely formatted output.
In this case join command will also work,
join -1 1 -2 1 <(sort file1) <(sort file2)
Description
-1 -> file1
1 -> first field of file1 (common field)
-2 -> file2
1 -> first field of file2 (common field)
**cat file1**
David 734.838.9801
Roberto 313.123.4567
Sally 248.344.5576
Mary 313.449.1390
Ted 248.496.2207
Alice 616.556.4458
Frank 634.296.1259
**cat file2**
Roberto Tuesday 2
Sally Monday 8
Ted Sunday 16
Alice Wednesday 23
David Thursday 10
Mary Saturday 14
Frank Friday 15
output
Alice 616.556.4458 Wednesday 23
David 734.838.9801 Thursday 10
Frank 634.296.1259 Friday 15
Mary 313.449.1390 Saturday 14
Roberto 313.123.4567 Tuesday 2
Sally 248.344.5576 Monday 8
Ted 248.496.2207 Sunday 16
Related
Improved question for clarity:
Hello there so I have about 2000 csv files
One master file called fileaa.csv
And 1999 description files called fileaa-1.csv, fileaa-2.csv, fileaa-4.csv... (some numbers are missing)
I want to add a 3rd column to the 2 column master file:
| link | link2 |
1| somelink.com | somelink2.com |
like so
| link | link2 | description |
1| somelink.com | somelink2.com | some description |
where the description of line 1 comes from fileaa-1.csv, which is a single-cell csv with a paragraph of text.
Does anyone know how to do this at scale? I have 100 other masters with about 2000 descriptions each.
Edit (incl. commands):
Things I couldn't try:
cat * | awk 'NR==FNR{a[NR]=$0;next}{print a[FNR],$0}' fileaa.csv fileaa-1.csv
wouldn't work because of the missing numbers
awk '{print $0,NR}' fileaa.csv; \
find /mnt/media/fileaa.csv -type f -exec sed -i 's/1/fileaa-1.csv/g' {} \;
because sed can't read external files inside the -exec sed command
Edit 1:
The exact contents of fileaa-1.csv are:
"Texan singer-songwriter Robert Earl Keen performs the song "What I Really Mean" acoustically with his band, live in the Magnolia Avenue Studios of KDHX, St. Louis, Missouri, February 11, 2010. The full session aired Sun, Feb. 28, 2010 on Songwriter's Showcase, heard Sundays from 10:30 a.m.-noon Central on KDHX with host Ed Becker. Sound and Video by Andy Coco and Ed Kleinberg. Discover more great music (streaming audio, photos, video and more)"
The exact input:
| link | link2 |
1| https://www.youtube.com/watch?v=lhNFZ37OfE4 | https://www.youtube.com/user/kdhx |
The exact desired output:
| link | link2 | description |
1| https://www.youtube.com/watch?v=lhNFZ37OfE4 | https://www.youtube.com/user/kdhx | "Texan singer-songwriter Robert Earl Keen performs the song "What I Really Mean" acoustically with his band, live in the Magnolia Avenue Studios of KDHX, St. Louis, Missouri, February 11, 2010. The full session aired Sun, Feb. 28, 2010 on Songwriter's Showcase, heard Sundays from 10:30 a.m.-noon Central on KDHX with host Ed Becker. Sound and Video by Andy Coco and Ed Kleinberg. Discover more great music (streaming audio, photos, video and more)" |
Edit 2:
The contents of fileaa.csv are already in order and do not need to be sorted. It is not possible for there to be a fileaa-[number].csv that does not match a row in fileaa.csv.
Edit 3:
There are no | of linefeeds in the data.
To be honest I am a complete beginner and I don't really know where to start on this one.
Any help will be appreciated ❤️
Assumptions:
the 'paragraph' from the fileaa-*.csv files is on a single line (ie, does not include any embedded linefeeds)
assuming the sample from OP's fileaa-1.csv is one long line and what we're seeing in the question is an issue of incorrect formatting of the paragraph (ie, there are no linefeeds)
we can ignore anything on lines 2-N from the fileaa-*.csv files
we only append a field to a line in fileaa.csv if we find a matching file (ie, we don't worry about appending an empty field if the matching fileaa-*.csv files does not exist)
the finale result (ie, contents of all files) will fit in memory
Adding some additional sample data:
$ head fileaa*csv
==> fileaa-1.csv <==
"Texan singer-songwriter Robert Earl Keen performs the song "What I Really Mean" acoustically with his band, live in the Magnolia Avenue Studios of KDHX, St. Louis, Missouri, February 11, 2010. The full session aired Sun, Feb. 28, 2010 on Songwriter's Showcase, heard Sundays from 10:30 a.m.-noon Central on KDHX with host Ed Becker. Sound and Video by Andy Coco and Ed Kleinberg. Discover more great music (streaming audio, photos, video and more)"
==> fileaa-2.csv <==
"this one has a short paragraph ... 1 ... 2 ... 3"
==> fileaa-3.csv <==
and then there's this paragraph with a bunch of random characters ... as;dlkfjaw;eorifujqw4[-09hjavnd;oitjuwae[-0g9ujadg;flkjas
==> fileaa.csv <==
| link | link2 |
1| https://www.youtube.com/watch?v=lhNFZ37OfE4 | https://www.youtube.com/user/kdhx |
2| https://www.youtube.com/watch?v=AAAAAAAAAAAAAAAAA | https://www.youtube.com/user/AAAAAAAA |
3 | https://www.youtube.com/watch?v=BBBBB | https://www.youtube.com/user/BBBBBBBBBBBBBBB |
4| https://www.youtube.com/watch?v=CCCCCCCC | https://www.youtube.com/user/CCCCCC |
NOTE: since there is no fileaa-4.csv we will not append anything to the last line (where 1st field = 4) in fileaa.csv
One awk idea:
master='fileaa'
awk '
FNR==NR { if (FNR==1)
lines[0]=$0 " Description |" # save header line
else {
split($0,a,"|") # get line number
ndx=a[1]+0 # remove spaces and leading zeros
lines[ndx]=$0 # save line
max=ndx > max ? ndx : max # keep track of the max line number
}
next
}
{ split(FILENAME,a,/[-.]/) # split filename on dual delimiters: hyphen and period
ndx=a[2]+0 # remove leading zeros
lines[ndx]=lines[ndx] " " $0 " |" # append current line to matching line from 1st file
nextfile # skip the rest of the current file
}
END { for (i=0;i<=max;i++)
print lines[i]
}
' "${master}".csv "${master}"-*.csv
This generates:
| link | link2 | Description |
1| https://www.youtube.com/watch?v=lhNFZ37OfE4 | https://www.youtube.com/user/kdhx | "Texan singer-songwriter Robert Earl Keen performs the song "What I Really Mean" acoustically with his band, live in the Magnolia Avenue Studios of KDHX, St. Louis, Missouri, February 11, 2010. The full session aired Sun, Feb. 28, 2010 on Songwriter's Showcase, heard Sundays from 10:30 a.m.-noon Central on KDHX with host Ed Becker. Sound and Video by Andy Coco and Ed Kleinberg. Discover more great music (streaming audio, photos, video and more)" |
2| https://www.youtube.com/watch?v=AAAAAAAAAAAAAAAAA | https://www.youtube.com/user/AAAAAAAA | "this one has a short paragraph ... 1 ... 2 ... 3" |
3 | https://www.youtube.com/watch?v=BBBBB | https://www.youtube.com/user/BBBBBBBBBBBBBBB | and then there's this paragraph with a bunch of random characters ... as;dlkfjaw;eorifujqw4[-09hjavnd;oitjuwae[-0g9ujadg;flkjas |
4| https://www.youtube.com/watch?v=CCCCCCCC | https://www.youtube.com/user/CCCCCC |
This might work.
Based on the FILENAME ending in a number or not ending in a number, columns one and two are collected if the FILENAME does not end in a number and column three is collected if the FILENAME ends in a number.
After all input files are processed, columns one, two, and three are printed.
./doit.awk fileeaa*
|link|link2|Description
1| https://www.youtube.com/watch?v=lhNFZ37OfE4 | https://www.youtube.com/user/kdhx |Texan singer-songwriter Robert Earl Keen performs the song "What I Really Mean" acoustically with his band, live in the Magnolia Avenue Studios of KDHX, St. Louis, Missouri, February 11, 2010. The full session aired Sun, Feb. 28, 2010 on Songwriter's Showcase, heard Sundays from 10:30 a.m.-noon Central on KDHX with host Ed Becker. Sound and Video by Andy Coco and Ed Kleinberg. Discover more great music (streaming audio, photos, video and more)
#!/usr/local/bin/gawk -f
BEGIN { FS="|" }
FILENAME !~ /[0-9]\.csv$/ && $1 > 0 {
join_on[$1]=$1
c1[$1] = $2
c2[$1] = $3
joins++
}
FILENAME ~ /[0-9]\.csv$/ {
match(FILENAME , /-([0-9]+)\.csv/, join_int)
c3[join_int[1]] = $0
}
END {
print "|link|link2|Description"
for (j in join_on) {
print j "|" c1[j] "|" c2[j] "|" c3[j]
}
}
This question already has answers here:
Joining multiple fields in text files on Unix
(11 answers)
Closed 2 years ago.
I have two txt files with different lengths.
File 1:
Albania 20200305 0
Albania 20200306 0
Albania 20200307 0
Albania 20200308 0
Albania 20200309 3
Albania 20200310 7
Albania 20200311 4
Albania 20200312 2
File 2:
Europe Albania 20200309 2
Europe Albania 20200310 6
Europe Albania 20200311 10
Europe Albania 20200312 11
Europe Albania 20200313 23
Europe Albania 20200314 33
I would like to create a File3 which will add the 3. column of the File1 at the end of File2 if 1st and 2nd column of File1 is same with 2nd and 3rd column of File2. It should look like this:
File3:
Europe Albania 20200309 2 3
Europe Albania 20200310 6 7
Europe Albania 20200311 10 4
Europe Albania 20200312 11 2
I have tried
awk 'NR==FNR{A[$1,$2]=$3;next} (($2,$3) in A) {print $0, A[$1,$2]}' file1.txt file2.txt > file3.txt
but it is just printing File 2, it does not add the third column of File1.
Can you please help me with the problem.
Thanks in advance!
Your approach is correct but while printing you need to use like A[$2,$3], you are using A[$1,$2] which is NOT existing(Because 1st, 2nd columns of file1 should be compared to 2nd and 3rd columns of file2) in array A hence its printing only current line values of file2 in your file3.
awk 'NR==FNR{a[$1,$2]=$3;next} (($2,$3) in a) {print $0, a[$2,$3]}' file1 file2
Also see link(Thanks to James for providing nice link here) Why we shouldn't use variables in capital letters
Here is my code
#!bin/bash
IFS=$'\r\n'
GLOBIGNORE='*'
command eval
'array=($(<'$1'))'
sorted=($(sort <<<"${array[*]}"))
for ((i = -1; i <= ${array[-25]}; i--)); do
echo "${array[i]}" | awk -F "/| " '{print $2}'
done
I keep getting an error that says "line 5: array=($(<)): command not found"
This is my problem.
As a whole my code should read in a file as a command line argument, sort the elements, then print out column 2 of the last 25 lines. I haven't been able to test this far so if there's a problem there too any help would be appreciated.
This is some of what the file contains:
290729 123456
79076 12345
76789 123456789
59462 password
49952 iloveyou
33291 princess
21725 1234567
20901 rockyou
20553 12345678
16648 abc123
16227 nicole
15308 daniel
15163 babygirl
14726 monkey
14331 lovely
14103 jessica
13984 654321
13981 michael
13488 ashley
13456 qwerty
13272 111111
13134 iloveu
13028 000000
12714 michelle
11761 tigger
11489 sunshine
11289 chocolate
11112 password1
10836 soccer
10755 anthony
10731 friends
10560 butterfly
10547 purple
10508 angel
10167 jordan
9764 liverpool
9708 justin
9704 loveme
9610 fuckyou
9516 123123
9462 football
9310 secret
9153 andrea
9053 carlos
8976 jennifer
8960 joshua
8756 bubbles
8676 1234567890
8667 superman
8631 hannah
8537 amanda
8499 loveyou
8462 pretty
8404 basketball
8360 andrew
8310 angels
8285 tweety
8269 flower
8025 playboy
7901 hello
7866 elizabeth
7792 hottie
7766 tinkerbell
7735 charlie
7717 samantha
7654 barbie
7645 chelsea
7564 lovers
7536 teamo
7518 jasmine
7500 brandon
7419 666666
7333 shadow
7301 melissa
7241 eminem
7222 matthew
In Linux you can simply do a
sort -nbr file_to_sort | head -n 25 | awk '{print $2}'
read in a file as a command line argument, sort the elements, then
print out column 2 of the last 25 lines.
From that discription of the problem, I suggest:
#! /bin/sh
sort -bn $1 | tail -25 | awk '{print $2}'
As a rule, use the shell to operate on filenames, and never use the
shell to operate on data. Utilities like sort and awk are far
faster and more powerful than the shell when it comes to processing a
file.
I'm trying to take the 2 files below, and create a script to sort them into a new file of 3 column's with headers and some additional info.
I know the command to combine files and sort -- cat file1 file2 | sort > file3 but I don't know how to align the column's or add headings.
File 1
Dave 734.838.9800
Bob 313.123.4567
Carol 248.344.5576
Mary 313.449.1390
Ted 248-496-2204
Alice 616.556.4458
File 2
Bob Tuesday
Carol Monday
Ted Sunday
Alice Wednesday
Dave Thursday
Mary Saturday
Anticipated New File
Name On-Call Phone
Carol MONDAY 248.344.5576
Bob TUESDAY 313.123.4567
Alice WEDNESDAY 616.556.4458
Dave THURSDAY 734.838.9800
Mary SATURDAY 313.449.1390
Ted SUNDAY 248.496.2204
i'm just wondering how can we use awk to do exact matches.
for eg
$ cal 09 09 2009
September 2009
Su Mo Tu We Th Fr Sa
1 2 3 4 5
6 7 8 9 10 11 12
13 14 15 16 17 18 19
20 21 22 23 24 25 26
27 28 29 30
$ cal 09 09 2009 | awk '{day="9"; col=index($0,day); print col }'
17
0
0
11
20
0
8
0
As you can see the above command outputs the index number of all the lines that contain the string/number "9", is there a way to make awk output index number in only the 4th line of cal output above.??? may be an even more elegant solution?
I'm using awk to get the day name using the cal command. here's the whole line of code:
$ dayOfWeek=$(cal $day $month $year | awk '{day='$day'; split("Sunday Monday Tuesday Wednesday Thursday Friday Saturday", array); column=index($o,day); dow=int((column+2)/3); print array[dow]}')
The problem with the above code is that if multiple matches are found then i get multiple results, whereas i want it to output only one result.
Thanks!
Limit the call to index() to only those lines which have your "day" surrounded by spaces:
awk -v day=$day 'BEGIN{split("Sunday Monday Tuesday Wednesday Thursday Friday Saturday", array)} $0 ~ "\\<"day"\\>"{for(i=1;i<=NF;i++)if($i == day){print array[i]}}'
Proof of Concept
$ cal 02 1956
February 1956
Su Mo Tu We Th Fr Sa
1 2 3 4
5 6 7 8 9 10 11
12 13 14 15 16 17 18
19 20 21 22 23 24 25
26 27 28 29
$ day=18; cal 02 1956 | awk -v day=$day 'BEGIN{split("Sunday Monday Tuesday Wednesday Thursday Friday Saturday", array)} $0 ~ "\\<"day"\\>"{for(i=1;i<=NF;i++)if($i == day){print array[i]}}'
Saturday
Update
If all you are looking for is to get the day of the week from a certain date, you should really be using the date command like so:
$ day=9;month=9;year=2009;
$ dayOfWeek=$(date +%A -d "$day/$month/$year")
$ echo $dayOfWeek
Wednesday
you wrote
cal 09 09 2009
I'm not aware of a version of cal that accepts day of month as an input,
only
cal ${mon} (optional) ${year} (optional)
But, that doesn't affect your main issue.
you wrote
is there a way to make awk output index number in only the 4th line of cal output above.?
NR (Num Rec) is your friend
and there are numerous ways to use it.
cal 09 09 2009 | awk 'NR==4{day="9"; col=index($0,day); print col }'
OR
cal 09 09 2009 | awk '{day="9"; if (NR==4) {col=index($0,day); print col } }'
ALSO
In awk, if you have variable assignments that should be used throughout your whole program, then it is better to use the BEGIN section so that the assignment is only performed once. Not a big deal in you example, but why set bad habits ;-)?
HENCE
cal 09 2009 | awk 'BEGIN{day="9"}; NR==4 {col=index($0,day); print col }'
FINALLY
It is not completely clear what problem you are trying to solve. Are you sure you always want to grab line 4? If not, then how do you propose to solve that?
Problems stated as " 1. I am trying to do X. 2. Here is my input. 3. Here is my output. 4. Here is the code that generated that output" are much easier to respond to.
It looks like you're trying to do date calculations. You can be much more robust and general solutions by using the gnu date command. I have seen numerous useful discussions of this tagged as bash, shell, (date?).
I hope this helps.
This is so much easier to do in a language that has time functionality built-in. Tcl is great for that, but many other languages are too:
$ echo 'puts [clock format [clock scan 9/9/2009] -format %a]' | tclsh
Wed
If you want awk to only output for line 4, restrict the rule to line 4:
$ awk 'NR == 4 { ... }'