Convert multi-line file into TSV using awk - windows

Am using Windows 7 & gawk 3.1.3 (via UnxUtils).
I'd like to turn this input (Liverpool FC's fixtures):
Sunday, 27 November 2011
Barclays Premier League
Liverpool v Man City, 16:00
Tuesday, 29 November 2011
Carling Cup
Chelsea v Liverpool, QF, 19:45
...
into a tab-separated file, such as:
Sunday, 27 November 2011<tab>Barclays Premier League<tab>Liverpool v Man City, 16:00
Tuesday, 29 November 2011<tab>Carling Cup<tab>Chelsea v Liverpool, QF, 19:45
...
I've tried doing this with awk, but failed thus far. Identifying every first and second line is easy enough:
if (NR % 3 == 1 || NR % 3 == 2) print;
but despite many attempts (usually resulting in syntax errors) can't find out how to strip out the (Windows) line-endings and concatenate those with every third line.
I'm now wondering if awk is actually the right tool for the job.
Thanks for any pointers.

awk '(NR % 3) > 0 {printf("%s\t",$0)}
(Nr % 3) == 0 {printf("%s\n",$0)}
Should work. For every line where the modulo of NR (number of records) is not 0 it will print the line and a tab character. Otherwise the (input) line and a newline character.
HTH

see the test below:
kent$ echo "Sunday, 27 November 2011
Barclays Premier League
Liverpool v Man City, 16:00
Tuesday, 29 November 2011
Carling Cup
Chelsea v Liverpool, QF, 19:45
"|awk '{printf $0"\t";if(!(NR%3))print""}'
output:
Sunday, 27 November 2011 Barclays Premier League Liverpool v Man City, 16:00
Tuesday, 29 November 2011 Carling Cup Chelsea v Liverpool, QF, 19:45

Related

Display row/column data for csv with max value in another column, same row (bash)

I'm trying to make a script that sorts column 2 for highest value, prints said value, and prints column 3 for every row matching this value. Here's an example of unsorted csv:
Argentina,4.6,2016,some data
Argentina,4.2,2018,some data
Argentina,4.6,1998,some data
Argentina,4.5,2001,some data
Desired output would be:
4.6
2016
1998
Here's what I've got so far, but I'm feeling unsure if I'm going about it correctly:
grep "$2*" "$1"> new.csv
sort -t, -k2,2nr new.csv > new2.csv
cut -f3 -d"," new2.csv
Wherein $2 is the name of country in first column and $1 is the filename. While it sorts the values in the 2nd column just fine, I'd like to show the years for only the rows with max value in column 2. This route just prints the years for all of the rows, and I understand why that's happening, but not sure the best course to get the intended result from there. What are some ways to go about this? Thanks in advance
You could do something like that:
declare maxvalue_found=no
declare maxvalue=''
while read -r line; do
IFS=',' read -r <<< "$line" country value year data
if [[ "${maxvalue_found}" == no ]]; then
echo "$value"
maxvalue="${value}"
maxvalue_found=yes
fi
if [[ "${value}" == "${maxvalue}" ]]; then
echo "$year"
fi
done < new2.csv
new2.csv is your sorted file: we simply read it line by line, then read said line by splitting using ',' (https://www.gnu.org/software/bash/manual/bash.html#Word-Splitting):
The first value should be the highest due to sort.
The next value must be tested because you want only those that matches.
The year are print in same order than in new2.csv
Assumptions:
comma only shows up as a field delimiter (ie, comma is not part of any data)
no sorting requirement has been defined for the final result
One awk idea requiring 2 passes through the unsorted file:
awk -F, ' # set input field delimiter as comma
FNR==NR { max=($2>max ? $2 : max); next} # 1st pass of file (all rows): keep track of max value from field #2
FNR==1 { print max } # 2nd pass of file (1st row ): print max
$2==max { print $3 } # 2nd pass of file (all rows): if field #2 matches "max" then print field #3
' unsorted.csv unsorted.csv
This generates:
4.6
2016
1998
Another GNU awk idea that requires a single pass through the unsorted file:
awk -F, ' # set input field delimiter as comma
{ arr[$2][$3] # save fields #2 and #3 as indices in array "arr[]"
max = ( $2 > max ? $2 : max) # keep track of max value from field #2
}
END { print max # after file has been processed ... print max and then ...
for (i in arr[max]) # loop through indices of 2nd dimension where 1st dimension == max
print i # print 2nd dimension index (ie, field #3)
}
' unsorted.csv
This generates:
4.6
1998
2016
NOTES:
GNU awk required for arrays of arrays (ie, multidimensional arrays)
while field #3 appeara to be sorted this is not guaranteed unless we modify the code to explicitly sort the 2nd dimension of the array
How about single-pass in awk instead of multi-pass ? I've generated this synthetic version of the file, plus randomizing some data, to create a 6.24 mn row version of it :
INPUT
out9: 177MiB 0:00:01 [ 105MiB/s] [ 105MiB/s] [ <=> ]
rows = 6243584. | UTF8 chars = 186289540. | bytes = 186289540.
CODE
default value initialized to gigantically negative value of
-2^512, or more elegantly, -4^4^4, *to ensure it'll always take on the value on row 1
if you really want to play it safe, then make it something very close to* negative infinity :
e.g. -(3+4+1)^341, -16^255, -256^127, or -1024^102
=
{m,g}awk '
BEGIN {
1 _= -(_^= __= _+= _^= FS= OFS = ",")^_^_
1 ___= split("",____)
}
# Rule(s)
6243584 +_ <= +$__ { # 2992
2992 __= $(NF = ++__)
2992 if ((+_)< +$--NF) {
7 _= $NF
7 ___= split("",____)
}
2992 ____[++___]=__
2992 __=NF
}
END {
1 print _
2984 for (__^=_<_; __<=___; __++) {
2984 print ____[__]
}
}
OUTPUT (column 3 printed exactly in input row order)
.
53.6 1834 1999 1866 1938 1886 1973 1968 1921 1984 1957 1891 1864 1992
1998 1853 1950 1985 1962 2018 1897 1979 2020 1954 1995 1980 1900 1997
1856 1975 1851 1853 1988 1897 1973 1875 1917 1861 1912 1912 1954 1871
1952 1877 2003 1886 1863 1899 1897 1853 2013 1956 1965 1854 1873 1915
1983 1961 1965 1979 1919 1970 1946 1843 1856 1954 1965 1831 1926 1964
1994 1969 1831 1945 1942 1971 1988 1879 1998 1986 1844 1846 1994 1894
2008 1851 1877 1979 1970 1852 1942 1889 1986 2013 1905 1932 2021 1944
1866 1892 1940 1989 1907 1982 2016 1966 1975 1831 1851 2003 1980 1963
1869 1983 1972 2013 1972 1948 1843 1928 1959 1911 1844 1920 1943 1864
1985 1978 1855 1986 1975 1880 2001 1914 1877 1900 1964 1995 1992 1968
1868 1974 2012 1827 1849 1849 1992 1942 1884 1876 2021 1866 1977 1857
1866 1937 1920 1983 1915 1887 1890 1852 1871 1972 1903 1944 1943 1957
1844 1932 1854 1890 1891 1866 1923 1924 1941 1845 1907 2019
(further rows truncated for readability)
A single pass awk:
$ awk -F, '{
if($2>=m||m=="") {
b= ($2==m?b:$2) ORS $3 # b is the record buffer
m=$2 # m holds the maximum of $2 so far
}
}
END {
print b
}' file
Output:
4.6
2016
1998

getting the sum of the out put in unix [duplicate]

This question already has answers here:
Summing values of a column using awk command
(2 answers)
Closed 5 years ago.
I am trying to get the sum of my output in bash shell using only awk. One of the problems I am getting is that I only need to use awk in this.
This is the code I am using for getting the output:
awk '{print substr($7, 9, 4)}' emp.txt
This is the output I am getting: (output omitted)
7606
6498
7947
4044
1657
3872
4834
8463
9280
2789
9104
this is how I am trying to do the sum of the numbers: awk '(s = s + substr($7, 9, 4)) {print s}' emp.txt
The problem is that it is not giving me the right output (which should be 9942686) but instead giving me the series sum (as shown below).
(output omitted)
9890696
9898643
9902687
9904344
9908216
9913050
9921513
9930793
9933582
9942686
Am I using the code the wrong way? Or is there any other method of doing it with awk and I am doing it the wrong way?
Here is the sample file I am working on:
Brynlee Watkins F 55 Married 2016 778-555-6498 62861
Malcolm Curry M 24 Married 2016 604-555-7947 54647
Aylin Blake F 45 Married 2015 236-555-4044 80817
Mckinley Hodges F 50 Married 2015 604-555-1657 46316
Rylan Dorsey F 51 Married 2017 778-555-3872 77160
Taylor Clarke M 23 Married 2015 604-555-4834 46624
Vivaan Hooper M 26 Married 2016 778-555-8463 80010
Gibson Rowland M 42 Married 2017 236-555-9280 59874
Alyson Mahoney F 51 Single 2017 778-555-2789 71394
Catalina Frazier F 53 Married 2016 604-555-9104 79364
EDIT: I want to get the sum of the numbers that are repeating in the output. Let's say the repeating numbers are 4826 and 0028 in the output and both of them repeated 2 times. I only want the sum of these numbers (each repetition must be counted as the individual. hence these are counted as 4). So the desired output for these 4 numbers shall be 9708
Will Duffy M 33 Single 2017 236-555-4826 47394
Nolan Reed M 27 Single 2015 604-555-0028 46622
Anya Horn F 54 Married 2017 236-555-4826 73270
Cynthia Davenport F 29 Married 2015 778-555-0028 59687
Oscar Medina M 43 Married 2016 778-555-7864 73688
Angelina Herrera F 37 Married 2017 604-555-7910 82061
Peyton Reyes F 35 Married 2017 236-555-8046 51920
END { print s }
Since you only need the total sum printed once, do it under the END pattern.
awk '{s = s + substr($7, 9, 4)} END {print s}' emp.txt
Could you please try following awk and let me know if this helps you. It will look always look for last digits after -:
awk -F' |-' '{sum+=$(NF-1)} END{print sum}' Input_file
EDIT:
awk -F' |-' '
{
++a[$(NF-1)];
b[$(NF-1)]=b[$(NF-1)]?b[$(NF-1)]+$(NF-1):$(NF-1)
}
END{
for(i in a){
if(a[i]>1){
print i,b[i]}
}}
' Input_file
Output will be as follows:
4826 9652
0028 56

Determining week number on a 4 week cycle in bash script

what I want is a cycle of 4 weeks in a bash script
My question is: How do I know this week's number in the cycle.
week x monday : echo one
week x+1 monday : echo two
week x+2 monday : echo three
week x+3 monday : echo four
and again
week x+4 monday : echo one
and so on
what I have is the epoch
(UTC), Thursday, 1 January 1970
consequently
(UTC), monday, 5 January 1970 (I can set this to echo 1)
Any suggestions? Converting dates is no problem. Just a general idea is ok.
I think you are expecting do something like this, with GNU date,
start_date=$(date -d "1970-01-05" '+%s') # Corresponding to 1
end_date=$(date -d "2017-01-02" '+%s') # Current week
Number of weeks between the dates
numberOfWeeks=$(( ( end_date - start_date )/(60*60*24*7) ))
printf "%s\n" "$numberOfWeeks"
2452
Now to determine which week this corresponds to, do
printf "The current week %s belongs to week %d" "$(date)" "$(((numberOfWeeks%4) + 1))"
The current week Mon, Jan 02, 2017 4:47:09 PM belongs to week 1
For further weeks down the line, say. 4th Monday of March 2017, using the above computation, i.e. with
end_date=$(date -d "2017-03-27" '+%s')
printf "The week %s belongs to week %d" "$(date -d "2017-03-27")" "$(((numberOfWeeks%4) + 1))"
The week Mon, Mar 27, 2017 12:00:00 AM belongs to week 1
Another example for the 3rd Monday or March 2017,
end_date=$(date -d "2017-03-20" '+%s')
printf "The week %s belongs to week %d" "$(date -d "2017-03-20")" "$(((numberOfWeeks%4) + 1))"
The week Mon, Mar 20, 2017 12:00:00 AM belongs to week 4
You can format the date output for showing the week number:
function printweek {
weeknr=$(date '+%V' -d "+$1 weeks")
echo "$((weeknr%4))"
}
# Test
for week in 0 1 2 3 4 5 6 30 31 32 33; do
echo "Week offset from today ${week} => $(printweek ${week})"
done
This will work when you start over counting each year (first week 1 again). When you want to continue counting on 1 Januari, the script will be more difficult. You can look at the solution of #Inian.
Another option might be looking at the output of the last run, and add one %4 to the weeknumber of the last run.

Combine text from two files, output to another [duplicate]

This question already has answers here:
Inner join on two text files
(5 answers)
Closed 1 year ago.
i'm having a bit of a problem and i've been searching allll day. this is my first Unix class don't be to harsh.
so this may sound fairly simple, but i can't get it
I have two text files
file1
David 734.838.9801
Roberto‭ ‬313.123.4567
Sally‭ ‬248.344.5576
Mary‭ ‬313.449.1390
Ted‭ ‬248.496.2207
Alice‭ ‬616.556.4458
Frank‭ ‬634.296.1259
file2
Roberto Tuesday‭ ‬2
Sally Monday‭ ‬8
Ted Sunday‭ ‬16
Alice Wednesday‭ ‬23
David Thursday‭ ‬10
Mary Saturday‭ ‬14
Frank Friday‭ ‬15
I am trying to write a script using a looping structure that will combine both files and come out with the output below as a separate file
output:
Name On-Call Phone Start Time
Sally Monday 248.344.5576 8am
Roberto Tuesday 313.123.4567 2am
Alice‭ Wednesday‭ 616.556.4458‭ 11pm
David‭ Thursday‭ 734.838.9801‭ 10am
Frank‭ Friday‭ 634.296.1259‭ 3pm
Mary‭ Saturday‭ 313.449.1390‭ 2pm
Ted‭ ‬ Sunday‭ 248.496.2207‭ 4pm
This is what i tried( i know its horrible)
echo " Name On-Call Phone Start Time"
file="/home/xubuntu/date.txt"
file1="/home/xubuntu/name.txt"
while read name2 phone
do
while read name day time
do
echo "$name $day $phone $time"
done<"$file"
done<"$file1"
any help would be appreciated
First, sort the files using sort and then use this command:
paste file1 file2 | awk '{print $1,$4,$2,$5}'
This will bring you pretty close. After that you have to figure out how to format the time from the 24 hour format to the 12 hour format.
If you want to avoid using sort separately, you can bring in a little more complexity like this:
paste <(sort file1) <(sort file2) | awk '{print $1,$4,$2,$5}'
Finally, if you have not yet figured out how to print the time in 12 hour format, here is your full command:
paste <(sort file1) <(sort file2) | awk '{"date --date=\"" $5 ":00:00\" +%I%P" |& getline $5; print $1 " " $4 " " $2 " " $5 }'
You can use tabs (\t) in place of spaces as connectors to get a nicely formatted output.
In this case join command will also work,
join -1 1 -2 1 <(sort file1) <(sort file2)
Description
-1 -> file1
1 -> first field of file1 (common field)
-2 -> file2
1 -> first field of file2 (common field)
**cat file1**
David 734.838.9801
Roberto 313.123.4567
Sally 248.344.5576
Mary 313.449.1390
Ted 248.496.2207
Alice 616.556.4458
Frank 634.296.1259
**cat file2**
Roberto Tuesday 2
Sally Monday 8
Ted Sunday 16
Alice Wednesday 23
David Thursday 10
Mary Saturday 14
Frank Friday 15
output
Alice 616.556.4458 Wednesday 23
David 734.838.9801 Thursday 10
Frank 634.296.1259 Friday 15
Mary 313.449.1390 Saturday 14
Roberto 313.123.4567 Tuesday 2
Sally 248.344.5576 Monday 8
Ted 248.496.2207 Sunday 16

using awk to do exact match in a file

i'm just wondering how can we use awk to do exact matches.
for eg
$ cal 09 09 2009
September 2009
Su Mo Tu We Th Fr Sa
1 2 3 4 5
6 7 8 9 10 11 12
13 14 15 16 17 18 19
20 21 22 23 24 25 26
27 28 29 30
$ cal 09 09 2009 | awk '{day="9"; col=index($0,day); print col }'
17
0
0
11
20
0
8
0
As you can see the above command outputs the index number of all the lines that contain the string/number "9", is there a way to make awk output index number in only the 4th line of cal output above.??? may be an even more elegant solution?
I'm using awk to get the day name using the cal command. here's the whole line of code:
$ dayOfWeek=$(cal $day $month $year | awk '{day='$day'; split("Sunday Monday Tuesday Wednesday Thursday Friday Saturday", array); column=index($o,day); dow=int((column+2)/3); print array[dow]}')
The problem with the above code is that if multiple matches are found then i get multiple results, whereas i want it to output only one result.
Thanks!
Limit the call to index() to only those lines which have your "day" surrounded by spaces:
awk -v day=$day 'BEGIN{split("Sunday Monday Tuesday Wednesday Thursday Friday Saturday", array)} $0 ~ "\\<"day"\\>"{for(i=1;i<=NF;i++)if($i == day){print array[i]}}'
Proof of Concept
$ cal 02 1956
February 1956
Su Mo Tu We Th Fr Sa
1 2 3 4
5 6 7 8 9 10 11
12 13 14 15 16 17 18
19 20 21 22 23 24 25
26 27 28 29
$ day=18; cal 02 1956 | awk -v day=$day 'BEGIN{split("Sunday Monday Tuesday Wednesday Thursday Friday Saturday", array)} $0 ~ "\\<"day"\\>"{for(i=1;i<=NF;i++)if($i == day){print array[i]}}'
Saturday
Update
If all you are looking for is to get the day of the week from a certain date, you should really be using the date command like so:
$ day=9;month=9;year=2009;
$ dayOfWeek=$(date +%A -d "$day/$month/$year")
$ echo $dayOfWeek
Wednesday
you wrote
cal 09 09 2009
I'm not aware of a version of cal that accepts day of month as an input,
only
cal ${mon} (optional) ${year} (optional)
But, that doesn't affect your main issue.
you wrote
is there a way to make awk output index number in only the 4th line of cal output above.?
NR (Num Rec) is your friend
and there are numerous ways to use it.
cal 09 09 2009 | awk 'NR==4{day="9"; col=index($0,day); print col }'
OR
cal 09 09 2009 | awk '{day="9"; if (NR==4) {col=index($0,day); print col } }'
ALSO
In awk, if you have variable assignments that should be used throughout your whole program, then it is better to use the BEGIN section so that the assignment is only performed once. Not a big deal in you example, but why set bad habits ;-)?
HENCE
cal 09 2009 | awk 'BEGIN{day="9"}; NR==4 {col=index($0,day); print col }'
FINALLY
It is not completely clear what problem you are trying to solve. Are you sure you always want to grab line 4? If not, then how do you propose to solve that?
Problems stated as " 1. I am trying to do X. 2. Here is my input. 3. Here is my output. 4. Here is the code that generated that output" are much easier to respond to.
It looks like you're trying to do date calculations. You can be much more robust and general solutions by using the gnu date command. I have seen numerous useful discussions of this tagged as bash, shell, (date?).
I hope this helps.
This is so much easier to do in a language that has time functionality built-in. Tcl is great for that, but many other languages are too:
$ echo 'puts [clock format [clock scan 9/9/2009] -format %a]' | tclsh
Wed
If you want awk to only output for line 4, restrict the rule to line 4:
$ awk 'NR == 4 { ... }'

Resources