AWK or SED Replace space between alphabets in a particular column [closed] - bash

Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 3 years ago.
Improve this question
I have an infile as below:
infile:
INM00042170 28.2500 74.9167 290.0 CHURU 2015 2019 2273
INM00042182 28.5833 77.2000 211.0 NEW DELHI/SAFDARJUNG 1930 2019 67874
INXUAE05462 28.6300 77.2000 216.0 NEW DELHI 1938 1942 2068
INXUAE05822 25.7700 87.5200 40.0 PURNEA 1933 1933 179
INXUAE05832 31.0800 77.1800 2130.0 SHIMLA 1926 1928 728
PKM00041640 31.5500 74.3333 214.0 LAHORE CITY 1960 2019 22915
I want to replace the space between two words by an underscore in column 5 (example: NEW DELHI becomes NEW_DELHI). I want output as below.
outfile:
INM00042170 28.2500 74.9167 290.0 CHURU 2015 2019 2273
INM00042182 28.5833 77.2000 211.0 NEW_DELHI/SAFDARJUNG 1930 2019 67874
INXUAE05462 28.6300 77.2000 216.0 NEW_DELHI 1938 1942 2068
INXUAE05822 25.7700 87.5200 40.0 PURNEA 1933 1933 179
INXUAE05832 31.0800 77.1800 2130.0 SHIMLA 1926 1928 728
PKM00041640 31.5500 74.3333 214.0 LAHORE_CITY 1960 2019 22915
Thank you

#!/bin/bash
# connect field 5 and 6 and remove those with numbers.
# this returns a list of new names (with underscore) for
# all cities that need to be replaced
declare -a NEW_NAMES=$(cat infile | awk '{print $5 "_" $6}' | grep -vE "_[0-9]")
# iterating all new names
for NEW_NAME in ${NEW_NAMES[#]}; do
OLD_NAME=$(echo $NEW_NAME | tr '_' ' ')
# replace in file
sed -i "s/${OLD_NAME}/${NEW_NAME}/g" infile
done

Related

Adding text to a specific (row,column) in csv with sed

Improved question for clarity:
Hello there so I have about 2000 csv files
One master file called fileaa.csv
And 1999 description files called fileaa-1.csv, fileaa-2.csv, fileaa-4.csv... (some numbers are missing)
I want to add a 3rd column to the 2 column master file:
| link | link2 |
1| somelink.com | somelink2.com |
like so
| link | link2 | description |
1| somelink.com | somelink2.com | some description |
where the description of line 1 comes from fileaa-1.csv, which is a single-cell csv with a paragraph of text.
Does anyone know how to do this at scale? I have 100 other masters with about 2000 descriptions each.
Edit (incl. commands):
Things I couldn't try:
cat * | awk 'NR==FNR{a[NR]=$0;next}{print a[FNR],$0}' fileaa.csv fileaa-1.csv
wouldn't work because of the missing numbers
awk '{print $0,NR}' fileaa.csv; \
find /mnt/media/fileaa.csv -type f -exec sed -i 's/1/fileaa-1.csv/g' {} \;
because sed can't read external files inside the -exec sed command
Edit 1:
The exact contents of fileaa-1.csv are:
"Texan singer-songwriter Robert Earl Keen performs the song "What I Really Mean" acoustically with his band, live in the Magnolia Avenue Studios of KDHX, St. Louis, Missouri, February 11, 2010. The full session aired Sun, Feb. 28, 2010 on Songwriter's Showcase, heard Sundays from 10:30 a.m.-noon Central on KDHX with host Ed Becker. Sound and Video by Andy Coco and Ed Kleinberg. Discover more great music (streaming audio, photos, video and more)"
The exact input:
| link | link2 |
1| https://www.youtube.com/watch?v=lhNFZ37OfE4 | https://www.youtube.com/user/kdhx |
The exact desired output:
| link | link2 | description |
1| https://www.youtube.com/watch?v=lhNFZ37OfE4 | https://www.youtube.com/user/kdhx | "Texan singer-songwriter Robert Earl Keen performs the song "What I Really Mean" acoustically with his band, live in the Magnolia Avenue Studios of KDHX, St. Louis, Missouri, February 11, 2010. The full session aired Sun, Feb. 28, 2010 on Songwriter's Showcase, heard Sundays from 10:30 a.m.-noon Central on KDHX with host Ed Becker. Sound and Video by Andy Coco and Ed Kleinberg. Discover more great music (streaming audio, photos, video and more)" |
Edit 2:
The contents of fileaa.csv are already in order and do not need to be sorted. It is not possible for there to be a fileaa-[number].csv that does not match a row in fileaa.csv.
Edit 3:
There are no | of linefeeds in the data.
To be honest I am a complete beginner and I don't really know where to start on this one.
Any help will be appreciated ❤️
Assumptions:
the 'paragraph' from the fileaa-*.csv files is on a single line (ie, does not include any embedded linefeeds)
assuming the sample from OP's fileaa-1.csv is one long line and what we're seeing in the question is an issue of incorrect formatting of the paragraph (ie, there are no linefeeds)
we can ignore anything on lines 2-N from the fileaa-*.csv files
we only append a field to a line in fileaa.csv if we find a matching file (ie, we don't worry about appending an empty field if the matching fileaa-*.csv files does not exist)
the finale result (ie, contents of all files) will fit in memory
Adding some additional sample data:
$ head fileaa*csv
==> fileaa-1.csv <==
"Texan singer-songwriter Robert Earl Keen performs the song "What I Really Mean" acoustically with his band, live in the Magnolia Avenue Studios of KDHX, St. Louis, Missouri, February 11, 2010. The full session aired Sun, Feb. 28, 2010 on Songwriter's Showcase, heard Sundays from 10:30 a.m.-noon Central on KDHX with host Ed Becker. Sound and Video by Andy Coco and Ed Kleinberg. Discover more great music (streaming audio, photos, video and more)"
==> fileaa-2.csv <==
"this one has a short paragraph ... 1 ... 2 ... 3"
==> fileaa-3.csv <==
and then there's this paragraph with a bunch of random characters ... as;dlkfjaw;eorifujqw4[-09hjavnd;oitjuwae[-0g9ujadg;flkjas
==> fileaa.csv <==
| link | link2 |
1| https://www.youtube.com/watch?v=lhNFZ37OfE4 | https://www.youtube.com/user/kdhx |
2| https://www.youtube.com/watch?v=AAAAAAAAAAAAAAAAA | https://www.youtube.com/user/AAAAAAAA |
3 | https://www.youtube.com/watch?v=BBBBB | https://www.youtube.com/user/BBBBBBBBBBBBBBB |
4| https://www.youtube.com/watch?v=CCCCCCCC | https://www.youtube.com/user/CCCCCC |
NOTE: since there is no fileaa-4.csv we will not append anything to the last line (where 1st field = 4) in fileaa.csv
One awk idea:
master='fileaa'
awk '
FNR==NR { if (FNR==1)
lines[0]=$0 " Description |" # save header line
else {
split($0,a,"|") # get line number
ndx=a[1]+0 # remove spaces and leading zeros
lines[ndx]=$0 # save line
max=ndx > max ? ndx : max # keep track of the max line number
}
next
}
{ split(FILENAME,a,/[-.]/) # split filename on dual delimiters: hyphen and period
ndx=a[2]+0 # remove leading zeros
lines[ndx]=lines[ndx] " " $0 " |" # append current line to matching line from 1st file
nextfile # skip the rest of the current file
}
END { for (i=0;i<=max;i++)
print lines[i]
}
' "${master}".csv "${master}"-*.csv
This generates:
| link | link2 | Description |
1| https://www.youtube.com/watch?v=lhNFZ37OfE4 | https://www.youtube.com/user/kdhx | "Texan singer-songwriter Robert Earl Keen performs the song "What I Really Mean" acoustically with his band, live in the Magnolia Avenue Studios of KDHX, St. Louis, Missouri, February 11, 2010. The full session aired Sun, Feb. 28, 2010 on Songwriter's Showcase, heard Sundays from 10:30 a.m.-noon Central on KDHX with host Ed Becker. Sound and Video by Andy Coco and Ed Kleinberg. Discover more great music (streaming audio, photos, video and more)" |
2| https://www.youtube.com/watch?v=AAAAAAAAAAAAAAAAA | https://www.youtube.com/user/AAAAAAAA | "this one has a short paragraph ... 1 ... 2 ... 3" |
3 | https://www.youtube.com/watch?v=BBBBB | https://www.youtube.com/user/BBBBBBBBBBBBBBB | and then there's this paragraph with a bunch of random characters ... as;dlkfjaw;eorifujqw4[-09hjavnd;oitjuwae[-0g9ujadg;flkjas |
4| https://www.youtube.com/watch?v=CCCCCCCC | https://www.youtube.com/user/CCCCCC |
This might work.
Based on the FILENAME ending in a number or not ending in a number, columns one and two are collected if the FILENAME does not end in a number and column three is collected if the FILENAME ends in a number.
After all input files are processed, columns one, two, and three are printed.
./doit.awk fileeaa*
|link|link2|Description
1| https://www.youtube.com/watch?v=lhNFZ37OfE4 | https://www.youtube.com/user/kdhx |Texan singer-songwriter Robert Earl Keen performs the song "What I Really Mean" acoustically with his band, live in the Magnolia Avenue Studios of KDHX, St. Louis, Missouri, February 11, 2010. The full session aired Sun, Feb. 28, 2010 on Songwriter's Showcase, heard Sundays from 10:30 a.m.-noon Central on KDHX with host Ed Becker. Sound and Video by Andy Coco and Ed Kleinberg. Discover more great music (streaming audio, photos, video and more)
#!/usr/local/bin/gawk -f
BEGIN { FS="|" }
FILENAME !~ /[0-9]\.csv$/ && $1 > 0 {
join_on[$1]=$1
c1[$1] = $2
c2[$1] = $3
joins++
}
FILENAME ~ /[0-9]\.csv$/ {
match(FILENAME , /-([0-9]+)\.csv/, join_int)
c3[join_int[1]] = $0
}
END {
print "|link|link2|Description"
for (j in join_on) {
print j "|" c1[j] "|" c2[j] "|" c3[j]
}
}

Display row/column data for csv with max value in another column, same row (bash)

I'm trying to make a script that sorts column 2 for highest value, prints said value, and prints column 3 for every row matching this value. Here's an example of unsorted csv:
Argentina,4.6,2016,some data
Argentina,4.2,2018,some data
Argentina,4.6,1998,some data
Argentina,4.5,2001,some data
Desired output would be:
4.6
2016
1998
Here's what I've got so far, but I'm feeling unsure if I'm going about it correctly:
grep "$2*" "$1"> new.csv
sort -t, -k2,2nr new.csv > new2.csv
cut -f3 -d"," new2.csv
Wherein $2 is the name of country in first column and $1 is the filename. While it sorts the values in the 2nd column just fine, I'd like to show the years for only the rows with max value in column 2. This route just prints the years for all of the rows, and I understand why that's happening, but not sure the best course to get the intended result from there. What are some ways to go about this? Thanks in advance
You could do something like that:
declare maxvalue_found=no
declare maxvalue=''
while read -r line; do
IFS=',' read -r <<< "$line" country value year data
if [[ "${maxvalue_found}" == no ]]; then
echo "$value"
maxvalue="${value}"
maxvalue_found=yes
fi
if [[ "${value}" == "${maxvalue}" ]]; then
echo "$year"
fi
done < new2.csv
new2.csv is your sorted file: we simply read it line by line, then read said line by splitting using ',' (https://www.gnu.org/software/bash/manual/bash.html#Word-Splitting):
The first value should be the highest due to sort.
The next value must be tested because you want only those that matches.
The year are print in same order than in new2.csv
Assumptions:
comma only shows up as a field delimiter (ie, comma is not part of any data)
no sorting requirement has been defined for the final result
One awk idea requiring 2 passes through the unsorted file:
awk -F, ' # set input field delimiter as comma
FNR==NR { max=($2>max ? $2 : max); next} # 1st pass of file (all rows): keep track of max value from field #2
FNR==1 { print max } # 2nd pass of file (1st row ): print max
$2==max { print $3 } # 2nd pass of file (all rows): if field #2 matches "max" then print field #3
' unsorted.csv unsorted.csv
This generates:
4.6
2016
1998
Another GNU awk idea that requires a single pass through the unsorted file:
awk -F, ' # set input field delimiter as comma
{ arr[$2][$3] # save fields #2 and #3 as indices in array "arr[]"
max = ( $2 > max ? $2 : max) # keep track of max value from field #2
}
END { print max # after file has been processed ... print max and then ...
for (i in arr[max]) # loop through indices of 2nd dimension where 1st dimension == max
print i # print 2nd dimension index (ie, field #3)
}
' unsorted.csv
This generates:
4.6
1998
2016
NOTES:
GNU awk required for arrays of arrays (ie, multidimensional arrays)
while field #3 appeara to be sorted this is not guaranteed unless we modify the code to explicitly sort the 2nd dimension of the array
How about single-pass in awk instead of multi-pass ? I've generated this synthetic version of the file, plus randomizing some data, to create a 6.24 mn row version of it :
INPUT
out9: 177MiB 0:00:01 [ 105MiB/s] [ 105MiB/s] [ <=> ]
rows = 6243584. | UTF8 chars = 186289540. | bytes = 186289540.
CODE
default value initialized to gigantically negative value of
-2^512, or more elegantly, -4^4^4, *to ensure it'll always take on the value on row 1
if you really want to play it safe, then make it something very close to* negative infinity :
e.g. -(3+4+1)^341, -16^255, -256^127, or -1024^102
=
{m,g}awk '
BEGIN {
1 _= -(_^= __= _+= _^= FS= OFS = ",")^_^_
1 ___= split("",____)
}
# Rule(s)
6243584 +_ <= +$__ { # 2992
2992 __= $(NF = ++__)
2992 if ((+_)< +$--NF) {
7 _= $NF
7 ___= split("",____)
}
2992 ____[++___]=__
2992 __=NF
}
END {
1 print _
2984 for (__^=_<_; __<=___; __++) {
2984 print ____[__]
}
}
OUTPUT (column 3 printed exactly in input row order)
.
53.6 1834 1999 1866 1938 1886 1973 1968 1921 1984 1957 1891 1864 1992
1998 1853 1950 1985 1962 2018 1897 1979 2020 1954 1995 1980 1900 1997
1856 1975 1851 1853 1988 1897 1973 1875 1917 1861 1912 1912 1954 1871
1952 1877 2003 1886 1863 1899 1897 1853 2013 1956 1965 1854 1873 1915
1983 1961 1965 1979 1919 1970 1946 1843 1856 1954 1965 1831 1926 1964
1994 1969 1831 1945 1942 1971 1988 1879 1998 1986 1844 1846 1994 1894
2008 1851 1877 1979 1970 1852 1942 1889 1986 2013 1905 1932 2021 1944
1866 1892 1940 1989 1907 1982 2016 1966 1975 1831 1851 2003 1980 1963
1869 1983 1972 2013 1972 1948 1843 1928 1959 1911 1844 1920 1943 1864
1985 1978 1855 1986 1975 1880 2001 1914 1877 1900 1964 1995 1992 1968
1868 1974 2012 1827 1849 1849 1992 1942 1884 1876 2021 1866 1977 1857
1866 1937 1920 1983 1915 1887 1890 1852 1871 1972 1903 1944 1943 1957
1844 1932 1854 1890 1891 1866 1923 1924 1941 1845 1907 2019
(further rows truncated for readability)
A single pass awk:
$ awk -F, '{
if($2>=m||m=="") {
b= ($2==m?b:$2) ORS $3 # b is the record buffer
m=$2 # m holds the maximum of $2 so far
}
}
END {
print b
}' file
Output:
4.6
2016
1998

getting the sum of the out put in unix [duplicate]

This question already has answers here:
Summing values of a column using awk command
(2 answers)
Closed 5 years ago.
I am trying to get the sum of my output in bash shell using only awk. One of the problems I am getting is that I only need to use awk in this.
This is the code I am using for getting the output:
awk '{print substr($7, 9, 4)}' emp.txt
This is the output I am getting: (output omitted)
7606
6498
7947
4044
1657
3872
4834
8463
9280
2789
9104
this is how I am trying to do the sum of the numbers: awk '(s = s + substr($7, 9, 4)) {print s}' emp.txt
The problem is that it is not giving me the right output (which should be 9942686) but instead giving me the series sum (as shown below).
(output omitted)
9890696
9898643
9902687
9904344
9908216
9913050
9921513
9930793
9933582
9942686
Am I using the code the wrong way? Or is there any other method of doing it with awk and I am doing it the wrong way?
Here is the sample file I am working on:
Brynlee Watkins F 55 Married 2016 778-555-6498 62861
Malcolm Curry M 24 Married 2016 604-555-7947 54647
Aylin Blake F 45 Married 2015 236-555-4044 80817
Mckinley Hodges F 50 Married 2015 604-555-1657 46316
Rylan Dorsey F 51 Married 2017 778-555-3872 77160
Taylor Clarke M 23 Married 2015 604-555-4834 46624
Vivaan Hooper M 26 Married 2016 778-555-8463 80010
Gibson Rowland M 42 Married 2017 236-555-9280 59874
Alyson Mahoney F 51 Single 2017 778-555-2789 71394
Catalina Frazier F 53 Married 2016 604-555-9104 79364
EDIT: I want to get the sum of the numbers that are repeating in the output. Let's say the repeating numbers are 4826 and 0028 in the output and both of them repeated 2 times. I only want the sum of these numbers (each repetition must be counted as the individual. hence these are counted as 4). So the desired output for these 4 numbers shall be 9708
Will Duffy M 33 Single 2017 236-555-4826 47394
Nolan Reed M 27 Single 2015 604-555-0028 46622
Anya Horn F 54 Married 2017 236-555-4826 73270
Cynthia Davenport F 29 Married 2015 778-555-0028 59687
Oscar Medina M 43 Married 2016 778-555-7864 73688
Angelina Herrera F 37 Married 2017 604-555-7910 82061
Peyton Reyes F 35 Married 2017 236-555-8046 51920
END { print s }
Since you only need the total sum printed once, do it under the END pattern.
awk '{s = s + substr($7, 9, 4)} END {print s}' emp.txt
Could you please try following awk and let me know if this helps you. It will look always look for last digits after -:
awk -F' |-' '{sum+=$(NF-1)} END{print sum}' Input_file
EDIT:
awk -F' |-' '
{
++a[$(NF-1)];
b[$(NF-1)]=b[$(NF-1)]?b[$(NF-1)]+$(NF-1):$(NF-1)
}
END{
for(i in a){
if(a[i]>1){
print i,b[i]}
}}
' Input_file
Output will be as follows:
4826 9652
0028 56

In many lines of the same values that I want to be counted [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
Questions asking for code must demonstrate a minimal understanding of the problem being solved. Include attempted solutions, why they didn't work, and the expected results. See also: Stack Overflow question checklist
Closed 9 years ago.
Improve this question
main faile is:
785
785
788
788
883
883
883
921
921
921
921
921
921
925
925
I want to count the same values and write the results in a new file (as follows):
785 2
788 2
883 3
921 6
925 2
Thank you for your helps.
sort myFile.txt | uniq -c | awk '{ print $2 " " $1}' > myNewFile.txt
Edit: added sort and removed cat to take comments into account
And if you want only values which appear at least 4 times:
sort temp.txt | uniq -c | sort -n | egrep -v "^ *[0-3] " | awk '{ print $2 " " $1}'
Imagine your file is called t
You can do with:
cat t | sort -u | while read line #read each one element in sorted and uniqye
do
echo -n $line; # print element
cat t | grep ${line} | wc -l # read file, get only the specified and count
done
kent$ awk '{a[$0]++}END{for(x in a)print x, a[x]}' f
921 6
925 2
883 3
785 2
788 2
print only count >=4:
kent$ awk '{a[$0]++}END{for(x in a)if(a[x]>=4)print x, a[x]}' f
921 6

using awk to do exact match in a file

i'm just wondering how can we use awk to do exact matches.
for eg
$ cal 09 09 2009
September 2009
Su Mo Tu We Th Fr Sa
1 2 3 4 5
6 7 8 9 10 11 12
13 14 15 16 17 18 19
20 21 22 23 24 25 26
27 28 29 30
$ cal 09 09 2009 | awk '{day="9"; col=index($0,day); print col }'
17
0
0
11
20
0
8
0
As you can see the above command outputs the index number of all the lines that contain the string/number "9", is there a way to make awk output index number in only the 4th line of cal output above.??? may be an even more elegant solution?
I'm using awk to get the day name using the cal command. here's the whole line of code:
$ dayOfWeek=$(cal $day $month $year | awk '{day='$day'; split("Sunday Monday Tuesday Wednesday Thursday Friday Saturday", array); column=index($o,day); dow=int((column+2)/3); print array[dow]}')
The problem with the above code is that if multiple matches are found then i get multiple results, whereas i want it to output only one result.
Thanks!
Limit the call to index() to only those lines which have your "day" surrounded by spaces:
awk -v day=$day 'BEGIN{split("Sunday Monday Tuesday Wednesday Thursday Friday Saturday", array)} $0 ~ "\\<"day"\\>"{for(i=1;i<=NF;i++)if($i == day){print array[i]}}'
Proof of Concept
$ cal 02 1956
February 1956
Su Mo Tu We Th Fr Sa
1 2 3 4
5 6 7 8 9 10 11
12 13 14 15 16 17 18
19 20 21 22 23 24 25
26 27 28 29
$ day=18; cal 02 1956 | awk -v day=$day 'BEGIN{split("Sunday Monday Tuesday Wednesday Thursday Friday Saturday", array)} $0 ~ "\\<"day"\\>"{for(i=1;i<=NF;i++)if($i == day){print array[i]}}'
Saturday
Update
If all you are looking for is to get the day of the week from a certain date, you should really be using the date command like so:
$ day=9;month=9;year=2009;
$ dayOfWeek=$(date +%A -d "$day/$month/$year")
$ echo $dayOfWeek
Wednesday
you wrote
cal 09 09 2009
I'm not aware of a version of cal that accepts day of month as an input,
only
cal ${mon} (optional) ${year} (optional)
But, that doesn't affect your main issue.
you wrote
is there a way to make awk output index number in only the 4th line of cal output above.?
NR (Num Rec) is your friend
and there are numerous ways to use it.
cal 09 09 2009 | awk 'NR==4{day="9"; col=index($0,day); print col }'
OR
cal 09 09 2009 | awk '{day="9"; if (NR==4) {col=index($0,day); print col } }'
ALSO
In awk, if you have variable assignments that should be used throughout your whole program, then it is better to use the BEGIN section so that the assignment is only performed once. Not a big deal in you example, but why set bad habits ;-)?
HENCE
cal 09 2009 | awk 'BEGIN{day="9"}; NR==4 {col=index($0,day); print col }'
FINALLY
It is not completely clear what problem you are trying to solve. Are you sure you always want to grab line 4? If not, then how do you propose to solve that?
Problems stated as " 1. I am trying to do X. 2. Here is my input. 3. Here is my output. 4. Here is the code that generated that output" are much easier to respond to.
It looks like you're trying to do date calculations. You can be much more robust and general solutions by using the gnu date command. I have seen numerous useful discussions of this tagged as bash, shell, (date?).
I hope this helps.
This is so much easier to do in a language that has time functionality built-in. Tcl is great for that, but many other languages are too:
$ echo 'puts [clock format [clock scan 9/9/2009] -format %a]' | tclsh
Wed
If you want awk to only output for line 4, restrict the rule to line 4:
$ awk 'NR == 4 { ... }'

Resources