Turning multi-line string into single comma-separated list in Bash - bash

I have this format:
host1,app1
host1,app2
host1,app3
host2,app4
host2,app5
host2,app6
host3,app1
host4... and so on.
I need it like this format:
host1;app1,app2,app3
host2;app4,app5,app6
I have tired this: awk -vORS=, '{ print $2 }' data | sed 's/,$/\n/'
and it gives me this:
app1,app2,app3 without the host in front.
I do not want to show duplicates.
I do not want this:
host1;app1,app1,app1,app1...
host2;app1,app1,app1,app1...
I want this format:
host1;app1,app2,app3
host2;app2,app3,app4
host3;app2;app3

With input sorted on the first column (as in your example ; otherwise just pipe it to sort), you can use the following awk command :
awk -F, 'NR == 1 { currentHost=$1; currentApps=$2 }
NR > 1 && currentHost == $1 { currentApps=currentApps "," $2 }
NR > 1 && currentHost != $1 { print currentHost ";" currentApps; currentHost=$1; currentApps=$2 }
END { print currentHost ";" currentApps }'
It has the advantage over other solutions posted as of this edit to avoid holding the whole data in memory. This comes at the cost of needing the input to be sorted (which is what would need to put lots of data in memory if the input wasn't sorted already).
Explanation :
the first line initializes the currentHost and currentApps variables to the values of the first line of the input
the second line handles a line with the same host as the previous one : the app mentionned in the line is appended to the currentApps variable
the third line handles a line with a different host than the previous one : the infos for the previous host are printed, then we reinitialize the variables to the value of the current line of input
the last line prints the infos of the current host when we have reached the end of the input
It probably can be refined (so much redundancy !), but I'll leave that to someone more experienced with awk.
See it in action !

$ awk '
BEGIN { FS=","; ORS="" }
$1!=prev { print ors $1; prev=$1; ors=RS; OFS=";" }
{ print OFS $2; OFS=FS }
END { print ors }
' file
host1;app1,app2,app3
host2;app4,app5,app6
host3;app1

Maybe something like this:
#!/bin/bash
declare -A hosts
while IFS=, read host app
do
[ -z "${hosts["$host"]}" ] && hosts["$host"]="$host;"
hosts["$host"]+=$app,
done < testfile
printf "%s\n" "${hosts[#]%,}" | sort
The script reads the sample data from testfile and outputs to stdout.

You could try this awk script:
awk -F, '{a[$1]=($1 in a?a[$1]",":"")$2}END{for(i in a) printf "%s;%s\n",i,a[i]}' file
The script creates entries in the array a for each unique element in the first column. It appends to that array entry all element from the second column.
When the file is parsed, the content of the array is printed.

Related

Bash: Adding the contents of a column to a variable

So I have a file that contains some text, but there are some lines that contain only "Overall>5". (the number after > could be any number from 0 to 5).
The code I wrote:
let sumOfReviews=0
while read line; do awk -F ">" '{if($1=="Overall") ((sumOfReviews+=$2))}'; done<$file
echo $sumOfReviews
I tried splitting these lines into 2 columns at ">" and I want to add the number in the second column (5 in this case) to a variable. The problem is when I print out the value of the variable at the end it shows 0. Any thoughts? Thanks!
If called from the console, the following seems to do the job:
awk -F'>' '/Overall>[0-9]+/ { total += $2 } END { print total }' exampleData.txt
If you want to call it from inside bash, you have to enclose it in $( ... ):
#!/bin/bash
total="$(awk -F'>' '/Overall>[0-9]+/ { total += $2 } END { print total }' exampleData.txt)"
# do something with `total` here.
You cannot simply use awk as some sort of syntax inside bash, awk is a separate programming language, it is invoked as a completely separate process. You can try to build in some bash parameters into awks source code though.
This can be done in a one line awk script.
awk 'BEGIN { FS=">"; sumOfReviews=0 } /^Overall>[0-5]/ { sumOfReviews+=$2 } END { print sumOfReviews }' < file
Explanation from Manpage:
An AWK program consists of a sequence of pattern-action statements and optional function definitions.
pattern { action statements }
In this case we have used the BEGIN pattern to set the file separator to ">" and the sumOfReviews variable to 0.
We use the /^Overall>[0-5]/ regular expression pattern to match lines beginning with "Overall>" followed by a number 0-5 and if true add the $2 field to the sumOfReviews variable.
Finally we use the END pattern to output the final sumOfReviews value.
Example solution in a bash shell script:
#!/bin/bash
noAuthors=4 # set to 4 for example
sumOfReviews=$(awk 'BEGIN { FS=">"; sumOfReviews=0 } /^Overall>[0-5]/
{ sumOfReviews+=$2 } END { print sumOfReviews }' < file)
echo $(($sumOfReviews/$noAuthors))
awk and bash are two separate programs; they don't share variables. All you need is a single awk script:
awk -F '>' '$1 == "Overall" {reviews += $2}; END {print reviews}' "$file"

Find which row having less columns using awk

I have a file where there are 4 fields expected for each row. If there are less number of fields then I want to write that information in a logfile with the row number.
Filed1line1| Filed2line1| Filed3line1| Filed4line1
Filed1line2| Filed2line2|
Filed1line3| Filed2line3| Filed3line3| Filed4line3
Something like - Row number 2 is having 3 fields for file a.txt
Can we achieve this using awk.
Actually I am using the below code snippet. If the number of fields is <> 4 then I am writing it in a bad file. that is working good. But I am unable to write NR value in log.
awk -F'|' -v DGFNM="$IN_DIR$DGFNAME" -v DBFNM="$IN_DIR$DBFNAME" '
$1 == "DTL" {
if (NF == 4) {
print substr($0, 5) > DGFNM
} else {
print > DBFNM
print NR >> $logfile
}
}
' "$IN_DIR$IN_FILE"
Easy: NF is the number of fields in the record and NR is the record number.
Something like: awk '{ if (NF < 4) { print "Row " NR " has " NF " fields"; } }' - there are shorter ways, but I prefer longer code that is easier to read ;-)
See this question for some info on printing to different output files: is it possible to print different lines to different output files using awk
To answer your edited question: $logfile is inside the single quotes, so it is not expanded to your shell variable logfile. And it is not an "awk" variable. try print NR >> "some_file"; in the awk, and then rename some_file to $logfile later.
Another option would be to generate the awk file with the expanded $logfile already in place instead of trying to do it inline.

setting the NR to 1 does not work (awk)

I have the following script in bash.
awk -F ":" '{if($1 ~ "^fall")
{ NR==1
{{printf "\t<course id=\"%s\">\n",$1} } } }' file1.txt > container.xml
So what I have a small file. If ANY line starts with fall, then I want the first field of the VERY first line.
So I did that in the code and set NR==1. However, it does not do the job!!!
Try this:
awk -F: 'NR==1 {id=$1} $1~/^fall/ {printf "\t<course id=\"%s\">\n",id}' file1.txt > container.xml
Notes:
NR==1 {id=$1}
This saves the course ID from the first line
$1~/^fall/ {printf "\t<course id=\"%s\">\n",id}
If any line begins with fall, then the course ID is printed.
The above code illustrates that awk commands can be preceded by conditions. Thus, id=$1 is executed only if we are on the first line: NR==1. If this way, it is often unnecessary to have explicit if statements.
In awk, assignment with done with = while tests for equality are done with ==.
If this doesn't do what you want, then please add sample input and corresponding desired output to the question.
awk -F: 'NR==1{x=$1}/^fail/{printf "\t<course id=\"%s\">\n",x;exit}' file
Note:
if the file has any line beginning with fail, print the 1st field in very first line in certain format (xml tag).
no matter how many lines with fail as start, it outputs the xml tag only once.
if the file has no line starts with fail, it outputs nothing.
#!awk -f
BEGIN {
FS = ":"
}
NR==1 {
foo = $1
}
/^fall/ {
printf "\t<course id=\"%s\">\n", foo
}
Also note
BUGS
The -F option is not necessary given the command line variable assignment
feature; it remains only for backwards compatibility.
awk man page

Replace special characters in variable in awk shell command

I am currently executing the following command:
awk 'BEGIN { FS="," ; getline ; H=$0 } N != $3 { N=$3 ; print H > "/Directory/FILE_"$3"_DOWNLOAD.csv" } { print > "/Directory/FILE_"$3"_DOWNLOAD.csv" }' /Directory/FILE_ALL_DOWNLOAD.csv
This takes the value from the third position in the CSV file and creates a CSV for each distinct $3 value. Works as desired.
The input file looks as follows:
Name, Amount, ID
"ABC", "100.00", "0000001"
"DEF", "50.00", "0000001"
"GHI", "25.00", "0000002"
Unfortunately I have no control over the value in the source (CSV) sheet, the $3 value, but I would like to eliminate special (non-alphanumeric) characters from it. I tried the following to accomplish this but failed...
awk 'BEGIN { FS="," ; getline ; H=$0 } N != $3 { N=$3 ; name=${$3//[^a-zA-Z_0-9]/}; print H > "/Directory/FILE_"$name"_DOWNLOAD.csv" } { print > "/Directory/FILE_"$name"_DOWNLOAD.csv" }' /Directory/FILE_ALL_DOWNLOAD.csv
Suggestions? I'm hoping to do this in a single command but if anyone has a bash script answer that would work.
This is definitely not a job you should be using getline for, see http://awk.info/?tip/getline
It looks like you just want to reproduce the first line of your input file in every $3-named file. That'd be:
awk -F, '
NR==1 { hdr=$0; next }
$3 != prev { prev=name=$3; gsub(/[^[:alnum:]_]/,"",name); $0 = hdr "\n" $0 }
{ print > ("/Directory/FILE_" name "_DOWNLOAD.csv") }
' /Directory/FILE_ALL_DOWNLOAD.csv
Note that you must always parenthesize expressions on the right side of output redirection (>) as it's ambiguous otherwise and different awks will behave differently if you don't.
Feel free to put it all back onto one line if you prefer.
If you always expect the number to be in the last field of your CSV and you know that each field is wrapped in quotes, you could use this awk to extract the value 456 from the input you have provided in the comment:
echo " 123.", "Company Name" " 456." | awk -F'[^a-zA-Z0-9]+' 'NF { print $(NF-1) }'
This defines the field separator as any number of non-alphanumeric characters and retrieves the second-last field.
If this is sufficient to reliably retrieve the value, you could construct your filename like this:
file = "/Directory/FILE_" $(NF-1) "_DOWNLOAD.csv"
and output to it as you're already doing.
bash variable expansions do not occur in single quotes.
They also cannot be performed on awk variables.
That being said you don't need that to work.
awk has string manipulation functions that can perform the same tasks. In this instance you likely want the gsub function.
Would this not work for what you asked ?
awk -F, 'a=NR==1{x=$0;next}
!a{gsub(/[^[:alnum:]]/,"",$3);print x"\n"$0 >> "/Directory/FILE_"$3"_DOWNLOAD.csv"}' file

Get next field/column width awk

I have a dataset of the following structure:
1234 4334 8677 3753 3453 4554
4564 4834 3244 3656 2644 0474
...
I would like to:
1) search for a specific value, eg 4834
2) return the following field (3244)
I'm quite new to awk, but realize it is a simple operation. I have created a bash-script that asks the user for the input, and attempts to return the following field.
But I can't seem to get around scoping in AWK. How do I parse the input value to awk?
#!/bin/bash
read input
cat data.txt | awk '
for (i=1;i<=NF;i++) {
if ($i==input) {
print $(i+1)
}
}
}'
Cheers and thanks in advance!
UPDATE Sept. 8th 2011
Thanks for all the replies.
1) It will never happen that the last number of a row is picked - still I appreciate you pointing this out.
2) I have a more general problem with awk. Often I want to "do something" with the result found. In this case I would like to output it to xclip - an application which read from standard input and copies it to the clipboard. Eg:
$ echo Hi | xclip
Unfortunately, echo doesn't exist for awk, so I need to return the value and echo it. How would you go about this?
#!/bin/bash
read input
cat data.txt | awk '{
for (i=1;i<=NF;i++) {
if ($i=='$input') {
print $(i+1)
}
}
}'
Don't over think it!
You can create an array in awk with the split command:
split($0, ary)
This will split the line $0 into an array called ary. Now, you can use array syntax to find the particular fields:
awk '{
size = split($0, ary)
for (i=1; i < size ;i++) {
print ary[i]
}
print "---"
}' data.txt
Now, when you find ary[x] as the field, you can print out ary[x+1].
In your example:
awk -v input=$input '{
size = split($0, ary)
for (i=1; i<= size ;i++) {
if ($i == ary[i]) {
print ary[i+1]
}
}
}' data.txt
There is a way of doing this without creating an array, but it's simply much easier to work with arrays in situations like this.
By the way, you can eliminate the cat command by putting the file name after the awk statement and save creating an extraneous process. Everyone knows creating an extraneous process kills a kitten. Please don't kill a kitten.
You pass shell variable to awk using -v option. Its cleaner/nicer than having to put quotes.
awk -v input="$input" '
for(i=1;i<=NF;i++){
if ($i == input ){
print "Next value: " $(i+1)
}
}
' data.txt
And lose the useless cat.
Here is my solution: delete everything up to (and including) the search field, then the field you want to print out is field #1 ($1):
awk '/4834/ {sub(/^.* * 4834 /, ""); print $1}' data.txt

Resources