Get next field/column width awk - bash

I have a dataset of the following structure:
1234 4334 8677 3753 3453 4554
4564 4834 3244 3656 2644 0474
...
I would like to:
1) search for a specific value, eg 4834
2) return the following field (3244)
I'm quite new to awk, but realize it is a simple operation. I have created a bash-script that asks the user for the input, and attempts to return the following field.
But I can't seem to get around scoping in AWK. How do I parse the input value to awk?
#!/bin/bash
read input
cat data.txt | awk '
for (i=1;i<=NF;i++) {
if ($i==input) {
print $(i+1)
}
}
}'
Cheers and thanks in advance!
UPDATE Sept. 8th 2011
Thanks for all the replies.
1) It will never happen that the last number of a row is picked - still I appreciate you pointing this out.
2) I have a more general problem with awk. Often I want to "do something" with the result found. In this case I would like to output it to xclip - an application which read from standard input and copies it to the clipboard. Eg:
$ echo Hi | xclip
Unfortunately, echo doesn't exist for awk, so I need to return the value and echo it. How would you go about this?

#!/bin/bash
read input
cat data.txt | awk '{
for (i=1;i<=NF;i++) {
if ($i=='$input') {
print $(i+1)
}
}
}'

Don't over think it!
You can create an array in awk with the split command:
split($0, ary)
This will split the line $0 into an array called ary. Now, you can use array syntax to find the particular fields:
awk '{
size = split($0, ary)
for (i=1; i < size ;i++) {
print ary[i]
}
print "---"
}' data.txt
Now, when you find ary[x] as the field, you can print out ary[x+1].
In your example:
awk -v input=$input '{
size = split($0, ary)
for (i=1; i<= size ;i++) {
if ($i == ary[i]) {
print ary[i+1]
}
}
}' data.txt
There is a way of doing this without creating an array, but it's simply much easier to work with arrays in situations like this.
By the way, you can eliminate the cat command by putting the file name after the awk statement and save creating an extraneous process. Everyone knows creating an extraneous process kills a kitten. Please don't kill a kitten.

You pass shell variable to awk using -v option. Its cleaner/nicer than having to put quotes.
awk -v input="$input" '
for(i=1;i<=NF;i++){
if ($i == input ){
print "Next value: " $(i+1)
}
}
' data.txt
And lose the useless cat.

Here is my solution: delete everything up to (and including) the search field, then the field you want to print out is field #1 ($1):
awk '/4834/ {sub(/^.* * 4834 /, ""); print $1}' data.txt

Related

combining numbers from multiple text files using bash

I'm strugling to combine some data from my txt files generated in my jenkins job.
on each of the files there is 1 line, this is how each file look:
testsuite name="mytest" cars="201" users="0" bus="0" bike="0" time="116.103016"
What I manage to do for now is to extract the numbers for each txt file:
awk '/<testsuite name=/{print $3, $4, $5, $6}' my-output*.txt
Result are :
cars="193" users="2" bus="0" bike="0"
cars="23" users="2" bus="10" bike="7"
cars="124" users="2" bus="5" bike="0"
cars="124" users="2" bus="0" bike="123"
now I have a random number of files like this:
my-output1.txt
my-output2.txt
my-output7.txt
my-output*.txt
I would like to create single command just like the one I did above and to sum all of the files to have the following echo result:
cars=544 users=32 bus=12 bike=44
is there a way to do that? with a single line of command?
Using awk
$ cat script.awk
BEGIN {
FS="[= ]"
} {
gsub(/"/,"")
for (i=1;i<NF;i++)
if ($i=="cars") cars+=$(i+1)
else if($i=="users") users+=$(i+1);
else if($i=="bus") bus+=$(i+1);
else if ($i=="bike")bike+=$(i+1)
} END {
print "cars="cars,"users="users,"bus="bus,"bike="bike
}
To run the script, you can use;
$ awk -f script.awk my-output*.txt
Or, as a ugly one liner.
$ awk -F"[= ]" '{gsub(/"/,"");for (i=1;i<NF;i++) if ($i=="cars") cars+=$(i+1); else if($i=="users") users+=$(i+1); else if($i=="bus") bus+=$(i+1); else if ($i=="bike")bike+=$(i+1)}END{print"cars="cars,"users="users,"bus="bus,"bike="bike}' my-output*.txt
1st solution: With your shown samples please try following awk code, using match function in here. Since awk could read multiple files within a single program itself and your files have .txt format you can pass as .txt format to awk program itself.
Written and tested in GNU awk with its match function's capturing group capability to create/store values into an array to be used later on in program.
awk -v s1="\"" '
match($0,/[[:space:]]+(cars)="([^"]*)" (users)="([^"]*)" (bus)="([^"]*)" (bike)="([^"]*)"/,tempArr){
temp=""
for(i=2;i<=8;i+=2){
temp=tempArr[i-1]
values[i]+=tempArr[i]
indexes[i-1]=temp
}
}
END{
for(i in values){
val=(val?val OFS:"") (indexes[i-1]"=" s1 values[i] s1)
}
print val
}
' *.txt
Explanation:
In start of GNU awk program creating variable named s1 to be set to " to be used later in the program.
Using match function in main program of awk.
Mentioning regex [[:space:]]+(cars)="([^"]*)" (users)="([^"]*)" (bus)="([^"]*)" (bike)="([^"]*)"(explained at last of this post) which is creating 8 groups to be used later on.
Then once condition is matched running a for loop which runs only even numbers in it(to get required values only).
Creating array values with index of i and keep adding its own value + tempArr values to it, where tempArr is created by match function.
Similarly creating indexes array to store only key values in it.
Then in END block of this program traversing through values array and printing the values from indexes and values array as per requirement.
Explanation of regex:
[[:space:]]+ ##Matching spaces 1 or more occurrences here.
(cars)="([^"]*)" ##Matching cars=" till next occurrence of " here.
(users)="([^"]*)" ##Matching spaces followed by users=" till next occurrence of " here.
(bus)="([^"]*)" ##Matching spaces followed by bus=" till next occurrence of " here.
(bike)="([^"]*)" ##Matching spaces followed by bike=" till next occurrence of " here.
2nd solution: In GNU awk only with using RT and RS variables power here. This will make sure the sequence of the values also in output should be same in which order they have come in input.
awk -v s1="\"" -v RS='[[:space:]][^=]*="[^"]*"' '
RT{
gsub(/^ +|"/,"",RT)
num=split(RT,arr,"=")
if(arr[1]!="time" && arr[1]!="name"){
if(!(arr[1] in values)){
indexes[++count]=arr[1]
}
values[arr[1]]+=arr[2]
}
}
END{
for(i=1;i<=count;i++){
val=(val?val OFS:"") (indexes[i]"=" s1 values[indexes[i]] s1)
}
print val
}
' *.txt
You may use this awk solution:
awk '{
for (i=1; i<=NF; ++i)
if (split($i, a, /=/) == 2) {
gsub(/"/, "", a[2])
sums[a[1]] +=a[2]
}
}
END {
for (i in sums) print i "=" sums[i]
}' file*
bus=15
cars=464
users=8
bike=130
found a way to do so a bit long:
awk '/<testsuite name=/{print $3, $4, $5, $6}' my-output*.xml | sed -e 's/[^0-9]/ /g' -e 's/^ *//g' -e 's/ *$//g' | tr -s ' ' | awk '{bus+=$1;users+=$2;cars+=$3;bike+=$4 }END{print "bus=" bus " users="users " cars=" cars " bike=" bike}'
M. Nejat Aydin answer was good fit:
awk -F '[ "=]+' '/testsuite name=/{ cars+=$5; users+=$7; buses+=$9; bikes+=$11 } END{ print "cars="cars, "users="users, "buses="buses, "bikes="bikes }' my-output*.xml

Appending result of function on another field into csv using shell script, awk

I have a csv file stored as a temporary variable in a shell script (*.sh).
Let's say the data looks like this:
Account,Symbol,Price
100,AAPL US,200
102,SPY US,500
I want to add a fourth column, "Type", which is the result of a shell function "foobar". Run from the command line or a shell script itself:
$ foobar "AAPL US"
"Stock"
$ foobar "SPY US"
"ETF"
How do I add this column to my csv, and populate it with calls to foobar which take the second column as an argument? To clarify, this is my ideal result post-script:
Account,Symbol,Price,Type
100,AAPL US,200,Common Stock
102,SPY US,500,ETF
I see many examples online involving such a column addition using awk, and populating the new column with fixed values, conditional values, mathematical derivations from other columns, etc. - but nothing that calls a function on another field and stores its output.
You may use this awk:
export -f foobar
awk 'BEGIN{FS=OFS=","} NR==1{print $0, "Type"; next} {
cmd = "foobar \"" $2 "\""; cmd | getline line; close(cmd);
print $0, line
}' file.csv
Account,Symbol,Price,Type
100,AAPL US,200,Common Stock
102,SPY US,500,ETF
#anubhavas answer is a good approach so please don't change the accepted answer as I'm only posting this as an answer as it's too big and in need of formatting to fit in a comment.
FWIW I'd write his awk script as:
awk '
BEGIN { FS=OFS="," }
NR==1 { type = "Type" }
NR > 1 {
cmd = "foobar \047" $2 "\047"
type = ((cmd | getline line) > 0 ? line : "ERROR")
close(cmd)
}
{ print $0, type }
' file.csv
to:
better protect $2 from shell expansion, and
protect from silently printing the previous value if/when cmd | getline fails, and
consolidate the print statements to 1 line so it's easy to change for all output lines if/when necessary
awk to the rescue!
$ echo "Account,Symbol,Price
100,AAPL US,200
102,SPY US,500" |
awk -F, 'NR>1{cmd="foobar "$2; cmd | getline type} {print $0 FS (NR==1?"Type":type)}'
Not sure you need to quote the input to foobar
Another way not using awk:
paste -d, input.csv <({ read; printf "Type\n"; while IFS=, read -r _ s _; do foobar "$s"; done; } < input.csv)

Turning multi-line string into single comma-separated list in Bash

I have this format:
host1,app1
host1,app2
host1,app3
host2,app4
host2,app5
host2,app6
host3,app1
host4... and so on.
I need it like this format:
host1;app1,app2,app3
host2;app4,app5,app6
I have tired this: awk -vORS=, '{ print $2 }' data | sed 's/,$/\n/'
and it gives me this:
app1,app2,app3 without the host in front.
I do not want to show duplicates.
I do not want this:
host1;app1,app1,app1,app1...
host2;app1,app1,app1,app1...
I want this format:
host1;app1,app2,app3
host2;app2,app3,app4
host3;app2;app3
With input sorted on the first column (as in your example ; otherwise just pipe it to sort), you can use the following awk command :
awk -F, 'NR == 1 { currentHost=$1; currentApps=$2 }
NR > 1 && currentHost == $1 { currentApps=currentApps "," $2 }
NR > 1 && currentHost != $1 { print currentHost ";" currentApps; currentHost=$1; currentApps=$2 }
END { print currentHost ";" currentApps }'
It has the advantage over other solutions posted as of this edit to avoid holding the whole data in memory. This comes at the cost of needing the input to be sorted (which is what would need to put lots of data in memory if the input wasn't sorted already).
Explanation :
the first line initializes the currentHost and currentApps variables to the values of the first line of the input
the second line handles a line with the same host as the previous one : the app mentionned in the line is appended to the currentApps variable
the third line handles a line with a different host than the previous one : the infos for the previous host are printed, then we reinitialize the variables to the value of the current line of input
the last line prints the infos of the current host when we have reached the end of the input
It probably can be refined (so much redundancy !), but I'll leave that to someone more experienced with awk.
See it in action !
$ awk '
BEGIN { FS=","; ORS="" }
$1!=prev { print ors $1; prev=$1; ors=RS; OFS=";" }
{ print OFS $2; OFS=FS }
END { print ors }
' file
host1;app1,app2,app3
host2;app4,app5,app6
host3;app1
Maybe something like this:
#!/bin/bash
declare -A hosts
while IFS=, read host app
do
[ -z "${hosts["$host"]}" ] && hosts["$host"]="$host;"
hosts["$host"]+=$app,
done < testfile
printf "%s\n" "${hosts[#]%,}" | sort
The script reads the sample data from testfile and outputs to stdout.
You could try this awk script:
awk -F, '{a[$1]=($1 in a?a[$1]",":"")$2}END{for(i in a) printf "%s;%s\n",i,a[i]}' file
The script creates entries in the array a for each unique element in the first column. It appends to that array entry all element from the second column.
When the file is parsed, the content of the array is printed.

Replace special characters in variable in awk shell command

I am currently executing the following command:
awk 'BEGIN { FS="," ; getline ; H=$0 } N != $3 { N=$3 ; print H > "/Directory/FILE_"$3"_DOWNLOAD.csv" } { print > "/Directory/FILE_"$3"_DOWNLOAD.csv" }' /Directory/FILE_ALL_DOWNLOAD.csv
This takes the value from the third position in the CSV file and creates a CSV for each distinct $3 value. Works as desired.
The input file looks as follows:
Name, Amount, ID
"ABC", "100.00", "0000001"
"DEF", "50.00", "0000001"
"GHI", "25.00", "0000002"
Unfortunately I have no control over the value in the source (CSV) sheet, the $3 value, but I would like to eliminate special (non-alphanumeric) characters from it. I tried the following to accomplish this but failed...
awk 'BEGIN { FS="," ; getline ; H=$0 } N != $3 { N=$3 ; name=${$3//[^a-zA-Z_0-9]/}; print H > "/Directory/FILE_"$name"_DOWNLOAD.csv" } { print > "/Directory/FILE_"$name"_DOWNLOAD.csv" }' /Directory/FILE_ALL_DOWNLOAD.csv
Suggestions? I'm hoping to do this in a single command but if anyone has a bash script answer that would work.
This is definitely not a job you should be using getline for, see http://awk.info/?tip/getline
It looks like you just want to reproduce the first line of your input file in every $3-named file. That'd be:
awk -F, '
NR==1 { hdr=$0; next }
$3 != prev { prev=name=$3; gsub(/[^[:alnum:]_]/,"",name); $0 = hdr "\n" $0 }
{ print > ("/Directory/FILE_" name "_DOWNLOAD.csv") }
' /Directory/FILE_ALL_DOWNLOAD.csv
Note that you must always parenthesize expressions on the right side of output redirection (>) as it's ambiguous otherwise and different awks will behave differently if you don't.
Feel free to put it all back onto one line if you prefer.
If you always expect the number to be in the last field of your CSV and you know that each field is wrapped in quotes, you could use this awk to extract the value 456 from the input you have provided in the comment:
echo " 123.", "Company Name" " 456." | awk -F'[^a-zA-Z0-9]+' 'NF { print $(NF-1) }'
This defines the field separator as any number of non-alphanumeric characters and retrieves the second-last field.
If this is sufficient to reliably retrieve the value, you could construct your filename like this:
file = "/Directory/FILE_" $(NF-1) "_DOWNLOAD.csv"
and output to it as you're already doing.
bash variable expansions do not occur in single quotes.
They also cannot be performed on awk variables.
That being said you don't need that to work.
awk has string manipulation functions that can perform the same tasks. In this instance you likely want the gsub function.
Would this not work for what you asked ?
awk -F, 'a=NR==1{x=$0;next}
!a{gsub(/[^[:alnum:]]/,"",$3);print x"\n"$0 >> "/Directory/FILE_"$3"_DOWNLOAD.csv"}' file

AWK split file by separator and count

I have a large 220mb file. The file is grouped by a horizontal row "---". This is what I have so far:
cat test.list | awk -v ORS="" -v RS="-------------------------------------------------------------------------------" '{print $0;}'
How do I take this and print to a new file every 1000 matches?
Is there another way to do this? I looked at split, and csplit but the "----" rows to not occur predictably so I have to match them, and then split on a count of the matches.
I would like the output files to groups of 1000 matches per file.
To output the first 1000 records to outputfile0, the next to outputfile1, etc., just do:
awk 'NR%1000 == 1{ file = "outputfile" i++ } { print > file }' ORS= RS=------ test.list
(Note that I truncated the dashes in RS for simplicity.)'
Unfortunately, using a value of RS that is more than a single character produces unspecified results, so the above cannot be the solution. Perhaps something like twalberg's solution is required:
awk '/^----$/ { if(!(c%1000)) count+=1; c+=1; next }
{print > ("outputfile"count)}' c=1 count=1
Not tested, but something along these lines might work:
awk 'BEGIN {fileno=1,matchcount=0}
/^-------/ { if (++matchcount == 1000) { ++fileno; matchcount=0; } }
{ print $0 > "output_file_" fileno }' < test.list
It might be cleaner to put all that in, say split.awk and use awk -f split.awk test.list instead...

Resources