Bash script OSX : split CSV - bash

I have this string stored in a file :
ID1,A,B,,,,F
ID2,,,,D,E,F
ID3,,B,C,,,,
and I need to transform like this :
ID1,A
ID1,B
ID1,F
ID2,D
ID2,E
...
I tried with loop and IFS (like IFS=","; declare -a Array=($*)) whitout success.
Does someone knows how to do that ?

Pretty straight-forward in Awk,
awk 'BEGIN{FS=OFS=","}{first=$1; for (i=2;i<=NF;i++) if (length($i)) print first,$i}' file
Setting input and output field separator to , store the first field separately in first variable and print rest of the non-empty fields.
As suggested by user #PS. below you can also do,
awk -F, '{for(i=2;i<=NF;i++) if(length($i)) print $1 FS $i}' file

awk -F, '{for (i=2; i<=NF; i++) if($i != "") print $1","$i}' File
ID1,A
ID1,B
ID1,F
ID2,D
ID2,E
ID2,F
ID3,B
ID3,C
With , as the field seperator, for each line, loop from the 2nd field to the last field. If the current field is not empty, print the first field (IDx) and the current field seperated by a ,

Related

awk: select first column and value in column after matching word

I have a .csv where each row corresponds to a person (first column) and attributes with values that are available for that person. I want to extract the names and values a particular attribute for persons where the attribute is available. The doc is structured as follows:
name,attribute1,value1,attribute2,value2,attribute3,value3
joe,height,5.2,weight,178,hair,
james,,,,,,
jesse,weight,165,height,5.3,hair,brown
jerome,hair,black,breakfast,donuts,height,6.8
I want a file that looks like this:
name,attribute,value
joe,height,5.2
jesse,height,5.3
jerome,height,6.8
Using this earlier post, I've tried a few different awk methods but am still having trouble getting both the first column and then whatever column has the desired value for the attribute (say height). For example the following returns everything.
awk -F "height," '{print $1 "," FS$2}' file.csv
I could grep only the rows with height in them, but I'd prefer to do everything in a single line if I can.
You may use this awk:
cat attrib.awk
BEGIN {
FS=OFS=","
print "name,attribute,value"
}
NR > 1 && match($0, k "[^,]+") {
print $1, substr($0, RSTART+1, RLENGTH-1)
}
# then run it as
awk -v k=',height,' -f attrib.awk file
name,attribute,value
joe,height,5.2
jesse,height,5.3
jerome,height,6.8
# or this one
awk -v k=',weight,' -f attrib.awk file
name,attribute,value
joe,weight,178
jesse,weight,165
With your shown samples please try following awk code. Written and tested in GNU awk. Simple explanation would be, using GNU awk and setting RS(record separator) to ^[^,]*,height,[^,]* and then printing RT as per requirement to get expected output.
awk -v RS='^[^,]*,height,[^,]*' 'RT{print RT}' Input_file
I'd suggest a sed one-liner:
sed -n 's/^\([^,]*\).*\(,height,[^,]*\).*/\1\2/p' file.csv
One awk idea:
awk -v attr="height" '
BEGIN { FS=OFS="," }
FNR==1 { print "name", "attribute", "value"; next }
{ for (i=2;i<=NF;i+=2) # loop through even-numbered fields
if ($i == attr) { # if field value is an exact match to the "attr" variable then ...
print $1,$i,$(i+1) # print current name, current field and next field to stdout
next # no need to check rest of current line; skip to next input line
}
}
' file.csv
NOTE: this assumes the input value (height in this example) will match exactly (including same capitalization) with a field in the file
This generates:
name,attribute,value
joe,height,5.2
jesse,height,5.3
jerome,height,6.8
With a perl one-liner:
$ perl -lne '
print "name,attribute,value" if $.==1;
print "$1,$2" if /^(\w+).*(height,\d+\.\d+)/
' file
output
name,attribute,value
joe,height,5.2
jesse,height,5.3
jerome,height,6.8
awk accepts variable-value arguments following a -v flag before the script. Thus, the name of the required attribute can be passed into an awk script using the general pattern:
awk -v attr=attribute1 ' {} ' file.csv
Inside the script, the value of the passed variable is reference by the variable name, in this case attr.
Your criteria are to print column 1, the first column containing the name, the column corresponding to the required header value, and the column immediately after that column (holding the matched values).
Thus, the following script allows you to fish out the column headed "attribute1" and it's next neighbour:
awk -v attr=attribute1 ' BEGIN {FS=","} /attr/{for (i=1;i<=NF;i++) if($i == attr) col=i;} {print $1","$col","$(col+1)} ' data.txt
result:
name,attribute1,value1
joe,height,5.2
james,,
jesse,weight,165
jerome,hair,black
another column (attribute 3):
awk -v attr=attribute3 ' BEGIN {FS=","} /attr/{for (i=1;i<=NF;i++) if($i == attr) col=i;} {print $1","$col","$(col+1)} ' awkNames.txt
result:
name,attribute3,value3
joe,hair,
james,,
jesse,hair,brown
jerome,height,6.8
Just change the value of the -v attr= argument for the required column.

Separating onto a new line based on a delimiter

I have some rows in my file that look like this
ENSG00000003096:E4.2|E5.1
ENSG00000035115:E14.2|E15.1
ENSG00000140987:E5.2|ENSG00000140987:E6.1
ENSG00000154358:E46.1|E47.1
I would like to separate them onto a new line based on the delimiter "|" , such that it becomes
ENSG00000003096:E4.2
ENSG00000003096:E5.1
ENSG00000035115:E14.2
ENSG00000035115:E15.1
ENSG00000140987:E5.2
ENSG00000140987:E6.1
ENSG00000154358:E46.1
ENSG00000154358:E47.1
With input data as advised in your question, this seems to work with gnu awk:
awk -F: -v RS="[|]|\n" 'NF==1{print p FS $0;next}NF!=1{p=$1}1' file1
#Output
ENSG00000003096:E4.2
ENSG00000003096:E5.1
ENSG00000035115:E14.2
ENSG00000035115:E15.1
ENSG00000140987:E5.2
ENSG00000140987:E6.1
ENSG00000154358:E46.1
ENSG00000154358:E47.1
Logic:
| or \n are used as record separator RS
: is used as field separator FS
If a line has more than one fields then keep the first field in a variable p
if a line has only one field then print previous $1 = variable p and the line $0
You may mean something like
awk 'BEGIN{FS=":"}{ split($2, fields, "|"); print $1 ":" fields[1]; print $1 ":" fields[2]; }' my_file.txt

Awk, Shell Scripting

I have a file which has the following form:
#id|firstName|lastName|gender|birthday|creationDate|locationIP|browserUsed
111|Arkas|Sarkas|male|1995-09-11|2010-03-17T13:32:10.447+0000|192.248.2.123|Midori
Every field is separated with "|". I am writing a shell script and my goal is to remove the "-" from the fifth field (birthday), in order to make comparisons as if they were numbers.
For example i want the fifth field to be like |19950911|
The only solution I have reached so far, deletes all the "-" from each line which is not what I want using sed.
i would be extremely grateful if you show me a solution to my problem using awk.
If this is a homework writing the complete script will be a disservice. Some hints: the function you should be using is gsub in awk. The fifth field is $5 and you can set the field separator by -F'|' or in BEGIN block as FS="|"
Also, line numbers are in NR variable, to skip first line for example, you can add a condition NR>1
An awk one liner:
awk 'BEGIN { FS="|" } { gsub("-","",$5); print }' infile.txt
To keep "|" as output separator, it is better to define OFS value as "|" :
... | awk 'BEGIN { FS="|"; OFS="|"} {gsub("-","",$5); print $0 }'

Replace special characters in variable in awk shell command

I am currently executing the following command:
awk 'BEGIN { FS="," ; getline ; H=$0 } N != $3 { N=$3 ; print H > "/Directory/FILE_"$3"_DOWNLOAD.csv" } { print > "/Directory/FILE_"$3"_DOWNLOAD.csv" }' /Directory/FILE_ALL_DOWNLOAD.csv
This takes the value from the third position in the CSV file and creates a CSV for each distinct $3 value. Works as desired.
The input file looks as follows:
Name, Amount, ID
"ABC", "100.00", "0000001"
"DEF", "50.00", "0000001"
"GHI", "25.00", "0000002"
Unfortunately I have no control over the value in the source (CSV) sheet, the $3 value, but I would like to eliminate special (non-alphanumeric) characters from it. I tried the following to accomplish this but failed...
awk 'BEGIN { FS="," ; getline ; H=$0 } N != $3 { N=$3 ; name=${$3//[^a-zA-Z_0-9]/}; print H > "/Directory/FILE_"$name"_DOWNLOAD.csv" } { print > "/Directory/FILE_"$name"_DOWNLOAD.csv" }' /Directory/FILE_ALL_DOWNLOAD.csv
Suggestions? I'm hoping to do this in a single command but if anyone has a bash script answer that would work.
This is definitely not a job you should be using getline for, see http://awk.info/?tip/getline
It looks like you just want to reproduce the first line of your input file in every $3-named file. That'd be:
awk -F, '
NR==1 { hdr=$0; next }
$3 != prev { prev=name=$3; gsub(/[^[:alnum:]_]/,"",name); $0 = hdr "\n" $0 }
{ print > ("/Directory/FILE_" name "_DOWNLOAD.csv") }
' /Directory/FILE_ALL_DOWNLOAD.csv
Note that you must always parenthesize expressions on the right side of output redirection (>) as it's ambiguous otherwise and different awks will behave differently if you don't.
Feel free to put it all back onto one line if you prefer.
If you always expect the number to be in the last field of your CSV and you know that each field is wrapped in quotes, you could use this awk to extract the value 456 from the input you have provided in the comment:
echo " 123.", "Company Name" " 456." | awk -F'[^a-zA-Z0-9]+' 'NF { print $(NF-1) }'
This defines the field separator as any number of non-alphanumeric characters and retrieves the second-last field.
If this is sufficient to reliably retrieve the value, you could construct your filename like this:
file = "/Directory/FILE_" $(NF-1) "_DOWNLOAD.csv"
and output to it as you're already doing.
bash variable expansions do not occur in single quotes.
They also cannot be performed on awk variables.
That being said you don't need that to work.
awk has string manipulation functions that can perform the same tasks. In this instance you likely want the gsub function.
Would this not work for what you asked ?
awk -F, 'a=NR==1{x=$0;next}
!a{gsub(/[^[:alnum:]]/,"",$3);print x"\n"$0 >> "/Directory/FILE_"$3"_DOWNLOAD.csv"}' file

How can i compare the numeric values of the last two fields in a file?

I have a file that contains the following information
organic_apple;2;organic_apple_212_212
organic_tomato;3;organic_tomato_24_29
fruit_juice;5;fruit_juice_15_15
So i want a file that contains the output
organic_apple;2;organic_apple_212
organic_tomato;3;organic_tomato_24_29
fruit_juice;5;fruit_juice_15
compare the last two fields, if they are the same display it once , if not , display them both
I'm writing in unix bash using solaris
Regardless of the number of underscores, compare the last two:
awk 'BEGIN{FS=OFS="_"}$NF==$(NF-1){--NF;$1=$1}1' test.in
Try this :
awk -vOFS=_ -F_ '{if ($2 == $3) print $1, $2; else print $1, $2, $3}' file.txt
This script removes the last field, if it is equal to the one before last:
awk -F "_" '$NF==$(NF-1){$NF=""}1' file

Resources