Unix row to column format with string prefix and post fix - shell

I have the requirement to convert row string data to column format and pre/postfix specific strings. The data string in file has 4 major fixed columns (separated by ";") and each column is further divided in two sections (separated by ":").
E.g.
Source data file:
A100:T100;B100:T200;A200:T300;B200:T400
Output from file should be:
TABa:BatchID=A100:TagId=T100:ProcId=1
TABb:BatchID=B100:TagId=T200:ProcId=2
TABc:BatchID=A200:TagId=T300:ProcId=3
TABd:BatchID=B200:TagId=T400:ProcId=4
Meanwhile I am trying with following code:
String="A100:T100;B100:T200;A200:T300;B200:T400"
> File.txt
for deploy in $(echo $String | tr ";" "\n")
do
echo $deploy >> File.txt
done
cat File.txt | awk 'BEGIN { FS=":"; OFS=":" } NR==1{ print "TABa:BatchID="$1,$2 } NR==2{ print "TABb:BatchID="$1,$2 }'

printf handles this:
$ awk -F: '{sub(/\n/,""); printf "TAB%c:BatchID=%s:TagId=%s:ProcId=%i\n",(NR+96),$1,$2,NR }' RS=';' File.txt
TABa:BatchID=A100:TagId=T100:ProcId=1
TABb:BatchID=B100:TagId=T200:ProcId=2
TABc:BatchID=A200:TagId=T300:ProcId=3
TABd:BatchID=B200:TagId=T400:ProcId=4
How it works
-F:
This sets the field separator to a colon: :.
sub(/\n/,"")
This removes newline characters.
printf "TAB%c:BatchID=%s:TagId=%s:ProcId=%i\n",(NR+96),$1,$2,NR
This does all the work. It makes use of the record number, NR, and the first and second fields and prints the output that you want.
RS=';'
This tells awk to use a semicolon, ;, as the record separator.

Related

Ignore comma after backslash in a line in a text file using awk or sed

I have a text file containing several lines of the following format:
name,list_of_subjects,list_of_sports,school
Eg1: john,science\,social,football,florence_school
Eg2: james,painting,tennis\,ping_pong\,chess,highmount_school
I need to parse the text file and print the output of fields ignoring the escaped commas. Here those will be fields 2 or 3 like this:
science, social
tennis, ping_pong, chess
I do not know how to ignore escaped characters. How can I do it with awk or sed in terminal?
Substitute \, with a character that your records do not contain normally (e.g. \n), and restore it before printing. For example:
$ awk -F',' 'NR>1{ if(gsub(/\\,/,"\n")) gsub(/\n/,",",$2); print $2 }' file
science,social
painting
Since first gsub is performed on the whole record (i.e $0), awk is forced to recompute fields. But the second one is performed on only second field (i.e $2), so it will not affect other fields. See: Changing Fields.
To be able to extract multiple fields with properly escaped commas you need to gsub \ns in all fields with a for loop as in the following example:
$ awk 'BEGIN{ FS=OFS="," } NR>1{ if(gsub(/\\,/,"\n")) for(i=1;i<=NF;++i) gsub(/\n/,"\\,",$i); print $2,$3 }' file
science\,social,football
painting,tennis\,ping_pong\,chess
See also: What's the most robust way to efficiently parse CSV using awk?.
You could replace the \, sequences by another character that won't appear in your text, split the text around the remaining commas then replace the chosen character by commas :
sed $'s/\\\,/\31/g' input | awk -F, '{ printf "Name: %s\nSubjects : %s\nSports: %s\nSchool: %s\n\n", $1, $2, $3, $4 }' | tr $'\31' ','
In this case using the ASCII control char "Unit Separator" \31 which I'm pretty sure your input won't contain.
You can try it here.
Why awk and sed when bash with coreutils is just enough:
# Sorry my cat. Using `cat` as input pipe
cat <<EOF |
name,list_of_subjects,list_of_sports,school
Eg1: john,science\,social,football,florence_school
Eg2: james,painting,tennis\,ping_pong\,chess,highmount_school
EOF
# remove first line!
tail -n+2 |
# substitute `\,` by an unreadable character:
sed 's/\\\,/\xff/g' |
# read the comma separated list
while IFS=, read -r name list_of_subjects list_of_sports school; do
# read the \xff separated list into an array
IFS=$'\xff' read -r -d '' -a list_of_subjects < <(printf "%s" "$list_of_subjects")
# read the \xff separated list into an array
IFS=$'\xff' read -r -d '' -a list_of_sports < <(printf "%s" "$list_of_sports")
echo "list_of_subjects : ${list_of_subjects[#]}"
echo "list_of_sports : ${list_of_sports[#]}"
done
will output:
list_of_subjects : science social
list_of_sports : football
list_of_subjects : painting
list_of_sports : tennis ping_pong chess
Note that this will be most probably slower then solution using awk.
Note that the principle of operation is the same as in other answers - substitute \, string by some other unique character and then use that character to iterate over the second and third field elemetns.
This might work for you (GNU sed):
sed -E 's/\\,/\n/g;y/,\n/\n,/;s/^[^,]*$//Mg;s/\n//g;/^$/d' file
Replace quoted commas by newlines and then revert newlines to commas and commas to newlines. Remove all lines that do not contain a comma. Delete empty lines.
Using Perl. Change the \, to some control char say \x01 and then replace it again with ,
$ cat laxman.txt
john,science\,social,football,florence_school
james,painting,tennis\,ping_pong\,chess,highmount_school
$ perl -ne ' s/\\,/\x01/g and print ' laxman.txt | perl -F, -lane ' for(#F) { if( /\x01/ ) { s/\x01/,/g ; print } } '
science,social
tennis,ping_pong,chess
You can perhaps join columns with a function.
function joincol(col, i) {
$col=$col FS $(col+1)
for (i=col+1; i<NF; i++) {
$i=$(i+1)
}
NF--
}
This might get used thusly:
{
for (col=1; col<=NF; col++) {
if ($col ~ /\\$/) {
joincol(col)
}
}
}
Note that decrementing NF is undefined behaviour in POSIX. It may delete the last field, or it may not, and still be POSIX compliant. This works for me in BSDawk and Gawk. YMMV. May contain nuts.
Use gawk's FPAT:
awk -v FPAT='(\\\\.|[^,\\\\]*)+' '{print $3}' file
#list_of_sports
#football
#tennis\,ping_pong\,chess
then use gnusub to replace the backslashes:
awk -v FPAT='(\\\\.|[^,\\\\]*)+' '{print gensub("\\\\", "", "g", $3)}' file
#list_of_sports
#football
#tennis,ping_pong,chess

Extract first 5 fields from semicolon-separated file

I have a semicolon-separated file with 10 fields on each line. I need to extract only the first 5 fields.
Input:
A.txt
1;abc ;xyz ;0.0000;3.0; ; ;0.00; ; xyz;
Output file:
B.txt
1;abc ;xyz ;0.0000;3.0;
You can cut from field1-5:
cut -d';' -f1-5 file
If the ending ; is needed, you can append it by other tool or using grep(assume your grep has -P option):
kent$ grep -oP '^(.*?;){5}' file
1;abc ;xyz ;0.0000;3.0;
In sed you can match the pattern string; 5 times:
sed 's/\(\([^;]*;\)\{5\}\).*/\1/' A.txt
or, when your sedsupports -r:
sed -r 's/(([^;]*;){5}).*/\1/' A.txt
cut -f-5 -d";" A.txt > B.txt
Where:
- -f selects the fields (-5 from start to 5)
- -d provides a delimiter, (here the semicolon)
Given that the input is field-based, using awk is another option:
awk 'BEGIN { FS=OFS=";"; ORS=OFS"\n" } { NF=5; print }' A.txt > B.txt
If you're using BSD/macOS, insert $1=$1; after NF=5; to make this work.
FS=OFS=";" sets both the input field separator, FS, and the output field separator, OFS, to a semicolon.
The input field separator is used to break each input record (line) into fields.
The output field separator is used to rebuild the record when individual fields are modified or the number of fields are modified.
ORS=OFS"\n" sets the output record separator to a semicolon followed by a newline, given that a trailing ; should be output.
Simply omit this statement if the trailing ; is undesired.
{ NF=5; print } truncates the input record to 5 fields, by setting NF, the number (count) of fields to 5 and then prints the modified record.
It is at this point that OFS comes into play: the first 5 fields are concatenated to form the output record, using OFS as the separator.
Note: BSD/macOS Awk doesn't modify the record just by setting NF; you must additionally modify a field explicitly for the changed field count to take effect: a dummy operation such as $1=$1 (assigning field 1 to itself) is sufficient.
awk '{print $1,$2,$3}' A.txt >B.txt
1;abc ;xyz ;0.0000;3.0;

Shell command for inserting a newline every nth element of a huge line of comma separated strings

I have a one line csv containing a lot of elements. Now I want to insert a newline after every n-th element in a bash/shell script.
Bonus: I'd like to prepend a line with descriptors and using the count of descriptors as 'n'.
Example:
"4908041eee3d4bf98e606140b21ebc89.16","7.38974601030349731","45.31298584267982221","94ff11ce7eb54642b0768dde313e8b25.16","7.38845318555831909","45.31425320325949713", (...)
into
"id","lon","lat"
"4908041eee3d4bf98e606140b21ebc89.16","7.38974601030349731","45.31298584267982221"
"94ff11ce7eb54642b0768dde313e8b25.16","7.38845318555831909","45.31425320325949713"
(...)
Edit: I made a first attempt, but the comma delimiters are missing then:
(...) | xargs --delimiter=',' -n3
"4908041eee3d4bf98e606140b21ebc89.16" "7.38974601030349731" "45.31298584267982221"
"94ff11ce7eb54642b0768dde313e8b25.16" "7.38845318555831909" "45.31425320325949713"
trying to replace the " " with ","
(...) | xargs --delimiter=',' -n3 -i echo ${{}//" "/","}
-bash: ${{}//\": bad substitution
I would go with Perl for that!
Let's assume this outputs something like your file:
printf "1,2,3,4,5,6,7,8,9,10"
1,2,3,4,5,6,7,8,9,10
Then you could use this if you wanted every 4th comma replaced:
printf "1,2,3,4,5,6,7,8,9,10" | perl -pe 's{,}{++$n % 4 ? $& : "\n"}ge'
1,2,3,4
5,6,7,8
9,10
cat data.txt | xargs -n 3 -d, | sed 's/ /,/g'
With n=3 here and input filename is called data.txt
Note: What distinguishes this solution is that it derives the number of output columns from the number of columns in the header line.
Assuming that the fields in your CSV input have no embedded , instances (in which case you'd need a proper CSV parser), try awk:
awk -v RS=, -v header='"id","lon","lat"' '
BEGIN {
print header
colCount = 1 + gsub(",", ",", header)
}
{
ORS = NR % colCount == 0 ? "\n" : ","
print
}
' file.csv
Note that if the input file ends with a newline (as is typical), you'll get an extra newline trailing the output.
With GNU Awk or Mawk (but not BSD/OSX Awk, which only supports literal, single-character RS values), you can fix this as follows:
awk -v RS='[,\n]' -v header='"id","lon","lat"' '
BEGIN {
print header
colCount = 1 + gsub(",", ",", header)
}
{
ORS = NR % colCount == 0 ? "\n" : ","
print
}
' file.csv
BSD/OSX Awk workaround: stick with -v RS=, and replace file.csv with <(tr -d '\n' < file.csv) in order to remove all newlines from the input first.
Assuming your input file is named input:
echo id,lon,lat; awk '{ORS=NR%3?",":"\n"}1' RS=, input

replace a pipe delimited column using awk

I have a column seperated using pipe delimited , I have to replace the entire column with some other value .
Example :
A|B|C
I want to replace second column with "Z" as ,
A|Z|C
Replacing the second field can be done by setting up the input and output field separators and simply changing the second field before printing:
awk 'BEGIN {FS = OFS = "|"} {$2 = "Z"; print}' inputFileName
as per the following transcript:
pax$ printf 'A|B|C\nD|E|F\n' | awk 'BEGIN{FS=OFS="|"}{$2="Z";print}'
A|Z|C
D|Z|F
Below command provided the expected solution,
awk -F'|' '{$2="string";print}' file_name > new_file_name
$2 --> denotes the second position of the delimiter, |

how to replace a special character from a column of file

I have a file which contains 6 columns, each field separated by pipe the second last column contains amount field.
140121059|01/01/201400:00:45|[1390]|[387]|17.64|10
140121060|01/01/201400:00:46|[1112]|[867]|26.46|10
140121062|01/01/201400:00:47|[182]|[13]|4,117.60|10
140121065|01/01/201400:00:48|[1088]|[385]|1,147.04|10
I want to remove commA from the amount column as I'm not able to put some operations on this column. , is not coming in all columns. And I'm using bash.
Using awk:
awk -F '|' -v OFS='|' '{ gsub(/,/, "", $5) } 1' file
Output:
140121059|01/01/201400:00:45|[1390]|[387]|17.64|10
140121060|01/01/201400:00:46|[1112]|[867]|26.46|10
140121062|01/01/201400:00:47|[182]|[13]|4117.60|10
140121065|01/01/201400:00:48|[1088]|[385]|1147.04|10
-F '|' -v OFS='|' sets both input field separator and output field separator to pipe |. This basically sets column separator as |.
gsub(/,/, "", $5) removes all commas in 5th column.
1 commands the actual printing.
Same output would follow if input is not indented with extra spaces.
Using bash:
while IFS='|' read -ra LINE; do LINE[4]=${LINE[4]//,}; IFS='|' eval 'echo "${LINE[*]}"'; done < file

Resources