Converting CSV file to multiline text file - bash

I have file which looks like following:
C_DocType_ID,SOReference,DocumentNo,ProductValue,Quantity,LineDescription,C_Tax_ID,TaxAmt
1000000,1904093563U,1904093563U,5210-1,1,0,1000000,0
1000000,1904093563U,1904093563U,6511,2,0,1000000,0
1000000,1904093563U,1904093563U,5001,1,0,1000000,0
1000000,1904083291U,1904083291U,5310,4,0,1000000,0
1000000,1904083291U,1904083291U,5311,3,0,1000000,0
1000000,1904083291U,1904083291U,6101,6,0,1000000,0
1000000,1904083291U,1904083291U,6102,1,0,1000000,0
1000000,1904083291U,1904083291U,6106,6,0,1000000,0
I need to convert it to text file which looks like this:
WOH~1.0~~1904093563Utest~~~ORD~~~~
WOL~~~5210-1~~~~~~~~1~~~~~~~~~~~~~~~~~~~~~
WOL~~~6511~~~~~~~~2~~~~~~~~~~~~~~~~~~~~~
WOL~~~5001~~~~~~~~1~~~~~~~~~~~~~~~~~~~~~
WOH~1.0~~1904083291Utest~~~ORD~~~~~~
WOL~~~5310~~~~~~~~4~~~~~~~~~~~~~~~~~~~~~
WOL~~~5311~~~~~~~~3~~~~~~~~~~~~~~~~~~~~~
WOL~~~6101~~~~~~~~6~~~~~~~~~~~~~~~~~~~~~
WOL~~~6102~~~~~~~~1~~~~~~~~~~~~~~~~~~~~~
WOL~~~6106~~~~~~~~6~~~~~~~~~~~~~~~~~~~~~
The output file has header record and line item record. Header Record contains the SOReference and some hardcoded fields and the Line Item record contains the Product Value and Quantity associated to that SOReference . In the input file we have 2 unique SOReferences thats why the the output file contains 2 header record and their associated line items record.
Need something being done as a command line (awk/sed)? since I have a series of files like this one which need to be converted to text.

With AWK, please try the following:
awk -F, '
FNR==1 {next} # skip the header line
{
if ($2 != prevcol2) { # insert newline when SOReference changes
nl = FNR<=2 ? "" : "\n" # suppress the newline in the 1st line
printf("%sWOH~1.0~~%stest~~~ORD~~~~\n", nl, $2)
}
printf("WOL~~~%s~~~~~~~~%s~~~~~~~~~~~~~~~~~~~~~\n", $4, $5)
prevcol2 = $2
}' file.csv

Related

Need help splitting a file with blanc lines (BASH)

I have a file containing several few thousands of lines. The file format is similar to this:
1
H
H 13.1641870 7.1039560 -5.9652740
3
O2H2
H 15.5567440 5.6184980 -4.5255100
H 15.8907030 4.2338600 -5.4917990
O 15.5020000 6.4310000 -7.0960000
O 13.7940000 5.5570000 -8.1620000
2
CH
H 13.0960830 7.7155820 -3.5224750
C 11.0480000 7.4400000 -5.5080000
.
.
.
.
What I want is to split the full file in several files where putting in each file all the information between empty lines. The problem is that the blank lines do not follow a pattern. Some parts of the text have 1 line and others have 10.
Could someone tell me how to separate the file using the blank lines as separator?
Using awk and the data in a file called mainfile
awk 'BEGIN { RS="[\n]+" } { print $0 >> "file"NR".txt" }' mainfile
Set the record separator to one or more line feeds and then print each record to a file dictated by the record number i.e. file1.txt etc
Would you please try the following:
awk -v RS="" '{print > "file" ++i ".txt"; close("file" i ".txt")}' input.txt
If the awk variable RS is set to the null string, then records are separated by blank lines.
It is recommended to close each file to avoid the "too many open files" error.

AWK post-procession of multi-column data

I am working with the set of txt file containing multi column information present in one line. Within my bash script I use the following AWK expression to take the filename from each of the txt filles as well as the number from the 5th column and save it in 2 column format in results.CSV file (piped to SED, which remove path of the file and its extension from the final CSV file):
awk '-F, *' '{if(FNR==2) printf("%s| %s \n", FILENAME,$5) }' ${tmp}/*.txt | sed 's|\/Users/gleb/Desktop/scripts/clusterizator/tmp/||; s|\.txt||' >> ${home}/"${experiment}".csv
obtaining something (for 5 txt filles) like this as CSV:
lig177_cl_5.2| -0.1400
lig331_cl_3.5| -8.0000
lig394_cl_1.9| -4.3600
lig420_cl_3.8| -5.5200
lig550_cl_2.0| -4.3200
How it would be possible to modify my AWK expression in order to exclude "_cl_x.x" from the name of each txt file as well as add the name of the CSV as the comment to the first line of the resulted CSV file:
# results.CSV
lig177| -0.1400
lig331| -8.0000
lig394| -4.3600
lig420| -5.5200
lig550| -4.3200
based on the rest of the pipe, I think you want to do something like this and get rid of sed invocations.
awk -F', *' 'FNR==2 {f=FILENAME;
sub(/.*\//,"",f);
sub(/_.*/ ,"",f);
printf("%s| %s\n", f, $5) }' "${tmp}"/*.txt >> "${home}/${experiment}.csv"
this will convert
/Users/gleb/Desktop/scripts/clusterizator/tmp/lig177_cl_5.2.txt
to
lig177
The pattern replacement is generic
/path/to/the/file/filename_otherstringshere...
will extract only filename. From the last / char to the first _ char. This is based the greedy matching of regex patterns.
For the output filename, it's easier to do it before awk call, since it's a one line only.
$ echo "${experiment}.csv" > "${home}/${experiment}.csv"
$ awk ... >> "${home}/${experiment}.csv"

how can i modify a csv file to add new header using bash command?

Hi currently i have a csv file . i want to add a new header field named budget and values as True for all records. here is my csv file.
id,address1,address2,address3,address4,addressprofile,administrator,averageclickthroughrate,contactnumber,contractid,country,createdby,createdon,currency,customercontactnumber,customerid,defaultlanguage,features,internal,inventories,lastupdated,lastupdatedby,logo,name,status,testmessagecontactlist,testmessagelimit,usedefaultclickthroughrate,zipcode
d4385ff7-247f-407a-97c6-366d8128c6c7,,,,,eb0137fc-b279-11e8-8753-570ce0b5ef9b,92059277-e2ad-4cf0-a941-0f0b52bf3421,40,,,,ab4e0287-6973-4eec-bd03-cf3669c535d0,2019-01-08 08:48:36.353+0000,,,,b04265e6-c114-470c-8bb0-d10879655ec9,[],True,"[bdf7fad0-b8cd-4a9a-9c9d-48261fd5e7c7, be25104b-90d1-4076-bb4b-44c756d06d20]",2019-04-05 09:38:15.322+0000,3363a3ad-f52a-4a8b-bc52-7a069bab31d9,,OTT,ACTIVE,ca6b6808-111c-49ac-90ac-44078e8e3db0,5,True,
this is the following result i am expecting.
id,address1,address2,address3,address4,addressprofile,administrator,averageclickthroughrate,budget,contactnumber,contractid,country,createdby,createdon,currency,customercontactnumber,customerid,defaultlanguage,features,internal,inventories,lastupdated,lastupdatedby,logo,name,status,testmessagecontactlist,testmessagelimit,usedefaultclickthroughrate,zipcode
d4385ff7-247f-407a-97c6-366d8128c6c7,,,,,eb0137fc-b279-11e8-8753-570ce0b5ef9b,92059277-e2ad-4cf0-a941-0f0b52bf3421,40,,True,,,ab4e0287-6973-4eec-bd03-cf3669c535d0,2019-01-08 08:48:36.353+0000,,,,b04265e6-c114-470c-8bb0-d10879655ec9,[],True,"[bdf7fad0-b8cd-4a9a-9c9d-48261fd5e7c7, be25104b-90d1-4076-bb4b-44c756d06d20]",2019-04-05 09:38:15.322+0000,3363a3ad-f52a-4a8b-bc52-7a069bab31d9,,OTT,ACTIVE,ca6b6808-111c-49ac-90ac-44078e8e3db0,5,True,
how can i do using shell scripting
thank you
awk can easily do the job, similar to #kvantour link:
awk 'BEGIN{FS = OFS = ","} {$8 = $8 FS (NR == 1 ? "budget" : "true")}1'
where FS: field separator, OFS: output field separator, NR: current row number
Example: https://ideone.com/oRQqhi

How to remove part of the middle of a line/string by matching two known patterns in front and behind variable text to be removed

How to remove part of the middle of a line/string by matching two known patterns, one in front of text to be removed and one behind the text to be removed?
I have a Linux text file with thousands of one line, comma delimited records. unfortunately, all records are not the same format. Each line may have as many as four comma delimited fields of which only the first and last are constant, the two middle fields may, or may not, be present.
Examples of existing line (record) formats. Messy data but the first field is always present, as is the last field, starts with word ADDED.
FNAME LNAME, SOME COMMENT, JOINED DATE, ADDED TO DB DATE
FNAME LNAME, ADDED TO DB DATE
FNAME LNAME, SOME COMMENT, ADDED TO DB DATE
FNAME LNAME, JOINED DATE, ADDED TO DB DATE
Objective is to keep field one including the comma, throw away everything following the first comma, keeping the word "ADDED" and everything that follows to the end of line and insert a space between the first comma and the word ADDED.
For each line in parse the file from start of line to the first comma (keep this).
Parse rest of line up to the space before the word “Added” and throw it away.
Keep everything from the space before the word “ADDED” to end of line and concatenate the first part and last part to form one record per line with two fields separated by a comma and a space.
(if record is already in desired format, change nothing)
Final file to look like:
FNAME LNAME, ADDED TO DB DATE
or
Fred Flintstone, ADDED on January 1st 2015 By Barney Rubble
Thanks!
If you don't care about blank lines:
awk '{print $1,$NF}' FS=, OFS=, input
(Blank lines will be output as a single comma)
If you want to just skip blank lines, use:
awk 'NF>1{print $1,$NF}' FS=, OFS=, input
If you want to keep them:
awk '{printf( "%s%s\n", $1, NF>1 ? ","$NF : "")}' FS=, OFS=, input
Note that this will not ensure a single space after the comma, but will retain the spacing as in the final column of the original file. (that is, if there are 3 spaces after the final column in the original, you'll get 3 in the output). It's not clear to me from the description, but that seems like desirable behavior.
A Perl solution
perl -ne 'print join ", ", (split /,\s*/)[0,-1]' myfile
or
perl -pe 's/,.*(?=,)//' myfile
Both of those solutions work fine for me with the data you have given, but you may like to try
perl -pe 's/,.*(?=,\s*ADDED)//' myfile
You can use backreference:
sed 's/\(^[^,]*,\).* ADDED/\1 ADDED/' file
one more approach with awk could help here.
awk -F, '{val=$1;sub(/FNAME.*\,/,",");print val $0}' Input_file
Where I am making field separator as (,) then saving first field to variable named val, now substituting FNAME till comma with (,) in current line, now printing the value of variable val and new edited current line.
Using perl
#!/usr/bin/perl
use strict;
use warnings;
open my $fh, "<", "file.txt" or die "$!: couldn't open file\n";
while(<$fh>) {
my #arr = split(/,/);
my $text = $arr[0] . ", " . $arr[$#arr];
print "$text\n";
}

Extract subset of a feed file with custom delimiter and create CSV file

I get a feed file in below format.
employee_id||034100151730105|L|
employee_cd||03410015|L|
dept_id||1730105|L|
dept_name||abc|L|
employee_firstname||pqr|L|
employee_lastname||ppp|L|
|R||L|
employee_id||034100151730108|L|
employee_cd||03410032|L|
dept_id||4230105|L|
dept_name||fdfd|L|
employee_firstname||sasas|L|
employee_lastname||dfdf|L|
|R||L|
.....
Is there any easy unix script to extract subset of fields and create a CSV like below..
employee_cd,employee_firstname,dept_name
03410015,pqr,abc
03410032,sasas,fdfd
.....
I would suggest awk solution (considering that dept_name item always goes before employee_firstname item):
awk -F'|' 'BEGIN{OFS=","; print "employee_cd,employee_firstname,dept_name";}
$1~/employee_cd|employee_firstname|dept_name/{ a[++c]=$3 }
END { for(i=1;i<length(a);i+=3) print a[i],a[i+2],a[i+1] }' file
The output:
employee_cd,employee_firstname,dept_name
03410015,pqr,abc
03410032,sasas,fdfd
Solution details:
OFS="," - setting output field separator
$1~/employee_cd|employee_firstname|dept_name/ - if first column matches one of the needed items
a[++c]=$3 - capturing an item value indexed by consequent position
for(i=1;i<length(a);i+=3) print a[i],a[i+2],a[i+1] - outputting item values by threes
To save the output as .csv file:
the above command > output.csv

Resources