advanced concatenation of lines based on the specific number of compared columns in csv - bash

this is the question based on the previous solved problem.
i have the following type of .csv files(they aren't all sorted!, but the structure of columns is the same):
name1,address1,town1,zip1,email1,web1,,,,category1
name2,address2,town2,zip2,email2,,,,,category2
name3,address3,town3,zip3,email3,,,,,category3_1
name3,address3,town3,zip3,,,,,,category3_2
name3,address3,town3,zip3,,,,,,category3_3
name4,address4,town4,zip4,,,,,,category4_1
name4,address4,town4,zip4,email4,,,,,category4_2
name4,address4,town4,zip4,email4,,,,,category4_3
name4,address4,town4,zip4,,,,,,category4_4
name5,address5,town5,zip5,,,,,,category5_1
name5,address5,town5,zip5,,web5,,,,category5_2
name6,address6,town6,zip6,,,,,,category6
first 4 records in columns are always populated, other columns are not always, except the last one - category
empty space between "," delimiter means that there is no data for the particular line or name
if the nameX doesnt contain addressX but addressY, it is a different record(not the same line) and should not be concatenated
i need the script in sed or awk, maybe the bash(but this solution is little slower on bigger files[hundreds of MB+]), that will take first 4 columns(in this case) compares them and if matched, will merge every category with the ";" delimiter and will keep the structure and the most possible data in other columns of those matched lines of a .csv file:
name1,address1,town1,zip1,email1,web1,,,,category1
name2,address2,town2,zip2,email2,,,,,category2
name3,address3,town3,zip3,email3,,,,,category3_1;category3_2;category3_3
name4,address4,town4,zip4,email4,,,,,category4_1;category4_2;category4_3;category4_4
name5,address5,town5,zip5,,web5,,,,category5_1;category5_2
name6,address6,town6,zip6,,,,,,category6
if that is not possible, solution could be to retain data from the first line of the duped data(the one with categoryX_1). example:
name1,address1,town1,zip1,email1,web1,,,,category1
name2,address2,town2,zip2,email2,,,,,category2
name3,address3,town3,zip3,email3,,,,,category3_1;category3_2;category3_3
name4,address4,town4,zip4,,,,,,category4_1;category4_2;category4_3;category4_4
name5,address5,town5,zip5,,,,,,category5_1;category5_2
name6,address6,town6,zip6,,,,,,category6
does the .csv have to be sorted before using the script?
thank you again!

sed -n 's/.*/²&³/;H
$ { g
:cat
s/\(²\([^,]*,\)\{4\}\)\(\([^,]*,\)\{5\}\)\([^³]*\)³\(.*\)\n\1\(\([^,]*,\)\{5\}\)\([^³]*\)³/\1~\3~ ~\7~\5;\9³\6/
t fields
b clean
:fields
s/~\([^,]*\),\([^~]*~\) ~\1,\([^~]*~\)/\1,~\2 ~\3/
t fields
s/~\([^,]*\),\([^~]*~\) ~\([^,]*,\)\([^~]*~\)/\1\3~\2 ~\4/
t fields
s/~~ ~~//g
b cat
:clean
s/.//;s/[²³]//g
p
}' YourFile
Posix version (so --posixwith GNU sed) and without sorting your file previously
2 recursive loop after loading the full file in buffer, adding marker for easier manipulation and lot of fun with sed group substitution (hopefully just reach the maximum group available).
loop to add category (1 line after the other, needed for next loop on each field) per line and a big sub field temporary structured (2 group of field from the 2 concatened lines. field 5 to 9 are 1 group)
ungroup sub field to original place
finaly, remove marker and first new line
Assuming there is no ²³~ character because used as marker (you can use other marker and adapt the script with your new marker)
Note:
For performance on a hundred MB file, i guess awk will be lot more efficient.
Sorting the data previoulsy may help certainly in performance reducing amount of data to manipulate after each category loop

i found, that this particular problem is faster being processed through db...
SQL - GROUP BY to combine/concat a column
db: mysql through wamp

Related

How to parse csv file into multiple csv based on row spacing

I'm trying to build a airflow DAG and need to split out 7 tables contained in one csv into seven separate csv's.
dataset1
header_a
header_b
header_c
One
Two
Three
One
Two
Three
<-Always two spaced rows between data sets
dataset N <-part of csv file giving details on data
header_d
header_e
header_f
header_g
One
Two
Three
Four
One
Two
Three
Four
out:
dataset1.csv
datasetn.csv
Based on my research i think my solution might lie in awk searching for the double spaces?
EDIT: In plain text as requested.
table1 details1,
table1 details2,
table1 details3,
header_a,header_b,header_c,
1,2,3
1,2,3
tableN details1,
tableN details2,
tableN details3,
header_a, header_b,header_c,header_N,
1,2,3,4
1,2,3,4
Always two spaced rows between data sets
If your CSV file contains blank lines, and your goal is to write out each chunk of records that is separated by those blank lines into individual files, then you could use awk with its record separator RS set to nothing, which then defaults to treating each "paragraph" as a record. Each of them can then be redirected to a file whose name is based on the record number NR:
awk -vRS= '{print $0 > ("output_" NR ".csv")}' input.csv
This reads from input.csv and writes the chunks to output_1.csv, output_2.csv, output_3.csv and so forth.
If my interpretation of your input file's structure (or your problem in general) is wrong, please provide more detail to clarify.

Bash script - Construct a single line out of many lines having duplicates in a single column

I have an instrumented log file that have 6 lines of duplicated first column as below.
//SC001#1/1/1#1/1,get,ClientStart,1363178707755
//SC001#1/1/1#1/1,get,TalkToSocketStart,1363178707760
//SC001#1/1/1#1/1,get,DecodeRequest,1363178707765
//SC001#1/1/1#1/1,get-reply,EncodeReponse,1363178707767
//SC001#1/1/1#1/2,get,DecodeRequest,1363178708765
//SC001#1/1/1#1/2,get-reply,EncodeReponse,1363178708767
//SC001#1/1/1#1/2,get,TalkToSocketEnd,1363178708770
//SC001#1/1/1#1/2,get,ClientEnd,1363178708775
//SC001#1/1/1#1/1,get,TalkToSocketEnd,1363178707770
//SC001#1/1/1#1/1,get,ClientEnd,1363178707775
//SC001#1/1/1#1/2,get,ClientStart,1363178708755
//SC001#1/1/1#1/2,get,TalkToSocketStart,1363178708760
Note: , (comma) is the delimiter here
Like wise there are many duplicate first column values (IDs) in the log file (above example having only two values (IDs); //SC001#1/1/1#1/1 and //SC001#1/1/1#1/2) I need to consolidate log records as below format.
ID,ClientStart,TalkToSocketStart,DecodeRequest,EncodeReponse,TalkToSocketEnd,ClientEnd
//SC001#1/1/1#1/1,1363178707755,1363178707760,1363178707765,1363178707767,1363178707770,1363178707775
//SC001#1/1/1#1/2,1363178708755,1363178708760,1363178708765,1363178708767,1363178708770,1363178708775
I suppose to have a bash script for this exercise and appreciate an expert support for this. Hope there may be a sed or awk solution which is more efficient.
Thanks much
One way:
sort -t, -k4n,4 file | awk -F, '{a[$1]=a[$1]?a[$1] FS $NF:$NF;}END{for(i in a){print i","a[i];}}'
sort command sorts the file on the basis of the last(4th) column. awk takes the sorted input and forms an array where the 1st field is the key, and the value is combination of values of the last column.

Split a Value in a Column with Right Function in SSIS

I need an urgent help from you guys, the thing i have a column which represent the full name of a user , now i want to split it into first and last name.
The format of the Full name is "World, hello", now the first name here is hello and last name is world.
I am using Derived Column(SSIS) and using Right Function for First Name and substring function for last name, but the result of these seems to be blank, this where even i am blank. :)
It's working for me. In general, you should provide more detail in your questions on places such as this to help others recreate and troubleshoot your issue. You did not specify whether we needed to address NULLs in this field nor do I know how you'd want to interpret it so there is room for improvement on this answer.
I started with a simple OLE DB Source and hard coded a query of "SELECT 'World, Hello' AS Name".
I created 2 Derived Column Tasks. The first one adds a column to Data Flow called FirstCommaPosition. The formula I used is FINDSTRING(Name,",", 1) If NAME is NULLable, then we will need to test for nullability prior to calling the FINDSTRING function. You'll then need to determine how you will want to store the split data in the case of NULLs. I would assume both first and last are should be NULLed but I don't know that.
There are two reasons for doing this in separate steps. The first is performance. As counter-intuitive as it sounds, doing less in a derived column results in better performance because the SSIS engine can better parallelize the operations. The other is more simple - I will need to use this value to make the first and last name split so it will be easier and less maintenance to reference a column than to copy paste a formula.
The second Derived Column is going to actually perform the split.
My FirstNameUnicode column uses this formula (FirstCommaPosition > 0) ? RTRIM(LTRIM(RIGHT(Name,FirstCommaPosition))) : "" That says "If we found a comma in the preceding step, then slice out everything from the comma's position to the end of the string and apply trim operations. If we didn't find a comma, then just return a blank string. The default string type for expressions will be the Unicode (DT_WSTR) so if that is not your need, you will need to cast the resultant into the correct string codepage (DT_STR)
My LastNameUnicode column uses this formula (FirstCommaPosition > 0) ? SUBSTRING(Name,1,FirstCommaPosition -1) : "" Similar logic as above except now I use the SUBSTRING operation instead of RIGHT. Users of the 2012 release of SSIS and beyond, rejoice fo you can use the LEFT function instead of SUBSTRING. Also note that you will need to back off 1 position to remove the comma.

using awk to grab random lines and append to a new column?

So I have a document "1", which is one column. I have 3 files with one column each and I want to append a randomly selected line from each of those columns onto the document 1's line.
So like
awk 'NR==10' moves.txt 'NR==1' propp_tasks.txt
prints out
10.Qg3 Bb4+
First function of the donor
when I want it to be:
10 Qg3 Bb4+ First function of the donor
Is there a good way to do this with awk? I had been trying to set up a bash script with a for loop but I didn't know how to cycle the indices so on line n of document 1, columns 2,3 and 4 would be appended on there. I feel like this should be really, really simple...
paste 1 <(cat 2 3 4 | sort -R)
If the length of the first file and the length of the combination of the other 3 files are different, then some more work is required.

List of names and their numbers needed to be sorted .TXT file

I have a list of names (never over 100 names) with a value for each of them, either 3 or 4 digits.
john2E=1023
mary2E=1045
fred2E=968
And so on... They're formatted exactly like that in the .txt file. I have Python and Excel, also willing to download whatever I need.
What I want to do is sort all the names according to their values in a descending order so highest is on top. I've tried to use Excel by replacing the '2E=' with ',' so I can have the name,value then important the data so each are in separate columns but I still couldn't sort them any other way than A to Z.
Help is much appreciated, I did take my time to look around before posting this.
Replace the "2E=" with a tab character so that the data is displayed in excel in two columns. Then sort on the value column.

Resources