Parsing semicolon separated key value pairs into CSV file - bash

I have a piece of data that is composed of semicolon separated key value pairs (a round 50 pairs) on the same line. The existence of all pairs is not necessary in each line.
Below is a sample of the data:
A=0.1; BB=2; CD=hi there; XZV=what's up; ...
A=-2; CD=hello; XZV=no; ...
I want to get a CSV file of this data, where the key becomes the field (column) name and the value becomes the row value of that particular line. Missing pairs should be replaced by default value or left blank.
In other words, I want my CSV to look like this:
A,BB,CD,XZV,....
0.1,2,"hi there","what's up",...
-2,0,"hello","no";...
The volume of my data is extremely large. What is the most efficient way to do this? Bash solution is highly appreciated.

Related

How to detect data type in column of table in ORACLE database (probably blob or clob)?

I have a table with a column in the format VARCHAR2(2000 CHAR). This column contained a row containing comma-separated numbers (ex: "3;3;780;1230;1;450.."). Now the situation has changed. Some rows contain data in the old format, but some contain the following data (ex: "BAAAABAAAAAgAAAAHAAAAAAAAAAAAAAAAQOUw6.."). Maybe it's blob or clob. How can I check exactly? And how can I read it now? Sorry for my noob question :)
The bad news is you really can't. Your column is a VARCHAR2 so it's all character data. It seems like what you're really asking is "How do I tell if this value is a comma separated string or a binary value encoded as a string?" So the best you can do is make an educated guess. There's not enough information here to give a very good answer, but you can try things like:
If the value is numeric characters with separators (you say commas but your example has semicolons) then treat it as such.
But what if the column value is "123", is that a single number or a short binary value?
If there are any letters in the value, you know it's not a separated list of numbers, then treat it as binary. But not all encoded binary values will have letters.
Try decoding it as binary, if it fails, maybe it's actually the separated list. This probably isn't a good one.

Csv with Weka how to add comma as a value not a seperator

I have a dataset. With using this dataset, I must run machine learning algorithms. But my dataset has some elements which also has comma but when I convert CSV to Arff this comma values does not recognized.
Example;
a,b,c
asdasd'%sdas,1,5,4234
My elements are
asdasd'%sdas 1,5 4234
But I could not handle the value has comma inside it.
I tried these
a,b,c
asdasd'%sdas,1\,5,4234
a,b,c
asdasd'%sdas,"1,5",4234
How can I pass comma valued element while using weka? My another wonder is how pass an element as string which has special chars like "sdas&%',+" Is it possible or something similar with this?
The following should work:
"asdasd'%sdas","1,5",4234
You can send strings that contain special characters just like this.

advanced concatenation of lines based on the specific number of compared columns in csv

this is the question based on the previous solved problem.
i have the following type of .csv files(they aren't all sorted!, but the structure of columns is the same):
name1,address1,town1,zip1,email1,web1,,,,category1
name2,address2,town2,zip2,email2,,,,,category2
name3,address3,town3,zip3,email3,,,,,category3_1
name3,address3,town3,zip3,,,,,,category3_2
name3,address3,town3,zip3,,,,,,category3_3
name4,address4,town4,zip4,,,,,,category4_1
name4,address4,town4,zip4,email4,,,,,category4_2
name4,address4,town4,zip4,email4,,,,,category4_3
name4,address4,town4,zip4,,,,,,category4_4
name5,address5,town5,zip5,,,,,,category5_1
name5,address5,town5,zip5,,web5,,,,category5_2
name6,address6,town6,zip6,,,,,,category6
first 4 records in columns are always populated, other columns are not always, except the last one - category
empty space between "," delimiter means that there is no data for the particular line or name
if the nameX doesnt contain addressX but addressY, it is a different record(not the same line) and should not be concatenated
i need the script in sed or awk, maybe the bash(but this solution is little slower on bigger files[hundreds of MB+]), that will take first 4 columns(in this case) compares them and if matched, will merge every category with the ";" delimiter and will keep the structure and the most possible data in other columns of those matched lines of a .csv file:
name1,address1,town1,zip1,email1,web1,,,,category1
name2,address2,town2,zip2,email2,,,,,category2
name3,address3,town3,zip3,email3,,,,,category3_1;category3_2;category3_3
name4,address4,town4,zip4,email4,,,,,category4_1;category4_2;category4_3;category4_4
name5,address5,town5,zip5,,web5,,,,category5_1;category5_2
name6,address6,town6,zip6,,,,,,category6
if that is not possible, solution could be to retain data from the first line of the duped data(the one with categoryX_1). example:
name1,address1,town1,zip1,email1,web1,,,,category1
name2,address2,town2,zip2,email2,,,,,category2
name3,address3,town3,zip3,email3,,,,,category3_1;category3_2;category3_3
name4,address4,town4,zip4,,,,,,category4_1;category4_2;category4_3;category4_4
name5,address5,town5,zip5,,,,,,category5_1;category5_2
name6,address6,town6,zip6,,,,,,category6
does the .csv have to be sorted before using the script?
thank you again!
sed -n 's/.*/²&³/;H
$ { g
:cat
s/\(²\([^,]*,\)\{4\}\)\(\([^,]*,\)\{5\}\)\([^³]*\)³\(.*\)\n\1\(\([^,]*,\)\{5\}\)\([^³]*\)³/\1~\3~ ~\7~\5;\9³\6/
t fields
b clean
:fields
s/~\([^,]*\),\([^~]*~\) ~\1,\([^~]*~\)/\1,~\2 ~\3/
t fields
s/~\([^,]*\),\([^~]*~\) ~\([^,]*,\)\([^~]*~\)/\1\3~\2 ~\4/
t fields
s/~~ ~~//g
b cat
:clean
s/.//;s/[²³]//g
p
}' YourFile
Posix version (so --posixwith GNU sed) and without sorting your file previously
2 recursive loop after loading the full file in buffer, adding marker for easier manipulation and lot of fun with sed group substitution (hopefully just reach the maximum group available).
loop to add category (1 line after the other, needed for next loop on each field) per line and a big sub field temporary structured (2 group of field from the 2 concatened lines. field 5 to 9 are 1 group)
ungroup sub field to original place
finaly, remove marker and first new line
Assuming there is no ²³~ character because used as marker (you can use other marker and adapt the script with your new marker)
Note:
For performance on a hundred MB file, i guess awk will be lot more efficient.
Sorting the data previoulsy may help certainly in performance reducing amount of data to manipulate after each category loop
i found, that this particular problem is faster being processed through db...
SQL - GROUP BY to combine/concat a column
db: mysql through wamp

retrieving unique results from a column in a text file with Hadoop MapReduce

I have the data set below. I want to get a unique list of the first column as the output. {9719,382 ..} there are integers in the end of the each line so checking if it starts and ends with a number is not a way and i couldn't think of a solution. Can you show me how to do it? I'd really
appreciate it if you show it in detail.(with what to do in map and what to do in reduce step)
id - - [date] "URL"
In your mapper you should parse each line and write out the token that you are interested in from the beginning of the line (e.g. 9719) as the Key in a Key-Value pair (the Value is irrelevant in this case). Since the keys will be sorted before sending to the reducer, all you need to do in the reducer is iterate thru the values and each time a value changes, output it.
The WordCount example app that is packaged with Hadoop is very close to what you need.

List of names and their numbers needed to be sorted .TXT file

I have a list of names (never over 100 names) with a value for each of them, either 3 or 4 digits.
john2E=1023
mary2E=1045
fred2E=968
And so on... They're formatted exactly like that in the .txt file. I have Python and Excel, also willing to download whatever I need.
What I want to do is sort all the names according to their values in a descending order so highest is on top. I've tried to use Excel by replacing the '2E=' with ',' so I can have the name,value then important the data so each are in separate columns but I still couldn't sort them any other way than A to Z.
Help is much appreciated, I did take my time to look around before posting this.
Replace the "2E=" with a tab character so that the data is displayed in excel in two columns. Then sort on the value column.

Resources