How to compare 2 schema files so that I can add the columns in the other file and fill with some default value? - bash

I have two files that have tabluar data. In file A, I have 5 columns and in file B I have 3 columns. I want to find the position of the columns in file A and add the new columns in file B. I also need to fill all the rows in file B with a default value that I will hard code.
I tried to run a loop to get the length of the files and then fetch the data but it's failing.
I'm using PuTTY and vim editor. Need to write a script to compare and add.

Without having a concrete example of fileA and fileB it's difficult to figure out the right solution, but you could use diff and patch, something like this:
diff fileA fileB | patch -R fileB

Related

Append multiple CSVs into one single file with Apache Nifi

I have a folder with CSV files that have the same first 3 columns and different last N columns. N is minimum 2 and up to 11.
Last n columns have number as header, for example:
File 1:
AAA,BBB,CCC,0,10,15
1,India,c,0,28,54
2,Taiwan,c,0,23,52
3,France,c,0,26,34
4,Japan,c,0,27,46
File 2:
AAA,BBB,CCC,0,5,15,30,40
1,Brazil,c,0,20,64,71,88
2,Russia,c,0,20,62,72,81
3,Poland,c,0,21,64,78,78
4,Litva,c,0,22,66,75,78
Desired output:
AAA,BBB,CCC,0,5,10,15,30,40
1,India,c,0,null,28,54,null,null
2,Taiwan,c,0,null,23,52,null,null
3,France,c,0,null,26,34,null,null
4,Japan,c,0,null,27,46,null,null
1,Brazil,c,0,20,null,64,71,88
2,Russia,c,0,20,null,62,72,81
3,Poland,c,0,21,null,64,78,78
4,Litva,c,0,22,null,66,75,78
Is there a way to append this files together with Nifi where a new column would get created (even if I do not now the column name beforehad) if a file with additional data is present in the folder?
I tried with Merge content processor but by default it just appends content of all my files together without minding headers (all the headers are always appended).
What you could do is write some scripts to combine the rows and columns using the ExecuteStreamCommand. This would allow you to write a custom script in whatever language you want.

How to parse csv file into multiple csv based on row spacing

I'm trying to build a airflow DAG and need to split out 7 tables contained in one csv into seven separate csv's.
dataset1
header_a
header_b
header_c
One
Two
Three
One
Two
Three
<-Always two spaced rows between data sets
dataset N <-part of csv file giving details on data
header_d
header_e
header_f
header_g
One
Two
Three
Four
One
Two
Three
Four
out:
dataset1.csv
datasetn.csv
Based on my research i think my solution might lie in awk searching for the double spaces?
EDIT: In plain text as requested.
table1 details1,
table1 details2,
table1 details3,
header_a,header_b,header_c,
1,2,3
1,2,3
tableN details1,
tableN details2,
tableN details3,
header_a, header_b,header_c,header_N,
1,2,3,4
1,2,3,4
Always two spaced rows between data sets
If your CSV file contains blank lines, and your goal is to write out each chunk of records that is separated by those blank lines into individual files, then you could use awk with its record separator RS set to nothing, which then defaults to treating each "paragraph" as a record. Each of them can then be redirected to a file whose name is based on the record number NR:
awk -vRS= '{print $0 > ("output_" NR ".csv")}' input.csv
This reads from input.csv and writes the chunks to output_1.csv, output_2.csv, output_3.csv and so forth.
If my interpretation of your input file's structure (or your problem in general) is wrong, please provide more detail to clarify.

Add index column to CSV file

I have a large Comma-Separated File (6GB) and would like to add an index column to it. I'm looking at Unix type solutions for efficiency. I'm using a Mac.
I have this:
V1 V2 V3
0.4625 0.9179 0.8384
0.9324 0.2486 0.1114
0.6691 0.7813 0.6705
0.1935 0.3303 0.4336
Would like to get this:
ID V1 V2 V3
1 0.4625 0.9179 0.8384
2 0.9324 0.2486 0.1114
3 0.6691 0.7813 0.6705
4 0.1935 0.3303 0.4336
This will probably work:
awk -F'\t' -v OFS='\t' '
NR == 1 {print "ID", $0; next}
{print (NR-1), $0}
' input.csv > output.csv
In awk, the NR variable is "the total number of input records seen so far", which in general means "the current line number". So the NR == 1 in the first line is how we match the first record and add the "ID" column header, and for the remaining lines we use NR-1 as the index.
The -F'\t' argument sets the input field separator, and -vOFS='\t' sets the output field separator.
Since no technology is specified in the original post, I'd be happy here to keep it simple.
(all the fancy Vim/bash solutions are fine if you know what you're doing).
Open the CSV file in your favourite spreadsheet programme (I'm using
LibreOffice, but Excel or a native Mac equivalent will do)
insert a column to the left of column A
Enter a 1 into cell A2, the first cell under the headers
Double-click the blob at the bottom right of the cell as shown in the screenshot:
This last step will fill the index column with 1,2,3... etc.
You can then save the resulting spreadsheet as a CSV file again.
I assume you have a commas delimited file.
Using vim, open the file. In normal mode, type
:%s/^/\=line('.').','/
:%s/^/\=line('.')/ adds the line number at the beginning of the line. Since you have a commas delimited file (add a column) you need a comma after your line number. so the .','
see this answer for full explanation about :%s/^/\=line('.')/
Open the CSV file in your favorite spreadsheet program, such as Excel
Insert a column to the left side of first column
Type 1 in the first cell of this column
Type an equation '=A2+1' in the following cell
Double-click the blob at the bottom right of the cell as shown in the screenshot

advanced concatenation of lines based on the specific number of compared columns in csv

this is the question based on the previous solved problem.
i have the following type of .csv files(they aren't all sorted!, but the structure of columns is the same):
name1,address1,town1,zip1,email1,web1,,,,category1
name2,address2,town2,zip2,email2,,,,,category2
name3,address3,town3,zip3,email3,,,,,category3_1
name3,address3,town3,zip3,,,,,,category3_2
name3,address3,town3,zip3,,,,,,category3_3
name4,address4,town4,zip4,,,,,,category4_1
name4,address4,town4,zip4,email4,,,,,category4_2
name4,address4,town4,zip4,email4,,,,,category4_3
name4,address4,town4,zip4,,,,,,category4_4
name5,address5,town5,zip5,,,,,,category5_1
name5,address5,town5,zip5,,web5,,,,category5_2
name6,address6,town6,zip6,,,,,,category6
first 4 records in columns are always populated, other columns are not always, except the last one - category
empty space between "," delimiter means that there is no data for the particular line or name
if the nameX doesnt contain addressX but addressY, it is a different record(not the same line) and should not be concatenated
i need the script in sed or awk, maybe the bash(but this solution is little slower on bigger files[hundreds of MB+]), that will take first 4 columns(in this case) compares them and if matched, will merge every category with the ";" delimiter and will keep the structure and the most possible data in other columns of those matched lines of a .csv file:
name1,address1,town1,zip1,email1,web1,,,,category1
name2,address2,town2,zip2,email2,,,,,category2
name3,address3,town3,zip3,email3,,,,,category3_1;category3_2;category3_3
name4,address4,town4,zip4,email4,,,,,category4_1;category4_2;category4_3;category4_4
name5,address5,town5,zip5,,web5,,,,category5_1;category5_2
name6,address6,town6,zip6,,,,,,category6
if that is not possible, solution could be to retain data from the first line of the duped data(the one with categoryX_1). example:
name1,address1,town1,zip1,email1,web1,,,,category1
name2,address2,town2,zip2,email2,,,,,category2
name3,address3,town3,zip3,email3,,,,,category3_1;category3_2;category3_3
name4,address4,town4,zip4,,,,,,category4_1;category4_2;category4_3;category4_4
name5,address5,town5,zip5,,,,,,category5_1;category5_2
name6,address6,town6,zip6,,,,,,category6
does the .csv have to be sorted before using the script?
thank you again!
sed -n 's/.*/²&³/;H
$ { g
:cat
s/\(²\([^,]*,\)\{4\}\)\(\([^,]*,\)\{5\}\)\([^³]*\)³\(.*\)\n\1\(\([^,]*,\)\{5\}\)\([^³]*\)³/\1~\3~ ~\7~\5;\9³\6/
t fields
b clean
:fields
s/~\([^,]*\),\([^~]*~\) ~\1,\([^~]*~\)/\1,~\2 ~\3/
t fields
s/~\([^,]*\),\([^~]*~\) ~\([^,]*,\)\([^~]*~\)/\1\3~\2 ~\4/
t fields
s/~~ ~~//g
b cat
:clean
s/.//;s/[²³]//g
p
}' YourFile
Posix version (so --posixwith GNU sed) and without sorting your file previously
2 recursive loop after loading the full file in buffer, adding marker for easier manipulation and lot of fun with sed group substitution (hopefully just reach the maximum group available).
loop to add category (1 line after the other, needed for next loop on each field) per line and a big sub field temporary structured (2 group of field from the 2 concatened lines. field 5 to 9 are 1 group)
ungroup sub field to original place
finaly, remove marker and first new line
Assuming there is no ²³~ character because used as marker (you can use other marker and adapt the script with your new marker)
Note:
For performance on a hundred MB file, i guess awk will be lot more efficient.
Sorting the data previoulsy may help certainly in performance reducing amount of data to manipulate after each category loop
i found, that this particular problem is faster being processed through db...
SQL - GROUP BY to combine/concat a column
db: mysql through wamp

using awk to grab random lines and append to a new column?

So I have a document "1", which is one column. I have 3 files with one column each and I want to append a randomly selected line from each of those columns onto the document 1's line.
So like
awk 'NR==10' moves.txt 'NR==1' propp_tasks.txt
prints out
10.Qg3 Bb4+
First function of the donor
when I want it to be:
10 Qg3 Bb4+ First function of the donor
Is there a good way to do this with awk? I had been trying to set up a bash script with a for loop but I didn't know how to cycle the indices so on line n of document 1, columns 2,3 and 4 would be appended on there. I feel like this should be really, really simple...
paste 1 <(cat 2 3 4 | sort -R)
If the length of the first file and the length of the combination of the other 3 files are different, then some more work is required.

Resources