separate the single file into many based upon a column value - bash

i have a file like this:
one vijay three
two vijay four
five chandu three
outputfile1
one vijay three
two vijay four
outputfile2
five chandu three
the file is split based upon the value of the second column.
i can do this in shell scripting.but i suppose its more simple in awk to do.
how do i do it in awk?

awk '{print $0>$2".txt"}' file

Related

How to parse csv file into multiple csv based on row spacing

I'm trying to build a airflow DAG and need to split out 7 tables contained in one csv into seven separate csv's.
dataset1
header_a
header_b
header_c
One
Two
Three
One
Two
Three
<-Always two spaced rows between data sets
dataset N <-part of csv file giving details on data
header_d
header_e
header_f
header_g
One
Two
Three
Four
One
Two
Three
Four
out:
dataset1.csv
datasetn.csv
Based on my research i think my solution might lie in awk searching for the double spaces?
EDIT: In plain text as requested.
table1 details1,
table1 details2,
table1 details3,
header_a,header_b,header_c,
1,2,3
1,2,3
tableN details1,
tableN details2,
tableN details3,
header_a, header_b,header_c,header_N,
1,2,3,4
1,2,3,4
Always two spaced rows between data sets
If your CSV file contains blank lines, and your goal is to write out each chunk of records that is separated by those blank lines into individual files, then you could use awk with its record separator RS set to nothing, which then defaults to treating each "paragraph" as a record. Each of them can then be redirected to a file whose name is based on the record number NR:
awk -vRS= '{print $0 > ("output_" NR ".csv")}' input.csv
This reads from input.csv and writes the chunks to output_1.csv, output_2.csv, output_3.csv and so forth.
If my interpretation of your input file's structure (or your problem in general) is wrong, please provide more detail to clarify.

Compare 2 csv file using shell script and print the output in 3rd file

I am learning shell script and by using it trying to build a framework for my team for their testing purpose. Thus need your help in something.
Overview: I am trying to extract the aggregated values from hive through my queries using shell script and storing the result in a separate file, let's say File1.csv.
Now I wanted to compare above csv file with another csv file File2.csv using shell script and print the result as PASS(if records are matching) or FAIL(if records are not matching) row wise into the third file, let's say output.txt
Note: First we need to sort the records into File1.csv and then compare it with File2.csv, following with store the result PASS/FAIL row wise into output.txt
Format of File1.csv
Postcode Location InnerLocation Value_% Volume_%
XYZ London InnerLondon 6.987 2.561
ABC NY High Street 3.564 0.671
DEF Florida Miami 8.129 3.178
Quick help will be appreciated. Thanks in Advance.
You have two sorted text files and want to see which lines are different. There is nothing in your question which would make the problem CSV specific.
A convenient tool for this type of task would be sdiff.
sdiff -s File[12].csv
The -s option ensures that you see only different lines, but have a look at the sdiff man page: Maybe you want also to add one of the options dealing with white space.
If you need to go into more detail and, for example, show not just different CSV lines, but out which field in the line is different, and if there are really general CSV files, you really should use a CSV parser and not do it in shell scripts. Parsing a CSV file from a shell script really works if you know for sure that only a subset of all features allowed for CSV files are actually used.

advanced concatenation of lines based on the specific number of compared columns in csv

this is the question based on the previous solved problem.
i have the following type of .csv files(they aren't all sorted!, but the structure of columns is the same):
name1,address1,town1,zip1,email1,web1,,,,category1
name2,address2,town2,zip2,email2,,,,,category2
name3,address3,town3,zip3,email3,,,,,category3_1
name3,address3,town3,zip3,,,,,,category3_2
name3,address3,town3,zip3,,,,,,category3_3
name4,address4,town4,zip4,,,,,,category4_1
name4,address4,town4,zip4,email4,,,,,category4_2
name4,address4,town4,zip4,email4,,,,,category4_3
name4,address4,town4,zip4,,,,,,category4_4
name5,address5,town5,zip5,,,,,,category5_1
name5,address5,town5,zip5,,web5,,,,category5_2
name6,address6,town6,zip6,,,,,,category6
first 4 records in columns are always populated, other columns are not always, except the last one - category
empty space between "," delimiter means that there is no data for the particular line or name
if the nameX doesnt contain addressX but addressY, it is a different record(not the same line) and should not be concatenated
i need the script in sed or awk, maybe the bash(but this solution is little slower on bigger files[hundreds of MB+]), that will take first 4 columns(in this case) compares them and if matched, will merge every category with the ";" delimiter and will keep the structure and the most possible data in other columns of those matched lines of a .csv file:
name1,address1,town1,zip1,email1,web1,,,,category1
name2,address2,town2,zip2,email2,,,,,category2
name3,address3,town3,zip3,email3,,,,,category3_1;category3_2;category3_3
name4,address4,town4,zip4,email4,,,,,category4_1;category4_2;category4_3;category4_4
name5,address5,town5,zip5,,web5,,,,category5_1;category5_2
name6,address6,town6,zip6,,,,,,category6
if that is not possible, solution could be to retain data from the first line of the duped data(the one with categoryX_1). example:
name1,address1,town1,zip1,email1,web1,,,,category1
name2,address2,town2,zip2,email2,,,,,category2
name3,address3,town3,zip3,email3,,,,,category3_1;category3_2;category3_3
name4,address4,town4,zip4,,,,,,category4_1;category4_2;category4_3;category4_4
name5,address5,town5,zip5,,,,,,category5_1;category5_2
name6,address6,town6,zip6,,,,,,category6
does the .csv have to be sorted before using the script?
thank you again!
sed -n 's/.*/²&³/;H
$ { g
:cat
s/\(²\([^,]*,\)\{4\}\)\(\([^,]*,\)\{5\}\)\([^³]*\)³\(.*\)\n\1\(\([^,]*,\)\{5\}\)\([^³]*\)³/\1~\3~ ~\7~\5;\9³\6/
t fields
b clean
:fields
s/~\([^,]*\),\([^~]*~\) ~\1,\([^~]*~\)/\1,~\2 ~\3/
t fields
s/~\([^,]*\),\([^~]*~\) ~\([^,]*,\)\([^~]*~\)/\1\3~\2 ~\4/
t fields
s/~~ ~~//g
b cat
:clean
s/.//;s/[²³]//g
p
}' YourFile
Posix version (so --posixwith GNU sed) and without sorting your file previously
2 recursive loop after loading the full file in buffer, adding marker for easier manipulation and lot of fun with sed group substitution (hopefully just reach the maximum group available).
loop to add category (1 line after the other, needed for next loop on each field) per line and a big sub field temporary structured (2 group of field from the 2 concatened lines. field 5 to 9 are 1 group)
ungroup sub field to original place
finaly, remove marker and first new line
Assuming there is no ²³~ character because used as marker (you can use other marker and adapt the script with your new marker)
Note:
For performance on a hundred MB file, i guess awk will be lot more efficient.
Sorting the data previoulsy may help certainly in performance reducing amount of data to manipulate after each category loop
i found, that this particular problem is faster being processed through db...
SQL - GROUP BY to combine/concat a column
db: mysql through wamp

Bash script - Construct a single line out of many lines having duplicates in a single column

I have an instrumented log file that have 6 lines of duplicated first column as below.
//SC001#1/1/1#1/1,get,ClientStart,1363178707755
//SC001#1/1/1#1/1,get,TalkToSocketStart,1363178707760
//SC001#1/1/1#1/1,get,DecodeRequest,1363178707765
//SC001#1/1/1#1/1,get-reply,EncodeReponse,1363178707767
//SC001#1/1/1#1/2,get,DecodeRequest,1363178708765
//SC001#1/1/1#1/2,get-reply,EncodeReponse,1363178708767
//SC001#1/1/1#1/2,get,TalkToSocketEnd,1363178708770
//SC001#1/1/1#1/2,get,ClientEnd,1363178708775
//SC001#1/1/1#1/1,get,TalkToSocketEnd,1363178707770
//SC001#1/1/1#1/1,get,ClientEnd,1363178707775
//SC001#1/1/1#1/2,get,ClientStart,1363178708755
//SC001#1/1/1#1/2,get,TalkToSocketStart,1363178708760
Note: , (comma) is the delimiter here
Like wise there are many duplicate first column values (IDs) in the log file (above example having only two values (IDs); //SC001#1/1/1#1/1 and //SC001#1/1/1#1/2) I need to consolidate log records as below format.
ID,ClientStart,TalkToSocketStart,DecodeRequest,EncodeReponse,TalkToSocketEnd,ClientEnd
//SC001#1/1/1#1/1,1363178707755,1363178707760,1363178707765,1363178707767,1363178707770,1363178707775
//SC001#1/1/1#1/2,1363178708755,1363178708760,1363178708765,1363178708767,1363178708770,1363178708775
I suppose to have a bash script for this exercise and appreciate an expert support for this. Hope there may be a sed or awk solution which is more efficient.
Thanks much
One way:
sort -t, -k4n,4 file | awk -F, '{a[$1]=a[$1]?a[$1] FS $NF:$NF;}END{for(i in a){print i","a[i];}}'
sort command sorts the file on the basis of the last(4th) column. awk takes the sorted input and forms an array where the 1st field is the key, and the value is combination of values of the last column.

using awk to grab random lines and append to a new column?

So I have a document "1", which is one column. I have 3 files with one column each and I want to append a randomly selected line from each of those columns onto the document 1's line.
So like
awk 'NR==10' moves.txt 'NR==1' propp_tasks.txt
prints out
10.Qg3 Bb4+
First function of the donor
when I want it to be:
10 Qg3 Bb4+ First function of the donor
Is there a good way to do this with awk? I had been trying to set up a bash script with a for loop but I didn't know how to cycle the indices so on line n of document 1, columns 2,3 and 4 would be appended on there. I feel like this should be really, really simple...
paste 1 <(cat 2 3 4 | sort -R)
If the length of the first file and the length of the combination of the other 3 files are different, then some more work is required.

Resources