How to transform a csv file having multiple delimiters using awk - shell

Below is a sample data. Please note this operation is required to be done on files with millions of records hence I need the optimal method. Essentially we are looking to update 2nd column with concatenation of first two characters from 4th column and excluding first 3 fields ('_' delimited) of 2nd column.
I have been trying using cut and reading the file line by line which is very time consuming. I need something with awk something like
awk -F, '{print $1","substr($4,1,2)"_"cut -f4-6 -d'_'($2)","$3","$4","$5","$6}'
Input Data:
234234234,123_33_3_11111_asdf_asadfas,01,06_1234,4325325432,2
234234234,123_11_2_234111_aadsvfcvxf_anfews,01,07_4444,423425432,2
234234234,123_33_3_11111_mlkvffdg_mlkfgufks,01,08_2342,436876532,2
234234234,123_33_3_11111_qewf_mkhsdf,01,09_68645,43234532,2
Output is required as:
234234234,06_11111_asdf_asadfas,01,06_1234,4325325432,2
234234234,07_234111_aadsvfcvxf_anfews,01,07_4444,423425432,2
234234234,08_11111_mlkvffdg_mlkfgufks,01,08_2342,436876532,2
234234234,09_11111_qewf_mkhsdf,01,09_68645,43234532,2

You can use awk and printf for line re-formating
awk -F"[,_]" '{
printf "%s,%s_%s_%s_%s,%s,%s_%s,%s,%s\n", $1,$9,$5,$6,$7,$8,$9,$10,$11,$12
}' file
you get,
234234234,06_11111_asdf_asadfas,01,06_1234,4325325432,2
234234234,07_234111_aadsvfcvxf_anfews,01,07_4444,423425432,2
234234234,08_11111_mlkvffdg_mlkfgufks,01,08_2342,436876532,2
234234234,09_11111_qewf_mkhsdf,01,09_68645,43234532,2

Related

Remove duplicate records from a csv file considering single column

I have a file with records in such a type-
,laac_repo,cntrylist,idlist,domlist,typelist
1,22DE17,BA,S6CD6728,24JA13,6A
2,12FE18,AA,S6FD7688,25DA15,7D
3,22DE17,BA,S6CD6728,24JA13,6A
4,12FE18,AA,S6FD7688,25DA15,7D
I want to remove duplicate records considering 4th column which has "S6CD6728" these type of record and skipping first row which is
",laac_repo,cntrylist,idlist,domlist,type list"
I have tried
awk '{a[$4]++}!(a[$4]-1)' filename
And also tried
awk 'FNR > 1 {a[$4]++}!(a[$4]-1)' filename
The expected output is-
,laac_repo,cntrylist,idlist,domlist,typelist
1,22DE17,BA,S6CD6728,24JA13,6A
2,12FE18,AA,S6FD7688,25DA15,7D
P.S file has more than 10 million records, please suggest solution w.r.t that.( If any script given much appreciated, instead of single command).
What about this:
awk -F, 'FNR>1 && \!seen[$4]++' filename
1,22DE17,BA,S6CD6728,24JA13,6A
2,12FE18,AA,S6FD7688,25DA15,7D
awk -F, '\!seen[$4]++' filename
,laac_repo,cntrylist,idlist,domlist,typelist
1,22DE17,BA,S6CD6728,24JA13,6A
2,12FE18,AA,S6FD7688,25DA15,7D

Remove duplicated entries in a table based on first column (which consists of two values sep by colon)

I need to sort and remove duplicated entries in my large table (space separated), based on values on the first column (which denote chr:position).
Initial data looks like:
1:10020 rs775809821
1:10039 rs978760828
1:10043 rs1008829651
1:10051 rs1052373574
1:10051 rs1326880612
1:10055 rs892501864
Output should look like:
1:10020 rs775809821
1:10039 rs978760828
1:10043 rs1008829651
1:10051 rs1052373574
1:10055 rs892501864
I've tried following this post and variations, but the adapted code did not work:
sort -t' ' -u -k1,1 -k2,2 input > output
Result:
1:10020 rs775809821
Can anyone advise?
Thanks!
Its quite easy when doing with awk. Split the file on either of space or : as the field separator and group the lines by the word after the colon
awk -F'[: ]' '!unique[$2]++' file
The -F[: ] defines the field separator to split the individual words on the line and the part !unique[$2]++ creates a hash-table map based on the value from $2. We increment the value every time a value is seen in $2, so that on next iteration the negation condition ! on the line would prevent the line from printed again.
Defining the regex with -F flag might not be supported on all awk versions. In a POSIX compliant way, you could do
awk '{ split($0,a,"[: ]"); val=a[2]; } !unique[val]++ ' file
The part above assumes you want to unique the file based on the word after :, but for completely based on the first column only just do
awk '!unique[$1]++' file
since your input data is pretty simple, the command is going to be very easy.
sort file.txt | uniq -w7
This is just going to sort the file and do a unique with the first 7 characters. the data for first 7 character is numbers , if any aplhabets step in use -i in the command.

BASH - Delete specific lines

I need to remove every line that has value like SUPMA in the 4th column.
My data looks like this:
abc;def;ghi;SUPMA;klm
abc;def;ghi;SUPMA;klm
SUPMA;def;ghi;MA;klm
abc;def;ghi;SUPMA;klm
abc;def;ghi;SUP;klm
In this example, I want to keep the 3th and 5th lines.
How can i do this in bash script? Can i use AWK?
Thanks
awk -F";" '$4!="SUPMA"' yourfile.txt
Here awk splits the records by semicolon, then tests the 4th position for SUPMA. By default, if that condition passes, it will print the line.
awk -F\; '$4 !~/SUPMA/' file
SUPMA;def;ghi;MA;klm
abc;def;ghi;SUP;klm

I need to be able to print the largest record value from txt file using bash

I am new to bash programming and I hit a roadblock.
I need to be able to calculate the largest record number within a txt file and store that into a variable within a function.
Here is the text file:
student_records.txt
12345,fName lName,Grade,email
64674,fName lName,Grade,email
86345,fName lName,Grade,email
I need to be able to get the largest record number ($1 or first field) in order for me to increment this unique record and add more records to the file. I seem to not be able to figure this one out.
First, I sort the file by the first field in descending order and then, perform this operation:
largest_record=$(awk-F,'NR==1{print $1}' student_records.txt)
echo $largest_record
This gives me the following error on the console:
awk-F,NR==1{print $1}: command not found
Any ideas? Also, any suggestions on how to accomplish this in the best way?
Thank you in advance.
largest=$(sort -r file|cut -d"," -f1|head -1)
You need spaces, and quotes
awk -F, 'NR==1{print $1}'
The command is awk, you need a space after it so bash parses your command line properly, otherwise it thinks the whole thing is the name of the command, which is what the error messages is telling you.
Learn how to use the man command so you can learn how to invoke other commands:
man awk
This will tell you what the -F option does:
The -F fs option defines the input field separator to be the regular expression fs.
So in your case the field separator is a comma -F,
What follows in quotes is what you want awk to interpret, it says to match a line with the pattern NR==1, NR is special, it is the record number, so you want it to match the first record, following that is the action you want awk to take when that pattern matches, {print $1}, which says to print the first field (comma separated) of the line.
A better way to accomplish this would be to use awk to find the largest record for you rather than sorting it first, this gives you a solution that is linear in the number for records - you just want the max, no need to do extra work of sorting the whole file:
awk -F, 'BEGIN {max = 0} {if ($1>max) max=$1} END {print max}' student_records.txt
For this and other awk "one liners" look here.

Bash extract parts from string and create csv

I have a document with 1+ million of the following strings and I like to create some new structures byextract some parts and create a csv file for it, what's the quickest way to do this?
document/0006-291X(85)91157-X
I would like to have a file with on each line the original string and the extracted parts
document/0006-291X(85)91157-X;0006-291X;85
You can try this one-liner awk:
awk -F "[/()]" -v OFS=';' '{print $0,$(NF-2),$(NF-1)}' your-file
It parses the fields of each line with taking /,(,) as delimiters. Then it prints out the whole line, the 3rd field and the second field starting from the end of the line. The option -v OFS=';' prints semicolumns as output field separator.

Resources