Bash extract parts from string and create csv - bash

I have a document with 1+ million of the following strings and I like to create some new structures byextract some parts and create a csv file for it, what's the quickest way to do this?
document/0006-291X(85)91157-X
I would like to have a file with on each line the original string and the extracted parts
document/0006-291X(85)91157-X;0006-291X;85

You can try this one-liner awk:
awk -F "[/()]" -v OFS=';' '{print $0,$(NF-2),$(NF-1)}' your-file
It parses the fields of each line with taking /,(,) as delimiters. Then it prints out the whole line, the 3rd field and the second field starting from the end of the line. The option -v OFS=';' prints semicolumns as output field separator.

Related

Unable to remove last field CSV file

i have csv file contains data like, I need to get all fields as it is except last one.
"one","two","this has comment section1"
"one","two","this has comment section2 and ( anything ) can come here ( ok!!!"
gawk 'BEGIN {FS=",";OFS=","}{sub(FS $NF, x)}1'
gives error-
fatal: Unmatched ( or (:
I know if i remove '(' from second line solves the problem but i can not remove anything from comment section.
With any awk you could try:
awk 'BEGIN{FS=",";OFS=","}{$NF="";sub(/,$/,"")}1' Input_file
Or with GNU awk try:
awk 'BEGIN{FS=",";OFS=","}NF{--NF};1' Input_file
Since you mention that everything can come here, you might also have a line that looks like:
"one","two","comment with a , comma"
So it is a bit hard to just use the <comma>-character as a field separator.
The following two posts are now very handy:
What's the most robust way to efficiently parse CSV using awk?
[U&L] How to delete the last column of a file in Linux (Note: this is only for GNU awk)
Since you work with GNU awk, you can thus do any of the following two:
$ awk -v FPAT='[^,]*|"[^"]+"' -v OFS="," 'NF{NF--}1'
$ awk 'BEGIN{FPAT="[^,]*|\"[^\"]+\"";OFS=","}NF{NF--}1'
$ awk 'BEGIN{FPAT="[^,]*|\042[^\042]+\042";OFS=","}NF{NF--}1'
Why is your command failing: The sub(ere,repl,in) command of awk assumes that the first part ere is an extended regular expression. Hence, the bracket has a special meaning. If you want to replace fields which are known and unique, you should not use sub, but just redefine the field:
$ awk '{$NF=""}'
If you want to replace a string matching a field, you should do this:
s=$(number);while(i=index(s,$0)){$0=substr(1,i-1) "repl" substr(i+length(s),$0) }

Using awk to try to find a variable in a CSV line

I am trying to go through two files. First one line by line, while using awk to search for a line containing a string pulled from the first file.
while IFS=, read col1 col2 col3
do
echo $(awk -F, -v var="$col2" '$2==var || $2=="www."var {print $0}' searchFile.csv)
//do stuff with data from awk
done < origFile.csv
I am trying to find domain names in this file, and the awk currently is never returning matches. I have checked the files manually to make sure some that are not returning matches are in both, and they are.
I have tried using a nested loop, but bash was not wanting to open a second file to read and would not read the second file. I also tried using grep, but the files are too large and grep would run out of memory.
Sample input for searchFile.csv:
4915,google.com,oct
3532,domain.ca,nov
33451,yahoo.ca,nov
I have ensured there are no spaces in the data being input, and have verified that $col2 from origFile.csv matches data in the searchFile.csv
Does your data file have spaces? If so, your records ($2) will have spaces and may not match (because you are using awk -F,). Try matching with ~ instead of == .

How to transform a csv file having multiple delimiters using awk

Below is a sample data. Please note this operation is required to be done on files with millions of records hence I need the optimal method. Essentially we are looking to update 2nd column with concatenation of first two characters from 4th column and excluding first 3 fields ('_' delimited) of 2nd column.
I have been trying using cut and reading the file line by line which is very time consuming. I need something with awk something like
awk -F, '{print $1","substr($4,1,2)"_"cut -f4-6 -d'_'($2)","$3","$4","$5","$6}'
Input Data:
234234234,123_33_3_11111_asdf_asadfas,01,06_1234,4325325432,2
234234234,123_11_2_234111_aadsvfcvxf_anfews,01,07_4444,423425432,2
234234234,123_33_3_11111_mlkvffdg_mlkfgufks,01,08_2342,436876532,2
234234234,123_33_3_11111_qewf_mkhsdf,01,09_68645,43234532,2
Output is required as:
234234234,06_11111_asdf_asadfas,01,06_1234,4325325432,2
234234234,07_234111_aadsvfcvxf_anfews,01,07_4444,423425432,2
234234234,08_11111_mlkvffdg_mlkfgufks,01,08_2342,436876532,2
234234234,09_11111_qewf_mkhsdf,01,09_68645,43234532,2
You can use awk and printf for line re-formating
awk -F"[,_]" '{
printf "%s,%s_%s_%s_%s,%s,%s_%s,%s,%s\n", $1,$9,$5,$6,$7,$8,$9,$10,$11,$12
}' file
you get,
234234234,06_11111_asdf_asadfas,01,06_1234,4325325432,2
234234234,07_234111_aadsvfcvxf_anfews,01,07_4444,423425432,2
234234234,08_11111_mlkvffdg_mlkfgufks,01,08_2342,436876532,2
234234234,09_11111_qewf_mkhsdf,01,09_68645,43234532,2

cat from file.csv with grep data

I have data in file.csv:
(...)
0000046;0000046;04688;29;1;52.1683;20.5567
0000046;0000046;04688;2A;1;52.1818;20.5639
0000046;0000046;04688;3;1;52.1785;20.5629
0000046;0000046;04688;4;1;52.1815;20.5638
0000046;0000046;04688;5;;52.1779;20.5635
0000046;0000046;04688;6;1;52.1813;20.5636
0000046;0000046;04688;7;;52.1777;20.5634
0000046;0000046;04688;8;;52.1810;20.5635
0000046;0000046;04688;9;1;52.1775;20.5631
0000046;0000046;05027;2;;52.1908;20.5660
0000046;0000046;05027;4;1;52.1907;20.5649
0000046;0000046;05527;1;1;52.1824;20.5636
(...)
I need to extract lines where the third field matches a given value. I tried
cat file.csv |grep 05027
Unfortunately, this matches any line containing 05027 anywhere. How can I restrict to matching only on the third field?
First of all, you don't need the cat for grep, you can just grep pattern file
awk is easier to handle column based data input.
What you can try is:
awk -F';' '$3=="05027"' file

I need to be able to print the largest record value from txt file using bash

I am new to bash programming and I hit a roadblock.
I need to be able to calculate the largest record number within a txt file and store that into a variable within a function.
Here is the text file:
student_records.txt
12345,fName lName,Grade,email
64674,fName lName,Grade,email
86345,fName lName,Grade,email
I need to be able to get the largest record number ($1 or first field) in order for me to increment this unique record and add more records to the file. I seem to not be able to figure this one out.
First, I sort the file by the first field in descending order and then, perform this operation:
largest_record=$(awk-F,'NR==1{print $1}' student_records.txt)
echo $largest_record
This gives me the following error on the console:
awk-F,NR==1{print $1}: command not found
Any ideas? Also, any suggestions on how to accomplish this in the best way?
Thank you in advance.
largest=$(sort -r file|cut -d"," -f1|head -1)
You need spaces, and quotes
awk -F, 'NR==1{print $1}'
The command is awk, you need a space after it so bash parses your command line properly, otherwise it thinks the whole thing is the name of the command, which is what the error messages is telling you.
Learn how to use the man command so you can learn how to invoke other commands:
man awk
This will tell you what the -F option does:
The -F fs option defines the input field separator to be the regular expression fs.
So in your case the field separator is a comma -F,
What follows in quotes is what you want awk to interpret, it says to match a line with the pattern NR==1, NR is special, it is the record number, so you want it to match the first record, following that is the action you want awk to take when that pattern matches, {print $1}, which says to print the first field (comma separated) of the line.
A better way to accomplish this would be to use awk to find the largest record for you rather than sorting it first, this gives you a solution that is linear in the number for records - you just want the max, no need to do extra work of sorting the whole file:
awk -F, 'BEGIN {max = 0} {if ($1>max) max=$1} END {print max}' student_records.txt
For this and other awk "one liners" look here.

Resources