Parsing CSV file from DB - bash

I have this DB dump file in comma separated CSV file with first line as heading/table name and rest of it are data and some has duplicate entry
HOST_#_INFORMATION,HOST#,Primary Hostname,DNS Domain,IP_#_INFORMATION,Primary IP,DNS
,11,abc,example.com,,10.10.10.10,10.10.10.1
,12,bcd,example.com,,10.10.10.11,10.10.10.1
,13,cde,example.com,,10.10.10.12,10.10.10.1
,11,abc,example.com,,10.10.10.10,10.10.10.1
,13,cde,example.com,,10.10.10.12,10.10.10.1
I need to print only unique columns between HOST_#_INFORMATION and IP_#_INFORMATIO. Output I am looking for is
HOST#,Primary Hostname,DNS Domain
11,abc,example.com
12,bcd,example.com
12,bcd,example.com
I tried with awk gsub option but only printing first line. how can i parse this csv file. I am open to perl option also. Thanks

[root#test /tmp]$ awk -F, -vOFS=, '{if(++a[$2,$3,$4]==1)print $2,$3,$4}' a
HOST#,Primary Hostname,DNS Domain
11,abc,example.com
12,bcd,example.com
13,cde,example.com

No need for awk or sed, use cut'n'sort instead:
cut -d, -f2-4 infile | sort -u
Output:
11,abc,example.com
12,bcd,example.com
13,cde,example.com

Assuming your input format (OP specify between 2 field but with 1 configuration showed)
awk -F ',' 'NR == 1{print "HOST#,Primary Hostname,DNS Domain"} NR > 1{print $2 "," $3, "," $4}' YourFile

Assuming you will parse header separately from data, this is how to parse data and remove duplicates:
awk -F',' '{print $2","$3","$4}'|sort -u

In Perl you could use Text::CSV module, which has rich set of functions to deal with CSV files.

Related

How to replace all occurrence of a symbol with awk

From the cmd (awk 'some expression') I got a result in the format
Key:(white_space)Value
Key:(white_space)Value
...
How to manipulate the result to be in the format:
Key=Value
I need this because I want to put the information into .properties file format which is key=value
In other words I need to replace : with = and remove the whitespace.
Is there a command in awk that can achieve this ?
You ask for awk, while sed provides just as easy a solution. However, awk makes it trivial with sub as well:
awk '{ sub(/:[ \t]*/,"=") }1'
Example
$ echo "Key: Value" | awk '{ sub(/:[ \t]*/,"=") }1'
Key=Value
Another awk approach.
awk -F'[: ]' '{print $1 "=" $NF}' file.txt

Need to prepend a string to a column and to add another column with it

I have a file with 2 lines
123|456|789
abc|123|891
I need a to output like below. Basically, I want to add the string "xyz" to col 1 and to add "xyz" as a new col 2
xyz-123|xyz|456|789
xyz-abc|xyz|123|891
This is what I used
awk 'BEGIN{FS=OFS="fs-"$1}{print value OFS $0}' /tmp/b.log
I get
xyz-123|456|789
xyz-abc|123|891
I tried
awk 'BEGIN{FS=OFS="fs-"$1}{print value OFS $0}' /tmp/b.log|awk -F" " '{$2="fs" $0;}1' OFS=" "
In addition to the awk's updating fields ($1, $2...) approach, we can also use substitution to do the job:
sed 's/^[^|]*/xyz-&|xyz/' file
If awk is a must:
awk '1+sub(/^[^|]*/, "xyz-&|xyz")' file
Both one-liners give expected output.
Could you please try following.
awk 'BEGIN{FS=OFS="|"} {$1="xyz-"$1;$2="xyz" OFS $2} 1' Input_file
OR as per #Corentin Limier's comment try:
awk 'BEGIN{FS=OFS="|"} {$1="xyz-" $1 OFS "xyz"} 1' Input_file
Output will be as follows.
xyz-123|xyz|456|789
xyz-abc|xyz|123|891
I would use sed instead of awk as follows:
sed -e 's/^/xyz-/' -e 's/|/|xyz|/' Input_file
This prepends xyz- at beginning of each line and changes the first | into |xyz|
Another slight variation of sed:
sed 's/^/xyz-/;s/|/&xyz&/' file

delete all line after a specific date

I have a lot of *.csv files. I want to delete the content after a specific line. I will delete all lines after 20031231
How do I solve this problem with some lines of a shell script?
Test,20031231,000107,0.74843,0.74813
Test,20031231,000107,0.74838,0.74808
Test,20031231,000108,0.74841,0.74815
Test,20031231,000108,0.74835,0.74809
Test,20031231,000110,0.74842,0.74818
Test,20040101,000100,0.73342,0.744318
quick and dirty but without any other info about constraint
sed '1,/20031231/p;d' YourFile
If you want to use a shell script, the best is to use awk. This will do the trick:
awk 'BEGIN {FS=","} {if ($2 == "20031231") print $0}' input.csv > output.csv
This code will write to a different file only the lines that have 20031231.
ignores empty lines and unmatched data
awk file:
$ cat awk.awk
{
if($2<="20031231" && $0!=""){
print $0
}else{
next
}
}
execution:
$ awk -F',' -f awk.awk input
Test,20031231,000107,0.74843,0.74813
Test,20031231,000107,0.74838,0.74808
Test,20031231,000108,0.74841,0.74815
Test,20031231,000108,0.74835,0.74809
Test,20031231,000110,0.74842,0.74818
one liner:
$ awk -F',' '{if($2<="20031231" && $0!=""){print $0}else{next}}' input
Test,20031231,000107,0.74843,0.74813
Test,20031231,000107,0.74838,0.74808
Test,20031231,000108,0.74841,0.74815
Test,20031231,000108,0.74835,0.74809
Test,20031231,000110,0.74842,0.74818
with Miller (http://johnkerl.org/miller/doc/)
mlr --nidx --fs "," filter '$2>20031231' input
gives you
Test,20040101,000100,0.73342,0.744318
With awk please try:
awk -F, '$2<=20031231' input.csv

Read variable from file with awk?

I'm new using awk and I found it very useful for extracting data from columns. For example in my file I had
Data: 1234 23434 31324
If I wanted the second column I used:
awk '/Data:/ {print $3}' file.txt
But next, I had some variables inside the file, let's say:
variable_1=1
variable_2=4
How can I extract only the value? how can I extract the name of the variable by knowing the value?
awk offers to specify the field delimiter:
awk -F'=' '$1 == "variable_1" {print $2}' file
Prints:
1
You can do a lot of things with your file, what do you really want?
Get values:
source file.txt
echo "variable_1=${variable_1}"
echo "variable_2=${variable_2}"
Get keys corresponding to value 2
sed '/=2$/ s/=.*//' file.txt

Bash: Converting 4 columns of text interleaved lines (tab-delimited columns to FASTQ file)

I need to convert a 4-column file to 4 lines per entry. The file is tab-delimited.
The file at current is arranged in the following format, with each line representing one record/sequence (with millions of such lines):
#SRR1012345.1 NCAATATCGTGG #4=DDFFFHDHH HWI-ST823:136:C24YTACXX
#SRR1012346.1 GATTACAGATCT #4=DDFFFHDHH HWI-ST823:136:C22YTAGXX
I need to rearrange this such that the four columns are presented as 4 lines:
#SRR1012345.1
NCAATATCGTGG
#4=DDFFFHDHH
HWI-ST823:136:C24YTACXX
#SRR1012346.1
GATTACAGATCT
#4=DDFFFHDHH
HWI-ST823:136:C22YTAGXX
What would be the best way to go about doing this, preferably with a bash one-liner? Thank you for your assistance!
You can use tr:
< file tr '\t' '\n' > newfile
very clear to use awk here:
awk '{print $1; print $2; print $3; print $4}' file
$ awk -v OFS='\n' '{$1=$1}1' file
#SRR1012345.1
NCAATATCGTGG
#4=DDFFFHDHH
HWI-ST823:136:C24YTACXX
#SRR1012346.1
GATTACAGATCT
#4=DDFFFHDHH
HWI-ST823:136:C22YTAGXX

Resources