Add to CSV a timestamp column based on other columns (using bash) - bash

I need to read a CSV file (list.csv) like this:
0;John Doe;2001;03;24
1;Jane Doe;1985;12;05
2;Mr. White;2018;06;01
3;Jake White;2017;11;20
...
and add a column (doesn't matter where I put it) with a Unix timestamp based on the year/month/day being in column 3, 4 and 5, to get this:
0;John Doe;2001;03;24;985392000
1;Jane Doe;1985;12;05;502588800
2;Mr. White;2018;06;01;1527811200
3;Jake White;2017;11;20;1511136000
...
So I wrote this script.sh:
#!/bin/sh
while read line
do
printf "$line;"
date -d $(awk -F\; '{print $3$4$5}' <<<$line) +%s
done
and I ran:
<list.csv ./script.sh
and it works, but it's very slow when it comes to having very large CSVs.
Is there a way to do it faster in a sed/awk command line?
I mean, can I (for instance) inject a bash command into a sed/awk line?
For example (I know this won't work, it's just an example):
awk -F\; '{print $1 ";" $2 ";" $3 ";" $4 ";" $5 ";" $(date -d $3$4$5 +%s)}'

GNU awk to the rescue!
$ gawk -F';' '{$0=$0 FS mktime($3" "$4" "$5" 00 00 00")}1' file
0;John Doe;2001;03;24;985410000
1;Jane Doe;1985;12;05;502606800
2;Mr. White;2018;06;01;1527825600
3;Jake White;2017;11;20;1511154000
not sure what hour/min/sec you use as default.

For other awks without builtin time functions:
awk -F';' '{
cmd = "date -d "$3 $4 $5" +%s"
cmd | getline time
close(cmd)
$0 = $0 FS time
print
}' file
or perl
perl -MTime::Piece -F';' -lane '
print join ";", #F, Time::Piece->strptime("#F[2..4]", "%Y %m %d")->epoch
' file
# or
perl -MTime::Local -F';' -lane '
print join ";", #F, timelocal(0, 0, 0, $F[4], $F[3]-1, $F[2]-1900)
' file

Related

using awk to print header name and a substring

i try using this code for printing a header of a gene name and then pulling a substring based on its location but it doesn't work
>output_file
cat input_file | while read row; do
echo $row > temp
geneName=`awk '{print $1}' tmp`
startPos=`awk '{print $2}' tmp`
endPOs=`awk '{print $3}' tmp`
for i in temp; do
echo ">${geneName}" >> genes_fasta ;
echo "awk '{val=substr($0,${startPos},${endPOs});print val}' fasta" >> genes_fasta
done
done
input_file
nad5_exon1 250405 250551
nad5_exon2 251490 251884
nad5_exon3 195620 195641
nad5_exon4 154254 155469
nad5_exon5 156319 156548
fasta
atgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgc............
and this is my wrong output file
>
awk '{val=substr(pull_genes.sh,,);print val}' unwraped_carm_mt.fasta
>
awk '{val=substr(pull_genes.sh,,);print val}' unwraped_carm_mt.fasta
>
awk '{val=substr(pull_genes.sh,,);print val}' unwraped_carm_mt.fasta
>
awk '{val=substr(pull_genes.sh,,);print val}' unwraped_carm_mt.fasta
>
awk '{val=substr(pull_genes.sh,,);print val}' unwraped_carm_mt.fasta
>
awk '{val=substr(pull_genes.sh,,);print val}' unwraped_carm_mt.fasta
output should look like that:
>name1
atgcatgcatgcatgcatgcat
>name2
tgcatgcatgcatgcat
>name3
gcatgcatgcatgcatgcat
>namen....
You can do this with a single call to awk which will be orders of magnitude more efficient than looping in a shell script and calling awk 4-times per-iteration. Since you have bash, you can simply use command substitution and redirect the contents of fasta to an awk variable and then simply output the heading and the substring containing the beginning through ending characters from your fasta file.
For example:
awk -v fasta=$(<fasta) '{print ">" $1; print substr(fasta,$2,$3-$2+1)}' input
or using getline within the BEGIN rule:
awk 'BEGIN{getline fasta<"fasta"}
{print ">" $1; print substr(fasta,$2,$3-$2+1)}' input
Example Input Files
Note: the beginning and ending values have been reduced to fit within the 129 characters of your example:
$ cat input
rad5_exon1 1 17
rad5_exon2 23 51
rad5_exon3 110 127
rad5_exon4 38 62
rad5_exon5 59 79
and the first 129-characters of your example fasta
$ cat fasta
atgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgc
Example Use/Output
$ awk -v fasta=$(<fasta) '{print ">" $1; print substr(fasta,$2,$3-$2+1)}' input
>rad5_exon1
atgcatgcatgcatgca
>rad5_exon2
gcatgcatgcatgcatgcatgcatgcatg
>rad5_exon3
tgcatgcatgcatgcatg
>rad5_exon4
tgcatgcatgcatgcatgcatgcat
>rad5_exon5
gcatgcatgcatgcatgcatg
Look thing over and let me know if I understood your question requirements. Also let me know if you have further questions on the solution.
If I'm understanding correctly, how about:
awk 'NR==FNR {fasta = fasta $0; next}
{
printf(">%s %s\n", $1, substr(fasta, $2, $3 - $2 + 1))
}' fasta input_file > genes_fasta
It first reads fasta file and stores the sequence in a variable fasta.
Then it reads input_file line by line, extracts the substring of fasta starting at $2 and of length $3 - $2 + 1. (Note that the 3rd argument to substr function is length, not endpos.)
Hope this helps.
made it work!
this is the script for pulling substrings from a fasta file
cat genes_and_bounderies1 | while read row; do
echo $row > temp
geneName=`awk '{print $1}' temp`
startPos=`awk '{print $2}' temp`
endPos=`awk '{print $3}' temp`
length=$(expr $endPos - $startPos)
for i in temp; do
echo ">${geneName}" >> genes_fasta
awk -v S=$startPos -v L=$length '{print substr($0,S,L)}' unwraped_${fasta} >> genes_fasta
done
done

Gawk Line removal, Splitter is :

Is it possible to move certain columns from one .txt file into another .txt file?
I have a .txt that contains:
USERID:ORDER#:IP:PHONE:ADDRESS:POSTCODE
USERID:ORDER#:IP:PHONE:ADDRESS:POSTCODE
With gawk I want to extract ADDRESS & POSTCODE columns into another .txt, so for this given file the output should be:
ADDRESS1:POSTCODE1
ADDRESS2:POSTCODE2
etc.
This is a classic AWK transform. You want to use "-F :" to specify that the input is delimited by ":" and print a new ":" on output:
awk -F: '{ print $5 ":" $6 }' <input.txt >output.txt
Try that:
awk -F: '{printf "%s:%s ",$5,$6}' ex.txt
input is
USERID:ORDER#:IP:PHONE:ADDRESS1:POSTCODE1
USERID:ORDER#:IP:PHONE:ADDRESS2:POSTCODE2
output is (on one line if I understand correctly)
ADDRESS1:POSTCODE1 ADDRESS2:POSTCODE2
only default is that it ends with a trailing space and does not end with a newline.
Which can be fixed with the slightly more complex (but still readable):
awk -F: 'BEGIN {z=0;} {if (z==1) { printf " "; } ; z=1; printf "%s:%s",$5,$6} END{printf"\n"}' ex.txt
awk -F: 'NR==1 {print $5"1:"$6"1"};NR==2 {print $5"2:"$6"2"}' file
ADDRESS1:POSTCODE1
ADDRESS2:POSTCODE2

How to convert date with awk

My file temp.txt
ID53,20150918,2015-09-19,,0,CENTER<br>
ID54,20150911,2015-09-14,,0,CENTER<br>
ID55,20150911,2015-09-14,,0,CENTER
I need to replace and convert the 2nd field (yyyymmdd) for seconds
I try it, but only the first line is replaced
awk -F"," '{ ("date -j -f ""%Y%m%d"" ""20150918"" ""+%s""") | getline $2; print }' OFS="," temp.txt
and tried to like this
awk -F"," '{system("date -j -f ""%Y%m%d"" "$2" ""+%s""") | getline $2; print }' temp.txt
the output is:
1442619474
sh: 0: command not found
ID53,20150918,2015-09-19,,0,CENTER
1442014674
ID54,20150911,2015-09-14,,0,CENTER
1442014674
ID55,20150911,2015-09-14,,0,CENTER
Using gsub also could not
awk -F"," '{gsub($2,"system("date -j -f ""%Y%m%d"" "$2" ""+%s""")",$2); print}' OFS="," temp.txt
awk: syntax error at source line 1
context is
{gsub($2,"system("date -j -f ""%Y%m%d"" "$2" >>> ""+% <<< s""")",$2); print}
awk: illegal statement at source line 1
extra )
I need the output to be so. How to?
ID53,1442619376,2015-09-19,,0,CENTER
ID54,1442014576,2015-09-14,,0,CENTER
ID55,1442014576,2015-09-14,,0,CENTER
This GNU awk script should make it. If it is not yet installed on your mac, I suggest installing macport and then GNU awk. You can also install a decent version of bash, date and other important utilities for which the default are really disappointing on OSX.
BEGIN { FS = ","; OFS = FS; }
{
y = substr($2, 1, 4);
m = substr($2, 5, 2);
d = substr($2, 7, 2);
$2 = mktime(y " " m " " d " 00 00 00");
print;
}
Put it in a file (e.g. txt2ts.awk) and process your file with:
$ awk -f txt2ts.awk data.txt
ID53,1442527200,2015-09-19,,0,CENTER<br>
ID54,1441922400,2015-09-14,,0,CENTER<br>
ID55,1441922400,2015-09-14,,0,CENTER
Note that we do not have the same timestamps. I let you try to understand where it comes from, it is another problem.
Explanations: substr(s, m, n) returns the n-characters sub-string of s that starts at position m (starting with 1). mktime("YYYY MM DD HH MM SS") converts the date string into a timestamp (seconds since epoch). FS and OFS are the input and output filed separators, respectively. The commands between the curly braces of the BEGIN pattern are executed at the beginning only while the others are executed on each line of the file.
You could use substr:
printf "%s-%s-%s", substr($6,0,4), substr($6,5,2), substr($6,7,2)
Assuming that the 6th field was 20150914, this would produce 2015-09-14

bash awk first 1st column and 3rd column with everything after

I am working on the following bash script:
# contents of dbfake file
1 100% file 1
2 99% file name 2
3 100% file name 3
#!/bin/bash
# cat out data
cat dbfake |
# select lines containing 100%
grep 100% |
# print the first and third columns
awk '{print $1, $3}' |
# echo out id and file name and log
xargs -rI % sh -c '{ echo %; echo "%" >> "fake.log"; }'
exit 0
This script works ok, but how do I print everything in column $3 and then all columns after?
You can use cut instead of awk in this case:
cut -f1,3- -d ' '
awk '{ $2 = ""; print }' # remove col 2
If you don't mind a little whitespace:
awk '{ $2="" }1'
But UUOC and grep:
< dbfake awk '/100%/ { $2="" }1' | ...
If you'd like to trim that whitespace:
< dbfake awk '/100%/ { $2=""; sub(FS "+", FS) }1' | ...
For fun, here's another way using GNU sed:
< dbfake sed -r '/100%/s/^(\S+)\s+\S+(.*)/\1\2/' | ...
All you need is:
awk 'sub(/.*100% /,"")' dbfake | tee "fake.log"
Others responded in various ways, but I want to point that using xargs to multiplex output is rather bad idea.
Instead, why don't you:
awk '$2=="100%" { sub("100%[[:space:]]*",""); print; print >>"fake.log"}' dbfake
That's all. You don't need grep, you don't need multiple pipes, and definitely you don't need to fork shell for every line you're outputting.
You could do awk ...; print}' | tee fake.log, but there is not much point in forking tee, if awk can handle it as well.

set shell variable in awk and reuse

How can I pass a shell variable to awk, set it, use it in another awk in same line and print it?
I want to save $0 (all fields) into a variable first, parse $6 (ABC 123456M123000) - get '12300', do a range check on it and if it satisfies, print all fields ($0)
part 1: I am trying to do:
line="hello"
java class .... | awk -F, -v '{line=$0}' | awk 'begin my range check code' | if(p>100) print $line }
part2:
$6="ABC 123456M123000" ( string that I will parse)
Once I store all fields into a variable, I can parse $6 using this:
awk 'begin {FS=" "} { print $2; len=length($2); p=substr($2,8,len)+0 ; print len,p ; if(p>100) print $line }'
But my question is in part1: how to store $0 into a variable so that after my check is done, I can print them?
It's not clear why you need multiple invocations of awk. From your description, it looks like you are just trying to do:
... | awk -F, '{split( $6, f, "M" )} f[2] > min' min=100
or, if you can't split on 'M' but need to use substr (or some other method to extract the desired value):
... | awk -F, '{ split( $6, f, " " )} 0+substr( f[2], 8 ) > min' min=100
With the shell:
java ... | while IFS= read -r line ; do
sixth=$(IFS=,; set -- $line; echo "$6")
val=${sixth:11}
(( $val > 100 )) && echo "$line"
done
Some bash-isms there.

Resources