Bash Script to append/insert Timestamp into CSV - bash

I have a CSV file sample named RecordCountTest.csv that looks like this:
Date Time Shift Record
26-06-2016 15:23:01 Shift2 000002
23-06-2016 09:06:24 Shift1 000001
When a GPIO pin wired button is pressed a bash script command is run. I would like a simple AWK or Bash script that I can call after the command is run that can document that it happened and track when. I would like to do this by inserting a row at the top (but under the header) with the current date (DD-MM-YYYY), the current time (HH:MM:SS), the shift is a variable determined by the bash script (and I will post that question in 90 minutes if I can't figure it out before then), and then increment the record by one and saving the file as RecordCountTest.CSV. I would prefer SED if possible as that is what I am trying to learn currently and any suggestions that might help me understand the syntax more, but will accept anything that can work in this bash script including AWK.

awk to the rescue!
$ gawk -v OFS="\t" -v s="${shift}" 'NR==2{print strftime("%d-%m-%Y"),
strftime("%H:%M:%S"),
s,
sprintf("%06d",$4+1)}
{$1=$1}1' file > temp && mv temp file
if your awk doesn't support strftime function, you can revert to bash
$ awk -v OFS="\t" -v s="${shift}"
-v d="$(date +"%d-%m-%Y"$'\t'"%H:%M:%S")"
'NR==2{print d, s, sprintf("%06d",$4+1)}
{$1=$1}1' file > temp && mv temp file

Related

Optimising my script which lookups into a big compressed file

I'm here again ! I would like to optimise my bash script in order to lower the time spent for each loop.
Basically what it does is :
getting an info from a tsv
using that information to lookup with awk into a file
printing the line and exporting it
My issues are :
1) the files are 60GB compressed files : I need a software to uncompress it (I'm actually trying now to uncompress it, not sure I'll have enough space)
2) It is long to look into it anyway
My ideas to improve it :
0) as said, if possible I'll decompress the file
using GNU parallel with parallel -j 0
./extract_awk_reads_in_bam.sh ::: reads_id_and_pos.tsv but I'm unsure it works as expected? I'm cutting down the time per research from 36 min to 16 so just a factor 2.5 ? (I have 16 cores)
I was thinking (but It may be redundant with GNU?) to split down
my list of info to look into into several files to launch them
parallely
sorting the bam file by reads name, and exiting awk after having
found 2 matches (can't be more than 2)
Here is the rest of my bash script, I'm really open for ideas to improve it but I'm not sure I am a superstar in programming, so maybe keeping it simple would help? :)
My bash script :
#/!bin/bash
while IFS=$'\t' read -r READ_ID_WH POS_HOTSPOT; do
echo "$(date -Iseconds) read id is : ${READ_ID_WH} with position ${POS_HOTSPOT}" >> /data/bismark2/reads_done_so_far.txt
echo "$(date -Iseconds) read id is : ${READ_ID_WH} with position ${POS_HOTSPOT}"
samtools view -# 2 /data/bismark2/aligned_on_nDNA/bamfile.bam | awk -v read_id="$READ_ID_WH" -v pos_hotspot="$POS_HOTSPOT" '$1==read_id {printf $0 "\t%s\twh_genome",pos_hotspot}'| head -2 >> /data/bismark2/export_reads_mapped.tsv
done <"$1"
My tsv file has a format like :
READ_ABCDEF\t1200
Thank you a lot ++
TL;DR
Your new script will be:
#!/bin/bash
samtools view -# 2 /data/bismark2/aligned_on_nDNA/bamfile.bam | awk -v st="$1" 'BEGIN {OFS="\t"; while (getline < st) {st_array[$1]=$2}} {if ($1 in st_array) {print $0, st_array[$1], "wh_genome"}}'
You are reading the entire file for each of the inputs. Better look for all of them at the same time. Start by extracting the interesting reads and then, on this subset, apply the second transformation.
samtools view -# 2 "$bam" | grep -f <(awk -F$'\t' '{print $1}' "$1") > "$sam"
Here you are getting all the reads with samtools and searching for all the terms that appear in the -f parameter of grep. That parameter is a file that contains the first column of the search input file. The output is a sam file with only the reads that are listed in the search input files.
awk -v st="$1" 'BEGIN {OFS="\t"; while (getline < st) {st_array[$1]=$2}} {print $0, st_array[$1], "wh_genome"}' "$sam"
Finally, use awk for adding the extra information:
Open the search input file with awk at the beginning and read its contents into an array (st_array)
Set the Output Field Separator to the tabulator
Traverse the sam file and add the extra information from the pre-populated array.
I'm proposing this schema because I feel like grep is faster than awk for doing the search, but the same result can be obtained with awk alone:
samtools view -# 2 "$bam" | awk -v st="$1" 'BEGIN {OFS="\t"; while (getline < st) {st_array[$1]=$2}} {if ($1 in st_array) {print $0, st_array[$1], "wh_genome"}}'
In this case, you only need to add a conditional to identify the interesting reads and get rid of the grep.
In any case you need to re-read the file more than once or to decompress it before working with it.

How to make awk command run faster on large data files

I used this awk command below to create a new UUID column in a table in my existing .dat files.
$ awk '("uuidgen" | getline uuid) > 0 {print uuid "|" $0} {close("uuidgen")}' $filename > ${filename}.pk
The problem is that my .dat files are pretty big (like 50-60 GB) and this awk command takes hours even on small data files (like 15MB).
Is there any way to increase the speed of this awk command?
I wonder if you might save time by not having awk open and close uuidgen every line.
$ function regen() { while true; do uuidgen; done; }
$ coproc regen
$ awk -v f="$filename" '!(getline line < f){exit} {print $0,line}' OFS="|" < /dev/fd/${COPROC[0]} > "$filename".pk
This has awk reading your "real" filename from a variable, and the uuid from stdin, because the call to uuidgen is handled by a bash "coprocess". The funky bit around the getline is to tell awk to quit once it runs out of input from $filename. Also, note that awk is taking input from input redirection instead of reading the file directly. This is important; the file descriptor at /dev/fd/## is a bash thing, and awk can't open it.
This should theoretically save you time doing unnecessary system calls to open, run and close the uuidgen binary. On the other hand, the coprocess is doing almost the same thing anyway by running uuidgen in a loop. Perhaps you'll see some improvement in an SMP environment. I don't have a 50GB text file handy for benchmarking. I'd love to hear your results.
Note that coproc is a feature that was introduced with bash version 4. And use of /dev/fd/* requires that bash is compiled with file descriptor support. In my system, it also means I have to make sure fdescfs(5) is mounted.
I just noticed the following on my system (FreeBSD 11):
$ /bin/uuidgen -
usage: uuidgen [-1] [-n count] [-o filename]
If your uuidgen also has a -n option, then adding it to your regen() function with ANY value might be a useful optimization, to reduce the number of times the command needs to be reopened. For example:
$ function regen() { while true; do uuidgen -n 100; done; }
This would result in uuidgen being called only once every 100 lines of input, rather than for every line.
And if you're running Linux, depending on how you're set up, you may have an alternate source for UUIDs. Note:
$ awk -v f=/proc/sys/kernel/random/uuid '{getline u<f; close(f); print u,$0}' OFS="|" "$filename" "$filename".pk
This doesn't require the bash coproc, it just has awk read a random uuid directly from a Linux kernel function that provides them. You're still closing the file handle for every line of input, but at least you don't have to exec the uuidgen binary.
YMMV. I don't know what OS you're running, so I don't know what's likely to work for you.
Your script is calling shell to call awk to call shell to call uuidgen. Awk is a tool for manipulating text, it's not a shell (an environment to call other tools from) so don't do that, just call uuidgen from shell:
$ cat file
foo .*
bar stuff
here
$ xargs -d $'\n' -n 1 printf '%s|%s\n' "$(uuidgen)" < file
5662f3bd-7818-4da8-9e3a-f5636b174e94|foo .*
5662f3bd-7818-4da8-9e3a-f5636b174e94|bar stuff
5662f3bd-7818-4da8-9e3a-f5636b174e94|here
I'm just guessing that the real problem here is that you're running a sub-process for each line. You could read your file explicitly line by line and read output from a batch-uuidgen line by line, and thus only have a single subprocess to handle at once. Unfortunately, uuidgen doesn't work that way.
Maybe another solution?
perl -MData::UUID -ple 'BEGIN{ $ug = Data::UUID->new } $_ = lc($ug->to_string($ug->create)) . " | " . $_' $filename > ${filename}.pk
Might this be faster?

Bash Script and Edit CSV

Im trying to create a bash file to do the following task
1- I have a file name called "ORDER.CSV" need to make a copy of this file and append the date/time to file name - This I was able to get done
2- Need to edit a particular field in the new csv file I created above. Column DY and row 2. This have not been able to do. I need to insert date bash script is run in this row. Needs to be in this format DDMMYY
3- then have system upload to SFTP. This I believe I got it figured out as shown below.
#!/usr/bin/env bash
Im able to get this step done with below command
# Copies order.csv and appends file name date/time
#cp /mypath/SFTP/order.csv /mypath/SFTP/orders.`date +"%Y%m%d%H%M%S"`.csv
Need help to echo new file name
echo "new file name "
Need help to edit field under Colum DY Row 2. Need to insert current date inthis format MMDDYYYY
awk -v r=2 -v DY=3 -v val=1001 -F, 'BEGIN{OFS=","}; NR != r; NR == r {$c = val;
print}'
This should connect to SFTP, which it does with out issues.
sshpass -p MyPassword sftp -o "Port 232323"
myusername#mysftpserver.com
Need to pass new file that was created and put into SFTP server.
put /incoming/neworder/NEWFILEName.csv
Thanks
I guess that's what you want to do...
echo -e "h1,h2,h3,h4\n1,2,3,4" |
awk -v r=2 -v c=3 -v v=$(date +"%d%m%y") 'BEGIN{FS=OFS=","} NR==r{$c=v}1'
h1,h2,h3,h4
1,2,120617,4
to find the column index from the column name (not tested)
... | awk -v r=2 -v h="DY" -v v=$(date +"%d%m%y") '
BEGIN {FS=OFS=","}
NR==1 {c=$(NF+1); for(i=1;i<=NF;i++) if($i==h) {c=i; break}}
NR==r {$c=v}1'
the risk of doing this is the column name may not match, in that case this will add the value as a new column.

Save changes to a file AWK/SED

I have a huge text file delimited with comma.
19429,(Starbucks),390 Provan Walk,Glasgow,G34 9DL,-4.136909,55.872982
The first one is a unique id. I want the user to enter the id and enter a value for one of the following 6 fields in order to be replaced. Also, i'm asking him to enter a 2-7 value in order to identify which field should be replaced.
Now i've done something like this. I am checking every line to find the id user entered and then i'm replacing the value.
awk -F ',' -v elem=$element -v id=$code -v value=$value '{if($1==id) {if(elem==2) { $2=value } etc }}' $path
Where $path = /root/clients.txt
Let's say user enters "2" in order to replace the second field, and also enters "Whatever". Now i want "(Starbucks)" to be replaced with "Whatever" What i've done work fine but does not save the change into the file. I know that awk is not supposed to do so, but i don't know how to do it. I've searched a lot in google but still no luck.
Can you tell me how i'm supposed to do this? I know that i can do it with sed but i don't know how.
Newer versions of GNU awk support inplace editing:
awk -i inplace -v elem="$element" -v id="$code" -v value="$value" '
BEGIN{ FS=OFS="," } $1==id{ $elem=value } 1
' "$path"
With other awks:
awk -v elem="$element" -v id="$code" -v value="$value" '
BEGIN{ FS=OFS="," } $1==id{ $elem=value } 1
' "$path" > /usr/tmp/tmp$$ &&
mv /usr/tmp/tmp$$ "$path"
NOTES:
Always quote your shell variables unless you have an explicit reason not to and fully understand all of the implications and caveats.
If you're creating a tmp file, use "&&" before replacing your original with it so you don't zap your original file if the tmp file creation fails for any reason.
I fully support replacing Starbucks with Whatever in Glasgow - I'd like to think they wouldn't have let it open in the first place back in my day (1986 Glasgow Uni Comp Sci alum) :-).
awk is much easier than sed for processing specific variable fields, but it does not have in-place processing. Thus you might do the following:
#!/bin/bash
code=$1
element=$2
value=$3
echo "code is $code"
awk -F ',' -v elem=$element -v id=$code -v value=$value 'BEGIN{OFS=",";} /^'$code',/{$elem=value}1' mydb > /tmp/mydb.txt
mv /tmp/mydb.txt ./mydb
This finds a match for a line starting with code followed by a comma (you could also use ($1==code)), then sets the elemth field to value; finally it prints the output, using the comma as output field separator. If nothing matches, it just echoes the input line.
Everything is written to a temporary file, then overwrites the original.
Not very nice but it gets the job done.

how to write finding output to same file using awk command

awk '/^nameserver/ && !modif { printf("nameserver 127.0.0.1\n"); modif=1 } {print}' testfile.txt
It is displaying output but I want to write the output to same file. In my example testfile.txt.
Not possible per se. You need a second temporary file because you can't read and overwrite the same file. Something like:
awk '(PROGRAM)' testfile.txt > testfile.tmp && mv testfile.tmp testfile.txt
The mktemp program is useful for generating unique temporary file names.
There are some hacks for avoiding a temporary file, but they rely mostly on caching and read buffers and quickly get unstable for larger files.
Since GNU Awk 4.1.0, there is the "inplace" extension, so you can do:
$ gawk -i inplace '{ gsub(/foo/, "bar") }; { print }' file1 file2 file3
To keep a backup copy of original files, try this:
$ gawk -i inplace -v INPLACE_SUFFIX=.bak '{ gsub(/foo/, "bar") }
> { print }' file1 file2 file3
This can be used to simulate the GNU sed -i feature.
See: Enabling In-Place File Editing
Despite the fact that using a temp file is correct, I don't like it because :
you have to be sure not to erase another temp file (yes you can use mktemp - it's a pretty usefull tool)
you have to take care of deleting it (or moving it like thiton said) INCLUDING when your script crash or stop before the end (so deleting temp files at the end of the script is not that wise)
it generate IO on disk (ok not that much but we can make it lighter)
So my method to avoid temp file is simple:
my_output="$(awk '(PROGRAM)' source_file)"
echo "$my_output" > source_file
Note the use of double quotes either when grabbing the output from the awk command AND when using echo (if you don't, you won't have newlines).
Had to make an account when seeing 'awk' and 'not possible' in one sentence. Here is an awk-only solution without creating a temporary file:
awk '{a[b++]=$0} END {for(c=1;c<=b;c++)print a[c]>ARGV[1]}' file
You can also use sponge from moreutils.
For example
awk '!a[$0]++' file|sponge file
removes duplicate lines and
awk '{$2=10*$2}1' file|sponge file
multiplies the second column by 10.
Try to include statement in your awk file so that you can find the output in a new file. Here total is a calculated value.
print $total, total >> "new_file"
This inline writing worked for me. Redirect the output from print back to the original file.
echo "1" > test.txt
awk '{$1++; print> "test.txt"}' test.txt
cat test.txt
#$> 2

Resources