bash delete line condition - bash

I couldn't find a solution to conditionally delete a line in a file using bash. The file contains year dates within strings and the corresponding line should be deleted only if the year is lower than a reference value.
The file looks like the following:
'zg_Amon_MPI-ESM-LR_historical_r1i1p1_196001-196912.nc' 'MD5'
'zg_Amon_MPI-ESM-LR_historical_r1i1p1_197001-197912.nc' 'MD5'
'zg_Amon_MPI-ESM-LR_historical_r1i1p1_198001-198912.nc' 'MD5'
'zg_Amon_MPI-ESM-LR_historical_r1i1p1_199001-199912.nc' 'MD5'
'zg_Amon_MPI-ESM-LR_historical_r1i1p1_200001-200512.nc' 'MD5'
I want to get the year 1969 from line 1 and compare it to a reference (let's say 1980) and delete the whole line if the year is lower than the reference. This means in this case the code should remove the first two lines of the file.
I tried with sed and grep, but couldn't get it working.
Thanks in advance for any ideas.

You can use awk:
awk -F- '$4 > 198000 {print}' filename
This will output all the lines where the second date is later than 31/12/1979. This will not edit the file in-place, you would have to save the output to another file then move that in place of the original:
awk -F- '$4 > 198000 {print}' filename > tmp && mv tmp filename
Using sed (will edit in-place):
sed -i '/.*19[0-7][0-9]..\.nc/d' filename
This requires a little more thought, in that you will need to construct a regex to match any values which you don't want to be displayed.

Perhaps something like this:
awk -F- '{ if (substr($4,1,4) >= 1980) print }' input.txt

Related

Adding part of filename as column to csv files, then concatenate

I have many csv files that look like this:
data/0.Raw/20190401_data.csv
(Only the date in the middle of the filename changes)
I want to concatenate all these files together, but add the date as a new column in the data to be able to distinguish between the different files after merging.
I wrote a bash script that adds the full path and filename as a column in each file, and then merges into a master csv. However, I am having trouble getting rid of the path and the extension to only keep the date portion
The bash script
#! /bin/bash
mkdir data/1.merged
for i in "data/0.Raw/"*.csv; do
awk -F, -v OFS=, 'NR==1{sub(/\_data.csv$/, "", FILENAME) } NR>1{ $1=FILENAME }1' "$i" |
column -t > "data/1.merged/"${i/"data/0.Raw/"/""}""
done
awk 'FNR > 1' data/1.merged/*.csv > data/1.merged/all_files
rm data/1.merged/*.csv
mv data/1.merged/all_files data/1.merged/all_files.csv
using "sub" I was able to remove the "_data.csv" part, but as a result the column gets added as "data/0.Raw/20190401" - that is, I am having trouble removing both the part before the date as well as the part after the date.
I tried replacing sub with gensub to regex match everything except the 8 digits in the middle but that does not seem to work either.
Any ideas on how to solve this?
Thanks!
You can process and concatenate all the files with a single awk call:
awk '
FNR == 1 {
date = FILENAME
gsub(/.*\/|_data\.csv$/,"",date)
next
}
{ print date "," $0 }
' data/0.Raw/*_data.csv > all_files.csv
However, I am having trouble getting rid of the path and the extension
to only keep the date portion
Then take look at basename command
basename NAME [SUFFIX]
Print NAME with any leading directory components removed. If
specified, also remove a trailing SUFFIX.
Example
basename 'data/0.Raw/20190401_data.csv' _data.csv
gives output
20190401

AWK - Delete whole line when inside that line a piece matchs a string

I have a db.sql full of lines containing sometime the string _wc_session_
(26680, '_wc_session_expires_120f486fe21c9ae4ce247c04f3b009f9', '1445934089', 'no'),
(26682, '_wc_session_expires_73516b532380c28690a4437d20967e03', '1445934114', 'no'),
(26683, '_wc_session_1a71c566970b07ac2b48c5da4e0d43bf', 'a:21:{s:4:"cart";s:305:"a:1:{s:32:"7fe1f8abaad094e0b5cb1b01d712f708";a:9:{s:10:"product_id";i:459;s:12:"variation_id";s:0:"";s:9:"variation";a:0:{}s:8:"quantity";i:1;s:10:"line_total";d:6;s:8:"line_tax";i:0;s:13:"line_subtotal";i:6;s:17:"line_subtotal_tax";i:0;s:13:"line_tax_data";a:2:{s:5:"total";a:0:{}s:8:"subtotal";a:0:{}}}}";s:15:"applied_coupons";s:6:"a:0:{}";s:23:"coupon_discount_amounts";s:6:"a:0:{}";s:27:"coupon_discount_tax_amounts";s:6:"a:0:{}";s:21:"removed_cart_contents";s:6:"a:0:{}";s:19:"cart_contents_total";d:6;s:20:"cart_contents_weight";i:0;s:19:"cart_contents_count";i:1;s:5:"total";i:0;s:8:"subtotal";i:6;s:15:"subtotal_ex_tax";i:6;s:9:"tax_total";i:0;s:5:"taxes";s:6:"a:0:{}";s:14:"shipping_taxes";s:6:"a:0:{}";s:13:"discount_cart";i:0;s:17:"discount_cart_tax";i:0;s:14:"shipping_total";i:0;s:18:"shipping_tax_total";i:0;s:9:"fee_total";i:0;s:4:"fees";s:6:"a:0:{}";s:10:"wc_notices";s:205:"a:1:{s:7:"success";a:1:{i:0;s:166:"Ver carrito Se ha añadido "Incienso Gaudí Lavanda" con éxito a tu carrito.";}}";}', 'no'),
I'd like to remove those whole lines with AWK when _wc_session_ is within. I mean the whole line like:
(26682, '_wc_session_expires_73516b532380c28690a4437d20967e03', '1445934114', 'no'),
So far I've found the right REGEX that select the whole line
when "_wc_session_" is found
(^\(.*_wc_session_.*\)\,)
but we I try to run
awk '!(^\(.*_wc_session_.*\)\,)' db.sql > temp.sql
I get
awk: line 1: syntax error at or near ^
Am I missing something?
If you're set on awk:
awk '!/_wc_session/' db.sql
You may also you sed -i to write output "inplace" (in the input file):
sed -i '/_wc_session/d' db.sql
Edit:
A more precise approach with awk would be to use the inherent , from your file as delimiter and only check column 2 for the respective pattern. This approach is useful in case the pattern would be in a different column and that line should not be removed.
awk -F',' '$2 !~ "_wc_session" {print $0}' db.sql
With simple grep following may help you in same and should do the trick.
grep -v "(26682, '_wc_session_expires_73516b532380c28690a4437d20967e03', '1445934114', 'no')" Input_file
EDIT: If you want to remove the lines which have only string _wc_session_expires_ in any line then following may help you in same.
grep -v "_wc_session_expires_" Input_file
Mistake on the input Regex, the right one is
'!/(^(.*wc_session.*)\,)/'

How to strip date in csv output using shell script?

I have a few csv extracts that I am trying to fix up the date on, they are as follows:
"Time Stamp","DBUID"
2016-11-25T08:28:33.000-8:00,"5tSSMImFjIkT0FpiO16LuA"
The first column is always the "Time Stamp", I would like to convert this so it only keeps the date "2016-11-25" and drops the "T08:28:33.000-8:00".
The end result would be..
"Time Stamp","DBUID"
2016-11-25,"5tSSMImFjIkT0FpiO16LuA"
There are plenty of files with different dates.
Is there a way to do this in ksh? Some kind of for each loop to loop through all the files and replace the long time-stamp and leave just the date?
Use sed:
$ sed '2,$s/T[^,]*//' file
"Time Stamp","DBUID"
2016-11-25,"5tSSMImFjIkT0FpiO16LuA"
How it works:
2,$ # Skip header (first line) removing this will make a
# replacement on the first line as well.
s/T[^,]*// # Replace everything between T (inclusive) and , (exclusive)
# `[^,]*' Matches everything but `,' zero or more times
Here's one solution using a standard aix utility,
awk -F, -v OFS=, 'NR>1{sub(/T.*$/,"",$1)}1' file > file.cln && mv file.cln file
output
"Time Stamp","DBUID"
2016-11-25,"5tSSMImFjIkT0FpiO16LuA"
(but I no longer have access to an aix environment, so only tested with my local awk).
NR>1 skips the header line, and the sub() is limited to only the first field (up to the first comma). The trailing 1 char is awk shorthand for {print $0}.
If your data layout changes and you get extra commas in your data, this may required fixing.
IHTH
Using sed:
sed -i "s/\([0-9]\{4\}\)-\([0-9]\{2\}\)-\([0-9]\{2\}\).*,/\1-\2-\3,/" file.csv
Output:
"Time Stamp","DBUID"
2016-11-25,"5tSSMImFjIkT0FpiO16LuA"
-i edit files inplace
s substitute
This is a perfect job for awk, but unlike the previous answer, I recommend using the substring function.
awk -F, 'NR > 1{$1 = substr($1,1,10)} {print $0}' file.txt
Explanation
-F,: The -F flag sets a field separator, in this case a comma
NR > 1: Ignore the first row
$1: Refers to the first field
$1 = substr($1,1,10): Sets the first field to the first 10 characters of the field. In the example, this is the date portion
print $0: This will print the entire row

Printing a sequence from a fasta file

I often need to find a particular sequence in a fasta file and print it. For those who don't know, fasta is a text file format for biological sequences (DNA, proteins, etc.). It's pretty simple, you have a line with the sequence name preceded by a '>' and then all the lines following until the next '>' are the sequence itself. For example:
>sequence1
ACTGACTGACTGACTG
>sequence2
ACTGACTGACTGACTG
ACTGACTGACTGACTG
>sequence3
ACTGACTGACTGACTG
The way I'm currently getting the sequence I need is to use grep with -A, so I'll do
grep -A 10 sequence_name filename.fa
and then if I don't see the start of the next sequence in the file, I'll change the 10 to 20 and repeat until I'm sure I'm getting the whole sequence.
It seems like there should be a better way to do this. For example, can I ask it to print up until the next '>' character?
Using the > as the record separator:
awk -v seq="sequence2" -v RS='>' '$1 == seq {print RS $0}' file
>sequence2
ACTGACTGACTGACTG
ACTGACTGACTGACTG
Like this maybe:
awk '/>sequence1/{p++;print;next} /^>/{p=0} p' file
So, if the line starts with >sequence1, set a flag (p) to start printing, print this line and move to next. On subsequent lines, if the line starts with >, change p flag to stop printing. In general, print if the flag p is set.
Or, improving a little on your grep solution, use this to cut off the -A (after) context:
grep -A 999999 "sequence1" file | awk 'NR>1 && /^>/{exit} 1'
So, that prints up to 999999 lines after sequence1 and pipes them into awk. Awk then looks for a > at the start of any line after line 1, and exits if it finds one. Until then, the 1 causes awk to do its standard thing, which is print the current line.
Using sed only:
sed -n '/>sequence3/,/>/ p' | sed '${/>/d}'
$ perl -0076 -lane 'print join("\n",#F) if $F[0]=~/sequence2/' file
This question has excellent answers already. However, if you are dealing with FASTA records often, I would highly recommend Python's Biopython Module. It has many options and make life easier if you want to manipulate FASTA records. Here is how you can read and print the records:
from Bio import SeqIO
import textwrap
for seq_record in SeqIO.parse("input.fasta", "fasta"):
print(f'>{seq_record.id}\n{seq_record.seq}')
#If you want to wrap the record into multiline FASTA format
#You can use textwrap module
for seq_record in SeqIO.parse("input.fasta", "fasta"):
dna_sequence = str(seq_record.seq)
wrapped_dna_sequence = textwrap.fill(dna_sequence, width=8)
print(f'>{seq_record.id}\n{wrapped_dna_sequence}')

Remove a line from a csv file bash, sed, bash

I'm looking for a way to remove lines within multiple csv files, in bash using sed, awk or anything appropriate where the file ends in 0.
So there are multiple csv files, their format is:
EXAMPLEfoo,60,6
EXAMPLEbar,30,10
EXAMPLElong,60,0
EXAMPLEcon,120,6
EXAMPLEdev,60,0
EXAMPLErandom,30,6
So the file will be amended to:
EXAMPLEfoo,60,6
EXAMPLEbar,30,10
EXAMPLEcon,120,6
EXAMPLErandom,30,6
A problem which I can see arising is distinguishing between double digits that end in zero and 0 itself.
So any ideas?
Using your file, something like this?
$ sed '/,0$/d' test.txt
EXAMPLEfoo,60,6
EXAMPLEbar,30,10
EXAMPLEcon,120,6
EXAMPLErandom,30,6
For this particular problem, sed is perfect, as the others have pointed out. However, awk is more flexible, i.e. you can filter on an arbitrary column:
awk -F, '$3!=0' test.csv
This will print the entire line is column 3 is not 0.
use sed to only remove lines ending with ",0":
sed '/,0$/d'
you can also use awk,
$ awk -F"," '$NF!=0' file
EXAMPLEfoo,60,6
EXAMPLEbar,30,10
EXAMPLEcon,120,6
EXAMPLErandom,30,6
this just says check the last field for 0 and don't print if its found.
sed '/,[ \t]*0$/d' file
I would tend to sed, but there is an egrep (or: grep -e) -solution too:
egrep -v ",0$" example.csv

Resources