Delete lines in file that have a date older than x - bash

I can read an entire file into memory like so:
#!/bin/bash
filename='peptides.txt'
filelines=`cat $filename`
ten_days_ago="$(date)"
for line in $filelines ; do
date_of="$(echo "$line" | jq -r '.time')"
if [[ "$ten_days_ago" > "$date_of" ]]; then
# delete this line
fi
done
the problem is:
I may not want to read the whole file into memory
If I stream it line by line with bash, how can I store which line to delete from? I would delete lines 0 to x, where line x has a date equal to 10 days ago.
A binary search would be appropriate here - so maybe bash is not a good solution to this? I would need to find the number of lines in the file, divide by two and go to that line.

You can use binary search only if the file is sorted.
You do not need to read the whole file into memory; you can process it line by line:
while read line
do
....
done <$filename
And: Yes, I personally would not use shell scripting for this kind of problems, but this is of course a matter of taste.

You didn't show what the input file looks like but judging by your jq its JSON data.
With that said this is how i would do it
today=$(date +%j)
tenDaysAgo=$(date --date="10 day ago" +%j)
#This is where you would create the data for peptides.txt
#20 spaces away there is a date stamp so it doesn't distract you
echo "Peptides stuff $today" >> peptides.txt
while read pepStuff; do
if [ $pepStuff == $tenDaysAgo ]; then
sed -i "/.*$pepStuff/d" peptides.txt
fi
done < <(awk '{print $3}' peptides.txt)

Related

Removing current line in bash during read

I'm using bash to read a file and after doing opeation on particular line ,i need to delete that line from input file.
Can you please suggest some way to do so using sed or any other way ?
i've tried using sed command like this :-
#!/bin/sh
file=/Volumes/workplace/GeneratedRules.ion
while read line;do
printf "%s\n" "$line"
sed '1d' $file
done <$file
my aim in this program is to read one line and then deleting it.
Input :-
AllIsWell
LetsHopeForBest
YouCanMakeIt
but the output , i got is more weird than i thought.
output :-
AllIsWell
LetsHopeForBest
YouCanMakeIt
LetsHopeForBest
LetsHopeForBest
YouCanMakeIt
YouCanMakeIt
LetsHopeForBest
YouCanMakeIt
but i need to output as :
AllIsWell
LetsHopeForBest
YouCanMakeIt
as i want to delete line after reading it.
NOTE :- i have simplified my problem here . The actual usecase is :-
I need to perform some bunch of operation on line except reading that and the input file is too long and my operation got fails in some way in between .So i want those lines which i have read to be deleted so that if i start the process again , it will not start from the beginning but at the point where it got stuck.
please help.
You effectively said you want your process to be restartable. Depending upon how you define the successful completion of an iteration of your while loop, you should store a line number in a separate file, x, that indicates how many lines you have successfully processed. If the file doesn't exist, then assume you would start reading at line 1.
Otherwise, you would get the content of x into variable n and then you would start reading $file at line $n + 1.
How you start reading at a particular line depends on constraints we don't know yet.
One way you could do it is to use sed to put lines $n + 1 into a temporary file, remove $file and then move the temporary file to $file before your while loop begins.
There are other solutions but each one might not elegantly satisfy your constraints.
But you'll want to carefully consider what happens if some other process is modifying the content of $file and when it is changing the content. If it only changes the content before or after your bash script is running, then you're probably ok to continue down this path. Otherwise, you have a synchronization problem to solve.
As stated in comments, there are many issues with altering the file you are currently reading from. Don't do it.
You could just keep track of which lines you have dealt with in the first loop (with a counter) then use sed to delete those lines after your first loop has processed them.
A simple example:
cd /tmp
echo 'Line 1
Line 2
Line 3
Line 4
Line 5' >file
echo "file before:"
cat file
cnt=1
while IFS= read -r line || [[ -n $line ]]; do
printf "'%s' processed\n" "$line";
if [ "$cnt" -ge 3 ]; then
break
fi
let "cnt+=1"
done <file
sed -i '' "1,${cnt}d" file
echo "file after:"
cat file
Prints:
file before:
Line 1
Line 2
Line 3
Line 4
Line 5
'Line 1' processed
'Line 2' processed
'Line 3' processed
file after:
Line 4
Line 5
Another method is to use something like awk, ruby or perl to 'slurp' the file and then feed that slurpped file line-by-line to your Bash while loop. The file can then be modified in your loop since the other process has already fully read and closed the file:
# Note: This is SLOWER and USES MORE MEMORY...
echo "file before:"
cat file
while IFS= read -r line; do
printf "'%s' processed\n" "$line";
sed -i '' "1d" file
done < <(awk -v cnt=3 'NR>cnt{next}
{arr[NR]=$0}
END { for(i=1;i<=cnt;i++) print arr[i] }' file)
echo "file after:"
cat file
# same output
Note:
Please make sure you polish up on how to use bash to read a stream line-by-line for less surprises.
Read HERE and HERE for more.
It needs a option -i, and if you need backup the file, just assign a suffix to the option. see the man sed
#!/bin/sh
file=/Volumes/workplace/GeneratedRules.ion
while read line;do
printf "%s\n" "$line"
sed -i '1d' $file
done <$file

How to loop a variable range in cut command

I have a file with 2 columns, and i want to use the values from the second column to set the range in the cut command to select a range of characters from another file. The range i desire is the character in the position of the value in the second column plus the next 10 characters. I will give an example in a while.
My files are something like that:
File with 2 columns and no blank lines between lines (file1.txt):
NAME1 10
NAME2 25
NAME3 48
NAME4 66
File that i want to extract the variable range of characters(just one very long line with no spaces and no bold font) (file2.txt):
GATCGAGCGGGATTCTTTTTTTTTAGGCGAGTCAGCTAGCATCAGCTACGAGAGGCGAGGGCGGGCTATCACGACTACGACTACGACTACAGCATCAGCATCAGCGCACTAGAGCGAGGCTAGCTAGCTACGACTACGATCAGCATCGCACATCGACTACGATCAGCATCAGCTACGCATCGAAGAGAGAGC
...or, more literally (for copy/paste to test):
GATCGAGCGGGATTCTTTTTTTTTAGGCGAGTCAGCTAGCATCAGCTACGAGAGGCGAGGGCGGGCTATCACGACTACGACTACGACTACAGCATCAGCATCAGCGCACTAGAGCGAGGCTAGCTAGCTACGACTACGATCAGCATCGCACATCGACTACGATCAGCATCAGCTACGCATCGAAGAGAGAGC
Desired resulting file, one sequence per line (result.txt):
GATTCTTTTT
GGCGAGTCAG
CGAGAGGCGA
TATCACGACT
The resulting file would have the characters from 10-20, 25-35, 48-58 and 66-76, each range in a new line. So, it would always keep the range of 10, but in different start points and those start points are set by the values in the second column from the first file.
I tried the command:
for i in $(awk '{print $2}' file1.txt);
do
p1=$i;
p2=`expr "$1" + 10`
cut -c$p1-$2 file2.txt > result.txt;
done
I don't get any output or error message.
I also tried:
while read line; do
set $line
p2=`expr "$2" + 10`
cut -c$2-$p2 file2.txt > result.txt;
done <file1.txt
This last command gives me an error message:
cut: invalid range with no endpoint: -
Try 'cut --help' for more information.
expr: non-integer argument
There's no need for cut here; dd can do the job of indexing into a file, and reading only the number of bytes you want. (Note that status=none is a GNUism; you may need to leave it out on other platforms and redirect stderr otherwise if you want to suppress informational logging).
while read -r name index _; do
dd if=file2.txt bs=1 skip="$index" count=10 status=none
printf '\n'
done <file1.txt >result.txt
This approach avoids excessive memory requirements (as present when reading the whole of file2 -- assuming it's large), and has bounded performance requirements (overhead is equal to starting one copy of dd per sequence to extract).
Using awk
$ awk 'FNR==NR{a=$0; next} {print substr(a,$2+1,10)}' file2 file1
GATTCTTTTT
GGCGAGTCAG
CGAGAGGCGA
TATCACGACT
If file2.txt is not too large, then you can read it in memory,
and use Bash sub-strings to extract the desired ranges:
data=$(<file2.txt)
while read -r name index _; do
echo "${data:$index:10}"
done <file1.txt >result.txt
This will be much more efficient than running cut or another process for every single range definition.
(Thanks to #CharlesDuffy for the tip to read data without a useless cat, and the while loop.)
One way to solve it:
#!/bin/bash
while read line; do
pos=$(echo "$line" | cut -f2 -d' ')
x=$(head -c $(( $pos + 10 )) file2.txt | tail -c 10)
echo "$x"
done < file1.txt > result.txt
It's not the solution an experienced bash hacker would use, but it is very good for someone who is new to bash. It uses tools that are very versatile, although somewhat bad if you need high performance. Shell scripting is commonly used by people who rarely shell scripts, but knows a few commands and just wants to get the job done. That's why I'm including this solution, even if the other answers are superior for more experienced people.
The first line is pretty easy. It just extracts the numbers from file1.txt. The second line uses the very nice tools head and tail. Usually, they are used with lines instead of characters. Nevertheless, I print the first pos + 10 characters with head. The result is piped into tail which prints the last 10 characters.
Thanks to #CharlesDuffy for improvements.

How to get line WITH tab character using tail and head

I have made a script to practice my Bash, only to realize that this script does not take tabulation into account, which is a problem since it is designed to find and replace a pattern in a Python script (which obviously needs tabulation to work).
Here is my code. Is there a simple way to get around this problem ?
pressure=1
nline=$(cat /myfile.py | wc -l) # find the line length of the file
echo $nline
for ((c=0;c<=${nline};c++))
do
res=$( tail -n $(($(($nline+1))-$c)) myfile.py | head -n 1 | awk 'gsub("="," ",$1){print $1}' | awk '{print$1}')
#echo $res
if [ $res == 'pressure_run' ]
then
echo "pressure_run='${pressure}'" >> myfile_mod.py
else
echo $( tail -n $(($nline-$c)) myfile.py | head -n 1) >> myfile_mod.py
fi
done
Basically, it finds the line that has pressure_run=something and replaces it by pressure_run=$pressure. The rest of the file should be untouched. But in this case, all tabulation is deleted.
If you want to just do the replacement as quickly as possible, sed is the way to go as pointed out in shellter's comment:
sed "s/\(pressure_run=\).*/\1$pressure/" myfile.py
For Bash training, as you say, you may want to loop manually over your file. A few remarks for your current version:
Is /myfile.py really in the root directory? Later, you don't refer to it at that location.
cat ... | wc -l is a useless use of cat and better written as wc -l < myfile.py.
Your for loop is executed one more time than you have lines.
To get the next line, you do "show me all lines, but counting from the back, don't show me c lines, and then show me the first line of these". There must be a simpler way, right?
To get what's the left-hand side of an assignment, you say "in the first space-separated field, replace = with a space , then show my the first space separated field of the result". There must be a simpler way, right? This is, by the way, where you strip out the leading tabs (your first awk command does it).
To print the unchanged line, you do the same complicated thing as before.
A band-aid solution
A minimal change that would get you the result you want would be to modify the awk command: instead of
awk 'gsub("="," ",$1){print $1}' | awk '{print$1}'
you could use
awk -F '=' '{ print $1 }'
"Fields are separated by =; give me the first one". This preserves leading tabs.
The replacements have to be adjusted a little bit as well; you now want to match something that ends in pressure_run:
if [[ $res == *pressure_run ]]
I've used the more flexible [[ ]] instead of [ ] and added a * to pressure_run (which must not be quoted): "if $res ends in pressure_run, then..."
The replacement has to use $res, which has the proper amount of tabs:
echo "$res='${pressure}'" >> myfile_mod.py
Instead of appending each line each loop (and opening the file each time), you could just redirect output of your whole loop with done > myfile_mod.py.
This prints literally ${pressure} as in your version, because it's single quoted. If you want to replace that by the value of $pressure, you have to remove the single quotes (and the braces aren't needed here, but don't hurt):
echo "$res=$pressure" >> myfile_mod.py
This fixes your example, but it should be pointed out that enumerating lines and then getting one at a time with tail | head is a really bad idea. You traverse the file for every single line twice, it's very error prone and hard to read. (Thanks to tripleee for suggesting to mention this more clearly.)
A proper solution
This all being said, there are preferred ways of doing what you did. You essentially loop over a file, and if a line matches pressure_run=, you want to replace what's on the right-hand side with $pressure (or the value of that variable). Here is how I would do it:
#!/bin/bash
pressure=1
# Regular expression to match lines we want to change
re='^[[:space:]]*pressure_run='
# Read lines from myfile.py
while IFS= read -r line; do
# If the line matches the regular expression
if [[ $line =~ $re ]]; then
# Print what we matched (with whitespace!), then the value of $pressure
line="${BASH_REMATCH[0]}"$pressure
fi
# Print the (potentially modified) line
echo "$line"
# Read from myfile.py, write to myfile_mod.py
done < myfile.py > myfile_mod.py
For a test file that looks like
blah
test
pressure_run=no_tab
blah
something
pressure_run=one_tab
pressure_run=two_tabs
the result is
blah
test
pressure_run=1
blah
something
pressure_run=1
pressure_run=1
Recommended reading
How to read a file line-by-line (explains the IFS= and -r business, which is quite essential to preserve whitespace)
BashGuide

Delete lines in file over an hour old using timestamps bash

Having a bit of bother trying to get the following to work.
I have a file containing hostname:timestamp as below:
hostname1:1445072150
hostname2:1445076364
I am trying to create a bash script that will query this file (using a cron job) to check if the timestamp is over 1 hour old and if so, remove the line.
Below is what I have so far but it doesn't appear to be removing the line in the file.
#!/bin/bash
hosts=/tmp/hosts
current_timestamp=$(date +%s)
while read line; do
hostname=`echo $line | sed -e 's/:.*//g'`
timestamp=`echo $line | cut -d ":" -f 2`
diff=$(($current_timestamp-$timestamp))
if [ $diff -ge 3600 ]; then
echo "$hostname - Timestamp over an hour old. Deleting line."
sed -i '/$hostname/d' $hosts
fi
done <$hosts
I have managed to get the timestamp part working correctly in identifying hosts that are over an hour old but having trouble removing the time from the file.
I suspect it may be due to the while loop keeping the file open but not 100% sure how to work around it. Also tried making a copy of the file and editing that but still nothing.
ALTERNATIVELY: If there is a better way to get this to work and produce the same result, I am open to suggestions :)
Any help would be much appreciated.
Cheers
The problem in your script was just this line:
sed -i '/$hostname/d' $hosts
Variables inside single-quotes are not expanded to their values,
so the command is trying to replace literally "$hostname", instead of its value. If you replace the single-quotes with double-quotes,
the variable will get expanded to its value, which is what you need here:
sed -i "/$hostname/d" $hosts
There are improvements possible:
#!/bin/bash
hosts=/tmp/hosts
current_timestamp=$(date +%s)
while read line; do
set -- ${line/:/ }
hostname=$1
timestamp=$2
((diff = current_timestamp - timestamp))
if ((diff >= 3600)); then
echo "$hostname - Timestamp over an hour old. Deleting line."
sed -i "/^$hostname:/d" $hosts
fi
done <$hosts
The improvements:
More strict pattern in the sed command, to make it more robust and to avoid some potential errors
Simpler way to extract hostname part and timestamp part without any sub-shells
Simpler arithmetic operations by enclosing within ((...))
You ask for alternatives — use awk:
awk -F: -v ts=$(date +%s) '$2 <= ts-3600 { next }' $hosts > $hosts.$$
mv $hosts.$$ $hosts
The ts=$(date +%s) sets the awk variable ts to the value from date. The script skips any lines where the value in the second column (after the first colon) is smaller than the threshold. You could do the subtraction once in a BEGIN block if you wanted to. Decide whether <= or < is correct for your purposes.
If you need to know which lines are deleted, you can add
printf "Deleting %s - timestamp %d older than %d\n", $1, $2, (ts-3600) >/dev/stderr;
before the next to print the information on standard error. If you must write that to standard output, then you need to arrange for retained lines to be written to a file with print > file as an alternative action after the filter condition (passing -v file="$hosts.$$" as another pair of arguments to awk). The tweaks that can be made are endless.
If the file is of any significant size, it will be quicker to copy the relevant subsection of the file once to a temporary file and then to the final file than to edit the file in place multiple times as in the original code. If the file is small enough, there isn't a problem.

Fastest way to print a single line in a file

I have to fetch one specific line out of a big file (1500000 lines), multiple times in a loop over multiple files, I was asking my self what would be the best option (in terms of performance).
There are many ways to do this, i manly use these 2
cat ${file} | head -1
or
cat ${file} | sed -n '1p'
I could not find an answer to this do they both only fetch the first line or one of the two (or both) first open the whole file and then fetch the row 1?
Drop the useless use of cat and do:
$ sed -n '1{p;q}' file
This will quit the sed script after the line has been printed.
Benchmarking script:
#!/bin/bash
TIMEFORMAT='%3R'
n=25
heading=('head -1 file' 'sed -n 1p file' "sed -n '1{p;q} file" 'read line < file && echo $line')
# files upto a hundred million lines (if your on slow machine decrease!!)
for (( j=1; j<=100,000,000;j=j*10 ))
do
echo "Lines in file: $j"
# create file containing j lines
seq 1 $j > file
# initial read of file
cat file > /dev/null
for comm in {0..3}
do
avg=0
echo
echo ${heading[$comm]}
for (( i=1; i<=$n; i++ ))
do
case $comm in
0)
t=$( { time head -1 file > /dev/null; } 2>&1);;
1)
t=$( { time sed -n 1p file > /dev/null; } 2>&1);;
2)
t=$( { time sed '1{p;q}' file > /dev/null; } 2>&1);;
3)
t=$( { time read line < file && echo $line > /dev/null; } 2>&1);;
esac
avg=$avg+$t
done
echo "scale=3;($avg)/$n" | bc
done
done
Just save as benchmark.sh and run bash benchmark.sh.
Results:
head -1 file
.001
sed -n 1p file
.048
sed -n '1{p;q} file
.002
read line < file && echo $line
0
**Results from file with 1,000,000 lines.*
So the times for sed -n 1p will grow linearly with the length of the file but the timing for the other variations will be constant (and negligible) as they all quit after reading the first line:
Note: timings are different from original post due to being on a faster Linux box.
If you are really just getting the very first line and reading hundreds of files, then consider shell builtins instead of external external commands, use read which is a shell builtin for bash and ksh. This eliminates the overhead of process creation with awk, sed, head, etc.
The other issue is doing timed performance analysis on I/O. The first time you open and then read a file, file data is probably not cached in memory. However, if you try a second command on the same file again, the data as well as the inode have been cached, so the timed results are may be faster, pretty much regardless of the command you use. Plus, inodes can stay cached practically forever. They do on Solaris for example. Or anyway, several days.
For example, linux caches everything and the kitchen sink, which is a good performance attribute. But it makes benchmarking problematic if you are not aware of the issue.
All of this caching effect "interference" is both OS and hardware dependent.
So - pick one file, read it with a command. Now it is cached. Run the same test command several dozen times, this is sampling the effect of the command and child process creation, not your I/O hardware.
this is sed vs read for 10 iterations of getting the first line of the same file, after read the file once:
sed: sed '1{p;q}' uopgenl20121216.lis
real 0m0.917s
user 0m0.258s
sys 0m0.492s
read: read foo < uopgenl20121216.lis ; export foo; echo "$foo"
real 0m0.017s
user 0m0.000s
sys 0m0.015s
This is clearly contrived, but does show the difference between builtin performance vs using a command.
If you want to print only 1 line (say the 20th one) from a large file you could also do:
head -20 filename | tail -1
I did a "basic" test with bash and it seems to perform better than the sed -n '1{p;q} solution above.
Test takes a large file and prints a line from somewhere in the middle (at line 10000000), repeats 100 times, each time selecting the next line. So it selects line 10000000,10000001,10000002, ... and so on till 10000099
$wc -l english
36374448 english
$time for i in {0..99}; do j=$((i+10000000)); sed -n $j'{p;q}' english >/dev/null; done;
real 1m27.207s
user 1m20.712s
sys 0m6.284s
vs.
$time for i in {0..99}; do j=$((i+10000000)); head -$j english | tail -1 >/dev/null; done;
real 1m3.796s
user 0m59.356s
sys 0m32.376s
For printing a line out of multiple files
$wc -l english*
36374448 english
17797377 english.1024MB
3461885 english.200MB
57633710 total
$time for i in english*; do sed -n '10000000{p;q}' $i >/dev/null; done;
real 0m2.059s
user 0m1.904s
sys 0m0.144s
$time for i in english*; do head -10000000 $i | tail -1 >/dev/null; done;
real 0m1.535s
user 0m1.420s
sys 0m0.788s
How about avoiding pipes?
Both sed and head support the filename as an argument. In this way you avoid passing by cat. I didn't measure it, but head should be faster on larger files as it stops the computation after N lines (whereas sed goes through all of them, even if it doesn't print them - unless you specify the quit option as suggested above).
Examples:
sed -n '1{p;q}' /path/to/file
head -n 1 /path/to/file
Again, I didn't test the efficiency.
I have done extensive testing, and found that, if you want every line of a file:
while IFS=$'\n' read LINE; do
echo "$LINE"
done < your_input.txt
Is much much faster then any other (Bash based) method out there. All other methods (like sed) read the file each time, at least up to the matching line. If the file is 4 lines long, you will get: 1 -> 1,2 -> 1,2,3 -> 1,2,3,4 = 10 reads whereas the while loop just maintains a position cursor (based on IFS) so would only do 4 reads in total.
On a file with ~15k lines, the difference is phenomenal: ~25-28 seconds (sed based, extracting a specific line from each time) versus ~0-1 seconds (while...read based, reading through the file once)
The above example also shows how to set IFS in a better way to newline (with thanks to Peter from comments below), and this will hopefully fix some of the other issue seen when using while... read ... in Bash at times.
For the sake of completeness you can also use the basic linux command cut:
cut -d $'\n' -f <linenumber> <filename>

Resources