Getting one line in a huge file with bash - bash

How can I get a particular line in a 3 gig text file. All the lines have:
the same length, and
are delimited by \n.
And I need to be able to get any line on demand.
How can this be done? Only one line need be returned.

If all the lines have the same length, the best way by far will be to use dd(1) and give it a skip parameter.
Let the block size be the length of each line (including the newline), then you can do:
$ dd if=filename bs=<line-length> skip=<line_no - 1> count=1 2>/dev/null
The idea is to seek past all the previous lines (skip=<line_no - 1>) and read a single line (count=1). Because the block size is set to the line length (bs=<line-length>), each block is effectively a single line. Redirect stderr so you don't get the annoying stats at the end.
That should be much more efficient than streaming the lines before the one you want through a program to read all the lines and then throw them away, as dd will seek to the position you want in the file and read only one line of data from the file.

head -10 file | tail -1 returns line 10 probably slow though.
from here
# print line number 52
sed -n '52p' # method 1
sed '52!d' # method 2
sed '52q;d' # method 3, efficient on large files

An awk alternative, where 3 is the line number.
awk 'NR == 3 {print; exit}' file.txt

If it's not a fixed-record-length file and you don't do some sort of indexing on the line starts, your best bet is to just use:
head -n N filespec | tail -1
where N is the line number you want.
This isn't going to be the best-performing piece of code for a 3Gb file unfortunately but there are ways to make it better.
If the file doesn't change too often, you may want to consider indexing it. By that I mean having another file with the line offsets in it as fixed length records.
So the file:
0000000000
0000000017
0000000092
0000001023
would give you an fast way to locate each line. Just multiply the desired line number by the index record size and seek to there in the index file.
Then use the value at that location to seek in the main file so you can read until the next newline character.
So for line 3, you would seek to 33 in the index file (index record length is 10 characters plus one more for the newline). Reading the value there, 0000000092, would give you the offset to use into the main file.
Of course, that's not so useful if the file changes frequently although, if you can control what happens when things get appended, you can still add offsets to the index efficiently. If you don't control that, you'll have to re-index whenever the last-modified date of the index is earlier than that of the main file.
And, based on your update:
Update: If it matters, all the lines have the same length.
With that extra piece of information, you don't need the index - you can just seek immediately to the right location in the main file by multiplying the record length by the record length (assuming the values fit into your data types).
So something like the pseudo-code:
def getline(fhandle,reclen,recnum):
seek to position reclen*recnum for file fhandle.
read reclen characters into buffer.
return buffer.

Use q with sed to make the search stop after the line has been printed.
sed -n '11723{p;q}' filename
Python (minimal error checking):
#!/usr/bin/env python
import sys
# by Dennis Williamson - 2010-05-08
# for http://stackoverflow.com/questions/2794049/getting-one-line-in-a-huge-file-with-bash
# seeks the requested line in a file with a fixed line length
# Usage: ./lineseek.py LINE FILE
# Example: ./lineseek 11723 data.txt
EXIT_SUCCESS = 0
EXIT_NOT_FOUND = 1
EXIT_OPT_ERR = 2
EXIT_FILE_ERR = 3
EXIT_DATA_ERR = 4
# could use a try block here
seekline = int(sys.argv[1])
file = sys.argv[2]
try:
if file == '-':
handle = sys.stdin
size = 0
else:
handle = open(file,'r')
except IOError as e:
print >> sys.stderr, ("File Open Error")
exit(EXIT_FILE_ERR)
try:
line = handle.readline()
lineend = handle.tell()
linelen = len(line)
except IOError as e:
print >> sys.stderr, ("File I/O Error")
exit(EXIT_FILE_ERR)
# it would be really weird if this happened
if lineend != linelen:
print >> sys.stderr, ("Line length inconsistent")
exit(EXIT_DATA_ERR)
handle.seek(linelen * (seekline - 1))
try:
line = handle.readline()
except IOError as e:
print >> sys.stderr, ("File I/O Error")
exit(EXIT_FILE_ERR)
if len(line) != linelen:
print >> sys.stderr, ("Line length inconsistent")
exit(EXIT_DATA_ERR)
print(line)
Argument validation should be a lot better and there is room for many other improvements.

A quick perl one liner would work well for this too...
$ perl -ne 'if (YOURLINENUMBER..YOURLINENUMBER) {print $_; last;}' /path/to/your/file

Related

Script using sed and grep gives unintended output

I have a "source.fasta" file that contains information in the following format:
>TRINITY_DN80_c0_g1_i1 len=723 path=[700:0-350 1417:351-368 1045:369-722] [-1, 700, 1417, 1045, -2]
CGTGGATAACACATAAGTCACTGTAATTTAAAAACTGTAGGACTTAGATCTCCTTTCTATATTTTTCTGATAACATATGGAACCCTGCCGATCATCCGATTTGTAATATACTTAACTGCTGGATAACTAGCCAAAAGTCATCAGGTTATTATATTCAATAAAATGTAACTTGCCGTAAGTAACAGAGGTCATATGTTCCTGTTCGTCACTCTGTAGTTACAAATTATGACACGTGTGCGCTG
>TRINITY_DN83_c0_g1_i1 len=371 path=[1:0-173 152:174-370] [-1, 1, 152, -2]
GTTGTAAACTTGTATACAATTGGGTTCATTAAAGTGTGCACATTATTTCATAGTTGATTTGATTATTCCGAGTGACCTATTTCGTCACTCGATGTTTAAAGAAATTGCTAGTGTGACCCCAATTGCGTCAGACCAAAGATTGAATCTAGACATTAATTTCCTTTTGTATTTGTATCGAGTAAGTTTACAGTCGTAAATAAAGAATCTGCCTTGAACAAACCTTATTCCTTTTTATTCTAAAAGAGGCCTTTGCGTAGTAGTAACATAGTACAAATTGGTTTATTTAACGATTTATAAACGATATCCTTCCTACAGTCGGGTGAAAAGAAAGTATTCGAAATTAGAATGGTTCCTCATATTACACGTTGCTG
>TRINITY_DN83_c0_g1_i2 len=218 path=[1:0-173 741:174-217] [-1, 1, 741, -2]
GTTGTAAACTTGTATACAATTGGGTTCATTAAAGTGTGCACATTATTTCATAGTTGATTTGATTATTCCGAGTGACCTATTTCGTCACTCGATGTTTAAAGAAATTGCTAGTGTGACCCCAATTGCGTCAGACCAAAGATTGAATCTAGACATTAATTTCCTTTTGTATTTGTACCGAGTAAGTTTCCAGTCGTAAATAAAGAATCTGCCAGATCGGA
>TRINITY_DN99_c0_g1_i1 len=326 path=[1:0-242 221:243-243 652:244-267 246:268-325] [-1, 1, 221, 652, 246, -2]
ATCGGTACTATCATGTCATATATCTAGAAATAATACCTACGAATGTTATAAGAATTTCATAACATGATATAACGATCATACATCATGGCCTTTCGAAGAAAATGGCGCATTTACGTTTAATAATTCCGCGAAAGTCAAGGCAAATACAGACCTAATGCGAAATTGAAAAGAAAATCCGAATATCAGAAACAGAACCCAGAACCAATATGCTCAGCAGTTGCTTTGTAGCCAATAAACTCAACTAGAAATTGCTTATCTTTTATGTAACGCCATAAAACGTAATACCGATAACAGACTAAGCACACATATGTAAATTACCTGCTAAA
>TRINITY_DN90_c0_g1_i1 len=1240 path=[1970:0-527 753:528-1239] [-1, 1970, 753, -2]
GTCGATACTAGACAACGAATAATTGTGTCTATTTTTAAAAATAATTCCTTTTGTAAGCAGATTTTTTTTTTCATGCATGTTTCGAGTAAATTGGATTACGCATTCCACGTAACATCGTAAATGTAACCACATTGTTGTAACATACGGTATTTTTTCTGACAACGGACTCGATTGTAAGCAACTTTGTAACATTATAATCCTATGAGTATGACATTCTTAATAATAGCAACAGGGATAAAAATAAAACTACATTGTTTCATTCAACTCGTAAGTGTTTATTTAAAATTATTATTAAACACTATTGTAATAAAGTTTATATTCCTTTGTCAGTGGTAGACACATAAACAGTTTTCGAGTTCACTGTCG
>TRINITY_DN84_c0_g1_i1 len=301 path=[1:0-220 358:221-300] [-1, 1, 358, -2]
ACTATTATGTAGTACCTACATTAGAAACAACTGACCCAAGACAGGAGAAGTCATTGGATGATTTTCCCCATTAAAAAAAGACAACCTTTTAAGTAAGCATACTCCAAATTAAGGTTTAATTAGCTAAGTGAGCGCGAAAAATGATCAAATATACCGACGTCCATTTGGGGCCTATCCTTTTTAGTGTTCCTAATTGAAATCCTCACGTATACAGCTAGTCACTTTTAAATCTTATAAACATGTGATCCGTCTGCTCATTTGGACGTTACTGCCCAAAGTTGGTACATGTTTCGTACTCACG
>TRINITY_DN84_c0_g1_i2 len=301 path=[1:0-220 199:221-300] [-1, 1, 199, -2]
ACTATTATGTAGTACCTACATTAGAAACAACTGACCCAAGACAGGAGAAGTCATTGGATGATTTTCCCCATTAAAAAAAGACAACCTTTTAAGTAAGCATACTCCAAATTAAGGTTTAATTAGCTAAGTGAGCGCGAAAAATGATCAAATATACCGACGTCCATTTGGGGCCTATCCTTTTTAGTGTTCCTAATTGAAATCCTCACGTATACAGCTAGTCAGCTAACCAAAGATAAGTGTCTTGGCTTGGTATCTACAGATCTCTTTTCGTAATTTCGTGAGTACGAAACATGTACCAACT
>TRINITY_DN72_c0_g1_i1 len=434 path=[412:0-247 847:248-271 661:272-433] [-1, 412, 847, 661, -2]
GTTAATTTAGTGGGAAGTATGTGTTAAAATTAGTAAATTAGGTGTTGGTGTGTTTTTAATATGAATCCGGAAGTGTTTTGTTAGGTTACAAGGGTACGGAATTGTAATAATAGAAATCGGTATCCTTGAGACCAATGTTATCGCATTCGATGCAAGAATAGATTGGGAAATAGTCCGGTTATCAATTACTTAAAGATTTCTATCTTGAAAACTATTTCTAATTGGTAAAAAAACTTATTTAGAATCACCCATAGTTGGAAGTTTAAGATTTGAGACATCTTAAATTTTTGGTAGGTAATTTTAAGATTCTATCGTAGTTAGTACCTTTCGTTCTTCTTATTTTATTTGTAAAATATATTACATTTAGTACGAGTATTGTATTTCCAATATTCAGTCTAATTAGAATTGCAAAATTACTGAACACTCAATCATAA
>TRINITY_DN75_c0_g1_i1 len=478 path=[456:0-477] [-1, 456, -2]
CGAGCACATCAGGCCAGGGTTCCCCAAGTGCTCGAGTTTCGTAACCAAACAACCATCTTCTGGTCCGACCACCAGTCACATGATCAGCTGTGGCGCTCAGTATACGAGCACAGATTGCAACAGCCACCAAATGAGAGAGGAAAGTCATCCACATTGCCATGAAATCTGCGAAAGAGCGTAAATTGCGAGTAGCATGACCGCAGGTACGGCGCAGTAGCTGGAGTTGGCAGCGGCTAGGGGTGCCAGGAGGAGTGCTCCAAGGGTCCATCGTGCTCCACATGCCTCCCCGCCGCTGAACGCGCTCAGAGCCTTGCTCATCTTGCTACGCTCGCTCCGTTCAGTCATCTTCGTGTCTCATCGTCGCAGCGCGTAGTATTTACG
There are close to 400,000 sequences in this file.
I have another file ids.txt in the following format:
>TRINITY_DN14840_c10_g1_i1
>TRINITY_DN8506_c0_g1_i1
>TRINITY_DN12276_c0_g2_i1
>TRINITY_DN15434_c5_g3_i1
>TRINITY_DN9323_c8_g3_i5
>TRINITY_DN11957_c1_g7_i1
>TRINITY_DN15373_c1_g1_i1
>TRINITY_DN22913_c0_g1_i1
>TRINITY_DN13029_c4_g5_i1
I have 100 sequence ids in this file. When I match these ids to the source file I want an output that gives me the match for each of these ids with the entire sequence.
For example, for an id:
>TRINITY_DN80_c0_g1_i1
I want my output to be:
>TRINITY_DN80_c0_g1_i1
CGTGGATAACACATAAGTCACTGTAATTTAAAAACTGTAGGACTTAGATCTCCTTTCTATATTTTTCTGATAACATATGGAACCCTGCCGATCATCCGATTTGTAATATACTTAACTGCTGGATAACTAGCCAAAAGTCATCAGGTTATTATATTCAATAAAATGTAACTTGCCGTAAGTAACAGAGGTCATATGTTCCTGTTCGTCACTCTGTAGTTACAAATTATGACACGTGTGCGCTG
I want all hundred sequences in this format.
I used this code:
while read p; do
echo ''$p >> out.fasta
grep -A 400000 -w $p source.fasta | sed -n -e '1,/>/ {/>/ !{'p''}} >> out.fasta
done < ids.txt
But my output is different in that only the last id has a sequence and the rest dont have any sequence associated:
>TRINITY_DN14840_c10_g1_i1
>TRINITY_DN8506_c0_g1_i1
>TRINITY_DN12276_c0_g2_i1
....
>TRINITY_DN10309_c6_g3_i1
>TRINITY_DN6990_c0_g1_i1
TTTTTTTTTTTTTGTGGAAAAACATTGATTTTATTGAATTGTAAACTTAAAATTAGATTGGCTGCACATCTTAGATTTTGTTGAAAGCAGCAATATCAACAGACTGGACGAAGTCTTCGAATTCCTGGATTTTTTCAGTCAAGAGATCAACAGACACTTTGTCGTCTTCAATGACACACATGATCTGCAGTTTGTTGATACCATATCCAACAGGTACAAGTTTGGAAGCTCCCCAGAGGAGACCATCCATTTCGATGGTGCGGACCTGGTTTTCCATTTCTTTCATGTCTGTTTCATCATCCCATGGCTTGACGTCAAGGATTATAGATGATTTAGCAATGAGAGCAGGTTTCTTCGATTTTTTGTCAGCATAAGCTTTCAGACGTTCTTCACGAATTCTGGCGGCCTCTGCATCCTCTTCCTCGTCGCCAGATCCGAATAGGTCGACGTCATCATCGTCGTCATCCTTAGCAGCGGGTGCAGGTGCTGTGGTGGTCTTTCCGCCAGCGGTCAGAGGGCTAGCTCCAGCCGCCCAGGATTTGCGCTCCTCGGCATTGTAGGAGGCAATCTGGTTGTACCACCGGAGAGCGTGGGGCAAGCTTGCGCTCGGGGCCTTGCCGACTTGTTGGAACACTTGGAAATCGGCTTGAGTTGGTGTGTAACCTGACACATAACTCTTATCAGCTAAGAAATTGTTAAGCTCATTAAGGCCTTGTGCGGTTTTAACGTCTCCTACTGCCATTTTTATTTAAAAAAGTAGTTTTTTTCGAGTAATAGCCACACGCCCCGGCACAATGTGAGCAAGAAGGAATGAAAAAGAAATCTGACATTGACATTGCCATGAAATTGACTTTCAAAGAACGAATGAATTGAACTAATTTGAACGG
I am only getting the desired output for the 100th id from my ids.txt. Could someone help me on where my script is wrong. I would like to get all 100 sequences when i run the script.
Thank you
I have added google drive links to the files i am working with:
ids.txt
Source.fasta
Repeatedly looping over a large file is inefficient; you really want to avoid running grep (or sed or awk) more than once if you can avoid it. Generally speaking, sed and Awk will often easily allow you to specify actions for individual lines in a file, and then you run the script on the file just once.
For this particular problem, the standard Awk idiom with NR==FNR comes in handy. This is a mechanism which lets you read a number of keys into memory (concretely, when NR==FNR it means that you are processing the first input file, because the overall input line number is equal to the line number within this file) and then check if they are present in subsequent input files.
Recall that Awk reads one line at a time and performs all the actions whose conditions match. The conditions are a simple boolean, and the actions are a set of Awk commands within a pair of braces.
awk 'NR == FNR { s[$0]; next }
# If we fall through to here, we have finished processing the first file.
# If we see a wedge and p is 1, reset it -- this is a new sequence
/^>/ && p { p = 0 }
# If the prefix of this line is in s, we have found a sequence we want.
($1$2 in s) || ($1 in s) || ((substr($1, 1, 1) " " substr($1, 2)) in s) {
if ($1 ~ /^>./) { print $1 } else { print $1 $2 }; p = 1; next }
# If p is true, we want to print this line
p' ids.txt source.fasta >out.fasta
So when we are reading ids.txt, the condition NR==FNR is true, and so we simply store each line in the array s. The next causes the rest of the Awk script to be skipped for this line.
On subsequent reads, when NR!=FNR, we use the variable p to control what to print. When we see a new sequence, we set p to 0 (in case it was 1 from a previous iteration). Then when we see a new sequence, we check if it is in s, and if so, we set p to one. The last line simply prints the line if p is not empty or zero. (An empty action is a shorthand for the action { print }.)
The slightly complex condition to check if $1 is in s might be too complicated -- I put in some normalizations to make sure that a space between the > and the sequence identifier is tolerated, regardless of there was one in ids.txt or not. This can probably be simplified if your files are consistently formatted.
Only with GNU grep and sed:
grep -A 1 -w -F -f ids.txt source.fasta | sed 's/ .*//'
See: man grep
$ awk 'NR==FNR{a[$1];next} $1 in a{c=2} c&&c--' ids.txt source.fasta
>TRINITY_DN80_c0_g1_i1 len=723 path=[700:0-350 1417:351-368 1045:369-722] [-1, 700, 1417, 1045, -2]
CGTGGATAACACATAAGTCACTGTAATTTAAAAACTGTAGGACTTAGATCTCCTTTCTATATTTTTCTGATAACATATGGAACCCTGCCGATCATCCGATTTGTAATATACTTAACTGCTGGATAACTAGCCAAAAGTCATCAGGTTATTATATTCAATAAAATGTAACTTGCCGTAAGTAACAGAGGTCATATGTTCCTGTTCGTCACTCTGTAGTTACAAATTATGACACGTGTGCGCTG
The above was run using your posted source.fasta and this ids.txt:
$ cat ids.txt
>TRINITY_DN14840_c10_g1_i1
>TRINITY_DN80_c0_g1_i1
First group all id's as one expression separated by | like this
cat ids.txt | tr '\n' '|' | awk "{print "\"" $0 "\""}'
Remove the last | symbol from the expression.
Now you can grep using the output you got from previous command like this
egrep -E ">TRINITY_DN14840_c10_g1_i1|>TRINITY_DN8506_c0_g1_i1|>TRINITY_DN12276_c0_g2_i1|>TRINITY_DN15434_c5_g3_i1|>TRINITY_DN9323_c8_g3_i5|>TRINITY_DN11957_c1_g7_i1|>TRINITY_DN15373_c1_g1_i1|>TRINITY_DN22913_c0_g1_i1|>TRINITY_DN13029_c4_g5_i1" source.fasta
This will print only the matching lines
Editing as per tripleee comments
Using the following is printing the output properly
Assuming the ID and sequence are in different lined
tr '\n' '|' <ids.txt | sed 's/|$//' | grep -A 1 -E -f - source.fasta
This might work for you (GNU sed):
sed 's#.*#/^&/{s/ .*//;N;p}#' idFile | sed -nf - fastafile
Convert the idFile into a sed script and run it against the fastaFile.
the best way to do this is using either python or perl. I was able to make a script for extracting the ids using python as follows.
#script to extract sequences from a source file based on ids in another file
#the source is a fasta file with a header and a sequence that follows in one line
#the ids file contains one id per line
#both the id and source file should contain the character '>' at the beginning that siginifies an id
def main():
#asks the user for the ids file
file1 = raw_input('ids file: ');
#opens the ids file into the memory
ids_file = open(file1, 'r');
#asks the user for the fasta file
file2 = raw_input('fasta file: ');
#opens the fasta file into memory; you need your memory to be larger than the filesize, or python will hard crash
fasta_file = open(file2, 'r');
#ask the user for the file name of output file
file3 = raw_input('enter the output filename: ');
#opens output file with append option; append is must as you dont want to override the existing data
output_file = open(file3, 'w');
#split the ids into an array
ids_lines = ids_file.read().splitlines()
#split the fasta file into an array, the first element will be the id followed by the sequence
fasta_lines = fasta_file.read().splitlines()
#initializing loop counters
i = 0;
j = 0;
#while loop to iterate over the length of the ids file as this is the limiter for the program
while j<len(fasta_lines) and i<len(ids_lines):
#if statement to match ids from both files and bring matching sequences
if ids_lines[i] == fasta_lines[j]:
#output statements including newline characters
output_file.write(fasta_lines[j])
output_file.write('\n')
output_file.write(fasta_lines[j+1])
output_file.write('\n')
#increment i so that we go for the next id
i=i+1;
#deprecate j so we start all over for the new id
j=0;
else:
#when there is no match check the id, we are skipping the sequence in the middle which is j+1
j=j+2;
ids_file.close()
fasta_file.close()
output_file.close()
main()`
The code is not perfect but works for any number of ids. I have tested for my samples which contained 5000 ids in one of them and the program worked fine. If there are improvements to the code please do so, I am a relatively newbie at programming so the code is a bit crude.

sorting numerically by first row

I have a file with almost 900 lines in excel that I've saved as a tab deliminated .txt file. I'd like to sort the text file by the numbers given in the first column (they range between 0 and 2250). The other columns are both numbers and letters of varying length eg.
myfile.txt:
0251 abcd 1234,24 bcde
2240 efgh 2345,98 ikgpppm
0001 lkjsi 879,09 ikol
I've tried
sort -k1 -n myfile.txt > myfile_num.txt
but I just get an identical file with new name. I'd like to get:
myfile_num.txt
0001 lkjsi 879,09 ikol
0251 abcd 1234,24 bcde
2240 efgh 2345,98 ikgpppm
What am I doing wrong? I'm guessing that it's quite simple, but I'd appreciate any help I can get! I only know a little bash scripting, so it'd be nice if the script is a very simple one-liner that I can understand :)
Thanks :)
Use this to convert old Mac OS carriage return to newline:
tr '\r' '\n' < myfile.txt | sort
As stated here you can have problems with this (and in the other pseudo-follow-up-duplicate question you asked, yes, you did)
tr '\r' '\n' < myfile.txt | sort -n
It works fine here on MSYS but on some platforms you may have to add:
export LC_CTYPE=C
or tr will consider the file as a text file, and probably will tag it as corrupt after having reached the max line limit.
Obviously I could not test it, but I'm confident it will solve the problem given what I read on the linked answer.
A python approach (python 2 & 3 compatible), immune to all shell problems. Works great, and portable. I noticed that the input file has some '0x8C' chars (exotic dots), probably confusing tr command.
That is handled properly below:
import csv,sys
# read the file as binary, as it is not really text
with open("Proteins.txt","rb") as f:
data = bytearray(f.read())
# replace 0x8c char by classical dots
for i,c in enumerate(data):
if c>0x7F: # non-ascii: replace by dot
data[i] = ord(".")
# convert to list of ASCII strings (split using the old MAC separator)
lines = "".join(map(chr,data)).split("\r")
# treat our lines as input for CSV reader
cr = csv.reader(lines,delimiter='\t',quotechar='"')
# read all the lines in a list
rows = list(cr)
# perform the sort (tricky)
# on first row, numerical, removing the leading 0 which is illegal
# in python 3, and if not numerical, put it at the top
rows = sorted(rows,key=lambda x : x[0].isdigit() and int(x[0].strip("0")))
# write back the file as a nice, legal, ASCII tsv file
if sys.version_info < (3,):
f = open("Proteins_sorted_2.txt","wb")
else:
f = open("Proteins_sorted_2.txt","w",newline='')
cw = csv.writer(f,delimiter='\t',quotechar='"')
cw.writerows(rows)
f.close()

Find Replace using Values in another File

I have a directory of files, myFiles/, and a text file values.txt in which one column is a set of values to find, and the second column is the corresponding replace value.
The goal is to replace all instances of find values (first column of values.txt) with the corresponding replace values (second column of values.txt) in all of the files located in myFiles/.
For example...
values.txt:
Hello Goodbye
Happy Sad
Running the command would replace all instances of "Hello" with "Goodbye" in every file in myFiles/, as well as replace every instance of "Happy" with "Sad" in every file in myFiles/.
I've taken as many attempts at using awk/sed and so on as I can think logical, but have failed to produce a command that performs the action desired.
Any guidance is appreciated. Thank you!
Read each line from values.txt
Split that line in 2 words
Use sed for each line to replace 1st word with 2st word in all files in myFiles/ directory
Note: I've used bash parameter expansion to split the line (${line% *} etc) , assuming values.txt is space separated 2 columnar file. If it's not the case, you may use awk or cut to split the line.
while read -r line;do
sed -i "s/${line#* }/${line% *}/g" myFiles/* # '-i' edits files in place and 'g' replaces all occurrences of patterns
done < values.txt
You can do what you want with awk.
#! /usr/bin/awk -f
# snarf in first file, values.txt
FNR == NR {
subs[$1] = $2
next
}
# apply replacements to subsequent files
{
for( old in subs ) {
while( index(old, $0) ) {
start = index(old, $0)
len = length(old)
$0 = substr($0, start, len) subs[old] substr($0, start + len)
}
}
print
}
When you invoke it, put values.txt as the first file to be processed.
Option One:
create a python script
with open('filename', 'r') as infile, etc., read in the values.txt file into a python dict with 'from' as key, and 'to' as value. close the infile.
use shutil to read in directory wanted, iterate over files, for each, do popen 'sed 's/from/to/g'" or read in each file interating over all the lines, each line you find/replace.
Option Two:
bash script
read in a from/to pair
invoke
perl -p -i -e 's/from/to/g' dirname/*.txt
done
second is probably easier to write but less exception handling.
It's called 'Perl PIE' and it's a relatively famous hack for doing find/replace in lots of files at once.

extract each line followed by a line with a different value in column two

Given the following file structure,
9.975 1.49000000 0.295 0 0.4880 0.4929 0.5113 0.5245 2.016726 1.0472 -30.7449 1
9.975 1.49000000 0.295 1 0.4870 0.5056 0.5188 0.5045 2.015859 1.0442 -30.7653 1
9.975 1.50000000 0.295 0 0.5145 0.4984 0.4873 0.5019 2.002143 1.0854 -30.3044 2
is there a way to extract each line in which the value in column two is not equal to the value in column two in the following line?
I.e. from these three lines I would like to extract the second one, since 1.49 is not equal to 1.50.
Maybe with sed or awk?
This is how I do this in MATLAB:
myline = 1;
mynewline = 1;
while myline < length(myfile)
if myfile(myline,2) ~= myfile(myline+1,2)
mynewfile(mynewline,:) = myfile(myline,:);
mynewline = mynewline+1;
myline = myline+1;
else
myline = myline+1;
end
end
However, my files are so large now that I would prefer to carry out this extraction in terminal before transferring them to my laptop.
Awk should do.
<data awk '($2 != prev) {print line} {line = $0; prev = $2}'
A brief intro to awk: awk program consists of a set of condition {code} blocks. It operates line by line. When no condition is given, the block is executed for each line. BEGIN condition is executed before the first line. Each line is split to fields, which are accessible with $_number_. The full line is in $0.
Here I compare the second field to the previous value, if it does not match I print the whole previous line. In all cases I store the current line into line and the second field into prev.
And if you really want it right, careful with the float comparisons - something like abs($2 - prev) < eps (there is no abs in awk, you need to define it yourself, and eps is some small enough number). I'm actually not sure if awk converts to number for equality testing, if not you're safe with the string comparisons.
This might work for you (GNU sed):
sed -r 'N;/^((\S+)\s+){2}.*\n\S+\s+\2/!P;D' file
Read two lines at a time. Pattern match on the first two columns and only print the first line when the second column does not match.
Try following command:
awk '$2 != field && field { print line } { field = $2; line = $0 }' infile
It saves previous line and second field, comparing in next loop with current line values. The && field check is useful to avoid a blank line at the beginning of file, when $2 != field would match because variable is empty.
It yields:
9.975 1.49000000 0.295 1 0.4870 0.5056 0.5188 0.5045 2.015859 1.0442 -30.7653 1

Replace text with sed

A program creates HTML files from a database. There are headings and stuff in between the headings.
There are not a set amount of headings.
After each heading the program places the text:
$WHITE*("5")$
$WHITE*("20")$
$HRULE$
I need every occurrence of these 4 lines to be replaced with:
$WHITE*("20")$
$HRULE$
$WHITE*("10")$
I am not fussed what program is used :)
I have tried:
sed 's:\$WHITE\*(\"5\")\$\n\n\$WHITE\*(\"20\")\$\n\$HRULE\$:\$WHITE\*(\"20\")\$\
\$HRULE$\
\$WHITE*("10")$:g'
and various other permutations
If that'S your input file, and this is the spec, you can do:
sed -n '3,$p;$a$WHITE*("10")$' INPUTFILE
But I assume that's not the case, so you might want to rephrase your question and/or giving some more detailes.
More specific solution with sed:
sed '/^\$WHITE\*("5")\$$/,/^$/d;/\$HRULE\$/ a$WHITE*("10")$' INPUTFILE
(Searches for the $WHITE*("5")$ line and deletes it till (including!) the next empty line. Then searches for the next $HRULE$ line and appends an $WHITE*("10")$ line.
awk solution:
awk '/\$WHITE\*\("5"\)\$/ { getline ; next }
/\$WHITE\*\("20"\)\$/ { print ;
getline ;
if ($0 ~ /\$HRULE\$/) { print ;
print "$WHITE*(\"10\")$" ;
}
else { print }
}
1 ' INPUTFILE
This reads the file and prints every line - that's why the 1 is there, except if it finds the $WHITE*("5") pattern it drops it, reads the next line and drops that too. if it finds the $WHITE*("20") prints it. Reads the next line and if its $HRULE$ then prints that and the appended $WHITE*("10") line. Else just prints the line.
HTH
UPDATE #2
From the sed faq, section 4.23.3
If you need to match a static block of text (which may occur any number of times throughout a file), where the contents of the block are known in advance, then this script is easy to use
UPDATE #1
Python?
$ cat input
first line
second line
3rd line
$WHITE*("5")$
$WHITE*("20")$
$HRULE$
some more lines
yet another
$WHITE*("5")$
$WHITE*("20")$
$HRULE$
THE END
the script:
#!/usr/bin/env python
## Use these 3 lines for python version < 2.5
#fd=open('input')
#text=fd.read()
#fd.close()
## Use these 2 lines for python version >= 2.5
with open('input') as fd:
text=fd.read()
old="""$WHITE*("5")$
$WHITE*("20")$
$HRULE$
"""
new="""$WHITE*("20")$
$HRULE$
$WHITE*("10")$
"""
print text.replace(old,new)
output:
first line
second line
3rd line
$WHITE*("20")$
$HRULE$
$WHITE*("10")$
some more lines
yet another
$WHITE*("20")$
$HRULE$
$WHITE*("10")$
THE END
Try something like
sed -e '${p;};/$WHITE\*("5")\$/,/$HRULE\$/{H;/$HRULE\$/{g;s/$HRULE\$//;s/20/10/;s/5/20/;s/\n/&$HRULE$/2p;s/.*//p;x;d;};d;};' white.txt
Crude, but it should work.
This might work for you:
sed '/^\$WHITE\*(\"5\")\$/{N;N;N;s/.*\n\n\(\(\$WHITE\*(\"\)20\(\")\$\s*\)\n\$HRULE\$\s*$\)/\1\n\210\3/}' file
Explanation:
Match on first string $WHITE*("5")$, read the next 3 lines and match on remainder. Use grouping and back references to formulate output lines.

Resources