shell script to search attribute and store value along with filename - shell

Looking out for a shell script which searches for an attribute (a string) in all the files in current directory and stores the attribute values along with the file name.
e.g File1.txt
abc xyz = "pqr"
File2.txt
abc xyz = "klm"
Here File1 and File2 contains desired string "abc xyz" and have values "pqr" and "klm".
I want result something like this:
File1.txt:pqr
File2.txt:klm

Well, this depends on how do you define a 'shell script'. Here are 3 one-line solutions:
Using grep/sed:
egrep -o "abc xyz = ".*"' * | sed -e 's/abc xyz = "(.*)"/\1/'
Using awk:
awk '/abc xyz = "(.)"/ { print FILENAME ":" gensub("abc xyz = \"(.)\"", "\1", 1) }' *
Using perl one-liner:
perl -ne 'if(s/abc xyz = "(.*)"/$ARGV:$1/) { print }' *
I personally would go with the last one.

Please don't use bash scripting for this.
There is much room for small improvements in the code,
but in 20 lines the damn thing does the job.
Note: the code assumes that "abc xyz" is at the beginning of the line.
#!/usr/bin/python
import os
import re
MYDIR = '/dir/you/want/to/search'
def search_file(fn):
myregex = re.compile(r'abc xyz = \"([a-z]+)\"')
f = open(fn, 'r')
for line in f:
m = myregex.match(line)
if m:
yield m.group(1)
for filename in os.listdir(MYDIR):
if os.path.isfile(os.path.join(MYDIR, filename)):
matches = search_file(os.path.join(MYDIR, filename))
for match in matches:
print filename + ':' + match,
Thanks to David Beazley, A.M. Kuchling, and Mark Pilgrim for sharing their vast knowledge.
I couldn't have done something like this without you guys leading the way.

Related

Script to find and print common IDs between two files working but it's not optimal

I have a code that is working and does what I want, but it is extremely slow. It takes 1 or 2 days depending on the size of the input files. I know that there are alternatives that can be almost instant and that my code is slow because it's a recursive grep. I wrote another code in python that works as intended and is almost instant, but it does not print everything I need.
What I need is the common IDs between two files, and I want it to print the whole line. My python script does not do that, while the bash does it but it's too much slow.
This is my code in bash:
awk '{print $2}' file1.bim > sites.txt
for snp in `cat sites.txt`
do
grep -w $snp file2.bim >> file1_2_shared.txt
done
This is my code in python:
#!/usr/bin/env python3
import sys
argv1=sys.argv[1] #argv1 is the first .bim file
argv2=sys.argv[2] #argv2 is the second .bim file
argv3=sys.argv[3] #argv3 is the output .txt file name
def printcommonSNPs(inputbim1,inputbim2,outputtxt):
bim1 = open(inputbim1, "r")
bim2 = open(inputbim2, "r")
output = open(outputtxt,"w")
snps1 = []
line1 = bim1.readline()
line1 = line1.split()
snps1.append(line1[1])
for line1 in bim1:
line1 = line1.split()
snps1.append(line1[1])
bim1.close()
snps2 = []
line2 = bim2.readline()
line2 = line2.split()
snps2.append(line2[1])
for line2 in bim2:
line2 = line2.split()
snps2.append(line2[1])
bim2.close()
common=[]
common = list(set(snps1).intersection(snps2))
for SNP in common:
print(SNP, file=output)
printcommonSNPs(argv1,argv2,argv3)
My .bim input files are made this way:
1 1:891021 0 891021 G A
1 1:903426 0 903426 T C
1 1:949654 0 949654 A G
I would appreciate suggestions on what I could do to make it quick in bash (I suspect I can use an awk script, but I tried awk 'FNR==NR {map[$2]=$2; next} {print $2, map[$2]}' file1.bim file2.bim > Roma_sets_shared_sites.txt and it simply prints every line, so it's not working as I need), or how could I tell to print the whole line in python3.
It looks as if the problem can be solved like this:
grep -w -f <(awk '{ print $2 }' file1.bim) file2.bim
The identifiers (field $2) from file1.bim are to be treated as patterns to grep for in file2.bim. GNU grep takes a -f file argument which gives a list of patterns, one per line. We use <() process substitution in place of a file. It looks as if the -w option individually applies to the -f patterns.
This won't have the same output as your shell script if there are duplicate IDs in file1.bim. If the same pattern occurs more than once, that's the same as one instance. And of course the order is different. Grepping the entire second file for one identifier and hen the next and next, produces the matches in a different order. If that order has to be reproduced, it will take extra work.

Bash: How to decode and encode specific URL characters? [duplicate]

is it possible to change multiply patterns to different values at the same command?
lets say I have
A B C D ABC
and I want to change every A to 1 every B to 2 and every C to 3
so the output will be
1 2 3 D 123
since I have 3 patterns to change I would like to avoid substitute them separately.
I thought there would be something like
sed -r s/'(A|B|C)'/(1|2|3)/
but of course this just replace A or B or C to (1|2|3).
I should just mention that my real patterns are more complicated than that...
thank you!
Easy in sed:
sed 's/WORD1/NEW_WORD1/g;s/WORD2/NEW_WORD2/g;s/WORD3/NEW_WORD3/g'
You can separate multiple commands on the same line by a ;
Update
Probably this was too easy. NeronLeVelu pointed out that the above command can lead to unwanted results because the second substitution might even touch results of the first substitution (and so on).
If you care about this you can avoid this side effect with the t command. The t command branches to the end of the script, but only if a substitution did happen:
sed 's/WORD1/NEW_WORD1/g;t;s/WORD2/NEW_WORD2/g;t;s/WORD3/NEW_WORD3/g'
Easy in Perl:
perl -pe '%h = (A => 1, B => 2, C => 3); s/(A|B|C)/$h{$1}/g'
If you use more complex patterns, put the more specific ones before the more general ones in the alternative list. Sorting by length might be enough:
perl -pe 'BEGIN { %h = (A => 1, AA => 2, AAA => 3);
$re = join "|", sort { length $b <=> length $a } keys %h; }
s/($re)/$h{$1}/g'
To add word or line boundaries, just change the pattern to
/\b($re)\b/
# or
/^($re)$/
# resp.
This will work if your "words" don't contain RE metachars (. * ? etc.):
$ cat file
there is the problem when the foo is closed
$ cat tst.awk
BEGIN {
split("the a foo bar",tmp)
for (i=1;i in tmp;i+=2) {
old = (i>1 ? old "|" : "\\<(") tmp[i]
map[tmp[i]] = tmp[i+1]
}
old = old ")\\>"
}
{
head = ""
tail = $0
while ( match(tail,old) ) {
head = head substr(tail,1,RSTART-1) map[substr(tail,RSTART,RLENGTH)]
tail = substr(tail,RSTART+RLENGTH)
}
print head tail
}
$ awk -f tst.awk file
there is a problem when a bar is closed
The above obviously maps "the" to "a" and "foo" to "bar" and uses GNU awk for word boundaries.
If your "words" do contain RE metachars etc. then you need a string-based solution using index() instead of an RE based one using match() (note that sed ONLY supports REs, not strings).
replace with callback function in javascript
similar to the perl solution by choroba
var i = 'abcd'
var r = {ab: "cd", cd: "ab"}
var o = i.replace(/ab|cd/g, (...args) => r[args[0]])
o == 'cdab'
can be optimized with capture groups like /(ab)|(cd)/g
and checking args[i] for undefined values

combining multiple grep searches and making my script more efficient

I have a file called Type1.txt, that looks like this:
$ cat Type1.txt
ID.580.G3C0
TTTTTTTTTTT
ID.580.G3C8
ATTATATC-AAA
ID.580.GXC16
ATTATTTC-ACG-TTTTTCCTA
ID.694.G9C3
ATTATATC-ACG-AAATCCTA
ID.694.G9C3
etc...
I want to write a bash script to count the instances of each ID and export it into another file that provides a summary, something like this:
ID.580 = 3
ID.694 = 1
etc...
So far the script is messy and unusable.
For the above I have the following:
#!/bin/bash
for Count in `grep -c "ID.580" Type1.txt; do
echo $Count=ID.580
done > Result.txt #Allows to count only for that single ID.
I have over a thousand ID.XXX, making this code unusable since it's not plausible to add individual ID.XXX for each search. Thank you for the help!
Shell
The code below uses the standard UNIX utilities, and does not assume that the second part of the ID is exactly 3 characters, but will find ID.1.123123123 and ID.1234.123123 and properly only take the first dot-delimited part. As it is
grep '^ID\.[0-9]' Type1.txt | cut -d . -f 1-2 | sort \
| uniq -c | awk '{ print $2" = "$1 }'
grep filters only lines beginning with ID. followed by 1 digit (at least)
cut uses . as the field delimiter, and only outputting fields 1 and 2, thus removing
everything after and including the second . on the line.
sort sorts the lines for uniq to work
uniq prints each line from its input prefixed with a count
awk part reverses these fields and prints them separated with =.
If the first part of the ID can contain letters too, change the end of regular expression to [0-9] to [0-9A-Z]. for example
The pipeline outputs
ID.580 = 3
ID.694 = 2
Python
As the Python is popular among biologists, you might want to hone your python skills instead:
from collections import Counter
counter = Counter()
with open('Type1.txt') as f:
for line in f:
if line.startswith('ID.'):
top_id = '.'.join(line.split('.', 2)[:2])
counter[top_id] += 1
for top_id, count in sorted(counter.items()):
print("%s = %d" % (top_id, count))
The results are exactly identical.
grep '^ID.[0-9][0-9][0-9]' input_file | cut -c1-6 | sort | uniq -c
works?
TL;DR
Given your particular corpus and grouping strategy, there's more than one way to get the results you need. Here are two alternative solutions, one in awk, and one in Ruby.
GNU awk
One way is to use GNU awk to perform the following steps:
match just the ID lines
split matching input lines into fields
select and print the fields you need
sort the lines in the filtered result
count the adjacent duplicates
perform any specialized formatting on the result
For example:
$ awk '/^ID/ {split($0, a, "."); print a[1] "." a[2]}' /tmp/foo |
sort | uniq --count | awk '{print $2 " = " $1}'
ID.580 = 3
ID.694 = 2
With the corpus you provided in your question, this takes an average of 8 ms on my system. A larger corpus will take longer, of course, but unless you have a really huge data set this should be fast enough for most purposes.
Ruby
Ruby offers what I consider a more elegant solution, but is in fact slower. The idea here is to store the relevant portion of your IDs as hash keys, and increment a counter each time you encounter a given ID. For example, consider this Ruby one-liner:
$ ruby -ne 'BEGIN { id = Hash.new(0) }
id[$&] += 1 if /\AID\.\d+/
END { id.each_pair do |k,v| puts "#{k} = #{v}" end }' /tmp/foo
ID.580 = 3
ID.694 = 2
This solution takes around 45 ms to process the same corpus, so I wouldn't recommend it over the awk pipeline just for transforming output. The main advantage to doing it this way is that you have an actual data structure (e.g. a Hash object) that you could manipulate in a more full-featured program.
Here is awk one liner:
$ awk -F. '$1=="ID"{a[$2,$3]++}END{for (i in a) {split(i,ind,SUBSEP); r[ind[1]]++}for (i in r) print "ID."i" = "r[i]}' file
ID.694 = 1
ID.580 = 3
And here is a pure bash solution:
#!/bin/bash
while IFS=. read -r pre id code rest
do
[[ $pre == ID ]] || continue
[[ ${a[$id]} =~ \."$code"\. ]] || {
a[$id]="${a[$id]}.$code."
((count[$id]++));
}
done < file
for i in "${!count[#]}"
do
echo "ID.$i = ${count[$i]}"
done
$ ./script.sh
ID.580 = 3
ID.694 = 1
awk might work too...
awk '/ID.580/{x++}END{print x}' test.txt
You can put this in a for loop
for i in ID.580 ID.694
do
awk '/'$i'/{x++}END{print x}' test.txt
done

Shell script to Preprocess Fortran RECORDs to TYPEs?

I'm converting some old F77 code to compile under gfortran. I have a bunch of RECORDS used in the following manner:
RecoRD /TEST/ this
this.field = 1
this.otherfield.sumthin = 2
func = func(%val(ThIs.field,foo.bar,this.other.field))
I am trying to convert these all to TYPEs as such:
TYPE(TEST) this
this%field = 1
this%otherfield%sumthin = 2
func = func(%val(ThIs%field,foo.bar,this%other%field))
I'm just ok with sed and I can process the files to replace the RECORD declarations with TYPE declarations, but is there a way to write a preprocessing type of script using linux tools to convert the this.field notation to this%field notation? I believe I would need something that can recognize the declared record name and target it specifically to avoid borking other variables on accident. Also, any idea how I can deal with included files? I feel like that could get pretty messy but if anyone has done something similar it would be good to include in a solution.
Edit:
I have python 2.4 avaialable to me.
You could use Python for that. Following script reads the text from stdin and outputs it to stdout using the replacement you asked for:
import re
import sys
txt = sys.stdin.read()
names = re.findall(r"RECORD /TEST/\s*\b(.+)\b", txt, re.MULTILINE)
for name in list(set(names)):
txt = re.sub(r"\b%s\.(.*)\b"%name, r"%s%%\1"%name, txt,
re.MULTILINE)
sys.stdout.write(txt)
EDIT: As for Python 2.4: Yes format should be replaced with %. As for structures with subfields, one could easily achieve that by using a function in the sub() call as below. I also added case insensitiveness:
import re
import sys
def replace(match):
return match.group(0).replace(".", "%")
txt = sys.stdin.read()
names = re.findall(r"RECORD /TEST/\s*\b(.+)\b", txt, re.MULTILINE)
for name in names:
txt = re.sub(r"\b%s(\.\w+)+\b" % name, replace, txt,
re.MULTILINE | re.IGNORECASE)
sys.stdout.write(txt)
With GNU awk:
$ cat tst.awk
/RECORD/ { $0 = gensub(/[^/]+[/]([^/]+)[/]/,"TYPE(\\1)",""); name=tolower($NF) }
{
while ( match(tolower($0),"\\<" name "[.][[:alnum:]_.]+") ) {
$0 = substr($0,1,RSTART-1) \
gensub(/[.]/,"%","g",substr($0,RSTART,RLENGTH)) \
substr($0,RSTART+RLENGTH)
}
}
{ print }
$ cat file
RECORD /TEST/ tHiS
this.field = 1
THIS.otherfield.sumthin = 2
func = func(%val(ThIs.field,foo.bar,this.other.field))
$ awk -f tst.awk file
TYPE(TEST) tHiS
this%field = 1
THIS%otherfield%sumthin = 2
func = func(%val(ThIs%field,foo.bar,this%other%field))
Note that I modified your input to show what would happen with multiple occurrences of this.field on one line and mixed in with other "." references (foo.bar). I also added some mixed-case occurrences of "this" to show how that works.
In response to the question below about how to handle included files, here's one way:
This script will not only expand all the lines that say "include subfile", but by writing the result to a tmp file, resetting ARGV[1] (the highest level input file) and not resetting ARGV[2] (the tmp file), it then lets awk do any normal record parsing on the result of the expansion since that's now stored in the tmp file. If you don't need that, just do the "print" to stdout and remove any other references to a tmp file or ARGV[2].
awk 'function read(file) {
while ( (getline < file) > 0) {
if ($1 == "include") {
read($2)
} else {
print > ARGV[2]
}
}
close(file)
}
BEGIN{
read(ARGV[1])
ARGV[1]=""
close(ARGV[2])
}1' a.txt tmp
The result of running the above given these 3 files in the current directory:
a.txt b.txt c.txt
----- ----- -----
1 3 5
2 4 6
include b.txt include c.txt
9 7
10 8
would be to print the numbers 1 through 10 and save them in a file named "tmp".
So for this application you could replace the number "1" at the end of the above script with the contents of the first script posted above and it'd work on the tmp file that now includes the contents of the expanded files.

Grep search strings with line breaks

How to use grep to output occurrences of the string 'export to excel' in the input files given below? Specifically, how to handle the line breaks that happen in between the search strings? Is there a switch in grep that can do this or some other command probably?
Input files:
File a.txt:
blah blah ... export to
excel ...
blah blah..
File b.txt:
blah blah ... export to excel ...
blah blah..
Do you just want to find files that contain the pattern, ignoring linebreaks, or do you want to actually see the matching lines?
If the former, you can use tr to convert newlines to spaces:
tr '\n' ' ' | grep 'export to excel'
If the latter you can do the same thing, but you may want to use the -o flag to only print the actual match. You'll then want to adjust your regex to include any extra context you want.
I don't know how to do this in grep. I checked the man page for egrep(1) and it can't match with a newline in the middle either.
I like the solution #Laurence Gonsalves suggested, of using tr(1) to wipe out the newlines. But as he noted, it will be a pain to print the matching lines if you do it that way.
If you want to match despite a newline and then print the matching line(s), I can't think of a way to do it with grep, but it would be not too hard in any of Python, AWK, Perl, or Ruby.
Here's a Python script that solves the problem. I decided that, for lines that only match when joined to the previous line, I would print a --> arrow before the second line of the match. Lines that match outright are always printed without the arrow.
This is written assuming that /usr/bin/python is Python 2.x. You can trivially change the script to work under Python 3.x if desired.
#!/usr/bin/python
import re
import sys
s_pat = "export\s+to\s+excel"
pat = re.compile(s_pat)
def print_ete(fname):
try:
f = open(fname, "rt")
except IOError:
sys.stderr.write('print_ete: unable to open file "%s"\n' % fname)
sys.exit(2)
prev_line = ""
i_last = -10
for i, line in enumerate(f):
# is ete within current line?
if pat.search(line):
print "%s:%d: %s" % (fname, i+1, line.strip())
i_last = i
else:
# construct extended line that included previous
# note newline is stripped
s = prev_line.strip("\n") + " " + line
# is ete within extended line?
if pat.search(s):
# matched ete in extended so want both lines printed
# did we print prev line?
if not i_last == (i - 1):
# no so print it now
print "%s:%d: %s" % (fname, i, prev_line.strip())
# print cur line with special marker
print "--> %s:%d: %s" % (fname, i+1, line.strip())
i_last = i
# make sure we don't match ete twice
prev_line = re.sub(pat, "", line)
try:
if sys.argv[1] in ("-h", "--help"):
raise IndexError # print help
except IndexError:
sys.stderr.write("print_ete <filename>\n")
sys.stderr.write('grep-like tool to print lines matching "%s"\n' %
"export to excel")
sys.exit(1)
print_ete(sys.argv[1])
EDIT: added comments.
I went to some trouble to make it print the correct line number on each line, using a format similar to what you would get with grep -Hn.
It could be much shorter and simpler if you don't need line numbers, and you don't mind reading in the whole file at once into memory:
#!/usr/bin/python
import re
import sys
# This pattern not compiled with re.MULTILINE on purpose.
# We *want* the \s pattern to match a newline here so it can
# match across multiple lines.
# Note the match group that gathers text around ete pattern uses a character
# class that matches anything but "\n", to grab text around ete.
s_pat = "([^\n]*export\s+to\s+excel[^\n]*)"
pat = re.compile(s_pat)
def print_ete(fname):
try:
text = open(fname, "rt").read()
except IOError:
sys.stderr.write('print_ete: unable to open file "%s"\n' % fname)
sys.exit(2)
for s_match in re.findall(pat, text):
print s_match
try:
if sys.argv[1] in ("-h", "--help"):
raise IndexError # print help
except IndexError:
sys.stderr.write("print_ete <filename>\n")
sys.stderr.write('grep-like tool to print lines matching "%s"\n' %
"export to excel")
sys.exit(1)
print_ete(sys.argv[1])
grep -A1 "export to" filename | grep -B1 "excel"
I have tested this a little and it seems to work:
sed -n '$b; /export to excel/{p; b}; N; /export to\nexcel/{p; b}; D' filename
You can allow for some extra white space at the end and beginning of the lines like this:
sed -n '$b; /export to excel/{p; b}; N; /export to\s*\n\s*excel/{p; b}; D' filename
use gawk. set record separator as excel, then check for "export to".
gawk -vRS="excel" '/export.*to/{print "found export to excel at record: "NR}' file
or
gawk '/export.*to.*excel/{print}
/export to/&&!/excel/{
s=$0
getline line
if (line~/excel/){
printf "%s\n%s\n",s,line
}
}' file

Resources