In bash extract select properties from a standard property file to a single delimited line? - bash

In bash:
1) For a given groupname of interest, and
2) a list of keys of interest, for which we want a table of values, for this groupname,
3) read in a set of files, like those in /usr/share/applications (see simplified example below),
4) and produce a delimited table, with one line per file, and one field for each given key.
EXAMPLE
inputs
We want only the values of the Name and Exec keys, from only [Desktop Entry] groups, and from one or more files, like these:
[Desktop Entry]
Name=Root
Comment=Opens
Exec=e2
..
[Desktop Entry]
Comment=Close
Name=Root2
output
Two lines, one per input file, each in a delimited <Name>,<Exec> format, ready for import into a database:
Root,e2
Root2,
Each input file is:
One or more blocks of lines delimited by a [some-groupname].
Below each [.*] is one or more standard, unsorted key=value pairs.
Not every block contains the same set of keys.
[Forgive me if I am asking for a solution to an old problem, but I can't seem to find a good, quick bash way, to do this. Yes, I could code it up with some while and read loops, etc... but surely it's been done before.]
Similar to this Q but more general answer wanted.

If awk is your option, would you please try the following:
awk -v RS="[" -v FS="\n" '{ # split the file into records on "["
# and split the record into fields on "\n"
name = ""; exec = "" # reset variables
if ($1 == "Desktop Entry]") {
# if the groupname matches
for (i=2; i<=NF; i++) { # loop over the fields (lines) of "key=value" pairs
if (sub(/^Name=/, "", $i)) name = $i
# the field (line) starts with "Name="
else if (sub(/^Exec=/, "", $i)) exec = $i
# the field (line) starts with "Exec="
}
print name "," exec
}
}' file
You can feed multiple files as file1 file2 file3, dir/file* or whatever.

Related

Find Replace using Values in another File

I have a directory of files, myFiles/, and a text file values.txt in which one column is a set of values to find, and the second column is the corresponding replace value.
The goal is to replace all instances of find values (first column of values.txt) with the corresponding replace values (second column of values.txt) in all of the files located in myFiles/.
For example...
values.txt:
Hello Goodbye
Happy Sad
Running the command would replace all instances of "Hello" with "Goodbye" in every file in myFiles/, as well as replace every instance of "Happy" with "Sad" in every file in myFiles/.
I've taken as many attempts at using awk/sed and so on as I can think logical, but have failed to produce a command that performs the action desired.
Any guidance is appreciated. Thank you!
Read each line from values.txt
Split that line in 2 words
Use sed for each line to replace 1st word with 2st word in all files in myFiles/ directory
Note: I've used bash parameter expansion to split the line (${line% *} etc) , assuming values.txt is space separated 2 columnar file. If it's not the case, you may use awk or cut to split the line.
while read -r line;do
sed -i "s/${line#* }/${line% *}/g" myFiles/* # '-i' edits files in place and 'g' replaces all occurrences of patterns
done < values.txt
You can do what you want with awk.
#! /usr/bin/awk -f
# snarf in first file, values.txt
FNR == NR {
subs[$1] = $2
next
}
# apply replacements to subsequent files
{
for( old in subs ) {
while( index(old, $0) ) {
start = index(old, $0)
len = length(old)
$0 = substr($0, start, len) subs[old] substr($0, start + len)
}
}
print
}
When you invoke it, put values.txt as the first file to be processed.
Option One:
create a python script
with open('filename', 'r') as infile, etc., read in the values.txt file into a python dict with 'from' as key, and 'to' as value. close the infile.
use shutil to read in directory wanted, iterate over files, for each, do popen 'sed 's/from/to/g'" or read in each file interating over all the lines, each line you find/replace.
Option Two:
bash script
read in a from/to pair
invoke
perl -p -i -e 's/from/to/g' dirname/*.txt
done
second is probably easier to write but less exception handling.
It's called 'Perl PIE' and it's a relatively famous hack for doing find/replace in lots of files at once.

make math operation from multiple files with shell scripting

I have multiple files, let's say
fname1 contains:
red=5
green=10
yellow=2
fname2 contains:
red=10
green=2
yellow=2
fname3 contains:
red=1
green=7
yellow=4
I want to write script that read from these files, sum the numbers for each colour,
and redirect the sums into new file.
New file contains:
red=16
green=19
yellow=8
[ awk ] is your friend :
awk 'BEGIN{FS="=";}
{color[$1]+=$2}
END{
for(var in color)
printf "%s=%s\n",var,color[var]
}' fname1 fname2 fname3 >result
should do it.
Demystifying above stuff
Anything that is include inside '' is the awk program.
Stuff inside BEGIN will be executed only once, ie in the beginning
FS is an awk built-in variable which stands for field separator.
Setting FS to = means awk will use = to delimit the fields/columns.
By default awk considers each line as a record.
In that case you have two fields denoted by $1 and $2 in each record having = as the delimiter.
{color[$1]+=$2} creates(if not already exist) an associative array with color name as the key and += adds the value of the field2 to this array element. Remember, associative arrays at the time of creation are initilized to zero.
This is repeated for the three files fname1, fname2, fname3 fed into awk
Anything inside END{} will be executed only at last, ie just before exit.
for(var in color) is a the style of forloop used to parse an associative array.
Here var will be a key and color[key] points to value.
printf "%s=%s\n",var,color[var] is self explained.
Note
If all the filenames start with fname you can even put fname* instead of fname1 fname2 fname3
This assumes that there are no blank lines in any file
Because your source files are valid shell code. You can just source them (if they are from a trusted source) and accumulate them using Shell Arithmetic.
#!/bin/bash
sum_red=0
sum_green=0
sum_yellow=0
for file in "$#";do
. ${file}
let sum_red+=red
let sum_green+=green
let sum_yellow+=yellow
done
echo "red=$sum_red
green=$sum_green
yellow=$sum_yellow"

Slow bash script to execute sed expression on each line of an input file

I have a simple bash script as follows
#!/bin/bash
#This script reads a file of row identifiers separated by new lines
# and outputs all query FASTA sequences whose headers contain that identifier.
# usage filter_fasta_on_ids.sh fasta_to_filter.fa < seq_ids.txt; > filtered.fa
while read SEQID; do
sed -n -e "/$SEQID/,/>/ p" $1 | head -n -1
done
A fasta file has the following format:
> HeadER23217;count=1342
ACTGTGCCCCGTGTAA
CGTTTGTCCACATACC
>ANotherName;count=3221
GGGTACAGACCTACAC
CAACTAGGGGACCAAT
edit changed header names to better show their actual structure in the files
The script I made above does filter the file correctly, but it is very slow. My input file has ~ 20,000,000 lines containing ~ 4,000,000 sequences, and I have a list of 80,000 headers that I want to filter on. Is there a faster way to do this using bash/sed or other tools (like python or perl?) Any ideas why the script above is taking hours to complete?
You're scanning the large file 80k times. I'll suggest a different approach with a different tool: awk. Load the selection list into an hashmap (awk array) and while scanning the large file if any sequence matches print.
For example
$ awk -F"\n" -v RS=">" 'NR==FNR{for(i=1;i<=NF;i++) a["Sequence ID " $i]; next}
$1 in a' headers fasta
The -F"\n" flag sets the field separator in the input file to be a new line. -v RS=">" sets the record separator to be a ">"
Sequence ID 1
ACTGTGCCCCGTGTAA
CGTTTGTCCACATACC
Sequence ID 4
GGGTACAGACCTACAT
CAACTAGGGGACCAAT
the headers file contains
$ cat headers
1
4
and the fasta file includes some more records in the same format.
If your headers already includes the "Sequence ID" prefix, adjust the code as such. I didn't test this for large files but should be dramatically faster than your code as long as you don't have memory restrictions to hold 80K size array. In that case, splitting the headers to multiple sections and combining the results should be trivial.
To allow any format of header and to have the resulting file be a valid FASTA file, you can use the following command:
awk -F"\n" -v RS=">" -v ORS=">" -v OFS="\n" 'NR==FNR{for(i=1;i<=NF;i++) a[$i]; next} $1 in a' headers fasta > out
The ORS and OFS flags set the output field and record separators, in this case to be the same as the input fasta file.
You should take advantage of the fact (which you haven't explicitly stated, but I assume) that the huge fasta file contains the sequences in order (sorted by ID).
I'm also assuming the headers file is sorted by ID. If it isn't, make it so - sorting 80k integers is not costly.
When both are sorted it boils down to a single simultaneous linear scan through both files. And since it runs in constant memory it can work with any size unlike the other awk example. I give an example in python since I'm not comfortable with manual iteration in awk.
import sys
fneedles = open(sys.argv[1])
fhaystack = open(sys.argv[2])
def get_next_id():
while True:
line = next(fhaystack)
if line.startswith(">Sequence ID "):
return int(line[len(">Sequence ID "):])
def get_next_needle():
return int(next(fneedles))
try:
i = get_next_id()
j = get_next_needle()
while True:
if i == j:
print(i)
while i <= j:
i = get_next_id()
while i > j:
j = get_next_needle()
except StopIteration:
pass
Sure it's a bit verbose, but it finds 80k of 4M sequences (339M of input) in about 10 seconds on my old machine. (It could also be rewritten in awk which would probably be much faster). I created the fasta file this way:
for i in range(4000000):
print(">Sequence ID {}".format(i))
print("ACTGTGCCCCGTGTAA")
print("ACTGTGCCCCGTGTAA")
print("ACTGTGCCCCGTGTAA")
print("ACTGTGCCCCGTGTAA")
And the headers ("needles") this way:
import random
ids = list(range(4000000))
random.shuffle(ids)
ids = ids[:80000]
ids.sort()
for i in ids:
print(i)
It's slow because you are reading several times the same file when you could have sed read it once and process all patterns. So you need to generate a sed script with a statement for each ID and the />/b to replace your head -n -1.
while read ID; do
printf '/%s/,/>/ { />/b; p }\n' $ID;
done | sed -n -f - data.fa

Want to sort a file based on another file in unix shell

I have 2 files refer.txt and parse.txt
refer.txt contains the following
julie,remo,rob,whitney,james
parse.txt contains
remo/hello/1.0,remo/hello2/2.0,remo/hello3/3.0,whitney/hello/1.0,julie/hello/2.0,julie/hello/3.0,rob/hello/4.0,james/hello/6.0
Now my output.txt should list the files in parse.txt based on the order specified in refer.txt
ex of output.txt should be:
julie/hello/2.0,julie/hello/3.0,remo/hello/1.0,remo/hello2/2.0,remo/hello3/3.0,rob/hello/4.0,whitney/hello/1.0,james/hello/6.0
i have tried the following code:
sort -nru refer.txt parse.txt
but no luck.
please assist me.TIA
You can do that using gnu-awk:
awk -F/ -v RS=',|\n' 'FNR==NR{a[$1] = (a[$1])? a[$1] "," $0 : $0 ; next}
{s = (s)? s "," a[$1] : a[$1]} END{print s}' parse.txt refer.txt
Output:
julie/hello/2.0,julie/hello/3.0,remo/hello/1.0,remo/hello2/2.0,remo/hello3/3.0,rob/hello/4.0,whitney/hello/1.0,james/hello/6.0
Explanation:
-F/ # Use field separator as /
-v RS=',|\n' # Use record separator as comma or newline
NR == FNR { # While processing parse.txt
a[$1]=(a[$1])?a[$1] ","$0:$0 # create an array with 1st field as key and value as all the
# records with keys julie, remo, rob etc.
}
{ # while processing the second file refer.txt
s = (s)?s "," a[$1]:a[$1] # aggregate all values by reading key from 2nd file
}
END {print s } # print all the values
In pure native bash (4.x):
# read each file into an array
IFS=, read -r -a values <parse.txt
IFS=, read -r -a ordering <refer.txt
# create a map from content before "/" to comma-separated full values in preserved order
declare -A kv=( )
for value in "${values[#]}"; do
key=${value%%/*}
if [[ ${kv[$key]} ]]; then
kv[$key]+=",$value" # already exists, comma-separate
else
kv[$key]="$value"
fi
done
# go through refer list, putting full value into "out" array for each entry
out=( )
for value in "${ordering[#]}"; do
out+=( "${kv[$value]}" )
done
# print "out" array in comma-separated form
IFS=,
printf '%s\n' "${out[*]}" >output.txt
If you're getting more output fields than you have input fields, you're probably trying to run this with bash 3.x. Since associative array support is mandatory for correct operation, this won't work.
tr , "\n" refer.txt | cat -n >person_id.txt # 'cut -n' not posix, use sed and paste
cat person_id.txt | while read person_id person_key
do
print "$person_id" > $person_key
done
tr , "\n" parse.txt | sed 's/(^[^\/]*)(\/.*)$/\1 \1\2/' >person_data.txt
cat person_data.txt | while read foreign_key person_data
do
person_id="$(<$foreign_key)"
print "$person_id" " " "$person_data" >>merge.txt
done
sort merge.txt >output.txt
A text book data processing approach, a person id table, a person data table, merged on a common key field, which is the first name of the person:
[person_key] [person_id]
- person id table, a unique sortable 'id' for each person (line number in this instance, since that is the desired sort order), and key for each person (their first name)
[person_key] [person_data]
- person data table, the data for each person indexed by 'person_key'
[person_id] [person_data]
- a merge of the 'person_id' table and 'person_data' table on 'person_key', which can then be sorted on person_id, giving the output as requested
The trick is to implement an associative array using files, the file name being the key (in this instance 'person_key'), the content being the value. [Essentially a random access file implemented using the filesystem.]
This actually adds a step to the otherwise simple but not very efficient task of grepping parse.txt with each value in refer.txt - which is more efficient I'm not sure.
NB: The above code is very unlikely to work out of the box.
NBB: On reflection, probably a better way of doing this would be to use the file system to create a random access file of parse.txt (essentially an index), and to then consider refer.txt as a batch file, submitting it as a job as such, printing out from the parse.txt random access file the data for each of the names read in from refer.txt in turn:
# 1) index data file on required field
cat person_data.txt | while read data
do
key="$(print "$data" | sed 's/(^[^\/]*)/\1/')" # alt. `cut -d'/' -f1` ??
print "$data" >>./person_data/"$key"
done
# 2) run batch job
cat refer_data.txt | while read key
do
print ./person_data/"$key"
done
However having said that, using egrep is probably just as rigorous a solution or at least for small datasets, I would most certainly use this approach given the specific question posed. (Or maybe not! The above could well prove faster as well as being more robust.)
Command
while read line; do
grep -w "^$line" <(tr , "\n" < parse.txt)
done < <(tr , "\n" < refer.txt) | paste -s -d , -
Key points
For both files, newlines are translated to commas using the tr command (without actually changing the files themselves). This is useful because while read and grep work under the assumption that your records are separated by newlines instead of commas.
while read will read in every name from refer.txt, (i.e julie, remo, etc.) and then use grep to retrieve lines from parse.txt containing that name.
The ^ in the regex ensures matching is only performed from the start of the string and not in the middle (thanks to #CharlesDuffy's comment below), and the -w option for grep allows whole-word matching only. For example, this ensures that "rob" only matches "rob/..." and not "robby/..." or "throb/...".
The paste command at the end will comma-separate the results. Removing this command will print each result on its own line.

Adding file information to an AWK comparison

I'm using awk to perform a file comparison against a file listing in found.txt
while read line; do
awk 'FNR==NR{a[$1]++;next}$1 in a' $line compare.txt >> $CHECKFILE
done < found.txt
found.txt contains full path information to a number of files that may contain the data. While I am able to determine that data exists in both files and output that data to $CHECKFILE, I wanted to be able to put the line from found.txt (the filename) where the line was found.
In other words I end up with something like:
File " /xxxx/yyy/zzz/data.txt "contains the following lines in found.txt $line
just not sure how to get the /xxxx/yyy/zzz/data.txt information into the stream.
Appended for clarification:
The file found.txt contains the full path information to several files on the system
/path/to/data/directory1/file.txt
/path/to/data/directory2/file2.txt
/path/to/data/directory3/file3.txt
each of the files has a list of parameters that need to be checked for existence before appending additional information to them later in the script.
so for example, file.txt contains the following fields
parameter1 = true
parameter2 = false
...
parameter35 = true
the compare.txt file contains a number of parameters as well.
So if parameter35 (or any other parameter) shows up in one of the three files I get it's output dropped to the Checkfile.
Both of the scripts (yours and the one I posted) will give me that output but I would also like to echo in the line that is being read at that point in the loop. Sounds like I would just be able to somehow pipe it in, but my awk expertise is limited.
It's not really clear what you want but try this (no shell loop required):
awk '
ARGIND==1 { ARGV[ARGC] = $0; ARGC++; next }
ARGIND==2 { keys[$1]; next }
$1 in keys { print FILENAME, $1 }
' found.txt compare.txt > "$CHECKFILE"
ARGIND is gawk-specific, if you don't have it add FNR==1{ARGIND++}.
Pass the name into awk inside a variable like this:
awk -v file="$line" '{... print "File: " file }'

Resources