Extracting a value from a same file from multiple directories - bash

Directory name F1 F2 F3……F120
Inside each directory, a file with a common name ‘xyz.txt’
File xyz.txt has a value
Example:
F1
Xyz.txt
3.345e-2
F2
Xyz.txt
2.345e-2
F3
Xyz.txt
1.345e-2
--
F120
Xyz.txt
0.345e-2
I want to extract these values and paste them in a single file say ‘new.txt’ in a column like
New.txt
3.345e-2
2.345e-2
1.345e-2
---
0.345e-2
Any help please? Thank you so much.

If your files look very similar then you can use grep. For example:
cat F{1..120}/xyz.txt | grep -E '^[0-9][.][0-9]{3}e-[0-9]$' > new.txt
This is a general example as any number can be anything. The regular expression says that the whole line must consist of: a any digit [0-9], a dot character [.], three digits [0-9]{3}, the letter 'e' and any digit [0-9].
If your data is more regular you can also try more simple solution:
cat F{1..120}/xyz.txt | grep -E '^[0-9][.]345e-2$' > new.txt
In this solution only the first digit can be anything.

If your files might contain something else than the line, but the line you want to extract can be unambiguously extracted with a regex, you can use
sed -n '/^[0-9]\.[0-9]*e-*[0-9]*$/p' F*/Xyz.txt >new.txt
The same can be done with grep, but you have to separately tell it to not print the file name. The -x option can be used as a convenience to simplify the regex.
grep -h -x '[0-9]\.[0-9]*e-*[0-9]*' F*/Xyz.txt >new.txt
If you have some files which match the wildcard which should be excluded, try a more complex wildcard, or multiple wildcards which only match exactly the files you want, like maybe F[1-9]/Xyz.txt F[1-9][0-9]/Xyz.txt F1[0-9][0-9]/Xyz.txt

This might work for you (GNU parallel and grep):
parallel -k grep -hE '^[0-9][.][0-9]{3}e-[0-9]$' F{}/xyz.txt ::: {1..120}
Process files in parallel but output results in order.

If the files contain just one line, and you want the whole thing, you can use bash range expansion:
cat /path/to/F{1..120}/Xyz.txt > output.txt
(this keeps the order too).
If the files have more lines, and you need to actually extract the value, use grep -o (-o is not posix, but your grep probably has it).
grep -o '[0-9].345-e2' /path/to/F{1..120}/Xyz.txt > output.txt

Related

Extracting all but a certain sequence of characters in Bash

In bash I need to extract a certain sequence of letters and numbers from a filename. In the example below I need to extract just the S??E?? section of the filenames. This must work with both upper/lowercase.
my.show.s01e02.h264.aac.subs.mkv
great.s03e12.h264.Dolby.mkv
what.a.fab.title.S05E11.Atmos.h265.subs.eng.mp4
Expected output would be:
s01e02
s03e12
S05E11
I've been trying to do this with SED but can't get it to work. This is what I have tried, without success:
sed 's/.*s[0-9][0-9]e[0-9][0-9].*//'
Many thanks for any help.
With sed we can match the desired string in a capture group, and use the I suffix for case-insensitive matching, to accomplish the desired result.
For the sake of this answer I'm assuming the filenames are in a file:
$ cat fnames
my.show.s01e02.h264.aac.subs.mkv
great.s03e12.h264.Dolby.mkv
what.a.fab.title.S05E11.Atmos.h265.subs.eng.mp4
One sed solution:
$ sed -E 's/.*\.(s[0-9][0-9]e[0-9][0-9])\..*/\1/I' fnames
s01e02
s03e12
S05E11
Where:
-E - enable extended regex support
\.(s[0-9][0-9]e[0-9][0-9])\. - match s??e?? with a pair of literal periods as bookends; the s??e?? (wrapped in parens) will be stored in capture group #1
\1 - print out capture group #1
/I - use case-insensitive matching
I think your pattern is ok. With the grep -o you get only the matched part of a string instead of matching lines. So
grep -io 'S[0-9]{2}E[0-9]{2}'
solves your problem. Compared to your pattern only numbers will be matched. Maybe you can put it in an if, so lines without a match show that something is wrong with the filename.
Suppose you have those file names:
$ ls -1
great.s03e12.h264.Dolby.mkv
my.show.s01e02.h264.aac.subs.mkv
what.a.fab.title.S05E11.Atmos.h265.subs.eng.mp4
You can extract the substring this way:
$ printf "%s\n" * | sed -E 's/^.*([sS][0-9][0-9][eE][0-9][0-9]).*/\1/'
Or with grep:
$ printf "%s\n" *.m* | grep -o '[sS][0-9][0-9][eE][0-9][0-9]'
Either prints:
s03e12
s01e02
S05E11
You could use that same sed or grep on a file (with filenames in it) as well.

Can i give sed an array for its path?

I'm new to the world of macOS and Unix, but I have to work with it.
My question is: Am I able to give the sed-command an array, which contains paths, so that sed uses the contained variable as a path?
I try to manipulate the dock of a User, that I have identified before that.
My code to this point is this:
#!/bin/bash
...
...
for change in '${plistToModify[#]}'
do
sed 's<_CFURLSTRING>file:///Applications/name_old.app</_CFURLSTRING>#<_CFURLSTRING>file:///Applications/name_new.app</_CFURLSTRING>#' '${plistToModify[$change]}'
done
killall Dock
(I switched the name of the app, fyi)
I tried double-quoting the array at the beginning of the for-loop and at the end of sed, but nothing worked
The array contains a unknown number of paths, which look like this:
/Users/theUser/Library/Preferences/com.apple.dock.plist
Is this possible in the first place or am I missing something?
Thanks in advance.
Yes, sed can handle multiple files, whether they come from an array or not does not matter.
Your Specific Case
In your specific example you can write
sed 'yourSedCommand' "${plistToModify[#]}"
General Case
sed 'cmd' file1 file2 file3 is the same as a=(file1 file2 file3); sed 'cmd' "${a[#]}". Bash expands "${a[#]}" before sed even runs. There is no way for sed to tell the difference.
sed itself will handle multiple files as if it was one big file, therefore sed 'cmd' file1 file2 file3 is the same as cat file1 file2 file3 | sed 'cmd'.
Only if you use the -i flag to edit files inplace sed -i will treat multiple files individually.
Wrong type of quotes. Parameter expansion does not occur within single quotes. You have to write "${plistToModify[$change]}".

Find words from file a in file b and output the missing word matches from file a

I have two files that I am trying to run a find/grep/fgrep on. I have been trying several different commands to try to get the following results:
File A
hostnamea
hostnameb
hostnamec
hostnamed
hostnamee
hostnamef
File B
hostnamea-20170802
hostnameb-20170802
hostnamec-20170802.xml # some files have extensions
020214-_hostnamed-20170208.tar # some files have different extensions and have different date structure
HOSTNAMEF-20170802
*about files- date=20170802 - most all have this date format - some have different date format *
FileA is my control file - I want to search fileb with the whole word hostnamea-f and match the hostnamea-f in fileb and output the non-matches from filea into the output on terminal to be used in a shell script.
For this example I made it so hostnamee is not within fileb. I want to run an fgrep/grep/awk - whatever can work for this - and output only the missing hostnamee from filea.
I can get this to work but it does not particularly do what I need and if I swap it around I get nothing.
user#host:/netops/backups/scripts$ fgrep -f filea fileb -i -w -o
hostnamea
hostnameb
hostnamec
hostnamed
HOSTNAMEF
Cool - I get the matches in File-B but what if I try to reverse it.
host#host:/netops/backups/scripts$ fgrep -f fileb filea -i -w -o
host#host:/netops/backups/scripts$
I have tried several different commands but cannot seem to get it right. I am using -i to ignore case, -w to match whole word and -o
I have found some sort of workaround but was hoping there was a more elegant way of doing this with a single command either awk,egrep,fgrep or other.
user#host:/netops/backups/scripts$ fgrep -f filea fileb -i -w -o > test
user#host:/netops/backups/scripts$ diff filea test -i
5d4
< hostnamee
You can
look for "only-matches", i.e. -o, of a in b
use the result as patterns to look for in a, i.e. -f-
only list what does not match, i.e. -v
Code:
grep -of a.txt b.txt | grep -f- -v a.txt
Output:
hostnamee
hostnamef
Case-insensitive code:
grep -oif a.txt b.txt | grep -f- -vi a.txt
Output:
hostnamee
Edit:
Responding to the interesting input by Ed Morton, I have made the sample input somewhat "nastier" to test robustness against substring matches and regex-active characters (e.g. "."):
a.txt:
hostnamea
hostnameb
hostnamec
hostnamed
hostnamee
hostnamef
ostname
lilihostnamec
hos.namea
b.txt:
hostnamea-20170802
hostnameb-20170802
hostnamec-20170802.xml # some files have extensions
020214-_hostnamed-20170208.tar # some files have different extensions and have different date structure
HOSTNAMEF-20170802
lalahostnamef
hostnameab
stnam
This makes things more interesting.
I provide this case insensitive solution:
grep -Fwoif a.txt b.txt | grep -f- -Fviw a.txt
additional -F, meaning "no regex tricks"
additional -w, meaning "whole word matching"
I find the output quite satisfying, assuming that the following change of the "requirements" is accepted:
Hostnames in "a" only match parts of "b", if all adjoining _ (and other "word characers" are always considered part of the hostname.
(Note the additional output line of hostnamed, which is now not found in "b" anymore, because in "b", it is preceded by an _.)
To match possible occurrences of valid hostnames which are preceded/followed by other word characters, the list in "a" would have to explicitly name those variations. E.g. "_hostnamed" would have to be listed in order to not have "hostnamed" in the output.
(With a little luck, this might even be acceptable for OP, then this extended solution is recommended; for robustness against "EdMortonish traps". Ed, please consider this a compliment on your interesting input, it is not meant in any way negatively.)
Output for "nasty" a and b:
hostnamed
hostnamee
ostname
lilihostnamec
hos.namea
I am not sure whether the changed handling of _ still matches OPs goal (if not, within OPs scope the first case insensitive solution is satisfying).
_ is part of "letter characters" which can be used for "whole word only matching" -w. More detailed regex control at some point gets beyond grep, as Ed Morton has mentioned, using awk, perl (sed for masochistic brain exercise, the kind I enjoy) is then appropriate.
With GNU grep 2.5.4 on windows.
The files a.txt and b.txt have your content, I made however sure that they have UNIX line-endings, that is important (at least for a, possibly not for b).
$ cat tst.awk
NR==FNR {
gsub(/^[^_]+_|-[^-]+$/,"")
hostnames[tolower($0)]
next
}
!(tolower($0) in hostnames)
$ awk -f tst.awk fileB fileA
hostnamee
$ awk -f tst.awk b.txt a.txt
hostnamee
ostname
lilihostnamec
hos.namea
The only assumption in the above is that your host names don't contain underscores and anything after the last - on the line is a date. If that's not the case and there's a better definition of what the optional hostname prefix and suffix strings in fileB can be then just tweak the gsub() to use an appropriate regexp.

How to print all lines of a file that do not contain a *partial* pattern

We know grep -v pattern file prints lines that do not contain pattern.
My file to search is a table:
Sample File, Sample Name, Panel, Marker, Allele 1, Allele 2, GQ,
M090972.s-206_B01.fsa, M090972-206, Sample ID-1, SNPchr1, C, T,0.9933,
I want to weed out the lines that contain "M090972-206" and some more patterns like that.
My search patterns come from a directory of text files:
$ ls 20170227_snap_genotypes_1_VCF
M070370-208_S1.genome.vcf M170276-201_S20.genome.vcf
M170308-201_S5.genome.vcf
Only the part of these filenames up to the first "_" is in my table (or the first "." if I remove the ".s" in the example). It is not a constant number of characters. I could remove the characters after the first "." but could not find a way in the sed and awk documentation.
Alternatively I tried using agrep 3.441 with the "-f" option for reading the patterns from a temporary file made with
$ ls "directory" > temp.txt
$ ./agrep -v -f temp.txt $infile >> $outfile
But agrep -f does not find any match (or everything with -v).
What am I missing? Is there a better way, perhaps with sed or awk?
If you are deriving your patterns from the name of files (up to the first _) that exist in 20170227_snap_genotypes_1_VCF directory, then you could do this:
# run from the parent of 20170227_snap_genotypes_1_VCF directory
grep -vf <(cd 20170227_snap_genotypes_1_VCF; ls | cut -f1 -d_) file

Removing duplicate entries from files on the basis of substring postfixes

Let's say that I have the following text in a file:
foo.bar.baz
bar.baz
123.foo.bar.baz
pqr.abc.def
xyz.abc.def
abc.def.ghi.jkl
def.ghi.jkl
How would I remove duplicates from the file, on the basis of postfixes? The expected output without duplicates would be:
bar.baz
pqr.abc.def
xyz.abc.def
def.ghi.jkl
(Consider foo.bar.baz and bar.baz. The latter is a substring postfix so only bar.baz remains. However, neither of pqr.abc.def and xyz.abc.def are not substring postfixes of each other, so both remain.)
Try this:
#!/bin/bash
INPUT_FILE="$1"
in="$(cat $INPUT_FILE)"
out="$in"
for line in $in; do
out=$(echo "$out" | grep -v "\.$line\$")
done
echo "$out"
You need to save it to a script (e.g. bashor.sh), make it executable (chmod +x bashor.sh) and call it with your input file as the first argument:
./bashor.sh path/to/input.txt
Use sed to escape the string for regular expressions, prefix ., postfix $ and pipe this into GNU grep (-f - doesn't work with BSD grep, eg. on a mac).
sed 's/[^-A-Za-z0-9_]/\\&/g; s/^/./; s/$/$/' test.txt |grep -vf - test.txt
I just used to regular expression escaping from another answer and didn't think about whether it is reasonable. On first sight it seems fine, but escapes too much, though probably this is not an issue.

Resources