Sort files based on content - sorting

I have around 1000 files from a phylogenetic analysis and each file looks something like this
File 1
(((A:0.03550734102561460548,B:0.04004337325891465377):0.01263892787244691278,C:0.03773369182398536581):0.08345900687345568503,(D:0.04441859961888914438,((E:0.04707945363105774106,F:0.03769496882665739068):0.00478087012691866091,G:0.01269975716918288373):0.01263526019405349088):0.13087200352448438712,H:1.91169780510990117151):0.0;
File 12
((A:0.11176787864288327545,(B:0.18283029119402782747,C:0.12136417416322728413):0.02367730683755531543):0.21101090994668161849,(((F:0.06464548582830945134,E:0.06903977777526745796):0.01710921464740109560,G:0.01577242071367901746):0.00958883666063858192,D:0.03506359820882300193):0.47082738536589324729,H:2.94973933657097164840):0.0;
I want to read the content of each file, and classify them according to the patterns (meaning the file content). The numbers here represent the branch length and will not be the same for any of the files. So, I would like to classify the files based on the alphabets A to H. Say, for instance all the files that have the alphabets from A to H arranged in the same order, I would like to sort those files into separate folders. For example:
For the pattern in File1, the pattern will be something like this ignoring the numbers(branch length):
(((A:,B:),C:):,(D:,((E:,F:):,G:):):,H:):;
And all the files that contain this pattern will go into a folder.
File 1
File 5
File 6
File 10
....
I know to sort contents based on a particular pattern using:
grep -l -Z pattern files | xargs -0 mv -t target-directory --
But am not sure how to do it for this case here as I do not have a prior knowledge of the patterns

you can get the content patterns and sort them
$ for f in file{1..2};
do printf "%s\t" $f; tr -d '[ 0-9.]' <$f;
done |
sort -k2
file1 (((A:,B:):,C:):,(D:,((E:,F:):,G:):):,H:):;
file2 ((A:,(B:,C:):):,(((F:,E:):,G:):,D:):,H:):;
same patterns will be consecutive. This assumes you have one record per file.

Related

count all lines containing a string in all files in a directory and see the output for each file separately

I split a file into 15 new files, each having 400 lines of text. I am trying to search each of the 15 new files for a specific string pattern but would like to know how many instances of the pattern are in each new file.
I know I can search all files in the directory using grep -r "pattern" | wc -l which gives me the overall number of lines that my pattern is in but I'd like to know which files -- I haven't been able to figure out how to do that yet
this is what I'd like it to look like:
File1 100
File2 89
File3 90
Because of how I split the files they're all titled "FILEAA" "FILEAB" "FILEAC" etc
Just:
grep -cr "pattern" .

Concatenating multiple fastq files and renaming to parent folder

Currently, I have genome sequencing data as fq.gz files and each sample consists of a folder titled with the sample name containing 2-8 pairs of forward and reverse reads. I would like to concatenate all of the forward and reverse files into one forward and one reverse file for each sample while maintaining pair order.
My data are organized as follows:
/ParentFolder/SampleA/V549_1.fq.gz
/ParentFolder/SampleA/V549_2.fq.gz
/ParentFolder/SampleA/V550_1.fq.gz
/ParentFolder/SampleA/V550_2.fq.gz
/ParentFolder/SampleB/V588_1.fq.gz
/ParentFolder/SampleB/V588_2.fq.gz
/ParentFolder/SampleB/V599_1.fq.gz
/ParentFolder/SampleB/V599_2.fq.gz
In order to concatenate the files, I tried the following:
ls *_1.fq.gz | sort | xargs cat > SampleA_1.fq.gz
ls *_2.fq.gz | sort | xargs cat > SampleA_2.fq.gz
This works for one sample folder, however, since I have many dozens of samples, I would like to write a bash script to accomplish this across all my samples and rename the concatenated files to name of their parent folder. I am still learning but I am a bit lost, I would greatly appreciate any help with this problem.
I have attempted the following, without success:
for i in $(find ./ -type f -name "*.fq.gz"; done | sort | uniq)
do echo "Merging 1"
cat "$i"*_1.fq.gz > "$i"CG1-1_1.fq.gz
Thank you for any input/advice/solutions.
Consider the following logic:
for each suffix (_1, _2):
Find all the fq.gz files
Extract list of folders
For each folder
Merge all the containing 'fq' files into new 'all.FOLDER.fq.gz'
p=pp
for suffix in 1 2 ; do
# Find all dirs containing suffix files.
dirs=$(printf '%s\n' $p/*/*_$suffix.fq.gz | sed 's:/[^/]*$::' | uniq)
for d in $dirs ; do
# Merge, and store in parent.
(cd $d ; cat *_${suffix}.fq.gz > ../${d##*/}_${suffix}.fq.gz)
done
done
Notes:
code assume no special characters in folder names.
More compact files will be created if the merge process will uncompressed the original data, and re-compress. (gzcat *.gz

Concatenating CSV files in bash preserving the header only once

Imagine I have a directory containing many subdirectories each containing some number of CSV files with the same structure (same number of columns and all containing the same header).
I am aware that I can run from the parent folder something like
find ./ -name '*.csv' -exec cat {} \; > ~/Desktop/result.csv
And this will work fine, expect for the fact that the header is repeated each time (once for each file).
I'm also aware that I can do something like sed 1d <filename> or tail -n +<N+1> <filename> to skip the first line of a file.
But in my case, it seems a bit more specialised. I want to preserve the header once for the first file and then skip the header for every file after that.
Is anyone aware of a way to achieve this using standard Unix tools (like find, head, tail, sed, awk etc.) and bash?
For example input files
/folder1
/file1.csv
/file2.csv
/folder2
/file1.csv
Where each file has header:
A,B,C and each file has one data row 1,2,3
The desired output would be:
A,B,C
1,2,3
1,2,3
1,2,3
Marked As Duplicate
I feel this is different to other questions like this and this specifically because those solutions reference file1 and file2 in the solution. My question asks about a directory structure with an arbitrary number of files where I would not want to type out each file one by one.
You may use this find + xargs + awk:
find . -name '*.csv' -print0 | xargs -0 awk 'NR==1 || FNR>1'
NR==1 || FNR>1 condition will be true for very first line in combined output or for every non-first line.
$ {
> cat real-daily-wages-in-pounds-engla.tsv;
> tail -n+2 real-daily-wages-in-pounds-engla.tsv;
> } | cat
You can pipe the output of multiple commands through cat. tail -n+2 selects all lines from a file, except the first.

My .gz/.zip file contains a huge text file; without saving that file unpacked to disk, how to extract its lines that match a regular expression?

I have a file.gz (not a .tar.gz!) or file.zip file. It contains one file (20GB-sized text file with tens of millions of lines) named 1.txt.
Without saving 1.txt to disk as a whole (this requirement is the same as in my previous question), I want to extract all its lines that match some regular expression and don't match another regex.
The resulting .txt files must not exceed a predefined limit, say, one million lines.
That is, if there are 3.5M lines in 1.txt that match those conditions, I want to get 4 output files: part1.txt, part2.txt, part3.txt, part4.txt (the latter will contain 500K lines), that's all.
I tried to make use of something like
gzip -c path/to/test/file.gz | grep -P --regexp='my regex' | split -l1000000
But the above code doesn't work. Maybe Bash can do it, as in my previous question, but I don't know how.
You can perhaps use zgrep.
zgrep [ grep_options ] [ -e ] pattern filename.gz ...
NOTE: zgrep is a wrapper script (installed with gzip package), which essentially uses the same command internally as mentioned in other answers.
However, it looks more readable in the script & easier to write the command manually.
I'm afraid It's imposible, quote from gzip man:
If you wish to create a single archive file with multiple members so
that members can later be extracted independently, use an archiver
such as tar or zip.
UPDATE: After de edit, if the gz only contains one file , a one step tool like awk shoul be fine:
gzip -cd path/to/test/file.gz | awk 'BEGIN{global=1}/my regex/{count+=1;print $0 >"part"global".txt";if (count==1000000){count=0;global+=1}}'
split is also a good choice but you will have to rename files after it.
Your solution is almost good. The problem is that You should specify for gzip what to do. To decompress use -d. So try:
gzip -dc path/to/test/file.gz | grep -P --regexp='my regex' | split -l1000000
But with this you will have a bunch of files like xaa, xab, xac, ... I suggest to use the PREFIX and numeric suffixes features to create better output:
gzip -dc path/to/test/file.gz | grep -P --regexp='my regex' | split -dl1000000 - file
In this case the result files will look like: file01, file02, fil03 etc.
If You want to filter out some not matching perl style regex, you can try something like this:
gzip -dc path/to/test/file.gz | grep -P 'my regex' | grep -vP 'other regex' | split -dl1000000 - file
I hope this helps.

Bash script to recursively traverse directories, compare and sync files

I'm trying to write a bash shell script to sync content on two different paths.
The algorithm I'm striving for consists of the following steps
given two full (as opposed to relative) paths
recursively compare files (whose filename optionally may have
basename and suffix) in corresponding directories of both paths
if either corresponding directories or files are not present, then
copy each file (from the path with the folder) to the other
corresponding folder.
I've figured out steps 1 and 2 which are
OLD_IFS=$IFS
# The extra space after is crucial
IFS=\
for old_file in `diff -rq old/ new/ | grep "^Files.*differ$" | sed 's/^Files \(.*\) and .* differ$/\1/'`
do
mv $old_file $old_file.old
done
IFS=$OLD_IFS
Thanks.
I have implemented a similar algorithm in Java, which essentially boils down to this:
Retrieve a listing of directories A and B, e.g. A.lst and B.lst
Create the intersection of both listings (e.g. cat A.lst B.lst | sort | uniq -d). This is the list of files you need to actually compare; you will also have to descend to any directories recursively.
You may want to have a look at the conditional expressions supported by your shell (e.g. for bash) or by the test command. I would also suggest using cmp instead of diff.
Note: you need to consider what the proper action should be when you have a directory on one side and a file on the other with the same name.
Find the files that are only present in A (e.g. cat A.lst B.lst B.lst | sort | uniq -u) and copy them recursively (cp -a) to B.
Similarly, find the files that are only present in B and copy them recursively to A.
EDIT:
I forgot to mention a significant optimization: if you sort the file lists A.lst and B.lst beforehand, you can use comm instead of cat ... | sort | uniq ... to perform the set operations:
Intersection: comm -12 A.sorted.lst B.sorted.lst
Files that exist only in A: comm -23 A.sorted.lst B.sorted.lst
Files that exist only in B: comm -13 A.sorted.lst B.sorted.lst
There exists a ready-made solution (shell script), based on find (also using the same idea as yours), to synchronize two directories: https://github.com/Fitus/Zaloha.sh.
Documentation is here: https://github.com/Fitus/Zaloha.sh/blob/master/DOCUMENTATION.md.
Cheers

Resources