remove lines from file that does not have dot extension in bash - bash

I am having such of file that contains lines as below:
/folder/share/folder1
/folder/share/folder1/file.gz
/folder/share/folder2/11072012
/folder/share/folder2/11072012/file1.rar
I am trying to remove these lines:
/folder/share/folder1/
/folder/share/folder2/11072012
To get a final result the following:
/folder/share/folder2/11072012/file1.rar
/folder/share/folder1/file.gz
In other words, I am trying to keep only the path for files and not directories.

This
awk -F/ '$NF~/\./{print}'
splits input records on the character "/" using the command line switch -F
examines the last field of the input record $NF (where NF is the number of fields in the input record) to see if it DOES contain the character "." (the !~ operator)
if it matches, oputput the record.
Example
$ echo -e '/folder/share/folder.2/11072012
/folder/share/folder2/11072012/file1.rar' | mawk -F/ '$NF~/\./{print}'
/folder/share/folder2/11072012/file1.rar
$
NB: my microscript looks at . ONLY in the filename part of the full path.
Edit in my 1st post I reversed the logic, to print dotless files instead of dotted ones.

You could to use the find command to get only the file list
find <directory> -type f

With awk:
awk -F/ '$NF ~ /\./{print}' File
Set / as delimiter, check if last field ($NF) has . in it, if yes, print the line.

Text only result
sed -n 'H
$ {g
:cycle
s/\(\(\n\).*\)\(\(\2.*\)\{0,1\}\)\1/\3\1/g
t cycle
s/^\n//p
}' YourFile
Based on file name and folder name assuming that:
line that are inside other line are folder and uniq are file (could be completed by a OS file existence file on result)
line are sorted (at least between folder and file inside)
posix version so --posixon GNU sed

Related

Adding part of filename as column to csv files, then concatenate

I have many csv files that look like this:
data/0.Raw/20190401_data.csv
(Only the date in the middle of the filename changes)
I want to concatenate all these files together, but add the date as a new column in the data to be able to distinguish between the different files after merging.
I wrote a bash script that adds the full path and filename as a column in each file, and then merges into a master csv. However, I am having trouble getting rid of the path and the extension to only keep the date portion
The bash script
#! /bin/bash
mkdir data/1.merged
for i in "data/0.Raw/"*.csv; do
awk -F, -v OFS=, 'NR==1{sub(/\_data.csv$/, "", FILENAME) } NR>1{ $1=FILENAME }1' "$i" |
column -t > "data/1.merged/"${i/"data/0.Raw/"/""}""
done
awk 'FNR > 1' data/1.merged/*.csv > data/1.merged/all_files
rm data/1.merged/*.csv
mv data/1.merged/all_files data/1.merged/all_files.csv
using "sub" I was able to remove the "_data.csv" part, but as a result the column gets added as "data/0.Raw/20190401" - that is, I am having trouble removing both the part before the date as well as the part after the date.
I tried replacing sub with gensub to regex match everything except the 8 digits in the middle but that does not seem to work either.
Any ideas on how to solve this?
Thanks!
You can process and concatenate all the files with a single awk call:
awk '
FNR == 1 {
date = FILENAME
gsub(/.*\/|_data\.csv$/,"",date)
next
}
{ print date "," $0 }
' data/0.Raw/*_data.csv > all_files.csv
However, I am having trouble getting rid of the path and the extension
to only keep the date portion
Then take look at basename command
basename NAME [SUFFIX]
Print NAME with any leading directory components removed. If
specified, also remove a trailing SUFFIX.
Example
basename 'data/0.Raw/20190401_data.csv' _data.csv
gives output
20190401

How to add an empty line at the end of these commands?

I am in a situation where I have so many fastq files that I want to convert to fasta.
Since they belong to the same sample, I would like to merge the fasta files to get a single file.
I tried running these two commands:
sed -n '1~4s/^#/>/p;2~4p' INFILE.fastq > OUTFILE.fasta
cat infile.fq | awk '{if(NR%4==1) {printf(">%s\n",substr($0,2));} else if(NR%4==2) print;}' > file.fa
And the output files is correctly a fasta file.
However I get a problem in the next step. When I merge files with this command:
cat $1 >> final.fasta
The final file apparently looks correct. But when I run makeblastdb it gives me the following error:
FASTA-Reader: Ignoring invalid residues at position(s): On line 512: 1040-1043, 1046-1048, 1050-1051, 1053, 1055-1058, 1060-1061, 1063, 1066-1069, 1071-1076
Looking at what's on that line I found that a file header was put at the end of the previous file sequence. And it turns out like this:
GGCTTAAACAGCATT>e45dcf63-78cf-4769-96b7-bf645c130323
So how can I add a blank line to the end of the file within the scripts that convert fastq to fasta?
So that when I merge they are placed on top of each other correctly and not at the end of the sequence of the previous file.
So how can I add a blank line to the end of the file within the
scripts that convert fastq to fasta?
I would use GNU sed following replace
cat $1 >> final.fasta
using
sed '$a\\n' $1 >> final.fasta
Explanation: meaning of expression for sed is at last line ($) append newline (\n) - this action is undertaken before default one of printing. If you prefer GNU AWK then you might same behavior following way
awk '{print}END{print ""}' $1 >> final.fasta
Note: I was unable to test any of solution as you doesnot provide enough information to this. I assume above line is somewhere inside loop and $1 is always name of file existing in current working directory.
if the only thing you need is extra blank line, and the input files are within 1.5 GB in size, then just directly do :
awk NF=NF RS='^$' FS='\n' OFS='\n'
Should work for mawk 1/2, gawk, and nawk, maybe others as well. This works despite appearing not to do anything special is that the extra \n comes from ORS.

Prepending part of a filename to a .csv file using bash/sed

I have a couple of files in a directory that are named like this;
1_38OE983729JKHKJV.csv
an integer followed by an ID (the Integer and ID are both unique).
I need to prepend this ID to every line of the file for each file in the folder to prepare the files for import to a database (and discard the integer part of the filename). The contents of the file look something like this:
BW;20015;11,45;0,49;41;174856;4103399
BA;25340;11,41;0,55;40;222161;4599779
BB;800;7,58;0,33;42;10559;239887
HE;6301;9,11;0,39;40;69191;1614302
.
.
.
Total;112613;9,33;0,43;40;1207387;25897426
The end result should look something like this:
38OE983729JKHKJV;BW;20015;11,45;0,49;41;174856;4103399
38OE983729JKHKJV;BA;25340;11,41;0,55;40;222161;4599779
38OE983729JKHKJV;BB;800;7,58;0,33;42;10559;239887
38OE983729JKHKJV;HE;6301;9,11;0,39;40;69191;1614302
.
.
.
38OE983729JKHKJV;Total;112613;9,33;0,43;40;1207387;25897426
Thanks for the help!
EDIT: Spelling and vocabular for clarity
Loop over the files with for, use parameter expansion to extract the id.
#!/bin/bash
for csv in *.csv ; do
prefix=${csv%_*}
id=${csv#*_}
id=${id%.csv}
sed -i~ "s/^/$id;/" "$csv"
done
If the ID can contain underscores, you might need to be more careful with the expansion.
With awk tool:
for f in *csv; do awk '{ fn=FILENAME; $0=substr(fn,index(fn,"_")+1,length(fn)-6)";"$0 }1' "$f" > tmp && mv tmp "$f"; done
fn=FILENAME - the filename
try following too in single awk and which will take care of the number of files which are getting opened during this operation too, so that we will avoid the error of maximum number of files opened.
awk 'FNR==1{close(val);val=FILENAME;split(FILENAME,a,"_");sub(/\..*/,"",a[2])} {print a[2]","$0}' *.csv
With GNU awk for inplace editing and gensub() all you need is:
awk -i inplace '{print gensub(/.*_(.*)\..*/,"\\1;",1,FILENAME) $0}' *.csv
No shell loops or anything else necessary, just that command.

Mac Terminal Bash awk change multiple file names to $NF output

I have been working on this script to retrieve files from all the folders in my directory and trying to change their names to my desired output.
Before filename:
Folder\actors\character\hair\haircurly1.dds
After filename:
haircurly1.dds
I am working with over 12,000 textures with different names that I extracted from an archive. My extractor included the path to the folder where it extracted the files in each file name. For example, a file that should have been named haircurly1.dds was named Folder\actors\character\hair\haircurly1.dds during extraction.
cd ~/Desktop/MainFolder/Folder
find . -name '*\\*.dds' | awk -F\\ '{ print; print $NF; }'
This code retrieves every texture file that I am looking at containing backslashes (as I have already changed some of the files with other codes, however I want one that will change all of the files at once rather than me having to write specific codes for every folder for 12,000+ texture files)
I use print; and it sends me the file path:
./Folder\actors\character\hair\haircurly1.dds
I use print $NF; and it sends me the text after the awk separator:
\
haircurly1.dds
I would like every file name that this script runs through to be changed to the $NF output of the awk command. Anyone know how I can make my script change the file names to their $NF output?
Thank you
Your question isn't clear but it SOUNDS like all you want to do is:
for file in *\\*; do
mv -- "$file" "${file##*\\}"
done
If that's not all you want then edit your question to clarify your requirements.
Have your awk command format and print a "mv" command, and pipe the result to bash. The extra single-quoting ensures bash treats backslash as a normal char.
find . -name '*\\*.dds' | awk -F\\ '{print "mv '\''" $0 "'\'' " $NF}' | bash -x
hth

How to print all lines of a file that do not contain a *partial* pattern

We know grep -v pattern file prints lines that do not contain pattern.
My file to search is a table:
Sample File, Sample Name, Panel, Marker, Allele 1, Allele 2, GQ,
M090972.s-206_B01.fsa, M090972-206, Sample ID-1, SNPchr1, C, T,0.9933,
I want to weed out the lines that contain "M090972-206" and some more patterns like that.
My search patterns come from a directory of text files:
$ ls 20170227_snap_genotypes_1_VCF
M070370-208_S1.genome.vcf M170276-201_S20.genome.vcf
M170308-201_S5.genome.vcf
Only the part of these filenames up to the first "_" is in my table (or the first "." if I remove the ".s" in the example). It is not a constant number of characters. I could remove the characters after the first "." but could not find a way in the sed and awk documentation.
Alternatively I tried using agrep 3.441 with the "-f" option for reading the patterns from a temporary file made with
$ ls "directory" > temp.txt
$ ./agrep -v -f temp.txt $infile >> $outfile
But agrep -f does not find any match (or everything with -v).
What am I missing? Is there a better way, perhaps with sed or awk?
If you are deriving your patterns from the name of files (up to the first _) that exist in 20170227_snap_genotypes_1_VCF directory, then you could do this:
# run from the parent of 20170227_snap_genotypes_1_VCF directory
grep -vf <(cd 20170227_snap_genotypes_1_VCF; ls | cut -f1 -d_) file

Resources