Concatenate awk-output, string, and text file - bash

I have the following two tab-separated files in my current directory.
a.tsv
do not use this line
but this one
and that too
b.tsv
three fields here
not here
For each tsv file there is an associated txt file in the same directory, with the same filename but different suffix.
a.txt
This is the a-specific text.
b.txt
Text associated to b.
For each pair of files I want to create a new file with the same name but the suffix _new.txt. The new files should contain all lines from the respective tsv file that contain exactly 3 fields, afterwards the string \n####\n, and then the whole content of the respective txt file. Thus, the following output files should be created.
Desired output
a_new.txt
but this one
and that too
####
This is the a-specific text.
b_new.txt
three fields here
####
Text associated to b.
Working, but bad solution
for file in ./*.tsv
do awk -F'\t' 'NF==3' $file > ${file//.tsv/_3_fields.tsv}
done
for file in ./*_3_fields.tsv
do cat $file <(printf "\n####\n") ${file//_3_fields.tsv/.txt} > ${file//_3_fields.tsv/_new.txt}
done
Non-working code
I'd like to get the result with one script, and avoid creating the intermediate file with the suffix _3_fields.tsv.
I tried command substitution as follows:
for file in ./*.tsv
do cat <<< $(awk -F'\t' 'NF==3' $file) <(printf "\n####\n") ${file//.tsv/.txt} > ${file//.tsv/_new.txt}
done
But this doesn't write the awk-processed part into the new files.
Yet, the command substitution seems to work if I only write the awk-processed part into the new file like follows:
for file in ./*.tsv; do cat <<< $(awk -F'\t' 'NF==3' $file) > ${file//.tsv/_new.txt}; done
I'd be interested in why the second last code doesn't work as expected, and what a good solution would be to do the task.

Maybe you wanted to redirect a sequence of commands
for file in ./*.tsv
do
{
awk -F'\t' 'NF==3' "$file"
printf "\n####\n"
cat "${file//.tsv/.txt}"
} > "${file//.tsv/_new.txt}"
done
Note that space after opening brace and semicolon or newline before closing brace are important.
Seems also you are confusing command substitution $() and process substituion <() or >(). Also <<< is to redirect content as standard input whereas < to redirect a file.

Related

Need command or script to rename a list of files in linux using a pattern match

I have downloaded some 90 fasta files from NCBI for bacterial genomes. The downloaded files have default names given by NCBI. I need to change it to my desired file names. Thus I have created two .txt files:
file1.txt - having the default files names provided by NCBI. listed out the names provided by NCBI in file1.txt
file2.txt - having the names to replace the default. listed out the names to replace the NCBI names
Both the files are made in an order so that 1st entry of file1.txt is corresponding to 1st entry of file2.txt.
Now all the downloaded files are in a folder. the folder having the files
and I need a script which reads file1.txt, matches with the file name in the folder and replace it with the names in file2.txt.
I am not a bioinformatician, new to this genre. I look forward to your help. Can this process be made simpler?
This can done with a very small awk one-liner. For convenience, lets first combine your file1 and file2 to make processing easier. This can be done with paste file1.txt file2.txt >> names.txt.
names.txt will be a text file with the old names in the first column and the new names in the second. Awk lets us conveniently run through a file line-by-line (or record-by-record in its terminology) and access each column/field.
Assuming you are in the directory with all these files, as well as names.txt, you can simply run awk '{system("mv " $1 " " $2)}' names.txt to transform them all. This will run through all the lines in names.txt, take the filename given in the first column, and move it to the name given in the second column. The system command allows you to access more basic file system operations through the shell, like moving mv, copying cp, or removing rm files.
Use paste
and xargs like so:
paste file1.txt file2.txt | xargs --verbose -n2 mv
The command is using paste to write lines from 2 files side by side, separated by TABs, to STDOUT. The STDOUT is read by xargs using a pipe (|). Option --verbose prints the command, and option -n2 specifies the max number of arguments for xargs to be 2, so that the resulting commands that are executed are something like mv old_file new_file.
Alternatively, use the Perl one-liners below.
Print the commands to rename the files, without executing the commands ("dry run"):
paste file1.txt file2.txt | perl -lane '$cmd = "mv $F[0] $F[1]"; print $cmd;'
Print the commands to rename the files, then actually execute them:
paste file1.txt file2.txt | perl -lane '$cmd = "mv $F[0] $F[1]"; print $cmd; system $cmd;'
The command is using paste to write lines from 2 files side by side, separated by TABs, to STDOUT. The STDOUT is read by the Perl one-liner using a pipe (|) to pass it to Perl one-liner's STDIN.
The Perl one-liner uses these command line flags:
-e : Tells Perl to look for code in-line, instead of in a file.
-n : Loop over the input one line at a time, assigning it to $_ by default.
-l : Strip the input line separator ("\n" on *NIX by default) before executing the code in-line, and append it when printing.
-a : Split $_ into array #F on whitespace or on the regex specified in -F option.
$F[0], $F[1] : first and second elements of the array #F into which the line is split. They are old and new file names, respectively.
system executes the command $cmd, which actually moves the files.
SEE ALSO:
perldoc perlrun: how to execute the Perl interpreter: command line switches

How to use awk to split a file and store each filename in a Bash array

Input
A file called input_file.csv, which has 7 columns, and n rows.
Example header and row:
Date Location Team1 Team2 Time Prize_$ Sport
2016 NY Raptors Gators 12pm $500 Soccer
Output
n files, where the rows in each new file are grouped based on their values in column 7 of the original file. Each file is named after that shared value from column 7. Note: each file will have the same header. (The script currently does this.)
Example: if 2 rows in the original file had golf as their value for column 7, they would be grouped together in a file called golf.csv. If 3 other rows shared soccer as their value for column 7, they would be found in soccer.csv.
An array that has the name of each generated file in it. This array lives outside of the scope of awk. (This is what I need help with.)
Example: Array = [golf.csv, soccer.csv]
Situation
The following script produces the desired output. However, I want to run another script on each of the newly generated files and I don't know how.
Question:
My idea is to store the names of each new file in an array. That way, I can loop through the array and do what I want to each file. The code below passes a variable called array into awk, but I don't know how to add the name of each file to the array.
#!/bin/bash
ARRAY=()
awk -v myarray="$ARRAY" -F"\",\"" 'NR==1 {header=$0}; NF>1 && NR>1 {if(! files[$7]) {print header >> ("" $7 ".csv"); files[$7]=1}; print $0 >> ("" $7 ".csv"); close("" $7 ".csv");}' input_file.csv
for i in "${ARRAY[#]}"
do
:
echo $i
done
Rather than struggling to get awk to fill your shell array variable, why not:
make sure that the *.csv files are created in a clean directory
use globbing to loop over all *.csv files in that directory?
awk -F'","' ... # your original Awk command
for i in *.csv # use globbing to loop over resulting *.csv files
do
:
echo $i
done
Just off the top of my head, untested because you haven't supplied very much sample data, what about this?
#!/usr/bin/awk -f
FNR==1 {
header=$0
next
}
! $7 in files {
files[$7]=sprintf("sport-%s.csv", $7)
print header > file
}
{
files[$7]=sprintf("sport-%s.csv", $7)
}
{
print > files[$7]
}
END {
printf("declare -a sportlist=( ")
for (sport in files) {
printf("\"%s\"", sport)
}
printf(" )\n");
}
The idea here is that we store sport names in the array files[], and build filenames out of that array. (You can format the filename inside sprintf() as you see fit.) We step through the file, adding a header line whenever we get a new sport with no recorded filename. Then for non-headers, print to the file based on the sport name.
For your second issue, exporting the array back to something outside of awk, the END block here will output a declare line which can be interpreted by bash. IF you feel lucky, you can eval this awk script inside command expansion, and the declare command will effectively be interpreted by your shell:
eval $(/path/to/awkscript inputfile.csv)
Or, if you subscribe to the school of thought that consiers eval to be evil, you can redirect the awk script's standard output to a temporary file which you source:
/path/to/awkscript inputfile.csv > /tmp/yadda.$$
. /tmp/yadda.$$
(Don't use this temp file, make a real one with mktemp or the like.)
There's no way for any program to modify the environment of the parent shell. Just have the awk script output the names of the files as standard output, and use command substitution to put them in an array.
filesArray=($(awk ... ))
If the files might have spaces in them, you need a different solution; assuming you're on bash 4, you can just be sure to print each file on a separate line and use readarray:
readarray filesArray < <( awk ... )
if the files might have newlines in them, too, then things get tricky...
if your file is not large, you can run another script to get the unique $7 elements, for example
$ awk 'NR>1&&!a[$7]++{print $7}' sports
will print the values, you can change it to your file name format as well, such as
$ awk 'NR>1&&!a[$7]++{print tolower($7)".csv"}' sports
this then can be piped to your other process, here for example to wc
$ awk ... sports | xargs wc
This will do what I THINK you want:
oIFS="$IFS"; IFS=$'\n'
array=( $(awk '{out=$7".csv"; print > out} !seen[out]++{print out}' input_file.csv) )
IFS="$oIFS"
If your input file really is comma-separated instead of space-separated as you show in the sample input in your question then adjust the awk script to suit (You might want to look at GNU awk and FPAT).
If you don't have GNU awk then you'll need to add a bit more code to close the open output files as you go.
The above will fail if you have file names that contain newlines but will be fine for blank chars or other white space.

Extract line from text file based on leading characters of each line

I have a very large data dump that I need to manipulate. Basically, I receive a text file that has data from multiple tables in it. The first two characters of each line will tell me what table this is from. I need to read each of these lines and then extract them into a TEXT file... It would append each line to the text file. Each table should have it's own text file.
For example, lets say the data file looks like this...
HDxxxxxxxxxxxxx
HDyyyyyyyyyyyyy
ENxxxxxxxxxxxxx
ENyyyyyyyyyyyyy
HSyyyyyyyyyyyyy
What I would need is the first two lines to be in a text file named HD_out.txt, the 3rd and 4th lines in one named EN_out.txt, and the last one in a file named HS_out.txt.
Does anyone know how could this be done with either a simple batch file or UNIX shell script?
Use awk to split file based on first 2 characters:
gawk -v FIELDWIDTHS='2 99999' '{print $2 > $1"_out.txt"}' input.txt
Using bash:
while read -r line; do
echo "${line:2}" >> "${line:0:2}_out.txt"
done < inputFile
${var:startposition:length} is a bash string function to capture sub-strings. This would cause your inputfile to be split based on the first two chars. If you want to include the table prefix, just use echo "$line" >> "${line:0:2}_out.txt" instead of what is shown above.
Demo:
$ ls
file
$ cat file
HDxxxxxxxxxxxxx
HDyyyyyyyyyyyyy
ENxxxxxxxxxxxxx
ENyyyyyyyyyyyyy
HSyyyyyyyyyyyyy
$ while read -r line; do echo "${line:2}" >> "${line:0:2}_out.txt"; done < file
$ ls
EN_out.txt file HD_out.txt HS_out.txt
$ head *.txt
==> EN_out.txt <==
xxxxxxxxxxxxx
yyyyyyyyyyyyy
==> HD_out.txt <==
xxxxxxxxxxxxx
yyyyyyyyyyyyy
==> HS_out.txt <==
yyyyyyyyyyyyy

Cut and paste a line with an exact match using sed

I have a text file (~8 GB). Lets call this file A. File A has about 100,000 lines with 19 words and integers separated by a space. I need to cut several lines from file A and paste them into a new file (file B). The lines should be deleted from file A. The lines to be cut from file A should have an exact matching string.
I then need to repeat this several times, removing lines from file A with a different matching string every time. Each time, file A is getting smaller.
I can do this using "sed" but using two commands, like this:
# Finding lines in file A with matching string and copying those lines to file B
sed -ne '/\<matchingString\>/ p' file A > file B
#Again finding the lines in file A with matching string and deleting those lines,
#writing a tmp file to hold the lines that were not deleted.
sed '/\<matchingString\>/d'file A > tmp
# Replacing file A with the tmp file.
mv tmp file A
Here is an example of files A and B. I want to extract all lines containing hg15
File A:
ID pos frac xp mf ...
23 43210 0.1 2 hg15...
...
...
File B:
23 43210 0.1 2 hg15...
I´m fairly new to writing shell scripts and using all the Unix tools, but I feel I should be able to do this more elegantly and faster. Can anyone please guide me along to improving this script. I don´t specifically need to use "sed". I have been searching the web and stackoverflow without finding a solution to this exact problem. I´m using RedHat and bash.
Thanks.
This might work for you (GNU sed):
sed 's|.*|/\\<&\\>/{w fileB\nd}|' matchingString_file | sed -i.bak -f - fileA
This makes a sed script from the matching strings that writes the matching lines to fileB and deletes them from fileA.
N.B. a backup of fileA is made too.
To make a different file for each exact word match use:
sed 's|.*|/\\<&\\>/{w "&.txt"\nd}|' matchingString_file | sed -i.bak -f - fileA
I'd use grep for this but besides this small improvement this is probably the fastest way to do it already, even if this means to apply the regexp to each line twice:
grep '<matchingString>' A > B
grep -v '<matchingString>' A > tmp
mv tmp A
The next approach would be to read the file line by line, check the line, and write it depending on the check either to B or to tmp. (And mv tmp A again in the end.) But there is no standard Unix tool which does this (AFAIK), and doing it in shell will probably reduce performance massively:
while IFS='' read line
do
if expr "$line" : '<matchingString>' >/dev/null
then
echo "$line" 1>&3
else
echo "$line"
fi > B 3> tmp
done < A
You could try to do this using Python (or similar scripting languages):
import re
with open('B', 'w') as b:
with open('tmp', 'w') as tmp:
with open('A') as a:
for line in a:
if re.match(r'<matchingString>', line):
b.write(line)
else:
tmp.write(line)
os.rename('tmp', 'A')
But this is a little out of scope here (not shell anymore).
Hope this will help you...
cat File A | while read line
do
#Finding lines in file A wit matching string and copying those lines to file B
sed -ne '/\<matchingString\>/ p' file A >> file B
#Again finding the lines in file A with matching string and deleting those lines
#writing a tmp file to hold the lines that were not deleted
sed '/\<matchingString\>/d'file A >> tmp
done
#once you are done with greping and copy pasting Replacing file A with the tmp file
`mv tmp file A`
PS: I'm appending to the file B since we are greping in a loop when the match pattern found.

Edit files in Bash

I have a few files that contain IP addresses. I'm creating a script and have to figure out how to create a new user file with an IP address that is based off the file created before it. If the last file contains an IP of A.B.C.D the new file needs to be A.B.C.(D+4).
I think I need to use the 'sed' and 'awk' commands, but haven't been able to get anything working. How would I go about writing this part of the script?
Here's something to get you started: suppose there is a file called input looks like this:
Input: contents of input
127.0.0.1
127.0.0.2
127.0.0.3
127.0.0.200
You can do on the cmdline:
awk 'BEGIN{FS=OFS="."} {$4=$4+4; print}' input > output
Explanation on what awk is doing here:
awk '...' - invoke awk, a tool used primarily for line-by-line manipulation of files, the stuff enclosed by single quotes are instructions to awk.
BEGIN{FS=OFS="."} - tell awk to use . as the delimiter for both input and output. FS stands for "Field Separator"
{$4=$4+4; print} - $4 means the 4th field. Since . is the delimiter, D corresponds to the 4th field and we add the integer value 4 to the 4th field. The print here is just short hand for printing the entire line.
input - name the input file as argument to awk; save a cat
> output - redirect the output to a file so you can inspect them for any issues before making the user files based on it.
Output: contents of output
127.0.0.5
127.0.0.6
127.0.0.7
127.0.0.204
And then you can read output one line at a time to create new user files as needed, maybe another script with something along the lines of:
while read line
do
echo "this is a user file" > "$line"
done < output
(and adjust it to your needs)
Finally, as long as you understand what's going on in the above, you can skip the output file altogether and just do this all in a one-liner:
awk 'BEGIN{FS=OFS="."} {$4=$4+4; print}' input | while read line; do echo "hello world" > "$line"; done

Resources