Making a new file using another as template in BASH - bash

I have two files such as:
File_1
c1,c2,c3,c4
File_2
c1,c3,c2,c4
DA,CA,DD,CD
Thus, I want to make a File 3 using the File 1 as model using BASH:
File_3
c1,c2,c3,c4
DA,DD,CA,CD
In this example, the File_1 is a model of the correct disposition of the columns and the File_2 has the columns and their respective informations but in a wrong disposition. Thus, the File_3 used the file_1 as a template and ordered the information in the file_2 in a correct disposition.
In the example I just gave 4 columns, but my real file has 402 columns.
So, to do an
awk -F"," '{print $1","$3","$2","$4}' File_2
or something like this, will not work because I dont know the position of the itens of the File_1 in the File_2 (for example the c1 column in the File_2 could be in the sixth, the second, or the last columns positions).
I hope that you can help me using BASH (if possible) and I would like an small explanation of the script, because I'm newbie and I don't know a lot the commands.
Thanks in advance.

You can make a header index mapping like this:
File_2 => File_1
------ ------
1 => 1
2 => 3
3 => 2
4 => 4
awk -F, '
FNR==NR{
for(i=1;i<=NF;i++)
a[$i]=i
print
nextfile
}
FNR==1{
for(j=1;j<=NF;j++)
b[j]=a[$j]
next
}
{
for(k=1;k<=NF;k++)
printf( "%s%s",$b[k], k==NF?"\n":",")
}
' File_{1,2}
Note: This command works if File_{1,2} contain no empty lines!

If you are free to change the format of file 2 from:
File_2
c1,c3,c2,c4
DA,CA,DD,CD
to:
s/c1/DA/g
s/c3/CA/g
s/c2/DD/g
s/c4/CD/g
you can use sed:
sed -f File_2 File_1 > File_3
Else you may work with arrays:
key=($(head -n1 File_2 | tr "," " "))
val=($(tail -n1 File_2 | tr "," " "))
len=${#key[*]}
for i in $(seq 0 $((len-1))); do echo s/${key[$i]}/${val[$i]}/g; done > subst.sed
sed -f subst.sed File_1 > File_3
The generated sed-Program is the one as above. If a substitution matches the key of a following command, you might get unexpected results. If you only like to match whole words, you have to change the sed command a bit.

Related

Using bash comm command on columns but returning the entire line

I have two files, each with two columns and sorted only by the second column, such as:
File 1:
176 AAATC
6 CCGTG
80 TTTCG
File 2:
20 AAATC
77 CTTTT
50 TTTTT
I would like to use comm command using options -13 and -23 to get two different files reporting the different lines between the two files with the corresponding count number, but only comparing the second columns (i.e. the strings). What I tried so far was something like:
comm -23 <(cut -d$'\t' -f2 file1.txt) <(cut -d$'\t' -f2 file2.txt)
But I could only have the strings in output, without the numbers:
CCGTG
TTTCG
While what I want would be:
6 CCGTG
80 TTTCG
Any suggestion?
Thanks!
You can use join instead of comm:
join -1 2 -2 2 File1 File2 -a 1 -o 1.1,1.2,2.2
It will output the matching lines, too, but you can remove them with
| grep -v '[ACTG] [ACTG]'
Explanation:
-1 2 use the second column in file 1 for joining;
-2 2 similarly, use the second column in file 2;
-a 1 show also non-matching lines from file 1 - these are the ones you want in the end;
-o specifies the output format, here we want columns 1 and 2 from file 1 and column 2 from file 2 (this is just arbitrary, you can use column 1 as well, but the second command would be different: | grep -v '[ACTG] [0-9]').
comm is not the right tool for this job, and while join will work you also need to look at running join twice and then further filter the results with some other command (eg, grep).
One awk idea that requires a single pass through each input file:
awk 'BEGIN {FS=OFS="\t"}
FNR==NR { f1[$2]=$1; next } # save 1st file entries
$2 in f1 { delete f1[$2]; next } # 2nd file: if $2 in f1[] then delete f1[] entry and skip this line else ..
{ f2[$2]=$1 } # save 2nd file entries
END { # at this point:
# f1[] contains rows where field #2 only exists in the 1st file
# f2[] contains rows where field #2 only exists in the 2nd file
PROCINFO["sorted_in"]="#ind_str_asc"
for (i in f1) print f1[i],i > "file-23"
for (i in f2) print f2[i],i > "file-13"
}
' file1 file2
NOTE: the PROCINFO["sorted_in"] line requires GNU awk; without this line we cannot guarantee the order of writes to the final output files, and OP would then need to add more (awk) code to maintain the ordering or use another OS-level utility (eg, sort) to sort the final files
This generates:
$ cat file-23
6 CCGTG
80 TTTCG
$ cat file-13
77 CTTTT
50 TTTTT

cat multiple files into one using same amount of rows as file B from A B C

This is a strange question, I have been looking around and I wasn't able to find anything to match with what I wish to do.
What I'm trying to do is;
File A, File B, File C
5 Lines, 3 Lines, 2 Lines.
Join all files in one file matching the same amount of the file B
The output should be
File A, File B, File C
3 Lines, 3 Lines, 3 Lines.
So in file A I have to remove two lines, in File C i have to duplicate 1 line so I can match the same lines as file B.
I was thinking to do a count to see how many lines each file has first
count1=`wc -l FileA| awk '{print $1}'`
count2=`wc -l FileB| awk '{print $1}'`
count3=`wc -l FileC| awk '{print $1}'`
Then to do a gt then file B remove lines, else add lines.
But I have got lost as I'm not sure how to continue with this, I never seen anyone trying to do this.
Can anyone point me to an idea?
the output should be as per attached picture below;
Output
thanks.
Could you please try following. I have made # as a separator you could change it as per your need too.
paste -d'#' file1 file2 file3 |
awk -v file2_lines="$(wc -l < file2)" '
BEGIN{
FS=OFS="#"
}
FNR<=file2_lines{
$1=$1?$1:prev_first
$3=$3?$3:prev_third
print
prev_first=$1
prev_third=$3
}'
Example of running above code:
Lets say following are Input_file(s):
cat file1
File1_line1
File1_line2
File1_line3
File1_line4
File1_line5
cat file2
File2_line1
File2_line2
File2_line3
cat file3
File3_line1
File3_line2
When I run above code in form of script following will be the output:
./script.ksh
File1_line1#File2_line1#File3_line1
File1_line2#File2_line2#File3_line2
File1_line3#File2_line3#File3_line2
you can get the first n lines of a files with the head command resp sed.
you can generate new lines with echo.
i'm going to use sed, as it allows in-place editing of a file (so you don't have to deal with temporary files):
#!/bin/bash
fix_numlines() {
local filename=$1
local wantlines=$2
local havelines=$(grep -c . "${filename}")
head -${wantlines} "${filename}"
if [ $havelines -lt $wantlines ]; then
for i in $(seq $((wantlines-havelines))); do echo; done
fi
}
lines=$(grep -c . fileB)
fix_numlines fileA ${lines}
fix_numlines fileB ${lines}
fix_numlines fileC ${lines}
if you want columnated output, it's even simpler:
paste fileA fileB fileC | head -$(grep -c . fileB)
Another for GNU awk that outputs in columns:
$ gawk -v seed=$RANDOM -v n=2 ' # n parameter is the file index number
BEGIN { # ... which defines the record count
srand(seed) # random record is printed when not enough records
}
{
a[ARGIND][c[ARGIND]=FNR]=$0 # hash all data to a first
}
END {
for(r=1;r<=c[n];r++) # loop records
for(f=1;f<=ARGIND;f++) # and fields for below output
printf "%s%s",((r in a[f])?a[f][r]:a[f][int(rand()*c[f])+1]),(f==ARGIND?ORS:OFS)
}' a b c # -v n=2 means the second file ie. b
Output:
a1 b1 c1
a2 b2 c2
a3 b3 c1
If you don't like the random pick of a record, replace int(rand()*c[f])+1] with c[f].
$ gawk ' # remember GNU awk only
NR==FNR { # count given files records
bnr=FNR
next
}
{
print # output records of a b c
if(FNR==bnr) # ... up to bnr records
nextfile # and skip to next file
}
ENDFILE { # if you get to the end of the file
if(bnr>FNR) # but bnr not big enough
for(i=FNR;i<bnr;i++) # loop some
print # and duplicate the last record of the file
}' b a b c # first the file to count then all the files to print
To make a file have n lines you can use the following function (usage: toLength n file). This omits lines at the end if the file is too long and repeats the last line if the file is too short.
toLength() {
{ head -n"$1" "$2"; yes "$(tail -n1 "$2")"; } | head -n"$1"
}
To set all files to the length of FileB and show them side by side use
n="$(wc -l < FileB)"
paste <(toLength "$n" FileA) FileB <(toLength "$n" FileC) | column -ts$'\t'
As observed by the user umläute the side-by-side output makes things even easier. However, they used empty lines to pad out short files. The following solution repeats the last line to make short files longer.
stretch() {
cat "$1"
yes "$(tail -n1 "$1")"
}
paste <(stretch FileA) FileB <(stretch FileC) | column -ts$'\t' |
head -n"$(wc -l < FileB)"
This is a clean way using awk where we read each file only a single time:
awk -v n=2 '
BEGIN{ while(1) {
for(i=1;i<ARGC;++i) {
if (b[i]=(getline tmp < ARGV[i])) a[i] = tmp
}
if (b[n]) for(i=1;i<ARGC;++i) print a[i] > ARGV[i]".new"
else {break}
}
}' f1 f2 f3 f4 f5 f6
This works in the following way:
the lead file is defined by the index n. Here we choose the lead file to be f2.
We do not process the files in the standard read record, fields sequentially, but we use the BEGIN block where we read the files in parallel.
We do an infinite loop while(1) where we will break out if the lead-file has no more input.
Per cycle, we read a new line of each file using getline. If the file i has a new line, store it in a[i], and set the outcome of getline into b[i]. If file i has reached its end, keep the last line in mind.
Check the outcome of the lead file with b[n]. If we still read a line, print all the lines to the files f1.new, f2.new, ..., otherwise, break out of the infinite loop.

Reading two columns in two files and outputting them to another file

I recently posted this question - paste -d " " command outputting a return separated file
However I am concerned there is formatting in the text files that is causing it to error. For this reason I am attempting to do it with awk.
I am not very experienced with awk but currently I have the following:
awk {print $1} file1 | {print $1} file2 > file 3
Is this the kind of syntax I should be using? It gives an error saying missing } Each file contains a single column of numbers and the same number of rows.
By seeing your OLD post seems to be you could have control M characters in your files. To remove control M characters in your files either use dos2unix utility or use following command(s).
1st: To remove junk chars everywhere.
tr -d '\r' < Input_file > temp_file && mv temp_file Input_file
2nd: To remove them only at last of lines use following.
awk '{sub(/\r$/,"")} 1' Input_file > temp_file && mv temp_file Input_file
I believe once you remove junk chars your paste command should work properly too. Run following after you fix the control M chars in your Input_file(s).
paste -d " " Input_file1 Input_file2 > Output_file
OR to concatenate 2 files simply use:(considering that your Input_files have either 1 column or you want full lines to be there in output)
cat Input_file1 Input_file2 > output_file
awk to the rescue:
awk 'FNR==NR{a[FNR]=$1;next}{print a[FNR],$1}' a.txt b.txt > output.txt
a.txt:
1
2
3
4
5
b.txt:
A
B
C
D
E
output.txt:
1 A
2 B
3 C
4 D
5 E

Merge multiple files into a single row file with a delimeter

UPDATED QS:
I have been working on a bash script that will merge multiple text files with numerical values into one a single row text file using delimiter for each file values while merging
Example:
File1.txt has the followling contents:
168321099
File2.txt has:
151304
151555
File3.txt has:
16980925
File4.txt has:
154292
149092
Now i want a output.txt file like below:
, 168321099 151304 151555 16980925 , 154292 149092
Basically each file delimited by space and in a single row. with comma as first and 6 field of the outputrow
tried:
cat * > out.txt but its not coming as expected
I am not very sure If I understood your question correctly, but I interpreted it as following :
The set of files file1,...,filen contain a set of words which you want to have printed in one single line.
Each word is space separated
In addition to the string of words, you want the first character to be a , and between word 4 and 5 you want to have a ,.
The cat+tr+awk solution:
$ cat <file1> ... <filen> | tr '\n' ' ' | awk '{$1=", "$1; $4=$4" ,"; print}'
The awk solution:
$ awk 'NR==1||NR==4{printf s",";s=" "}{printf " "$1}' <file1> ... <filen>
If tr is available on your system you can do the following cat * | tr "\n" " " > out.txt
tr "\n" " " translates all line breaks to spaces
If the number of lines per file is constant, then the easiest way is tr as #Littlefinix suggested, with a couple of anonymous files to supply the commas, and an echo at the end to add an explicit newline to the output line:
cat <(echo ",") File1.txt File2.txt File3.txt <(echo ",") File4.txt | tr "\n" " " > out.txt; echo >> out.txt
out.txt is exactly what you specified:
, 168321099 151304 151555 16980925 , 154292 149092
If the number of lines per input file might vary (e.g., File2.txt has 3 or 4 lines, etc.), then placing the commas always in the 1st and 6th field will be more involved, and you'd probably need a script and not a one-liner.
Following single awk could help you on same.
awk 'FNR==1{count++;} {printf("%s%s",count==1||(count==(ARGC-1)&&FNR==1)?", ":" ",$0)} END{print ""}' *.txt
Adding a non-one liner form of solution too now.
awk '
FNR==1 { count++ }
{ printf("%s%s",count==1||(count==(ARGC-1)&&FNR==1)?", ":" ",$0) }
END { print "" }
' *.txt

Need an awk script or any other way to do this on unix

i have small file with around 50 lines and 2 fields like below
file1
-----
12345 8373
65236 7376
82738 2872
..
..
..
i have some around 100 files which are comma"," separated as below:
file2
-----
1,3,4,4,12345,,,23,3,,,2,8373,1,1
each file has many lines similar to the above line.
i want to extract from all these 100 files whose
5th field is eqaul to 1st field in the first file and
13th field is equal to 2nd field in the first file
I want to search all the 100 files using that single file?
i came up with the below in case of a single comma separated file.i am not even sure whether this is correct!
but i have multiple comma separated files.
awk -F"\t|," 'FNR==NR{a[$1$2]++;next}($5$13 in a)' file1 file2
can anyone help me pls?
EDIT:
the above command is working fine in case of a single file.
Here is another using an array, avoiding multiple work files:
#!/bin/awk -f
FILENAME == "file1" {
keys[$1] = ""
keys[$2] = ""
next
}
{
split($0, fields, "," )
if (fields[5] in keys && fields[13] in keys) print "*:",$0
}
I am using split because the field seperator in the two files are different. You could swap it around if necessary. You should call the script thus:
runit.awk file1 file2
An alternative is to open the first file explicitly (using "open") and reading it (readline) in a BEGIN block.
Here is a simple approach. Extract each line from the small file, split it into fields and then use awk to print lines from the other files which match those fields:
while read line
do
f1=$(echo $line | awk '{print $1}')
f2=$(echo $line | awk '{print $2}')
awk -v f1="$f1" -v f2="$f2" -F, '$5==f1 && $13==f2' file*
done < small_file

Resources