Reading two columns in two files and outputting them to another file - bash

I recently posted this question - paste -d " " command outputting a return separated file
However I am concerned there is formatting in the text files that is causing it to error. For this reason I am attempting to do it with awk.
I am not very experienced with awk but currently I have the following:
awk {print $1} file1 | {print $1} file2 > file 3
Is this the kind of syntax I should be using? It gives an error saying missing } Each file contains a single column of numbers and the same number of rows.

By seeing your OLD post seems to be you could have control M characters in your files. To remove control M characters in your files either use dos2unix utility or use following command(s).
1st: To remove junk chars everywhere.
tr -d '\r' < Input_file > temp_file && mv temp_file Input_file
2nd: To remove them only at last of lines use following.
awk '{sub(/\r$/,"")} 1' Input_file > temp_file && mv temp_file Input_file
I believe once you remove junk chars your paste command should work properly too. Run following after you fix the control M chars in your Input_file(s).
paste -d " " Input_file1 Input_file2 > Output_file
OR to concatenate 2 files simply use:(considering that your Input_files have either 1 column or you want full lines to be there in output)
cat Input_file1 Input_file2 > output_file

awk to the rescue:
awk 'FNR==NR{a[FNR]=$1;next}{print a[FNR],$1}' a.txt b.txt > output.txt
a.txt:
1
2
3
4
5
b.txt:
A
B
C
D
E
output.txt:
1 A
2 B
3 C
4 D
5 E

Related

Extracting unique values between 2 files with awk

I need to get uniq lines when comparing 2 files. These files containing field separator ":" which should be treated as the end of line while comparing strings.
The file1 contains these lines
apple:tasty
apple:red
orange:nice
kiwi:awesome
kiwi:expensive
banana:big
grape:green
orange:oval
banana:long
The file2 contains these lines
orange:nice
banana:long
The output file should be (2 occurrences of orange and 2 occurrences of banana deleted)
apple:tasty
apple:red
kiwi:awesome
kiwi:expensive
grape:green
So the only strings before : should be compared
Is it possible to complete this task in 1 command ?
I tried to complete the task in such way but field separator does not work in that situation.
awk -F: 'FNR==NR {a[$0]++; next} !a[$0]' file1 file2 > outputfile
You basically had it, but $0 refers to the whole line when you want to deal with only the first field, which is $1.
Also you need to take care with the order of the input files. To use the values from file2 for deciding which lines to include from file1, process file2 first:
$ awk -F: 'FNR==NR {a[$1]++; next} !a[$1]' file2 file1
apple:tasty
apple:red
kiwi:awesome
kiwi:expensive
grape:green
One comment: awk is very ineffective with arrays. In real life with big files, better use something like:
comm -3 <(cut -d : -f 1 f1 | sort -u) <(cut -d : -f 1 f2 | sort -u) | grep -h -f /dev/stdin f1 f2

Merge multiple files into a single row file with a delimeter

UPDATED QS:
I have been working on a bash script that will merge multiple text files with numerical values into one a single row text file using delimiter for each file values while merging
Example:
File1.txt has the followling contents:
168321099
File2.txt has:
151304
151555
File3.txt has:
16980925
File4.txt has:
154292
149092
Now i want a output.txt file like below:
, 168321099 151304 151555 16980925 , 154292 149092
Basically each file delimited by space and in a single row. with comma as first and 6 field of the outputrow
tried:
cat * > out.txt but its not coming as expected
I am not very sure If I understood your question correctly, but I interpreted it as following :
The set of files file1,...,filen contain a set of words which you want to have printed in one single line.
Each word is space separated
In addition to the string of words, you want the first character to be a , and between word 4 and 5 you want to have a ,.
The cat+tr+awk solution:
$ cat <file1> ... <filen> | tr '\n' ' ' | awk '{$1=", "$1; $4=$4" ,"; print}'
The awk solution:
$ awk 'NR==1||NR==4{printf s",";s=" "}{printf " "$1}' <file1> ... <filen>
If tr is available on your system you can do the following cat * | tr "\n" " " > out.txt
tr "\n" " " translates all line breaks to spaces
If the number of lines per file is constant, then the easiest way is tr as #Littlefinix suggested, with a couple of anonymous files to supply the commas, and an echo at the end to add an explicit newline to the output line:
cat <(echo ",") File1.txt File2.txt File3.txt <(echo ",") File4.txt | tr "\n" " " > out.txt; echo >> out.txt
out.txt is exactly what you specified:
, 168321099 151304 151555 16980925 , 154292 149092
If the number of lines per input file might vary (e.g., File2.txt has 3 or 4 lines, etc.), then placing the commas always in the 1st and 6th field will be more involved, and you'd probably need a script and not a one-liner.
Following single awk could help you on same.
awk 'FNR==1{count++;} {printf("%s%s",count==1||(count==(ARGC-1)&&FNR==1)?", ":" ",$0)} END{print ""}' *.txt
Adding a non-one liner form of solution too now.
awk '
FNR==1 { count++ }
{ printf("%s%s",count==1||(count==(ARGC-1)&&FNR==1)?", ":" ",$0) }
END { print "" }
' *.txt

bash: using 2 variables from same file and sed

I have a 2 files:
file1.txt
rs142159069:45000079:TACTTCTTGGACATTTCC:T 45000079
rs111285978:45000103:A:AT 45000103
rs190363568:45000168:C:T 45000168
file2.txt
rs142159069:45000079:TACTTCTTGGACATTTCC:T rs142159069
rs111285978:45000103:A:AT rs111285978
rs190363568:45000168:C:T rs190363568
Using file2.txt, I want to replace the names (column2 of file1.txt which is column1 of file2.txt) by the entry in column 2. The output file would then be:
rs142159069 45000079
rs111285978 45000103
rs190363568 45000168
I have tried inputing the columns of file2.txt but without success:
while read -r a b
do
cat file1.txt | sed s'/$a/$b/'
done < file2.txt
I am quite new to bash. Also, not sure how to write an output file with my command. Any help would be deeply appreciated.
In your case, using awk or perl would be easier, if you are willing to accept an answer without sed:
awk '(NR==FNR){out[$1]=$2;next}{out[$1]=out[$1]" "$2}END{for (i in out){print out[i]} }' file2.txt file1.txt > output.txt
output.txt :
rs142159069 45000079
rs111285978 45000103
rs190363568 45000168
Note: this assume all symbols in column1 are unique, and that they are all present in both files
explanation:
(NR==FNR){out[$1]=$2;next} : while you are parsing the first file, create a map with the name from the first column as key
{out[$1]=out[$1]" "$2} : append the value from the second column
END{for (i in out){print out[i]} } : print all the values in the map
Apparently $2 of file2 is part of $1 of file1, so you could use awk and redefine FS:
$ awk -F"[: ]" '{print $1,$NF}' file1
rs142159069 45000079
rs111285978 45000103
rs190363568 45000168

Printing numerous specific lines from file using awk or sed command loop

I've got this big txt file with ID names. It has 2500 lines, one column. Let's call it file.txt
H3430
H3467
H9805
Also, I've got another file, index.txt, which has 390 numbers:
1
4
9
13
15
Those numbers are the number of lines (of IDs) I have to extract from file.txt. I need to generate another file, newfile.txt let's call it, with only the 390 IDs that are in the specific lines that index.txt demands (the first ID of the list, the fourth, the ninth, and so on).
So, I tried to do the following loop, but it didn't work.
num=$'index.txt'
for i in num
do
awk 'NR==i' "file.txt" > newfile.txt
done
I'm a noob regarding this matters... so, I need some help. Even if it is with my loop or with a new solution suggested by you. Thank you :)
Lets create an example file that simulates your 2500 line file with seq:
$ seq 2500 > /tmp/2500
And use the example you have for the line numbers to print in a file called 390:
$ echo "1
4
9
13
15" > /tmp/390
You can print line N in the file 2500 by reading the line numbers into an array and printing the lines if in that array:
$ awk 'NR==FNR{ a[$1]++; next} a[FNR]' /tmp/390 /tmp/2500
You can also use a sed command file:
$ sed 's/$/p/' /tmp/390 > /tmp/sed_cmd
$ sed -n -f /tmp/sed_cmd /tmp/2500
With GNU sed, you can do sed 's/$/p/' /tmp/390 | sed -n -f - /tmp/2500 but that does not work on OS X :-(
You can do this tho:
$ sed -n -f <(sed 's/$/p/' /tmp/390) /tmp/2500
You can read the index.txt file in to a map and then compare it with the line number of file.txt. Redirect the output to another file.
awk 'NR==FNR{line[$1]; next}(FNR in line){print $1}' index.txt file.txt > newfile.txt
When you work with two files, using FNR is necessary as it gets reset to 1 when new file starts (on the contrary NR will continue to increment).
As Ed Morton suggests in the comments. The command could then be refined to further remove {print $1} since awk prints by default on truth.
awk 'NR==FNR{line[$1]; next} FNR in line' index.txt file.txt > newfile.txt
If index.txt is sorted, we could walk file.txt in order.
That reduces the number of actions to the very minimum (faster script):
awk 'BEGIN
{ indexfile="index.txt"
if ( (getline ind < indexfile) <= 0)
{ printf("Empty %s\n; exiting",indexfile);exit }
}
{ if ( FNR < ind ) next
if ( FNR == ind ) printf("%s %s\n",ind,$0)
if ( (getline ind < indexfile) <= 0) {exit}
}' file.txt
If the file is not actually sorted, get it quickly sorted with sort:
sort -n index.txt > temp.index.txt
rm index.txt
mv temp.index.txt index.txt

Join lines based on pattern

I have the following file:
test
1
My
2
Hi
3
i need a way to use cat ,grep or awk to give the following output:
test1
My2
Hi3
How can i achieve this in a single command? something like
cat file.txt | grep ... | awk ...
Note that its always a string followed by a number in the original text file.
sed 'N;s/\n//' file.txt
This should give the desired output when the content is in file.txt
paste -d "" - - < filename
This takes consecutive lines and pastes them together delimited by the empty string.
awk '{printf("%s", $0);} !(NR%2){printf("\n");}' file.txt
EDIT: I just noticed that your question requires the use of cat and grep. Both of those programs are unnecessary to achieve your stated aims. If you have some reason for including them that you haven't mentioned, try this (uselessly inefficient) version of the line I wrote immediately above:
cat file.txt | grep '^' | awk '{printf("%s", $0);} !(NR%2){printf("\n");}'
It is possible that this command uses features not present in the original awk program. You may need to invoke the new awk program, nawk instead.
If your input file is always 1 number then 1 string, and you only want the strings, all you have to do is take every other line.
If you only want the odd lines, you can do awk 'NR % 2' file.txt
If you want the evens, this becomes awk 'NR % 2==0' data
Here is the answer:
cat file.txt | awk 'BEGIN { lno = 0 } { val=$0; if (lno % 2 == 1) {printf "%s\n", $0} else {printf "%s", $0}; ++lno}'

Resources