Joining specific parts of text from two files in third? - bash

My question is based again, on linux shell programming, and this time, I have two textual files, with about 17,000 lines in each.
In first file i have lines which have this form:
[*] 11004, e01c5dee8efb188af91fb989a1039a12, isabelleann86#yahoo.com
And second file has form for each line:
e01c5dee8efb188af91fb989a1039a12:nathan09
Now I want to create third file from these two, to have form of:
isabelleann86#yahoo.com:nathan09
But notation please, hash e01c5dee8efb188af91fb989a1039a12 must correspond to both lines in first and second file, not like creating line with email_1 and password_3421.
Email from file one, and password from file two, where line has the same hash value?
I know it is maybe possible by using grep/awk combination, but I just do not know how to form it.

Here's one way using awk with multiple delimiters:
awk -F "[ ,:]+" 'FNR==NR { a[$3]=$4; next } $1 in a { print a[$1], $2 }' OFS=":" file1 file2 > file3
Results; contents of file3:
isabelleann86#yahoo.com:nathan09

Using awk
awk 'NR==FNR{a[$(NF-1)]=$NF;next}
$1"," in a {print a[$1","] FS $NF}' file1 FS=: file2

Related

Converting from tsv to fasta

I have a bunch of TSV files in my folder and for everyone one of them I would like to get a fasta file where the header after the sign '>' is the name of the file.
My TSV file has 5 columns without header:
Thus:
inputfile called: "A.coseq.table_headless.tsv"
HIV1B-pol-seed 15 MAX 1959 GTAACAGACTCACAATATGCATTAGGAATCATTCAAGC
output file called "A.fasta"
>A_MAX
GTAACAGACTCACAATATGCATTAGGAATCATTCAAGC
I want to run the script simultaneously in bash for all the files and I have this script who does not work because in awk print statement I have a curly brace:
for sample in `ls *coseq.table_headless.tsv`
do
base1=$(basename $sample "coseq.table_headless.tsv")
awk '{print ">"${base1}"_"$3"\n"$5}' ${base1}coseq.table_headless.tsv > ${base1}fasta
done
Any idea how to correct this code?
Thank you very much
if the basename is the part until the first ".", you can get rid of the loop as well.
awk '{split(FILENAME,base,".");
print ">" base[1] "_" $3 "\n" $5 > base[1]".fasta"}' *coseq.table_headless.tsv
The other solutions posted so far have a few issues:
not closing the files as they're written will produce "too many
open files" errors unless you use GNU awk,
calculating the output file name every time a line is
read rather than once when the input file is opened is inefficient, and
using parenthesized expression on the right side of output
redirection is undefined behavior and so will only work in some awks
(including GNU awk).
This will work robustly and efficiently in all awks:
awk '
FNR==1 { close(out); f=FILENAME; sub(/\..*/,"",f); pfx=">"f"_"; out=f".fasta" }
{ print pfx $3 ORS $5 > out }
' *coseq.table_headless.tsv
Another awk solution:
awk '{ pfx=substr(FILENAME,1,index(FILENAME,".")-1);
printf(">%s_%s\n%s\n",pfx,$3,$5) > pfx".fasta" }' *coseq.table_headless.tsv
pfx contains the first part of filename (till the 1st .)

Fastest way to extract a column and then find its uniq items in a large delimited file

Hoping for help. I have a 3 million line file, data.txt, delimited with "|", e.g,.
"4"|"GESELLSCHAFT FUER NUCLEONIC & ELECT MBH"|"DE"|"0"
"5"|"IMPEX ESSEN VERTRIEB VON WERKZEUGEN GMBH"|"DE"|"0"
I need to extract the 3rd column ("DE") and then limit it to its unique values. Here is what I've come up with (gawk and gsort as I'm running MacOS and only had the "--parallel" option via GNU sort):
gawk -F "|" '{print $3}' data.txt \
| gsort --parallel=4 -u > countries.uniq
This works, but it isn't very fast. I have similar tasks coming up with some even larger (11M record) files, so I'm wondering if anyone can point out a faster way.
I hope to stay in shell, rather than say, Python, because some of the related processing is much easier done in shell.
Many thanks!
awk is tailor-made for such tasks. Here is a minimal awk logic that could do the trick for you.
awk -F"|" '!($3 in arr){print} {arr[$3]++} END{ for (i in arr) print i}' logFile
The logic is as awk processes every line, it adds the entry of the value in $3 only if it has not seen it before. The above prints both unique lines followed by unique entries from $3
If you want the unique lines only, you can exclude the END() clause
awk -F"|" '!($3 in arr){print} {arr[$3]++}' logFile > uniqueLinesOnly
If you want unique values only from the file remove the inside print
awk -F"|" '!($3 in arr){arr[$3]++} END{ for (i in arr) print i}' logFile > uniqueEntriesOnly
You can see how fast it is for a 11M record entry file. You can write it a new file using the redirect operator

compare 2 columns from 2 different csv files

My intention is to compare a particular column of 2 different csv files & get the data from second file what is not there in first file. For example.
First File
"siddhartha",1
"mukherjee",2
Second file
"siddhartha",1
"mukherjee",2
"unique",3
Expected output
"unique",3
The below command is working properly when the text size of the first column is limited, so in the above example its working.
awk -F',' 'FNR==NR{a[$1];next};!($1 in a);' file1.csv file2.csv > file3.csv
but is the text size of the 1st column is quite large (for example 10000 char), its not working. its cutting the text at a certain point.
Any solution for this?
Did you tried the comm command?
something like this: comm -23 file2.csv file1.csv
Please read about it on man comm; and both files should be sorted before.
You can user sort to do that
Maybe the below awk
awk 'BEGIN{FS=","};FNR==NR{a[$1];next};!($1 in a)' file1 file2
"unique",3
or
awk -F',' 'FNR==NR{a[$1];next};!($1 in a)' file1 file2
"unique",3
Set the field separator to comma and read each $1 value into a key

How to process many csv files into many separate dat files

I have several thousand csv files I wish to reformat. They all have a standard filename with incremental integer, eg. file_1.csv, file_2.csv, file_3.csv, and they all have the same format:
CH1
s,Volts
-1e-06,-0.0028,
-9.998e-07,-0.0032,
-9.99e-07,-0.0036,
For 10,002 lines. I want to remove the header, and I want to separate the two columns into separate files. I have the following code which produces the results I want when I consider a single input file:
tail -10000 file_1.csv |
awk -F, '{print $1 > "s.dat"; print $2 > "Volts.dat"}'
However, I want something that will produce the equivalent files for each csv file, say, replace s.dat with s_$i.dat or similar, but I'm not sure how to go about this, and how to call in each separate csv file in a loop rather than explicitly stating it as file_1.csv.
awk to the rescue!
awk -F, 'FNR>2{print $1 > "s_"FILENAME".dat";
print $2 > "Volts_"FILENAME".dat"}' file*
or reading the filename from the data files
$ awk -F, 'FNR==2{s="_"FILENAME".dat";h1=$1s;h2=$2s}
FNR>2{print $1 > h1; print $2 > h2}' file*

Bash script compare values from 2 files and print output values from one file

I have two files like this;
File1
114.4.21.198,cl_id=1J3W7P7H0S3L6g85900g736h6_101ps
114.4.21.205,cl_id=1O3M7A7Q0S3C6h85902g7b3h7_101pf
114.4.21.205,cl_id=1W3C7Z7W0U3J6795197g177j9_117p1
114.4.21.213,cl_id=1I3A7J7N0M3W6e950i7g2g2i0_1020h
File2
cl_id=1B3O7M6C8T4O1b559i2g930m0_1165d
cl_id=1X3J7M6J0W5S9535180h90302_101p5
cl_id=1G3D7X6V6A7R81356e3g527m9_101nl
cl_id=1L3J7R7O0F0L74954h2g495h8_117qk
cl_id=1L3J7R7O0F0L74954h2g495h8_117qk
cl_id=1J3W7P7H0S3L6g85900g736h6_101ps
cl_id=1W3C7Z7W0U3J6795197g177j9_117p1
cl_id=1I3A7J7N0M3W6e950i7g2g2i0_1020h
cl_id=1Q3Y7Q7J0M3E62953e5g3g5k0_117p6
I want to compare cl_id values that exist on file1 but not exist on file2 and print out the first values from file1 (IP Address).
it should be like this
114.4.21.198
114.4.21.205
114.4.21.205
114.4.21.213
114.4.23.70
114.4.21.201
114.4.21.211
120.172.168.36
I have tried awk,grep diff, comm. but nothing come close. Please tell the correct command to do this.
thanks
One proper way to that is this:
grep -vFf file2 file1 | sed 's|,cl_id.*$||'
I do not see how you get your output. Where does 120.172.168.36 come from.
Here is one solution to compare
awk -F, 'NR==FNR {a[$0]++;next} !a[$1] {print $1}' file2 file1
114.4.21.198
114.4.21.205
114.4.21.205
114.4.21.213
Feed both files into AWK or perl with field separator=",". If there are two fields, add the fields to a dictionary/map/two arrays/whatever ("file1Lines"). If there is just one field (this is file 2), add it to a set/list/array/whatever ("file2Lines"). After reading all input:
Loop over the file1Lines. For each element, check whether the key part is present in file2Lines. If not, print the value part.
This seems like what you want to do and might work, efficiently:
grep -Ff file2.txt file1.txt | cut -f1 -d,
First the grep takes the lines from file2.txt to use as patterns, and finds the matching lines in file1.txt. The -F is to use the patterns as literal strings rather then regular expressions, though it doesn't really matter with your sample.
Finally the cut takes the first column from the output, using , as the column delimiter, resulting in a list of IP addresses.
The output is not exactly the same as your sample, but the sample didn't make sense anyway, as it contains text that was not in any of the input files. Not sure if this is what you wanted or something more.

Resources