I have been trying to come up with a nice way in BASH to find the first entry in list A that also exists in list B. Where A and B are in separate files.
A B
1024dbeb 8e450d71
7e474d46 8e450d71
1126daeb 1124dae9
7e474d46 7e474d46
1124dae9 3217a53b
In the example above, 7e474d46 is the first entry in A also appearing in B, So I would return 7e474d46.
Note: A can be millions of entries, and B can be around 300.
awk is your friend.
awk 'NR==FNR{a[$1]++;next}{if(a[$1]>=1){print $1;exit}}' file2 file1
7e474d46
Note : Check the [ previous version ] of this answer too which assumed that values are listed in a single file as two columns. This one is wrote after you have clarified that values are fed as two files in [ this ] comment.
Though few points are not clear, like how about if a number in A list is coming 2 times or more?(IN your given example itself d46 comes 2 times). Considering that you need all the line numbers of list A which are present in List B, then following will help you in same.
awk '{col1[$1]=col1[$1]?col1[$1]","FNR:FNR;col2[$2];} END{for(i in col1){if(i in col2){print col1[i],i}}}' Input_file
OR(NON-one liner form of above solution)
awk '{
col1[$1]=col1[$1]?col1[$1]","FNR:FNR;
col2[$2];
}
END{
for(i in col1){
if(i in col2){
print col1[i],i
}
}
}
' Input_file
Above code will provide following output.
3,5 7e474d46
6 1124dae9
creating array col1 here whose index is first field and array col2 whose index is $2. col1's value is current line's value and it will be concatenating it's own value too. Now in END section of awk traversing through col1 array and then checking if any value of col1 is present in array col2 too, if yes then printing col1's value and it's index.
If you have GNU grep, you can try this:
grep -m 1 -f B A
I would like to split a market matrix file into two parts. These two parts should be of varying sizes. Such sizes should be corresponding to the number of rows of the matrix represented by the hash in the market format.
There are some examples:
http://math.nist.gov/MatrixMarket/formats.html
For a usual matrix file format of , e.g., 100 rows it is quite easy:
head -n 70 matrix1.mtx > matrix170.mtx
tail -n 30 matrix1.mtx > matrix130.mtx
where in matrix170.mtx there are the 70 first lines of matrix1.mtx and so on.
Thank you.
awk to the rescue!
You can use this script to split the matrix file unevenly
awk -v splits='70 30' 'BEGIN{n=split(splits,s);i=1;limit=s[i]}
NR==1{split(FILENAME,f,".")}
{if(NR<=limit) suffix=s[i];
else limit+=s[++i]}
i>n{exit}
{print > f[1] suffix "." f[2]}' matrix.mtx
will generate two files matrix70.mtx and matrix30.mtx derived from input file name and split values.
Suppose I have two files
$A
a b
1 5
2 6
3 7
4 8
$B
a b
1 5
2 6
5 6
My question is, in Shell or Terminal, How to calculate the total number of values of B's first column (1,2,5) in the A's first column(1,2,3,4)? (here the answer is 2 (1,2).
The following awk solution counts column1 entries of file2 in file1:
awk 'FNR==1{next}NR==FNR{a[$1]=$b;next}$1 in a{count++}END{print count}' file1 file2
2
Skip the first line from both files using FNR==1{next}. You can remove this if you don't have header fields (a b) in your actual data files.
Read the entire first file into an array using NR==FNR{a[$1]=$b;next}. I am assigning column2 here if you wish to scale the solution to match both columns. You can also do a[$1]++ if you are not interested in column2 at all. Wont hurt either ways.
If the value of column1 from second file is in our array, increment a count variable
In the END block print the count variable.
Say I have a text file items.txt
Another Purple Item:1:3:01APR13
Another Green Item:1:8:02APR13
Another Yellow Item:1:3:01APR13
Another Orange Item:5:3:04APR13
Where the 2nd column is price and the 3rd is quantity. How could I loop through this in bash such that each unique date had a total of price * quantity?
try this awk one-liner:
awk -F: '{v[$NF]+=$2*$3}END{for(x in v)print x, v[x]}' file
result:
01APR13 6
04APR13 15
02APR13 8
EDIT sorting
as I commented, there are two approaches to sort the output by date, I just take the simpler one: ^_^:
kent$ awk -F: '{ v[$NF]+=$2*$3}END{for(x in v){"date -d\""x"\" +%F"|getline d;print d,x,v[x]}}' file|sort|awk '$0=$2" "$3'
01APR13 6
02APR13 8
04APR13 15
Take a look at bash's associative arrays.
You can create a map of dates to values, then go over the lines, calculate price*quantity and add it to the value currently in the map or insert it if non-existent.
I am trying to resolve locations in lat and long in one file to a couple of named fields in another file.
I have one file that is like this..
f1--f2--f3--------f4-------- f5---
R 20175155 41273951N078593973W 18012
R 20175156 41274168N078593975W 18000
R 20175157 41274387N078593976W 17999
R 20175158 41274603N078593977W 18024
R 20175159 41274823N078593978W 18087
Each character is in a specific place so I need to define fields based on characters.
f1 char 18-21; f2 char 22 - 25; f3 char 26-35; f4 char 36-45; f5 char 62-66.
I have another much larger csv file that has fields 11, 12, and 13 to correspond to f3, f4, f5.
awk -F',' '{print $11, $12, $13}'
41.46703821 -078.98476926 519.21
41.46763555 -078.98477791 524.13
41.46824123 -078.98479015 526.67
41.46884129 -078.98480615 528.66
41.46943371 -078.98478482 530.50
I need to find the closest match to file 1 field 1 && 2 in file 2 field 11 && 12;
When the closest match is found I need to insert field 1, 2, 3, 4, 5 from file 1 into file 2 field 16, 17, 18, 19, 20.
As you can see the format is slightly different. File 1 breaks down like this..
File 1
f3-------f4--------
DDMMSSdd DDDMMSSdd
41273951N078593973W
File 2
f11-------- f12---------
DD dddddddd DDD dddddddd
41.46703821 -078.98476926
N means f3 is a positive number, W means f4 is a negative number.
I changed file 1 with sed, ridiculous one liner that works great.. (better way???)
cat $file1 |sed 's/.\{17\}//' |sed 's/\(.\{4\}\)\(.\{4\}\)\(.\{9\}\)\(.\)\(.\{9\}\)\(.\)\(.\{16\}\)\(.\{5\}\)/\1,\2,\3,\4,\5,\6,\8/'|sed 's/\(.\{10\}\)\(.\{3\}\)\(.\{2\}\)\(.\{2\}\)\(.\{2\}\)\(.\{3\}\)\(.\{3\}\)\(.\{2\}\)\(.*\)/\1\2,\3,\4.\5\6\7,\8\9/'|sed 's/\(.\{31\}\)\(.\{2\}\)\(.*\)/\1,\2.\3/'
2017,5155, 41,27,39.51,N,078,59,39.73,W,18012
2017,5156, 41,27,41.68,N,078,59,39.75,W,18000
2017,5157, 41,27,43.87,N,078,59,39.76,W,17999
2017,5158, 41,27,46.03,N,078,59,39.77,W,18024
2017,5159, 41,27,48.23,N,078,59,39.78,W,18087
Now I have to convert the formats.. (RESOLVED this (see below)--problem -- The numbers are rounded off too far. I need to have at least six decimal places.)
awk -F',' '{for (i=1;i<=NF;i++) {if (i <= 2) printf ($i","); else if (i == 3&&$6 == "S") printf("-"$3+($4/60)+($5/3600)","); else if (i == 3&&$6 == "N") printf($3+($4/60)+($5/3600)","); else if (i == 7&&$10 == "W") printf("-"$7+($8/60)+($9/3600)","); else if (i == 7&&$10 == "E") printf($7+($8/60)+($9/3600)","); if (i == 11) printf ($i"\n")}}'
2017,5155,41.461,-78.9944,18012
2017,5156,41.4616,-78.9944,18000
2017,5157,41.4622,-78.9944,17999
2017,5158,41.4628,-78.9944,18024
2017,5159,41.4634,-78.9944,18087
That's where I'm at.
RESOLVED THIS
*I need to get the number format to have at least 6 decimal places from this formula.*
printf($3+($4/60)+($5/3600))
Added "%.8f"
printf("%.8f", $3+($4/60)+($5/3600))
Next issue will be to match the fields file 1 f3 and f4 to the closest match in file 2 f11 and f12.
Any ideas?
Then I will need to calculate the distance between the fields.
In Excel the formuls would be like this..
=ATAN2(COS(lat1)*SIN(lat2)-SIN(lat1)*COS(lat2)*COS(lon2-lon1), SIN(lon2-lon1)*COS(lat2))
What could I use for that calculation?
*UPDATE---
I am looking at a short distance for the matching locations. I was thinking about applying something simple like Pythagoras’ theorem for the nearest match. Maybe even use less decimal places. It's got to be many times faster.
maybe something like this..*
x = (lon2-lon1) * Math.cos((lat1+lat2)/2);
y = (lat2-lat1);
d = Math.sqrt(x*x + y*y) * R;
Then I could do the heavy calculations required for greater accuracy after the final file is updated.
Thanks
You can't do the distance calculation after you perform the closest match: closest is defined by comparison of the distance values. Awk can evaluate the formula that you want (looks like great-circle distance?). Take a look at this chapter to see what you need.
The big problem is finding the nearest match. Write an awk script that takes a single line of file 1 and outputs the lines in file 2 with an extra column. That column is the calculation of the distance between the pair of points according to your distance formula. If you sort that file numerically (sort -n) then your closest match is at the top. Then you need a script that loops over each line in file 1, calls your awk script, uses head -n1 to pull out the closest match and then output it in the format that you want.
This is all possible in bash and awk, but it would be a much simpler script in Python. Depends on which you prefer.