Subtrack fields from duplicate lines - bash

I have file with two columns. First column is string, second is positive number. in If first field (string) doesn't have double in file (so, first field is unique for the file), I want to copy that unique line to (let's say) result.txt. If first field does have duplicate in file, then I want to subtract second field (number) in those duplicated lines. By the way, file will have one duplicate max, no more than that. I want to save that also in result.txt. So, output file will have all lines with unique values of first field and lines in which first field is duplicated name and second is subtracted value from those duplicates. Files are not sorted. Here is example:
INPUT FILE:
hello 7
something 8
hey 9
hello 8
something 12
nathanforyou 23
OUTPUT FILE that I need (result.txt):
hello 1
something 4
hey 9
nathanforyou 23
I can't have negative numbers in ending file, so I have to subtract smaller number from bigger. What have I tried so far? All kinds of sort (I figure out how to find non-duplicate lines and put them in separate file, but choked on duplicate substraction), arrays in awk (I saved all lines in array, and do "for" clause... problem is that I don't know how to get second field from array element that is line) etc. By the way, problem is more complicated than I described (I have four fields, first two are the same and so on), but at the end - it comes to this.

$ cat tst.awk
{ val[$1,++cnt[$1]] = $2 }
END {
for (name in cnt) {
if ( cnt[name] == 1 ) {
print name, val[name,1]
}
else {
val1 = val[name,1]
val2 = val[name,2]
print name, (val1 > val2 ? val1 - val2 : val2 - val1)
}
}
}
$ awk -f tst.awk file
hey 9
hello 1
nathanforyou 23
something 4

Related

Comparing two files and printing 2nd column if value of column 1 (file1) is less than to value of column 1 in file 2

I got 2 files.
File 1:
abc 40
cde 50
efg 100
File 2:
cde 35
efg 100
abc 45
The output should be
cde "value is below normal"
If the value of column 2 of file 2 is less than the value of column 2 of file 1 it will print column 1 and the text value is below normal.
Im trying this with awk, but when I use the if condition for I got a syntax error.
This is pretty straightforward:
#!/usr/bin/awk -f
NR == FNR {
first[$1]=$2
}
NR != FNR {
second[$1]=$2
}
END {
for (i in second) {
if (second[i] < first[i])
print i, "\"value is below normal\""
}
}
The first pattern (NR == FNR) is an AWK trick that matches only on the first file and its rule adds all encountered values to the first array. The second pattern is the same, except it matches only on the second (and 3rd, 4th, etc. but it doesn't matter here) file and adds values to the second array. After both files are processed the last rule goes through all remembered values, compares them and prints the result like you asked.

bash: identifying the first value list that also exists in another list

I have been trying to come up with a nice way in BASH to find the first entry in list A that also exists in list B. Where A and B are in separate files.
A B
1024dbeb 8e450d71
7e474d46 8e450d71
1126daeb 1124dae9
7e474d46 7e474d46
1124dae9 3217a53b
In the example above, 7e474d46 is the first entry in A also appearing in B, So I would return 7e474d46.
Note: A can be millions of entries, and B can be around 300.
awk is your friend.
awk 'NR==FNR{a[$1]++;next}{if(a[$1]>=1){print $1;exit}}' file2 file1
7e474d46
Note : Check the [ previous version ] of this answer too which assumed that values are listed in a single file as two columns. This one is wrote after you have clarified that values are fed as two files in [ this ] comment.
Though few points are not clear, like how about if a number in A list is coming 2 times or more?(IN your given example itself d46 comes 2 times). Considering that you need all the line numbers of list A which are present in List B, then following will help you in same.
awk '{col1[$1]=col1[$1]?col1[$1]","FNR:FNR;col2[$2];} END{for(i in col1){if(i in col2){print col1[i],i}}}' Input_file
OR(NON-one liner form of above solution)
awk '{
col1[$1]=col1[$1]?col1[$1]","FNR:FNR;
col2[$2];
}
END{
for(i in col1){
if(i in col2){
print col1[i],i
}
}
}
' Input_file
Above code will provide following output.
3,5 7e474d46
6 1124dae9
creating array col1 here whose index is first field and array col2 whose index is $2. col1's value is current line's value and it will be concatenating it's own value too. Now in END section of awk traversing through col1 array and then checking if any value of col1 is present in array col2 too, if yes then printing col1's value and it's index.
If you have GNU grep, you can try this:
grep -m 1 -f B A

Awk script - loop through row values of two columns in a csv file [duplicate]

This question already has an answer here:
Shell script - loop through values in multiple columns of a csv file
(1 answer)
Closed 5 years ago.
I am working with a huge CSV file (filtest.csv) that contains two columns. From column 1, I wanted to read current row and compare it with the value of the previous row. If it is greater OR equal, continue comparing and if the value of the current cell is smaller than the previous row - then i wanted to jump to the second column and take the value in the current row (of the second column). Next I wanted to divided the larger value we got in column 1 by the value in the same cell of column two with that of the smaller value in column 1. Let me clarify with this example. For example in the following table: the smaller value we will be depending on my requirement from Column 1 is 327 (because 327 is smaller than the previous value 340) - and then we take 500 (which is the corresponding cell value on column 2). Finally we divide 340 by 500 and get the value 0.68. My bash script should exit right after we print the value to the console.
338,800
338,550
339,670
340,600
327,500
301,430
299,350
284,339
284,338
283,335
283,330
283,310
282,310
282,300
282,300
283,290
In the following script, I tried it to do the division operation on the same row of the two columns and it works fine
awk -F, '$1<p && $2!=0{
val=$1/$2
if(val>=0.85 && val<=0.9)
{
print "value is:" $1/p
print "A"
}
else if(val==0.8)
{
print "B"
}
else if(val>=0.5 && val <=0.7)
{
print "C"
}
else if(val==0.5)
{
print "E"
}
else
{
print "D"
}
exit
}
{
p=$1
}' filetest.csv
But how can we loop through the values in two columns and perform control statements on two different rows of the two columns as i mentioned earlier?
From first description
awk -F, '$1<prev{print prev/$2;exit}{prev=$1}' <input.txt
At the end of each line, 1st column is stored in prev
Then when value of 1st column is least than prev, it prints the ratio and exits

Find lines that have partial matches

So I have a text file that contains a large number of lines. Each line is one long string with no spacing, however, the line contains several pieces of information. The program knows how to differentiate the important information in each line. The program identifies that the first 4 numbers/letters of the line coincide to a specific instrument. Here is a small example portion of the text file.
example text file
1002IPU3...
POIPIPU2...
1435IPU1...
1812IPU3...
BFTOIPD3...
1435IPD2...
As you can see, there are two lines that contain 1435 within this text file, which coincides with a specific instrument. However these lines are not identical. The program I'm using can not do its calculation if there are duplicates of the same station (ie, there are two 1435* stations). I need to find a way to search through my text files and identify if there are any duplicates of the partial strings that represent the stations within the file so that I can delete one or both of the duplicates. If I could have BASH script output the number of the lines containing the duplicates and what the duplicates lines say, that would be appreciated. I think there might be an easy way to do this, but I haven't been able to find any examples of this. Your help is appreciated.
If all you want to do is detect if there are duplicates (not necessarily count or eliminate them), this would be a good starting point:
awk '{ if (++seen[substr($0, 1, 4)] > 1) printf "Duplicates found : %s\n",$0 }' inputfile.txt
For that matter, it's a good starting point for counting or eliminating, too, it'll just take a bit more work...
If you want the count of duplicates:
awk '{a[substr($0,1,4)]++} END {for (i in a) {if(a[i]>1) print i": "a[i]}}' test.in
1435: 2
or:
{
a[substr($0,1,4)]++ # put prefixes to array and count them
}
END { # in the end
for (i in a) { # go thru all indexes
if(a[i]>1) print i": "a[i] # and print out the duplicate prefixes and their counts
}
}
Slightly roundabout but this should work-
cut -c 1-4 file.txt | sort -u > list
for i in `cat list`;
do
echo -n "$i "
grep -c ^"$i" file.txt #This tells you how many occurrences of each 'station'
done
Then you can do whatever you want with the ones that occur more than once.
Use following Python script(syntax of python 2.7 version used)
#!/usr/bin/python
file_name = "device.txt"
f1 = open(file_name,'r')
device = {}
line_count = 0
for line in f1:
line_count += 1
if device.has_key(line[:4]):
device[line[:4]] = device[line[:4]] + "," + str(line_count)
else:
device[line[:4]] = str(line_count)
f1.close()
print device
here the script reads each line and initial 4 character of each line are considered as device name and creates a key value pair device with key representing device name and value as line numbers where we find the string(device name)
following would be output
{'POIP': '2', '1435': '3,6', '1002': '1', '1812': '4', 'BFTO': '5'}
this might help you out!!

Calculating averages over ranges of patterns

I am very new to this kind of work so bear with me please :) I am trying to calculate means over ranges of patterns. E.g. I have two files which are tab delimited:
The file coverage.txt contains two colums. The first colum indicates the position and the second the value assigned to that postion. There are ca. 4*10^6 positions.
coverage.txt
1 10
2 30
3 5
4 10
The second file "patterns.txt" contains three columns 1. the name of the pattern, 2. the starting position of the pattern and 3. end position of the pattern. The pattern ranges do not overlap. There are ca. 3000 patterns.
patterns.txt
rpoB 1 2
gyrA 3 4
Now I want to calculate the mean of the values assigned to the positions of the different patterns and write the output to a new file containing the first colum of the patterns.txt as an identifier.
output.txt
rpoB 20
gyrA 7.5
I think this can be accomplished using awk but I do not know where to start. Your help would be greatly appreciated!
With four million positions, it might be time to reach for a more substantial programming language than shell/awk, but you can do it in a single pass with something like this:
awk '{
if (FILENAME ~ "patterns.txt") {
min[$1]=$2
max[$1]=$3
} else {
for (pat in min) {
if ($1 >= min[pat] && $1 <= max[pat]) {
total[pat] += $2
count[pat] += 1
}
}
}
}
END {
for (pat in total) {
print pat,total[pat]/count[pat]
}
}' patterns.txt coverage.txt
This omits any patterns that don't have any data in the coverage file; you can change the loop in the END to loop over everything in the patterns file instead and just output 0s for the ones that didn't show up.

Resources