SPLIT file by Script (bash, cpp) - numbers in columns - bash

I have files with some columns filled by numbers (float). I would need to split these files according to the value in one of the columns (can set as the first one). This means, when
a b c
in my file the value c fullfils 0.05<=c<=0.1 then create the file named c and copy the whole columns there which fullfils the c-condition...
is this possible? I can something small with bash, awk, something also with c++.
I have searched for some solutions but - I can the data sort of course and only read the first number of the line..
I don't know.
Please, very please.
Thank you
Jane

As you mentioned awk, the basic rule in awk is 'match a line (either by default or with a regexp, condition or line number)' AND 'do something because you found a match'.
awk uses values like $1, $2, $3 to indicate which column in the current line of data it is looking at. $0 refers to the whole line. So ...
awk '
BEGIN{
afile="afile.txt"
bfile="bfile.txt"
cfile="cfile.txt"
}
{
# test c value between .05 and .1
if ($3 >= 0.05 && $3 <= 0.1) print $0 > cfile
} inputData
Note that I am testing the value of the third column (c in your example). You can use $2 to test b column, etc.
If you don't know about the sort of condition test I have included >= 0.5 && $3 <= 0.1 you'll have some learning ahead of you.
Questions in the form of 1. I have this input, 2. I want this output. 3. (but) I'm getting this output, 4. with this code .... {code here} .... have a much better chance of getting a reasonable response in a reasonable amount of time ;-)
I hope this helps.
P.S. as you appear to be a new user, if you get an answer that helps you please remember to mark it as accepted, and/or give it a + (or -) as a useful answer.

If I understand your requirements correctly:
awk '{print > $3}' file ...

Related

Awk printing out smallest and highest number, in a time format

I'm fairly new to linux/bash shell and I'm really having trouble printing two values (the highest and lowest) from a particular column in a text file. The file is formatted like this:
Geoff Audi 2:22:35.227
Bob Mercedes 1:24:22.338
Derek Jaguar 1:19:77.693
Dave Ferrari 1:08:22.921
As you can see the final column is a timing, I'm trying to use awk to print out the highest and lowest timing in the column. I'm really stumped, I've tried:
awk '{print sort -n < $NF}' timings.txt
However that didn't even seem to sort anything, I just received an output of:
1
0
1
0
...
Repeating over and over, it went on for longer but I didn't want a massive line of it when you get the point after the first couple iterations.
My desired output would be:
Min: 1:08:22.921
Max: 2:22:35.227
After question clarifications: if the time field always has a same number of digits in the same place, e.g. h:mm:ss.ss, the solution can be drastically simplified. Namely, we don't need to convert time to seconds to compare it anymore, we can do a simple string/lexicographical comparison:
$ awk 'NR==1 {m=M=$3} {$3<m&&m=$3; $3>M&&M=$3} END {printf("min: %s\nmax: %s",m,M)}' file
min: 1:08:22.921
max: 2:22:35.227
The logic is the same as in the (previous) script below, just using a simpler string-only based comparison for ordering values (determining min/max). We can do that since we know all timings will conform to the same format, and if a < b (for example "1:22:33" < "1:23:00") we know a is "smaller" than b. (If values are not consistently formatted, then by using the lexicographical comparison alone, we can't order them, e.g. "12:00:00" < "3:00:00".)
So, on first value read (first record, NR==1), we set the initial min/max value to the timing read (in the 3rd field). For each record we test if the current value is smaller than the current min, and if it is, we set the new min. Similarly for the max. We use short circuiting instead if to make expressions shorter ($3<m && m=$3 is equivalent to if ($3<m) m=$3). In the END we simply print the result.
Here's a general awk solution that accepts time strings with variable number of digits for hours/minutes/seconds per record:
$ awk '{split($3,t,":"); s=t[3]+60*(t[2]+60*t[1]); if (s<min||NR==1) {min=s;min_t=$3}; if (s>max||NR==1) {max=s;max_t=$3}} END{print "min:",min_t; print "max:",max_t}' file
min: 1:22:35.227
max: 10:22:35.228
Or, in a more readable form:
#!/usr/bin/awk -f
{
split($3, t, ":")
s = t[3] + 60 * (t[2] + 60 * t[1])
if (s < min || NR == 1) {
min = s
min_t = $3
}
if (s > max || NR == 1) {
max = s
max_t = $3
}
}
END {
print "min:", min_t
print "max:", max_t
}
For each line, we convert the time components (hours, minutes, seconds) from the third field to seconds which we can later simply compare as numbers. As we iterate, we track the current min val and max val, printing them in the END. Initial values for min and max are taken from the first line (NR==1).
Given your statements that the time field is actually a duration and the hours component is always a single digit, this is all you need:
$ awk 'NR==1{min=max=$3} {min=(min<$3?min:$3); max=(max>$3?max:$3)} END{print "Min:", min ORS "Max:", max}' file
Min: 1:08:22.921
Max: 2:22:35.227
You don't want to run sort inside of awk (even with the proper syntax).
Try this:
sed 1d timings.txt | sort -k3,3n | sed -n '1p; $p'
where
the first sed will remove the header
sort on the 3rd column numerically
the second sed will print the first and last line

bash: identifying the first value list that also exists in another list

I have been trying to come up with a nice way in BASH to find the first entry in list A that also exists in list B. Where A and B are in separate files.
A B
1024dbeb 8e450d71
7e474d46 8e450d71
1126daeb 1124dae9
7e474d46 7e474d46
1124dae9 3217a53b
In the example above, 7e474d46 is the first entry in A also appearing in B, So I would return 7e474d46.
Note: A can be millions of entries, and B can be around 300.
awk is your friend.
awk 'NR==FNR{a[$1]++;next}{if(a[$1]>=1){print $1;exit}}' file2 file1
7e474d46
Note : Check the [ previous version ] of this answer too which assumed that values are listed in a single file as two columns. This one is wrote after you have clarified that values are fed as two files in [ this ] comment.
Though few points are not clear, like how about if a number in A list is coming 2 times or more?(IN your given example itself d46 comes 2 times). Considering that you need all the line numbers of list A which are present in List B, then following will help you in same.
awk '{col1[$1]=col1[$1]?col1[$1]","FNR:FNR;col2[$2];} END{for(i in col1){if(i in col2){print col1[i],i}}}' Input_file
OR(NON-one liner form of above solution)
awk '{
col1[$1]=col1[$1]?col1[$1]","FNR:FNR;
col2[$2];
}
END{
for(i in col1){
if(i in col2){
print col1[i],i
}
}
}
' Input_file
Above code will provide following output.
3,5 7e474d46
6 1124dae9
creating array col1 here whose index is first field and array col2 whose index is $2. col1's value is current line's value and it will be concatenating it's own value too. Now in END section of awk traversing through col1 array and then checking if any value of col1 is present in array col2 too, if yes then printing col1's value and it's index.
If you have GNU grep, you can try this:
grep -m 1 -f B A

Find lines that have partial matches

So I have a text file that contains a large number of lines. Each line is one long string with no spacing, however, the line contains several pieces of information. The program knows how to differentiate the important information in each line. The program identifies that the first 4 numbers/letters of the line coincide to a specific instrument. Here is a small example portion of the text file.
example text file
1002IPU3...
POIPIPU2...
1435IPU1...
1812IPU3...
BFTOIPD3...
1435IPD2...
As you can see, there are two lines that contain 1435 within this text file, which coincides with a specific instrument. However these lines are not identical. The program I'm using can not do its calculation if there are duplicates of the same station (ie, there are two 1435* stations). I need to find a way to search through my text files and identify if there are any duplicates of the partial strings that represent the stations within the file so that I can delete one or both of the duplicates. If I could have BASH script output the number of the lines containing the duplicates and what the duplicates lines say, that would be appreciated. I think there might be an easy way to do this, but I haven't been able to find any examples of this. Your help is appreciated.
If all you want to do is detect if there are duplicates (not necessarily count or eliminate them), this would be a good starting point:
awk '{ if (++seen[substr($0, 1, 4)] > 1) printf "Duplicates found : %s\n",$0 }' inputfile.txt
For that matter, it's a good starting point for counting or eliminating, too, it'll just take a bit more work...
If you want the count of duplicates:
awk '{a[substr($0,1,4)]++} END {for (i in a) {if(a[i]>1) print i": "a[i]}}' test.in
1435: 2
or:
{
a[substr($0,1,4)]++ # put prefixes to array and count them
}
END { # in the end
for (i in a) { # go thru all indexes
if(a[i]>1) print i": "a[i] # and print out the duplicate prefixes and their counts
}
}
Slightly roundabout but this should work-
cut -c 1-4 file.txt | sort -u > list
for i in `cat list`;
do
echo -n "$i "
grep -c ^"$i" file.txt #This tells you how many occurrences of each 'station'
done
Then you can do whatever you want with the ones that occur more than once.
Use following Python script(syntax of python 2.7 version used)
#!/usr/bin/python
file_name = "device.txt"
f1 = open(file_name,'r')
device = {}
line_count = 0
for line in f1:
line_count += 1
if device.has_key(line[:4]):
device[line[:4]] = device[line[:4]] + "," + str(line_count)
else:
device[line[:4]] = str(line_count)
f1.close()
print device
here the script reads each line and initial 4 character of each line are considered as device name and creates a key value pair device with key representing device name and value as line numbers where we find the string(device name)
following would be output
{'POIP': '2', '1435': '3,6', '1002': '1', '1812': '4', 'BFTO': '5'}
this might help you out!!

finding and sorting datapoints in bash and awk

First of all, let me clarify that I am unfortunately still quite inexperienced in programming, so I really need some help.
What I have:
I have a data file containing 3 columns: $1=(Energy1), $2=(Energy2), $3=(intensity of their frequency in combination).
If I plot these data e.g. in gnuplot by doing spl "datafile.dat" u 1:2:3 I obtain a surface plot with my 2D-spectrum.
What I want:
Now, I would like to select only certain data points, for which my ($1-$2)=5.7 give this specific value, thus obtaining a line spectrum along a diagonal, with all possible combinations of $1 and $2 yielding this value.
The new data-file should then contain the $1-value and the intensity (stored in $3) corresponding to the selected line, which contained the correct values of $1 and $2 yielding 5.7.
I have tried do do this in bash using awk, but unfortunately until now I failed. PLEASE help me!!! thank you very much in advance.
Maybe I don't understand all the issues, or maybe you are having a floating-equal problem as others have noted, but why doesn't a simple filter through the data work?:
awk -v s=5.7 -v e=.01 '{d=$1-$2-$s}d<e&&d>-e{print $1,$3}'
Tack on a sort if you want/need:
| sort -n
Or, is it possible that your data is too sparse, and you're looking for some value interpolation solution?
You do not need awk for this, gnuplot can do it.
admissible(x,y,value,epsilon)=(abs(x-y-value)<epsilon)
plot 'datafile.dat' using (admissible($1,$2,5.7,1e-5)?$1:1/0):3 with points
Function admissible is tested for each line of data file, if it returns true then the point ($1,$3) is plotted, else the x-coordinate is set to undefined (1/0) and thus the point is not plotted. The only shortcoming is that you cannot use the lines style with this, since lines will be interrupted by non-admissible datapoints.
If you want to compare every $1 against every $2, you need to take 2 passes through the file, once to collect all the $1,$3 pairs, the next to do all the comparisons:
awk -v diff=5.7 '
NR == FNR {
# this is the first trip through
val[$1] = $3
next
}
{
for (v1 in val) {
if ( (v1 - $2) == diff ) {
print v1, val[v1]
}
}
}
' file file # yes, give the same filename twice.
To address #Baruchel's comment about floating point precision, try this:
awk -v diff=5.7 -v epsilon=0.0001'
NR == FNR {val[$1] = $3; next}
{
for (v1 in val) {
delta = v1 - $2 - diff
if (-epsilon <= delta && delta <= epsilon)
print v1, val[v1]
}
}
' file file

Addition if identical value over line

I have a CSV like:
1015,5
1015,4
1035,17
1035,11
1009,1
1009,4
1026,9
1004,5
1004,5
1009,1
I search a way to obtain : an addition of the second number if the first number match
1015,9
1035,28
1009,6
1026,9
1004,10
Try this :
awk 'BEGIN{FS=OFS=","}{a[$1]+=$2}END{for(i in a){print i,a[i]}}' file
This is the awk snippet that every shell coder should know from the top of his head.

Resources