Awk : accommodating the blank cell values

Awk : accommodating the blank cell values - bash

I need to calculate the size of deleted file size per user from lsof command in bash.
But there are few rows which has third column as blank causing issues to sum up values.
For example, in the attached image I need to show the total size by each user type for deleted files, now because of few blank cells in third column the column count is not coming as right and hence the resulting values are not correct too.
I tried using few options to replace blank cells with some dummy text but that is not working well, so need suggestion to solve this problem and also any command to show the resulting size in human readable format ?
I tried to add the output by user type with following command
lsof|grep -i deleted| awk '{a[$5] +=$8} END{for (i in a) print i, a[$i]}'
Above command did not give the right results, so I tried below command to replace blank cells with a dummy text
lsof|grep -i deleted| awk '!$3{$3="NA"}{a[$5] +=$8} END{for (i in a) print i, a[$i]}'
That did not work, so I tried using if condition
lsof|grep -i deleted| awk '{if($3 =="") $3="NA"; a[$5] +=$8} END{for (i in a) print i, a[$i]}'

Assuming you are interested in file owner/size/name, here is a python script (test.py) which can get them :
import re
import sys
user_last_column = 0
for line in sys.stdin:
if user_last_column:
if re.search(r'\(deleted\)$', line):
print("%s %s %s" % (re.sub(r'.* ', '', line[:user_last_column]),
re.sub(r'.* ', '', line[:size_last_column]),
line[name_first_column:][:-11]))
else: # Process first row which is header
user_last_column = line.find('USER') + 4
size_last_column = line.find('SIZE/OFF') + 8
name_first_column = line.find('NAME')
Call it with :
lsof | python test.py | sort -u # [sort -u] to remove duplicates
or
python test.py < /tmp/sample
Explanation
The main thing is to find positions (in characters) of the three pieces of info.

Related

Copy columns of a file to specific location of another pipe delimited file

I have a file suppose xyz.dat which has data like below -
a1|b1|c1|d1|e1|f1|g1
a2|b2|c2|d2|e2|f2|g2
a3|b3|c3|d3|e3|f3|g3
Due to some requirement, I am making two new files(aka m.dat and o.dat) from original xyz.dat.
M.dat contains columns 2|4|6 like below after running some logic on it -
b11|d11|f11
b22|d22|f22
b33|d33|f33
O.dat contains all the columns except 2|4|6 like below without any change in it -
a1|c1|e1|g1
a2|c2|e2|g2
a3|c3|e3|g3
Now I want to merge both M and O file to create back the original format xyz.dat file.
a1|b11|c1|d11|e1|f11|g1
a2|b22|c2|d22|e2|f22|g2
a3|b33|c3|d33|e3|f33|g3
Please note column positions can change for another file. I will get the columns positions like in above example it is 2,4,6 so need some generic command to run in loop to merge the new M and O file or one command in which I can pass the columns positions and it will copy the columns form M.dat file and past it in O.dat file.
I tried paste, sed, cut but not able to make any perfect command.
Please help.

To perform column-wise merge of two files, better to use a scripting engine (Python, Awk or Perl or even bash). Tools like paste, sed and cut do not have enough flexibility for those tasks (join may come close, but require extra work).
Consider the following awk based script
awk -vOFS='|' '-F|' '
{
getline s < "o.dat"
n = split(s. a)
# Print output, Add a[n], or $n, ... as needed based on actual number of fields.
print $1, a[1], $2, a[2], $3, a[3], a[4]
}
' m.dat
The print line can be customized to generate whatever column order

Based on clarification from OP, looks like the goal is: Given an input of two files, and list of columns where data should be merged from the 2nd file, produce an output file that contain the merge data.
For example:
awk -f mergeCols COLS=2,4,6 M=b.dat a.dat
# If file is marked executable (chmod +x mergeCols)
mergeCols COLS=2,4,6 M=b.dat a.dat
Will insert the columns from b.dat into columns 2, 4 and 6, whereas other column will include data from a.dat
Implementation, using awk: (create a file mergeCols).
#! /usr/bin/awk -f
BEGIN {
FS=OFS="|"
}
NR==1 {
# Set the column map
nc=split(COLS, c, ",")
for (i=1 ; i<=nc ; i++ ) {
cmap[c[i]] = i
}
}
{
# Read one line from merged file, split into tokens in 'a'
getline s < M
n = split(s, a)
# Merge columns using pre-set 'cmap'
k=0
for (i=1 ; i<=NF+nc ; i++ ) {
# Pick up a column
v = cmap[i] ? a[cmap[i]] : $(++k)
sep = (i<NF+nc) ? "|" : "\n"
printf "%s%s", v, sep
}
}

Find lines that have partial matches

So I have a text file that contains a large number of lines. Each line is one long string with no spacing, however, the line contains several pieces of information. The program knows how to differentiate the important information in each line. The program identifies that the first 4 numbers/letters of the line coincide to a specific instrument. Here is a small example portion of the text file.
example text file
1002IPU3...
POIPIPU2...
1435IPU1...
1812IPU3...
BFTOIPD3...
1435IPD2...
As you can see, there are two lines that contain 1435 within this text file, which coincides with a specific instrument. However these lines are not identical. The program I'm using can not do its calculation if there are duplicates of the same station (ie, there are two 1435* stations). I need to find a way to search through my text files and identify if there are any duplicates of the partial strings that represent the stations within the file so that I can delete one or both of the duplicates. If I could have BASH script output the number of the lines containing the duplicates and what the duplicates lines say, that would be appreciated. I think there might be an easy way to do this, but I haven't been able to find any examples of this. Your help is appreciated.

If all you want to do is detect if there are duplicates (not necessarily count or eliminate them), this would be a good starting point:
awk '{ if (++seen[substr($0, 1, 4)] > 1) printf "Duplicates found : %s\n",$0 }' inputfile.txt
For that matter, it's a good starting point for counting or eliminating, too, it'll just take a bit more work...

If you want the count of duplicates:
awk '{a[substr($0,1,4)]++} END {for (i in a) {if(a[i]>1) print i": "a[i]}}' test.in
1435: 2
or:
{
a[substr($0,1,4)]++ # put prefixes to array and count them
}
END { # in the end
for (i in a) { # go thru all indexes
if(a[i]>1) print i": "a[i] # and print out the duplicate prefixes and their counts
}
}

Slightly roundabout but this should work-
cut -c 1-4 file.txt | sort -u > list
for i in `cat list`;
do
echo -n "$i "
grep -c ^"$i" file.txt #This tells you how many occurrences of each 'station'
done
Then you can do whatever you want with the ones that occur more than once.

Use following Python script(syntax of python 2.7 version used)
#!/usr/bin/python
file_name = "device.txt"
f1 = open(file_name,'r')
device = {}
line_count = 0
for line in f1:
line_count += 1
if device.has_key(line[:4]):
device[line[:4]] = device[line[:4]] + "," + str(line_count)
else:
device[line[:4]] = str(line_count)
f1.close()
print device
here the script reads each line and initial 4 character of each line are considered as device name and creates a key value pair device with key representing device name and value as line numbers where we find the string(device name)
following would be output
{'POIP': '2', '1435': '3,6', '1002': '1', '1812': '4', 'BFTO': '5'}
this might help you out!!

Gnuplot - Use a 3rd non integer parameters in a plot

I want to make a plot from a csv file:
02/15/2016 09:32:58,LG.04,4747.0
02/15/2016 09:33:08,LG.03,2899.0
02/15/2016 09:33:18,LG.01,5894.0
02/15/2016 09:33:28,LG.04,6043.0
Using the column 1 which is the date, the 3rd is the value that I want to compare.
This give me only one plot.
reset
date = system("date +%m-%d-%Y")
set datafile separator ","
set timefmt '%m/%d/%Y %H:%M:%S'
set xdata time
set format x "%m/%d/%Y\n%H:%M:%S"
#
plot '/home/'.system("echo $USER").'/Desktop/test.csv' u 1:3:2 w lp
pause 200
I am wondering how to get many lines using the second column, and define the title of the different columns (using the csv value).

To do this you will need to use an outside program to filter and reorganize the data. I'll demonstrate this using python3.
We need two python programs. The first, getnames.py, will get us the list of unique values in column 2:
data = open("test.csv","r").readlines()
names = [x.split(",")[1] for x in data]
print(" ".join(sorted(set(names))))
The second, filternames.py, will get us the lines in the data file corresponding to each unique value in column 2:
from sys import argv
nme = argv[1]
data = open("test.csv","r").readlines()
for x in data:
if x.split(",")[1] == nme:
print(x.strip())
Then, in gnuplot, we can call into these programs to process the data.
set datafile separator ","
set timefmt '%m/%d/%Y %H:%M:%S'
set xdata time
set format x "%m/%d/%Y\n%H:%M:%S"
names = system("getnames.py")
plot for [n in names] sprintf("< filternames.py %s",n) u 1:3 with linespoints t n
The first system call will get a string containing space separated unique values for this second column ("LG.01 LG.03 LG.04").
The plot command runs over each one of these values, and calls the filtering program to return only the lines corresponding to that value. The output of the filtering program is read directly by using the redirection operator.
Here, I moved the key to the left to keep the data off from it with set key left.
We can do the same thing using standard linux commands, if available. Instead of using the getnames.py program, we can do
names = system("awk -F, '{print $2}' test.csv | sort | uniq | tr '\n' ' '")
using awk to get the second column values, uniq to get only the unique values (which requires the values to be sorted with sort), and tr to replace newlines with spaces (returning the values as one space separated list).
Instead of using filternames.py, we can do
plot for [n in names] sprintf("< awk -F, '($2=="%s"){print $0}' test.csv",n) u 1:3 with linespoints t n
using awk to get only the lines with the desired second column value.

finding and sorting datapoints in bash and awk

First of all, let me clarify that I am unfortunately still quite inexperienced in programming, so I really need some help.
What I have:
I have a data file containing 3 columns: $1=(Energy1), $2=(Energy2), $3=(intensity of their frequency in combination).
If I plot these data e.g. in gnuplot by doing spl "datafile.dat" u 1:2:3 I obtain a surface plot with my 2D-spectrum.
What I want:
Now, I would like to select only certain data points, for which my ($1-$2)=5.7 give this specific value, thus obtaining a line spectrum along a diagonal, with all possible combinations of $1 and $2 yielding this value.
The new data-file should then contain the $1-value and the intensity (stored in $3) corresponding to the selected line, which contained the correct values of $1 and $2 yielding 5.7.
I have tried do do this in bash using awk, but unfortunately until now I failed. PLEASE help me!!! thank you very much in advance.

Maybe I don't understand all the issues, or maybe you are having a floating-equal problem as others have noted, but why doesn't a simple filter through the data work?:
awk -v s=5.7 -v e=.01 '{d=$1-$2-$s}d<e&&d>-e{print $1,$3}'
Tack on a sort if you want/need:
| sort -n
Or, is it possible that your data is too sparse, and you're looking for some value interpolation solution?

You do not need awk for this, gnuplot can do it.
admissible(x,y,value,epsilon)=(abs(x-y-value)<epsilon)
plot 'datafile.dat' using (admissible($1,$2,5.7,1e-5)?$1:1/0):3 with points
Function admissible is tested for each line of data file, if it returns true then the point ($1,$3) is plotted, else the x-coordinate is set to undefined (1/0) and thus the point is not plotted. The only shortcoming is that you cannot use the lines style with this, since lines will be interrupted by non-admissible datapoints.

If you want to compare every $1 against every $2, you need to take 2 passes through the file, once to collect all the $1,$3 pairs, the next to do all the comparisons:
awk -v diff=5.7 '
NR == FNR {
# this is the first trip through
val[$1] = $3
next
}
{
for (v1 in val) {
if ( (v1 - $2) == diff ) {
print v1, val[v1]
}
}
}
' file file # yes, give the same filename twice.
To address #Baruchel's comment about floating point precision, try this:
awk -v diff=5.7 -v epsilon=0.0001'
NR == FNR {val[$1] = $3; next}
{
for (v1 in val) {
delta = v1 - $2 - diff
if (-epsilon <= delta && delta <= epsilon)
print v1, val[v1]
}
}
' file file

SPLIT file by Script (bash, cpp) - numbers in columns

I have files with some columns filled by numbers (float). I would need to split these files according to the value in one of the columns (can set as the first one). This means, when
a b c
in my file the value c fullfils 0.05<=c<=0.1 then create the file named c and copy the whole columns there which fullfils the c-condition...
is this possible? I can something small with bash, awk, something also with c++.
I have searched for some solutions but - I can the data sort of course and only read the first number of the line..
I don't know.
Please, very please.
Thank you
Jane

As you mentioned awk, the basic rule in awk is 'match a line (either by default or with a regexp, condition or line number)' AND 'do something because you found a match'.
awk uses values like $1, $2, $3 to indicate which column in the current line of data it is looking at. $0 refers to the whole line. So ...
awk '
BEGIN{
afile="afile.txt"
bfile="bfile.txt"
cfile="cfile.txt"
}
{
# test c value between .05 and .1
if ($3 >= 0.05 && $3 <= 0.1) print $0 > cfile
} inputData
Note that I am testing the value of the third column (c in your example). You can use $2 to test b column, etc.
If you don't know about the sort of condition test I have included >= 0.5 && $3 <= 0.1 you'll have some learning ahead of you.
Questions in the form of 1. I have this input, 2. I want this output. 3. (but) I'm getting this output, 4. with this code .... {code here} .... have a much better chance of getting a reasonable response in a reasonable amount of time ;-)
I hope this helps.
P.S. as you appear to be a new user, if you get an answer that helps you please remember to mark it as accepted, and/or give it a + (or -) as a useful answer.

If I understand your requirements correctly:
awk '{print > $3}' file ...

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Awk : accommodating the blank cell values - bash

Related

Copy columns of a file to specific location of another pipe delimited file

Find lines that have partial matches

Gnuplot - Use a 3rd non integer parameters in a plot

finding and sorting datapoints in bash and awk

SPLIT file by Script (bash, cpp) - numbers in columns

Categories

Resources