Gnuplot - Use a 3rd non integer parameters in a plot - bash

I want to make a plot from a csv file:
02/15/2016 09:32:58,LG.04,4747.0
02/15/2016 09:33:08,LG.03,2899.0
02/15/2016 09:33:18,LG.01,5894.0
02/15/2016 09:33:28,LG.04,6043.0
Using the column 1 which is the date, the 3rd is the value that I want to compare.
This give me only one plot.
reset
date = system("date +%m-%d-%Y")
set datafile separator ","
set timefmt '%m/%d/%Y %H:%M:%S'
set xdata time
set format x "%m/%d/%Y\n%H:%M:%S"
#
plot '/home/'.system("echo $USER").'/Desktop/test.csv' u 1:3:2 w lp
pause 200
I am wondering how to get many lines using the second column, and define the title of the different columns (using the csv value).

To do this you will need to use an outside program to filter and reorganize the data. I'll demonstrate this using python3.
We need two python programs. The first, getnames.py, will get us the list of unique values in column 2:
data = open("test.csv","r").readlines()
names = [x.split(",")[1] for x in data]
print(" ".join(sorted(set(names))))
The second, filternames.py, will get us the lines in the data file corresponding to each unique value in column 2:
from sys import argv
nme = argv[1]
data = open("test.csv","r").readlines()
for x in data:
if x.split(",")[1] == nme:
print(x.strip())
Then, in gnuplot, we can call into these programs to process the data.
set datafile separator ","
set timefmt '%m/%d/%Y %H:%M:%S'
set xdata time
set format x "%m/%d/%Y\n%H:%M:%S"
names = system("getnames.py")
plot for [n in names] sprintf("< filternames.py %s",n) u 1:3 with linespoints t n
The first system call will get a string containing space separated unique values for this second column ("LG.01 LG.03 LG.04").
The plot command runs over each one of these values, and calls the filtering program to return only the lines corresponding to that value. The output of the filtering program is read directly by using the redirection operator.
Here, I moved the key to the left to keep the data off from it with set key left.
We can do the same thing using standard linux commands, if available. Instead of using the getnames.py program, we can do
names = system("awk -F, '{print $2}' test.csv | sort | uniq | tr '\n' ' '")
using awk to get the second column values, uniq to get only the unique values (which requires the values to be sorted with sort), and tr to replace newlines with spaces (returning the values as one space separated list).
Instead of using filternames.py, we can do
plot for [n in names] sprintf("< awk -F, '($2=="%s"){print $0}' test.csv",n) u 1:3 with linespoints t n
using awk to get only the lines with the desired second column value.

Related

Copy columns of a file to specific location of another pipe delimited file

I have a file suppose xyz.dat which has data like below -
a1|b1|c1|d1|e1|f1|g1
a2|b2|c2|d2|e2|f2|g2
a3|b3|c3|d3|e3|f3|g3
Due to some requirement, I am making two new files(aka m.dat and o.dat) from original xyz.dat.
M.dat contains columns 2|4|6 like below after running some logic on it -
b11|d11|f11
b22|d22|f22
b33|d33|f33
O.dat contains all the columns except 2|4|6 like below without any change in it -
a1|c1|e1|g1
a2|c2|e2|g2
a3|c3|e3|g3
Now I want to merge both M and O file to create back the original format xyz.dat file.
a1|b11|c1|d11|e1|f11|g1
a2|b22|c2|d22|e2|f22|g2
a3|b33|c3|d33|e3|f33|g3
Please note column positions can change for another file. I will get the columns positions like in above example it is 2,4,6 so need some generic command to run in loop to merge the new M and O file or one command in which I can pass the columns positions and it will copy the columns form M.dat file and past it in O.dat file.
I tried paste, sed, cut but not able to make any perfect command.
Please help.
To perform column-wise merge of two files, better to use a scripting engine (Python, Awk or Perl or even bash). Tools like paste, sed and cut do not have enough flexibility for those tasks (join may come close, but require extra work).
Consider the following awk based script
awk -vOFS='|' '-F|' '
{
getline s < "o.dat"
n = split(s. a)
# Print output, Add a[n], or $n, ... as needed based on actual number of fields.
print $1, a[1], $2, a[2], $3, a[3], a[4]
}
' m.dat
The print line can be customized to generate whatever column order
Based on clarification from OP, looks like the goal is: Given an input of two files, and list of columns where data should be merged from the 2nd file, produce an output file that contain the merge data.
For example:
awk -f mergeCols COLS=2,4,6 M=b.dat a.dat
# If file is marked executable (chmod +x mergeCols)
mergeCols COLS=2,4,6 M=b.dat a.dat
Will insert the columns from b.dat into columns 2, 4 and 6, whereas other column will include data from a.dat
Implementation, using awk: (create a file mergeCols).
#! /usr/bin/awk -f
BEGIN {
FS=OFS="|"
}
NR==1 {
# Set the column map
nc=split(COLS, c, ",")
for (i=1 ; i<=nc ; i++ ) {
cmap[c[i]] = i
}
}
{
# Read one line from merged file, split into tokens in 'a'
getline s < M
n = split(s, a)
# Merge columns using pre-set 'cmap'
k=0
for (i=1 ; i<=NF+nc ; i++ ) {
# Pick up a column
v = cmap[i] ? a[cmap[i]] : $(++k)
sep = (i<NF+nc) ? "|" : "\n"
printf "%s%s", v, sep
}
}

Awk : accommodating the blank cell values

I need to calculate the size of deleted file size per user from lsof command in bash.
But there are few rows which has third column as blank causing issues to sum up values.
For example, in the attached image I need to show the total size by each user type for deleted files, now because of few blank cells in third column the column count is not coming as right and hence the resulting values are not correct too.
I tried using few options to replace blank cells with some dummy text but that is not working well, so need suggestion to solve this problem and also any command to show the resulting size in human readable format ?
I tried to add the output by user type with following command
lsof|grep -i deleted| awk '{a[$5] +=$8} END{for (i in a) print i, a[$i]}'
Above command did not give the right results, so I tried below command to replace blank cells with a dummy text
lsof|grep -i deleted| awk '!$3{$3="NA"}{a[$5] +=$8} END{for (i in a) print i, a[$i]}'
That did not work, so I tried using if condition
lsof|grep -i deleted| awk '{if($3 =="") $3="NA"; a[$5] +=$8} END{for (i in a) print i, a[$i]}'
Assuming you are interested in file owner/size/name, here is a python script (test.py) which can get them :
import re
import sys
user_last_column = 0
for line in sys.stdin:
if user_last_column:
if re.search(r'\(deleted\)$', line):
print("%s %s %s" % (re.sub(r'.* ', '', line[:user_last_column]),
re.sub(r'.* ', '', line[:size_last_column]),
line[name_first_column:][:-11]))
else: # Process first row which is header
user_last_column = line.find('USER') + 4
size_last_column = line.find('SIZE/OFF') + 8
name_first_column = line.find('NAME')
Call it with :
lsof | python test.py | sort -u # [sort -u] to remove duplicates
or
python test.py < /tmp/sample
Explanation
The main thing is to find positions (in characters) of the three pieces of info.

Print a column as row and print remaining columns under that row accordingly using shell

I have a raw data in this structure:
Date|Number|Sum
2016-05-23|128|213
2016-05-23|121|254
2016-05-25|143|213
2016-05-23|111|56
2016-05-13|121|213
2016-05-29|111|251
2016-05-07|111|23
2016-05-07|143|25
2016-05-17|111|13
I want header like this
number(2nd column)|2016-05-01|2016-05-02|....................2016-05-31
and print the sum (3rd column) of that particular number corresponding to that date using shell script.
I have tried using awk but not successfully.

Using awk or sed to print column of CSV file enclosed in double quotes

I'm working on a csv file like the one below, comma delimited, each cell is enclosed in double quotes, but some of them contain double quote and/or comma inside double quote enclosure. The actual file contain around 300 columns and 200,000 rows.
"Column1","Column2","Column3","Column4","Column5","Column6","Column7"
"abc","abc","this, but with "comma" and a quote","18"" inch TV","abc","abc","abc"
"cde","cde","cde","some other, "cde" here","cde","cde","cde"
I'll need to remove some unless columns, and merge last few columns, instead of having "," in between them, I need </br>. and move second column to the end. Anything within the cells should be the same, with double quotes and commas as the original file. Below is an example of the output that I need.
"Column1","Column4","Column5","Column2"
"abc","18"" inch TV","abc</br>abc</br>abc","abc"
"cde","some other, "cde" here","cde</br>cde</br>cde","cde"
In this example I want to remove column3 and merge column 5, 6, 7.
Below is the code that I tried to use, but it is reading either double quote and/or comma, which is end of the row to be different than what I expected.
awk -vFPAT='([^,]*)|("[^"]+")' -vOFS=, '{print $1,$4,$5"</br>"$6"</br>"$7",$2}' inputfile.csv
sed -i 's#"</br>"#</br>#g' inputfile.csv
sed is used to remove beginning and ending double quote of a cell.
The output file that I'm getting right now, if previous field contains a double quote, it will consider that is the beginning of a cell, so the following values are often pushed up a column.
Other code that I have used consider every comma as beginning of a cell, so that won't work as well.
awk -F',' 'BEGIN{OFS=",";} {print $1,$4,$5"</br>"$6"</br>"$7",$2}' inputfile.csv
sed -i 's#"</br>"#</br>#g' inputfile.csv
Any help is greatly appreciated. thanks!
CSV is a loose format. There may be subtle variations in formatting. Your particular format may or may not be expressible with a regular grammar/regular expression. (See this question for a discussion about this.) Even if your particular formatting can be expressed with regular expressions, it may be easier to just whip out a parser from an existing library.
It is not a bash/awk/sed solution as you may have wanted or needed, but Python has a csv module for parsing CSV files. There are a number of options to tweak the formatting. Try something like this:
#!/usr/bin/python
import csv
with open('infile.csv', 'r') as infile, open('outfile.csv', 'wb') as outfile:
inreader = csv.reader(infile)
outwriter = csv.writer(outfile, quoting=csv.QUOTE_ALL)
for row in inreader:
# Merge fields 5,6,7 (indexes 4,5,6) into one
row[4] = "</br>".join(row[4:7])
del row[5:7]
# Copy second field to the end
row.append(row[1])
# Remove second and third fields
del row[1:3]
# Write manipulated row
outwriter.writerow(row)
Note that in Python, indexes start with 0 (e.g. row[1] is the second field). The first index of a slice is inclusive, the last is exclusive (row[1:3] is row[1] and row[2] only). Your formatting seems to require quotes around every field, hence the quoting=csv.QUOTE_ALL. There are more options at Dialects and Formatting Parameters.
The above code produces the following output:
"Column1","Column4","Column5</br>Column6</br>Column7","Column2"
"abc","18"" inch TV","abc</br>abc</br>abc","abc"
"cde","some other, cde"" here""","cde</br>cde</br>cde","cde"
There are two issues with this:
It doesn't treat the first row any differently, so the headers of columns 5, 6, and 7 are merged like the other rows.
Your input CSV contains "some other, "cde" here" (third row, fourth column) with unescaped quotes around the cde. There is another case of this on line two, but it was removed since it is in column 3. The result contains incorrect quotes.
If these quotes are properly escaped, your sample input CSV file becomes
infile.csv (escaped quotes):
"Column1","Column2","Column3","Column4","Column5","Column6","Column7"
"abc","abc","this, but with ""comma"" and a quote","18"" inch TV","abc","abc","abc"
"cde","cde","cde","some other, ""cde"" here","cde","cde","cde"
Now consider this modified Python script that doesn't merge columns on the first row:
#!/usr/bin/python
import csv
with open('infile.csv', 'r') as infile, open('outfile.csv', 'wb') as outfile:
inreader = csv.reader(infile)
outwriter = csv.writer(outfile, quoting=csv.QUOTE_ALL)
first_row = True
for row in inreader:
if first_row:
first_row = False
else:
# Merge fields 5,6,7 (indexes 4,5,6) into one
row[4] = "</br>".join(row[4:7])
del row[5:7]
# Copy second field (index 1) to the end
row.append(row[1])
# Remove second and third fields
del row[1:3]
# Write manipulated row
outwriter.writerow(row)
The output outfile.csv is
"Column1","Column4","Column5","Column2"
"abc","18"" inch TV","abc</br>abc</br>abc","abc"
"cde","some other, ""cde"" here","cde</br>cde</br>cde","cde"
This is your sample output, but with properly escaped "some other, ""cde"" here".
This may not be precisely what you wanted, not being a sed or awk solution, but I hope it is still useful. Processing more complicated formats may justify more complicated tools. Using an existing library also removes a few opportunities to make mistakes.
This might be an oversimplification of the problem but this has worked for me with your test data:
cat /tmp/inputfile.csv | sed 's#\"\,\"#|#g' | sed 's#"</br>"#</br>#g' | awk 'BEGIN {FS="|"} {print $1 "," $4 "," $5 "</br>" $6 "</br>" $7 "," $2}'
Please not that I am on Mac probably that's why I had to wrap the commas in the AWK script in quotation marks.

Gnuplot: substract varying offset from multiple data files

After my last question was answered as well (thanks # Christoph!), I have another one.
I have multiple data files I want to plot but at each file I want to substract in column 2 the first value so every data file starts at 0.
I have the following code:
file = 'file_1 file_2 file_3 ... filen'
intime(COL) = strptime("%H:%M:%S",strcol(COL))
do for [i=1:words(file)] {
stats word(file,i) using (intime(2)):3 nooutput
timemin(i) = STATS_min_x
}
plot for [i=1:words(file)] word(file,i) u (intime(2)-timemin(i)):3 notitle
The problem is that the variable timemin(i) only contains the value of the last file.
Does anybody know how I can save all the different values for file_1 to file_n?
Thanks for your help!
You can construct a string which contains all the computed minima:
file = 'file_1 file_2 file_3 ... filen'
timemin = ''
intime(COL) = strptime("%H:%M:%S",strcol(COL))
do for [i=1:words(file)] {
stats word(file,i) using (intime(2)):3 nooutput
timemin = sprintf("%s %e", timemin, STATS_min_x)
}
plot for [i=1:words(file)] word(file,i) u (intime(2)-word(timemin, i)):3 notitle
Since you want to subtract the very first value, you could also use another method without stats:
file = 'file_1 file_2 file_3 ... filen'
intime(COL) = strptime("%H:%M:%S",strcol(COL))
ofs = 0
plot for [f in file] f using (ofs = ($0 == 0 ? intime(2) : ofs), intime(2) - ofs):3 notitle
This sets the variable ofs to the first value of column 2 in each data file ($0 contains the row number). And note, that you can iterate over a word list with in.

Resources