Sometimes it might be required to sort data. Unfortunately, gnuplot (as far as I know) doesn't offer this possibility. Of course, you can use external tools like awk, Perl, Python, etc. However, for maximum platform independence and avoiding the installation of additional programs and related complications, and also for curiosity, I was interested whether gnuplot can sort somehow nevertheless.
I will be grateful for comments on improvements, limitations.
Does anybody have ideas how to sort alphanumerical data with gnuplot only?
### Sorting with gnuplot
reset session
# generate some random example data
N = 10
set samples N
RandomNo(n) = sprintf("%.02f",rand(0)*n)
set table $Data
plot '+' u (RandomNo(10)):(RandomNo(10)):(RandomNo(10)) w table
unset table
print $Data
# Settings for sorting
ColNo = 2 # ColumnNo for sorting
stats $Data nooutput # get the number of rows if data is from file
RowCount = STATS_records # with the example data above, of course RowCount=N
# create the sortkey and put it into an array
array SortKey[RowCount]
set table $Dummy
plot $Data u (SortKey[$0+1] = sprintf("%.06f%02d",column(ColNo),$0+1)) w table
unset table
# print $Dummy
# get lines as whole into array
set datafile separator "\n"
array DataSeq[RowCount]
set table $Dummy2
plot $Data u (SortKey[$0+1]):(DataSeq[$0+1] = stringcolumn(1)) with table
unset table
print $Dummy2
set datafile separator whitespace
# do the actual sorting with 'smooth unique'
set table $Dummy3
plot $Dummy2 u 1:0 smooth unique
unset table
# print $Dummy3
# extract the sorted sortkeys
set table $Dummy4
plot $Dummy3 u (SortKey[$0+1]=$2) with table
unset table
# print $Dummy4
# create the table with sorted lines
set table $DataSorted
plot $Data u (DataSeq[SortKey[$0+1]+1]) with table
unset table
print $DataSorted
### end of code
First datablock unsorted data
second datablock intermediate with sortkeys
third datablock sorted data by the second column
Output:
5.24 6.68 3.09
1.64 1.27 9.82
6.44 9.23 7.03
8.14 8.87 3.82
4.27 5.98 0.93
7.96 3.64 6.15
6.21 6.28 6.17
1.52 3.17 3.58
4.24 2.16 8.99
8.73 6.54 1.13
6.68000001 5.24 6.68 3.09
1.27000002 1.64 1.27 9.82
9.23000003 6.44 9.23 7.03
8.87000004 8.14 8.87 3.82
5.98000005 4.27 5.98 0.93
3.64000006 7.96 3.64 6.15
6.28000007 6.21 6.28 6.17
3.17000008 1.52 3.17 3.58
2.16000009 4.24 2.16 8.99
6.54000010 8.73 6.54 1.13
1.64 1.27 9.82
4.24 2.16 8.99
1.52 3.17 3.58
7.96 3.64 6.15
4.27 5.98 0.93
6.21 6.28 6.17
8.73 6.54 1.13
5.24 6.68 3.09
8.14 8.87 3.82
6.44 9.23 7.03
For curiosity, I wanted to know whether an alphanumerical sort could be implemented with gnuplot code only.
This avoids the need for external tools and ensures maximum platform compatibility.
I haven't heard yet about an external tool which could assist gnuplot and which works under Windows and Linux and MacOS.
I am happy to take comments and suggestions about bugs, simplifications, improvements, performance comparisons, and limits.
For alphanumerical sort, the first stage is alphanumerical string comparison, which to my knowledge does not exist in gnuplot directly. So, the first part Compare.plt is about comparison of strings.
### compare function for strings
# Compare.plt
# function cmp(a,b,cs) returns a<b:-1, a==b:0, a>b:+1
# cs=0: case-insensitive, cs=1: case-sensitive
reset session
ASCII = ' !"' . "#$%&'()*+,-./0123456789:;<=>?#".\
"ABCDEFGHIJKLMNOPQRSTUVWXYZ[\\]^_\`".\
"abcdefghijklmnopqrstuvwxyz{|}~"
ord(c) = strstrt(ASCII,c)>0 ? strstrt(ASCII,c)+31 : 0
# comparing char: case-sensitive
cmpcharcs(c1,c2) = sgn(ord(c1)-ord(c2))
# comparing char: case-insentitive
cmpcharci(c1,c2) = sgn(( cmpcharci_o1=ord(c1), ((cmpcharci_o1>96) && (cmpcharci_o1<123)) ?\
cmpcharci_o1-32 : cmpcharci_o1) - \
( cmpcharci_o2=ord(c2), ((cmpcharci_o2>96) && (cmpcharci_o2<123)) ?\
cmpcharci_o2-32 : cmpcharci_o2) )
# function cmp returns a<b:-1, a==b:0, a>b:+1
# cs=0: case-insensitive, cs=1: case-sensitive
cmp(a,b,cs) = ((cmp_r=0, cmp_flag=0, cmp_maxlen=strlen(a)>strlen(b) ? strlen(a) : strlen(b)),\
(sum[cmp_i=1:cmp_maxlen] \
((cmp_flag==0 && (cmp_c1 = substr(a,cmp_i,cmp_i), cmp_c2 = substr(b,cmp_i,cmp_i), \
(cmp_r = (cs==0 ? cmpcharci(cmp_c1,cmp_c2) : cmpcharcs(cmp_c1,cmp_c2) ) )!=0 ? \
(cmp_flag=1, cmp_r) : 0)), 1 )), cmp_r)
cmpsymb(a,b,cs) = (cmpsymb_r = cmp(a,b,cs))<0 ? "<" : cmpsymb_r>0 ? ">" : "="
### end of code
Example:
### example compare strings
load "Compare.plt"
a="Alligator"
b="Tiger"
print sprintf("% 2d: % 9s% 2s% 6s", cmp(a,b,0), a, cmpsymb(a,b,0), b)
a="Tiger"
print sprintf("% 2d: % 9s% 2s% 6s", cmp(a,b,0), a, cmpsymb(a,b,0), b)
a="Zebra"
print sprintf("% 2d: % 9s% 2s% 6s", cmp(a,b,0), a, cmpsymb(a,b,0), b)
### end of code
Result:
-1: Alligator < Tiger
0: Tiger = Tiger
1: Zebra > Tiger
The second part makes use of the comparison for sorting.
### alpha-numerical sort with gnuplot
reset session
load "Compare.plt"
$Data <<EOD
1 0.123 Orange
2 0.456 Apple
3 0.789 Peach
4 0.987 Pineapple
5 0.654 Banana
6 0.321 Raspberry
7 0.111 Lemon
EOD
stats $Data u 0 nooutput
RowCount = STATS_records
ColSort = 3
array Key[RowCount]
array Index[RowCount]
set table $Dummy
plot $Data u (Key[$0+1]=stringcolumn(ColSort),Index[$0+1]=$0+1) w table
unset table
# Bubblesort
do for [n=RowCount:2:-1] {
do for [i=1:n-1] {
if ( cmp(Key[i],Key[i+1],0) > 0) {
tmp=Key[i]; Key[i]=Key[i+1]; Key[i+1]=tmp
tmp2=Index[i]; Index[i]=Index[i+1]; Index[i+1]=tmp2
}
}
}
set datafile separator "\n"
set table $Dummy # and reuse Key-array
plot $Data u (Key[$0+1]=stringcolumn(1)) with table
unset table
set datafile separator whitespace
set table $DataSorted
plot $Data u (Key[Index[$0+1]]) with table
unset table
print $DataSorted
set grid xtics,ytics
plot [-0.5:RowCount-0.5][0:1.1] $DataSorted u 0:2:xtic(3) w lp lt 7 lc rgb "red"
### end of code
Input:
1 0.123 Orange
2 0.456 Apple
3 0.789 Peach
4 0.987 Pineapple
5 0.654 Banana
6 0.321 Raspberry
7 0.111 Lemon
Output:
2 0.456 Apple
5 0.654 Banana
7 0.111 Lemon
1 0.123 Orange
3 0.789 Peach
4 0.987 Pineapple
6 0.321 Raspberry
and the output graph:
I have a large .csv file that I need to extract information from and add this information to another column. My csv looks something like this:
file_name,#,Date,Time,Temp (°C) ,Intensity
trap12u_10733862_150809.txt,1,05/28/15,06:00:00.0,20.424,215.3,,
trap12u_10733862_150809.txt,2,05/28/15,07:00:00.0,21.091,1,130.2,,
trap12u_10733862_150809.txt,3,05/28/15,08:00:00.0,26.195,3,100.0,,
trap11u_10733862_150809.txt,4,05/28/15,09:00:00.0,25.222,3,444.5,,
trap11u_10733862_150809.txt,5,05/28/15,10:00:00.0,26.195,3,100.0,,
trap11u_10733862_150809.txt,6,05/28/15,11:00:00.0,25.902,2,927.8,,
trap11u_10733862_150809.txt,7,05/28/15,12:00:00.0,25.708,2,325.0,,
trap12c_10733862_150809.txt,8,05/28/15,13:00:00.0,26.292,3,100.0,,
trap12c_10733862_150809.txt,9,05/28/15,14:00:00.0,26.390,2,066.7,,
trap12c_10733862_150809.txt,10,05/28/15,15:00:00.0,26.097,1,463.9,,
I want to create a two new columns that contains the data from the "file_name" column. I want to extract the one to two numbers after the text "trap" and I want to extract the c or the u and create new columns with this data. Data should look like something like this after processing:
file_name,#,Date,Time,Temp (°C) ,Intensity,can_und,trap_no
trap12u_10733862_150809.txt,1,05/28/15,06:00:00.0,20.424,215.3,,u,12
trap12u_10733862_150809.txt,2,05/28/15,07:00:00.0,21.091,1,130.2,,u,12
trap12u_10733862_150809.txt,3,05/28/15,08:00:00.0,26.195,3,100.0,,u,12
trap11u_10733862_150809.txt,4,05/28/15,09:00:00.0,25.222,3,444.5,,u,11
trap12c_10733862_150809.txt,8,05/28/15,13:00:00.0,26.292,3,100.0,,c,12
trap12c_10733862_150809.txt,9,05/28/15,14:00:00.0,26.390,2,066.7,,c,12
trap12c_10733862_150809.txt,10,05/28/15,15:00:00.0,26.097,1,463.9,,c,12
I suspect the way to do this is with awk and a regular expression, but I'm not sure how to implement the regular expression. How can I extract parts of one column and append them to other columns?
Using sed you can do this:
sed -E '1s/.*/&,can_und,trap_no/; 2,$s/trap([0-9]+)([a-z]).*/&\2,\1/' file.csv
file_name,#,Date,Time,Temp (°C) ,Intensity,can_und,trap_no
trap12u_10733862_150809.txt,1,05/28/15,06:00:00.0,20.424,215.3,,u,12
trap12u_10733862_150809.txt,2,05/28/15,07:00:00.0,21.091,1,130.2,,u,12
trap12u_10733862_150809.txt,3,05/28/15,08:00:00.0,26.195,3,100.0,,u,12
trap11u_10733862_150809.txt,4,05/28/15,09:00:00.0,25.222,3,444.5,,u,11
trap11u_10733862_150809.txt,5,05/28/15,10:00:00.0,26.195,3,100.0,,u,11
trap11u_10733862_150809.txt,6,05/28/15,11:00:00.0,25.902,2,927.8,,u,11
trap11u_10733862_150809.txt,7,05/28/15,12:00:00.0,25.708,2,325.0,,u,11
trap12c_10733862_150809.txt,8,05/28/15,13:00:00.0,26.292,3,100.0,,c,12
trap12c_10733862_150809.txt,9,05/28/15,14:00:00.0,26.390,2,066.7,,c,12
trap12c_10733862_150809.txt,10,05/28/15,15:00:00.0,26.097,1,463.9,,c,12
gawk approach:
awk -F, 'NR==1{ print $0,"can_und,trap_no" }
NR>1{ match($1,/^trap([0-9]+)([a-z])/,a); print $0 a[2],a[1] }' OFS="," file
The output:
file_name,#,Date,Time,Temp (°C) ,Intensity,can_und,trap_no
trap12u_10733862_150809.txt,1,05/28/15,06:00:00.0,20.424,215.3,,u,12
trap12u_10733862_150809.txt,2,05/28/15,07:00:00.0,21.091,1,130.2,,u,12
trap12u_10733862_150809.txt,3,05/28/15,08:00:00.0,26.195,3,100.0,,u,12
trap11u_10733862_150809.txt,4,05/28/15,09:00:00.0,25.222,3,444.5,,u,11
trap11u_10733862_150809.txt,5,05/28/15,10:00:00.0,26.195,3,100.0,,u,11
trap11u_10733862_150809.txt,6,05/28/15,11:00:00.0,25.902,2,927.8,,u,11
trap11u_10733862_150809.txt,7,05/28/15,12:00:00.0,25.708,2,325.0,,u,11
trap12c_10733862_150809.txt,8,05/28/15,13:00:00.0,26.292,3,100.0,,c,12
trap12c_10733862_150809.txt,9,05/28/15,14:00:00.0,26.390,2,066.7,,c,12
trap12c_10733862_150809.txt,10,05/28/15,15:00:00.0,26.097,1,463.9,,c,12
NR==1{ print $0,"can_und,trap_no" } - print the header line
match($1,/^trap([0-9]+)([a-z])/,a) - matches the number following trap word and the next following suffix letter
With use of sed, this will be like:
sed 's/trap\([[:digit:]]\+\)\(.\)\(.*\)$/trap\1\2\3\2,\1/' file
Use sed -i ... to replace it in file.
Using python pandas reader because python is awesome for numerical analysis:
First: I had to modify the data header row so that the columns were consistent by appending 3 commas:
file_name,#,Date,Time,Temp (°C) ,Intensity,,,
There is probably a way to tell pandas to ignore the column differences - but I am yet a noob.
Python code to read your data into columns and create 2 new columns named 'cu_int' and 'cu_char' which contain the parsed elements of the filenames:
import pandas
def main():
df = pandas.read_csv("file.csv")
df['cu_int'] = 0 # Add the new columns to the data frame.
df['cu_char'] = ' '
for index, df_row in df.iterrows():
file_name = df['file_name'][index].strip()
trap_string = file_name.split("_")[0] # Get the file_name string prior to the underscore
numeric_offset_beg = len("trap") # Parse the number following the 'trap' string.
numeric_offset_end = len(trap_string) - 1 # Leave off the 'c' or 'u' char.
numeric_value = trap_string[numeric_offset_beg : numeric_offset_end]
cu_value = trap_string[len(trap_string) - 1]
df['cu_int'] = int(numeric_value)
df['cu_char'] = cu_value
# The pandas dataframe is ready for number crunching.
# For now just print it out:
print df
if __name__ == "__main__":
main()
The printed output (note there are inconsistencies in the data set posted - see row 1 as an example):
$ python read_csv.py
file_name # Date Time Temp (°C) Intensity Unnamed: 6 Unnamed: 7 Unnamed: 8 cu_int cu_char
0 trap12u_10733862_150809.txt 1 05/28/15 06:00:00.0 20.424 215.3 NaN NaN NaN 12 c
1 trap12u_10733862_150809.txt 2 05/28/15 07:00:00.0 21.091 1.0 130.2 NaN NaN 12 c
2 trap12u_10733862_150809.txt 3 05/28/15 08:00:00.0 26.195 3.0 100.0 NaN NaN 12 c
3 trap11u_10733862_150809.txt 4 05/28/15 09:00:00.0 25.222 3.0 444.5 NaN NaN 12 c
4 trap11u_10733862_150809.txt 5 05/28/15 10:00:00.0 26.195 3.0 100.0 NaN NaN 12 c
5 trap11u_10733862_150809.txt 6 05/28/15 11:00:00.0 25.902 2.0 927.8 NaN NaN 12 c
6 trap11u_10733862_150809.txt 7 05/28/15 12:00:00.0 25.708 2.0 325.0 NaN NaN 12 c
7 trap12c_10733862_150809.txt 8 05/28/15 13:00:00.0 26.292 3.0 100.0 NaN NaN 12 c
8 trap12c_10733862_150809.txt 9 05/28/15 14:00:00.0 26.390 2.0 66.7 NaN NaN 12 c
9 trap12c_10733862_150809.txt 10 05/28/15 15:00:00.0 26.097 1.0 463.9 NaN NaN 12 c
The gist of my question is this:
How can I display Unicode characters in Matlab's GUI (OS X) so that they are properly rendered?
Details:
I have a table of strings stored in a file, and some of these strings contain UTF-8-encoded Unicode characters. I have tried many different ways (too many to list here) to display the contents of this file in the MATLAB GUI, without success. For example:
>> fid = fopen('/Users/kj/mytable.txt', 'r', 'n', 'UTF-8');
>> [x, x, x, enc] = fopen(fid); enc
enc =
UTF-8
>> tbl = textscan(fid, '%s', 35, 'delimiter', ',');
>> tbl{1}{1}
ans =
ÎÎÎÎÎΠΣΦΩαβγδεζηθικλμνξÏÏÏÏÏÏÏÏÏÏ
>>
As it happens, if I paste the string directly into the MATLAB GUI, the pasted string is displayed properly, which shows that the GUI is not fundamentally incapable of displaying these characters, but once MATLAB reads it in, it longer displays it correctly. For example:
>> pasted = 'ΓΔΘΛΞΠΣΦΩαβγδεζηθικλμνξπρςστυφχψω'
pasted =
>>
Thanks!
I present below my findings after doing some digging... Consider these test files:
a.txt
ΓΔΘΛΞΠΣΦΩαβγδεζηθικλμνξπρςστυφχψω
b.txt
தமிழ்
First, we read files:
%# open file in binary mode, and read a list of bytes
fid = fopen('a.txt', 'rb');
b = fread(fid, '*uint8')'; %'# read bytes
fclose(fid);
%# decode as unicode string
str = native2unicode(b,'UTF-8');
If you try to print the string, you get a bunch of nonsense:
>> str
str =
Nonetheless, str does hold the correct string. We can check the Unicode code of each character, which are as you can see outside the ASCII range (last two are the non-printable CR-LF line endings):
>> double(str)
ans =
Columns 1 through 13
915 916 920 923 926 928 931 934 937 945 946 947 948
Columns 14 through 26
949 950 951 952 953 954 955 956 957 958 960 961 962
Columns 27 through 35
963 964 965 966 967 968 969 13 10
Unfortunately, MATLAB seems unable to display this Unicode string in a GUI on its own. For example, all these fail:
figure
text(0.1, 0.5, str, 'FontName','Arial Unicode MS')
title(str)
xlabel(str)
One trick I found is to use the embedded Java capability:
%# Java Swing
label = javax.swing.JLabel();
label.setFont( java.awt.Font('Arial Unicode MS',java.awt.Font.PLAIN, 30) );
label.setText(str);
f = javax.swing.JFrame('frame');
f.getContentPane().add(label);
f.pack();
f.setVisible(true);
As I was preparing to write the above, I found an alternative solution. We can use the DefaultCharacterSet undocumented feature and set the charset to UTF-8 (on my machine, it is ISO-8859-1 by default):
feature('DefaultCharacterSet','UTF-8');
Now with a proper font (you can change the font used in the Command Window from Preferences > Font), we can print the string in the prompt (note that DISP is still incapable of printing Unicode):
>> str
str =
ΓΔΘΛΞΠΣΦΩαβγδεζηθικλμνξπρςστυφχψω
>> disp(str)
ΓΔΘΛΞΠΣΦΩαβγδεζηθικλμνξπÏςστυφχψω
And to display it in a GUI, UICONTROL should work (under the hood, I think it is really a Java Swing component):
uicontrol('Style','text', 'String',str, ...
'Units','normalized', 'Position',[0 0 1 1], ...
'FontName','Arial Unicode MS', 'FontSize',30)
Unfortunately, TEXT, TITLE, XLABEL, etc.. are still showing garbage:
As a side note: It is difficult to work with m-file sources containing Unicode characters in the MATLAB editor. I was using Notepad++, with files encoded as UTF-8 without BOM.