How to sort number in file (highest) - sorting

There is a file included many information.
I want to sort several sentences with included numbers.
In Files, there are several sentences.
There are 7 lines below (middle line is blank)
GRELUP.C.3a.or:ndiff_c_fail_a_same_well = SELECT -inside GRELUP.C.3a.or:ndiff_c_fail_a GRELUP.C.3a.or:_EPTMPL312066 -not
generate layer GRELUP.C.3a.or:ndiff_c_fail_a_same_well, TYP = P, HPN = 0, FPN = 0, HEN = 0, FEN = 0
Time: cpu=0.00/8818.64 real=0.30/1875.23 Memory: 160.81/245.20/245.20
GRELUP.C.3a.or:ndiff_c_fail_a = SELECT -inside GRELUP.C.3a.or:ndiff_c_eg GRELUP.C.3a.or:well_cont_a_sized_a -not
generate layer GRELUP.C.3a.or:ndiff_c_fail_a, TYP = P, HPN = 0, FPN = 0, HEN = 0, FEN = 0
Time: cpu=0.00/8818.64 real=1.10/1875.23 Memory: 180.84/252.29/252.29
What I want to return line are below.
GRELUP.C.3a.or:ndiff_c_fail_a real=1.10/1875.23
GRELUP.C.3a.or:ndiff_c_fail_a_same_well real=0.30/1875.23
In other words, high number behind of "real=" is sorted first and added specific words behind "generate layer" above line.

I suggest doing this in several stages, as it is so much easier to take things in limited steps:
Split the data into records.
Convert each record into a reduced form that just has the information you want.
Sort now that you can easily determine what to sort by.
(Optional, depending on how you do step 2) Extract the information to print.
If your data is small enough to fit into memory, the first step can be done with:
proc splitIntoRecords {data} {
# U+001E is the official ASCII record separator; it's not used much!
regsub -all {\n{2,}} $data \u001e data
return [split $data \u001e]
}
I'm not quite so sure about the conversion step; this might work (on a single record; I'll lift to the collection with lmap later):
proc convertRecord {record} {
# We extract the parts we want to print and the part we want to sort by
regexp {(^\S+).*(real=[^\s/]+/(\S+))} $record -> name time val
return [list "$name $time" $val]
}
Once that's done, we can lsort -real -decreasing with a -index specified to get the collation key (the $vals we extracted above), and printing is now trivial:
set records [lmap r [splitIntoRecords $data] {convertRecord $r}]
foreach r [lsort -real -decreasing -index 1 $records] {
puts [lindex $r 0]
}

Related

How to avoid line insert to the file if the line is already present in the file?

How should the check be made so that there are no line duplicates in the file
open ( FILE, ">newfile");
for( $a = 1; $a < 20; $a = $a + 1 ) {
my $random_number = 1+ int rand(10);;
# check to avoid inserting the line if the line is already present in the file
print FILE "Random number is $random_number \n";
}
close(FILE);
!$seen{$_}++ is a common idiom for identifying duplicates.
my %seen;
for (1..19) {
my $random_number = 1+ int rand(10);
say "Random number is $random_number" if !$seen{$random_number}++;
}
But that doesn't guarantee that you will get all numbers from 1 to 10 in random order. If that's what you are trying to achieve, the following is a far better solution:
use List::Util qw( shuffle );
say "Random number is $_" for shuffle 1..10;
It seems like what you are asking is how to randomize the order of the numbers 1 to 20. I.e. no duplicates, random order. That can be easily done with a Schwartzian transform. For example:
perl -le'print for map { $_->[0] } sort { $a->[1] <=> $b->[1] } map { [$_, rand()] } 1..20'
6
7
16
14
5
20
3
13
19
17
4
8
15
10
9
11
18
1
2
12
In this case, reading from the end and backwards, we create a list of numbers 1 .. 20, we feed that into a map statement which turns each number into an array-ref, containing the number, and a random number. Then we feed that list of array refs to a sort, where we sort numerically on the second argument in the array ref: the random number (hence creating a random order). Then we transform the array ref back into a simple number with another map statement. Finally we print the list using a for loop.
So in your case, the code would look something like:
print "Random number is: $_\n" for # print each number
map { $_>[0] } # restore to a number
sort { $a->[1] <=> $b->[1] } # sort the list on the random number
map { [ $_, rand() ] } # create array ref with random number as index
1 .. 20; # create list of numbers to randomize order of
Then you can use the program like below to redirect output to a file:
$ perl numbers.pl > newfile.txt
Enter each line into a hash as well, what makes it easy and efficient to later check for it
use warnings;
use strict;
use feature 'say';
my $filename = shift or die "Usage: $0 filename\n";
open my $fh, '>', $filename or die "Can't open $filename: $!";
my %existing_lines;
for my $i (1..19)
{
my $random_number = 1 + int rand(10);
# Check to avoid inserting the line if it is already in the file
if (not exists $existing_lines{$random_number}) {
say $fh "Random number is $random_number";
$existing_lines{$random_number} = 1;
}
}
close $fh;
This assumes that the intent in the question is to not repeat that number (symbolizing content to be stored without repetition).
But if it is indeed the whole line (sentence) to be avoided, where that random number is used merely to make each line different, then use the whole line for the key
for my $i (1..19)
{
my $random_number = 1 + int rand(10);
my $line = "Random number is $random_number";
# Check to avoid inserting the line if it is already in the file
if (not exists $existing_lines{$line}) {
say $fh $line;
$existing_lines{$line} = 1;
}
}
Notes and literature
Lexical filehandles (my $fh) are much better than globs (FILE), and the three-argument open is better. See the quide perlopentut and reference open
Always check the open call (or die... above). It can and does fail -- quietly. In that check always print the error for which it failed, $!
The C-style for loop is very rarely needed while the usual foreach (with synonym for) is much nicer to use; see it in perlsyn. The .. is the range operator
Always declare variables with my, and enforce that with strict pragma; always use warnings
If the filehandle refers to pipe-open (not the case here) always check its close
See perlintro for a general overview and for hashes; for more about Perl's data types see perldata. Keep in mind for later the notion of complex data structures, perldsc
return false will do the trick.
Because you cannot generate 20 distinct numbers in the range [1, 10].

How to subtract or add time series data of a CombiTimeTable in Modelica?

I have a text file that is used in a CombiTimeTable. The text file looks like as follows:
#1
double tab1(5,2) # comment line
0 0
1 1
2 4
3 9
4 16
The first column is time and the second one is my data. My goal is to add each datum to the previous one, starting from the second row.
model example
Modelica.Blocks.Sources.CombiTimeTable Tsink(fileName = "C:Tin.txt", tableName = "tab1", tableOnFile = true, timeScale = 60) annotation(
Placement(visible = true, transformation(origin = {-70, 30}, extent = {{-10, -10}, {10, 10}}, rotation = 0)));
equation
end example;
Tsink.y[1] is the column 2 of the table but I do not know how to access it and how to implement an operation on it. Thanks for your help.
You can't use the blocks of the ModelicaStandardTables here, which are only meant for interpolation and hence do not expose the sample points to the Modelica model. However, you can use the Modelica library ExternData to easily read the array from a CSV file and do the required operations on the read data. For example,
model Example "Example model to read array and operate on it"
parameter ExternData.CSVFile dataSource(
fileName="C:/Tin.csv") "Data source"
annotation(Placement(transformation(extent={{-60,60},{-40,80}})));
parameter Integer n = 5 "Number of rows (must be known)";
parameter Real a[n,2] = dataSource.getRealArray2D(n, 2) "Array from CSV file";
parameter Real y[n - 1] = {a[i,2] + a[i + 1,2] for i in 1:n - 1} "Vector";
annotation(uses(ExternData(version="2.6.1")));
end Example;
where Tin.csv is a CSV file with comma as delimiter
0,0
1,1
2,4
3,9
4,16

word2vec recommendation system KeyError: "word '21883' not in vocabulary"

The code works absolutely fine for the data set containing 500000+ instances but whenever I reduce the data set to 5000/10000/15000 it throws a key error : word "***" not in vocabulary.Not for every data point but for most them it throws the error.The data set is in excel format. [1]: https://i.stack.imgur.com/YCBiQ.png
I don't know how to fix this problem since i have very little knowledge about it,,I am still learning.Please help me fix this problem!
purchases_train = []
for i in tqdm(customers_train):
temp = train_df[train_df["CustomerID"] == i]["StockCode"].tolist()
purchases_train.append(temp)
purchases_val = []
for i in tqdm(validation_df['CustomerID'].unique()):
temp = validation_df[validation_df["CustomerID"] == i]["StockCode"].tolist()
purchases_val.append(temp)
model = Word2Vec(window = 10, sg = 1, hs = 0,
negative = 10, # for negative sampling
alpha=0.03, min_alpha=0.0007,
seed = 14)
model.build_vocab(purchases_train, progress_per=200)
model.train(purchases_train, total_examples = model.corpus_count,
epochs=10, report_delay=1)
model.save("word2vec_2.model")
model.init_sims(replace=True)
# extract all vectors
X = model[model.wv.vocab]
X.shape
products = train_df[["StockCode", "Description"]]
products.drop_duplicates(inplace=True, subset='StockCode', keep="last")
products_dict=products.groupby('StockCode'['Description'].apply(list).to_dict()
def similar_products(v, n = 6):
ms = model.similar_by_vector(v, topn= n+1)[1:]
new_ms = []
for j in ms:
pair = (products_dict[j[0]][0], j[1])
new_ms.append(pair)
return new_ms
similar_products(model['21883'])
If you get a KeyError saying a word is not in the vocabulary, that's a reliable indicator that the word you're looking-up was not in the training data fed to Word2Vec, or did not appear enough (default min_count=5) times.
So, your error indicates the word-token '21883' did not appear at least 5 times in the texts (purchases_train) supplied to Word2Vec. You should do either or both of:
Ensure all words you're going to look-up appear enough times, either with more training data or a lower min_count. (However, words with only one or a few occurrences tend not to get good vectors & instead just drag the quaality of surrounding-words' vectors down - so keeping this value above 1, or even raising it above the default of 5 to discard more rare words, is a better path whenever you have sufficient data.)
If your later code will be looking up words that might not be present, either check for their presence first (word in model.wv.vocab) or set up a try: ... except: ... to catch & handle the case where they're not present.

Perl lookup data in hashs table faster

I use code like this to find data values for my calculations:
sub get_data {
$x =0 if($_[1] eq "A"); #get column number by name
$data{'A'}= [2.00000, 0.15000, -0.00143, 33.51030, 0.77, 1, 0, 12];
return $data{$_[0]}[$x];
}
Data is stored like this in Perl file. I plan no more than 100 columns. Then to get value I call:
get_data(column, row);
Now I realized that that is terribly slow way to look up data in table. How can I do it faster? SQL?
Looking at your github code, the main problem you have is that your
big hash of arrays is initialized every time the function is called.
Your current code:
my #atom;
# {'name'}= radius, depth, solvation_parameter, volume, covalent_radius, hydrophobic, H_acceptor, MW
$atom{'C'}= [2.00000, 0.15000, -0.00143, 33.51030, 0.77, 1, 0, 12];
$atom{'A'}= [2.00000, 0.15000, -0.00052, 33.51030, 0.77, 0, 0, ''];
$atom{'N'}= [1.75000, 0.16000, -0.00162, 22.44930, 0.75, 0, 1, 14];
$atom{'O'}= [1.60000, 0.20000, -0.00251, 17.15730, 0.73, 0, 1, 16];
...
Time taken for your test case on the slow netbook I'm typing this on: 6m24.400s.
The most important thing to do is to move this out of the function, so it's
initialized only once, when the module is loaded.
Time taken after this simple change: 1m20.714s.
But since I'm making suggestions, you could write it more legibly:
my %atom = (
C => [ 2.00000, 0.15000, -0.00143, 33.51030, 0.77, 1, 0, 12 ],
A => [ 2.00000, 0.15000, -0.00052, 33.51030, 0.77, 0, 0, '' ],
...
);
Note that %atom is a hash in both cases, so your code doesn't do what you
were imagining: it declares a lexically-scoped array #atom, which is unused, then proceeds to fill up an unrelated global variable %atom. (Also do you really want an empty string for MW of A? And what kind of atom is A anyway?)
Secondly, your name-to-array-index mapping is also slow. Current code:
#take correct value from data table
$x = 0 if($_[1] eq "radius");
$x = 1 if($_[1] eq "depth");
$x = 2 if($_[1] eq "solvation_parameter");
$x = 3 if($_[1] eq "volume");
$x = 4 if($_[1] eq "covalent_radius");
$x = 5 if($_[1] eq "hydrophobic");
$x = 6 if($_[1] eq "H_acceptor");
$x = 7 if($_[1] eq "MW");
This is much better done as a hash (again, initialized outside the function):
my %index = (
radius => 0,
depth => 1,
solvation_parameter => 2,
volume => 3,
covalent_radius => 4,
hydrophobic => 5,
H_acceptor => 6,
MW => 7
);
Or you could be snazzy if you wanted:
my %index = map { [qw[radius depth solvation_parameter volume
covalent_radius hydrophobic H_acceptor MW
]]->[$_] => $_ } 0..7;
Either way, the code inside the function is then simply:
$x = $index{$_[1]};
Time now: 1m13.449s.
Another approach is just to define your field numbers as constants.
Constants are capitalized by convention:
use constant RADIUS=>0, DEPTH=>1, ...;
Then the code in the function is
$x = $_[1];
and you then need to call the function using the constants instead of strings:
get_atom_parameter('C', RADIUS);
I haven't tried this.
But stepping back a bit and looking at how you are using this function:
while($ligand_atom[$x]{'atom_type'}[0]) {
print STDERR $ligand_atom[$x]{'atom_type'}[0];
$y=0;
while($protein_atom[$y]) {
$d[$x][$y] = sqrt(distance_sqared($ligand_atom[$x],$protein_atom[$y]))
- get_atom_parameter::get_atom_parameter($ligand_atom[$x]{'atom_type'}[0], 'radius');
- get_atom_parameter::get_atom_parameter($protein_atom[$y]{'atom_type'}[0], 'radius');
$y++;
}
$x++;
print STDERR ".";
}
Each time through the loop you are calling get_atom_parameter twice to
retrieve the radius.
But for the inner loop, one atom is constant throughout. So hoist the call
to get_atom_parameter out of the inner loop, and you've almost halved the
number of calls:
while($ligand_atom[$x]{'atom_type'}[0]) {
print STDERR $ligand_atom[$x]{'atom_type'}[0];
$y=0;
my $lig_radius = get_atom_parameter::get_atom_parameter($ligand_atom[$x]{'atom_type'}[0], 'radius');
while($protein_atom[$y]) {
$d[$x][$y] = sqrt(distance_sqared($ligand_atom[$x],$protein_atom[$y]))
- $lig_radius
- get_atom_parameter::get_atom_parameter($protein_atom[$y]{'atom_type'}[0], 'radius');
$y++;
}
$x++;
print STDERR ".";
}
But there's more. In your test case the ligand has 35 atoms and the
protein 4128 atoms. This means that your initial code made
4128*35*2 = 288960 calls to get_atom_parameter, and while now it's
only 4128*35 + 35 = 144515 calls, it's easy to just make some arrays with
the radii so that it's only 4128 + 35 = 4163 calls:
my $protein_size = $#protein_atom;
my $ligand_size;
{
my $x=0;
$x++ while($ligand_atom[$x]{'atom_type'}[0]);
$ligand_size = $x-1;
}
#print STDERR "protein_size = $protein_size, ligand_size = $ligand_size\n";
my #protein_radius;
for my $y (0..$protein_size) {
$protein_radius[$y] = get_atom_parameter::get_atom_parameter($protein_atom[$y]{'atom_type'}[0], 'radius');
}
my #lig_radius;
for my $x (0..$ligand_size) {
$lig_radius[$x] = get_atom_parameter::get_atom_parameter($ligand_atom[$x]{'atom_type'}[0], 'radius');
}
for my $x (0..$ligand_size) {
print STDERR $ligand_atom[$x]{'atom_type'}[0];
my $lig_radius = $lig_radius[$x];
for my $y (0..$protein_size) {
$d[$x][$y] = sqrt(distance_sqared($ligand_atom[$x],$protein_atom[$y]))
- $lig_radius
- $protein_radius[$y]
}
print STDERR ".";
}
And finally, the call to distance_sqared [sic]:
#distance between atoms
sub distance_sqared {
my $dxs = ($_[0]{'x'}-$_[1]{'x'})**2;
my $dys = ($_[0]{'y'}-$_[1]{'y'})**2;
my $dzs = ($_[0]{'z'}-$_[1]{'z'})**2;
return $dxs+$dys+$dzs;
}
This function can usefully be replaced with the following, which uses
multiplication instead of **.
sub distance_sqared {
my $dxs = ($_[0]{'x'}-$_[1]{'x'});
my $dys = ($_[0]{'y'}-$_[1]{'y'});
my $dzs = ($_[0]{'z'}-$_[1]{'z'});
return $dxs*$dxs+$dys*$dys+$dzs*$dzs;
}
Time after all these modifications: 0m53.639s.
More about **: elsewhere you declare
use constant e_math => 2.71828;
and use it thus:
$Gauss1 += e_math ** (-(($d[$x][$y]*2)**2));
The built-in function exp() calculates this for you (in fact, ** is commonly
implemented as x**y = exp(log(x)*y), so each time you are doing this you are
performing an unnecessary logarithm the result of which is just slightly less
than 1 as your constant is only accurate to 6 d.p.). This change would alter
the output very slightly. And again, **2 should be replaced by multiplication.
Anyway, this answer is probably long enough for now, and calculation of d[]
is no longer the bottleneck it was.
Summary: hoist constant values out of loops and functions! Calculating the
same thing repeatedly is no fun at all.
Using any kind of database for this would not help your performance in the
slightest. One thing that might help you though is Inline::C. Perl is
not really built for this kind of intensive computation, and Inline::C
would allow you to easily move performance-critical bits into C while
keeping your existing I/O in Perl.
I would be willing to take a shot at a partial C port. How stable
is this code, and how fast do you want it to be? :)
Putting this in a DB will make it MUCH easier to maintain, scale, expand, etc.... Using a DB can also save you a lot of RAM -- it gets and stores in RAM only the desired result instead of storing ALL values.
With regards to speed it depends. With a text file you take a long time to read all the values into RAM, but once it is loaded, retrieving the values is super fast, faster than querying a DB.
So it depends on how your program is written and what it is for. Do you read all the values ONCE and then run 1000 queries? The TXT file way is probably faster. Do you read all the values every time you make a query (to make sure you have the latest value set) -- then the DB would be faster. Do you 1 query/day? use a DB. etc......

Print array length for each element of an array

Given a string array of variable length, print the lengths of each element in the array.
For example, given:
string[] ex = {"abc", "adf", "df", "ergd", "adfdfd");
The output should be:
2 3 4 6
One possibility I'm considering is to use a linked list to save each string length, and sort while inserting and finally display the results.
Any other suggestions for efficient solutions to this problem?
Whenever you want to maintain a collection of distinct things (ie: filter out duplicates), you probably want a set.
There are many different data structures for storing sets. Some of these, like search trees, will also "sort" the values for you. You could try using one of the many forms of binary search trees.
What you are doing now (or the given answer) is called the insertion sort. It basically compare the length of the string-to-insert from the inserted strings. After then, when printing, teh length of string-to-print (at current pointer) will be compared to the length of the string before it and after it, if has the same length, do not print!
Another approach is, the bubble sort, it will sort two strings at a time, sort them, then move to next string...
The printing is the most important part in your program, regardless of what sorting algorithm you use, it doesn't matter.
Here's an algorithm for bubble sort and printing process, it's VB so just convert it...
Dim YourString(4) As String
YourString(0) = "12345" 'Will not be printed
YourString(1) = "12345" 'Will not be printed
YourString(2) = "123" 'Will be printed
YourString(3) = "1234" 'Will be printed
Dim RoundLimit As Integer = YourString.Length - 2
'Outer loop for how many times we will sort the whole array...
For CycleCounter = 0 To RoundLimit
Dim CompareCounter As Integer
'Inner loop to compare strings...
For CompareCounter = 0 To RoundLimit - CycleCounter - 1
'Compare lengths... If the first is greater, sort! Note: this is ascending
If YourString(CompareCounter).Length > YourString(CompareCounter + 1).Length Then
'Sorting process...
Dim TempString = YourString(CompareCounter)
YourString(CompareCounter) = YourString(CompareCounter + 1)
YourString(CompareCounter + 1) = TempString
End If
Next
Next
'Cycles = Array length - 2 , so we have 2 cycles here
'First Cycle!!!
'"12345","12345","123","1234" Compare 1: index 0 and 1 no changes
'"12345","123","12345","1234" Compare 2: index 1 and 2 changed
'"12345","123","1234","12345" Compare 3: index 2 and 3 changed
'Second Cycle!!!
'"123","12345","1234","12345" Compare 1: index 0 and 1 changed
'"123","1234","12345","12345" Compare 2: index 1 and 2 changed
'"123","1234","12345","12345" Compare 3: index 2 and 3 no changes
'No more cycle!
'Now print it! Or use messagebox...
Dim CompareLimit As Integer = YourString.Length - 2
For CycleCounter = 0 To CompareLimit
'If length is equal to next string or the preceeding string, do not print...
If ((CycleCounter - 1) <> -1) Then 'Check if index exist
If YourString(CycleCounter).Length = YourString(CycleCounter - 1).Length Then
Continue For 'The length is not unique, exit compare, go to next iteration...
End If
End If
If ((CycleCounter + 1) <> YourString.Length - 1) Then 'Check if index exist
If YourString(CycleCounter).Length = YourString(CycleCounter + 1).Length Then
Continue For 'The length is not unique, exit compare, go to next iteration...
End If
End If
'All test passed, the length is unique, show a dialog!
MsgBox(YourString(CycleCounter))
Next
The question as stated doesn't say anything about sorting or removing duplicates from the results. It is only the given output that implies the sorting and duplicate removal. It doesn't say anything about optimisation for speed or space or writing for maintainability.
So there really isn't enough information for a "best" solution.
If you want a solution that will work in most languages you probably should stick with an array. Put the lengths in a new array, sort it, then print in a loop that remembers that last value to skip duplicates. I wouldn't want to use a language that couldn't cope with that.
If a language is specified you might be able to take advantage of set or associate array type data structures to handle the duplicates and/or sorting automatically. E.g., in Java you could pick a collection class that automatically ignores duplicates and sorts, and you could structure your code such that a one line change to use a different class would let you keep duplicates, or not sort. If you are using C# you could probably write the whole thing as a one-line LINQ statement...
Here is a C++ solution:
#include <set>
#include <vector>
#include <string>
#include <iostream>
using namespace std;
int main()
{
string strarr[] = {"abc", "adf", "df", "ergd", "adfsgf"};
vector< string > vstr(strarr, strarr + 5);
set< size_t > s;
for (size_t i = 0; i < vstr.size(); i++)
{
s.insert( vstr[i].size() );
}
for (set<size_t>::iterator ii = s.begin(); ii != s.end(); ii++)
cout << *ii << " ";
cout << endl;
return 0;
}
Output:
$ g++ -o set-str set-str.cpp
$ ./set-str
2 3 4 6
A set is used because (quoting from here):
Sets are a kind of associative container that stores unique elements,
and in which the elements themselves are the keys.
Associative containers are containers especially designed to be
efficient accessing its elements by their key (unlike sequence
containers, which are more efficient accessing elements by their
relative or absolute position).
Internally, the elements in a set are always sorted from lower to
higher following a specific strict weak ordering criterion set on
container construction.
Sets are typically implemented as binary search trees.
And for details on vector see here and here for string.
Depending on the language, the easiest way might be to iterate through the array using a for loop
for (i=0;i<array.length;i++){
print array[i].length;
}
do you need to print them in order?

Resources