Erlang upper case and lower case sort - sorting

the question about a comparison of the upper and lower case..how can i do that in my sort function.any idea?
Ex: Inputfile : " I am Happy! "
Outputfile:
Happy!
I
am
thats what's happen with my program, but i would like so have:
am
I
Happy
My code:
-module(wp)
-compile([export_all]). % Open the File
sortFile(File1,File2) ->
{ok, File_Read} = file:read_file(File1),
% making a list
Liste = string:tokens(binary_to_list(File_Read), "\n "),
% isort List
Sort_List = isort(Liste),
ISort = string:join(Sort_List,"\n"),
%Written in the File.
{ok,Datei_Schreiben} = file:open(File2, write),
file:write(File_Write, Isort),
file:close(File_Write).
isort([]) -> [];
isort([X|XS])-> insert(X, isort(XS)).
insert(Elem, []) -> [Elem];
insert(Elem, [X|XS]) when Elem= [Elem,X|XS];
insert(Elem, [X|XS]) -> [X|insert(Elem,XS)].

how about something like this:
qsort1([]) -> [];
qsort1([H|T]) ->
qsort1([X || X <- T, string:to_lower(X) < string:to_lower(H)])
++ [H]
++ qsort1([X || X <- T, string:to_lower(X) >= string:to_lower(H)]).
7> qsort1(["I", "am","Happy"]).
["am","Happy","I"]
I believe that "happy" sorts less than "i"
8> "happy" < "i".
true
which is why my sorted order is a little differenct than your original post.

When there is at least N*log2(N) comparisons in sorting there is not necessary to make N*log2(N) but only N case transformations. (Almost all perl developers knows this trick.)
{ok, Bin} = file:read_file(?INPUT_FILE),
Toks = string:tokens(binary_to_list(Bin),"\n "),
Result = [[X,$\n] || {_,X} <- lists:sort([{string:to_lower(X), X} || X<-Toks])],
file:write_file(?OUTPUT_FILE, Result).
BTW lists:sort/1 merge sort has granted N*log2(N) and is pretty efficient in contrary to concise but less efficient quick sort implementation. What worse, quick sort has N^2 worst case.

Now, depending on whether you are on Windows or Unix/Linux, the lines in the files will be ended with different characters. Lets go with windows where its normally \r\n. Now assuming the input files are not too big, we can read them at once into a binary. The stream of data we get must be split into lines, then each line split into words (spaces). If the input file is very big and cannot fit in memory, then you will have to read it, line by line, in which case you might need an IN-Memory buffer to hold all the words ready for sorting, this would require ETS Table, or Memcached (an option i wont illustrate here). Lets write the code
-module(sick_sort).
-compile(export_all).
-define(INPUT_FILE,"C:/SICK_SORT/input.txt").
-define(OUTPUT_FILE_PATH,"C:/SICK_SORT/").
-define(OUTPUT_FILENAME,"output.txt").
start()->
case file:read_file(?INPUT_FILE) of
{ok,Binary} ->
%% input file read
AllLines = string:tokens(binary_to_list(Binary),"\r\n"),
SortedText = lists:flatten([XX ++ "\r\n" || XX <- lists:sort(string:tokens(AllLines," "))]),
EndFile = filename:join(?OUTPUT_FILE_PATH,?OUTPUT_FILENAME),
file:write_file(EndFile,SortedText),
ok;
Error -> {error,Error}
end.
That should work. Change the macros in the source file to suit your settings and then, just run sick_sort:start().

you have to compare low cap in your sort function:
(nitrogen#127.0.0.1)25> F= fun(X,Y) -> string:to_lower(X) < string:to_lower(Y) end.
#Fun<erl_eval.12.111823515>
(nitrogen#127.0.0.1)26> lists:sort(F,["I","am","Happy"]).
["am","Happy","I"]
(nitrogen#127.0.0.1)27>
EDIT:
In your code, the function that allows to sort the list are the operators > and < (if you want to see replicated string one of them should include =, otherwise you will do a usort). If you want to use a different comparison you can define it in a normal or anonymous function and then use it in the quicksort:
mycompare(X,Y) ->
string:to_lower(X) < string:to_lower(Y).
quicksort ([])->[];
([X|XS])-> quicksort([Y||Y<-XS,mycompare(X,Y)])++[X]++quicksort([Y||Y<-XS,mycompare(X,Y) == false]).

Related

Summation Elixir

I am trying to recreate this equation in Elixir:
For now I am working on an easy example and I have something like this:
Enum.each(1..2, fn x -> :math.pow(1 + 1/1, -x) end)
However, while using Enum.each I am getting an :ok output, and therefore I can't inject it later to Enum.sum()
I will be grateful for help.
While the answer by #sabiwara is perfectly correct, one’d better either use Stream.map/2 to avoid building the intermediate list that might be huge, or directly Enum.reduce/3 to the answer.
# ⇓ initial value
Enum.reduce(1..2, 0, &:math.pow(1 + 1/1, -&1) + &2)
Enum.each/2 is for side effects, but does not return a transformed list.
You are looking for Enum.map/2.
Alternatively, you could use a for comprehension:
for x <- 1..2, do: :math.pow(1 + 1/1, -x)

SciTE sort selection tool : numbers with leading spaces are not sorted as expected

The SciTE editor comes with embedded Lua scripting engine having access to the text buffer of the editor. This makes it possible to extend SciTE's functionality with tools programmed in Lua and started from the Tools Menu. One such tool available from here:
http://lua-users.org/wiki/SciteSortSelection
is a tool for sorting of selected lines in alphabetical order.
Annoying for me was/is that it doesn't sort lines containing numbers in their numerical order but like this:
1
111
2
222
3
333
where I would rather expect:
1
2
3
111
222
333
Google and Co. are not of much help here as there is to my knowledge no solution to this problem available yet online. It is also not that easy to find and deeply understand the Lua documentation for table.sort(). So the question for a knowledgeable Lua programmer would be, what is the best way to patch the existing Lua script code, so that numbers (and also lines with text in case of leading spaces) become sorted as expected and the Lua code for this task runs so fast, that even sorting of huge files (50 MByte and more) won't take much time?
Your expectation is wrong. You said the algorithm is supposed to sort the texts alphabetically and that is exactly what it does.
For Lua "11" is smaller than "2".
I think you would agree that "aa" should come befor "b" which is pretty much the same thing.
If you want to change how texts are sorted you have to provide your own function.
The Lua reference manual says:
table.sort (list [, comp])
Sorts list elements in a given order, in-place, from list[1] to
list[#list]. If comp is given, then it must be a function that
receives two list elements and returns true when the first element
must come before the second in the final order (so that, after the
sort, i < j implies not comp(list[j],list[i])). If comp is not given,
then the standard Lua operator < is used instead.
Note that the comp function must define a strict partial order over
the elements in the list; that is, it must be asymmetric and
transitive. Otherwise, no valid sort may be possible.
The sort algorithm is not stable: elements considered equal by the
given order may have their relative positions changed by the sort.
So you are free to implement your own comp function to change the sorting.
By default table.sort(list) sort list in ascending order.
To make it sort in descending order you call:
table.sort(list, function(a,b) return a > b end)
If you want to treat numbers differently you can do something like this:
t = {"111", "11", "3", "2", "a", "b"}
local function myCompare(a,b)
local a_number = tonumber(a)
local b_number = tonumber(b)
if a_number and b_number then
return a_number < b_number
end
end
table.sort(t, myCompare)
for i,v in ipairs(t) do
print(v)
end
which would give you the output
2
3
11
111
a
b
Of course this is just a quick and simple example. A nicer implementation is up to you.
Below what I finally came up with myself. It's sure a quick and dirty solution which slows the already slow
( compared to jEdit [Plugins]->[Text Tools]->[Sort Lines] or to bash command line 'sort -g' )
process of sorting huge buffers of text lines, but it is at least there for use and works as expected. For the sake of completeness here the entire section of code currently present in my Lua Startup Script for SciTE:
-- =============================================================================
-- Sort Selected Lines (available in MENU -> Tools):
-- -----------------------------------------------------------
-- Specify in .SciTEUser.properties:
-- command.name.2.*=# Sort Selected Lines '
-- command.subsystem.2.*=3
-- command.mode.2.*=savebefore:no
-- command.2.*=SortSelectedLines
-- # command.shortcut.2.*=Ctrl+2 # Ctrl+2 is DEFAULT for command.2.*
function lines(str)
local t = {}
local i, lstr = 1, #str
while i <= lstr do
local x, y = string.find(str, "\r?\n", i)
if x then t[#t + 1] = string.sub(str, i, x - 1)
else break
end
i = y + 1
end
if i <= lstr then t[#t + 1] = string.sub(str, i) end
return t
end
-- It was an annoying for me that using table.sort(buffer) in Lua
-- didn't sort numbers with leading spaces in their numerical order.
-- Using following comparison function helps to avoid that problem:
function compare(a,b)
return a:gsub(" ", "0") < b:gsub(" ", "0")
-- If 'compare' is not used ( table.sort(buf) )
-- Lua uses implicit for sorting (see Lua tutorial):
-- return a < b
-- so changing the provided return statement to this above
-- would be enough to restore sorting to how it was before
end
function SortSelectedLines()
local sel = editor:GetSelText()
if #sel == 0 then return end
local eol = string.match(sel, "\n$")
local buf = lines(sel)
table.sort(buf, compare)
--table.foreach (buf, print) --used for debugging
local out = table.concat(buf, "\n")
if eol then out = out.."\n" end
editor:ReplaceSel(out)
end
-- ---------
-- :Sort Selected Lines
-- -----------------------------------------------------------------------------

efficiently reading a large file into a Map

I'm trying to write code to perform the following simple task in Haskell: looking up the etymologies of words using this dictionary, stored as a large tsv file (http://www1.icsi.berkeley.edu/~demelo/etymwn/). I thought I'd parse (with attoparsec) the tsv file into a Map, which I could then use to look up etymologies efficiently, as required (and do some other stuff with).
This was my code:
{-# LANGUAGE OverloadedStrings #-}
import Control.Arrow
import qualified Data.Map as M
import Control.Applicative
import qualified Data.Text as DT
import qualified Data.Text.Lazy.IO as DTLIO
import qualified Data.Text.Lazy as DTL
import qualified Data.Attoparsec.Text.Lazy as ATL
import Data.Monoid
text = do
x <- DTLIO.readFile "../../../../etymwn.tsv"
return $ DTL.take 10000 x
--parsers
wordpair = do
x <- ATL.takeTill (== ':')
ATL.char ':' *> (ATL.many' $ ATL.char ' ')
y <- ATL.takeTill (\x -> x `elem` ['\t','\n'])
ATL.char '\n' <|> ATL.char '\t'
return (x,y)
--line of file
line = do
a <- (ATL.count 3 wordpair)
case (rel (a !! 2)) of
True -> return . (\[a,b,c] -> [(a,c)]) $ a
False -> return . (\[a,b,c] -> [(c,a)]) $ a
where rel x = if x == ("rel","etymological_origin_of") then False else True
tsv = do
x <- ATL.many1 line
return $ fmap M.fromList x
main = (putStrLn . show . ATL.parse tsv) =<< text
It works for small amounts of input, but quickly grows too inefficient. I'm not quite clear on where the problem is, and soon realized that even trivial tasks like viewing the last character of the file were taking too long when I tried, e.g. with
foo = fmap DTL.last $ DTLIO.readFile "../../../../etymwn.tsv
So my questions are: what are the main things that I'm doing wrong, in terms of approach and execution? Any tips for more Haskelly/better code?
Thanks,
Reuben
Note that the file you want to load has 6 million lines and
the text you are interested in storing comprises approx. 120 MB.
Lower Bounds
To establish some lower bounds I first created another .tsv file containing
the preprocessed contents of the etymwn.tsv file. I then timed how it
took for this perl program to read that file:
my %H;
while (<>) {
chomp;
my ($a,$b) = split("\t", $_, 2);
$H{$a} = $b;
}
This took approx. 17 secs., so I would expect any Haskell program to
take about that about of time.
If this start-up time is unacceptable, consider the following options:
Work in ghci and use the "live reloading" technique to save the map
using the Foreign.Store package
so that it persists through ghci code reloads.
That way you only have to load the map data once as you iterate your code.
Use a persistent key-value store (such as sqlite, gdbm, BerkeleyDB)
Access the data through a client-server store
Reduce the number of key-value pairs you store (do you need all 6 million?)
Option 1 is discussed in this blog post by Chris Done:
Reload Running Code in GHCI
Options 2 and 3 will require you to work in the IO monad.
Parsing
First of all, check the type of your tsv function:
tsv :: Data.Attoparsec.Internal.Types.Parser
DT.Text [M.Map (DT.Text, DT.Text) (DT.Text, DT.Text)]
You are returning a list of maps instead of just one map. This doesn't look
right.
Secondly, as #chi suggested, I doubt that using attoparsec is lazy.
In partcular, it has to verify that the entire parse succeeds,
so I can't see how it cannot avoid creating all of the parsed lines
before returning.
To truely parse the input lazily, take the following approach:
toPair :: DT.Text -> (Key, Value)
toPair input = ...
main = do
all_lines <- fmap DTL.lines $ DTLIO.getContent
let m = M.fromList $ map toPair all_lines
print $ M.lookup "foobar" m
You can still use attoparsec to implement toPair, but you'll be using it
on a line-by-line basis instead of on the entire input.
ByteString vs. Text
In my experience working with ByteStrings is much faster than working with Text.
This version of toPair for ByteStrings is about 4 times faster than the corresponding
version for Text:
{-# LANGUAGE OverloadedStrings #-}
import qualified Data.ByteString.Lazy.Char8 as L
import qualified Data.Attoparsec.ByteString.Char8 as A
import qualified Data.Attoparsec.ByteString.Lazy as AL
toPair :: L.ByteString -> (L.ByteString, L.ByteString)
toPair bs =
case AL.maybeResult (AL.parse parseLine bs) of
Nothing -> error "bad line"
Just (a,b) -> (a,b)
where parseLine = do
A.skipWhile (/= ' ')
A.skipWhile (== ' ')
a <- A.takeWhile (/= '\t')
A.skipWhile (== '\t')
rel <- A.takeWhile (/= '\t')
A.skipWhile (== '\t')
A.skipWhile (/= ' ')
A.skipWhile (== ' ')
c <- A.takeWhile (const True)
if rel == "rel:etymological_origin_of"
then return (c,a)
else return (a,c)
Or, just use plain ByteString functions:
fields :: L.ByteString -> [L.ByteString]
fields = L.splitWith (== '\t')
snipSpace = L.ByteString -> L.ByteString
snipSpace = L.dropWhile (== ' ') . L.dropWhile (/=' ')
toPair'' bs =
let fs = fields bs
case fields line of
(x:y:z:_) -> let a = snipSpace x
c = snipSpace z
in
if y == "rel:etymological_origin_of"
then (c,a)
else (a,c)
_ -> error "bad line"
Most of the time spent loading the map is in parsing the lines.
For ByteStrings this is about 14 sec. to load all 6 million lines
vs. 50 secs. for Text.
To add to this answer, I'd like to note that attoparsec actually has very good support for "pull-based" incremental parsing. You can use this directly with the convenient parseWith function. For even finer control, you can feed the parser by hand with parse and feed. If you don't want to worry about any of this, you should be able to use something like pipes-attoparsec, but personally I find pipes a bit hard to understand.

Reading the last n lines from a huge text file

I've tried something like this
file_in <- file("myfile.log","r")
x <- readLines(file_in, n=-100)
but I'm still waiting...
Any help would be greatly appreciated
I'd use scan for this, in case you know how many lines the log has :
scan("foo.txt",sep="\n",what="char(0)",skip=100)
If you have no clue how many you need to skip, you have no choice but to move towards either
reading in everything and taking the last n lines (in case that's feasible),
using scan("foo.txt",sep="\n",what=list(NULL)) to figure out how many records there are, or
using some algorithm to go through the file, keeping only the last n lines every time
The last option could look like :
ReadLastLines <- function(x,n,...){
con <- file(x)
open(con)
out <- scan(con,n,what="char(0)",sep="\n",quiet=TRUE,...)
while(TRUE){
tmp <- scan(con,1,what="char(0)",sep="\n",quiet=TRUE)
if(length(tmp)==0) {close(con) ; break }
out <- c(out[-1],tmp)
}
out
}
allowing :
ReadLastLines("foo.txt",100)
or
ReadLastLines("foo.txt",100,skip=1e+7)
in case you know you have more than 10 million lines. This can save on the reading time when you start having extremely big logs.
EDIT : In fact, I'd not even use R for this, given the size of your file. On Unix, you can use the tail command. There is a windows version for that as well, somewhere in a toolkit. I didn't try that out yet though.
You could do this with read.table by specifying the skip parameter. If your lines are not to be parsed to variables, specify the separator to be '\n' as #Joris Meys pointed out below, and also set as.is=TRUE to get character vectors instead of factors.
Small example (skipping the first 2000 lines):
df <- read.table('foo.txt', sep='\n', as.is=TRUE, skip=2000)
As #JorisMeys already mentioned the unix command tail would be the easiest way to solve this problem. However I want to propose a seek based R solution that starts reading the file from the end of the file:
tailfile <- function(file, n) {
bufferSize <- 1024L
size <- file.info(file)$size
if (size < bufferSize) {
bufferSize <- size
}
pos <- size - bufferSize
text <- character()
k <- 0L
f <- file(file, "rb")
on.exit(close(f))
while(TRUE) {
seek(f, where=pos)
chars <- readChar(f, nchars=bufferSize)
k <- k + length(gregexpr(pattern="\\n", text=chars)[[1L]])
text <- paste0(text, chars)
if (k > n || pos == 0L) {
break
}
pos <- max(pos-bufferSize, 0L)
}
tail(strsplit(text, "\\n")[[1L]], n)
}
tailfile(file, n=100)
You can read last n lines by following method
Step 1 - Open your file as your wish
df <- read.csv("hw1_data.csv")
Step 2 - Now use tail function to read n lines from last
tail(df, 2)
Some folks have said it already, but if you have a large log, it is most efficient to only read in what you need instead of reading it all into memory, then subsetting what you need.
For this, we use R's system() to run the Linux tail command.
Read the last 10 lines of the log:
system("tail path/to/my_file.log")
Read the last 2 lines of the log:
system("tail -n 2 path/to/my_file.log")
Read the last 2 lines of the log and capture the output in a character vector:
last_2_lines <- system("tail -n 2 path/to/my_file.log", intern = TRUE)
For seeing the last few lines:
tail(file_in,100)

Populating a list is Scala with random double taking forever

I am new to Scala and am trying to get a list of random double values:
The thing is, when I try to run this, it takes way too long compared to its Java counterpart. Any ideas on why this is or a suggestion on a more efficient approach?
def random: Double = java.lang.Math.random()
var f = List(0.0)
for (i <- 1 to 200000)
( f = f ::: List(random*100))
f = f.tail
You can also achieve it like this:
List.fill(200000)(math.random)
the same goes for e.g. Array ...
Array.fill(200000)(math.random)
etc ...
You could construct an infinite stream of random doubles:
def randomList(): Stream[Double] = Stream.cons(math.random, randomList)
val f = randomList().take(200000)
This will leverage lazy evaluation so you won't calculate a value until you actually need it. Even evaluating all 200,000 will be fast though. As an added bonus, f no longer needs to be a var.
Another possibility is:
val it = Iterator.continually(math.random)
it.take(200000).toList
Stream also has a continually method if you prefer.
First of all, it is not taking longer than java because there is no java counterpart. Java does not have an immutable list. If it did, performance would be about the same.
Second, its taking a lot of time because appending lists have linear performance, so the whole thing has quadratic performance.
Instead of appending, prepend, which had constant performance.
if your using mutable state anyways you should use a mutable collection like buffer which you can add too with += (which then would be the real counterpart to java code).
but why dont u use list comprehension?
val f = for (_ <- 1 to 200000) yield (math.random * 100)
by the way: var f = List(0.0) ... f = f.tail can be replaced by var f: List[Double] = Nil in your example. (no more performance but more beauty ;)
Yet more options! Tail recursion:
def randlist(n: Int, part: List[Double] = Nil): List[Double] = {
if (n<=0) part
else randlist(n-1, 100*random :: part)
}
or mapped ranges:
(1 to 200000).map(_ => 100*random).toList
Looks like you want to use Vector instead of List. List has O(1) prepend, Vector has O(1) append. Since you are appending, but using concatenation, it'll be faster to use Vector:
def random: Double = java.lang.Math.random()
var f: Vector[Double] = Vector()
for (i <- 1 to 200000)
f = f :+ (random*100)
Got it?

Resources