Cleanup data on one long line in Mathematica - wolfram-mathematica
I have a file of 12Mb approximately that has the following typology
[["1",-154],["2",-100],["3",-28],["4",-66],["5",-222],["6",-309],["7",-196],["8",-50],["9",-53],["10",-209],["11",-355],["12",-350],["13",-269],["14",-264],["15",-392],["16",-513],["17",-515],["18",-434],["19",-418],["20",-505],["21",-592],["22",-559],["23",-422],["24",-384],["25",-539],["26",-716],["27",-713],["28",-593],["29",-534],["30",-647],["31",-813],["32",-857],["33",-711],["34",-582],["35",-594],["36",-700],["37",-721],["38",-600],["39",-487],["40",-490],["41",-589],["42",-630],["43",-502],["44",-365],["45",-340],["46",-403],["47",-420],["48",-291],["49",-136],["50",-98],["51",-218],["52",-285],["53",-198],["54",-52],["55",-58],["56",-213],["57",-334],["58",-301],["59",-195],["60",-195],["61",-324],["62",-470],["63",-465],["64",-378],["65",-381],["66",-546],["67",-734],["68",-767],["69",-695],["70",-683],["71",-804],["72",-991],["73",-1050],["74",-937],["75",-850],["76",-912],["77",-1041],["78",-1065],["79",-972],["80",-931],["81",-1030],["82",-1186],["83",-1233],["84",-1113],["85",-992],["86",-1051],["87",-1206],["88",-1299],["89",-1218],["90",-1112],["91",-1150],["92",-1287],["93",-1345],["94",-1239],["95",-1140],["96",-1147],["97",-1276],["98",-1363],["99",-1312],["100",-1206],["101",-1184],["102",-1297],["103",-1378],["104",-1297],["105",-1141],["106",-1113],["107",-1219],["108",-1325],["109",-1284],["110",-1147],["111",-1103],["112",-1179],["113",-1300],["114",-1262],["115",-1141],
I'd like, using Mathematica, to clean it up removing all the symbols and numbers between quotes just leaving them separated by a space in the following format:
-154 -100 -28 -66 -222 -309 -196 etc…
How could I do this? I am fairly new to Mathematica and the tutorials on "How to clean a HTML file" or "How to clean a ZIP file" didn't clarified my question very much.
You can try importing it as a string, replacing [ with { and ] with }, Evaling it then stripping out the first element of each Tuple with a Last#Tranpose.
data = Import["your_data.txt"];
Last#Transpose#
ToExpression[StringReplace[data, {"[" -> "{", "]" -> "}"}]]
Of course, there are much nicer ways of doing this. Slater's idea works well as well. You'll find there are literally a million ways to do this sort of thing in Mathematica.
Mathematica does have regex support, as well as a general string manipulation package. Something along the lines of:
string = "[["1",-154],["2",-100],["3",-28],["4",-66]]"
StringSplit[string, "],["]
StringReplace[strings, RegularExpression["[\"[0-9]+\"]] -> " "]
You might need to play around with that a little, but that's the idea.
Here is another method that avoids ToExpression (which could theoretically run code you did not intend to):
Import["data.txt", "Text"];
StringSplit[%, {"[[", "],[", "]]", ","}][[2 ;; ;; 2]];
StringJoin[Riffle[%, " "]]
Export["out.dat", %, "Text"]
Related
ruby placeholders NOT populating in external .txt file
Here is what I currently have; the only problem being the external file loads without the placeholder text updated -- instead rather, the placeholder text just says '[NOUN]' instead of actual noun inserted from user in earlier program prompt. Update; cleaned up with #tadmans suggestions, it is however, still not passing user input to placeholder text in external .txt file. puts "\n\nOnce upon a time on the internet... \n\n" puts "Name 1 website:" print "1. " loc=gets puts "\n\Write 5 adjectives: " print "1. " adj1=gets print "\n2. " adj2=gets print "\n3. " adj3=gets print "\n4. " adj4=gets print "\n5. " adj5=gets puts "\n\Write 2 nouns: " print "1. " noun1=gets print "\n2. " noun2=gets puts "\n\nWrite 1 verb: " print "1. " verb=gets puts "\n\nWrite 1 adverb: " print "1. " ptverb=gets string_story = File.read("dynamicstory.txt") puts string_story Currently output is (i.e. placeholders not populated): \n\nOnce upon a time on the internet...\n\n One dreary evening while browsing the #{loc} website online, I stumbled accross a #{adj1} Frog creature named Earl. This frog would sit perturbed for hours at a time at the corner of my screen like Malware. One day, the frog appeared with a #{adj2} companion named Waldo that sat on the other corner of my screen. He had a #{adj3} set of ears with sharp #{noun1} inside. As the internet frogs began conversing and becoming close friends in hopes of #{noun2}, they eventually created a generic start-up together. They knew their start-up was #{adj4} but didn't seem to care and pushed through anyway. They would #{verb} on the beach with each other in the evenings after operating with shady ethics by day. They could only dream of a shiny and #{adj5} future full of gold. But then they eventually #{ptverb} and moved to Canada.\n\n The End\n\n\n
It's important to note that the Ruby string interpolation syntax is only valid within actual Ruby code, and it does not apply in external files. Those are just plain strings. If you want to do rough interpolation on those you'll need to restructure your program in order to make it easy to do. The last thing you want is to have to eval that string. When writing code, always think about breaking up your program into methods or functions that have a specific function and can be used in a variety of situations. Ruby generally encourages code-reuse and promoting the "DRY principle", or "Don't Repeat Yourself". For example, your input method boils down to this generic method: def input(thing, count = 1) puts "Name %d %s:" % [ count, thing ] count.times.map do |i| print '%d. ' % (i + 1) gets.chomp end end Where that gets input for a random thing with an arbitrary count. I'm using sprintf-style formatters here with % but you're free to use regular interpolation if that's how you like it. I just find it leads to a less cluttered string, especially when interpolating complicated chunks of code. Next you need to organize that data into a proper container so you can access it programmatically. Using a bunch of unrelated variables is problematic. Using a Hash here makes it easy: puts "\n\nOnce upon a time on the internet... \n\n" words = { } words[:website] = input('website') words[:adjective] = input('adjectives', 5) words[:noun] = input('nouns', 2) words[:verb] = input('verb') words[:adverb] = input('adverb') Notice how you can now alter the order of these things by re-ordering the lines of code, and you can change how many of something you ask for by adjusting a single number, very easy. The next thing to fix is your interpolation problem. Instead of using Ruby notation #{...}, which is hard to evaluate, go with something simple. In this case %verb1 and %noun2 are used: def interpolate(string, values) string.gsub(/\%(website|adjective|noun|verb|adverb)(\d+)/) do values.dig($1.to_sym, $2.to_i - 1) end end That looks a bit ugly, but the regular expression is used to identify those tags and $1 and $2 pull out the two parts, word and number, separately, based on the capturing done in the regular expression. This might look a bit advanced, but if you take the time to understand this method you can very quickly solve fairly complicated problems with little fuss. It's something you'll use in a lot of situations when parsing or rewriting strings. Here's a quick way to test it: string_story = File.read("dynamicstory.txt") puts interpolate(string_story, words) Where the content of your file looks like: One dreary evening while browsing the %website1 website online, I stumbled accross a %adjective1 Frog creature named Earl. You could also adjust your interpolate method to pick random words.
In Pylint, how do I disable "Exactly one space after comma" for multidimensional array indices?
I like having PyLint check that commas are generally followed by spaces, except in one case: multidimensional indices. For example, I get the following warning from Pylint: C: 31, 0: Exactly one space required after comma num_features = len(X_train[0,:]) ^ (bad-whitespace) Is there a way to get rid of the warnings requiring spaces after commas for the case multidimensional arrays, but keep the space-checking logic the same for all other comma uses? Thanks!
I am sure you figured this out by now but for anyone, like me, who happened upon this looking for an answer... use # pylint: disable=C0326 on the line that is guilty of this. for instance: num_features = len(X_train[0,:]) #pylint: disable=C0326 This applies to multiple kinds of space errors. See pylint wiki
You'll almost certainly want to disable this via the .pylintrc file for larger situations. Example, say I have: x111 = thing.abc(asdf) x112_b = thing1.abc(asdf) x112_b224 = thing.abc(asdf) x112_f = thing1.abc(asdf) ... lots more Now, presume I want to visually see the situation: x111 = thing.abc(asdf) x112_b = thing1.abc(asdf) ... lots more so I add the following line to .pylintrc disable=C0326,C0115,C0116 (note only the first one, c0326, counts, but I'm leaving two other docstring ones there so you can see you just add err messages you want to ignore.)
Undo wrong sorting after missing leading zeros
Here you have a little riddle for anyone who wants to spend some time on it: I have around 200 files that got badly sorted and renamed due to the lack of leading zeros. I have to undo this sorting, and assign the original values again, so that I have an order like this: current file name original TimePoint1 -> TimePoint1 TimePoint2 -> TimePoint10 TimePoint3 -> TimePoint100 TimePoint4 -> TimePoint101 TimePoint5 -> TimePoint102 ... TimePoint250 -> TimePoint250 I will work on an answer, but I didn't want to miss any of the elegant solutions you might provide. Thanks and have fun!
What you have here is simply alphabetical order. So the easiest way to un-do this sorting will be to use this order again. I am not sure which language you want to use so here is an example in ruby: a = [] (1..250).each{|t| a << t.to_s} # note that I am adding the *string* representation a.sort! (0...250).each do |i| File.rename("TimePoint#{a[i]}", "TimePoint#{i + 1}") end
Formatting usage messages
If you take a look at the Combinatorica package in Mathematica8 in (mathematicapath)/AddOns/LegacyPackages/DiscreteMath/Combinatorica.m you will find the definitions of functions. What I'm interested to know is how Mathematica knows how to format the usage messages. Something tells me that I'm not looking at the right file. In any case, lets try the following: Cofactor::usage = "Cofactor[m, {i, j}] calculates the (i, j)th cofactor of matrix m." This line is the 682 line in the file mentioned above. Now if we run it in a mathematica notebook and we use ?Cofactor we will see the exact same message. But if we get the package then the message is formatted. Here is a screenshot: Notice how the m, i and j inside the function changed and a double arrow was added to the message. I think the arrow was added to the message because there exists documentation for it. Can someone explain this behavior? EDIT: This is a screenshot of my notebook file that autosaves to an m file. As you can see, the L and M are in italic times new roman. Now I will load the package and see the usage. So far so good. Now lets look at the Documentation center. I will look for the function LineDistance. As you can see, it shows a weird message. In this case we only want to display the message without any styles. I still can't figure out how the Combinatorica package does this. I followed this to make the index so that the doc center can display the summary. The summary is essentially the usage display. Let me know if I need to be more specific.
OK, here's the explanation. Digging in the Combinatorica source reveals this: (* get formatted Combinatorica messages, except for special cases *) If[FileType[ToFileName[{System`Private`$MessagesDir,$Language},"Usage.m"]]===File, Select[FindList[ToFileName[{System`Private`$MessagesDir,$Language},"Usage.m"],"Combinatorica`"], StringMatchQ[#,StartOfString~~"Combinatorica`*"]&& !StringMatchQ[#,"Combinatorica`"~~("EdgeColor"|"Path"|"Thin"|"Thick"|"Star"|"RandomInteger")~~__]&]//ToExpression; ] It is loading messages from ToFileName[{System`Private`$MessagesDir,$Language},"Usage.m"], which on my machine is SystemFiles\Kernel\TextResources\English\Usage.m. This is why all usage messages are created conditionally in Combinatorica.m (only if they don't exist yet). If you look in Usage.m you'll see it has all the ugly boxes stuff that #ragfield mentioned. I guess the simplest way to have formatted messages is to edit them in the front end in a notebook, and create an auto-save package. This way you can use all the front end's formatting tools, and won't need to deal with boxes.
I will answer on how the link in the Message is generated. Tracing Message printing shows a call to undocumented Documentation`CreateMessageLink function which returns the URL to the corresponding Documentation page if this page exists: Trace[Information[Sin], Documentation`CreateMessageLink] In[32]:= Documentation`CreateMessageLink["System", "Sin", "argx", "English"] Out[32]= "paclet:ref/message/General/argx" In some cases we can also see calls to Internal`MessageButtonHandler which further calls Documentation`CreateMessageLink: Trace[Message[Sin::argx, 1, 1], Internal`MessageButtonHandler | Documentation`CreateMessageLink, TraceInternal -> True]
The way to embed style information in a String expression is to use linear syntax. For a box expression such as: StyleBox["foo", FontSlant->Italic] You can embed this inside of a String by adding \* to the front of it and escaping any special characters such as quotes: "blah \*StyleBox[\"foo\", FontSlant->Italic] blah" This should work for any box expression, no matter how complicated: "blah \*RowBox[{SubsuperscriptBox[\"\[Integral]\",\"0\",\"1\"],RowBox[{FractionBox[\"1\",RowBox[{\"x\",\"+\",\"1\"}]],RowBox[{\"\[DifferentialD]\",\"x\"}]}]}] blah"
I am currently working on rewriting your ApplicationMaker for newer Mathematica-Versions with added functionalities and came to the exact same question here. My answer is simple: Mathematica dont allowes you to use formated summaries for your symbols (or even build in symbols), so we have to unformate the usage-strings for the summaries. The usagestring itself can still have formatting, but one needs to have a function that removes all the formatingboxes from a string. i have a solution that uses the UndocumentedTestFEParserPacket as described by John Fultz! in this question. This funny named Tool parses a String Input into the real unchanged Mathematica BoxForm. This is my example code: str0 = Sum::usage str1=StringJoin[ToString[StringReplace[#, "\\\"" -> "\""]]& /# (Riffle[MathLink`CallFrontEnd[ FrontEnd`UndocumentedTestFEParserPacket[str0, True]]〚1〛 //. RowBox[{seq___}] :> seq /. BoxData -> List, " "] /. SubscriptBox[a_, b_] :> a<>"_"<>b /. Except[List, _Symbol][args__] :> Sequence##Riffle[{args}, " "])]; str2 = Fold[StringReplace, str1, {((WhitespaceCharacter...)~~br:("["|"("|"=") ~~ (WhitespaceCharacter ...)) :> br, ((WhitespaceCharacter ...) ~~ br:("]"|"}"|","|".")) :> br, (br:("{") ~~ (WhitespaceCharacter ...)) :> br, ". " ~~ Except[EndOfString] -> ". \n"}] and this is how the Output looks like (first Output formatted fancy str0, second simple flat str2) Code Explanation: str0 is the formatted usagestring with all the StyleBoxes and other formatting boxes. str1: UndocumentedTestFEParserPacket[str0, True] gives Boxes and strips off all StyleBoxes, thats because the second argument is True. First Replacement removes all RowBoxes. The outer BoxForm changed to a List of strings. Whitespaces are inserted between these strings the by Riffle. SubscriptBox gets a special treatment. The last line replaces every remaining FormatBox such as UnderoverscriptBox and it does that by adding Whitespaces between the arguments, and returning the arguments as a flat Sequence. ToString[StringReplace[#, "\\\"" -> "\""]]& /# was added to include more cases such as StringReplace::usage. This cases include string representations "" with Styles inside of a the usage-string, when "args" has to be given as strings. str2: In this block of code i only remove unwanted WhitespaceCharacter from the string str1 and i add linebreaks "/n" after the ".", because they got lost during the Parsing. There are 3 different cases where WhitespaceCharacter can be removed. 1 removing left-and right sided WithespaceCharacter from a character like "[". 2. and 3. removing WithespaceCharacter from left(2) or right(3) side. Summary Istead of summary-> mySymbol::usage, use summary -> unformatString[mySymbol::usage] with unformatString being an appropriate function that performes the unformating like descriped above. Alternatively you can define another usage message manually like f::usage = "fancy string with formating"; f::usage2 = "flat string without formating"; than use summary -> mySymbol::usage2
Fastest way to skip lines while parsing files in Ruby?
I tried searching for this, but couldn't find much. It seems like something that's probably been asked before (many times?), so I apologize if that's the case. I was wondering what the fastest way to parse certain parts of a file in Ruby would be. For example, suppose I know the information I want for a particular function is between lines 500 and 600 of, say, a 1000 line file. (obviously this kind of question is geared toward much large files, I'm just using those smaller numbers for the sake of example), since I know it won't be in the first half, is there a quick way of disregarding that information? Currently I'm using something along the lines of: while buffer = file_in.gets and file_in.lineno <600 next unless file_in.lineno > 500 if buffer.chomp!.include? some_string do_func_whatever end end It works, but I just can't help but think it could work better. I'm very new to Ruby and am interested in learning new ways of doing things in it.
file.lines.drop(500).take(100) # will get you lines 501-600 Generally, you can't avoid reading file from the start until the line you are interested in, as each line can be of different length. The one thing you can avoid, though, is loading whole file into a big array. Just read line by line, counting, and discard them until you reach what you look for. Pretty much like your own example. You can just make it more Rubyish. PS. the Tin Man's comment made me do some experimenting. While I didn't find any reason why would drop load whole file, there is indeed a problem: drop returns the rest of the file in an array. Here's a way this could be avoided: file.lines.select.with_index{|l,i| (501..600) === i} PS2: Doh, above code, while not making a huge array, iterates through the whole file, even the lines below 600. :( Here's a third version: enum = file.lines 500.times{enum.next} # skip 500 enum.take(100) # take the next 100 or, if you prefer FP: file.lines.tap{|enum| 500.times{enum.next}}.take(100) Anyway, the good point of this monologue is that you can learn multiple ways to iterate a file. ;)
I don't know if there is an equivalent way of doing this for lines, but you can use seek or the offset argument on an IO object to "skip" bytes. See IO#seek, or see IO#open for information on the offset argument.
Sounds like rio might be of help here. It provides you with a lines() method.
You can use IO#readlines, that returns an array with all the lines IO.readlines(file_in)[500..600].each do |line| #line is each line in the file (including the last \n) #stuff end or f = File.new(file_in) f.readlines[500..600].each do |line| #line is each line in the file (including the last \n) #stuff end