OrientDB ETL with CSV, no headers and multiple join fields - etl

I am trying to load some CSV files into OrientDB. They have been extracted from a MySQL database holding the Unified Medical Language System (NIH UMLS) data. The two files contain vertices:
"C0484850" "A18164418" "Troponin T.cardiac [Mass/volume] in Venous blood" "Y" "Clinical Attribute"
"C0484850" "A18241423" "Troponin T.cardiac:MCnc:Pt:BldV:Qn" "Y" "Clinical Attribute"
"C0484850" "A18861342" "Troponin T.cardiac:Mass Concentration:Point in time:Blood venous:Quantitative" "Y" "Clinical Attribute"
"C0484851" "A18280127" "Troponin T.cardiac [Mass/volume] in Serum or Plasma" "Y" "Clinical Attribute"
"C0484851" "A18357585" "Troponin T.cardiac:MCnc:Pt:Ser/Plas:Qn" "Y" "Clinical Attribute"
"C0484851" "A18816754" "Troponin T.cardiac:Mass Concentration:Point in time:Serum/Plasma:Quantitative" "Y" "Clinical Attribute"
and relationships:
"C0484850" "A18164418" "has_common_name" "C0484850" "A18241423"
"C0484850" "A18241423" "class_of" "C0201682" "A18205079"
"C0484850" "A18241423" "component_of" "C3538889" "A18284809"
"C0484850" "A18241423" "property_of" "C0560150" "A18367132"
"C0484850" "A18241423" "scale_of" "C1442116" "A18405933"
"C0484850" "A18241423" "system_of" "C1442207" "A18136032"
"C0484850" "A18241423" "time_aspect_of" "C1442880" "A18406936"
"C0484850" "A18241423" "fragments_for_synonyms_of" "C2603360" "A18401194"
I'm finding the OrientDB documentation for extractors and for CSV rather lacking.
For the "row" extractor, there is only one example with no full documentation. I do not have row headers, so how to I use the "row" extractor to name the fields in the vertices (cui, aui, description, pref, syn) ? I'm guessing there is a syntax like id:row2 but I can't find it.
The edges are joined using via the 2 and 5th fields of the vertices, which are unnamed. Also, the edge property is unnamed.
For silly reason, I can't pull directly from MySQL right now, but if there are better examples than the official site I would be interested in seeing them.

use the csv extractor (see : http://orientdb.com/docs/2.2.x/Extractor.html )
set "columnsOnFirstLine" to false
set "columns" to be the explicit list of columns, in the order they exist in the csv file

Related

Filter overlapping time periods using Elixir

I have two lists of tuples that contain start and end times. All the start and end times are Time structs.
I want to filter out the time_slots that overlap with any booked_slots:
time_slots = [
{~T[09:00:00], ~T[17:00:00]},
{~T[13:00:00], ~T[17:00:00]},
{~T[13:00:00], ~T[21:00:00]},
{~T[17:00:00], ~T[21:00:00]},
{~T[09:00:00], ~T[13:00:00]},
{~T[09:00:00], ~T[21:00:00]}
]
booked_slots = [{~T[14:00:00], ~T[21:00:00]}]
Which should leave us with:
[
{~T[09:00:00], ~T[13:00:00]}
]
I have tried the following method (as adapted from the answer to a previous related question: Compare two lists to find date and time overlaps in Elixir):
Enum.filter(time_slots, fn {time_start, time_end} ->
Enum.any?(booked_slots, fn {booking_start, booking_end} ->
if Time.compare(booking_start, time_start) == :lt do
Time.compare(booking_end, time_start) == :gt
else
Time.compare(booking_start, time_end) == :lt
end
end)
end)
However this returns:
[
{~T[09:00:00], ~T[17:00:00]},
{~T[13:00:00], ~T[17:00:00]},
{~T[13:00:00], ~T[21:00:00]},
{~T[17:00:00], ~T[21:00:00]},
{~T[09:00:00], ~T[21:00:00]}
]
We also need to factor in times that may be equal, but do not overlap. For example:
time_slots = [
{~T[09:00:00], ~T[17:00:00]},
{~T[13:00:00], ~T[17:00:00]},
{~T[13:00:00], ~T[21:00:00]},
{~T[17:00:00], ~T[21:00:00]},
{~T[09:00:00], ~T[21:00:00]},
{~T[09:00:00], ~T[13:00:00]}
]
booked_slots = [{~T[17:00:00], ~T[21:00:00]}]
Should return…
[
{~T[09:00:00], ~T[17:00:00]},
{~T[13:00:00], ~T[17:00:00]},
{~T[09:00:00], ~T[13:00:00]}
]
You need to filter out all the time slots having either start or end times inside the booked slot.
Enum.reject(time_slots, fn {ts, te} ->
Enum.any?(booked_slots, fn {bs, be} ->
(Time.compare(ts, bs) != :lt and Time.compare(ts, be) == :lt) or
(Time.compare(te, bs) == :gt and Time.compare(te, be) != :gt) or
(Time.compare(ts, bs) == :lt and Time.compare(te, be) == :gt)
end)
end)
The first condition checks for slots having start time inside the booked slot, the second one checks for those having end time within the booked slots, and the third one checks for those containing the whole booked slot.

Ruby splitting a record into multiple records based on contents of a field

Record layout contains two fields:
Requistion
Test Names
Example record:
R00000001,"4 Calprotectin, 1 Luminex xTAG, 8 H. pylori stool antigen (IgA), 9 Lactoferrin, 3 Anti-gliadin IgA, 10 H. pylori Panel, 6 Fecal Fat, 11 Antibiotic Resistance Panel, 2 C. difficile Tox A/ Tox B, 5 Elastase, 7 Fecal Occult Blood, 12 Shigella"
The current Ruby code snippet that is used in the LIMS (Lab Info Management System) system is this:
subj.get_value('Tests').join(', ')
What I need to be able to do in the Ruby code snippet is create a new record off each comma-separated value in the second field.
NOTE:
the amount of values in the 'Test Names' field varies from 1 to 20...or more.
There can be 100's of Requistion records
Final result would be:
R00000001,"4 Calprotectin"
R00000001,"1 Luminex xTAG"
R00000001,"8 H. pylori stool antigen (IgA)"
R00000001,"9 Lactoferrin"
R00000001,"3 Anti-gliadin IgA"
R00000001,"10 H. pylori Panel"
R00000001,"6 Fecal Fat"
R00000001,"11 Antibiotic Resistance Panel"
R00000001,"2 C. difficile Tox A/ Tox B"
R00000001,"5 Elastase"
R00000001,"7 Fecal Occult Blood"
R00000001,"12 Shigella"
If your data is a reliable string which you've shown in your example, here's your method:
data = subj.get_value('Tests').join(', ') # assuming this gives your string obj.
def split_data(data)
arr = data.gsub('"','').split(',')
arr.map {|l| "#{arr[0]} \"#{l.strip}\""}[1..-1]
end
puts split_data(data)

Evaluating code blocks in Rebol3

I'm trying to improve the Sliding Tile Puzzle example by making the starting positions random.
There's a better way to do this--"It is considered bad practice to convert values to strings and join them together to pass to do for evaluation."--but the approach I took was to try to generate Rebol3 source, and then evaluate it. I have it generating correctly, I think:
random/seed now
arr: random collect [ repeat tilenum 9 [ keep tilenum ] ]
hgroup-data: copy {}
repeat pos 9 [
curtile: (pick arr pos)
append hgroup-data either curtile = 9
[ reduce "x: box tilesize gameback " ]
[ rejoin [ { p "} curtile {" } ] ]
if all [(pos // 3) = 0 pos != 9] [ append hgroup-data " return^/" ]
]
print hgroup-data
...outputs something like:
p "4" x: box tilesize gameback p "5" return
p "3" p "7" p "1" return
p "2" p "8" p "6"
...which if I then copy and paste into this part, works correctly:
view/options [
hgroup [
PASTE-HERE
]
] [bg-color: gameback]
However, if I try to do it dynamically:
view/options [
hgroup [
hgroup-data
]
] [bg-color: gameback]
...(also print hgroup-data, do hgroup-data, and load hgroup-data), I get this error:
** GUI ERROR: Cannot parse the GUI dialect at: hgroup-data
...(or at: print hgroup-data, etc., depending on which variation I tried.)
If I try load [ hgroup-data ] I get:
** Script error: extend-face does not allow none! for its face argument
** Where: either if forever -apply- apply init-layout make-layout actor all foreach do-actor unless -apply- apply all build-face -apply- apply init-layout make-layout actor all foreach do-actor if build-face -apply- apply init-layout make-layout actor all foreach do-actor unless make-face -apply- apply case view do either either either -apply-
** Near: either all [
word? act: dial/1
block? body: get dial...
However, if I use the syntax hgroup do [ hgroup-data ], the program runs, but there are no buttons: it appears to be somehow over-evaluated, so that the return values of the functions p and box and so on are put straight into the hgroup as code.
Surely I'm missing an easy syntax error here. What is it?
First, I would say it's better to construct a block directly, instead of constructing a string and converting it to a block. But if you really want to do that, this should do the trick:
view/options compose/only [
hgroup (load hgroup-data)
] [bg-color: gameback]

creating nested dictionary from flat list with python

i have a list of file in this form:
base/images/graphs/one.png
base/images/tikz/two.png
base/refs/images/three.png
base/one.txt
base/chapters/two.txt
i would like to convert them to a nested dictionary of this sort:
{ "name": "base" , "contents":
[{"name": "images" , "contents":
[{"name": "graphs", "contents":[{"name":"one.png"}] },
{"name":"tikz", "contents":[{"name":"two.png"}]}
]
},
{"name": "refs", "contents":
[{"name":"images", "contents": [{"name":"three.png"}]}]
},
{"name":"one.txt", },
{"name": "chapters", "contents":[{"name":"two.txt"}]
]
}
trouble is, my attempted solution, given some input like images/datasetone/grapha.png" ,"images/datasetone/graphb.png" each one of them will end up in a different dictionary named "datasetone" however i'd like both to be in the same parent dictionary as they are in the same directory, how do i create this nested structure without duplicating parent dictionaries when there's more than one file in a common path?
here is what i had come up with and failed:
def path_to_tree(params):
start = {}
for item in params:
parts = item.split('/')
depth = len(parts)
if depth > 1:
if "contents" in start.keys():
start["contents"].append(create_base_dir(parts[0],parts[1:]))
else:
start ["contents"] = [create_base_dir(parts[0],parts[1:]) ]
else:
if "contents" in start.keys():
start["contents"].append(create_leaf(parts[0]))
else:
start["contents"] =[ create_leaf(parts[0]) ]
return start
def create_base_dir(base, parts):
l={}
if len(parts) >=1:
l["name"] = base
l["contents"] = [ create_base_dir(parts[0],parts[1:]) ]
elif len(parts)==0:
l = create_leaf(base)
return l
def create_leaf(base):
l={}
l["name"] = base
return l
b=["base/images/graphs/one.png","base/images/graphs/oneb.png","base/images/tikz/two.png","base/refs/images/three.png","base/one.txt","base/chapters/two.txt"]
d =path_to_tree(b)
from pprint import pprint
pprint(d)
In this example you can see we end up with as many dictionaries named "base" as there are files in the list, but only one is necessary, the subdirectories should be listed in the "contents" array.
This does not assume that all paths start with the same thing, so we need a list for it:
from pprint import pprint
def addBits2Tree( bits, tree ):
if len(bits) == 1:
tree.append( {'name':bits[0]} )
else:
for t in tree:
if t['name']==bits[0]:
addBits2Tree( bits[1:], t['contents'] )
return
newTree = []
addBits2Tree( bits[1:], newTree )
t = {'name':bits[0], 'contents':newTree}
tree.append( t )
def addPath2Tree( path, tree ):
bits = path.split("/")
addBits2Tree( bits, tree )
tree = []
for p in b:
print p
addPath2Tree( p, tree )
pprint(tree)
Which produces the following for your example path list:
[{'contents': [{'contents': [{'contents': [{'name': 'one.png'},
{'name': 'oneb.png'}],
'name': 'graphs'},
{'contents': [{'name': 'two.png'}],
'name': 'tikz'}],
'name': 'images'},
{'contents': [{'contents': [{'name': 'three.png'}],
'name': 'images'}],
'name': 'refs'},
{'name': 'one.txt'},
{'contents': [{'name': 'two.txt'}], 'name': 'chapters'}],
'name': 'base'}]
Omitting the redundant name tags, you can go on with :
import json
result = {}
records = ["base/images/graphs/one.png", "base/images/tikz/two.png",
"base/refs/images/three.png", "base/one.txt", "base/chapters/two.txt"]
recordsSplit = map(lambda x: x.split("/"), records)
for record in recordsSplit:
here = result
for item in record[:-1]:
if not item in here:
here[item] = {}
here = here[item]
if "###content###" not in here:
here["###content###"] = []
here["###content###"].append(record[-1])
print json.dumps(result, indent=4)
The # characters are used for uniqueness (there could be a folder which name was content in the hierarchy). Just run it and see the result.
EDIT : Fixed a few typos, added the output.

How to store a collection of floats in postgres?

I'm building a script that saves tweets to a postgres database using ruby/pg/ActiveRecord/TweetStream (gem).
This script works fine..
...
TweetStream::Client.new.track(SEARCH_TERMS) do |t|
puts "#{t.text}"
attributes = {
tweetid: t[:id],
text: t.text,
in_reply_to_user_id: t.in_reply_to_user_id,
in_reply_to_status_id: t.in_reply_to_status_id,
received_at: t.created_at,
user_statuses_count: t.user.statuses_count,
user_followers_count: t.user.followers_count,
user_profile_image_url: t.user.profile_image_url,
user_screen_name: t.user.screen_name,
user_timezone: t.user.time_zone,
user_location: t.user.location,
user_lang: t.lang,
user_id_str: t.user.id,
user_name: t.user.name,
user_url: t.user.url,
user_created_at: t.user.created_at,
user_geo_enabled: t.user.geo_enabled,
}
if StoreTweet.create(attributes)
puts "saved"
else
puts "error"
end
end
until I add
user_geo_enabled: t.user.geo_enabled,
coordinates: t.coordinates.coordinates,}
I also tried
t.coordinates
t.coordiantes[:coordinates]
t[:coordinates] #this allows me to save but is always blank if when geo_enabled is 'true'
Twitter dev center (https://dev.twitter.com/docs/platform-objects/tweets) tells me that 'coordinates' is a collection of floats like:
"coordinates":
{
"coordinates":
[
-75.14310264,
40.05701649
],
"type":"Point"
}
for the moment i use a 'text' field. Which type should i give to the field in order to store both values together?

Resources