I want to squash two requests:
a = r.table('A').run(conn)
b = r.table('B').run(conn)
in a single one. Something like:
out = some_reql({
'a': r.table('A'),
'b': r.table('B')
}).run(conn)
out['a']
out['b']
If you want to get both back in a single query you can do it with union like so
r.union(r.table("A"), r.table("B"))
This will give you back a single stream object that's a concatenation of the 2 streams. However you won't be able to tell where one stream ends and the next begins. There's currently no way to return 2 separate stream objects in the same query. So if you want to be able to use them as separate streams you need to do 2 separate queries. Is there a reason that doesn't work for you?
If your streams are big then this won't have a big impact on performance because to evaluate the results of each stream will require multiple requests anyways. However if they're small then you can just coerce them to arrays like so:
{"a" : r.table("A").coerce_to("ARRAY"),
"b" : r.table("B").coerce_to("ARRAY")}
Only do this if your streams will fit in memory.
Related
My processing has a "condense" step before needing further processing:
Source: Raw event/analytics logs of various users.
Transform: Insert each row into a hash according to UserID.
Destination / Output: An in-memory hash like:
{
"user1" => [event, event,...],
"user2" => [event, event,...]
}
Now, I've got no need to store these user groups anywhere, I'd just like to carry on processing them. Is there a common pattern with Kiba for using an intermediate destination? E.g.
# First pass
source EventSource # 10,000 rows of single events
transform {|row| insert_into_user_hash(row)}
#users = Hash.new
destination UserDestination, users: #users
# Second pass
source UserSource, users: #users # 100 rows of grouped events, created in the previous step
transform {|row| analyse_user(row)}
I'm digging around the code and it appears that all transforms in a file are applied to the source, so I was wondering how other people have approached this, if at all. I could save to an intermediate store and run another ETL script, but was hoping for a cleaner way - we're planning lots of these "condense" steps.
To directly answer your question: you cannot define 2 pipelines inside the same Kiba file. You can have multiple sources or destinations, but the rows will all go through each transform, and through each destination too.
That said you have quite a few options before resorting to splitting into 2 pipelines, depending on your specific use case.
I'm going to email you to ask a few more detailed questions in private, in order to properly reply here later.
I have a lot of javascript objects like:
var obj1 = {"key1" : value1, "key2" : value2, ...}
var obj2 = {"key3" : value3, "key4" : value4, ...}
and so on...
Following are the two approaches :
Store each object as Redis Hash i.e. one-to-one mapping.
Have one Redis Hash(bucketing can be done for better performance), store each object as stringified object in each key of hash i.e. for each object having a key value pair in the Redis Hash. Parse the object when we need to use the object.
1) -> Takes more space than 2) but has better performance than 2)
2) -> Takes less space than 1) but has worse performance than 1)
Is there a way to determine which approach would be better in the long run?
Update: This data is used on the client side (AngularJS), so all parsing of stringified JSON is done in the frontend.
This would probably be solved by deciding which method minimises the number of steps required to extract the required data from redis.
Case 1: Lots of nested objects
If your objects have a lot of nesting, ie objects within objects, like this,
obj = {key1:{key2:value1, key:3{key4:value2}}}
You should probably stringify and store them.
Because Redis does not allow nesting of data structures. You can't store a hash within another hash.
And storing the name of hash2 as a key within hash1 and querying hash2 after getting hash1 and so on is unnecessarily complex and has a lot of queries. In this case all you have to do is get the entire string from Redis and JSON.parse it. and you can get whatever data you want from the Object.
Case 2: No nested objects.
But on the other hand, if there is no nesting of objects and you store it as a string, you have to JSON.parse() every time you get the data from Redis. And parsing JSON is blocking and is CPU intensive. Node.js: does JSON.parse block the event loop?
Redis documentation also says that hashes are encoded in a very small space, so you should try representing your data using hashes every time it is possible. http://redis.io/topics/memory-optimization
So, in this case, you could probably go ahead and store them all as individual hashes as querying a particular value will be a lot easier.
---------Update---------
Even if the JSON parsing is done on the client, try not to do an extra computation needlessly :)
But nested objects are easier to store and query as a string. Otherwise, you'll have to query more than one hash table. In this case storing as stringified object might just be better for performance.
Redis stores small hashes very efficiently. So much that storing multiple small hashmaps is more memory efficient than one big hashmap.
the number of keys deciding about the encoding to use can be found in redis.conf
hash-max-zipmap-entries 512
also the value of each key should be hash-max-zipmap-value 64
So, you can now decide on the basis of nesting of your objects, number of Hash Keys below which Redis is more memory efficient and the value assigned to your keys.
Do go through http://redis.io/topics/memory-optimization
I need logic for below scenario which needs to be implemented using Pig scripts. Can anyone please help in providing some ideas on how to do this.
Input contains a column groupName with some data like others and unknown. This data needs to be replaced by its previous record data.
Input:
id,groupName
123,casc0001
124,casc0002
125,sale0001
126,unknown
127,nave9876
128,casc0001
129,sale0002
130,others
131,casc0004
132,unknown
133,unknown
134,others
135,nave1234
output:
123,casc0001
124,casc0002
125,sale0001
126,sale0001
127,nave9876
128,casc0001
129,sale0002
130,sale0002
131,casc0004
132,casc0004
133,casc0004
134,casc0004
135,nave1234
In the above input 126,unknown to be replaced with 125,sale0001. 130,others need to be replaced by 129,sale0002. 132,unknown 133,unknown 134,others to be replaced with 131,casc0004.
--Edit--
I tried lead function in Pig. But it is used only to compare n rows at a time. Which cannot solve this completely.
Another logic which is working, but looking for optimized one.
Cogroup for the same data set (like Dataset and Dataset_self)
-Filter Dataset.id=Dataset_self.id or Dataset_self.groupname='others' or Dataset_self.groupname='unknown'
-Generate IdDiff like (Dataset_self.id-Dataset.id), CASE when id=id then ( id, group) else (id_self,group)
-Foreach (group id){
ordered = order by id,diff,group;
limited = ordered limit 1;
generate limited ;
}
This is going to be a complicated problem on a distributed system like hadoop, especially that your file is going to be split between nodes. In your case what if 126 happens to be the first record in a new split. Then you will need to trace the previous file split which is most likely on a different node. Lets say you come up with a MapReduce program to do this, in all likelyhood it would an extremely slow and inefficient way to do it. The solution might be simpler if you are in a single node system where the splittable property of your input format is false, and the nuber of reducers is set to 1.
In that case you could almost make the argument that a traditional database like Oracle or Terra data might be a better fit for your problem as you have lead or lag functions readily available which could be used to do exactly what u need.
I have a nicely structured (human made) JSON file that I would like to programatically add and update values.
The issue is that the current structure of the JSON file is very easy to read for me and my colleagues, and we would like it to stay in the same (or very similar) indentation, line spacing and key order, etc.
Is there a way to do this with Ruby?
Ruby's JSON supports pretty_generate, which is a "pretty" generator, but in no way will it attempt to remember how you've structured a particular JSON data file, nor should it.
foo = {'a' => 1, 'b' => %w[2 3]}
puts JSON.generate(foo)
{"a":1,"b":["2","3"]}
puts JSON.pretty_generate(foo)
{
"a": 1,
"b": [
"2",
"3"
]
}
JSON is a data serialization format, and, along with YAML and XML, it's designed to move data accurately. Doing that while maintaining an arbitrary line spacing, or leading white-space adds no value to a serializer.
Remember, adding "pretty" to the output increases the size of the data being moved, without improving the quality:
puts JSON.generate(foo).size
21
puts JSON.pretty_generate(foo).size
43
Making just that little hash "pretty" doubled the size, which, over time, reduces throughput to browsers or across networks between servers. I'd recommend only bothering with the "pretty" output when initially debugging your code, then abandoning it once you're happy with the data movement, in favor of speed and efficiency. The data will be the same.
If you're worried about being able to modify some of the data, write a simple reader and/or JSON generator that works from a standard Ruby data object, then let JSON serialize it, and write the output to a file.
I'd like to read a large XML file that contains over a million small bibliographic records (like <article>...</article>) using libxml in Ruby. I have tried the Reader class in combination with the expand method to read record by record but I am not sure this is the right approach since my code eats up memory. Hence, I'm looking for a recipe how to conveniently process record by record with constant memory usage. Below is my main loop:
File.open('dblp.xml') do |io|
dblp = XML::Reader.io(io, :options => XML::Reader::SUBST_ENTITIES)
pubFactory = PubFactory.new
i = 0
while dblp.read do
case dblp.name
when 'article', 'inproceedings', 'book':
pub = pubFactory.create(dblp.expand)
i += 1
puts pub
pub = nil
$stderr.puts i if i % 10000 == 0
dblp.next
when 'proceedings','incollection', 'phdthesis', 'mastersthesis':
# ignore for now
dblp.next
else
# nothing
end
end
end
The key here is that dblp.expand reads an entire subtree (like an <article> record) and passes it as an argument to a factory for further processing. Is this the right approach?
Within the factory method I then use high-level XPath-like expression to extract the content of elements, like below. Again, is this viable?
def first(root, node)
x = root.find(node).first
x ? x.content : nil
end
pub.pages = first(node,'pages') # node contains expanded node from dblp.expand
When processing big XML files, you should use a stream parser to avoid loading everything in memory. There are two common approaches:
Push parsers like SAX, where you react to encoutered tags as you get them (see tadman answer).
Pull parsers, where you control a "cursor" in the XML file that you can move with simple primitives like go up/go down etc.
I think that push parsers are nice to use if you want to retrieve just some fields, but they are generally messy to use for complex data extraction and are often implemented whith use case... when... constructs
Pull parser are in my opinion a good alternative between a tree-based model and a push parser. You can find a nice article in Dr. Dobb's journal about pull parsers with REXML .
When processing XML, two common options are tree-based, and event-based. The tree-based approach typically reads in the entire XML document and can consume a large amount of memory. The event-based approach uses no additional memory but doesn't do anything unless you write your own handler logic.
The event-based model is employed by the SAX-style parser, and derivative implementations.
Example with REXML: http://www.iro.umontreal.ca/~lapalme/ForestInsteadOfTheTrees/HTML/ch08s01.html
REXML: http://ruby-doc.org/stdlib/libdoc/rexml/rdoc/index.html
I had the same problem, but I think I solved it by calling Node#remove! on the expanded node. In your case, I think you should do something like
my_node = dblp.expand
[do what you have to do with my_node]
dblp.next
my_node.remove!
Not really sure why this works, but if you look at the source for LibXML::XML::Reader#expand, there's a comment about freeing the node. I am guessing that Reader#expand associates the node to the Reader, and you have to call Node#remove! to free it.
Memory usage wasn't great, even with this hack, but at least it didn't keep on growing.