recursive nested loops - ruby

Example Scenario: Note, this can be as deep or as shallow depending on the website.
Spider scans the first page for links. it stores it as array1.
spider enters the first link, it's now on second page. it sees links, and stores it as array2.
spider enters the first link on the second page, its now on third page.
it sees links upon, and stores it as array 3.
Please note that this is generic scenario. I want to highlight the need to do many loops within loops.
rootArray[array1,array2,array3....]
how can i do a recursive nested loops ? array2 is the children of each VALUE of array1 (we assume the structure is very uniform, each VALUE of array 1 has similiar links in array2). Array 3 is the children of each Value of array2. and so on.

module Scratch
def self.recur(arr, depth, &fn)
arr.each do |a|
a.is_a?(Array) ? recur(a, depth+1, &fn) : fn.call(a, depth)
end
end
arr = [[1, 2, 3], 4, 5, [6, 7, [8, 9]]]
recur(arr, 0) { |x,d| puts "#{d}: #{x}" }
end

You'll want to store these results in a tree, not a collection of arrays. Page1 would have child nodes for each link. Each of those has child nodes for its links, etc. An alternate approach would be to just store all of the links in one array, recursing through the site to find the links in question. Do you really need them in a structure analogous to that of the site?
You'll also want to check for duplicate links when adding any new link to the list/tree/whatever that you've already got. Otherwise, loops like page_1 -> page_2 -> page_1... will break your app.
What's your real goal here? Page crawlers aren't exactly new technology.

It all depends on what you are trying to do.
If you are harvesting links then a hash or set will work well. An array can be used too but can lead to some gotchas.
If you need to show the structure of the site you'll want a tree or arrays of arrays along with some way of flagging which urls you've visited.
In any case you need to avoid redundant links to keep from getting into a loop. It's also real common to put some sort of limitation on how deep you'll descend and whether you'll remember and/or follow links outside of the site.

Gweg, I just answered this on your other post.
How do I create nested FOR loops with varying depths, for a varying number of arrays?

Related

Python3 Make tie-breaking lambda sort more pythonic?

As an exercise in python lambdas (just so I can learn how to use them more properly) I gave myself an assignment to sort some strings based on something other than their natural string order.
I scraped apache for version number strings and then came up with a lambda to sort them based on numbers I extracted with regexes. It works, but I think it can be better I just don't know how to improve it so it's more robust.
from lxml import html
import requests
import re
# Send GET request to page and parse it into a list of html links
jmeter_archive_url='https://archive.apache.org/dist/jmeter/binaries/'
jmeter_archive_get=requests.get(url=jmeter_archive_url)
page_tree=html.fromstring(jmeter_archive_get.text)
list_of_links=page_tree.xpath('//a[#href]/text()')
# Filter out all the non-md5s. There are a lot of links, and ultimately
# it's more data than needed for his exercise
jmeter_md5_list=list(filter(lambda x: x.endswith('.tgz.md5'), list_of_links))
# Here's where the 'magic' happens. We use two different regexes to rip the first
# and then the second number out of the string and turn them into integers. We
# then return them in the order we grabbed them, allowing us to tie break.
jmeter_md5_list.sort(key=lambda val: (int(re.search('(\d+)\.\d+', val).group(1)), int(re.search('\d+\.(\d+)', val).group(1))))
print(jmeter_md5_list)
This does have the desired effect, The output is:
['jakarta-jmeter-2.5.1.tgz.md5', 'apache-jmeter-2.6.tgz.md5', 'apache-jmeter-2.7.tgz.md5', 'apache-jmeter-2.8.tgz.md5', 'apache-jmeter-2.9.tgz.md5', 'apache-jmeter-2.10.tgz.md5', 'apache-jmeter-2.11.tgz.md5', 'apache-jmeter-2.12.tgz.md5', 'apache-jmeter-2.13.tgz.md5']
So we can see that the strings are sorted into an order that makes sense. Lowest version first and highest version last. Immediate problems that I see with my solution are two-fold.
First, we have to create two different regexes to get the numbers we want instead of just capturing groups 1 and 2. Mainly because I know there are no multiline lambdas, I don't know how to reuse a single regex object instead of creating a second.
Secondly, this only works as long as the version numbers are two numbers separated by a single period. The first element is 2.5.1, which is sorted into the correct place but the current method wouldn't know how to tie break for 2.5.2, or 2.5.3, or for any string with an arbitrary number of version points.
So it works, but there's got to be a better way to do it. How can I improve this?
This is not a full answer, but it will get you far along the road to one.
The return value of the key function can be a tuple, and tuples sort naturally. You want the output from the key function to be:
((2, 5, 1), 'jakarta-jmeter')
((2, 6), 'apache-jmeter')
etc.
Do note that this is a poor use case for a lambda regardless.
Originally, I came up with this:
jmeter_md5_list.sort(key=lambda val: list(map(int, re.compile('(\d+(?!$))').findall(val))))
However, based on Ignacio Vazquez-Abrams's answer, I made the following changes.
def sortable_key_from_string(value):
version_tuple = tuple(map(int, re.compile('(\d+(?!$))').findall(value)))
match = re.match('^(\D+)', value)
version_name = ''
if match:
version_name = match.group(1)
return (version_tuple, version_name)
and this:
jmeter_md5_list.sort(key = lambda val: sortable_key_from_string(val))

Condense nested for loop to improve processing time with text analysis python

I am working on an untrained classifier model. I am working in Python 2.7. I have a loop. It looks like this:
features = [0 for i in xrange(len(dictionary))]
for bgrm in new_scored:
for i in xrange(len(dictionary)):
if bgrm[0] == dictionary[i]:
features[i] = int(bgrm[1])
break
I have a "dictionary" of bigrams that I have collected from a data set containing customer reviews and I would like to construct feature arrays of each review corresponding to the dictionary I have created. It would contain the frequencies of the bigrams found within the review of the features in the dictionary (I hope that makes sense). new_scored is a list of tuples which contains the bigrams found within a particular review paired with their relative frequency of occurrence in that review. The final feature arrays will be the same length as the original dictionary with few non zero entries.
The above works fine but I am looking at a data set of 13000 reviews, for each review to loop through this code is going to take for eeever (if my computer doesnt run out of RAM first). I have been sitting with it for a while and cannot see how I can condense it.
I am very new to python so I was hoping a more experienced could help with condensing it or perhaps point me in the right direction towards a library that will contain the function I need.
Thank you in advance!
Consider making dictionary an actual dict object (or some fancier subclass of dict if it better suits your needs), as opposed to an iterable (list or tuple seems like what it is now). dictionary could map bigrams as keys to an integer identifier that would identify a feature position.
If you refactor dictionary that way, then the loop can be rewritten as:
features = [0 for key in dictionary]
for bgram in new_scored:
try:
features[dictionary[bgram[0]]] = int(bgrm[1])
except KeyError:
# do something if the bigram is not in the dictionary for some reason
This should convert what was an O(n) traversal through dictionary into a hash lookup.
Hope this helps.

Nokogiri Node Set

I am trying to use Nokogiri to scrape a web page. Right now, I am able to set a variable links to the following on a web page:
links = page.css('.item_inner')
and links is a:
Nokogiri::XML::NodeSet
Then I iterate through this NodeSet(links):
links.each{|link| puts link.css('.details a')}
In order to get some more information. But now the method above's class is now a:
Fixnum
and returns a list of (I'm not sure exactly what they are returning but it looks like a list of these:
<a se:clickable:target="true" href="/nyc/sale/1056207-coop-150-sullivan-street-soho-new-york?featured=1">150 Sullivan Street #34</a>
Now I know that there are key/value pairs within this but I am unable to access them at this point. How can I access say the href here and the actual name?
Once you have a single link as a node, its href is link['href'] and so forth, and the link text ("150 Sullivan Street") is its content.
NOTE: A css search always yields what is effectively an array of found nodes (actually a NodeSet). If you are quite sure that there is only one of something to be found by your search, you can skip past that by using at_css instead, thus yielding a single node.

Parsing one large array into several sub-arrays

I have a list of adjectives (found here), that I would like to be the basis for a "random_adjective(category)" method.
I'm really just taking a stab at this, as my first real attempt at a useful program.
Step 1: Open file, remove formatting. No problem.
list=File.read('adjectivelist')
list.gsub(/\n/, " ")
The next step is to break the string up by category..
list.split(" ")
Now I have an array of every word in the file. Neat. The ones with a tilde before them represent the category names.
Now I would like to break up this LARGE array into several smaller ones, based on category.
I need help with the syntax here, although the pseudocode for this would be something like
Scan the array for an element which begins with a tilde.
Now create a new array based on the name of that element sans the tilde, and ALSO place this "category name" into the "categories" array. Now pull all the elements from the main array, and pop them into the sub-array, until you meet another tilde. Then repeat the process until there are no more elements in the array.
Finally I would pull a random word from the category named in the parameter. If there was no category name matching the parameter, it would return false and exit (this is simply in case I want to add more categories later.)
Tips would be appreciated
You may want to go back and split first time around like this:
categories = list.split(" ~")
Then each list item will start with the category name. This will save you having to go back through your data structure as you suggest. Consider that a tip: sometimes it's better to re-think the start of a coding problem than to head inexorably forwards
The structure you are reaching towards is probably a Hash, where the keys are category names, and the values are arrays of all the matching adjectives. It might look like this:
{
'category' => [ 'word1', 'word2', 'word3' ]
}
So you might do this:
words_in_category = Hash.new
categories.each do |category_string|
cat_name, *words = category_string.split(" ")
words_in_category[cat_name] = words
end
Finally, to pick a random element from an array, Ruby provides a very useful method sample, so you can just do this
words_in_category[ chosen_category ].sample
. . . assuming chosen_category contains the string name of an actual category. I'll leave it to you to figure out how to put this all together and handle errors, bad input etc
Use slice_before:
categories = list.split(" ").slice_before(/~\w+/)
This will create an sub array for each word starting with ~, containing all words before the next matching word.
If this file format is your original and you have freedom to change it, then I recommend you save the data as yaml or json format and read it when needed. There are libraries to do this. That is all. No worry about the mess. Don't spend time reinventing the wheel.

Sorting by counting the intersection of two lists in MongoDB

We have a posting analyzing requirement, that is, for a specific post, we need to return a list of posts which are mostly related to it, the logic is comparing the count of common tags in the posts. For example:
postA = {"author":"abc",
"title":"blah blah",
"tags":["japan","japanese style","england"],
}
there are may be other posts with tags like:
postB:["japan", "england"]
postC:["japan"]
postD:["joke"]
so basically, postB gets 2 counts, postC gets 1 counts when comparing to the tags in the postA. postD gets 0 and will not be included in the result.
My understanding for now is to use map/reduce to produce the result, I understand the basic usage of map/reduce, but I can't figure out a solution for this specific purpose.
Any help? Or is there a better way like custom sorting function to work it out? I'm currently using the pymongodb as I'm python developer.
You should create an index on tags:
db.posts.ensure_index([('tags', 1)])
and search for posts that share at least one tag with postA:
posts = list(db.posts.find({_id: {$ne: postA['_id']}, 'tags': {'$in': postA['tags']}}))
and finally, sort by intersection in Python:
key = lambda post: len(tag for tag in post['tags'] if tag in postA['tags'])
posts.sort(key=key, reverse=True)
Note that if postA shares at least one tag with a large number of other posts this won't perform well, because you'll send so much data from Mongo to your application; unfortunately there's no way to sort and limit by the size of the intersection using Mongo itself.

Resources