Count and print images from URL - image

This is my first time using Spark/Scala and I am lost.
I am suppose to write a program that takes in a URL and outputs the number of images and the name of the image file.
So I was able to get the image count. I am doing this all in the command prompt which is making it quite difficult to go back and edit my def without out retyping the whole thing. Is there a better alternative. It took me quite a while just to get Spark/Scala working (I would of like to u PySpark but was unable to get them to communicate)
scala> def URLcount(url : String) : String = {
| var html = scala.io.Source.fromURL(url).mkString
| var list = html.split("\n").filter(_ != "")
| val rdds = sc.parallelize(list)
| val count = rdds.filter(_.contains("img")).count()
| return("There are " + count + " images at the " + url + " site.")
| }
URLcount: (url: String)String
scala> URLcount("https://www.yahoo.com/")
res14: String = There are 9 images at the https://www.yahoo.com/ site.
So I'm assuming after I parallelize the list I should be about to apply a filter and create a list of all the strings that contain "img src"
How would I create such list and then print it line by line to display the image urls?

I don't sure it is great solution for parsing HTML via Spark. I think that Spark created for big data (while it is general purpose). I did not find any easy way to parse HTML through Spark (but I easy find it for both XML and JSON). It is mean that in this case you will print a very long string, because HTML pages are often compressed. Anyway, for this page your program will print lines like this:
<p>So I'm assuming after I parallelize the list I should be about to apply a filter and create a list of all the strings that contain "img src"
I can advice you use Jsoup:
val yahoo = Jsoup.connect("https://www.yahoo.com").get
val images = yahoo.select("img[src]")
images.forEach(println)
You can use Spark for other purposes.
PS: I found 39 image tags with src attribute on https://www.yahoo.com. It is very easy to got error if you don't use good HTML parser.
Another way: prepare your data and than use Spark.
Sorry for my English.

Related

Xpath uninion parameter from parent

here xpath one
/Document/Attributes/BlobContent/Property[#Name="FileName"]/parent::*/Reference/#Link
and xpath two
Document/Attributes/BlobContent/Property[#Name="FileName"]/parent::*/Property[#Name="FileName"]/#Value
both bring back the right result !
I would like to avoid the complete chaining [one | two] as that brought back only a list of alternating results.
tried with
/Document/Attributes/BlobContent/Property[#Name="FileName"]/parent::*/Reference/#Link | */Property[#Name="FileName"]/#Value
but that brings back only the later one.
So how would I correctly bring back two child node attributes from a found parent ?
For anyone interested I didn't find any XPATH solution. However that python code did work for me
import xml.etree.ElementTree as ET
tree = ET.parse(file_xml)
root = tree.getroot()
blobs = root.findall("*/Attributes[1]/BlobContent")
for blob in blobs:
try:
filename = blob.find('Property[#Name="FileName"]').attrib["Value"]
exportname = blob.find('Reference[#Type="RELATIVEFILE"]').attrib["Link"]
print(filename + "," + exportname)
except:
#no filename Property
pass

Is there a way to read fixed length files using csv.reader() module in Python 2.x

I have a fixed length file like:
0001ABC,DEF1234
The file definition is:
id[1:4]
name[5:11]
phone[12:15]
I need to load this data into a table. I tried to use CSV module and defined the fixed lengths of each field. It is working fine except for the name field.
For the NAME field, only the value till ABC is getting loaded. The reason is:
As I am using CSV module, it is treating 0001ABC, as a value and only parsing till that.
I tried to use escapechar = ',' while reading the file, but it removes the ',' from the data. I also tried quoting=csv.QUOTE_ALL but that didnt work either.
with open("xyz.csv") as csvfile:
readCSV = csv.reader(csvfile)
writeCSV = open("sample_csv", 'w');
output = csv.writer(writeCSV, dialect='excel', lineterminator="\n")
for row in readCSV:
print(row) # to debug #
data= str(row[0])
print(data) # to debug #
id = data[0:4]
name = data[5:11]
phone = data[12:15]
output.writerow([id,name,phone])
writeCSV.close()
Output of the print commands:
row: ['0001ABC','DEF1234']
data: 0001ABC
Ideally, I expect to see the entire set 0001ABC,DEF1234 in the variable: data.
I can then use the parsing as mentioned in the code to break it into different fields.
Can you please let me know where I am going wrong?

Concept for recipe-based parsing of webpages needed

I'm working on a web-scraping solution that grabs totally different webpages and lets the user define rules/scripts in order to extract information from the page.
I started scraping from a single domain and build a parser based on Nokogiri.
Basically everything works fine.
I could now add a ruby class each time somebody wants to add a webpage with a different layout/style.
Instead I thought about using an approach where the user specifies elements where content is stored using xpath and storing this as a sort of recipe for this webpage.
Example: The user wants to scrape a table-structure extracting the rows using a hash (column-name => cell-content)
I was thinking about writing a ruby function for extraction of this generic table information once:
# extracts a table's rows as an array of hashes (column_name => cell content)
# html - the html-file as a string
# xpath_table - specifies the html table as xpath which hold the data to be extracted
def basic_table(html, xpath_table)
xpath_headers = "#{xpath_table}/thead/tr/th"
html_doc = Nokogiri::HTML(html)
html_doc = Nokogiri::HTML(html)
row_headers = html_doc.xpath(xpath_headers)
row_headers = row_headers.map do |column|
column.inner_text
end
row_contents = Array.new
table_rows = html_doc.xpath('#{xpath_table}/tbody/tr')
table_rows.each do |table_row|
cells = table_row.xpath('td')
cells = cells.map do |cell|
cell.inner_text
end
row_content_hash = Hash.new
cells.each_with_index do |cell_string, column_index|
row_content_hash[row_headers[column_index]] = cell_string
end
row_contents << [row_content_hash]
end
return row_contents
end
The user could now specify a website-recipe-file like this:
<basic_table xpath='//div[#id="grid"]/table[#id="displayGrid"]'
The function basic_table is referenced here, so that by parsing the website-recipe-file I would know that I can use the function basic_table to extract the content from the table referenced by the xPath.
This way the user can specify simple recipe-scripts and only has to dive into writing actual code if he needs a new way of extracting information.
The code would not change every time a new webpage needs to be parsed.
Whenever the structure of a webpage changes only the recipe-script would need to be changed.
I was thinking that someone might be able to tell me how he would approach this. Rules/rule engines pop into my mind, but I'm not sure if that really is the solution to my problem.
Somehow I have the feeling that I don't want to "invent" my own solution to handle this problem.
Does anybody have a suggestion?
J.

Sorting record in Adobe Air

Ok probably barking up the wrong tree with this one but some guidance would be nice!
Currently got an app that exports data to a text file
stream.open(file, FileMode.APPEND);
stream.writeUTFBytes(data1 + data2);
stream.close();
and then use the following to import that data
var textloader:URLLoader = URLLoader(event.target);
MyTextFile_txt.text = textloader.data;
Now is there anyway of sorting this information (for example put it in order of data2 records)? I know sorting from a textfile is probably a little difficult. Would there be a better way of exporting the file instead? Or when importing the file can I get it to import into a specific text box.
Dunno just throwing some ideas out.
Although not essential you can use stream.readUTFBytes instead of URLLoader.
Regarding sorting data you can add all the loaded data into an array and then use sort() on the array.
e.g.
var someArray:Array = [];
for (var i:int; i < loadedData.xmlNodeName.length; i++) {
someArray.push(loadedData.xmlNodeName[i]);
}
someArray.sort();
http://help.adobe.com/en_US/ActionScript/3.0_ProgrammingAS3/WS5b3ccc516d4fbf351e63e3d118a9b90204-7fa4.html

Read image IPTC data

I'm having some trouble with reading out the IPTC data of some images, the reason why I want to do this, is because my client has all the keywords already in the IPTC data and doesn't want to re-enter them on the site.
So I created this simple script to read them out:
$size = getimagesize($image, $info);
if(isset($info['APP13'])) {
$iptc = iptcparse($info['APP13']);
print '<pre>';
var_dump($iptc['2#025']);
print '</pre>';
}
This works perfectly in most cases, but it's having trouble with some images.
Notice: Undefined index: 2#025
While I can clearly see the keywords in photoshop.
Are there any decent small libraries that could read the keywords in every image? Or am I doing something wrong here?
I've seen a lot of weird IPTC problems. Could be that you have 2 APP13 segments. I noticed that, for some reasons, some JPEGs have multiple IPTC blocks. It's possibly the problem with using several photo-editing programs or some manual file manipulation.
Could be that PHP is trying to read the empty APP13 or even embedded "thumbnail metadata".
Could be also problem with segments lenght - APP13 or 8BIM have lenght marker bytes that might have wrong values.
Try HEX editor and check the file "manually".
I have found that IPTC is almost always embedded as xml using the XMP format, and is often not in the APP13 slot. You can sometimes get the IPTC info by using iptcparse($info['APP1']), but the most reliable way to get it without a third party library is to simply search through the image file from the relevant xml string (I got this from another answer, but I haven't been able to find it, otherwise I would link!):
The xml for the keywords always has the form "<dc:subject>...<rdf:Seq><rdf:li>Keyword 1</rdf:li><rdf:li>Keyword 2</rdf:li>...<rdf:li>Keyword N</rdf:li></rdf:Seq>...</dc:subject>"
So you can just get the file as a string using file_get_contents(get_attached_file($attachment_id)), use strpos() to find each opening (<rdf:li>) and closing (</rdf:li>) XML tag, and grab the keyword between them using substr().
The following snippet works for all jpegs I have tested it on. It will fill the array $keys with IPTC tags taken from an image on wordpress with id $attachment_id:
$content = file_get_contents(get_attached_file($attachment_id));
// Look for xmp data: xml tag "dc:subject" is where keywords are stored
$xmp_data_start = strpos($content, '<dc:subject>') + 12;
// Only proceed if able to find dc:subject tag
if ($xmp_data_start != FALSE) {
$xmp_data_end = strpos($content, '</dc:subject>');
$xmp_data_length = $xmp_data_end - $xmp_data_start;
$xmp_data = substr($content, $xmp_data_start, $xmp_data_length);
// Look for tag "rdf:Seq" where individual keywords are listed
$key_data_start = strpos($xmp_data, '<rdf:Seq>') + 9;
// Only proceed if able to find rdf:Seq tag
if ($key_data_start != FALSE) {
$key_data_end = strpos($xmp_data, '</rdf:Seq>');
$key_data_length = $key_data_end - $key_data_start;
$key_data = substr($xmp_data, $key_data_start, $key_data_length);
// $ctr will track position of each <rdf:li> tag, starting with first
$ctr = strpos($key_data, '<rdf:li>');
// Initialize empty array to store keywords
$keys = Array();
// While loop stores each keyword and searches for next xml keyword tag
while($ctr != FALSE && $ctr < $key_data_length) {
// Skip past the tag to get the keyword itself
$key_begin = $ctr + 8;
// Keyword ends where closing tag begins
$key_end = strpos($key_data, '</rdf:li>', $key_begin);
// Make sure keyword has a closing tag
if ($key_end == FALSE) break;
// Make sure keyword is not too long (not sure what WP can handle)
$key_length = $key_end - $key_begin;
$key_length = (100 < $key_length ? 100 : $key_length);
// Add keyword to keyword array
array_push($keys, substr($key_data, $key_begin, $key_length));
// Find next keyword open tag
$ctr = strpos($key_data, '<rdf:li>', $key_end);
}
}
}
I have this implemented in a plugin to put IPTC keywords into WP's "Description" field, which you can find here.
ExifTool is very robust if you can shell out to that (from PHP it looks like?)

Resources