How to extract a list of URLs from specific domain? - firefox

I'm using Firefox 53, and have Scrapbook X and want to save a lot of pages using the Save Multiple URLs feature, but before I do that, I want to extract a specific list of URLs without having to do so manually.
The site I'm looking at extracting data from is www.address-data.co.uk - namely this page.
What I want to do is extract only the URLs and subpages within that page but not the privacy policy or contact us page and all the sub-pages with the EH postcodes.
Is there a way to do this online, or any tool for Mac OS X that can find all related URLs before I copy them into Scrapbook's Save Multiple URLs (where I save them in a subfolder of Scrapbook)?

I assume that EH45 is typical of those you want to extract from the page you mentioned. Like its siblings it's of the form https://address-data.co.uk/postcode-district-EH<postcode number>.
This means that you can make a complete list of the urls if you have a list of the numbers, or of the postcodes.
My main difficulty in answering is that I don't know what tools (especially programming tools) you might have at your disposal. I will assume only that you have, or can obtain, access to an editor that can do macros or that can edit columns. On Windows I would use Emerald (ow known as Crimson).
Then copy the contents of the table in the EH page (not the table headings) and remove everything except the first column. Finally, prepend every item in the column with 'https://address-data.co.uk/postcode-district-'.
PS: This might also be a good question to put on SuperUser.

Related

Generate EDGAR FTP File Path List

I'm brand new to programming (though I'm willing to learn), so apologies in advance for my very basic question.
The [SEC makes available all of their filings via FTP][1], and eventually, I would like to download a subset of these files in bulk. However, before creating such a script, I need to generate a list for the location of these files, which follow this format:
/edgar/data/51143/000005114313000007/0000051143-13-000007-index.htm
51143 = the company ID, and I already accessed the list of company IDs I need via FTP
000005114313000007/0000051143-13-000007 = the report ID, aka "accession number"
I'm struggling with how to figure this out as the documentation is fairly light. If I already have the 000005114313000007/0000051143-13-000007 (what the SEC calls the "accession number") then it's pretty straightforward. But I'm looking for ~45k entries and would obviously need to generate these automatically for a given CIK ID (which I already have).
Is there an automated way to achieve this?
Welcome to SO.
I'm currently scraping the same site, so I'll explain what I've done so far. What I am assuming is that you'll have the CIK numbers of the companies you're looking to scrape. If you search the company's CIK, you'll get a list of all of the files that are available for the company in question. Let's use Apple as an example (since they have a TON of files):
Link to Apple's Filings
From here you can set a search filter. The document you linked was a 10-Q, so let's use that. If you filter 10-Q, you'll have a list of all of the 10-Q documents. You'll notice that the URL changes slightly, to accommodate for the filter.
You can use Python and its web scraping libraries to take that URL and scrape all of the URLs of the documents in the table on that page. For each of these links you can scrape whatever links or information you want off the page. I personally use BeautifulSoup4, but lxml is another choice for web scraping, should you choose Python as your programming language. I would recommend using Python, as it's fairly easy to learn the basics and some intermediate programming constructs.
Past that, the project is yours. Good luck, I've posted some links below to get you started. I'm only allowed to post two links since I'm new to the site, so I'll give you the beautiful soup link:
Beautiful Soup Home Page
If you choose to use Python and are new to the language, check out the codecademy python course, and don't forget to check out lxml, as some people prefer it over BeautifulSoup (some people also use both in conjunction, so it's all a matter of personal preference).

Magento Translation

I have Magento 1.8.1.0. Recently I've installed Russian pack, the result wasn't appropriate enough, cause some phrases on frontend remained in English
I know there's handy way to translate Magento using cvs-files.
The question is where I can find proper cvs-file? Does installed theme concerns translation some how? I know I'm asking newbie questions, I've read several posts, but I haven't made up my mind how to translate Magento.
Many thanks in advance.
Hope you are doing well,
As i have gone through your question that you want to translate your websites front end in Russian if user has selected the language Russian.
For this you are required to work out the translate.csv files which will be available in your theme Package.
Example : app/design/frontend/default/SecuareWeb/locale/de_DE
In the locale folder you will find the folder for Russian language open that folder and you will find the file where you are required to add the required translation text in it.
How to add translation text in translate.csv file is given below.
Example:
"This is the demo of translation in Russian","Это демо-трансляции на русском языке"
And one thing i would like add is that make sure your front end .phtml files must contain the text in $this->__("Example");. If you have added all the text like this then only then it will allow you for translation other wise it will not translate a text.
Hope this might be use full to you !!!
Waiting for your valuable comments in regards to your Question !!!
There are different ways to achieve translation in Magento so you can find multiple directory containing static csv files and also a database table.
All the modes have same structure: key/value. For example: "String to translate","String translated".
Inline Translation (database table: core_translate):
following best practices in Magento, you should use inline-translation aka database saved translation in rare cases. It is harder to mantain and can be buggy. It has first precedence, so any translation you do via inline translation will override the other 'modes'.
Theme level Translation (file in app/design/frontend/your_package/your_theme/ru_RU/translate.csv):
you can place any string to be translated in the translate.csv. It has second precedence.
Locale translation (file in app/locale/ru_RU/Module_Name.csv):
the suggested way to do translation as it will keep translation separated by each module and is easier to maintain. For example: Mage_Catalog.csv etc.
Each module in Magento can specify its csv file containing translation and sometimes the same string has different modules trying to translate, so if your translation does not work check between multiple file by a quick editor search. It will be overridden by the two above modes.
Note:
Magento will load all the csv files and build up a giant tree and caches it. So before scratching your head because the string is not translated as you wished in the frontend:
1. clean the cache.
2. check for any same key string which comes after your translated string. For example: in the same csv Line 100 will override Line 1 if the key string are the same.
3. check for any same key string in the mode which has higher precedence. For example: inline translation will override any csv based translated string.
It may be easier for you to go to the admin backend System -> Configuration -> Developer and switch "Translate Inline" "Enabled for Frontend" to "Yes".
Then, refresh the frontend and you can change the translation directly at your web browser.
The translation is saved in the database table core_translate just for the case you want to do it in a test environment and copy the translation later on to the production.
Take care that without client restrictions (System -> Configuration -> Developer) everyone will see the translation options.
btw. You may need to clear the cache and refresh the webpage in order to see your changes.

Randomization in Qualtrics using Photos or Graphics and Loop and Merge

I am creating a survey in Qualtrics with many photos, say 1000. I want to have each survey participant answer, say 6, questions per photo. Each participant will see 5 photos that are randomly assigned.
Before looking into things, I assumed that there would be a way to upload the 1000 photos, create one block in Qualtrics (with the 6 questions) and then simply randomize the photo that occurs and have this be repeated this 5 times.
But it seems like this is either not possible or not obvious. I called Qualtrics and they said that I would manually need to create 1000 blocks (each block would be exactly the same with the exception of the title and the photo). I would then need to go into the Survey Flow and use the Randomizer there and manually add all 1000 blocks and have it randomly present 5 of the elements.
I really hope that there is a better way. This will take a ton of time if I have to do it this way.
If not, is there any way to automate anything?
Creating new blocks and automatically populating the photos. I know python and could possibly write a script to generate blocks, BUT the photo names are changed from their original names into some complicated code that Qualtrics generates.
Loading the photos into Qualtrics all at once (it currently requires one to load photos one at a time).
It turns out that there is a much better faster way to do this than the 1000 blocks fix.
There is a bunch of stuff going on to accomplish it, but it is possible.
First, one needs to put the photos into Qualtrics through the Graphics Library. The best way to do this is to simply drag and drop the photos into the desired location. Luckily one does not have to do this one-by-one. Make sure that they are in the order you want.
Second, create a block with a "question" where you want the random photo to appear. This block should also have all 6 questions.
Third, create a column in a spreadsheet (in, eg. Excel) of the URLs corresponding to the photos. This should be in order. One way to do this is mentioned at the bottom.
Fourth, go to the Loop and Merge option for this block. Copy and paste the column of URLs to, say, Field 1. Luckily this option exists and one does not have to do this one-by-one either. A sidenote is that if one changes the numbers in the gray boxes to the left of the rows, this changes what appears in the results. But there is no apparent way to change these more than one-by-one at a time.
Then you should be all set.
Finally, a little bit about how to get the URLs of the photos. Once again, make sure the photos in the library are in the order you want. Then you can use web scraping to scrape the image names, which can then be put into the proper URL. I used Python's Selenium and BeautifulSoup to accomplish this. Here is what I did, using a mac. The code at least gives you the idea:
from bs4 import BeautifulSoup
import codecs
import os
from selenium import webdriver
import re
chromedriver = "File path to /chromedriver"
os.environ["webdriver.chrome.driver"] = chromedriver
driver = webdriver.Chrome(chromedriver)
*In the Chrome browser that has appeared, manually navigate to the photos library page, then:
abc = driver.find_elements_by_css_selector(".thumbframe")
file = codecs.open('outputURLs.txt', 'w', encoding = 'utf-8')
urls = {}
for i in range(0,len(abc)):
h = abc[i].get_attribute("innerHTML")
soup = BeautifulSoup(h)
t = soup.find_all("img", attrs={"p4":re.compile('.*')})
urls[i] = t[0]['p1']
file.write("<img src=*Qualtrics Path/Graphic.php?IM=" + urls[i] + "/> + '\n')
One can find the proper first part to stick in "Qualtrics Path" by, eg. going to the Qualtrics Survey Editor, inserting a photo using Rich HTML Editing (or something similar), inserting the photo, clicking on View Source, and then looking at the pattern file path to use. It may begin with something like https://qualtrics.com/...
Then copy the results into a spreadsheet program and you should be ready to copy and paste.

How do I take each line of a text file and insert them into a web form? Specifically, for testing domain name availability

I wrote a Ruby script that appended "data" to the beginning of every word of the English dictionary, and then filtered out various strings using different parameters, and now I want to use a site like namecheap or gandi.net in order to take each of these strings and insert them into the domain name availability checker in order to determine which ones are available.
It is my understanding that this will involve making a POST HTTP request of some kind, as well as grabbing the element in question, but I don't really understand the dynamics of what to read about in order to do this kind of thing.
I imagine that after a few requests I will be limited, but as a learning exercise I am still curious as to how I would go about doing this.
I inspected the element (on namecheap) to see what the tag looked like, to find any uniquely identifiable class/id names that I could use to grab that specific part of the source, and found that inside a fieldset tag, there was a line of HTML that I can't seem to paste here, so here is a picture:
Thanks in advance for any guidance in helping me learn about web scripting!

mod_rewrite and redundant / old urls, some SEO best practices needed

Having a look at how google perceives our site at the moment and coming up short...
Basically, we use a bog-standard structure of URL rewriting to make them look SEO friendly.
for instance, a product URL takes shape of any string_([0-9]).html and so forth. of course, this allows us to link to whatever we want before the product id... which we have done. In the past, a product page was Product_Name_79.html and then became Brand_Name_Product_Name_79.html. apache does not really care and id 79 gets passed on in either case. However, google now has 2 versions of this product cached under different URLs - and that's not a good thing as it continues to arrive to the first URL and spider it.
same thing applies to our rewrite rules for brands and categories, some of which had been dropped and some of which have been modified.
there are over 11k urls in site:domain whereas our sitemap gets some 5.8k only. how would you prevent spiders from fetching older versions of urls that you no-longer link to (considering it's not a manual process and often such urls can be very dynamic).
eg, Mens_Merrell_Trail_Running_Shoes__50-100__10____024/ is a dynamic url for the merrell brand, narrowed down by items in trail running shoes that cost between 50 and 100 and size 10 with gender set to men's.
if we decide to nofollow any size and money filter urls, that leaves google still being able to access them through its old cache...
what is the best practice for disallowing a particular type of urls? as the combinations above are nearly infinite, i cannot produce a list and it certainly cannot be backdated against what brands and categories google may hold for us historically.
shall we add noindex when such filters are applied? shall we export them to robots.txt? do nothing in the hope that google stops returning?
to put it into perspective, we have 2600 product page urls that are now redundant / disabled, what would you do with them? redirect to homepage, brand page, 404, do nothing?
thanks for any advice
i think you're looking for rel="canonical", google should start ignoring you're links if they're really not linked to. You can check any incoming links with a tool like this: http://www.seomoz.org/linkscape.
Also if you're old urls match (or don't match) a consisent pattern you could set up a 301 redirect in apache either for pages matching the old pattern or not matching the new pattern...
hope this helps!
Just be sure to set up redirects for any URL you change. Also, I don't recommend using rel=nofollow since it indicates to Google that your site is not trustworthy.

Resources