Exporting and importing images in MediaWiki - image

How do I export and import images from and into a MediaWiki?

Terminal solutions
MediaWiki administrator, at server's terminal, can perform maintenance tasks using the Maintenance scripts framework. New Mediawiki versions run all standard scripts in the tasks described below, but old versions have some bugs or not have all moderns scripts: check the version number by grep wgVersion includes/DefaultSettings.php.
Note: all cited (below) scripts have also --help option, for instance php maintenance/importImages.php --help
Original image folder
Users upload files through the Special:Upload page; administrators can configure the allowed file types through an extension whitelist. Once uploaded, files are stored in a folder on the file system, and thumbnails in a dedicated thumb directory.
The Mediawiki's images folder can be zipped with zip -r ~/Mediafiles.zip images command, but this zip is not so good:
there are a lot of expurious files: "deleted files" and "old files" (not the current) with filenames as 20160627184943!MyFig.png, and thumbnails as MyFig.png/120px-MyFig.jpg.
for data-interchange or long-term preservation porpurses, it is invalid... The ugly images/?/??/* folder format is not suitable, as usual "all image files in only one folder".
Images export/import
For "Exporting and Importing" all current images in one folder at MediaWiki server's terminal, there are a step-by-step single procedure.
Step-1: generate the image dumps using dumpUploads (with --local or --shared options when preservation need), that creates a txt list of all image filenames in use.
mkdir /tmp/workingBackupMediaFiles
php maintenance/dumpUploads.php \
| sed 's~mwstore://local-backend/local-public~./images~' \
| xargs cp -t /tmp/workingBackupMediaFiles
zip -r ~/Mediafiles.zip /tmp/workingBackupMediaFiles
rm -r /tmp/workingBackupMediaFiles
The command results in a standard zip file of your image backup folder, Mediafiles.zip at yor user root directory (~/).
NOTE: if you are not worried about the ugly folder strutcture, a more direct way is
php maintenance/dumpUploads.php \
| sed 's~mwstore://local-backend/local-public~./images~' \
| zip ~/Mediafiles.zip -#
according Mediawiki version the --base=./ option will work fine and you can remove the sed command of the pipe.
Step-2: need a backup? installing a copy of the images? ... you need only Mediafiles.zip, and the Mediawiki installed, with no contents... If the Wiki have contents, check problems with filename conflicks (!). Another problem is configuration of file formats and permissions, that must be the same or broader in the new Wiki, see Manual:Configuring file uploads.
Step-3: restore the dumps (to the new Wiki), with the maintenance tools. Supposing that you used step-1 to export and preserve in a zip file,
unzip ~/Mediafiles.zip -d /tmp/workingBackupMediaFiles
php maintenance/importImages.php /tmp/workingBackupMediaFiles
rm -r /tmp/workingBackupMediaFiles
php maintenance/update.php
php maintenance/rebuildall.php
That is all. Check, navegating in your new Wiki's Special:NewFiles.
The full export or preservation
For exporting "ALL images and ALL articles" of your old MediaWiki, for full backup or content preservation. Add some procedures at each step:
Step-1: ... see above step-1... and, to generate the text-content dumps from the old Wiki
php maintenance/dumpBackup.php --full | gzip > ~/dumpContent.xml.gz
Note: instead of --full you can use the --current option.
Step-2: ... you need dumpContent.xml.zip and Mediafiles.zip... from the old Wiki. Suppose both zip files at your ~ folder.
Step-3: run in your new Wiki
unzip ~/Mediafiles.zip -d /tmp/workingBackupMediaFiles
gunzip -c ~/dumpContent.xml.gz
| php maintenance/importDump.php --no-updates \
--image-base-path=/tmp/workingBackupMediaFiles
rm -r /tmp/workingBackupMediaFiles
php maintenance/update.php
php maintenance/rebuildall.php
That is all. Check also Special:AllPages of the new Wiki.

There is no automatic way to export images like you export pages, you have to right click on them, and choose "save image". To get the history of the Image page, use the Special:Export page.
To import images use the Special:Upload page on your wiki. If you have lots of them, you can use the Import Images script. Note: you generally have to be in the sysop group to upload images.

- Export ALL:
You can get all pages and all images from a MediaWiki web using [API], even you are not the owner of the web (of course when the owner hasn't disable this function):
Step 1: Using API to get all pages title and all images url. You can write some code to do it automatically.
Step 2: Next you use [Special:Export] to export all pages with the titles you got, and use wget to get all images you had links (like this wget -i img-list.txt).
- Import ALL:
Step 1: Import pages using [Special:Import]
Step 2: Import images using [Manual:ImportImages.php].

There are a few mass upload tools available.
Commonist - www.djini.de/software/commonist/
Both run on the desktop and can be configured to upload to your local wiki (they are configured for Wikipedia and Wikimedia commons by default). If you are afraid to edit the content of a .jar file, I suggest you start with Commonplace.
Another useful extension exists for Mediawiki itself.
MultiUpload - http://www.mediawiki.org/wiki/Extension:MultiUpload
This extension allows you to drop images in a folder and load them all at once. It supports annotations for each file if necessary and cleans up the folder once it is done. On the downside, it requires opening a shared folder on the server side.
Commonplace - commons.wikimedia.org/wiki/Commons:Tools/Commonplace
used to be available, but it was deprecated as of Jan. 13, 2010.

Hope this helps a bit: http://www.mediawiki.org/wiki/Manual:ImportImages.php

As a committer of MediaWiki-Japi I'd like to point out:
For the usecase to push pages including images from one wiki to another MediaWiki-Japi now has a command line mode see
Issue 49 - Enable commandline interface with page transfer option
Otherwise you can use MediaWiki-Api with the language of your choice and use the functions as you find in PushPages.java
e.g.
download
upload

Related

Dowloading files as a single .zip on windows server

a client have a download area where users can download or browse single files. Files are divided in folder, so there are documents, catalogues, newsletter and so on, and their extension can vary: they can be .pdf, .ai or simple .jpeg. He asked me if I can provide a link to download every item in a specific folder as a big, compressed file. Problem is, I'm on a Windows server, so I'm a bit clueless if there's a way. I can edit pe pages of this area, so I can include jquery and scripts with a little freedom. Any hint?
Windows archiver is TAR and you are needing to build a TARbALL (Historically all related files in one Tape ARchive)
I have a file server which is mapped as S:\ (it does not have TAR command, and Tar cannot use URL but can use device:)
For any folders contents (including sub folders) it is easy to remotely save all current files in a zip with a single command (for multiple root locations they need a loop or a list)
It will build the Tape Archive as a windows.zip using the -a (auto) switch but you need to consider the desired level of nesting by collect all contents at the desired root location.
TAR -a[other options] file.zip [folder / files]
Points to watch out for
ensure here is not an older archive
it will comment error/warnings like the two given during run, however, should complete without fail.
Once you have the zip file you can offer post as a web asset such as
<a href="\\server\folder\all.zip" download="all.zip">Get All<a>
for other notes see https://stackoverflow.com/a/68728992/10802527

pandoc to make each directory a chapter

I have a lot of markdown files in various directories each with the same format (# title, then ## sub-title).
can I make the --toc respect the folder layout, in that the folder itself is the name of chapter, and each markdown file is content of this chapter.
so far pandoc totally ignores my folder names, it works the same as putting all the markdown files within the same folder.
My approach to this is to create index files in each folder with first level heading and downgrade headings in other files by one level.
I use Git and by default I'm using default structure, having first level headings in files, but when I want to generate ebook using pandoc I'm modifying files via automated Linux shell script. After that, I revert changed files via Git.
Here's the script:
find ./docs/*/ -name "*.md" ! -name "*index.md" -exec perl -pi -e "s/^(#)+\s/#$&/g" {} \;
./docs/*/ means I'm looking only for files inside subfolders of docs directory like docs/foo/file1.md, docs/bar/file2.md.
I'm also interested only in *.md files, excluding *index.md files.
In index.md files (that I name usually 00-index.md to make them appear as first), I put a first level heading # and because those files are excluded from find portion of the script, their headings aren't downgraded.
Next, there's a perl's search and replace command with regular expression s/^(#)+\s/#$&/g that looks for all lines starting from one or more # and adds another # to them.
In the end, I'm running pandoc with --toc-depth=2 so the table of content contains only first and second level headings.
pandoc ./docs/**/*.md --verbose --fail-if-warnings --toc-depth=2 --table-of-contents -o ./ebook.epub
To revert all changes made to files, I restore changes in the Git repo.
git restore .

Programming a Filter/Backend to 'Print to PDF' with CUPS from any Mac OS X application

Okay so here is what I want to do. I want to add a print option that prints whatever the user's document is to a PDF and adds some headers before sending it off to a device.
I guess my questions are: how do I add a virtual "printer" driver for the user that will launch the application I've been developing that will make the PDF (or make the PDF and launch my application with references to the newly generated PDF)? How do I interface with CUPS to generate the PDF? I'm not sure I'm being clear, so let me know if more information would be helpful.
I've worked through this printing with CUPS tutorial and seem to get everything set up okay, but the file never seems to appear in the appropriate temporary location. And if anyone is looking for a user-end PDF-printer, this cups-pdf-for-mac-os-x is one that works through the installer, however I have the same issue of no file appearing in the indicated directory when I download the source and follow the instructions in the readme. If anyone can get either of these to work on a mac through the terminal, please let me know step-by-step how you did it.
The way to go is this:
Set up a print queue with any driver you like. But I recommend to use a PostScript driver/PPD. (A PostScript PPD is one which does not contain any *cupsFilter: ... line.):
Initially, use the (educational) CUPS backend named 2dir. That one can be copied from this website: KDE Printing Developer Tools Wiki. Make sure when copying that you get the line endings right (Unix-like).
Commandline to set up the initial queue:
lpadmin \
-p pdfqueue \
-v 2dir:/tmp/pdfqueue \
-E \
-P /path/to/postscript-printer.ppd
The 2dir backend now will write all output to directory /tmp/pdfqueue/ and it will use a uniq name for each job. Each result should for now be a PostScript file. (with none of the modifications you want yet).
Locate the PPD used by this queue in /etc/cups/ppd/ (its name should be pdfqueue.ppd).
Add the following line (best, near the top of the PPD):
*cupsFilter: "application/pdf 0 -" (Make sure the *cupsFilter starts at the very beginning of the line.) This line tells cupsd to auto-setup a filtering chain that produces PDF and then call the last filter named '-' before it sends the file via a backend to a printer. That '-' filter is a special one: it does nothing, it is a passthrough filter.
Re-start the CUPS scheduler:sudo launchctl unload /System/Library/LaunchDaemons/org.cups.cupsd.plist
sudo launchctl load /System/Library/LaunchDaemons/org.cups.cupsd.plist
From now on your pdfqueue will cause each job printed to it to end up as PDF in /tmp/pdfqueue/*.pdf.
Study the 2dir backend script. It's simple Bash, and reasonably well commented.
Modify the 2dir in a way that adds your desired modifications to your PDF before saving on the result in /tmp/pdfqueue/*.pdf...
Update: Looks like I forgot 2 quotes in my originally prescribed *cupsFilter: ... line above. Sorry!
I really wish I could accept two answers because I don't think I could have done this without all of #Kurt Pfeifle 's help for Mac specifics and just understanding printer drivers and locations of files. But here's what I did:
Download the source code from codepoet cups-pdf-for-mac-os-x. (For non-macs, you can look at http://www.cups-pdf.de/) The readme is greatly detailed and if you read all of the instructions carefully, it will work, however I had a little trouble getting all the pieces, so I will outline exactly what I did in the hopes of saving someone else some trouble. For this, the directory with the source code is called "cups-pdfdownloaddir".
Compile cups-pdf.c contained in the src folder as the readme specifies:
gcc -09 -s -lcups -o cups-pdf cups-pdf.c
There may be a warning: ld: warning: option -s is obsolete and being ignored, but this posed no issue for me. Copy the binary into /usr/libexec/cups/backend. You will likely have to the sudo command, which will prompt you for your password. For example:
sudo cp /cups-pdfdownloaddir/src/cups-pdf /usr/libexec/cups/backend
Also, don't forget to change the permissions on this file--it needs root permissions (700) which can be changed with the following after moving cupd-pdf into the backend directory:
sudo chmod 700 /usr/libexec/cups/backend/cups-pdf
Edit the file contained in /cups-pdfdownloaddir/extra/cups-pdf.conf. Under the "PDF Conversion Settings" header, find a line under the GhostScript that reads #GhostScript /usr/bin/gs. I did not uncomment it in case I needed it, but simply added beneath it the line Ghostscript /usr/bin/pstopdf. (There should be no pre-cursor # for any of these modifications)
Find the line under GSCall that reads #GSCall %s -q -dCompatibilityLevel=%s -dNOPAUSE -dBATCH -dSAFER -sDEVICE=pdfwrite -sOutputFile="%s" -dAutoRotatePage\
s=/PageByPage -dAutoFilterColorImages=false -dColorImageFilter=/FlateEncode -dPDFSETTINGS=/prepress -c .setpdfwrite \
-f %s Again without uncommenting this, under this I added the line GSCall %s %s -o %s %s
Find the line under PDFVer that reads #PDFVer 1.4 and change it to PDFVer, no spaces or following characters.
Now save and exit editing before copying this file to /etc/cups with the following command
sudo cp cups-pdfdownloaddir/extra/cups-pdf.conf /etc/cups
Be careful of editing in a text editor because newlines in UNIX and Mac environments are different and can potentially ruin scripts. You can always use a perl command to remove them, but I'm paranoid and prefer not to deal with it in the first place.
You should now be able to open a program (e.g. Word, Excel, ...) and select File >> Print and find an available printer called CUPS-PDF. Print to this printer, and you should find your pdfs in /var/spool/cups-pdf/yourusername/ by default.
*Also, I figured this might be helpful because it helped me: if something gets screwed up in following these directions and you need to start over/get rid of it, in order to remove the driver you need to (1) remove the cups-pdf backend from /usr/libexec/cups/backend (2) remove the cups-pdf.conf from /etc/cups/ (3) Go into System Preferences >> Print & Fax and delete the CUPS-PDF printer.
This is how I successfully set up a pdf backend/filter for myself, however there are more details, and other information on customization contained in the readme file. Hope this helps someone else!

Copy/publish images linked from the html files to another server and update the HTML files referencing them

I am publishing content from a Drupal CMS to static HTML pages on another domain, hosted on a second server. Building the HTML files was simple (using PHP/MySQL to write the files).
I have a list of images referenced in my HTML, all of which exist below the /userfiles/ directory.
cat *.html | grep -oE [^\'\"]+userfiles[\/.*]*/[^\'\"] | sort | uniq
Which produces a list of files
http://my.server.com/userfiles/Another%20User1.jpg
http://my.server.com/userfiles/image/image%201.jpg
...
My next step is to copy these images across to the second server and translate the tags in the html files.
I understand that sed is probably the tool I would need. E.g.:
sed 's/[^"]\+userfiles[\/image]\?\/\([^"]\+\)/\/images\/\1/g'
Should change http://my.server.com/userfiles/Another%20User1.jpg to /images/Another%20User1.jpg, but I cannot work out exactly how I would use the script. I.e. can I use it to update the files in place or do I need to juggle temporary files, etc. Then how can I ensure that the files are moved to the correct location on the second server
It's possible to use sed to change the file in-place using the -i option.
For your use it's up to you if it's easier/better to create a new file with the changes from the old, then copy to the 2nd domain using scp (or something similar). Or it may be easier to copy the file first, then modify it once it's on the remote server (less management of new filenames this way).

Site Performance and Download

I am wanting to find an automated way to download an entire website page (not the entire site just a single page) and all elements on the page, then sum the size of these files.
When I say files, I would like to know the total size of HTML, CSS, Images, local and remote JS files, and any CSS background images. Basically the entire page-weight for a given page.
I thought about using CURL but was not sure how to enable it to grab remote and local JS files as well as images referenced in the CSS files.
Try wget:
make it download all required files with -p or --page-requisites option
download scripts and images local to the site and not further than 2 hops away (this should get local images and code) with -l 2 for --level=2
and change the code files to link to your local files instead of their original path with -k for --convert-links:
wget -p -l 2 -k http://full_url/to/page.html

Resources