Powershell and Adobe OCR [closed] - windows

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 1 year ago.
This post was edited and submitted for review 1 year ago and failed to reopen the post:
Original close reason(s) were not resolved
Improve this question
we have many pdf files they are all unlocked they have text, pictures etc. everytime we have to open the file on adobe and do it manually i was thinking maybe there is a better way to do with PowerShell if not yeah we have to do over 1000 files and more are coming but thank you for your answer
Peggy

After looking into it a bit more, I discovered a command-line tool that you can use in tangent with PowerShell. It's called tesseract. For Windows and Linux, download the prebuilt binaries. For MacOS, you need to get use MacPorts or Homebrew.
You'll want to do something like this:
# Using Get-ChildItem's -Include parameter to filter file types
# requires the target path to end in an asterisk. Using just an
# asterisk as the path makes it target the current directory.
foreach ($pdf in (Get-ChildItem * -Include *.pdf))
{
# An array isn't needed, it's just good for arranging arguments
tesseract #(
#INPUT:
$pdf
#OUTPUT:
"$($pdf.Directory)\{OCR} $($pdf.Name)"
#LANGUAGE:
'-l','eng'
)
# The directory is included in the output path so that you can
# change Get-ChildItem's target without adjusting the argument
}
Or, without the fluff:
foreach ($pdf in (Get-ChildItem * -Include *.pdf))
{
tesseract $pdf "$($pdf.Directory)\{OCR} $($pdf.Name)" -l eng
}
Granted, I haven't actually tested tesseract out, but I did read other Q&A pages to derive the appropriate command. Let me know if there's any issues.

Your question is a bit unclear. There is a way to OCR images using PowerShell, such as using this function, and you can convert pdfs to images using this function (it does require imagemagick, which is available here, there are portable options if yuo don't want to install anything). This would effectively allow you to search PDF files that haven't been OCR'd.
However, in terms of directly editing the PDF files with PowerShell to make them into OCR'd PDFs, while PowerShell functionality might help you automate the process, you would first need to find a program that can do that sort of thing from the command line. The PDFs would also have to all be unlocked so that editing them would even be possible (though there are ways to circumvent PDF locks to unlock them).
Unfortunately, I don't really know of any programs that can do that. Maybe it's possible with some advanced Ghostscript parameters, but I haven't looked into it. It is certainly not going to be easy!

Related

Is there a best practice to documenting a Command Line Interface? [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 8 years ago.
Improve this question
I have designed a few programs that have a CLI and want to document them as standard as possible. Are there any agreements out there as to the best way to do this?
An example:
Let's say the Program is "sayHello" and it takes in a few parameters: name and message. So a standard call would look like this:
> sayHello "Bob" "You look great"
Okay, so my command usage would look something like this:
sayHello [name] [message]
That may already be a mistake if brackets have a specific meaning in usage commands. But let's go a step farther and say "message" is optional:
sayHello [name] [message (optional)]
And then just one more wrinkle, what if there is a default we want to denote:
sayHello [name] [message (optional: default 'you look good')]
I realize this usage statement looks a little obtuse at this point. I'm really asking if there are somewhat agreed-upon standards on how to write these. I have a sneaking suspicion that the parenthesis and brackets all have specific meanings.
While I am unaware of any official standard, there are some efforts to provide conventions-by-framework. Docopt is one such framework, and may suit your needs here. In their own words:
docopt helps you:
define interface for your command-line app, and
automatically generate parser for it.
There are implementations for many programming languages, including shell.
You might want to look at the manuals for common Unix commands (e.g. man grep) or the help documentation for Windows commands (e.g. find /?) and using them as a general guide. If you picked either of those patterns (or used some elements common to both), you'd at least surprise the fewest number of people.
Apache commons also has some classes in the commons-cli package that will print usage information for your particular set of command-line options.
Options options = new Options();
options.addOption(OptionBuilder.withLongOpt("file")
.withDescription("The file to be processed")
.hasArg()
.withArgName("FILE")
.isRequired()
.create('f'));
options.addOption(OptionBuilder.withLongOpt("version")
.withDescription("Print the version of the application")
.create('v'));
options.addOption(OptionBuilder.withLongOpt("help").create('h'));
String header = "Do something useful with an input file\n\n";
String footer = "\nPlease report issues at http://example.com/issues";
HelpFormatter formatter = new HelpFormatter();
formatter.printHelp("myapp", header, options, footer, true);
Using the above will generate help output that looks like:
usage: myapp -f [-h] [-v]
Do something useful with an input file
-f,--file <FILE> The file to be processed
-h,--help
-v,--version Print the version of the application
Please report issues at http://example.com/issues

Storing binary Ruby Marshalled objects in Git. Use filters to convert to text (JSON or YAML) and back? [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 8 years ago.
Improve this question
I want to version control binary files that contain the data to run our project. The format used by the program is Marshalled Ruby objects, there is no options to change this in the program, Windows only, and it's closed source. Lovely right?
Here is some good news though. Most of the classes are well documented and for the most part are close to just being Structs, but some have custom Marshalling methods. I also plan to build tools for diffing and merging these files, but figuring out how to put them in to the repo is more important.
So, would using filters to smudge binary files into text (JSON or YAML) for storage in Git and clean them back out to binary for the working directory be a wise idea or just a waste of time?
Rough implementation of both filters, dropping imports, using YAML, and untested with Git:
puts Marshal.load(gets).to_yaml # Smudge
puts Marshal.dump(YAML.load(gets)) # Clean
Edit: Thought I should note that there is deflated Ruby scripts stored in one of these files. A clean project has about 133 KB of Zlib deflated script in it, about 800 KB when inflated.
I wouldn't get too caught up in the guideline of not storing binary files in Git.
The real challenge comes, as you suggested, in diffing and merging these files. If you store them as text, you likely don't need to do anything special here. YAML and JSON are both relatively easy to diff and merge manually.
If it is convenient, store text. This will let anybody diff the files using whatever tools they already have available.
On the other hand, if you are already planning to write your own diff and merge tools (which can be hooked into Git) you shouldn't have too much trouble storing the original binary files.
Storing binary files and using your custom diff / merge tools will require users to have those tools available for diffing and merging.

Compare 2 directories in windows [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about a specific programming problem, a software algorithm, or software tools primarily used by programmers. If you believe the question would be on-topic on another Stack Exchange site, you can leave a comment to explain where the question may be able to be answered.
Closed 9 months ago.
The community reviewed whether to reopen this question 9 months ago and left it closed:
Original close reason(s) were not resolved
Improve this question
I need to compare 2 folders "A" and "B" and get the list of files and folders newly added or modified.
I tried using Winmerge software but it is not comparing the files present inside the internal folders(so i have to point to each internal folder manually and have to compare)
Is there any way to achieve this.
The following PowerShell code compares the file listings of two folders. It will detect renamed or newly created files and folders, but it will not detect modified data or different timestamps:
$dir1 = Get-ChildItem -Recurse -path C:\dir1
$dir2 = Get-ChildItem -Recurse -path C:\dir2
Compare-Object -ReferenceObject $dir1 -DifferenceObject $dir2
Source: MS Devblog - Dr. Scripto
3rd party edit
How to run a PowerShell script explains how to run the code above in a script.
For Windows you can use this solution.
Here's the right way to do it, without the external
downloads. It looks like a lot at first, but once you've done it,
it's very easy.
It works in all Windows versions from 7 back to 95.
For our example assume that you're comparing two directories named 'A'
and 'B'.
run cmd.exe to get a command prompt. (In Windows 7, the powershell won't work for this, FYI.) Then do it again, so that you have two of
them open next to each other.
in each window go to the directories that you want to compare. (Using 'cd' commands. If you're not comfortable with this, then you
should probably go with the external utilities, unless you want to
learn command prompt stuff.)
type 'dir /b > A.txt' into one window and 'dir /b > B.txt' into the other. You'll now have two text files that list the contents of each
directory. The /b flag means bare, which strips the directory listing
down to file names only.
move B.txt into the same folder as A.txt.
type 'fc A.txt B.txt'. The command 'fc' means file compare. This will spit out a list of the differences between the two files, with an
extra line of text above and below each difference, so you know where
they are. For more options on how the output is formatted, type 'fc
/?' at the prompt. You can also pipe the differences into another
file by using something like 'fc A.txt B.txt > differences.txt'.
Have
fun.
This is not necessarily better than other options already mentioned but it might better fit certain use-cases. In my case, I wanted to see what was different before copying those differences from one directory to the other. This method is great for that since the /L option means to only log what would happen.
robocopy C:\dir1 C:\dir2 /MIR /FP /NDL /NP /L
You can further refine the output format with other flags, or change the logic used to to compare, etc. Refer to robocopy docs for all the options.
We have been using Beyond Compare for years and it's quite useful. You can see which files are identical, which files are in folder "A" only and which files are in folder "B" only, and files that are different (for those files you can see what specific modifications have been made).
Some years ago, I made a command line utility, CrcCheckCopy, to help me verify the integrity of large data copies. It reads the source folder and produces a list of the CRCs of all the files. And then, using this list, it can verify the other folder.
I also use it to verify the same folder after some years, to make sure nothing was accidentally deleted or modified.
I give it from free from here in case people who arrive to this question want to try it.
FreeFileSync did the job for me.

Great tools to find and replace in files? [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 1 year ago.
Improve this question
I'm switching from a Windows PHP-specific editor to VIM, on the philosophy of "use one editor for everything and learn it really well."
However, one feature I liked in my PHP editor was its "find and replace" capability. I could approach things two ways:
Just find. Search all files in a project for a string, see all the occurrences listed, and click to dive into that file at that line.
Blindly replace all occurrences of "foo" with "bar".
And of course I could use the GUI to say what types of files, whether to look in subfolders, whether it was case sensitive, etc.
I'm trying to approximate this ability now, and trying to piece it together with bash is pretty tedious. Doable, but tedious.
Does anybody know any great tools for things like this, for Linux and/or Windows? (I would really prefer a GUI if possible.) Or failing that, a bash script that does the job well? (If it would list file names and line numbers and show code snippets, that would be great.)
Try sed. For example:
sed -i -e 's/foo/bar/g' myfile.txt
Vim has multi-file search built in using the command :vimgrep (or :grep to use an external grep program - this is the only option prior to Vim 7).
:vimgrep will search through files for a regex and load a list of matches into a buffer - you can then either navigate the list of results visually in the buffer or with the :cnext and :cprev commands. It also supports searching through directory trees with the ** wildcard. e.g.
:vimgrep "^Foo.*Bar" **/*.txt
to search for lines starting with Foo and containing Bar in any .txt file under the current directory.
:vimgrep uses the 'quickfix' buffer to store its results. There is also :lvimgrep which uses a local buffer that is specific to the window you are using.
Vim does not support multi-file replace out of the box, but there are plugins that will do that too on vim.org.
I don't get why you can't do this with VIM.
Just Find
/Foo
Highlights all instances of Foo in the file and you can do what you want.
Blindly Replace
:% s/Foo/Bar/g
Obviously this is just the tip of the iceberg. You have lots of flexibility of the scope of your search and full regex support for your term. It might not work exactly like your former editor, but I think your original 'use one editor' idea is a valid one.
Notepad++ allows me to search and replace in an entire folder (and subfolders), with regex support.
You can use perl in command prompt to replace text in files.
perl -p -i".backup" -e "s/foo/bar/g" test.txt
Since you are looking for a GUI tool, I generally use the following 2 tools. Both of them have great functionality including wildcat matching, regex, filetype filter etc. Both of them displays good useful information about the hit in files like filename/lines.
Visual Studio: fast yet powerful. I uses it if the file number is huge (say, tens of thousands...)
pspad: lightweight. And a good feature about find/replace for pspad is that it will organize hits in different files in a tree hierarchy, which is very clear.
There are a number of tools that you can use to make things easier. Firstly, to search all the files in the project from vim you can use :grep like so:
:grep 'Function1' myproject/
This essentially runs a grep and lets you quickly jump from/to locations where it has been found.
Ctags is a tool that finds declarations in your code and then allows vim to jump to these declarations. To do this, run ctags and then place your cursor over a function call and then use Ctrl-]. Here is a link with some more ctags information:
http://www.davedevelopment.co.uk/2006/03/13/vim-ctags-and-php-5/
I don't know if it is an option for you, but if you load all your files into vim with
vim *.php
than you can
:set hidden
:argdo %s/foo/bar/g => will execute the substitue command in all opened buffers
:wall => will write all opened buffers
Or instead of loading all your files into vim try :help vimgrep and a cominbation of :help argdo and :help argadd
For Windows, I think that grepWin is hard to beat -- a GUI to a powerful and flexible grep tool for Windows. It searches, and replaces, knows about regular expressions, that sort of stuff.
look into sed ... powerful command line tool that should accomplish most of what you're looking for ... its supports regex, so your find/replace is quite easy.
(man sed)
Notepad++ has support for syntax highlighting in many languages and supports find and replace across all open files with regex and basic \n \r \t support.
The command grep -rn "search terms" * will search for the specified terms in all files (including those in sub-directories) and will return matching lines including file name and line number. Armed with this info, it is easy to jump to a particular file/line in VIM.
As was mentioned before, sed is extremely powerful for doing find-and-replace.
You can run both of these tools from inside VIM as well.
Some developers I currently work with swear by Textpad. It has a UI and also supports using regex's -- everything you're looking for and more.
A very useful search tool is ack. (Ubuntu refers to it as "ack-grep" in the repositories and man pages.)
The short version of what it does is a combination of find and grep that's more powerful and intelligent than that pair.

Can Mac OS X's Spotlight be configured to ignore certain file types? [closed]

Closed. This question is off-topic. It is not currently accepting answers.
Want to improve this question? Update the question so it's on-topic for Stack Overflow.
Closed 10 years ago.
Improve this question
I've got bunches of auxiliary files that are generated by code and LaTeX documents that I dearly wish would not be suggested by SpotLight as potential search candidates. I'm not looking for example.log, I'm looking for example.tex!
So can Spotlight be configured to ignore, say, all .log files?
(I know, I know; I should just use QuickSilver instead…)
#diciu That's an interesting answer. The problem in my case is this:
Figure out which importer handles your type of file
I'm not sure if my type of file is handled by any single importer? Since they've all got weird extensions (.aux, .glo, .out, whatever) I think it's improbable that there's an importer that's trying to index them. But because they're plain text they're being picked up as generic files. (Admittedly, I don't know much about Spotlight's indexing, so I might be completely wrong on this.)
#diciu again: TextImporterDontImportList sounds very promising; I'll head off and see if anything comes of it.
Like you say, it does seem like the whole UTI system doesn't really allow not searching for something.
#Raynet Making the files invisible is a good idea actually, albeit relatively tedious for me to set up in the general sense. If worst comes to worst, I might give that a shot (but probably after exhausting other options such as QuickSilver). (Oh, and SetFile requires the Developer Tools, but I'm guessing everyone here has them installed anyway :) )
#Will - these things that define types are called uniform type identifiers.
The problem is they are a combination of extensions (like .txt) and generic types (i.e. public.plain-text matches a txt file without the txt extension based purely on content) so it's not as simple as looking for an extension.
RichText.mdimporter is probably the importer that imports your text file.
This should be easily verified by running mdimport in debug mode on one of the files you don't want indexed:
cristi:~ diciu$ echo "All work and no play makes Jack a dull boy" > ~/input.txt
cristi:~ diciu$ mdimport -d 4 -n ~/input.txt 2>&1 | grep Imported
kMD2008-09-03 12:05:06.342 mdimport[1230:10b] Imported '/Users/diciu/input.txt' of type 'public.plain-text' with plugIn /System/Library/Spotlight/RichText.mdimporter.
The type that matches in my example is public.plain-text.
I've no idea how you actually write an extension-based exception for an UTI (like public.plain-text except anything ending in .log).
Later edit: I've also looked though the RichText mdimporter binary and found a promising string but I can't figure out if it's actually being used (as a preference name or whatever):
cristi:FoodBrowser diciu$ strings /System/Library/Spotlight/RichText.mdimporter/Contents/MacOS/RichText |grep Text
TextImporterDontImportList
Not sure how to do it on a file type level, but you can do it on a folder level:
Source: http://lists.apple.com/archives/spotlight-dev/2008/Jul/msg00007.html
Make spotlight ignore a folder
If you absolutely can't rename the folder because other software depends on it another technique is to go ahead and rename the directory to end in ".noindex", but then create a symlink in the same location pointing to the real location using the original name.
Most software is happy to use the symlink with the original name, but Spotlight ignores symlinks and will note the "real" name ends in *.noindex and will ignore that location.
Perhaps something like:
mv OriginalName OriginalName.noindex
ln -s OriginalName.noindex
OriginalName
ls -l
lrwxr-xr-x 1 andy admin 24 Jan 9 2008
OriginalName -> OriginalName.noindex
drwxr-xr-x 11 andy admin 374 Jul 11
07:03 Original.noindex
Here's how it might work.
Note: this is not a very good solution as a system update will overwrite changes you will perform.
Get a list of all importers
cristi:~ diciu$ mdimport -L
2008-09-03 10:42:27.144 mdimport[727:10b] Paths: id(501) (
"/System/Library/Spotlight/Audio.mdimporter",
"/System/Library/Spotlight/Chat.mdimporter",
"/Developer/Applications/Xcode.app/Contents/Library/Spotlight/SourceCode.mdimporter",
Figure out which importer handles your type of file (example for the Audio importer):
cristi:~ diciu$ cat /System/Library/Spotlight/Audio.mdimporter/Contents/Info.plist
[..]
CFBundleTypeRole
MDImporter
LSItemContentTypes
public.mp3
public.aifc-audio
public.aiff-audio
Alter the importer's plist to delete the type you want to ignore.
Reimport the importer's types so the system picks up the change:
mdimport -r /System/Library/Spotlight/Chat.mdimporter
The only option probably is to have them not indexed by spotlight as from some reason you cannot do negative searches. You can search for files with specifix file extension, but you cannot not search for ones that don't match.
You could try making those files invisible for Finder, Spotlight won't index invisible files. Command for setting the kIsInvisible flag on files is:
SetFile -a v [filename(s)]

Resources