I have about 300 photos which are a string of letters and numbers, Im planning to data merge these into an indesign document and would like to use the numbers only as these IDs are related to customers and their quotes.
I need to remove every letter or space or symbol from the file names leaving only the id number. is this possible?
Here's an example of the file names:
148132durrnt-photojosh.jpg, 173722dumellphotojosh.jpg, 173816mxwell.jpg, 176764very.jpg, 176876pyumo.jpg, 178054plnt.jpg, engll170774pijosh.jpg, entley166282pijosh.jpg, hodgkinson169226pijosh.jpg
So there's a mixture of some with the number at the start, some at the end, some in the middle, and a couple dont actually have the numbers at all so those should be ignored..
I don't know of a way to do this other than manually, and wanted to try save some time..
Related
I am at the end of my rope. I have been doing internet searches for tutorials and help on this simple task and after nearly a hundred different links visited, I am no closer to my answer and I'm about to have a breakdown.
All I need is to write a simple pseudocode function that takes an input file with a list of strings (book titles) with a list of prices directly next to them, and when the book title matches the user's input, it spits out the price.
Basically all I need to know is how to get the second part of the line in the file that has the price, assuming it is separated by a space or comma. But no matter what string of search characters I use in Google, no one else on the planet seems to have ever had this requirement and I'm going insane.
I have 2 (large) files. The first one is about 200k lines, the second one about 30 millions lines.
I want to check if each line of the first one is in the second one using Perl.
Is it faster to compare directly each line of the first to each line of the second or is it better to store them all in two different arrays and then manipulate arrays?
You have File A and File B. You want to check if lines in File A appear in File B.
If you have enough memory to hold the contents of File B in a hash using one entry per line, that's the simplest. Go ahead.
However, if you do not, I recommend you put both files in tables in an SQL database. SQLite might be enough to start. Then, your problem is reduced to a simple JOIN. If line length is an issue, use a fast hash such as xxHash. If implemented correctly, the 64-bit version is blazing fast on a 64-bit machine, especially if you enabled optimizations in your Perl. Store two columns, hash and the actual line. If hashes match, check if the lines match. Make sure to index on the hash column.
You say:
In fact, my files are like : File A : name number (per line) File B : name date location number (per line) And I have to check if File B contains the lines matching datas of File A (ignoring date and location for example) So it's not an exact match ...
In that case, you are set. You do not even have to worry about the hash stuff (which I am leaving here for reference). Put the interesting bits of data on which you need to match against in separate columns in an SQLite database. Write a join. ... Profit.
Alternatively, you could use BerkeleyDB which gives you the conceptual simplicity of having an in memory hash while storing the table on disk. If you have multiple attributes on which to match, this will not scale well.
Store the first file's lines in a hash, then iterate through the second file without storing it in memory.
It might be counterintuitive to store the first file and iterate the second file as opposed to vice-versa, but it allows you to avoid creating a 30 million element hash.
use feature 'say';
my ($path_1, $path_2) = #ARGV;
open my $fh1,"<",$path_1;
my %f1;
$f1{$_} = $. while (<$fh1>);
open my $fh2,"<",$path_2;
while (<$fh2>) {
if (my $f1_line = $f1{$_}) {
say "file 1 line $f1_line appears in file 2 line $.";
}
}
Note that without further processing, the duplicated lines will display in the order they appear in the second file, not first.
Also, this assumes file 1 does not have duplicate lines, but that can be handled if necessary.
I am trying to standardize file names in a directory that have some similarities, but are not always consistent. They are, however, standard enough.
Examples of file names (where the date is Month/Day/Year):
Weekly sales report 022213 LV.xls
Weekly sales report 091908 LV-F.xls
Weekly sales 072508.xls
Weekly U S sales V1.0 061308.xls
Weekly U.S. Sales Jan0606.xls
My current solution has been an effective, but ugly find and replace for any possible string combinations. x.gsub!(/^Weekly|sales|report|U S|U.S.|\s/,'')
However, I would assume that there would be a way to look at the file name string and grab the chunk that has all of the date information. This would be the chunk bounded by whitespace on the left and ends in at least 4 digits. Is there a straightforward way to accomplish this?
Your requirement as stated would suggest the following:
date_portion = x.match(/\s(\S*\d{4,8})/)[1]
That's: match one whitespace char, then capture zero-or-more non-whitespace, followed by 4 to 8 digits; return the captured text.
You can structure data in various ways: for example comma separated or tab separated.
But you can also structure data on positions. So, for example, the first 20 characters are meant for a phone number, the following 2 characters are meant for the age of someone etc...
How would you call such a file in general?
If you had id[3]name[5]phone[6]
001Liz 882833
002Paul 892733
003John 927477
this is a fixed format file.
I am working on implementing an autocompletion script in javascript. However, some of the names are two word names with a space in the middle. What kind of algorithm can you use to deal with it. I am using a trie to store the names.
The only solutions I could come up with were just saying that two word names cannot be used (either run them together or put a dash in the middle). The other idea was to create a list of these kind of names and have a separate loop to check the input. The other and possibly best idea I have is to redesign it slightly and have categories for first and last names and then an extra name category. I was wondering if there was a better solution out there?
Edit: I realized I wasn't very clear on what I was asking. My problem isn't adding two word phrases to the trie, but returning them when someone is typing in a name. In the trie I split the first and last names so you can search by either. So if someone types in the first name and then a space, how would I tell if they are typing in the rest of the first name or if they are now typing in the last name.
Why not have the trie also include the names with spaces?
Once you have a list of candidates, split each of them on the space and show the first token...
Is there a reason you are rolling your own autocomplete script, instead of using a currently existing one, such as YUI autocomplete? (i.e. are you doing it just for fun?, etc.)
If you have a way to parse the two-word names, then just include spaces in your trie. But if you cannot determine what is a two-word name and what is two separate words, and your trie cannot be large enough to hold all two-word sequences, then you have a problem.
One simple way to solve this is to default to allowing two-word pairs, but if you have too much branching after the space, throw away that entire branch. This way, when the first word is predictive for the second, you'll get autocompletion, but when it could be any of a huge number of things, your trie will end at the end of a single word.
If you using multiline editor, i guess the best choice autocomplete items will be a word. So firstname, middlename and lastname must be parsed and add a lookup item.
For (one line) textbox use you can add whitespaces (and firstname + space + middlename + space + lastname pattern) in search criteria.