sed optimization (large file modification based on smaller dataset) - algorithm

I do have to deal with very large plain text files (over 10 gigabytes, yeah I know it depends what we should call large), with very long lines.
My most recent task involves some line editing based on data from another file.
The data file (which should be modified) contains 1500000 lines, each of them are e.g. 800 chars long. Each line is unique, and contains only one identity number, each identity number is unique)
The modifier file is e.g. 1800 lines long, contains an identity number, and an amount and a date which should be modified in the data file.
I just transformed (with Vim regex) the modifier file to sed, but it's very inefficient.
Let's say I have a line like this in the data file:
(some 500 character)id_number(some 300 character)
And I need to modify data in the 300 char part.
Based on the modifier file, I come up with sed lines like this:
/id_number/ s/^\(.\{650\}\).\{20\}/\1CHANGED_AMOUNT_AND_DATA/
So I have 1800 lines like this.
But I know, that even on a very fast server, if I do a
sed -i.bak -f modifier.sed data.file
It's very slow, because it has to read every pattern x every line.
Isn't there a better way?
Note: I'm not a programmer, had never learnt (in school) about algorithms.
I can use awk, sed, an outdated version of perl on the server.

My suggested approaches (in order of desirably) would be to process this data as:
A database (even a simple SQLite-based DB with an index will perform much better than sed/awk on a 10GB file)
A flat file containing fixed record lengths
A flat file containing variable record lengths
Using a database takes care of all those little details that slow down text-file processing (finding the record you care about, modifying the data, storing it back to the DB). Take a look for DBD::SQLite in the case of Perl.
If you want to stick with flat files, you'll want to maintain an index manually alongside the big file so you can more easily look up the record numbers you'll need to manipulate. Or, better yet, perhaps your ID numbers are your record numbers?
If you have variable record lengths, I'd suggest converting to fixed-record lengths (since it appears only your ID is variable length). If you can't do that, perhaps any existing data will not ever move around in the file? Then you can maintain that previously mentioned index and add new entries as necessary, with the difference is that instead of the index pointing to record number, you now point to the absolute position in the file.

I suggest you a programm written in Perl (as I am not a sed/awk guru and I don't what they are exactly capable of).
You "algorithm" is simple: you need to construct, first of all, an hashmap which could give you the new data string to apply for each ID. This is achieved reading the modifier file of course.
Once this hasmap in populated you may browse each line of your data file, read the ID in the middle of the line, and generate the new line as you've described above.
I am not a Perl guru too , but I think that the programm is quite simple. If you need help to write it, ask for it :-)

With perl you should use substr to get id_number, especially if id_number has constant width.
my $id_number=substr($str, 500, id_number_length);
After that if $id_number is in range, you should use substr to replace remaining text.
substr($str, -300,300, $new_text);
Perl's regular expressions are very fast, but not in this case.

My suggestion is, don't use database. Well written perl script will outperform database in order of magnitude in this sort of task. Trust me, I have many practical experience with it. You will not have imported data into database when perl will be finished.
When you write 1500000 lines with 800 chars it seems 1.2GB for me. If you will have very slow disk (30MB/s) you will read it in a 40 seconds. With better 50 -> 24s, 100 -> 12s and so. But perl hash lookup (like db join) speed on 2GHz CPU is above 5Mlookups/s. It means that your CPU bound work will be in seconds and you IO bound work will be in tens of seconds. If it is really 10GB numbers will change but proportion is same.
You have not specified if data modification changes size or not (if modification can be done in place) thus we will not assume it and will work as filter. You have not specified what format of your "modifier file" and what sort of modification. Assume that it is separated by tab something like:
<id><tab><position_after_id><tab><amount><tab><data>
We will read data from stdin and write to stdout and script can be something like this:
my $modifier_filename = 'modifier_file.txt';
open my $mf, '<', $modifier_filename or die "Can't open '$modifier_filename': $!";
my %modifications;
while (<$mf>) {
chomp;
my ($id, $position, $amount, $data) = split /\t/;
$modifications{$id} = [$position, $amount, $data];
}
close $mf;
# make matching regexp (use quotemeta to prevent regexp meaningful characters)
my $id_regexp = join '|', map quotemeta, keys %modifications;
$id_regexp = qr/($id_regexp)/; # compile regexp
while (<>) {
next unless m/$id_regexp/;
next unless $modifications{$1};
my ($position, $amount, $data) = #{$modifications{$1}};
substr $_, $+[1] + $position, $amount, $data;
}
continue { print }
On mine laptop it takes about half minute for 1.5 million rows, 1800 lookup ids, 1.2GB data. For 10GB it should not be over 5 minutes. Is it reasonable quick for you?
If you start think you are not IO bound (for example if use some NAS) but CPU bound you can sacrifice some readability and change to this:
my $mod;
while (<>) {
next unless m/$id_regexp/;
$mod = $modifications{$1};
next unless $mod;
substr $_, $+[1] + $mod->[0], $mod->[1], $mod->[2];
}
continue { print }

You should almost certainly use a database, as MikeyB suggested.
If you don't want to use a database for some reason, then if the list of modifications will fit in memory (as it currently will at 1800 lines), the most efficient method is a hashtable populated with the modifications as suggested by yves Baumes.
If you get to the point where even the list of modifications becomes huge, you need to sort both files by their IDs and then perform a list merge -- basically:
Compare the ID at the "top" of the input file with the ID at the "top" of the modifications file
Adjust the record accordingly if they match
Write it out
Discard the "top" line from whichever file had the (alphabetically or numerically) lowest ID and read another line from that file
Goto 1.
Behind the scenes, a database will almost certainly use a list merge if you perform this alteration using a single SQL UPDATE command.

Good deal on the sqlloader or datadump decision. That's the way to go.

Related

Most efficient data structure for a nested loop?

I am iterating through each line in the first file (3000 lines total) to find it's corresponding label in the second file, line by line (which is ~2 million lines; 47 MB)
Currently, I have a nested loop structure with the outer loop grabbing a line (converting into a list) and the inner loop iterating through the 2 million lines (line by line):
for row in read_FIMO: #read_FIMO is first file; 3000 lines long
with open("chr8labels.txt") as label: #2 million lines long
for line in csv.reader(label, delimiter="\t"): #list
for i in range(int(row[3]),int(row[4])):
if i in range((int(line[1])-50),int(line[1])):#compare the ranges in each list
line1=str(line)
row1=str(row)
outF.append(row1+"\t"+line1)
-I realize this is horribly inefficient, but I need to find all instances of when the first range overlaps with the ranges of the other file
-Is reading in each file line by line the fastest way? if not, what would the best data structure be for entire file
-should the lines be in a different data structure other than lists?
THANK YOU if you have any feedback!
aside: the purpose is to label a range of numbers if the numbers are found in the ranges of the other file(long story; maybe not relevant?)
Your goal seems to be to find whether one range (row[3] to row[4]) overlaps with another (line[1]-50 to line[1]). For this, it is sufficient to that that either line[1]-50 or row[3] lies inside the other range. This eliminates the third nested loop.
Also, take the 2-million-line file and sort it, once, and then use the sorted list inside your 3000-line loop to do a binary search, cutting from a O(nm) algorithm to an O(nlogm) one.
(My Python is far from perfect, but this should get you going in the right direction.)
with open("...") as label:
reader = csv.reader(label, delimiter="\t")
lines = list(reader)
lines.sort(key=lambda line: int(line[1]))
for row in read_FIMO:
# Find the nearest, lesser value in lines smaller than row[3]
line = binary_search(lines, int(row[3]))
# If what you're after is multiple matches, then
# instead of getting a single line, get the smallest and
# largest indexes whose ranges overlap (which you can do with a
# binary search for row[3] and row[4]+50)
# e.g.,
# smallestIndex = binary_search(lines, int(row[3]))
# largestIndex = binary_search(lines, int(row[4])+50)
# for index in range(smallestIndex, largestIndex+1):
lower1 = int(line[1] - 50)
lower2 = int(row[3])
upper1 = int(line[1])
upper2 = int(row[4])
if (lower1 > lower2 and lower1 < upper1) or (lower2 > lower1 and lower2 < upper2):
line1=str(line)
row1=str(row)
outF.append(row1+"\t"+line1)
I think am efficient way would be...
Going through the destination file first and record all the labels
to a dictionary hash. [LabelName] -> [Line number]
Go through each line of the source and lookup the line from the
dictionary and print (or print something else if not found)
Notes / Tips
I think the above would be O(n)
You could also go through the source file first and record then go through the destination file. This would create a smaller dictionary (and use less memory) but the output may not be in the order you would like. (smaller data sets often have better cache hits ratios so this is why I bring this up.)
Also, only because you mentioned most efficient, and if you want to get crazy, I would try skipping the line by line processing, and just go through the input like it is one big string. Search for a new line char + white-space + a label name. Using this tip kind of depends what the input data looks like though. If the "csv.reader" already does parse by line then this would not be good. Also, you may need a lower level language with more control to do this method. (Caution: 90% chance this tip would just lead you to a rabbit hole that goes no-where)

How can i save in a list/array first parts of a line and then sort them depending on the second part?

I have a school project that gives me several lines of string in a text like this:
team1-team2:2-1
team3-team1:2-2
etc
it wants me to determine what team won (or drew) and then make a league table with them, awarding points for wins/draws.
this is my first time using bash. what i did was save team1/team2 names in a variable and then do the same for goals. how should i make the table? i managed to make my script create a new file that saves in there all team names (And checking for no duplicates) but i dont know how to continue. should i make an array for each team saving in there their results? and then how do i implement the rankings, for example
team1 3p
team2 1p
etc.
im not asking for actual code, just a guide as to how i should implement it. is making a new file the right move? should i try making a new array with the teams instead? or something else?
The problem can be divided into 3 parts:
Read the input data into memory in a format that can be manipulated easily.
Manipulate the data in memory
Output the results in the desired format.
When reading the data into memory, you might decide to read all the data in one go before manipulating it. Or you might decide to read the input data one line at a time and manipulate each line as it is read. When using shell scripting languages, like bash, the second option usually results in simpler code.
The most important decision to make here is how you want to structure the data in memory. You normally want to avoid duplication of data, and you usually want a data structure that is easy to transform into your desired output. In this case, the most logical data structure is an associative array, using the team name as the key.
Assuming that you have to use bash, here is a framework for you to build upon:
#!/bin/bash
declare -A results
while IFS=':-' read team1 team2 score1 score2; do
if [ ${score1} -gt ${score2} ]; then
((results[${team1}]+=2))
elif [ ...next test... ]; then
...
else
...
fi
done < scores.txt
# Now you have an associative array containing the points for each team.
# You can either output it as it stands, or sort it by piping through the
# 'sort' command.
for key in $[!results[#]}; do
echo ...
done
I would use awk for this
AWK is an interpreted programming language(AWK stands for Aho, Weinberger, Kernighan) designed for text processing and typically used as a data extraction and reporting tool. AWK is used largely with Unix systems.
Using pure bash scripting is often messy for that kind of jobs.
Let me show you how easy it can be using awk
Input file : scores.txt
team1-team2:2-1
team3-team1:2-2
Code :
awk -F'[:-]' ' # set delimiters to ':' or '-'
{
if($3>$4){teams[$1] += 3} # first team gets 3 points
else if ($3<$4){teams[$2] += 3} # second team gets 3 points
else {teams[$1]+=1; teams[$2]+=1} # both teams get 1 point
}
END{ # after scanning input file
for(team in teams){
print(team OFS teams[team]) # print total points per team
}
}' scores.txt | sort -rnk 2 > ranking.txt # sort by nb of points
Output (ranking.txt):
team1 4
team3 1

Increment Serial Number using EXIF

I am using ExifTool to change the camera body serial number to be a unique serial number for each image in a group of images numbering several hundred. The camera body serial number is being used as a second place, in addition to where the serial number for the image is in IPTC, to put the serial number as it takes a little more effort to remove.
The serial number is in the format ###-###-####-#### where the last four digits is the number to increment. The first three groups of digits do not change for each batch I run. I only need to increment that last group of digits.
EXAMPLE
I if I have 100 images in my first batch, they would be numbered:
811-010-5469-0001, 811-010-5469-0002, 811-010-5469-0003 ... 811-010-5469-0100
I can successfully drag a group of images onto my ExifTool Shortcut that has the values
exiftool(-SerialNumber='001-001-0001-0001')
and it will change the Exif SerialNumber Tag on the images, but have not been successful in what to add to this to have it increment for each image.
I have tried variations on the below without success:
exiftool(-SerialNumber+=001-001-0001-0001)
exiftool(-SerialNumber+='001-001-0001-0001')
I realize most likely ExifTool is seeing these as numbers being subtracted in the first line and seeing the second line as a string. I have also tried:
exiftool(-SerialNumber+='1')
exiftool(-SerialNumber+=1)
just to see if I can even get it to increment with a basic, single digit number. This also has not worked.
Maybe this cannot be incremented this way and I need to use ExifTool from the command line. If so, I am learning the command line/powershell (Windows), but am still weak in this area and would appreciate some pointers to get started there if this is the route I need to take. I am not afraid to use the command line, just would need a bit more hand holding then normal for a starting point. I also am learning Linux and could do this project from there but again, not afraid to use it, just would need a bit more hand holding to get it done.
I do program in PHP, JavaScript and other languages so code is not foreign to me. Just experience in writing it for the command-line.
If further clarification is needed, please let me know in the comments.
Your help and guidance is appreciated!
You'll probably have to go to the command line rather than rely upon drag and drop as this command relies upon ExifTool's advance formatting.
Exiftool "-SerialNumber<001-001-0001-${filesequence;$_=sprintf('%04d', $_+1 )}" <FILE/DIR>
If you want to be more general purpose and to use the original serial number in the file, you could use
Exiftool "-SerialNumber<${SerialNumber}-${filesequence;$_=sprintf('%04d', $_+1 )}" <FILE/DIR>
This will just add the file count to the end of the current serial number in the image, though if you have images from multiple cameras in the same directory, that could get messy.
As for using the command line, you just need to rename to remove the commands in the parens and then either move it to someplace in the command line's path or use the full path to ExifTool.
As for clarification on your previous attempts, the += option is used with numbers and with lists. The SerialNumber tag is usually a string, though that could depend upon where it's being written to.
If I understand your question correctly, something like this should work:
1..100 | % {
$sn = '811-010-5469-{0:D4}' -f $_
# apply $sn
}
or like this (if you iterate over files):
$i = 1
Get-ChildItem 'C:\some\folder' -File | % {
$sn = '811-010-5469-{0:D4}' -f $i
# update EXIF data of current file with $sn
$i++
}

Logic to compare rows in pig

I need logic for below scenario which needs to be implemented using Pig scripts. Can anyone please help in providing some ideas on how to do this.
Input contains a column groupName with some data like others and unknown. This data needs to be replaced by its previous record data.
Input:
id,groupName
123,casc0001
124,casc0002
125,sale0001
126,unknown
127,nave9876
128,casc0001
129,sale0002
130,others
131,casc0004
132,unknown
133,unknown
134,others
135,nave1234
output:
123,casc0001
124,casc0002
125,sale0001
126,sale0001
127,nave9876
128,casc0001
129,sale0002
130,sale0002
131,casc0004
132,casc0004
133,casc0004
134,casc0004
135,nave1234
In the above input 126,unknown to be replaced with 125,sale0001. 130,others need to be replaced by 129,sale0002. 132,unknown 133,unknown 134,others to be replaced with 131,casc0004.
--Edit--
I tried lead function in Pig. But it is used only to compare n rows at a time. Which cannot solve this completely.
Another logic which is working, but looking for optimized one.
Cogroup for the same data set (like Dataset and Dataset_self)
-Filter Dataset.id=Dataset_self.id or Dataset_self.groupname='others' or Dataset_self.groupname='unknown'
-Generate IdDiff like (Dataset_self.id-Dataset.id), CASE when id=id then ( id, group) else (id_self,group)
-Foreach (group id){
ordered = order by id,diff,group;
limited = ordered limit 1;
generate limited ;
}
This is going to be a complicated problem on a distributed system like hadoop, especially that your file is going to be split between nodes. In your case what if 126 happens to be the first record in a new split. Then you will need to trace the previous file split which is most likely on a different node. Lets say you come up with a MapReduce program to do this, in all likelyhood it would an extremely slow and inefficient way to do it. The solution might be simpler if you are in a single node system where the splittable property of your input format is false, and the nuber of reducers is set to 1.
In that case you could almost make the argument that a traditional database like Oracle or Terra data might be a better fit for your problem as you have lead or lag functions readily available which could be used to do exactly what u need.

Most efficient way to parse a file in Lua

I'm trying to figure out what is the most efficient way to parse data from a file using Lua. For example lets say I have a file (example.txt) with something like this in it:
0, Data
74, Instance
4294967295, User
255, Time
If I only want the numbers before the "," I could think of a few ways to get the information. I'd start out by getting the data with f = io.open(example.txt) and then use a for loop to parse each line of f. This leads to the heart of my question. What is the most efficient way to do this?
In the for loop I could use any of these methods to get the # before the comma:
line.find(regex)
line:gmatch(regex)
line:match(regex)
or Lua's split function
Has anyone run test for speed for these/other methods which they could point out as the fast way to parse? Bonus points if you can speak to speeds for parsing small vs. large files.
You probably want to use line:match("%d+").
line:find would work as well but returns more than you want.
line:gmatch is not what you need because it is meant to match several items in a string, not just one, and is meant to be used in a loop.
As for speed, you'll have to make your own measurements. Start with the simple code below:
for line in io.lines("example.txt") do
local x=line:match("%d+")
if x~=nil then print(x) end
end

Resources