search&replace on huge txt files - text-files

I need a text processing tool that can perform search and replace operations PER LINE on HUGE TEXT FILES (>0.5 GB). Can be either windows or linux based. (I don't know if there is anything like a streamreader/writer in Linux but I have a feeling that it would be the ideal solution. The editors I have tries so far load the whole file into the momory.)
Bonus question: a tool that can MERGE two huge texts on a per line basis, separated with e.g. tabs

Sounds like you want sed. For example,
sed 's/foo/bar/' < big-input-file > big-output-file
should replace the first occurrence of foo by bar in each line of big-input-file, writing the results to big-output-file.
Bonus answer: I just learned about paste, which seems to be exactly what you want for your bonus question.

'sed' is built into Linux/Unix, and is available for Windows. I believe that it only loads a buffer at a time (not the whole file) -- you might try that.
What would you be trying to do with the merge -- interleaved in some way, rather than just concatenating?
Add: interleave.pl
use strict;
use warnings;
my $B;
open INA, $ARGV[0];
open INB, $ARGV[1];
while (<INA>) {
print $_;
$B = <INB>;
print $B;
}
close INA;
close INB;
run: perl interleave.pl fileA fileB > mergedFile
Note that this is a very bare-bones utility. It does not check if the files exist, and it expects that the files have the same number of lines.

I would use perl for this. It is easy to read a file line by line, has great search/repace available using regular expressions, and will enable you to merge, and you can make your perl script aware of both files.

Related

Split line from the end (cross-platform)

I have text that needs to be split; namely, put space after two characters from the end of the line. From "4.20GB" you need to get "4.20 GB". I know it can be done with sed, awk, etc., but I am looking for a light and more cross-platform method (for Linux/Unix/BSD).
Is it possible to do it with bash and its functions? For some reason, I thought printf could do it, but a quick check didn't yield anything positive.
You are looking for "more cross-platform method than sed" and then asking "Is it possible to do it with bash and its functions?"
It's safe bet to say that sed is installed (or "easily installable") on more computer architectures than bash so using sed should be more "cross-platform" than using bash.
If I understand you correctly, each line that finishes with some digits followed by GB, needs you to add a space before GB. I shouldn't use the word split that suggests you want to split one line into two lines.
Try:
sed -i 's/GB$/ GB/' [filenames ...]
I think that sed is more "cross-platform" than bash, because wherever you have bash, you easily will have sed, as #fuxoft says in his answer.

Finding a newline in the csv file

I know there are a lot of questions about this (latest one here.), but almost all of them are how to join those broken lines into one from a csv file or remove them. I don't want to remove, but I just want to display/find that line (or probably the line number?)
Example data:
22224,across,some,text,0,,,4 etc
33448,more,text,1,,3,,,4 etc
abcde,text,number,444444,0,1,,,, etc
358890,more
,text,here,44,,,, etc
abcdefg,textds3,numberss,413,0,,,,, etc
985678,93838,text,,,,
,text,continuing,from,previous,line,,, etc
More search on this, and I know I shouldn't use bash to accomplish this, but rather shoud use perl. I tried (from various website, I don't know perl), but apparently I don't have the Text::CSV package and I don't have permission to install one.
As I told I have no idea how to even start looking for this, so I don't have any script. This is not a windows file, this is very much unix file so we can ignore the CR problem.
Desired output:
358890,more
,text,here,44,,,, etc
985678,93838,text,,,,
,text,continuing,from,previous,line,,, etc
or
Line 4: 358890,more
,text,here,44,,,, etc
Line 7: 985678,93838,text,,,,
,text,continuing,from,previous,line,,, etc
Much appreciated.
You can use perl to count the number of fields(commas), and append the next line until it reaches the correct number
perl -ne 'if(tr/,/,/<28){$line=$.;while(tr/,/,/<28){$_.=<>}print "Line $line: $_\n"}' file
I do love Perl but I don't think it is the best tool for this job.
If you want a report of all lines that DO NOT have exactly the correct number of commas/delimiters, you could use the unix language awk.
For example, this command:
/usr/bin/awk -F , 'NF != 8' < csv_file.txt
will print all lines that DO NOT have exactly 7 commas. Comma is specified as the Field with -F and the Number of Fields is specified with NF.

One-line program to delete files with few header lines

This is the next part of my earlier question perl one-liner to keep only desired lines. Here I have many *.fa files in a folder.
Suppose for three files: 1.fa, 2.fa, 3.fa
The contents of them are as follows:
1.fa
>djhnk_9
abfgdddcfdafaf
ygdugidg
>kjvk.80
jdsfkdbfdkfadf
>jnck_q2
fdgsdfjghsjhsfddf
>7ytiu98
ihdlfwdfjdlfl]ol
2.fa
>cj76
dkjfhkdjcfhdjk
>67q32
nscvsdkvklsflplsad
>kbvbk
cbjfdikjbfadkjfbka
3.fa
>1290.5
mnzmnvjbsdjb
The lines that start with a > are the headers and the rest are the feature lines.
I want to delete those files that have 3 or fewer header lines. Here, file 2.fa and file 3.fa should be deleted.
As I am working on a Windows system, preferably I use a one-line Perl script like:
for %%F in ("*.fa") do perl ...
Is there a one-line program for that?
Use a program. "One-liners" are inscrutable, non-portable, and very hard to debug
This does as you ask. I hope it's clear that I have commented out the unlink call for testing purposes: it would be a pain to regenerate the *.fa files each time
You will probably want to change '[0-9].fa' to just *.fa. I had other files in my own directory that I didn't want to be considered
use strict;
use warnings 'all';
while ( my $file = glob '[0-9].fa' ) {
open my $fh, '<', $file;
my $headers = grep /^>/, <$fh>;
#unlink $file if $headers <= 3;
print qq{deleting "$file"\n} if $headers <= 3;
}
output
deleting "2.fa"
deleting "3.fa"
Next time, please try to write some code by yourself to solve the problem, and only after come ask for help. You will learn more if you do that, and we won't feel like you're just asking us to write your code.
The problem is very simple though, so here's a solution.
Note that this solution should be considered as a quick fix. Borodin suggested cleaner, easier to understand and more portable way to do this here.
I would suggest doing this with perl like this :
perl -nE "$count{$ARGV}++ if /^>/; END { unlink grep { $count{$_} <= 3 } keys %count }" *.fa
(for the record, I'm using double-quotes" as the delimiter of the string since you are on windows, but if anyone wish to use this on an unix system, just change the double-quotes " for some single-quotes').
Explanations:
-n surround the code with while(<>){...}, which will read the files one by one.
With $count{$ARGV}++ if /^>/ we count the number of headers in each file : $ARGV holds the name of the file being read, and /^>/ is true only if the line starts with >, ie. it's a header line.
Finally ( the END { .. } part), we delete (with the function unlink) the files that have 3 headers or less : keys %count gives all the file names, and grep { $count{$_} <= 3 } retains only the files that have 3 or less header lines to delete them.

Append function output to top of created file

I have a script running in linux which is simply multiple function calls outputted into a file. I have a function which is an overview of important information That I would like to append to the top of the file for easy viewing.
The problem is that I cannot simply call this overview function first because it is dependant on previous functions.
Is there an easier way to do this without creating a temp file? This is a fairly large file and that would take pretty long.
If you're using something like a Perl script, it should be possible to first leave some space at the top for the overview, then write all your data.
After this, reopen the file with write/append (w+) access, move the filehandle position to the desired position early in the file, using seek() or sysseek() functions, then write the overview data to it.
Help on Perl functions can be obtained here: http://perldoc.perl.org/perlfunc.html
First of all, perhaps you meant prepend, not append.
In linux, suppose you want:
file1 -> BottomContent
file2 -> TopContent
You could use:
$ cat file2 file1 > finalfile; rm file[12]
#SandeepY has one reasonable solution.
Any time you modify a file in Unix, you're using a system that is opening your original file and an new file, so there's (almost always) a temporary file involved, whether you can see it or not.
That being said, another solution, as you specfied a function is providing some output, is to use a process group to "marshall" your output into one stream, and redirect that into your file.
mv mainFile mainFile.tmp
{
myFunc
cat mainFile.tmp
} > mainFile && /bin/rm mainFile.tmp
As you seem to need this regularly, it should be easy to turn this into a function, replacing mainFile with "$1".
IHTH

Replace last line of XML file

Looking for help creating a script that will replace the last line of an XML file with a tag. I have a few hundred files so I'm looking for something that will process them in a loop. I've managed to rename the files sequentially like this:
posts1.xml
posts2.xml
posts3.xml
etc...
to make it easier to loop through. But I have no idea how to write a script to do this. I'm open to using either Linux or Windows (but i would guess that Linux is better for this kind of task).
So if you want to append a line to every file:
sed -i '$a<YOUR_SHINY_NEW_TAG>' *xml
To replace the last line:
sed -i '$s/.*/<YOUR_SHINY_NEW_TAG>/' *xml
But do note, sed is not the ideal tool to modify xml.
XMLStarlet is a command-line toolkit for performing XML parsing and manipulations. Note that as an XML-aware toolkit, it'll respect XML structure, character encoding and entity substitution.
Check out the ed command to see how to modify documents. You can wrap this in a standard bash loop.
e.g. in a doc consisting of a chain of <elem>s, you can add a following <added>5</added>:
mkdir new
for x in *.xml; do
xmlstarlet ed -a "//elem[count(//elem)]" -t elem -n added -v 5 $x > new/$x
done
Linux way using sed:
To edit the last line of the file in place, you can use sed:
sed -i '$s_pattern_replacement_' filename
To change the whole line to "replacement" use $s_.*_replacement_. Be sure to escape any _'s in replacement with a \.
To loop over files, just use for:
for f in /path/posts*.xml; do sed -i '$s_.*_replacement_' $f; done
This, however, is a dirty way as it's not aware of the XML structure, whereas the XML structure is not affected by newlines. You have to be sure the last line of the files contains exactly what you expect it to.
It makes little to no difference whether you're on Linux, Windows or MacOS
The question is what language do you want to use?
The following is an example in c# (not optimized, but read it as speudocode):
string rootDirectory = #"c:\myfiles";
var files = Directory.GetFiles(rootDirectory, "*.xml");
foreach (var file in files)
{
var lines = File.ReadAllLines(file);
lines[lines.Length - 1] = "whatever you want here";
File.WriteAllLines(file, lines);
}
You can compile this and run it on Windows, Linux, etc..
Or you could do the same in Python.
Of course this method does not actually parse the XML,
but you just wanted to replace the last line right?

Resources