How can I extract Stackoverflow post titles in Perl? - bash

I want to write a BASH script (using sed, grep, awk etc) to extract the titles of the questions from the https://stackoverflow.com/?tab=month.
For example:
Which is faster: while(1) or while(2)?
Replacing a 32-bit loop count variable with 64-bit introduces crazy performance deviations

Here's a small Mojo::UserAgent program that fetches the page, finds the right A tags with a selector, and extracts the text of those tags:
use v5.10;
use open qw(:std :utf8);
use Mojo::UserAgent;
my $ua = Mojo::UserAgent->new;
my $tx = $ua->get( 'https://stackoverflow.com/?tab=month' );
unless( $tx->success ) {
die "Something wrong happened, so handle that";
}
say $tx->res->dom('a.question-hyperlink')->map( 'text' )->join( "\n" );
The ojo module, which also comes with Mojolicious, has one-liner shortcuts for the command line:
perl -Mojo -E 'say g(shift)->dom("a.question-hyperlink")->map("text")->join("\n")' 'stackoverflow.com/?tab=month'
As the comments noted, instead of scraping the HTML, there's an XML version at https://stackoverflow.com/feeds/month. You could grab that and select things with XPath.

Related

Sed or Perl: One file with regex instructions, one instruction per line, executed on another file

I'm setting up a regex learning environment purely in bash/tmux with a pane for the file containing a regex, a pane for a text-file-for-processing, and a pane for the bash shell. I'm at the start of "The Bastards Book of Ruby"-regex chapter.
The 'Bastard's Book' shows an example of a 'negative-lookahead' regex (perfect, lets learn), where perl is recommended over sed. As I'm going for a CLI approach-> Bash command: $ perl -p file_with_regex.pl test.txt
(This prints the lines from test.txt with the intended substitutions)
Question: How would I add a second regex (on a new line) of the regex.pl file, and have perl execute both the first and (next) this second instruction for processing the text file?
# regex.pl
s/^(?!Mr)/Ms./g
s/Ms./Mrs./g
(Adding the second regex results in "Execution of regex.pl aborted due to compilation errors.")
The overall aim here is to progress in Ruby, while testing Regular Expressions as concisely as possible. Picking up a bare minimum of sed/perl while doing so would be a plus, as a proper dive into perl would take time from Ruby (and when it's time for the perl dive, I'll have had some time with the basics). The more I look at this the more it seems necessary to just do it in Ruby, if there isn't a perl switch that would enable a command-line-with-files approach.
The basic answer is that you need a semicolon after each line.
Paraphrased from perlrun, -p reads all lines of input, runs the commands you specified, and then prints out the value in $_ (the implicit variable you're running your substitute commands on in this script).
So, removing the magic, -p transformed your code into:
LINE:
while (<>) {
# regex.pl
s/^(?!Mr)/Ms./g
s/Ms./Mrs./g
} continue {
print or die "-p destination: $!\n";
}
Perl requires a semicolon between statements (but a terminal semicolon at the end of a block is optional) hence the error.
I personally would recommend writing the whole script above into the file instead of using -p because it is far less magical, but you're welcome to do it either way.
If you were going to write the whole script, I would recommend something more like the following:
use strict;
use warnings;
while ( my $line = <ARGV> ) {
$line =~ s/^(?!Mr)/Ms./g;
print "After first subst: $line";
$line =~ s/Ms./Mrs./g;
print "After second subst: $line";
}
use strict and use warnings are the boilerplate you want at the top of any perl script (to catch typos and other common mistakes) and explicitly calling the variable $line gives you a better understanding of how the script is working ($_ is very magical for beginners and the source of many errors IMO, but great when you know what's what).
If you're wondering about <> vs. <ARGV> they are the same thing and mean "Read through all the lines of files provided as command-line arguments to this script or standard input if no files are provided"."

modifying shell stdout in real time

Ok so bear with me as I am not a professional, this is a proof of concept project to learn more about my shell, programming and just basic bash scripting.
So WHAT I WANT TO DO is: whenever anything is printed out in my terminal, be it the result of a command or an error message from the shell I want to apply some "filters" to what is being displayed so for example if I input "ls -a" in the terminal I would like to get the list of folders that the command returns but apply a TIME DELAY to the characters so that it seems like the list is being typed in real time.
More SPECIFICALLY I'd like for the script to take every alphanumerical character in STDOUT and spend a specific amount of time (say 100 milliseconds) iterating through random characters (these can be accessed randomly from a list) before finally stopping at the original value of the character.
WHAT I KNOW:
not much, I am new to programming in general so also the bash language but I can read some code and browsing through I found this http://brettterpstra.com/2012/09/15/matrixish-a-bash-script-with-no-practical-application/ script that plays with tput. This shows me the visual effect I'd like to accomplish can be accomplished...now to make it happen orderly and individually for each character printed to STDOUT...that is what I can't figure out.
WHAT I THINK:
in my mind I know I could take the STDOUT and pipe it to a file in which through any language (let's say python!) I can do all kinds of string manipulation and then return the output to STDOUT but I'd like for the characters to be manipulated in realtime so if for example the code was
cool_chars="£ ア イ ウ エ オ カ キ ク ケ コ サ シ ス "
stdout=whatever module works to grab STDOUT from shell as string
stdout = stdout.split(" ")
for word in stdout:
for letter in word:
n=0
while (n<10):
#print the following iteration in real time # shell but how????
print random.choice(cool_chars)
#finally stop at correct character
print letter
n++
Anyway, I've read a little about curses and ncurses and how you can create new windows with whatever specified parameters, I wonder if it'd be just a matter of creating a terminal with the specified parameters with the curses libraries and then making a link so that each new terminal instance opens my modified curses shell or if I can just do a bash shell script or if it'd be easiest to use something like python. I know all of the above can be options but I'm looking for the simplest, not necessarily most resource efficient answer.
Any help, comments, pointers etc is appreciated.
This does not answer you question fully, but it does print any input as if it was being type in real time:
perl -MTime::HiRes -F -ane '$|=1;$old=""; foreach $char(#F){Time::HiRes::sleep(0.1); print "\r${old}${char}"; $old.=$char}' /etc/hosts
instead of file, STDIN can be used:
echo -e "abc\ndef\nghi" | perl -MTime::HiRes -F -ane '$|=1;$old=""; foreach $char(#F){Time::HiRes::sleep(0.1); print "\r${old}${char}"; $old.=$char}'
We can make it shorter using shell's sleep:
perl -F -ane '$|=1;$old=""; foreach $char(#F){`sleep 0.1`; print "\r${old}${char}"; $old.=$char}'
EDIT:
The script below should fully solve your problem:
#!/usr/bin/perl
use strict;
use utf8;
binmode(STDOUT, ":utf8");
our $cols=`tput cols`;
our $|=1;
our $cursor="";
sub reset_line {
print "\r" . " "x$cols . "\r";
}
sub pick_cursor {
my #c = split (//,"£アイウエオカキクケコサシス");
$cursor=$c[int(rand(1+#c))];
}
while (<>) {
my $line="";
my #a=split //;
foreach my $char (#a) {
`sleep 0.1`;
reset_line;
pick_cursor;
if ( $char eq "\n" || $char =~ /\s/) {
print "${line}${char}";
}else {
print "${line}${char}${cursor}";
}
$line .= $char;
}
}

search&replace on huge txt files

I need a text processing tool that can perform search and replace operations PER LINE on HUGE TEXT FILES (>0.5 GB). Can be either windows or linux based. (I don't know if there is anything like a streamreader/writer in Linux but I have a feeling that it would be the ideal solution. The editors I have tries so far load the whole file into the momory.)
Bonus question: a tool that can MERGE two huge texts on a per line basis, separated with e.g. tabs
Sounds like you want sed. For example,
sed 's/foo/bar/' < big-input-file > big-output-file
should replace the first occurrence of foo by bar in each line of big-input-file, writing the results to big-output-file.
Bonus answer: I just learned about paste, which seems to be exactly what you want for your bonus question.
'sed' is built into Linux/Unix, and is available for Windows. I believe that it only loads a buffer at a time (not the whole file) -- you might try that.
What would you be trying to do with the merge -- interleaved in some way, rather than just concatenating?
Add: interleave.pl
use strict;
use warnings;
my $B;
open INA, $ARGV[0];
open INB, $ARGV[1];
while (<INA>) {
print $_;
$B = <INB>;
print $B;
}
close INA;
close INB;
run: perl interleave.pl fileA fileB > mergedFile
Note that this is a very bare-bones utility. It does not check if the files exist, and it expects that the files have the same number of lines.
I would use perl for this. It is easy to read a file line by line, has great search/repace available using regular expressions, and will enable you to merge, and you can make your perl script aware of both files.

in bash, bash remove punctuation between pattern matches?

I am struggling with a conversion of a data file to csv when there is punctuation in the title field.
I have a bash script that obtains the file and processes it, and it almost works. What gets me is when there are commas in a free text title field, which then create extra fields.
I have tried some sed examples to replace between patterns but I have not gotten any of them to work. What I want to do is work between two patterns and replace commas with either nothing or perhaps a semicolon.
Taking this string:
name:A100040,title:Oatmeal is better with raisins, dates, and sugar,current_balance:50000,
Replacing with this:
name:A100040,title:Oatmeal is better with raisins dates and sugar,current_balance:50000,
I should probably use "title:" and ",current_" to denote the start and end of the block where I want to make the change to avoid situations like this:
name:A100040,title:Re-title current periodicals, recent books,current_balance:50000,
So far I have not gotten the substitution to match. In this case I am using !! to make the change obvious:
teststring="name:A100040,title:Oatmeal is better with raisins, dates, and sugar,current_balance:50000,"
echo $teststring |sed '/title:/,/current_/s/,/!!/g'
name:A100040!!title:Oatmeal is better with raisins!! dates!! and sugar!!current_balance:50000!!
Any help appreciated.
This is one way which could undoubtedly be refined:
perl -ple 'm/(.*?)(title:.*?)(current_balance:.*)/; $save = $part = $2; $part =~ s/,/!!/g; s/$save/$part/'
First, using sed or awk to parse CSV is almost always the wrong thing to do, because they do not allow field delimiters to be quoted. That said, it seems like a better approach would be to quote the fields so that your output would be:
name:"A100040",title:"Oatmeal ... , dates, and sugar",current_balance:50000
Using sed you can try: (this is fragile)
sed 's/:\([^:]*\),\([^,:]*\)/:"\1",\2/g'
If you insist on trying to parse the csv with "standard" tools and you consider perl to be standard, you could try:
perl -pe '1 while s/,([^,:]*),/ $1,/g'

Why can't my Perl script find the file when I run it from Windows?

I have a Perl Script which was built on a Linux platform using Perl 5.8 . However now I am trying to run the Perl Script on a Windows platform command prompt with the same Perl version.
I am using this command perl rgex.pl however it gives me one whole chunk of errors which looks to me like it has already been resolved in the script itself. The weird thing is I am able to run another Perl script without problem consisting of simple functions such as print, input etc.
The Code:
#!/usr/bin/perl
use warnings;
use strict;
use Term::ANSIColor;
my $file = "C:\Documents and Settings\Desktop\logfiles.log";
open LOG, $file or die "The file $file has the error of:\n => $!";
my #lines = <LOG>;
close (LOG);
my $varchar = 0;
foreach my $line ( #lines ) {
if ( $line =~ m/PLLog/ )
{
print("\n\n\n");
my $coloredText = colored($varchar, 'bold underline red');
print colored ("POS :: $coloredText\n\n", 'bold underline red');
$varchar ++;
}
print( $line );
}
When I run on the windows command prompt it gives me errors such as:
Unrecognized escape \D passed through at rgex.pl line 7.
=> No such file or directory at rgex.pl line 8.
Please give some advice on the codes please. Thanks.
A \ in a Perl string enclosed in double quotes marks the beginning of an escape sequence like \n for newline, \t for tab. Since you want \ to be treated literally you need to escape \ like \\ as:
my $file = "C:\\Documents and Settings\\Desktop\\logfiles.log";
Since you are not interpolating any variables in the string it's better to use single quotes:
my $file = 'C:\Documents and Settings\Desktop\logfiles.log';
(Inside single quotes, \ is not special unless the next character is a backslash or single quote.)
These error messages are pretty clear. They tell you exactly which lines the problems are on (unlike some error messages which tell you the line where Perl first though "Hey, wait a minute!").
When you run into these sorts of problems, reduce the program to just the problematic lines and start working on them. Start with the first errors first, since they often cascade to the errors that you see later.
When you want to check the value that you get, print it to ensure it is what you think it is:
my $file = "C:\\D....";
print "file is [$file]\n";
This would have shown you very quickly that there was a problem with $file, and once you know where the problem is, you're most of the way to solving it.
This is just basic debugging technique.
Also, you're missing quite a bit of the basics, so going through a good Perl tutorial will help you immensely. There are several listed in perlfaq2 or perlbook. Many of the problems that you're having are things that Learning Perl deals with in the first couple of chapters.

Resources