remove latin-1 character from large text file in bash - utf-8

I have some large dataset plain text files (wikipedia articles) and I have to remove latin-1 characters like here:
kemer } şehir kır toplam }}
use specific terminology . for example , it is often more appropriate for people or things from ethiopia ( a country in africa ) to be described as ethiopian , not carelessly ( with the risk of stereotyping ) as african .
bat avg .
label ਕਾਲਜ
ਅਡੋਲਫ ਹਿਟਲਰ ਨੇ ਦੇਸ਼ ਵਿਚ ਕਮਿਊਨਿਸਟ ਪਾਰਟੀ ਬਣਾਉਣ ਦੀ ਇਜਾਜ਼ਤ ਦੇਣ ਤੋਂ ਨਾਂਹ ਕਰ ਦਿਤੀ।
alt }
if not extra_units then
utc_offset +
ਕਬਜਾ ( )
demographics _title regional
I want to get only like
ਕਾਲਜ
ਅਡੋਲਫ ਹਿਟਲਰ ਨੇ ਦੇਸ਼ ਵਿਚ ਕਮਿਊਨਿਸਟ ਪਾਰਟੀ ਬਣਾਉਣ ਦੀ ਇਜਾਜ਼ਤ ਦੇਣ ਤੋਂ ਨਾਂਹ ਕਰ ਦਿਤੀ।
ਕਬਜਾ
and eventually trim white space lines that is trivial.
The approach I have used was the following
<?php
$in = fopen('php://stdin','rb');
while($line = stream_get_line($in, 64000)) {
foreach(str_split($line) as $char) {
$ordChar = ord($char);
if($ordChar > 127 || $ordChar <= 31) {
echo $char;
}
}
}
used like cat wiki.hi.txt | php -d memory_limit=1024M escape_latin.php > wiki.hi.esc.txt
This approach works ok, the only issue is that performances are getting worst as the file size grows as I can see with a watch du -h filename on the file I'm working on with. I'm surprised because I'm working on a local disk and I'm using stream_get_line to get the lines in streaming.
I have tried the same approach in python, but I get pretty the same performances with file size of ~1GB.
see here for more details.
[UPDATE]
I'm reporting here some results from alternative approaches proposed
Using the regex approach, that seems to produce pretty much the same output file:
A ~50MB file
$ time tr -d "[:alnum:][:punct:]" < wiki.as.txt > wiki.as.test.txt
real 0m2.990s
user 0m2.818s
sys 0m0.088s
A ~100MB file
$ time tr -d "[:alnum:][:punct:]" < wiki.gu.txt > wiki.gu.test.txt
real 0m7.322s
user 0m6.772s
sys 0m0.282s
A ~600MB file
$ time tr -d "[:alnum:][:punct:]" < wiki.ta.txt > wiki.ta.test.txt
real 0m35.973s
user 0m33.498s
sys 0m1.254s
A ~1000MB (1GB) file
$ time tr -d "[:alnum:][:punct:]" < wiki.ja.1.txt > wiki.ja.1.test.txt
real 1m5.409s
user 1m0.669s
sys 0m2.068s

try a regex.
If you're running it from a CLI, try something like
tr -d "[:alnum:][:punct:]" < wiki.hi.txt > wiki.hi.esc.txt
If you prefer to do the same in php -
<?php
$in = fopen('php://stdin','rb');
while($line = stream_get_line($in, 64000)) {
echo preg_replace('/[:alnum:][:punct:]/', '', $line);
}
But please check these to make sure they are doing what you want - esp. the php, since I'm working without a test setup here. It's likely to have syntax issues and/or worse. With luck someone will edit it or offer a better solution, or at least comment and point out whatever I may have done wrong.
Hope it helps.

Related

bash for script and input parameter

Can anyone help me to modify my script. Because it does not work. Here are three scripts.
1) pb.sh, use delphicpp_release software to read the 1brs.ab.sh and will give the output as 1brs.ab.out
2) 1brs.ab.sh, use for input parameter where a.sh(another script for protein structure), chramm.siz, charmm.crg are file for atom size and charge etc. rest of the parameters for run the delphicpp_release software.
3) a.sh, use for read several protein structures, which will be in the same directory.
my script_1 = pb.sh:
./delphicpp_release 1brs.ab.sh >1brs.ab.out
echo PB-Energy-AB = $(grep -oP '(?<=Energy> Corrected:).*' 1brs.ab.out) >>PB-energy.dat
cat PB-energy.dat
script_2 = 1brs.ab.sh:
in(pdb,file="a.sh")
in(siz,file="charmm.siz")
in(crg,file="charmm.crg")
perfil=70
scale=2.0
indi=4
exdi=80.0
prbrad=1.4
salt=0.15
bndcon=2
maxc=0.0001
linit=800
energy(s)
script_3 = a.sh:
for i in $(seq 90000 20 90040); do
$i.pdb
done
As we don't know what software is, something like
for ((i=90000;i<=100000;i+=20)); do
./software << " DATA_END" > 1brs.$i.a.out
scale=2.0
in(pdb,file="../$i.ab.pdb")
in(siz,file="charmm.siz")
in(crg,file="charmm.crg")
indi=z
exdi=x
prbrad=y
DATA_END
echo Energy-A = $(grep -oP '(?<=Energy>:).*' 1brs.$i.a.out) >>PB-energy.dat
done
A more POSIX shell compliant version
i=90000
while ((i<=100000)); do
...
((i+=20));
done
EDIT: Without heredoc
{
echo 'scale=2.0'
echo 'in(pdb,file="../'"$i"'.ab.pdb")'
echo 'in(siz,file="charmm.siz")'
echo 'in(crg,file="charmm.crg")'
echo 'indi=z'
echo 'exdi=x'
echo 'prbrad=y'
} > $i.ab.sh
./software <$i.ab.sh >$i.ab.out
but as question was changed I'm not sure to understand it.

Merge two json in bash (no jq)

I have two jsons :
env.json
{
"environment":"INT"
}
roles.json
{
"run_list":[
"recipe[splunk-dj]",
"recipe[tideway]",
"recipe[AlertsSearch::newrelic]",
"recipe[AlertsSearch]"
]
}
expected output should be some thing like this :
{
"environment":"INT",
"run_list":[
"recipe[splunk-dj]",
"recipe[tideway]",
"recipe[AlertsSearch::newrelic]",
"recipe[AlertsSearch]"
]
}
I need to merge these two json (and other like these two) into one single json using only available inbuilt bash commands.
only have sed, cat, echo, tail, wc at my disposal.
Tell whoever put the constraint "bash only" on the project that bash is not sufficient for processing JSON, and get jq.
$ jq --slurp 'add' env.json roles.json
I couldn't use jq either as I was limited due to client's webhost jailing the user on the command line with limited binaries as most discount/reseller web hosting companies do. Luckily they usually have PHP available and you can do a oneliner command like this which something like what I would place in my install/setup bash script for example.
php -r '$json1 = "./env.json";$json2 = "./roles.json";$data = array_merge(json_decode(file_get_contents($json1), true),json_decode(file_get_contents($json2),true));echo json_encode($data, JSON_PRETTY_PRINT);'
For clarity php -r accepts line feeds as well so using this also works.
php -r '
$json1 = "./env.json";
$json2 = "./roles.json";
$data = array_merge(json_decode(file_get_contents($json1), true), json_decode(file_get_contents($json2), true));
echo json_encode($data, JSON_PRETTY_PRINT);'
Output
{
"environment": "INT",
"run_list": [
"recipe[splunk-dj]",
"recipe[tideway]",
"recipe[AlertsSearch::newrelic]",
"recipe[AlertsSearch]"
]
}
A little bit hacky, but hopefully will do.
env_lines=`wc -l < $1`
env_output=`head -n $(($env_lines - 1)) $1`
roles_lines=`wc -l < $2`
roles_output=`tail -n $(($roles_lines - 1)) $2`
echo "$env_output" "," "$roles_output"

Conditional use of functions?

I created a bash script that parses ASCII files into a comma delimited output. It's worked great. Now, a new file layout for these files is being gradually introduced.
My script has now two parsing functions (one per layout) that I want to call depending on a specific marker that is present in the ASCII file header. The script is structured thusly:
#!/bin/bash
function parseNewfile() {...parse stuff...return stuff...}
function parseOldfile() {...parse stuff...return stuff...}
#loop thru ASCII files array
i=0
while [ $i -lt $len ]; do
#check if file contains marker for new layout
grep CSVHeaderBox output_$i.ASC
#calls parsing function based on exit code
if [ $? -eq 0 ]
then
CXD=`parseNewfile`
else
CXD=`parseOldfile`
fi
echo ${array[$i]}| awk -v cxd=`echo $CXD` ....
let i++
done>>${outdir}/outfile.csv
...
The script does not err out. It always calls the original function "parseOldfile" and ignores the new one. Even when I specifically feed my script with several files with the new layout.
What I am trying to do seem very trivial. What am I missing here?
EDIT: Samples of old and new file layouts.
1) OLD File Layout
F779250B
=====BOX INFORMATION=====
Model = R15-100
Man Date = 07/17/2002
BIST Version = 3.77
SW Version = 0x122D
SW Name = v1b1645
HW Version = 1.1
Receiver ID = 00089787556
=====DISK INFORMATION=====
....
2) NEW File Layout
F779250B
=====BOX INFORMATION=====
Model = HR22-100
Man Date = 07/17/2008
BIST Version = 7.55
SW Version = 0x066D
SW Name = v18m1fgu
HW Version = 2.3
Receiver ID = 028910170936
CSVHeaderBox:Platform,ManufactureDate,BISTVersion,SWVersion,SWName,HWRevision,RID
CSVValuesBox:HR22-100,20080717,7.55,0x66D,v18m1fgu,2.3,028910170936
=====DISK INFORMATION=====
....
This may not solve your problem, but a potential performance boost: instead of
grep CSVHeaderBox output_$i.ASC
#calls parsing function based on exit code
if [ $? -eq 0 ]
use
if grep -q CSVHeaderBox output_$i.ASC
qrep -q will exit successfully on the first match, so it doesn't have to scan the whole file. Plus you don't have to bother with the $? var.
Don't do this:
awk -v cxd=`echo $CXD`
Do this:
awk -v cxd="$CXD"
I'm not sure if this solves the OP's requirement.
What's the need for awk if your function knows how to parse the file?
#/bin/bash
function f1() {
echo "f1() says $#"
}
function f2() {
echo "f2() says $#"
}
FUN="f1"
${FUN} "foo"
FUN="f2"
${FUN} "bar"
I am bit embarrassed to write this but I solved my "problem".
After gedit (I am on Ubuntu) err-ed out several dozen times about "Trailing spaces", I copied and pasted my code into a new file and re-run my script.
It worked.
I have no explanation why.
Thanks to everyone for taking the time.

C shell doing arithmetic operations with large numbers

First of all: sorry for using c shell, blame my company not me. I hate the damn thing as much as most of you do now (at first I was like, hey this ain't so bad).
I am trying to subtract large numbers obtained from time stamps. Here is what I am trying:
set curTime = `date +%s%N`
#... some stuff
#curTime = `date +%s%N` - $curTime #get the diff
echo "time taken: $curTime"
However I guess the numbers are too big - before I tried with just seconds and it worked fine. Here's what I see in the log:
#curMilli = 1349996279792995000 - 1349996279170458000
#curMilli: Command not found.
As I said I do the exact same thing with date +%s and it's fine, so I'm assuming it's something about the largeness of the numbers.
How can I do this? Thanks a lot.
The article http://en.wikipedia.org/wiki/Bc_programming_language has a short section "Using bc in shell scripts". A test:
set curTime = `/bin/date +%s%N`
/bin/sleep 2
set prevTime = $curTime
set curTime = `/bin/date +%s%N`
set diff = `echo "$curTime - $prevTime;" | /usr/bin/bc`
echo $diff
will give (with the digits after the initial 20 variable):
2016204108
P.s: I wish I could vote you up twice for "I hate the damn thing as much as most of you do now (at first I was like, hey this ain't so bad)."

bash time output processing

I know that time will send timing statistics output to stderr. But somehow I couldn't capture it either in a bash script or into a file via redirection:
time $cmd 1>/dev/null 2>file
$output=`cat file`
Or
$output=`time $cmd 1>/dev/null`
I'm only interested in timing, not the direct output of the command. I've read some posts overhere but still no luck finding a viable solution. Any suggestions?
Thanks!
Try:
(time $cmd) 1>/dev/null 2>file
so that (time $cmd) is executed in a subshell environment and you can then redirect its output.
(Using GNU time /usr/bin/time rather than bash builtin) (Thanks #Michael Krelin)
(Or invoke as \time) (Thanks #Sorpigal, if I ever knew this I'd entirely forgotten)
How about using the -o and maybe -a command line options:
-o FILE, --output=FILE
Do not send the results to stderr, but overwrite the specified file.
-a, --append
(Used together with -o.) Do not overwrite but append.
I had a similar issue where I wanted to bench optimizations. The idea was to run the program several times then output statistics on run durations.
I used the following command lines:
1st run: (time ./myprog)2>times.log
Next runs: (time ./myprog)2>>times.log
Note that my (bash?) built-in time outputs statistics in the form:
real 0m2.548s
user 0m7.341s
sys 0m0.007s
Then I ran the following Perl script to retrieve statistics:
#!/usr/bin/perl -w
open FH, './times.log' or die "ERROR: ", $!;
my $useracc1 = 0;
my $useracc2 = 0;
my $usermean = 0;
my $uservar = 0;
my $temp = 0;
while(<FH>)
{
if("$_" =~ /user/)
{
if("$_" =~ /(\d+)m(\d{1,2})\.(\d{3})s/)
{
$usercpt++;
$temp = $1*60 + $2 + $3*0.001;
$useracc1 += $temp;
$useracc2 += $temp**2;
}
}
}
if($usercpt ne 0)
{
$usermean = $useracc1 / $usercpt;
$userdev = sqrt($useracc2 / $usercpt - $usermean**2);
$usermean = int($usermean*1000)/1000;
$userdev = int($userdev*1000)/1000;
}
else
{
$usermean = "---";
$userdev = "---";
}
print "User: ", $usercpt, " runs, avg. ", $usermean, "s, std.dev. ", $userdev,"s\n";
Of course, regular expressions may require adjustements depending on your time output format. It can also be easily extended to include real and system statistics.

Resources