Replace each } with a }\n in a huge (12GB) which consists of 1 line? - bash

I have a log file (from a customer). 18 Gigs. All contents of the file are in 1 line.
I want to read the file in logstash. But I get problems because of Memory. The file is read line by line but unfortunately it is all on 1 line.
I tried split the file into lines so that logstash can process it (the file has a simple json format, no nested objects) i wanted to have each json in one line, splitting at } by replacing with }\n:
sed -i 's/}/}\n/g' NonPROD.log.backup
But sed is killed - I assume also because of memory. How can I resolve this? Can I let sed manipulate the file using other chunks of data than lines? I know by default sed reads line by line.

The following uses only functionality built into the shell:
#!/bin/bash
# as long as there exists another } in the file, read up to it...
while IFS= read -r -d '}' piece; do
# ...and print that content followed by '}' and a newline.
printf '%s}\n' "$piece"
done
# print any trailing content after the last }
[[ $piece ]] && printf '%s\n' "$piece"
If you have logstash configured to read from a TCP port (using 14321 as an arbitrary example below), you can run thescript <NonPROD.log.backup >"/dev/tcp/127.0.0.1/14321" or similar, and there you are -- without needing to have double your original input file's space available on disk, as other answers thus far given require.

With GNU awk for RT:
$ printf 'abc}def}ghi\n' | awk -v RS='}' '{ORS=(RT?"}\n":"")}1'
abc}
def}
ghi
with other awks:
$ printf 'abc}def}ghi\n' | awk -v RS='}' -v ORS='}\n' 'NR>1{print p} {p=$0} END{printf "%s",p}'
abc}
def}
ghi
I decided to test all of the currently posted solutions for functionality and execution time using an input file generated by this command:
awk 'BEGIN{for(i=1;i<=1000000;i++)printf "foo}"; print "foo"}' > file1m
and here's what I got:
1) awk (both awk scripts above had similar results):
time awk -v RS='}' '{ORS=(RT?"}\n":"")}1' file1m
Got expected output, timing =
real 0m0.608s
user 0m0.561s
sys 0m0.045s
2) shell loop:
$ cat tst.sh
#!/bin/bash
# as long as there exists another } in the file, read up to it...
while IFS= read -r -d '}' piece; do
# ...and print that content followed by '}' and a newline.
printf '%s}\n' "$piece"
done
# print any trailing content after the last }
[[ $piece ]] && printf '%s\n' "$piece"
$ time ./tst.sh < file1m
Got expected output, timing =
real 1m52.152s
user 1m18.233s
sys 0m32.604s
3) tr+sed:
$ time tr '}' '\n' < file1m | sed 's/$/}/'
Did not produce the expected output (Added an undesirable } at the end of the file), timing =
real 0m0.577s
user 0m0.468s
sys 0m0.078s
With a tweak to remove that final undesirable }:
$ time tr '}' '\n' < file1m | sed 's/$/}/; $s/}//'
real 0m0.718s
user 0m0.670s
sys 0m0.108s
4) fold+sed+tr:
$ time fold -w 1000 file1m | sed 's/}/}\n\n/g' | tr -s '\n'
Got expected output, timing =
real 0m0.811s
user 0m1.137s
sys 0m0.076s
5) split+sed+cat:
$ cat tst2.sh
mkdir tmp$$
pwd="$(pwd)"
cd "tmp$$"
split -b 1m "${pwd}/${1}"
sed -i 's/}/}\n/g' x*
cat x*
rm -f x*
cd "$pwd"
rmdir tmp$$
$ time ./tst2.sh file1m
Got expected output, timing =
real 0m0.983s
user 0m0.685s
sys 0m0.167s

You can running it through tr, then put the end bracket back on at the end of each line:
$ cat NonPROD.log.backup | tr '}' '\n' | sed 's/$/}/' > tmp$$
$ wc -l NonPROD.log.backup tmp$$
0 NonPROD.log.backup
43 tmp10528
43 total
(My test file only had 43 brackets.)

You could:
Split the file to say 1M chunks using split -b 1m file.log
Process all the files sed 's/}/}\n/g' x*
... and redirect the output of sed to combine them back to a single piece
The drawback of this is the doubled storage space.

another alternative with fold
$ fold -w 1000 long_line_file | sed 's/}/}\n\n/g' | tr -s '\n'

Related

Bash: Working with CSV file to build a loop and save the result

Using Bash, I'm wanting to get a list of email addresses from a CSV file to do a recursive grep search on it for a bunch of directories looking for a match in specific metadata XML files, and then also tallying up how many results I find for each address throughout the directory tree (i.e. updating the tally field in the same CSV file).
accounts.csv looks something like this:
updated to more accurately reflect real-world data
email,date,bar,URL,"something else",tally
address#somewhere.com,21/04/2015,1.2.3.4,https://blah.com/,"blah blah",5
something#that.com,17/06/2015,5.6.7.8,https://blah.com/,"lah yah",0
another#here.com,7/08/2017,9.10.11.12,https://blah.com/,"wah wah",1
For example, if we put address#somewhere.com in $email from the list, run
grep -rl "${email}" --include=\*_meta.xml --only-matching | wc -l
on it and then add that result to the tally column.
At the moment I can get the first column of that CSV file (minus the heading/first line) using
awk -F"," '{print $1}' accounts.csv | tail -n +2
but I'm lost how to do the looping and also the writing of the result back to the CSV file...
So for instance, with another#here.com if we run
grep -rl "${email}" --include=\*_meta.xml --only-matching | wc -l
and the result is say 17, how can I update that line to become:
another#here.com,7/08/2017,9.10.11.12,https://blah.com/,"wah wah",17
Is this possible with maybe awk or sed?
This is where I'm up to:
#!/bin/bash
# make temporary list of email addresses
awk -F"," '{print $1}' accounts.csv | tail -n +2 > emails.tmp
# loop over each
while read email; do
# count how many uploads for current email address
grep -rl "${email}" --include=\*_meta.xml --only-matching | wc -l
done < emails.tmp
XML Metadata looks something like this:
<?xml version="1.0" encoding="UTF-8"?>
<metadata>
<identifier>SomeTitleNameGoesHere</identifier>
<mediatype>audio</mediatype>
<collection>opensource_movies</collection>
<description>example <br /></description>
<subject>testing</subject>
<title>Some Title Name Goes Here</title>
<uploader>another#here.com</uploader>
<addeddate>2017-05-28 06:20:54</addeddate>
<publicdate>2017-05-28 06:21:15</publicdate>
<curation>[curator]email#address.com[/curator][date]20170528062151[/date][comment]checked for malware[/comment]</curation>
</metadata>
how to do the looping and also the writing of the result back to the CSV file
awk does the looping automatically. You can change any field by assigning to it. So to change a tally field (the 6th in each line) you would do $6 = ....
awk is a great tool for many scenarios. You probably can safe a lot of time in the future by investing some minutes in a short tutorial now.
The only non-trivial part is getting the output of grep into awk.
The following script increments each tally by the count of *_meta.xml files containing the given email address:
awk -F, -v OFS=, -v q=\' 'NR>1 {
cmd = "grep -rlFw " q $1 q " --include=\\*_meta.xml | wc -l";
cmd | getline c;
close(cmd);
$6 = c
} 1' accounts.csv
For simplicity we assume that filenames are free of linebreaks and email addresses are free of '.
To reduce possible false positives, I also added the -F and -w option to your grep command.
-F searches literal strings; without it, searching for a.b#c would give false positives for things like axb#c and a-b#c.
-w matches only whole words; without it, searching for b#c would give a false positive for ab#c. This isn't 100% safe, as a-b#c would still give a false positive, but without knowing more about the structure of your xml files we cannot fix this.
A pipeline to reduce the number of greps:
grep -rHo --include=\*_meta.xml -f <(awk -F, 'NR > 1 {print $1}' accounts.csv) \
| gawk -F, -v OFS=',' '
NR == FNR {
# store the filenames for each email
if (match($0, /^([^:]+):(.+)/, m)) tally[m[2]][m[1]]
next
}
FNR > 1 {$4 = length(tally[$1])}
1
' - accounts.csv
Here is a solution using single awk command to achieve this. This solution will be highly performant as compared to other solutions because it is scanning each XML file only once for all the email addresses found in first column of the CSV file. Also it is not invoking any external command or spawning a sub0shell anywhere.
This should work in any version of awk.
cat srch.awk
# function to escape regex meta characters
function esc(s, tmp) {
tmp = s
gsub(/[&+.]/, "\\\\&", tmp)
return tmp
}
BEGIN {FS=OFS=","}
# while processing csv file
NR == FNR {
# save escaped email address in array em skipping header row
if (FNR > 1)
em[esc($1)] = 0
# save each row in rec array
rec[++n] = $0
next
}
# this block will execute for eaxh XML file
{
# loop each email and save count of matched email in array em
# PS: gsub return no of substitutionx
for (i in em)
em[i] += gsub(i, "&")
}
END {
# print header row
print rec[1]
# from 2nd row onwards split row into columns using comma
for (i=2; i<=n; ++i) {
split(rec[i], a, FS)
# 6th column is the count of occurrence from array em
print a[1], a[2], a[3], a[4], a[5], em[esc(a[1])]
}
}
Use it as:
awk -f srch.awk accounts.csv $(find . -name '*_meta.xml') > tmp && mv tmp accounts.csv
A script that handles accounts.csv line by line and replaces the data in accounts.new.csv for comparison.
#! /bin/bash
file_old=accounts.csv
file_new=${file_old/csv/new.csv}
delimiter=","
x=1
# Copy file
cp ${file_old} ${file_new}
while read -r line; do
# Skip first line
if [[ $x -gt 1 ]]; then
# Read data into variables
IFS=${delimiter} read -r address foo bar tally somethingelse <<< ${line}
cnt=$(find . -name '*_meta.xml' -exec grep -lo "${address}" {} \; | wc -l)
# Reset tally
tally=$cnt
# Change line number $x in new file
sed "${x}s/.*/${address} ${foo} ${bar} ${tally} ${somethingelse}/; ${x}s/ /${delimiter}/g" \
-i ${file_new}
fi
((x++))
done < ${file_old}
The input and ouput:
# Input
$ find . -name '*_meta.xml' -exec cat {} \; | sort | uniq -c
2 address#somewhere.com
1 something#that.com
$ cat accounts.csv
email,foo,bar,tally,somethingelse
address#somewhere.com,bar1,foo2,-1,blah
something#that.com,bar2,foo3,-1,blah
another#here.com,bar4,foo5,-1,blah
# output
$ ./test.sh
$ cat accounts.new.csv
email,foo,bar,tally,somethingelse
address#somewhere.com,bar1,foo2,2,blah
something#that.com,bar2,foo3,1,blah
another#here.com,bar4,foo5,0,blah

How to search for a matching string in a file bottom-to-top without using tac? [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 1 year ago.
Improve this question
I need to grep through a file, starting at the bottom of the file until I get to the first date that appears "2021-04-04", and then return that date. I don't want to start from the top and work my way down to the first line as there's thousands of lines in each file.
Example file contents:
random text on first line
random text on second line
2021-01-01
random text on fourth line
2021-02-03
random text on sixth line
2021-03-03
2021-04-04
Random text on ninth line
tac isn't available on MacOS so I can't use it.
"thousands of lines" are nothing, they'll be processed in the blink of an eye. Once you get into 10s of millions of lines THEN you could start thinking about a performance improvement if it became necessary.
All you need is:
awk '/[0-9]{4}(-[0-9]{2}){2}/{line=$0} END{if (line!="") print line}' file
Here's the 3rd-run timing comparison for finding the last line containing 2 or more consecutive 5s in a 100000 line file generated by seq 100000 > file100k, i.e. where the target string is just 45 lines from the end of the input file, with and without tac:
$ time awk '/5{2}/{line=$0} END{if (line!="") print line}' file100k
99955
real 0m0.056s
user 0m0.031s
sys 0m0.000s
$ time tac file100k | awk '/5{2}/{print; exit}'
99955
real 0m0.056s
user 0m0.015s
sys 0m0.030s
As you can see, both ran in a fraction of a second and using tac did nothing to improve the speed of execution. Switching to tac+grep doesn't make it any faster either, it still just takes 1/20th of a second:
$ time tac file100k | grep -m1 '5\{2\}'
99955
real 0m0.057s
user 0m0.015s
sys 0m0.015s
In case you ever do need it in future, though, here's how to implement an efficient tac if you don't have it:
$ mytac() { cat -n "${#:--}" | sort -k1,1rn | cut -d$'\t' -f2-; }
$ seq 5 | mytac
5
4
3
2
1
The above mytac() function just adds line numbers to the input, sorts those in reverse and then removes them again. If your cat doesn't have -n to add line numbers then you can use nl if you have it or awk -v OFS='\t' '{print NR, $0}' will always work.
Use tac:
#!/bin/bash
function process_file_backwords(){
tac $1 | while IFS= read line; do
# Grep for xxxx-xx-xx number matching
first_date=$(echo $line | grep '[0-9]\{4\}-[0-9]\{2\}-[0-9]\{2\}' | awk -F '"' '{ print $2}')
# Check if the variable is not empty, if yes break the loop
[ ! -z $first_date ] && echo $first_date && break
done
}
echo $(process_file_backwords $1)
Note: Make sure you add empty line at the of the file so tac will not concatenate the last two lines.
Note: Remove the awk part if the file contains strings without ".
On MacOS
You can use tail -r which will do the same thing as tac but you may have to supply the number of lines you want tail to output from your file. Something like this should work:
tail -r -n $(wc -l myfile.txt | cut -d ' ' -f 1) myfile.txt | grep -m 1 -o -P '\d{4}-\d{2}-\d{2}'
-r tells tail to output its last line first
-n takes a numeric argument telling how many lines tail should output
wc -l outputs the line count and filename of a given file
cut -d ' ' splits the above on the space character and -f 1 takes the first "field" which will be our line count
$ cat myfile.txt
foo
this is a date 2021-04-03
bar
this is another date 2021-04-04 for example
$ tail -r -n $(wc -l myfile.txt | cut -d ' ' -f 1) myfile.txt | grep -m 1 -o -P '\d{4}-\d{2}-\d{2}'
2021-04-04
grep options:
The -m 1 option will quit after the first result.
The -o option
will return only the string matching the pattern (i.e. your date)
The -P option uses the perl regex engine which is really down to
preference but I personally prefer the regex syntax (seems to use
fewer backslashes \)
On Linux
You can use tac (cat in reverse) and pipe that into your grep. e.g.:
$ tac myfile.txt
this is another date 2021-04-04 for example
bar
this is a date 2021-04-03
foo
$ tac myfile.txt | grep -m 1 -o -P '\d{4}-\d{2}-\d{2}'
2021-04-04
You can use perl to reverse the lines and grep for the 1st match too.
perl -e 'print reverse<>' inputFile | grep -m1 '[0-9]\{4\}-[0-9]\{2\}-[0-9]\{2\}'

Grep - Getting the character position in the line of each occurrence

According to the manual, the option -b can give the byte offset of a given occurence, but it seems to start from the beginning of the parsed content.
I need to retrieve the position of each matching content returned by grep. I used this line, but it's quite ugly:
grep '<REGEXP>' | while read -r line ; do echo $line | grep -bo '<REGEXP>' ; done
How to get it done in a more elegant way, with a more efficient use of GNU utils?
Example:
$ echo "abcdefg abcdefg" > test.txt
$ grep 'efg' | while read -r line ; do echo $line | grep -bo 'efg' ; done < test.txt
4:efg
12:efg
(Indeed, this command line doesn't output the line number, but it's not difficult to add it.)
With any awk (GNU or otherwise) in any shell on any UNIX box:
$ awk -v re='efg' -v OFS=':' '{
end = 0
while( match(substr($0,end+1),re) ) {
print NR, end+=RSTART, substr($0,end,RLENGTH)
end+=RLENGTH-1
}
}' test.txt
1:5:efg
1:13:efg
All strings, fields, array indices in awk start at 1, not zero, hence the output not looking like yours since to awk your input string is:
123456789012345
abcdefg abcdefg
rather than:
012345678901234
abcdefg abcdefg
Feel free to change the code above to end+=RSTART-1 and end+=RLENGTH if you prefer 0-indexed strings.
Perl is not a GNU util, but can solve your problem nicely:
perl -nle 'print "$.:$-[0]" while /efg/g'

How to add all values in a certain column?

I want to add all the 3rd fields from each line and produce the result.
Below is the way I solved the problem
sum=0
grep '2016Feb' input.txt|awk -F\- '{print $3}'|while read LINE; do
sum = $(expr $sum + $LINE)
done
echo $sum
Is there a better way of solving the problem than my code? Possible a command that solves the problem # command line itself?
For a file like:
$ cat input.txt
Feb2016-2016-110
Feb2016-2016-20
Feb2016-2016-220
Feb2016-2016-140
Feb2016-2016-100
The output is: 590.
Just set the field separator to the dash and sum the third column:
$ awk -F- '{sum+=$3} END{print sum+0}' file
590 ^^
# in case there are no matching lines, print 0
Since it looks like you are just counting those lines that contain the text "Feb2016", you can also add a filter:
awk -F- '/Feb2016/{sum+=$3} END{print sum+0}' file
# ^^^^^^^^^
# just on lines containing the string "Feb2016"
$ cat data
Feb2016-2016-110
Feb2016-2016-20
Feb2016-2016-220
Feb2016-2016-140
Feb2016-2016-100
$ cut -d - -f 3 data | paste -s -d '+' | bc
590
$

sed move text in .txt to next line

I am trying to parse out a text file that looks like the following:
EMPIRE,STATE,BLDG,CO,494202320000008,336,5,AVE,ENT,NEW,YORK,NY,10003,N,3/1/2012,TensionCode,VariableICAP,PFJICAP,Residential,%LBMPZone,L,9,146.0,,,10715.0956,,,--,,0,,,J,TripNumber,ServiceClass,PreviousAccountNumber,MinMonthlyDemand,TODCode,Profile,Tax,Muni,41,39,00000000000000,9952,54,Y,Non-Taxable,--,FromDate,ToDate,Use,Demand,BillAmt,12/29/2011,1/31/2012,4122520,6,936.00,$293,237.54
what I would like to see is the data stacked
- EMPIRE STATE BLDG CO
- 494202320000008
- 336 5 AVE ENT
- NEW YORK NY
and so on. If anything, after each comma I would want the text following to go to a new txt line. Ultimatly in regards to the last line where it states date from forward, I would like to have it in a txt file like
- From Date ToDate use Demand BillAmt
- 12/29/2011 1/31/2012 4122520 6,936.00 $293,237.54.
I am using cygwin on a windows XP machine. Thank you in advance for any assistance.
For getting the last line into a separate file:
echo -e "From Date\tToDate\tuse\tDemand\tBillAmt" > lastlinefile.txt
cat originalfile.txt | sed 's/,FromDate/~Fromdate/' | awk -v FS="~" '{print $2}' | sed 's/FromDate,ToDate,use,Demand,BillAmt,//' | sed 's/,/\t/' >> lastlinefile.txt
For the rest:
cat originalfile.txt | sed -r 's/,Fromdate[^\n]+//' | sed 's/,/\n/' | sed -r 's/$/\n\n' > nocommas.txt
Your mileage may vary as far as the first '\n' is concerned in the second command. It if doesn't work properly replace it with a space (assuming your data doesn't have spaces).
Or, if you like, a shell script to operate on a file and split it:
#!/bin/bash
if [ -z "$1" ]
then echo "Usage: $0 filename.txt; exit; fi
echo -e "From Date\tToDate\tuse\tDemand\tBillAmt" > "$1_lastline.txt"
cat "$1" | sed 's/,FromDate/~Fromdate/' | awk -v FS="~" '{print $2}' | sed 's/FromDate,ToDate,use,Demand,BillAmt,//' | sed 's/,/\t/' >> "$1_lastline.txt"
cat "$1" | sed -r 's/,Fromdate[^\n]+//' | sed 's/,/\n/' | sed -r 's/$/\n\n' > "$1_fixed.txt"
Just paste it into a file and run it. It's been years since I used Cygwin... you may have to chmod +x file it first.
I'm providing you two answers depending on how you wanted the file. The previous answer split it into two files, this one keeps it all in one file in the format:
EMPIRE
STATE
BLDG
CO
494202320000008
336
5
AVE
ENT
NEW
YORK
NY
From Date ToDate use Demand BillAmt
12/29/2011 1/31/2012 4122520 6,936.00 $293,237.54.
That's the best I can do with the delimiters have you set in place. If you'd have left it something like "EMPIRE STATE BUILDING CO,494202320000008,336 5 AVE ENT,NEW YORK,NY" it'd be a lot easier.
#!/bin/bash
if [ -z "$1" ]
then echo "Usage: $0 filename.txt; exit; fi
cat "$1" | sed 's/,FromDate/~Fromdate/' | awk -v FS="~" '{gsub(",","\n",$1);print $1;print "FromDate\tToDate\tuse\tDemand\tBillAmt";gsub("FromDate,ToDate,use,Demand,BillAmt","",$2);gsub(",","\t",$2);print $2}' >> "$1_fixed.txt"
again, just paste it into a file and run it from Cygwin: ./filename.sh

Resources