Linux sed command that generates a new file on every regex match - bash

I have the following Linux command which I am using to extract data from one very large log file.
sed -n "/<trade>/,/<\/trade>/p" Large.log > output.xml
However, the output is generated in a single file output.xml. My intention is to create a new file every time the "/<trade>/,/<\/trade>/p" is matched. Every new file will be named after the <id> tag which is inside the <trade> </trade> tags.
Something likes this...
sed -n "/<trade>/,/<\/trade>/p" Large.log > "/<id>/,/<\/id>/p".xml
However, that, of course, does not work and I am not sure how to apply a regex as a naming rule.
P.S At this point, I am also not sure if I should use sed or maybe I should try achieving this with awk

Related

Adding quotes to variating characters in bash

I am trying to use the sed function in order to add double quotes for anything in between a matched pattern and a comma to break of the pattern. At the moment I am extracting the following data from cloudflare and I am trying to modify it to line protocol;
count=24043,clientIP=x.x.x.x,clientRequestPath=/abc/abs/abc.php
count=3935,clientIP=y.y.y.y,clientRequestPath=/abc/abc/abc/abc.html
count=3698,clientIP=z.z.z.z,clientRequestPath=/abc/abc/abc/abc.html
I have already converted to this format from JSON output with a bunch of sed functions to modify it, however, I am unable to get to the bottom of it to put the data for clientIP and clientRequestPath in inverted commas.
My expected output has to be;
count=24043,clientIP="x.x.x.x",clientRequestPath="/abc/abs/abc.php"
count=3935,clientIP="y.y.y.y",clientRequestPath="/abc/abc/abc/abc.html"
count=3698,clientIP="z.z.z.z",clientRequestPath="/abc/abc/abc/abc.html"
This data will be imported into InfluxDB, count will be a float whilst clientIP and clientRequestPath will be strings, hence why I need them to be in inverted commas as at the moment I am getting errors since they arent as they should be.
Is anyone available to provided to adequate 'sed' function to do is?
This might work for you (GNU sed):
sed -E 's/=([^0-9][^,]*)/="\1"/g' file
Enclose any string following a = does not begin with a integer upto a , in double quotes, globally.
here is a solution using a SED script to allow for multiple operations on a source file.
assuming your source data is in a file "from.dat"
create a sed script to run multiple commands
cat script.sed
s/clientIP=/clientIP=\"/
s/,clientRequestPath/\",clientRequestPath/
execute multiple-command sed script on data file redirecting the output file "to.dat"
sed -f script.sed from.dat > to.dat
cat to.dat (only showing one line)
count=24043,clientIP="x.x.x.x",clientRequestPath=/abc/abs/abc.php

grep -w -f is not returning all matches from list

I am trying to use a list that looks like this:
List file:
1mAF
2mAF
4mAF
7mAF
9mAF
10mAF
11mAF
13mAF
18mAF
27mAF
33mAF
36mAF
37mAF
38mAF
39mAF
40mAF
41mAF
45mAF
46mAF
47mAF
49mAF
57mAF
58mAF
60mAF
61mAF
62mAF
63mAF
64mAF
67mAF
82mAF
86mAF
87mAF
95mAF
96mAF
to grab out lines that contain a word-level match in a tab delimited file that looks like this:
File_of_interest:
11mAF-NODE_111-g7687-JEFF-tig00000037_arrow-g7396-AFLA_058530 11mAF cluster63
17mAF-NODE_343-g9350 17mAF cluster07
18mAF-NODE_34-g3647-JEFF-tig00000037_arrow-g7396-AFLA_058530 18mAF cluster20
22mAF-NODE_36-g3735 22mAF cluster28
36mAF-NODE_107-g7427 36mAF cluster77
45mAF-NODE_151-g9067 45mAF cluster14
47mAF-NODE_30-g3242-JEFF-tig00000037_arrow-g7396-AFLA_058530 47mAF cluster21
67mAF-NODE_54-g4372 67mAF cluster06
69mAF-NODE_27-g2754 69mAF cluster39
71mAF-NODE_44-g4178 71mAF cluster25
73mAF-NODE_47-g4895 73mAF cluster57
78mAF-NODE_4-g688 78mAF cluster53
but when I do grep -w -f list file_of_interest these are the only ones I get:
18mAF-NODE_34-g3647-JEFF-tig00000037_arrow-g7396-AFLA_058530 18mAF cluster20
36mAF-NODE_107-g7427 36mAF cluster77
45mAF-NODE_151-g9067 45mAF cluster14
and this misses a bunch of the values that are in the original list for example note that "67mAF" is in the list and in the file but it isn't returned.
I have tried removing everything after "mAF" in the list and trying again -- no change. I have rewritten the list in a completely new file to no avail. Oddly, I get more of them if I "sort" the list into a new file and then do the grep, but I still don't get all of them. I have also removed all invisible characters using sed (sed $'s/[^[:print:]\t]//g'). no change.
I am on OSX and both files were created on OSX, but normally grep -f -w works in the fashion i'm describing above.
I am completely flummoxed. Is I thought grep -w -f would look for all word-level matches of items in the file in the target file... am I wrong?
Thanks!
My guess is at least one of these files originates from a Windows machine and has CRLF line endings. file(1) might be used to tell you. If that is the case do:
fromdos FILE
or, alternatively:
dos2unix FILE

use sed with for loop to delete lines from text file

I am essentially trying to use sed to remove a few lines within a text document. To clean it up. But I'm not getting it right at all. Missing something and I have no idea what...
#!/bin/bash
items[0]='X-Received:'
items[1]='Path:'
items[2]='NNTP-Posting-Date:'
items[3]='Organization:'
items[4]='MIME-Version:'
items[5]='References:'
items[6]='In-Reply-To:'
items[7]='Message-ID:'
items[8]='Lines:'
items[9]='X-Trace:'
items[10]='X-Complaints-To:'
items[11]='X-DMCA-Complaints-To:'
items[12]='X-Abuse-and-DMCA-Info:'
items[13]='X-Postfilter:'
items[14]='Bytes:'
items[15]='X-Original-Bytes:'
items[16]='Content-Type:'
items[17]='Content-Transfer-Encoding:'
items[18]='Xref:'
for f in "${items[#]}"; do
sed '/${f}/d' "$1"
done
What I am thinking, incorrectly it seems, is that I can setup a for loop to check each item in the array that I want removed from the text file. But it's simply not working. Any idea. Sure this is basic and simple and yet I can't figure it out.
Thanks,
Marek
Much better to create a single sed script, rather than generate 19 small scripts in sequence.
Fortunately, generating a script by joining the array elements is moderately easy in Bash:
regex=$(printf '\|%s' "${items[#]}")
regex=${regex#'\|'}
sed "/^$regex/d" "$1"
(Notice also the addition of ^ to the final regex -- I assume you only want to match at beginning of line.)
Properly, you should not delete any lines from the message body, so the script should leave anything after the first empty line alone:
sed "1,/^\$/!b;/$regex/d" "$1"
Add -i if you want in-place editing of the target file.

Grep -f and only return the first match

I'm working with a large CSV that follows a basic process.
Backup the working original
Generate a skeleton CSV
Read from another CSV, format the contents, and then append it to the skeleton
Append the data from the backup to the new one.
The issue I'm running into is that when I read in the contents from the backup, I'm using grep -Ev -f with a file containing regexes to exclude undesired data from the backup to be included in the next revision. This currently presents a problem because grep appears to evaluate each regex in the file against every line from STDIN which will cause duplicates. The simple solution would be to simply pipe it through sort | uniq and call it a day but that will screw with the formatting of the csv currently in use. I can elaborate if needed but the short of it is I run a script to bulk process IP addresses but there is also manual editing of the file by other people and with the current form of the script the final output will be all of the automated content with manual entries being at the bottom of the file.
So, is there anyway without some ugly looping of grep to tell it to stop evaluating a line after a pattern is matched? Using -m 1 will stop grep after the first match in the whole stream where I need it stop after each new line.
For the task you want to accomplish. It would be best in my opinion to use AWK. You can find an excellent tutorial for AWK at : http://www.grymoire.com/Unix/Awk.html. You basically need to change the input field separator for awk with
awk -f',' foo.awk bar.dat
As far as the problem with sorting is concerned follow this : http://www.linuxquestions.org/questions/linux-general-1/how-to-use-awk-to-sort-243177/

Replace last line of XML file

Looking for help creating a script that will replace the last line of an XML file with a tag. I have a few hundred files so I'm looking for something that will process them in a loop. I've managed to rename the files sequentially like this:
posts1.xml
posts2.xml
posts3.xml
etc...
to make it easier to loop through. But I have no idea how to write a script to do this. I'm open to using either Linux or Windows (but i would guess that Linux is better for this kind of task).
So if you want to append a line to every file:
sed -i '$a<YOUR_SHINY_NEW_TAG>' *xml
To replace the last line:
sed -i '$s/.*/<YOUR_SHINY_NEW_TAG>/' *xml
But do note, sed is not the ideal tool to modify xml.
XMLStarlet is a command-line toolkit for performing XML parsing and manipulations. Note that as an XML-aware toolkit, it'll respect XML structure, character encoding and entity substitution.
Check out the ed command to see how to modify documents. You can wrap this in a standard bash loop.
e.g. in a doc consisting of a chain of <elem>s, you can add a following <added>5</added>:
mkdir new
for x in *.xml; do
xmlstarlet ed -a "//elem[count(//elem)]" -t elem -n added -v 5 $x > new/$x
done
Linux way using sed:
To edit the last line of the file in place, you can use sed:
sed -i '$s_pattern_replacement_' filename
To change the whole line to "replacement" use $s_.*_replacement_. Be sure to escape any _'s in replacement with a \.
To loop over files, just use for:
for f in /path/posts*.xml; do sed -i '$s_.*_replacement_' $f; done
This, however, is a dirty way as it's not aware of the XML structure, whereas the XML structure is not affected by newlines. You have to be sure the last line of the files contains exactly what you expect it to.
It makes little to no difference whether you're on Linux, Windows or MacOS
The question is what language do you want to use?
The following is an example in c# (not optimized, but read it as speudocode):
string rootDirectory = #"c:\myfiles";
var files = Directory.GetFiles(rootDirectory, "*.xml");
foreach (var file in files)
{
var lines = File.ReadAllLines(file);
lines[lines.Length - 1] = "whatever you want here";
File.WriteAllLines(file, lines);
}
You can compile this and run it on Windows, Linux, etc..
Or you could do the same in Python.
Of course this method does not actually parse the XML,
but you just wanted to replace the last line right?

Resources