I'm working on a bash script at the moment which extracts data from a text file called carslist.txt, which each car (and its corresponding characteristics) being on separate lines. I've been able to extract and save data from the text file after it's met a single condition (below for example) but I can't figure out how to do it for two conditions.
Single condition example:
grep 'Vauxhall' $CARFILE > output/Vauxhall_Cars.txt
output:
Vauxhall:Vectra:1999:White:2
Vauxhall:Corsa:1999:White:5
Vauxhall:Cavalier:1995:White:2
Vauxhall:Nova:1994:Black:8
From the examples above, how would I extract data if I wanted the conditions Vauxhall and White to be met before extracting them?
the grep example above asks for Vauxhall to be met before pulling and saving the data, but I have no idea how to do it for 2. I've tried pipelining the command as Vauxhall | White but after that I was out of ideas.
Thanks in advance.
I would recommend to use awk, like this:
awk -F: '$1=="Vauxhall" && $4=="White"' input.file
As I'm using : as the field separator, I simply need to check the values of field 1 and 4.
Related
I got two huge comma delimited files.
The 1st file has 280 million lines and the following columns
first name, last name, city, state, ID, email*, phone
John,Smith,LA,CA,123123123123,johnsmith#yahoo.com,12312312
Bob,Marble,SF,CA,120947810924,,48595920
Tai,Nguyen,SD,CA,134124124124,tainguyen#gmail.com,12041284
The 2nd file has 420 million lines and the following columns
first name, last name, city, state, email
John,Smith,LA,CA,johnsmith#hotmail.com
Bob,Marble,SF,CA,bobmarble#gmail.com
Tai,Nguyen,SD,CA,tainguyen#gmail.com
* a lot of these fields are empty
I want to merge all the lines from both files that has the first 4 columns match. Then fill in the missing emails of the first file by the emails from the second files if the email is not blank then don't change it. The process should be case insensitive. In case there are many instances that have the same 4 information then just ignore these instance and do the work on unique instances only.
The result should have the following columns and look like this
first name, last name, city, state, ID, email, phone
John,Smith,LA,CA,123123123123,johnsmith#yahoo.com,12312312
Bob,Marble,SF,CA,120947810924,bobmarble#gmail.com,48595920
Tai,Nguyen,SD,CA,134124124124,tainguyen#gmail.com,12041284
They should only print out things that has 4 columns matched not 1 or 2 or 3. My boss insist on using Bash shell script for this and I am a newbie in Bash. Please help me with a clear explanation as I am so newbie.
I do my reading and understand that awk require storing information into cpu memory. However, I can split the big files into small files and use awk in that case. I copy some code online and change it to my need but whenever it fills in the blank email, it also reformats the line delimiter from comma into space. I want to stop that but don't know how. Please help me to solve this problem. All advises and answers are highly appreciated.
awk -F "," 'NR==FNR{a[$1,$2,$3,$4]=$5;next}{if ($6 =="") $6=a[$1,$2,$3,$4];print}' file2.txt file1.txt > file3.txt
The awk approach you showed is not suited for files that big. It stores parts of the files in memory. With the same approach you would need to store either ... or ...
280 million entries of the form first name, last name, city, state → ID, phone
420 million entries of the form first name, last name, city, state → email
Assume we go with the first option and each entry takes up only 50 bytes of memory. To store all 280 million entries we need 280M·50B = 14'000 MB = 14 GB. This is the absolute minimum of memory you need to run the awk command. In reality it would be even more due to implementation details of associative arrays.
What you can do instead
Use the classical approach to the problem:
sort both files
join the files by their first four columns*
cut the desired columns from the joined result**
* needs some pre- and post-processing as join can only join one column.
** Since we have to re-arrange the email column cut is not sufficient. We can use awk instead.
#! /bin/bash
prefixWithKey() {
sed -E 's/([^,]*,){4}/\L&\E\t&/' "$1"
}
sortByKeyInPlace() {
sort -t $'\t' -k1,1 -o "$1" "$1"
}
joinByKey() {
join -t $'\t' "$1" "$2"
}
cutColumns() {
awk 'BEGIN{FS="\t|,\t*"; OFS=","} {print $5,$6,$7,$8,$9,$16,$11}'
}
file1="your 1st input file.csv"
file2="your 2nd input file.csv"
for i in "$file1" "$file2"; do
prefixWithKey "$i" > "$i.tmp"
sortByKeyInPlace "$i.tmp"
done
joinByKey "$file1.tmp" "$file2.tmp" | cutColumns > result.csv
rm "$file1.tmp" "$file2.tmp"
This script assumes that the input files have no headers and contain no tabs. We always take the email field from the 2nd file, no matter whether the email field of the 1st file was defined or not.
I barely tested this script because you didn't provide any example input. If you encounter some errors and share a short input leading to that error I would be happy to fix the script (if it needs fixing).
In theory the script could be written without temporary files. I intentionally used temporary files because of the input size. Programs like sort may run faster on files.
This script could be speed up, for instance by
Executing both calls to prefixWithKey in parallel.
Adding LC_ALL=C in front of commands like sort.
Adding options to sort, for instance -S 70%.
Further Alternatives
For files that big it could be faster to store them into a database and process them there. There is even the tool q for doing thinks like this in a single command, but from what I experienced it's very slow.
Would someone help me form a script in Bash to keep only the unique lines, based solely on identifying duplicate values in a single field (the first field)
If I have data like this:
123456,23423,Smith,John,Jacob,Main St.,,Houston,78003<br>
654321,54524,Smith,Jenny,,Main St.,,Houston,78003<br>
332423,9023432,Gonzales,Michael,,Everyman,,Dallas,73423<br>
123456,324324,Bryant,Kobe,,Special St.,,New York,2311<br>
234324,232411,Willis,Bruce,,Sunset Blvd,,Hollywood,90210<br>
438329,34233,Moore,Mike,,Whatever,,Detroit,92343<br>
654321,43234,Smith,Jimbo,,Main St.,,Houston,78003<br>
And I like to only keep the lines which do not have matching first fields
(result would be a file with these contents below, based on above sample)
332423,9023432,Gonzales,Michael,,Everyman,,Dallas,73423<br>
234324,232411,Willis,Bruce,,Sunset Blvd,,Hollywood,90210<br>
438329,34233,Moore,Mike,,Whatever,,Detroit,92343<br>
What would the bash/awk approach be? Thanks in advance.
Due to poor past naming practices, I'm left with a list of names that is proving to be a challenge to work with. The bottom line is that I want the most current name (by date) to be placed in a variable. All the names are listed (unsorted) in a file called bar.txt.
In this case I can't rename, and there's no way to get the actual dates of the images; these names are all I have to go on. The names can follow one of several patterns;
foo
YYYYMMDD-foo
YYYYMMDD##-foo
foo can be anything from a single character to a long string of letters/numbers/symbols. I am interested only in the names matching the second use case, YYMMDD-foo, as those are from after we started tagging consistently.
I would like to end up with a variable containing the most recent date that follows the pattern YYMMDD-foo.
I know sort -k1 -n < bar.txt, but then I'm not sure how to isolate the second pattern's results to extract what I need.
How do I sort the file to ignore anything but the second pattern, and return the most current date?
Sample
Given that bar.txt looks like this;
test
2017120901-develop-BUILD-31
20170326-TEST-1.2.0
20170406-BUILD-40-1.2.0-test
2010818_001
I would want to extract 20170406-BUILD-40-1.2.0-test
Since your requirement involves 1) to get only files of a certain format 2) apply sorting and get only the latest file. Am using a Awk & GNU sort together to achieve it
awk -F'-' 'length($1) == 8' file | sort -nrk1 | head -1
20170406-BUILD-40-1.2.0-test
The solution works by only getting those lines in the file whose first column has 8 characters exactly corresponding to YYYYMMDD alignment. Once those filtered, sort applied on first field and the first line is obtained using head.
I need help understanding a weird problem with sed, bash and a while loop.
MY data looks like this:
-File 1- CSV
account,hostnames,status,ipaddress,port,user,pass
-File 2- XML - This is a sample record set for two entries under one account
<accountname="account">
<cname="fqdn or simple name goes here">
<field="hostname">ahostname or ipv4 goes here</field>
<protocol>aprotocol</protocol>
<field="port">aportnumber</field>
<field="username">ausername</field>
<field="password">apassword</field>
</cname>
<cname="fqdn or simple name goes here">
<field="hostname">ahostname or ipv4 goes here</field>
<protocol>aprotocol</protocol>
<field="port">aportnumber</field>
<field="username">ausername</field>
<field="password">apassword</field>
</cname>
</accountname>
So far, I can add records in between the respective account holder from File1 to File2. But, if I need to remove records that no longer exists it does not work efficiently since it wipes other records from different accounts, ie it does not delete between a matched accountname.
I import from File 1 into File 2 with a while loop in my bash program:
-Bash Program excerpts-
//Read File in to F//
cat File 2 | while read F
do
//extract fields from F into variables
_vmname="$(echo $F |grep 'cname'| sed 's/<cname="//g' |sed 's/.\{2\}$//g')"
_account="$(echo $F | grep 'accountname' | sed 's/accountname="//g' |sed 's/.\{2\}$//g')"
// I then compare my File1 and look for stale records that are still in File2
if grep "$_vmname" File1 ;then
continue
else
// if not matched, delete between the respective accountname
sed -i '/'"$_account"'/,/<\/accountname>/ {/'"$_vmname"'/,/<\/cname>/d}' File2
If I manually declare _vmname and _account and run
sed -i '/'"$_account"'/,/<\/accountname>/ {/'"$_vmname"'/,/<\/cname>/d}' File2
It removes the stale records from File2. When I let my bash script run, it does not.
I think I have three problems:
Reading the variables for _vmname and _account name inside a loop makes it read numerous times. Any better way to do is appreciated.
I do not think the sed statement for matching these two patterns and then delete works like I want inside a while loop.
I may have a logic problem with my thought chain.
Any pointers, and please no awk, perl, lxml or python for this one.
Thanks!
and please no awk
I appreciate that you want to keep things simple, and I suppose awk seems more complicated than what you're doing. But I'd like to point out you have so far 3 grep and 4 sed invocations per line in the file, to process another file N times, once per line. That's O(mn) using the slowest method on the planet to read the file (a while loop). And it doesn't work.
I may have a logic problem with my thought chain.
I'm afraid we must allow for that possibility!
The right advice is to tackle XML with an XML parser, because XML is not a regular language and so can't be parsed with regular expressions. And that's really what you need here, because your program processes the whole XML document. You're not just plucking out bits and depending on incidental formatting artifacts; you want to add records that aren't there and remove those that "no longer exist". Apparently there is information in the XML document you need to preserve, else you would just produce it from the CSV. A parser would spoon-feed it to you.
The second-best advice is to use awk. I suppose you might try an approach like:
Process the CSV and produce the XML to be inserted.
In awk, first read the new input XML into an array keyed by cname, Then process the XML target once. For every CNAME, consult your array; if you find a match, insert your pre-constructed XML replacement (or modify the "paragraph" accordingly).
I'm not sure what the delete criteria are, so I don't know if it can be done in the same pass with step #2. If not, extract the salient information somehow. Maybe print a list of keys from each of the two files, and use comm(1) to produce a list of to-be-deleted. Then, similar to step #2, read in that list, and process the XML file one more time. Write anything you delete to stderr so you can keep track of what went missing, from what lines.
Any pointers
Whenever you find yourself processing the same file N times for N inputs, you know you're headed for trouble. One of the two inputs is always smaller, and that one can be put in some kind of array. cat file | while read is another warning signal, telling you use awk or any of a dozen obvious utilities that understand lines of text.
You posted your question on SO two weeks ago. I suspect no one answered it because you warned them away: preemptively saying, in effect, don't tell me to use good tools. I'm only here to suggest that you'll be more comfortable after you take off that straightjacket. Better tools, in this case, are the only right answer.
I have a text file similar to
"3"|"0001"
"1"|"0003"
"1"|"0001"
"2"|"0001"
"1"|"0002"
i.e. a pipe-delimited text file containing quoted strings.
What I need to do is:
First, extract the first line which contains each value in the first column, producing
"3"|"0001"
"1"|"0003"
"2"|"0001"
Then, sort by the values in the first column, producing
"1"|"0003"
"2"|"0001"
"3"|"0001"
Performing the sort is easy - sort -k 1,1 -t \| - but I'm stuck on extracting the first line in the file which contains each value in the first column. I thought of using uniq but it doesn't do what I want, and it's "column-handling" abilities are limited to ignoring the first 'x' columns of space-or-tab delimited text.
Using the Posix shell (/usr/bin/sh) under HP-UX.
I'm kind of drawing a blank here. Any suggestions welcomed.
you can do:
awk -F'|' '!a[$1]++' file|sort...
The awk part will remove the duplicated lines, only leave the first occurrence.
I don't have a HP-unix box, I therefore cannot do real test. But I think it should go...