Removing lines from text list that don't match expression - sorting

I've made a script that lists movies that I've processed from DVDs for Kodi into a text file. In the example below I'd like to remove any entry that doesn't contain the year (YYYY). What is the best way to do this?
300 (2006).mkv
42nd Street (1933).mkv
47 Ronin (2013).mkv
A1_t00.mkv
A1_t01.mkv
A1_t01.mkv
New to bash scripting and explored a couple sites with awk and sed.

awk '/[0-9][0-9][0-9][0-9]/ {print}' list.txt > filteredlist.txt
Filters text for a 4 digit number (year), excluding all the extras.

Related

How to delete a series of positions within a file based on a list of numbers with Bash

I'm pretty new in Bash scripting and i have a problem to solve. I have a file that look like this:
>atac
ATTGGCAATTAAATTCTTTT
>lipa
ATTACCAAGTAAATTCTTTT
.
.
.
where each even lines have the same length, but can have different characters, and i need to remove, in each even lines, a series of position listed in a .txt file. The .txt have only a list of number, one for each lines, that correspond to the positions to be removed and look like this:
3
5
8
10
11
the expected output must keep the same length for each even line, but in each of them, the positions listed in the .txt file must have been deleted.
Any suggestion?
If the "position" in the txt file indicates always the index of the original string, this awk-oneliner will help you:
awk 'NR==FNR{a[$0];next}FNR%2==0{for(x in a)$x=""}7' your.txt FS="" OFS="" file
>atac
ATGCATAATTCTTTT
>lipa
ATACAGAATTCTTTT
We mark (as "-") the deleted char so that you can verify if the result is correct:
awk 'NR==FNR{a[$0];next}FNR%2==0{for(x in a)$x="-"}7' txt FS="" OFS="" file
>atac
AT-G-CA-T--AATTCTTTT
>lipa
AT-A-CA-G--AATTCTTTT

Replace a part of a file by a part of another file

I have two files containing a lot of floating numbers. I would like to replace one of the floating numbers from file 1 by a floating number from File 2, using lines and characters to find the numbers (and not their values).
A lot of topics on the subject, but I couldn't find anything that uses a second file to copy the values from.
Here are examples of my two files:
File1:
14 4
2.64895E-01 4.75834E+02 2.85629E+05 -9.65829E+01
2.76893E-01 8.53749E+02 4.56385E+05 -7.65658E+01
6.25576E-01 5.27841E+02 5.72960E+05 -7.46175E+01
8.56285E-01 4.67285E+02 5.75962E+05 -5.17586E+01
File2:
Some text on the first line
1
Some text on the third line
0
AND01 0.53758275 0.65728944
AND02 0.64889566 0.53386002
AND03 0.65729386 0.64628194
AND04 0.26586960 0.46582925
AND05 0.46480534 0.57415869
In this particular example, I would like to replace the first number of the second line of File1 (2.64895E-01) by the second floating number written on line 5 of File2 (0.65728944).
Note: the value of the numbers will change according to which file I consider, so I have to identify the numbers by their positions inside the files.
I am very new to using bash scripts and have only use "sed" command till now to modify my files.
Any help is welcome :)
Thanks a lot for your inputs!
It's not hard to do it in bash, but if that's not a strict requirement, an easier and more concise solution is possible with an actual text-processing tool like awk:
awk 'NR==5 {val=$2} NR>FNR {FNR==2 && $1=val; print}' file2 file1
Explanation: read file2 first, and store the second field of the 5th record in variable val (the first part: NR==5 {val=$2}). Then, read file1, print every line, but replace the first field of the second record (FNR is current-file record number, and NR is total number of records in all files so far) with value stored in val.
In general, an awk program consists of pattern { actions } sequences. pattern is a condition under which a series of actions will get executed. $1..$NF are variables with field values, and each line (record) is split into fields on the field separator (FS variable, or -F'..' option), which defaults to a space.
The result (output):
14 4
0.53758275 4.75834E+02 2.85629E+05 -9.65829E+01
2.76893E-01 8.53749E+02 4.56385E+05 -7.65658E+01
6.25576E-01 5.27841E+02 5.72960E+05 -7.46175E+01
8.56285E-01 4.67285E+02 5.75962E+05 -5.17586E+01

parse CSV, Group all rows containing string at 5th field, export each group of rows to file with filename <group>_someconstant.csv

Need this in bash.
In a linux directory, I will have a CSV file. Arbitrarily, this file will have 6 rows.
Main_Export.csv
1,2,3,4,8100_group1,6,7,8
1,2,3,4,8100_group1,6,7,8
1,2,3,4,3100_group2,6,7,8
1,2,3,4,3100_group2,6,7,8
1,2,3,4,5400_group3,6,7,8
1,2,3,4,5400_group3,6,7,8
I need to parse this file's 5th field (first four chars only) and take each row with 8100 (for example) and put those rows in a new file. Same with all other groups that exist, across the entire file.
Each new file can only contain the rows for its group (one file with the rows for 8100, one file for the rows with 3100, etc.)
Each filename needs to have that group# prepended to it.
The first four characters could be any numeric value, so I can't check these against a list - there are like 50 groups, and maintenance can't be done on this if a group # changes.
When parsing the fifth field, I only care about the first four characters
So we'd start with: Main_Export.csv and end up with four files:
Main_Export_$date.csv (unchanged)
8100_filenameconstant_$date.csv
3100_filenameconstant_$date.csv
5400_filenameconstant_$date.csv
I'm not sure the rules of the site. If I have to try this for myself first and then post this. I'll come back once I have an idea - but I'm at a total loss. Reading up on awk right now.
If I have understood well your problem this is very easy...
You can just:
$ awk -F, '{fifth=substr($5, 1, 4) ; print > (fifth "_mysuffix.csv")}' file.cv
or just:
$ awk -F, '{print > (substr($5, 1, 4) "_mysuffix.csv")}' file.csv
And you will get several files like:
$ cat 3100_mysuffix.csv
1,2,3,4,3100_group2,6,7,8
1,2,3,4,3100_group2,6,7,8
or...
$ cat 5400_mysuffix.csv
1,2,3,4,5400_group3,6,7,8
1,2,3,4,5400_group3,6,7,8

Slight error when using awk to remove spaces from a CSV column

I have used the following awk command on my bash script to delete spaces on the 26th column of my CSV;
awk 'BEGIN{FS=OFS="|"} {gsub(/ /,"",$26)}1' original.csv > final.csv
Out of 400 rows, I have about 5 random rows that this doesn't work on even if I rerun the script on final.csv. Can anyone assist me with a method to take care of this? Thank you in advance.
EDIT: Here is a sample of the 26th column on original.csv vs final.csv respectively;
2212026837 2212026837
2256 41688 6 2256416886
2076113566 2076113566
2009 84517 7 2009845177
2067950476 2067950476
2057 90531 5 2057 90531 5
2085271676 2085271676
2095183426 2095183426
2347366235 2347366235
2200160434 2200160434
2229359595 2229359595
2045373466 2045373466
2053849895 2053849895
2300 81552 3 2300 81552 3
I see two possibilities.
The simplest is that you have some whitespace other than a space. You can fix that by using a more general regex in your gsub: instead of / /, use /[[:space:]]/.
If that solves your problem, great! You got lucky, move on. :)
The other possible problem is trickier. The CSV (or, in this case, pipe-SV) format is not as simple as it appears, since you can have quoted delimiters inside fields. This, for instance, is a perfectly valid 4-field line in a pipe-delimited file:
field 1|"field 2 contains some |pipe| characters"|field 3|field 4
If the first 4 fields on a line in your file looked like that, your gsub on $26 would actually operate on $24 instead, leaving $26 alone. If you have data like that, the only real solution is to use a scripting language with an actual CSV parsing library. Perl has Text::CSV, but it's not installed by default; Python's csv module is, so you could use a program like so:
import csv, fileinput as fi, re;
for row in csv.reader(fi.input(), delimiter='|'):
row[25] = re.sub(r'\s+', '', row[25]) # fields start at 0 instead of 1
print '|'.join(row)
Save the above in a file like colfixer.py and run it with python colfixer.py original.csv >final.csv.
(If you tried hard enough, you could get that shoved into a -c option string and run it from the command line without creating a script file, but Python's not really built for that and it gets ugly fast.)
You can use the string function split, and iterate over the corresponding array to reassign the 26th field:
awk 'BEGIN{FS=OFS="|"} {
n = split($26, a, /[[:space:]]+/)
$26=a[1]
for(i=2; i<=n; i++)
$26=$26""a[i]
}1' original.csv > final.csv

display consolidated list of numbers from a CSV using BASH

I was sent a large list of URL's in an Excel spreadsheet, each unique according to a certain get variable in the string (who's value is a number ranging from 5-7 numbers in length). I am having to run some queries on our databases based on those numbers, and don't want to have to go through the hundreds of entries weeding out the numbers one-by-one. What BASH commands that can be used to parse out the number from each line (it's the only number in each line) and consolidate it down to one line with all the numbers, comma separated?
A sample (shortened) listing of the CVS spreadsheet includes:
http://www.domain.com/view.php?fDocumentId=123456
http://www.domain.com/view.php?fDocumentId=223456
http://www.domain.com/view.php?fDocumentId=323456
http://www.domain.com/view.php?fDocumentId=423456
DocumentId=523456
DocumentId=623456
DocumentId=723456
DocumentId=823456
....
...
The change of format was intentional, as they decided to simply reduce it down to the variable name and value after a few rows. The change of the get variable from fDocumentId to just DocumentId was also intentional. Ideal output would look similar to:
123456,23456,323456,423456,523456,623456,723456,823456
EDIT: my apologies, I did not notice that half way through the list, they decided to get froggy and change things around, there's entries that when saved as CSV, certain rows will appear as:
"DocumentId=098765 COMMENT, COMMENT"
DocumentId=898765 COMMENT
DocumentId=798765- COMMENT
"DocumentId=698765- COMMENT, COMMENT"
With several other entries that look similar to any of the above rows. COMMENT can be replaced with a single string of (upper-case) characters no longer than 3 characters in length per COMMENT
Assuming the variable always on it's own, and last on the line, how about just taking whatever is on the right of the =?
sed -r "s/.*=([0-9]+)$/\1/" testdata | paste -sd","
EDIT: Ok, with the new information, you'll have to edit the regex a bit:
sed -r "s/.*f?DocumentId=([0-9]+).*/\1/" testdata | paste -sd","
Here anything after DocumentId or fDocumentId will be captured. Works for the data you've presented so far, at least.
More simple than this :)
cat file.csv | cut -d "=" -f 2 | xargs
If you're not completely committed to bash, the Swiss Army Chainsaw will help:
perl -ne '{$_=~s/.*=//; $_=~s/ .*//; $_=~s/-//; chomp $_ ; print "$_," }' < YOUR_ORIGINAL_FILE
That cuts everything up to and including an =, then everything after a space, then removes any dashes. Run on the above input, it returns
123456,223456,323456,423456,523456,623456,723456,823456,098765,898765,798765,698765,

Resources