Change Data Capture in delimited files - shell

There are two tab delimited files (file1, file2) with same number and structure of records but with different values for columns.
Daily we get another file (newfile) with same number and structure of records but with some changes in column values.
Compare this file (newfile) with two files (file1, file2) and update the records in them with changed records, keeping unchanged records intact.
Before applying changes:
file1
11 aaaa
22 bbbb
33 cccc
file2
11 bbbb
22 aaaa
33 cccc
newfile
11 aaaa
22 eeee
33 ffff
After applying changes:
file1
11 aaaa
22 eeee
33 ffff
file2
11 aaaa
22 eeee
33 ffff
What could be the easy and most efficient solution? Unix shell scripting? The files are huge containing millions of records, can a shell script be efficient solution in this case?

Daily we get another file (newfile) with same number and structure of records but with
some changes in column values.
This sounds to me like a perfect case for git. With git you can commit the current file as it is.
Then as you get new "versions" of the file, you can simply replace the old version with the new one, and commit again. The best part is each time you make a commit git will record the changes from file to file, giving you access to the entire history of the file.

Related

SoftQuad DESC or font file binary

I read this question but it doesn't helped me. I am solving a challenge where I have two files, first one was .png which gave me upper half part of an image, second file is SoftQuad DESC or font file binary I am sure that this file should somehow convert into .png file to complete the image. I googled and got hint about magic bytes but I am unable to match the bytes.
These are the first two rows of output of xxd command
00000000: aaaa a6bb 67bb bf18 dd94 15e6 252c 0a2f ....g.......%,./
00000010: fe14 d943 e8b5 6ad5 2264 1632 646e debc ...C..j."d.2dn..
These are the last two rows of output of xxd command
00001c10: 7a05 7f4c 3600 0000 0049 454e 44ae 4260 z..L6....IEND.B`
00001c20: 82
.

Reorder lines near the beginning of a huge text file (>20G)

I am a vim user and can use some basic awk or bash commands. Now I have a text (vcf) file with size more than 20G. What I wanted is to move the line #69 to below line#66:
$less huge.vcf
...
66 ##contig=<ID=9,length=124595110>
67 ##contig=<ID=X,length=171031299>
68 ##contig=<ID=Y,length=91744698>
69 ##contig=<ID=MT,length=16299>
...
What I wanted is:
...
66 ##contig=<ID=9,length=124595110>
67 ##contig=<ID=MT,length=16299>
68 ##contig=<ID=X,length=171031299>
69 ##contig=<ID=Y,length=91744698>
...
I tried to open and edit it using vim (LargeFile plugin installed), but still not working very well.
The easy approach is to copy the section you want to edit out of your file, modify it in-place, then copy it back in.
# extract the first hundred lines
head -n 100 huge.txt >start.txt
# modify that extracted subset
vim start.txt
# copy that section back into the beginning of larger file
dd if=start.txt of=huge.txt conv=notrunc
Note that this only works if your edits don't change the size of the section being modified. That is to say -- make sure that start.txt has the exact same size in bytes after being modified that it had before.
Here's an awk version:
$ awk 'NR>=3 && NR<=4{b=b (b==""?"":ORS) $0;next}1;NR==5 {print b}' file
...
66 ##contig=<ID=9,length=124595110>
69 ##contig=<ID=MT,length=16299>
67 ##contig=<ID=X,length=171031299>
68 ##contig=<ID=Y,length=91744698>
...
You need to change the line numbers in the code, though. 3 -> 67, 4 -> 68 and 5 -> 69 and redirect the output to a new file. If you' like it to perform inplace, use i inplace for GNU awk.

Delete all consecutive lines with sed, but not an isolated one

I have a log file which looks like the following text:
...
5 files analysed in 98 ms
7 files analysed in 654 ms
error1: ....
error2: ....
error3: ....
21 files analysed in 345 ms
3 files analysed in 78 ms
6 files analysed in 55 ms
...
I am looking forward to using "sed" or "awk" in order to remove all consecutive lines containing the pattern "files analysed in", but not the one above the useful information.
7 files analysed in 654 ms
error1: ....
error2: ....
error3: ....
I tried some tricks from this post. But nothing is working like I would like to. The number of errors is not always the same.
How could I proceed?
grep -v "files analysed in" -B 1
select everything that doesn't have the pattern, but provide one line of context before each match
with awk
$ awk '/pattern/{p=$0} !/pattern/{print p; print}' file
foo3 pattern foo4
some useful information
you can also exit after the first match.

Update Oracle database with content of text file

I would like to update a field in an Oracle database with the content of a standard txt file.
The file is generated every 10 minutes by an external program on which i do not have control.
I would like to create a job in oracle or a SQLPLUS batch file that would pick the content of the file and update a specific record in an ORACLE Database
For exemple My_Table would contains this:
ID Description FileContent
-- ----------- ---------------------------------------------------------
00 test1.txt This is content of test.txt
01 test2.txt Content of files may
Contain several lines
blank lines
pretty much everything (but must be limited to 2000char)
02 test3.txt not loaded yet
My file "test3.txt" changes often but i do no know when and would look like this:
File generated at 3:33 on august 19, 2016
Result :
1 Banana
2 Apple
3 Pineapple
END OF FILE
i would like the full content of the file to be loaded up into it's corresponding record in an Oracle Database.

How to handle multi line fixed length file with BeanIO

I'm very new to BeanIO, it is solving most of my problems but I'm unable to figure out how to solve this one:
I have a multiline fixed width file in the following format:
BBB001 000 000000
BBB555 001 George
BBB555 002 London
BBB555 003 UK
BBB555 999 000000
BBB555 001 Jean
BBB555 002 Paris
BBB555 003 France
BBB555 004 Europe
BBB555 999 000000
BBB999 000 000000
Basically there is a header and footer which I can easily read because they are well defined. However a single record is actually on multiple lines and end of the record is the line that that has 999 in the middle ( there is no other information on that line). I was wondering what should my xml be or what classes do I need to override so I can properly read this type of format.
I would suggest using the lineContinuationCharacter property, as described in the BeanIO documentation. It probably has to be configured as a carriage return and line feed.
Try something like this:
<beanio xmlns="http://www.beanio.org/2012/03"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://www.beanio.org/2012/03 http://www.beanio.org/2012/03/mapping.xsd">
<stream name="s1" format="fixedlength" lineContinuationCharacter="">
<!-- record layout... -->
</stream>
</beanio>
Note that I haven't tested this, but according to the documentation this should work.

Resources