Using fetchmail for one time email extraction from gmail - fetchmail

I'm trying to use fetchmail in terminal to extract e-mails from my gmail account.
I configured my ~/.fetchmailrc with:
poll imap.gmail.com protocol POP3
user "someuser#gmail.com" is oren here
password 'verysecretpassword'
(Of course with real username+password).
Then I tried to naively extract emails with: $ fetchmail.
Sadly nothing happened, and all I got was:
fetchmail: 6.3.26 querying imap.gmail.com (protocol POP3) at Mon 03 Feb 2020 14:34:46 IST: poll started
Trying to connect to <ADDRESS> ... connection failed.

Using Fetchmail
It looks like the config is set to poll an IMAP server but then specifying POP3 protocol. Try something like this for the ~/.fetchmailrc file:
set postmaster "local_user"
set daemon 600
poll pop.gmail.com with proto POP3
user 'gmail_user_name' there with password 'app_password' is local_user here options ssl fetchlimit 400
where:
local_user is some local account where undeliverable mail should go (the "last ditch effort" before failing permanently).
gmail_user_name is everything to the left of the # in the email address.
app_password is a specially generated password that is restricted to the gmail application (go here: https://myaccount.google.com/ and click Security, then app passwords and generate a new app password)
What to do at this point will depend on your local setup. Fetchmail will... fetch mail (clearly).... and then deliver it to the local machine's delivery system. If you have sendmail (a pretty safe bet), this might work:
$ fetchmail -d0 -avNk -m "/usr/sbin/sendmail -i -f %F -- %T" pop.gmail.com
Mail should start flowing in. Messages can be read using the mail command or get the raw content from /var/mail/[username]. This might not get everything in one shot; it very likely won't if the address has accumulated even a small amount of history. Let it finish and check that it worked as expected. If everything looks good, then it's time to start fetchmail as a daemon process and let it download the entire mailbox. First, configure fetchmail with appropriate polling interval and batch size settings1.
Confirm that the polling interval is configured in the ~/.fetchmailrc by the line daemon 600 (ie. 10 minute polling interval).
Confirm that the polling option fetchlimit 400 is set in the ~/.fetchmailrc under the options section in the poll pop.gmail.com stanza. This is the maximum number of messages to fetch per poll.
Start fetchmail using the same command as above, but omit the -d0 switch.
Fetchmail should start as a true daemon and continue to periodically download batches of messages until the whole mailbox is downloaded. You will need to remember to kill the daemon process if you don't want it to continue syncing until the next reboot.
Using Google Takeout
You can do this super easy using Google Takeout. Log in, click the "deselect" option at the top of the list, then scroll down to Mail and check just that. You can choose to get the data in a .zip or .tgz file. They will send you an email when the archive is ready for download. It is packaged in an mbox file but that is pretty straightforward to convert to other formats.
This is probably the easiest way to accomplish a one time export, and I think they have an option to set up a recurring export too. It probably isn't offering as much control compared to using the developer API directly, but it's a lot less hassle.
1: I believe Google has some rate limiting in place, so I am adding some steps to accommodate those limits. These are conservative values, since I don't know exactly what the limits are (or even for sure if they exist). If you know more, or care to research it, adjust these values to whatever you think is best.

Related

How to download 300k log lines from my application?

I am running a job on my Heroku app that generates about 300k lines of log within 5 minutes. I need to extract all of them into a file. How can I do this?
The Heroku UI only shows logs in real time, since the moment it was opened, and only keeps 10k lines.
I attached a LogDNA Add-on as a drain, but their export also only allows 10k lines export. To even have the option of export, I need to apply a search filter (I typed 2020 because all the lines start with a date, but still...). I can scroll through all the logs to see them, but as I scroll up the bottom gets truncated, so I can't even copy-paste them myself.
I then attached Sumo Logic as a drain, which is better, because the export limit is 100k. However I still need to filter the logs in 30s to 60s intervals and download separately. Also it exports to CSV file and in reverse order (newest first, not what I want) so I have to still work on the file after its downloaded.
Is there no option to get actual raw log files in full?
Is there no option to get actual raw log files in full?
There are no actual raw log files.
Heroku's architecture requires that logging be distributed. By default, its Logplex service aggregates log output from all services into a single stream and makes it available via heroku logs. However,
Logplex is designed for collating and routing log messages, not for storage. It retains the most recent 1,500 lines of your consolidated logs, which expire after 1 week.
For longer persistence you need something else. In addition to commercial logging services like those you mentioned, you have several options:
Log to a database instead of files. Something like Apache Cassandra might be a good fit.
Send your logs to a logging server via Syslog (my preference):
Syslog drains allow you to forward your Heroku logs to an external Syslog server for long-term archiving.
Send your logs to a custom logging process via HTTPS.
Log drains also support messaging via HTTPS. This makes it easy to write your own log-processing logic and run it on a web service (such as another Heroku app).
Speaking solely from the Sumo Logic point of view, since that’s the only one I’m familiar with here, you could do this with its Search Job API: https://help.sumologic.com/APIs/Search-Job-API/About-the-Search-Job-API
The Search Job API lets you kick off a search, poll it for status, and then when complete, page through the results (up to 1M records, I believe) and do whatever you want with them, such as dumping them into a CSV file.
But this is only available to trial and Enterprise accounts.
I just looked at Heroku’s docs and it does not look like they have a native way to retrieve more than 1500 and you do have to forward those logs via syslog to a separate server / service.
I think your best solution is going to depend, however, on your use-case, such as why specifically you need these logs in a CSV.

How to only read a few lines from a remote file?

Before downloading file, I need to set up a way it (the .csv typically, but not always) will be parsed.
I don't want to download the whole file especially if the "headers" do not match what is expected.
Is there a way to only download up until a certain number of byes and then gracefully kill the connection?
There's no explicit support for this in an FTP protocol.
There's an expired draft for RANG command that would allow this:
https://datatracker.ietf.org/doc/html/draft-bryan-ftp-range-08
But that's obviously supported by only new FTP servers.
Though there's nothing that prevents you from initiating a normal (full) download and forcefully break it as soon you get the amount of data you need.
All you need to do is to close the data transfer connection. This is basically what all FTP clients do, when an end user decides to abort the transfer.
This approach might result in few error messages in an FTP server log.
If you can use an SFTP protocol, then it's easy. The SFTP supports this natively.

Client-server synchronization

I found a nice answer of S.Lott about what I've been searching for:
Client-server synchronization pattern / algorithm?
But my question is now, what if the client has a wrong time?
Here's my problem:
Let's say the time of the client is 1 hour behind the servers, then the client changes a file, so the last write time is now 1 hour behind the servers. When the user starts his program which synchronizes the file, the server says to the changed file: "Oh, that file you have there is 1 hour older than mine, so let's replace it", but that's wrong, because the users file is actually newer, so it should be uploaded to the server.
I need a system that checks if the file is newer on server or on the client, and that doesn't work if the time is wrong or different.
Any ideas?
By the way, I am trying to write a cloud program.
If you're resolving conflicts manually (which would make sense for most applications), this can probably be done better with versioning rather than timestamps. When a client modifies a file, set a flag. When synchronizing, check the flag and versions.
If the client flag is set and the client and server versions are the same, send the client file to the server.
If the client flag is not set and the server version is newer, send the server file to the client.
If the client flag is set and the server version is newer, a conflict occurred and should be resolved.
The versions are per-file and should be sent along with the files.
Reset all client flags after synchronization.
This 'flag' can just be a check whether the last modified time on the file is different from the time that file was received from the server (we can store this time separately right after getting the file from the server).
Alternatively, you could sync the time.
Here's one possible solution:
When receiving files from the server, first get the current time from the server, then offset the timestamp of each file received on the client side by the difference between the server and client time. When sending files to the server, you can do something similar by offsetting by the client time.
But this seems more complex than necessary.

IIS 7 request duration monitoring

i am curious if there is a way of monitoring the request duration time on an iis server. Personally I have came up with a solution but it's really resource intensive and that is why i'm asking the question, just to gather more opinions.
My plan is to extract the duration time of each request and send it to graphite so as to have a real time overview of the performance of the webserver. The idea i've came up with is to use poweshell with its webadministration module. And if you run get-item IIS:\AppPools\DefaultAppPool | Get-WebRequest for example you get all the requests on that app pool with a lot of info including the time info.
The thing is that i should have a script which runs every 100 ms to get all requests and that is kinda wasteful. Is there a way to tell iis to put the request duration time(in miliseconds) in the logs? Because then it would be much easier to get the information I need.
I don't know if there is such a feature on IIS, but I've done the same (sending iis page times to graphite) by using a reverse proxy between internet and the iis server, like nginx.
The proxy module from nginx allow you to log on each request the time the backend took to produce the page.
Also, having a proxy like nginx in fron of an IIS could be very helpful if you have to deal with visits with slow connections, nginx will store the reply from backend, drop backend connection and wait until visitor gets all the content. Highly recommended.
In case you go this route, you should use logster (also from etsy guys) or logstash to parse nginx logs each period of time you want (likely every minute).
Seems that there is a feature that logs requests based on a regex, and it's called Advanced Logging Module. You can specify from a number of fields what you want to get loged and it's W3C compliant. In my case i had time take as a filed which can be specified and that was what i was looking for. After that i written a script in powershell which parses the logs and gets the information i need, constructs a metric and sends it to statsd which in term sends it to powershell.
The method i chose for the log parsing was the following: in the script i used get-content comandlet from powershell to gather all the logs in one file(yes iis breaks the logs in multiple files, and i'm guessing the number of logs is dependent on the number of your working processes but i'm not sure). This was the first iteration in a second iteration i gather all the logs in another file and make a diff between the first file and the latter and only the difference gets processed.
I chose this method because it's i thought it wold be better to have the minimum regex processing. The next step is erasing the first file of accumulated logs and moving the second one in pace of the first that was erased and running the script again, so to have always a method of comparison. Also the log rollover is at one hour, after which the logs are erased.

Google Reader Like Web Application (SmartGWT) (GWT)

I need to write web aplication like google reader (using SmartGWT).
Instead of RSS feads I will show log files which updates in realtime. I think I can start a timer and ask server are there any new logs every minute. Is this the right way to do this?
Do I have to use WebSockets? Are they working in all the modern browsers?
I think I can start a timer and ask server are there any new logs every minute. Is this the right way to do this?
Without using server push this is the way to go. You typically want to query the server with the timestamp of the last received log entry. This way can you only send the diff since the last pull.
See here for some more information on GWT and push (which is actually pull). Or check out stream-hub (and the pimped stock watcher example) if you wanna go for server push.

Resources