ansible-mailserver-debian/journal-postfix-doc/20191127_pyugat_talk.html

378 lines
34 KiB
HTML

<h1>journal-postfix - A log parser for Postfix</h1>
<p>Experiences from applying Python to the domain of bad old email.</p>
<h2>Email ✉</h2>
<ul>
<li>old technology (starting in the 70ies)</li>
<li><a href="https://en.wikipedia.org/wiki/Store_and_forward">store-and-forward</a>: sent != delivered to recipient</li>
<li>non-delivery reasons:
<ul>
<li>recipient over quota</li>
<li>inexistent destination</li>
<li>malware</li>
<li>spam</li>
<li>server problem</li>
<li>...</li>
</ul></li>
<li>permanent / non-permanent failure (<a href="https://www.iana.org/assignments/smtp-enhanced-status-codes/smtp-enhanced-status-codes.xhtml">DSN ~ 5.X.Y / 4.X.Y</a>)</li>
<li>non-delivery modes
<ul>
<li>immediate reject on SMTP level</li>
<li>delayed <a href="https://en.wikipedia.org/wiki/Bounce_message">bounce messages</a> by <a href="https://upload.wikimedia.org/wikipedia/commons/a/a2/Bounce-DSN-MTA-names.png">reporting MTA</a> - queueing (e.g., ~5d) before delivery failure notification</li>
<li>discarding</li>
</ul></li>
<li>read receipts</li>
<li><a href="https://en.wikipedia.org/wiki/Email_tracking">Wikipedia: email tracking</a></li>
</ul>
<h2><a href="https://en.wikipedia.org/wiki/SMTP">SMTP</a></h2>
<p><a href="https://en.wikipedia.org/wiki/Simple_Mail_Transfer_Protocol#SMTP_transport_example">SMTP session example</a>: envelope sender, envelope recipient may differ from From:, To:</p>
<p>Lists of error codes:</p>
<ul>
<li><a href="https://www.inmotionhosting.com/support/email/email-troubleshooting/smtp-and-esmtp-error-code-list">SMTP and ESMTP</a></li>
<li><a href="https://serversmtp.com/smtp-error/">SMTP</a></li>
<li><a href="https://info.webtoolhub.com/kb-a15-smtp-status-codes-smtp-error-codes-smtp-reply-codes.aspx">SMTP</a></li>
</ul>
<p>Example of an error within a bounced email (Subject: Mail delivery failed: returning message to sender)</p>
<pre><code>SMTP error from remote server for TEXT command, host: smtpin.rzone.de (81.169.145.97) reason: 550 5.7.1 Refused by local policy. No SPAM please!
</code></pre>
<ul>
<li>email users are continually asking for the fate of their emails (or those of their correspondents which should have arrived)</li>
</ul>
<h2><a href="http://www.postfix.org">Postfix</a></h2>
<ul>
<li>popular <a href="https://en.wikipedia.org/wiki/Message_transfer_agent">MTA</a></li>
<li>written in C</li>
<li>logging to files / journald</li>
<li>example log messages for a (non-)delivery + stats</li>
</ul>
<pre><code>Nov 27 16:19:22 mail postfix/smtpd[18995]: connect from unknown[80.82.79.244]
Nov 27 16:19:22 mail postfix/smtpd[18995]: NOQUEUE: reject: RCPT from unknown[80.82.79.244]: 454 4.7.1 &lt;spameri@tiscali.it&gt;: Relay access denied; from=&lt;spameri@tiscali.it&gt; to=&lt;spameri@tiscali.it&gt; proto=ESMTP helo=&lt;WIN-G7CPHCGK247&gt;
Nov 27 16:19:22 mail postfix/smtpd[18995]: disconnect from unknown[80.82.79.244] ehlo=1 mail=1 rcpt=0/1 rset=1 quit=1 commands=4/5
Nov 27 16:22:43 mail postfix/anvil[18997]: statistics: max connection rate 1/60s for (smtp:80.82.79.244) at Nov 27 16:19:22
Nov 27 16:22:43 mail postfix/anvil[18997]: statistics: max connection count 1 for (smtp:80.82.79.244) at Nov 27 16:19:22
Nov 27 16:22:43 mail postfix/anvil[18997]: statistics: max cache size 1 at Nov 27 16:19:22
Nov 27 16:22:48 mail postfix/smtpd[18999]: connect from mail.cosmopool.net[2a01:4f8:160:20c1::10:107]
Nov 27 16:22:49 mail postfix/smtpd[18999]: 47NQzY13DbzNWNQG: client=mail.cosmopool.net[2a01:4f8:160:20c1::10:107]
Nov 27 16:22:49 mail postfix/cleanup[19003]: 47NQzY13DbzNWNQG: info: header Subject: Re: test from mail.cosmopool.net[2a01:4f8:160:20c1::10:107]; from=&lt;ibu@cosmopool.net&gt; to=&lt;ibu@multiname.org&gt; proto=ESMTP helo=&lt;mail.cosmopool.net&gt;
Nov 27 16:22:49 mail postfix/cleanup[19003]: 47NQzY13DbzNWNQG: message-id=&lt;d5154432-b984-d65a-30b3-38bde7e37af8@cosmopool.net&gt;
Nov 27 16:22:49 mail postfix/qmgr[29349]: 47NQzY13DbzNWNQG: from=&lt;ibu@cosmopool.net&gt;, size=1365, nrcpt=2 (queue active)
Nov 27 16:22:49 mail postfix/smtpd[18999]: disconnect from mail.cosmopool.net[2a01:4f8:160:20c1::10:107] ehlo=1 mail=1 rcpt=2 data=1 quit=1 commands=6
Nov 27 16:22:50 mail postfix/lmtp[19005]: 47NQzY13DbzNWNQG: to=&lt;ibu2@multiname.org&gt;, relay=mail.multiname.org[private/dovecot-lmtp], delay=1.2, delays=0.56/0.01/0.01/0.63, dsn=2.0.0, status=sent (250 2.0.0 &lt;ibu2@multiname.org&gt; nV9iJ9mi3l0+SgAAZU03Dg Saved)
Nov 27 16:22:50 mail postfix/lmtp[19005]: 47NQzY13DbzNWNQG: to=&lt;ibu@multiname.org&gt;, relay=mail.multiname.org[private/dovecot-lmtp], delay=1.2, delays=0.56/0.01/0.01/0.63, dsn=2.0.0, status=sent (250 2.0.0 &lt;ibu@multiname.org&gt; nV9iJ9mi3l0+SgAAZU03Dg:2 Saved)
Nov 27 16:22:50 mail postfix/qmgr[29349]: 47NQzY13DbzNWNQG: removed
</code></pre>
<ul>
<li><a href="http://www.postfix.org/OVERVIEW.html">involved postfix components</a>
<ul>
<li>smtpd (port 25: smtp, port 587: submission)</li>
<li>cleanup</li>
<li>smtp/lmtp</li>
</ul></li>
<li>missing log parser</li>
</ul>
<h2>Idea</h2>
<ul>
<li>follow log stream and write summarized delivery information to a database</li>
<li>goal: spot delivery problems, collect delivery stats</li>
<li>a GUI could then display the current delivery status to users</li>
</ul>
<h2>Why Python?</h2>
<ul>
<li>simple and fun language, clear and concise</li>
<li>well suited for text processing</li>
<li>libs available for systemd, PostgreSQL</li>
<li>huge standard library (used here: datetime, re, yaml, argparse, select)</li>
<li>speed sufficient?</li>
</ul>
<h2>Development iterations</h2>
<ul>
<li>hmm, easy task, might take a few days</li>
<li>PoC: reading and polling from journal works as expected</li>
<li>used postfix logfiles in syslog format and wrote regexps matching them iteratively</li>
<li>separated parsing messages from extracting delivery information</li>
<li>created a delivery table</li>
<li>hmm, this is very slow, takes hours to process log messages from a few days (from a server with not much traffic)</li>
<li>introduced polling timeout and SQL transactions handling several messages at once</li>
<li>... much faster</li>
<li>looks fine, but wait... did I catch all syntax variants of Postfix log messages?</li>
<li>looked into Postfix sources and almost got lost</li>
<li>weeks of hard work identifying relevant log output directives</li>
<li>completely rewrote parser to deal with the rich log msg syntax, e.g.:<br> <code>def _strip_pattern(msg, res, pattern_name, pos='l', target_names=None)</code></li>
<li>oh, there are even more Postfix components... limit to certain Postfix configurations, in particular virtual mailboxes and not local ones</li>
<li>mails may have multiple recipients... split delivery table into delivery_from and delivery_to</li>
<li>decide which delivery information is relevant</li>
<li>cleanup and polish (config mgmt, logging)</li>
<li>write ansible role</li>
</ul>
<h2>Structure</h2>
<svg viewBox="0 0 1216 400" xmlns="http://www.w3.org/2000/svg" xmlns:inkspace="http://www.inkscape.org/namespaces/inkscape" xmlns:xlink="http://www.w3.org/1999/xlink">
<defs id="defs_block">
<filter height="1.504" id="filter_blur" inkspace:collect="always" width="1.1575" x="-0.07875" y="-0.252">
<feGaussianBlur id="feGaussianBlur3780" inkspace:collect="always" stdDeviation="4.2" />
</filter>
</defs>
<title>blockdiag</title>
<desc>blockdiag {
default_fontsize = 20;
node_height = 80;
journal_since -&gt; run_loop;
journal_follow -&gt; run_loop;
logfile -&gt; run_loop;
run_loop -&gt; parse -&gt; extract_delivery -&gt; store;
store -&gt; delivery_from;
store -&gt; delivery_to;
store -&gt; noqueue;
group { label="input iterables"; journal_since; journal_follow; logfile; };
group { label="output tables"; delivery_from; delivery_to; noqueue; };
}
</desc>
<rect fill="rgb(243,152,0)" height="340" style="filter:url(#filter_blur)" width="144" x="56" y="30" />
<rect fill="rgb(243,152,0)" height="340" style="filter:url(#filter_blur)" width="144" x="1016" y="30" />
<rect fill="rgb(0,0,0)" height="80" stroke="rgb(0,0,0)" style="filter:url(#filter_blur);opacity:0.7;fill-opacity:1" width="128" x="259" y="46" />
<rect fill="rgb(0,0,0)" height="80" stroke="rgb(0,0,0)" style="filter:url(#filter_blur);opacity:0.7;fill-opacity:1" width="128" x="67" y="46" />
<rect fill="rgb(0,0,0)" height="80" stroke="rgb(0,0,0)" style="filter:url(#filter_blur);opacity:0.7;fill-opacity:1" width="128" x="67" y="166" />
<rect fill="rgb(0,0,0)" height="80" stroke="rgb(0,0,0)" style="filter:url(#filter_blur);opacity:0.7;fill-opacity:1" width="128" x="67" y="286" />
<rect fill="rgb(0,0,0)" height="80" stroke="rgb(0,0,0)" style="filter:url(#filter_blur);opacity:0.7;fill-opacity:1" width="128" x="451" y="46" />
<rect fill="rgb(0,0,0)" height="80" stroke="rgb(0,0,0)" style="filter:url(#filter_blur);opacity:0.7;fill-opacity:1" width="128" x="643" y="46" />
<rect fill="rgb(0,0,0)" height="80" stroke="rgb(0,0,0)" style="filter:url(#filter_blur);opacity:0.7;fill-opacity:1" width="128" x="835" y="46" />
<rect fill="rgb(0,0,0)" height="80" stroke="rgb(0,0,0)" style="filter:url(#filter_blur);opacity:0.7;fill-opacity:1" width="128" x="1027" y="46" />
<rect fill="rgb(0,0,0)" height="80" stroke="rgb(0,0,0)" style="filter:url(#filter_blur);opacity:0.7;fill-opacity:1" width="128" x="1027" y="166" />
<rect fill="rgb(0,0,0)" height="80" stroke="rgb(0,0,0)" style="filter:url(#filter_blur);opacity:0.7;fill-opacity:1" width="128" x="1027" y="286" />
<rect fill="rgb(255,255,255)" height="80" stroke="rgb(0,0,0)" width="128" x="256" y="40" />
<text fill="rgb(0,0,0)" font-family="sans-serif" font-size="20" font-style="normal" font-weight="normal" text-anchor="middle" textLength="87" x="320.5" y="90">run_loop</text>
<rect fill="rgb(255,255,255)" height="80" stroke="rgb(0,0,0)" width="128" x="64" y="40" />
<text fill="rgb(0,0,0)" font-family="sans-serif" font-size="20" font-style="normal" font-weight="normal" text-anchor="middle" textLength="120" x="128.0" y="79">journal_sin</text>
<text fill="rgb(0,0,0)" font-family="sans-serif" font-size="20" font-style="normal" font-weight="normal" text-anchor="middle" textLength="21" x="128.5" y="101">ce</text>
<rect fill="rgb(255,255,255)" height="80" stroke="rgb(0,0,0)" width="128" x="64" y="160" />
<text fill="rgb(0,0,0)" font-family="sans-serif" font-size="20" font-style="normal" font-weight="normal" text-anchor="middle" textLength="120" x="128.0" y="199">journal_fol</text>
<text fill="rgb(0,0,0)" font-family="sans-serif" font-size="20" font-style="normal" font-weight="normal" text-anchor="middle" textLength="32" x="128.0" y="221">low</text>
<rect fill="rgb(255,255,255)" height="80" stroke="rgb(0,0,0)" width="128" x="64" y="280" />
<text fill="rgb(0,0,0)" font-family="sans-serif" font-size="20" font-style="normal" font-weight="normal" text-anchor="middle" textLength="76" x="128.0" y="330">logfile</text>
<rect fill="rgb(255,255,255)" height="80" stroke="rgb(0,0,0)" width="128" x="448" y="40" />
<text fill="rgb(0,0,0)" font-family="sans-serif" font-size="20" font-style="normal" font-weight="normal" text-anchor="middle" textLength="54" x="512.0" y="90">parse</text>
<rect fill="rgb(255,255,255)" height="80" stroke="rgb(0,0,0)" width="128" x="640" y="40" />
<text fill="rgb(0,0,0)" font-family="sans-serif" font-size="20" font-style="normal" font-weight="normal" text-anchor="middle" textLength="120" x="704.0" y="79">extract_del</text>
<text fill="rgb(0,0,0)" font-family="sans-serif" font-size="20" font-style="normal" font-weight="normal" text-anchor="middle" textLength="54" x="704.0" y="101">ivery</text>
<rect fill="rgb(255,255,255)" height="80" stroke="rgb(0,0,0)" width="128" x="832" y="40" />
<text fill="rgb(0,0,0)" font-family="sans-serif" font-size="20" font-style="normal" font-weight="normal" text-anchor="middle" textLength="54" x="896.0" y="90">store</text>
<rect fill="rgb(255,255,255)" height="80" stroke="rgb(0,0,0)" width="128" x="1024" y="40" />
<text fill="rgb(0,0,0)" font-family="sans-serif" font-size="20" font-style="normal" font-weight="normal" text-anchor="middle" textLength="120" x="1088.0" y="79">delivery_fr</text>
<text fill="rgb(0,0,0)" font-family="sans-serif" font-size="20" font-style="normal" font-weight="normal" text-anchor="middle" textLength="21" x="1088.5" y="101">om</text>
<rect fill="rgb(255,255,255)" height="80" stroke="rgb(0,0,0)" width="128" x="1024" y="160" />
<text fill="rgb(0,0,0)" font-family="sans-serif" font-size="20" font-style="normal" font-weight="normal" text-anchor="middle" textLength="120" x="1088.0" y="210">delivery_to</text>
<rect fill="rgb(255,255,255)" height="80" stroke="rgb(0,0,0)" width="128" x="1024" y="280" />
<text fill="rgb(0,0,0)" font-family="sans-serif" font-size="20" font-style="normal" font-weight="normal" text-anchor="middle" textLength="76" x="1088.0" y="330">noqueue</text>
<path d="M 384 80 L 440 80" fill="none" stroke="rgb(0,0,0)" />
<polygon fill="rgb(0,0,0)" points="447,80 440,76 440,84 447,80" stroke="rgb(0,0,0)" />
<path d="M 576 80 L 632 80" fill="none" stroke="rgb(0,0,0)" />
<polygon fill="rgb(0,0,0)" points="639,80 632,76 632,84 639,80" stroke="rgb(0,0,0)" />
<path d="M 768 80 L 824 80" fill="none" stroke="rgb(0,0,0)" />
<polygon fill="rgb(0,0,0)" points="831,80 824,76 824,84 831,80" stroke="rgb(0,0,0)" />
<path d="M 960 80 L 1016 80" fill="none" stroke="rgb(0,0,0)" />
<polygon fill="rgb(0,0,0)" points="1023,80 1016,76 1016,84 1023,80" stroke="rgb(0,0,0)" />
<path d="M 960 80 L 992 80" fill="none" stroke="rgb(0,0,0)" />
<path d="M 992 80 L 992 200" fill="none" stroke="rgb(0,0,0)" />
<path d="M 992 200 L 1016 200" fill="none" stroke="rgb(0,0,0)" />
<polygon fill="rgb(0,0,0)" points="1023,200 1016,196 1016,204 1023,200" stroke="rgb(0,0,0)" />
<path d="M 960 80 L 992 80" fill="none" stroke="rgb(0,0,0)" />
<path d="M 992 80 L 992 320" fill="none" stroke="rgb(0,0,0)" />
<path d="M 992 320 L 1016 320" fill="none" stroke="rgb(0,0,0)" />
<polygon fill="rgb(0,0,0)" points="1023,320 1016,316 1016,324 1023,320" stroke="rgb(0,0,0)" />
<path d="M 192 80 L 248 80" fill="none" stroke="rgb(0,0,0)" />
<polygon fill="rgb(0,0,0)" points="255,80 248,76 248,84 255,80" stroke="rgb(0,0,0)" />
<path d="M 192 200 L 240 200" fill="none" stroke="rgb(0,0,0)" />
<path d="M 240 200 L 240 80" fill="none" stroke="rgb(0,0,0)" />
<path d="M 240 80 L 248 80" fill="none" stroke="rgb(0,0,0)" />
<polygon fill="rgb(0,0,0)" points="255,80 248,76 248,84 255,80" stroke="rgb(0,0,0)" />
<path d="M 192 320 L 240 320" fill="none" stroke="rgb(0,0,0)" />
<path d="M 240 320 L 240 80" fill="none" stroke="rgb(0,0,0)" />
<path d="M 240 80 L 248 80" fill="none" stroke="rgb(0,0,0)" />
<polygon fill="rgb(0,0,0)" points="255,80 248,76 248,84 255,80" stroke="rgb(0,0,0)" />
<text fill="rgb(0,0,0)" font-family="sans-serif" font-size="16" font-style="normal" font-weight="normal" text-anchor="middle" textLength="122" x="128.0" y="38">input iter ...</text>
<text fill="rgb(0,0,0)" font-family="sans-serif" font-size="16" font-style="normal" font-weight="normal" text-anchor="middle" textLength="113" x="1088.5" y="38">output tables</text>
</svg>
<h2>Iterables</h2>
<div class="sourceCode" id="cb3"><pre class="sourceCode python"><code class="sourceCode python"><a class="sourceLine" id="cb3-1" title="1"><span class="kw">def</span> iter_journal_messages_since(timestamp: Union[<span class="bu">int</span>, <span class="bu">float</span>]):</a>
<a class="sourceLine" id="cb3-2" title="2"> <span class="co">&quot;&quot;&quot;</span></a>
<a class="sourceLine" id="cb3-3" title="3"><span class="co"> Yield False and message details from the journal since *timestamp*.</span></a>
<a class="sourceLine" id="cb3-4" title="4"></a>
<a class="sourceLine" id="cb3-5" title="5"><span class="co"> This is the loading phase (loading messages that already existed</span></a>
<a class="sourceLine" id="cb3-6" title="6"><span class="co"> when we start).</span></a>
<a class="sourceLine" id="cb3-7" title="7"></a>
<a class="sourceLine" id="cb3-8" title="8"><span class="co"> Argument *timestamp* is a UNIX timestamp.</span></a>
<a class="sourceLine" id="cb3-9" title="9"></a>
<a class="sourceLine" id="cb3-10" title="10"><span class="co"> Only journal entries for systemd unit UNITNAME with loglevel</span></a>
<a class="sourceLine" id="cb3-11" title="11"><span class="co"> INFO and above are retrieved.</span></a>
<a class="sourceLine" id="cb3-12" title="12"><span class="co"> &quot;&quot;&quot;</span></a>
<a class="sourceLine" id="cb3-13" title="13"> ...</a>
<a class="sourceLine" id="cb3-14" title="14"></a>
<a class="sourceLine" id="cb3-15" title="15"><span class="kw">def</span> iter_journal_messages_follow(timestamp: Union[<span class="bu">int</span>, <span class="bu">float</span>]):</a>
<a class="sourceLine" id="cb3-16" title="16"> <span class="co">&quot;&quot;&quot;</span></a>
<a class="sourceLine" id="cb3-17" title="17"><span class="co"> Yield commit and message details from the journal through polling.</span></a>
<a class="sourceLine" id="cb3-18" title="18"></a>
<a class="sourceLine" id="cb3-19" title="19"><span class="co"> This is the polling phase (after we have read pre-existing messages</span></a>
<a class="sourceLine" id="cb3-20" title="20"><span class="co"> in the loading phase).</span></a>
<a class="sourceLine" id="cb3-21" title="21"></a>
<a class="sourceLine" id="cb3-22" title="22"><span class="co"> Argument *timestamp* is a UNIX timestamp.</span></a>
<a class="sourceLine" id="cb3-23" title="23"></a>
<a class="sourceLine" id="cb3-24" title="24"><span class="co"> Only journal entries for systemd unit UNITNAME with loglevel</span></a>
<a class="sourceLine" id="cb3-25" title="25"><span class="co"> INFO and above are retrieved.</span></a>
<a class="sourceLine" id="cb3-26" title="26"></a>
<a class="sourceLine" id="cb3-27" title="27"><span class="co"> *commit* (bool) tells whether it is time to store the delivery</span></a>
<a class="sourceLine" id="cb3-28" title="28"><span class="co"> information obtained from the messages yielded by us.</span></a>
<a class="sourceLine" id="cb3-29" title="29"><span class="co"> It is set to True if max_delay_before_commit has elapsed.</span></a>
<a class="sourceLine" id="cb3-30" title="30"><span class="co"> After this delay delivery information will be written; to be exact:</span></a>
<a class="sourceLine" id="cb3-31" title="31"><span class="co"> the delay may increase by up to one journal_poll_interval.</span></a>
<a class="sourceLine" id="cb3-32" title="32"><span class="co"> &quot;&quot;&quot;</span></a>
<a class="sourceLine" id="cb3-33" title="33"> ...</a>
<a class="sourceLine" id="cb3-34" title="34"></a>
<a class="sourceLine" id="cb3-35" title="35"><span class="kw">def</span> iter_logfile_messages(filepath: <span class="bu">str</span>, year: <span class="bu">int</span>,</a>
<a class="sourceLine" id="cb3-36" title="36"> commit_after_lines<span class="op">=</span>max_messages_per_commit):</a>
<a class="sourceLine" id="cb3-37" title="37"> <span class="co">&quot;&quot;&quot;</span></a>
<a class="sourceLine" id="cb3-38" title="38"><span class="co"> Yield messages and a commit flag from a logfile.</span></a>
<a class="sourceLine" id="cb3-39" title="39"></a>
<a class="sourceLine" id="cb3-40" title="40"><span class="co"> Loop through all lines of the file with given *filepath* and</span></a>
<a class="sourceLine" id="cb3-41" title="41"><span class="co"> extract the time and log message. If the log message starts</span></a>
<a class="sourceLine" id="cb3-42" title="42"><span class="co"> with &#39;postfix/&#39;, then extract the syslog_identifier, pid and</span></a>
<a class="sourceLine" id="cb3-43" title="43"><span class="co"> message text.</span></a>
<a class="sourceLine" id="cb3-44" title="44"></a>
<a class="sourceLine" id="cb3-45" title="45"><span class="co"> Since syslog lines do not contain the year, the *year* to which</span></a>
<a class="sourceLine" id="cb3-46" title="46"><span class="co"> the first log line belongs must be given.</span></a>
<a class="sourceLine" id="cb3-47" title="47"></a>
<a class="sourceLine" id="cb3-48" title="48"><span class="co"> Return a commit flag and a dict with these keys:</span></a>
<a class="sourceLine" id="cb3-49" title="49"><span class="co"> &#39;t&#39;: timestamp</span></a>
<a class="sourceLine" id="cb3-50" title="50"><span class="co"> &#39;message&#39;: message text</span></a>
<a class="sourceLine" id="cb3-51" title="51"><span class="co"> &#39;identifier&#39;: syslog identifier (e.g., &#39;postfix/smtpd&#39;)</span></a>
<a class="sourceLine" id="cb3-52" title="52"><span class="co"> &#39;pid&#39;: process id</span></a>
<a class="sourceLine" id="cb3-53" title="53"></a>
<a class="sourceLine" id="cb3-54" title="54"><span class="co"> The commit flag will be set to True for every</span></a>
<a class="sourceLine" id="cb3-55" title="55"><span class="co"> (commit_after_lines)-th filtered message and serves</span></a>
<a class="sourceLine" id="cb3-56" title="56"><span class="co"> as a signal to the caller to commit this chunk of data</span></a>
<a class="sourceLine" id="cb3-57" title="57"><span class="co"> to the database.</span></a>
<a class="sourceLine" id="cb3-58" title="58"><span class="co"> &quot;&quot;&quot;</span></a>
<a class="sourceLine" id="cb3-59" title="59"> ...</a></code></pre></div>
<h2>Running loops</h2>
<div class="sourceCode" id="cb4"><pre class="sourceCode python"><code class="sourceCode python"><a class="sourceLine" id="cb4-1" title="1"><span class="kw">def</span> run(dsn, verp_marker<span class="op">=</span><span class="va">False</span>, filepath<span class="op">=</span><span class="va">None</span>, year<span class="op">=</span><span class="va">None</span>, debug<span class="op">=</span>[]):</a>
<a class="sourceLine" id="cb4-2" title="2"> <span class="co">&quot;&quot;&quot;</span></a>
<a class="sourceLine" id="cb4-3" title="3"><span class="co"> Determine loop(s) and run them within a database context.</span></a>
<a class="sourceLine" id="cb4-4" title="4"><span class="co"> &quot;&quot;&quot;</span></a>
<a class="sourceLine" id="cb4-5" title="5"> init(verp_marker<span class="op">=</span>verp_marker)</a>
<a class="sourceLine" id="cb4-6" title="6"> <span class="cf">with</span> psycopg2.<span class="ex">connect</span>(dsn) <span class="im">as</span> conn:</a>
<a class="sourceLine" id="cb4-7" title="7"> <span class="cf">with</span> conn.cursor(cursor_factory<span class="op">=</span>psycopg2.extras.RealDictCursor) <span class="im">as</span> curs:</a>
<a class="sourceLine" id="cb4-8" title="8"> <span class="cf">if</span> filepath:</a>
<a class="sourceLine" id="cb4-9" title="9"> run_loop(iter_logfile_messages(filepath, year), curs, debug<span class="op">=</span>debug)</a>
<a class="sourceLine" id="cb4-10" title="10"> <span class="cf">else</span>:</a>
<a class="sourceLine" id="cb4-11" title="11"> begin_timestamp <span class="op">=</span> get_latest_timestamp(curs)</a>
<a class="sourceLine" id="cb4-12" title="12"> run_loop(iter_journal_messages_since(begin_timestamp), curs, debug<span class="op">=</span>debug)</a>
<a class="sourceLine" id="cb4-13" title="13"> begin_timestamp <span class="op">=</span> get_latest_timestamp(curs)</a>
<a class="sourceLine" id="cb4-14" title="14"> run_loop(iter_journal_messages_follow(begin_timestamp), curs, debug<span class="op">=</span>debug)</a>
<a class="sourceLine" id="cb4-15" title="15"></a>
<a class="sourceLine" id="cb4-16" title="16"><span class="kw">def</span> run_loop(iterable, curs, debug<span class="op">=</span>[]):</a>
<a class="sourceLine" id="cb4-17" title="17"> <span class="co">&quot;&quot;&quot;</span></a>
<a class="sourceLine" id="cb4-18" title="18"><span class="co"> Loop over log messages obtained from *iterable*.</span></a>
<a class="sourceLine" id="cb4-19" title="19"></a>
<a class="sourceLine" id="cb4-20" title="20"><span class="co"> Parse the message, extract delivery information from it and store</span></a>
<a class="sourceLine" id="cb4-21" title="21"><span class="co"> that delivery information.</span></a>
<a class="sourceLine" id="cb4-22" title="22"></a>
<a class="sourceLine" id="cb4-23" title="23"><span class="co"> For performance reasons delivery items are collected in a cache</span></a>
<a class="sourceLine" id="cb4-24" title="24"><span class="co"> before writing them (i.e., committing a database transaction).</span></a>
<a class="sourceLine" id="cb4-25" title="25"><span class="co"> &quot;&quot;&quot;</span></a>
<a class="sourceLine" id="cb4-26" title="26"> cache <span class="op">=</span> []</a>
<a class="sourceLine" id="cb4-27" title="27"> msg_count <span class="op">=</span> max_messages_per_commit</a>
<a class="sourceLine" id="cb4-28" title="28"> <span class="cf">for</span> commit, msg_details <span class="kw">in</span> iterable:</a>
<a class="sourceLine" id="cb4-29" title="29"> ...</a></code></pre></div>
<h2>Parsing</h2>
<p>Parse what you can. (But only msg_info in Postfix, and only relevant components.)</p>
<div class="sourceCode" id="cb5"><pre class="sourceCode python"><code class="sourceCode python"><a class="sourceLine" id="cb5-1" title="1"><span class="kw">def</span> parse(msg_details, debug<span class="op">=</span><span class="va">False</span>):</a>
<a class="sourceLine" id="cb5-2" title="2"> <span class="co">&quot;&quot;&quot;</span></a>
<a class="sourceLine" id="cb5-3" title="3"><span class="co"> Parse a log message returning a dict.</span></a>
<a class="sourceLine" id="cb5-4" title="4"></a>
<a class="sourceLine" id="cb5-5" title="5"><span class="co"> *msg_details* is assumed to be a dict with these keys:</span></a>
<a class="sourceLine" id="cb5-6" title="6"></a>
<a class="sourceLine" id="cb5-7" title="7"><span class="co"> * &#39;identifier&#39; (syslog identifier),</span></a>
<a class="sourceLine" id="cb5-8" title="8"><span class="co"> * &#39;pid&#39; (process id),</span></a>
<a class="sourceLine" id="cb5-9" title="9"><span class="co"> * &#39;message&#39; (message text)</span></a>
<a class="sourceLine" id="cb5-10" title="10"></a>
<a class="sourceLine" id="cb5-11" title="11"><span class="co"> The syslog identifier and process id are copied to the resulting dict.</span></a>
<a class="sourceLine" id="cb5-12" title="12"><span class="co"> &quot;&quot;&quot;</span></a>
<a class="sourceLine" id="cb5-13" title="13"> ...</a>
<a class="sourceLine" id="cb5-14" title="14"></a>
<a class="sourceLine" id="cb5-15" title="15"><span class="kw">def</span> _parse_branch(comp, msg, res):</a>
<a class="sourceLine" id="cb5-16" title="16"> <span class="co">&quot;&quot;&quot;</span></a>
<a class="sourceLine" id="cb5-17" title="17"><span class="co"> Parse a log message string *msg*, adding results to dict *res*.</span></a>
<a class="sourceLine" id="cb5-18" title="18"></a>
<a class="sourceLine" id="cb5-19" title="19"><span class="co"> Depending on the component *comp* we branch to functions</span></a>
<a class="sourceLine" id="cb5-20" title="20"><span class="co"> named _parse_{comp}.</span></a>
<a class="sourceLine" id="cb5-21" title="21"></a>
<a class="sourceLine" id="cb5-22" title="22"><span class="co"> Add parsing results to dict *res*. Always add key &#39;action&#39;.</span></a>
<a class="sourceLine" id="cb5-23" title="23"><span class="co"> Try to parse every syntactical element.</span></a>
<a class="sourceLine" id="cb5-24" title="24"><span class="co"> Note: We parse what we can. Assessment of parsing results relevant</span></a>
<a class="sourceLine" id="cb5-25" title="25"><span class="co"> for delivery is done in :func:`extract_delivery`.</span></a>
<a class="sourceLine" id="cb5-26" title="26"><span class="co"> &quot;&quot;&quot;</span></a>
<a class="sourceLine" id="cb5-27" title="27"> ...</a></code></pre></div>
<h2>Extracting</h2>
<p>Extract what is relevant.</p>
<div class="sourceCode" id="cb6"><pre class="sourceCode python"><code class="sourceCode python"><a class="sourceLine" id="cb6-1" title="1"><span class="kw">def</span> extract_delivery(msg_details, parsed):</a>
<a class="sourceLine" id="cb6-2" title="2"> <span class="co">&quot;&quot;&quot;</span></a>
<a class="sourceLine" id="cb6-3" title="3"><span class="co"> Compute delivery information from parsing results.</span></a>
<a class="sourceLine" id="cb6-4" title="4"></a>
<a class="sourceLine" id="cb6-5" title="5"><span class="co"> Basically this means that we map the parsed fields to</span></a>
<a class="sourceLine" id="cb6-6" title="6"><span class="co"> a type (&#39;from&#39; or &#39;to&#39;) and to the database</span></a>
<a class="sourceLine" id="cb6-7" title="7"><span class="co"> fields for table &#39;delivery_from&#39; or &#39;delivery_to&#39;.</span></a>
<a class="sourceLine" id="cb6-8" title="8"></a>
<a class="sourceLine" id="cb6-9" title="9"><span class="co"> We branch to functions _extract_{comp} where comp is the</span></a>
<a class="sourceLine" id="cb6-10" title="10"><span class="co"> name of a Postfix component.</span></a>
<a class="sourceLine" id="cb6-11" title="11"></a>
<a class="sourceLine" id="cb6-12" title="12"><span class="co"> Return a list of error strings and a dict with the</span></a>
<a class="sourceLine" id="cb6-13" title="13"><span class="co"> extracted information. Keys with None values are removed</span></a>
<a class="sourceLine" id="cb6-14" title="14"><span class="co"> from the dict.</span></a>
<a class="sourceLine" id="cb6-15" title="15"><span class="co"> &quot;&quot;&quot;</span></a>
<a class="sourceLine" id="cb6-16" title="16"> ...</a></code></pre></div>
<h2>Regular expressions</h2>
<ul>
<li><p>see sources</p></li>
<li><p><a href="https://stackoverflow.com/questions/201323/how-to-validate-an-email-address-using-a-regular-expression">Stackoverflow: How to validate an email address</a> <a href="https://i.stack.imgur.com/YI6KR.png">FSM</a></p></li>
</ul>
<h3>BTW: <a href="https://docs.python.org/3/library/email.utils.html#email.utils.parseaddr">email.utils.parseaddr</a></h3>
<div class="sourceCode" id="cb7"><pre class="sourceCode python"><code class="sourceCode python"><a class="sourceLine" id="cb7-1" title="1"><span class="op">&gt;&gt;&gt;</span> <span class="im">from</span> email.utils <span class="im">import</span> parseaddr</a>
<a class="sourceLine" id="cb7-2" title="2"><span class="op">&gt;&gt;&gt;</span> parseaddr(<span class="st">&#39;Ghost &lt;&quot;hello@nowhere&quot;@pyug.at&gt;&#39;</span>)</a>
<a class="sourceLine" id="cb7-3" title="3">(<span class="st">&#39;Ghost&#39;</span>, <span class="st">&#39;&quot;hello@nowhere&quot;@pyug.at&#39;</span>)</a>
<a class="sourceLine" id="cb7-4" title="4"><span class="op">&gt;&gt;&gt;</span> <span class="bu">print</span>(parseaddr(<span class="st">&#39;&quot;more</span><span class="ch">\&quot;</span><span class="st">fun</span><span class="ch">\&quot;\\</span><span class="st">&quot;hello</span><span class="ch">\\</span><span class="st">&quot;@nowhere&quot;@pyug.at&#39;</span>)[<span class="dv">1</span>])</a>
<a class="sourceLine" id="cb7-5" title="5"><span class="co">&quot;more&quot;</span>fun<span class="st">&quot;</span><span class="ch">\&quot;</span><span class="st">hello</span><span class="ch">\&quot;</span><span class="st">@nowhere&quot;</span><span class="op">@</span>pyug.at</a>
<a class="sourceLine" id="cb7-6" title="6"><span class="op">&gt;&gt;&gt;</span> <span class="bu">print</span>(parseaddr(<span class="st">&#39;&quot;&quot;@pyug.at&#39;</span>)[<span class="dv">1</span>])</a>
<a class="sourceLine" id="cb7-7" title="7"><span class="co">&quot;&quot;</span><span class="op">@</span>pyug.at</a></code></pre></div>
<h2>Storing</h2>
<div class="sourceCode" id="cb8"><pre class="sourceCode python"><code class="sourceCode python"><a class="sourceLine" id="cb8-1" title="1"><span class="kw">def</span> store_deliveries(cursor, cache, debug<span class="op">=</span>[]):</a>
<a class="sourceLine" id="cb8-2" title="2"> <span class="co">&quot;&quot;&quot;</span></a>
<a class="sourceLine" id="cb8-3" title="3"><span class="co"> Store cached delivery information into the database.</span></a>
<a class="sourceLine" id="cb8-4" title="4"></a>
<a class="sourceLine" id="cb8-5" title="5"><span class="co"> Find queue_ids in *cache* and group delivery items by</span></a>
<a class="sourceLine" id="cb8-6" title="6"><span class="co"> them, but separately for delivery types &#39;from&#39; and &#39;to&#39;.</span></a>
<a class="sourceLine" id="cb8-7" title="7"><span class="co"> In addition, collect delivery items with queue_id is None.</span></a>
<a class="sourceLine" id="cb8-8" title="8"></a>
<a class="sourceLine" id="cb8-9" title="9"><span class="co"> After grouping we merge all items withing a group into a</span></a>
<a class="sourceLine" id="cb8-10" title="10"><span class="co"> single item. So we can combine several SQL queries into </span></a>
<a class="sourceLine" id="cb8-11" title="11"><span class="co"> a single one, which improves performance significantly.</span></a>
<a class="sourceLine" id="cb8-12" title="12"></a>
<a class="sourceLine" id="cb8-13" title="13"><span class="co"> Then store the merged items and the deliveries with</span></a>
<a class="sourceLine" id="cb8-14" title="14"><span class="co"> queue_id is None.</span></a>
<a class="sourceLine" id="cb8-15" title="15"><span class="co"> &quot;&quot;&quot;</span></a>
<a class="sourceLine" id="cb8-16" title="16"> ...</a></code></pre></div>
<p>Database schema: 3 tables:</p>
<ul>
<li>delivery_from: smtpd, milters, qmgr</li>
<li>delivery_to: smtp, virtual, bounce, error</li>
<li>noqueue: rejected by smtpd before even getting a queue_id</li>
</ul>
<p>Table noqueue contains all the spam; for this we only use SQL INSERT, no ON CONFLICT ... UPDATE; it's faster.</p>
<h2>Demo</h2>
<pre><code>...
</code></pre>
<h2>Questions / Suggestions</h2>
<ul>
<li>Could you enhance speed by using prepared statements?</li>
<li>Will old data be deleted (as required by GDPR)?</li>
</ul>
<p>Both were implemented after the talk.</p>