Welcome, Greg
“Greg Wilson”:http://www.third-bit.com/~gvwilson/ now has a “weblog”:http://www.third-bit.com/~gvwilson/blog/.
“Greg Wilson”:http://www.third-bit.com/~gvwilson/ now has a “weblog”:http://www.third-bit.com/~gvwilson/blog/.
Late on Thursday afternoon I strolled off to Maxtor’s warranty website, entered the particulars of the failed drive, and requested that they send me a replacement right away. I did it this way so that they would ship me the new drive in an “official” drive box so that I can return the failed on the same way.
On Friday, the RMA tracking website claimed that my new drive would ship within the next two business days. “Pretty good”, though I, as I was expecting a longer turnaround.
Well, the new drive arrived this morning. I’m impressed…
At about 8:30 AM on Thursday October 30th, the hard drive failed. It was reporting itself on the bus, but with a garbage manufacturer string and a size of 0; not terribly useful.
Fortunately, I had a backup; I use rsync to keep a copy of the server filesystems on a machine at home. I even had to restart the backup by hand on Wednesday morning, so we only lost about 24 hours worth of stuff (the Thursday morning backup was still in its early stages when the disk crashed).
Unfortunately, I lost all of my MySQL databases. Most of them are replaceable, but it’s still annoying. It seems that @mysqlhotcopy@ “knows” that a list of tables fetched from the database is not quoted, so it does quoting itself. Unfortunately, a newer version of MySQL (or @DBD::mysql@) appears to be quoting table names. Net result:
bq. DBD::mysql::db do failed: You have an error in your SQL syntax near ‘` READ, `apachelogs`.“access_blog_org“ READ, `apachelogs`.“access_fywss_com“’ at line 1 at /usr/bin/mysqlhotcopy line 438
So, no MySQL backups… *sigh.
Anyway, 40Gb drives were on sale at ICCT, so I ran out last night and picked one up, restored the backup, and replaced it in the server this morning. The old drive is about 15 months old and is still under warranty; Maxtor is sending me a replacement even as we speak, so I’ll end up with a spare drive at the end of this whole mess.
Moral: keep good backups :-)
*Update*: There’s some evidence that the last complete backup was actually on the 26th of October. I’ll have to investigate, because I know it’s been _running_ every day since then…
So I patched my MovableType and set up the cron job to use “mt-publish-on”:http://www.mplode.com/tima/archives/000324.html. It runs, and switches the state on my articles from “Future” to “Publish”, but never actually publishes them (the content pages remain unchanged). I haven’t done much debugging yet, other than to figure out that mt-publish-on really does call the same MovableType functions that are used internally to publish new pages… *sigh.
If you can read this, It means two things. First, that my Movable Type setup can cut ‘n’ paste UTF-8 strings, and second that your reader works with my UTF-8 blog.
# Ħäřáŀđ Ķòċĥ (eight-bit, upper code page)
# ÐаÑаλδ Κόςн (multi-byte)
Inspired by “Ned Batchelder”:http://www.nedbatchelder.com/blog/200310.html#e20031017T081251.
So another comment spammer took down my server this weekend. It seems that it takes Movable Type over 8 seconds to rebuild the blog.org pages after a comment has been posted (the category pages are large, and get re-written to update the comment count in the summary). If a spammer tries POSTing to several comments pages at the same time, or does so over a relatively short period of time, I get a whole bunch of mt-comments.cgi scripts running simultaneously.
At 8Mb (of working memory) each, it doesn’t take long for them to max out the memory on my wimpy 128Mb box, at which point paging starts, slowing everything down and making the problem worse. As more HTTP requests show up, and cron scripts run, the box starts thrashing (i.e. spending all of its resources moving pages in and out instead of accomplishing useful work). I couldn’t even SSH into the box; the SSH negotiation was timing out after a few minutes.
Usually I have to ask my host to physically reset the server, but this time it was a long weekend. Fortunately I had a remote shell lying around. But it took two _days_ to run su, type my password, and kill off the offending httpd and mt-comments.cgi processes. In the meantime, many other important daemons had been killed due to out-of-memory, and the box was completely ignoring web requests and e-mail sessions; in short, the machine was a mess.
MT-Blacklist is due out today, and I intend to install it, but it won’t help this problem; by the time mt-comments.cgi is being exec()ed, it’s already too late.
So instead I wrote a simple locking wrapper for mt-comments.cgi. It’s in C, so it’s tiny (working memory is 306Kb instead of 8Mb; still way too large, but much better). It grabs a lock file before running mt-comments.cgi, so that only one instance is running at a given time. I’m hoping this will prevent the box from falling off the ‘net the _next_ time a comment spammer shows up.
I’ve also dropped the value of MaxClients in my Apache config, to prevent too many simultanous Apache processes from starting up (since this will also eat the virtual memory system for breakfast).
I wish there was a better way to do load shedding in this context, but I can’t think of one off-hand…
Sorry to hear about the comment spammer… I was wondering what happened to the box though.
About MT Blacklist, it appears to work before the mt.cgi is executed in that it checks material in the posting before the posting occurrs. It also halts the posting if there are hyperlinks to dubious sounding Web pages.
I was wondering what happened to you. You weren’t here when I checked in Monday.
I dread the same thing happening to whiterose. You’ll have to keep us posted on how well your spam-combatting measures are working.
“Russell Beattie”:http://www.russellbeattie.com/notebook/” blogged about an “XML + XSL Trick”:http://www.russellbeattie.com/notebook/1004309.html to convert RSS feeds into something that renders readably in modern browsers.
I decided to “try it out”:http://blog.cfrq.net/chk/index.rdf, and discovered in the process just how picky XML is; I had to change my SmartyPants installation to output UTF-8 sequences instead of “HTML Entities”:http://www.htmlhelp.com/reference/html40/entities/. I stopped there; although I intended to see if I could modify the “rss2html.xsl”:http://blogl.cfrq.net/chk/rss2html.xsl script to work for RSS 2.0, I never found the time…
I’ve updated my “SSL Help Pages”:http://www.cfrq.net/ssl/ to include a description of the steps required to get Outlook Express and Eudora to work with the SSL-protected mailservers here.
The Eudora configuration is a little ugly; there’s a buglet in Eudora’s TLS implementation that means you have to force Eudora to use SSL 3.0, and doing so requires editing the eudora.ini file :-). Also, Eudora can’t configure certificate trust settings until _after_ you attempt-but-fail an SSL negotiation, so you have to stand on your head a bit.
Anyway, the new page is “Eudora SSL Help”:http://www.cfrq.net/ssl/eudora.html. some people might find it useful for connecting to other SSL-enabled mailservers using Eudora.
(After I wrote the page I found another similar page, with similar screen shots, over at “Oxford’s FMRIB”:http://www.fmrib.ox.ac.uk/computing/docs/mailclients/eudora.html. Honest, I wrote mine -first- +before I read theirs+ :-)
A small network reconfiguration at our host site knocked the server off the air on Saturday afternoon, For many different (entirely reasonable) reasons we couldn’t restore service until Monday night.
We’re back now, though.
I saw “Russell’s post”:http://www.russellbeattie.com/notebook/1004309.html that discussed adding an XSL stylesheet to the site RSS feed, so that people who click on it get a pretty display instead of the ugly raw XML.
In the process of copying this to my blog, I re-discovered that “SmartyPants”:http://daringfireball.net/projects/smartypants/ was spitting out “HTML entities”:http://www.htmlhelp.com/reference/html40/entities/ in decimal, which look ugly in XML. (HTML entities in their text form don’t look any better). My blog has been XHTML and UTF-8 since I started reading “dive into mark”:http://www.diveintomark.org, so I modified my copy of SmartyPants to spit out UTF-8 sequences instead of HTML entities. In the process, I had to:
* turn off HTML entity processing by movable type, by setting @NoHTMLEntities 1@ in @mt.cfg@. Otherwise HTML::Entities was converting my UTF-8 sequences back into HTML entities…
* explicity set my movable type charset to utf8 (set @PublishCharset@ in @mt.cfg@). This gets used in the Movable Type edit pages, so I can now paste non-ASCII characters into entries and have them come out the other end as UTF-8, instead of as the raw ISO 8859-1 bytes (which aren’t valid UTF-8 characters).
* get rid of a couple of leftover @”charset=iso-8859-1″@ tags in my templates.
Only the “RSS 1.0 feed”:/chk/index.rdf works so far; I don’t have a corresponding XSL stylesheet for RSS 2.0…
Interesting log entries:
61.181.5.155 - - [31/Aug/2003:04:33:37 -0400] "GET /chk/archives/20030205152634.html" 61.181.5.155 - - [31/Aug/2003:04:35:23 -0400] "POST /chk/cgi/mt-comments.cgi" 61.181.5.155 - - [31/Aug/2003:04:45:06 -0400] "POST /chk/cgi/mt-comments.cgi" 61.181.5.155 - - [31/Aug/2003:04:45:10 -0400] "GET /chk/archives/000203.html"
It took almost two minutes for some loser to type the latest pr0n-based comments spam into my comments form and post it, then s/he spent 9 minutes previewing the comment before saving it? Yeesh.
I would have been much less surprised if it had all happened in seconds; Given the initial Google search for “blog 2003 august Name: Email Address: URL: Comments:
“, it would be trivially easy to automate posting a comment via movabletype’s CGI interface. I suppose I should be happy that it isn’t automated; I’d have to then write the corresponding automated delete tool… *sigh.
It probably doesn’t matter, but <PLONK> regardless.
I’ve knocked off 9K, or about 1/3rd, of the size of an average blog entry. I reduced my blogroll somewhat (mainly, I ditched all the A-list blogs that _everyone_ lists :-), I took out the wordcount stuff, and I dumped all the silly icons. It seems to be making a difference, although it’s a little early to tell.
I don’t know why I care, really; “roomie’s blog”:http://blog.org/ is pushing 50Mb/day with his 100Kb category archives :-) Still, a little efficiency here and there never hurts…
In the last week my blog traffic has gone from a sedate 4Mb/day to an average of 16Mb per day, with a peak last Thursday of 32Mb. A quick perusal of the logs shows that “this entry”:http://blog.cfrq.net/chk/archives/2003/08/17/blackout-satellite-images/ is extremely popular; it’s on the first page of Google for several obvious search terms. (My weblog is on the first page for several other interesting search terms too; I wonder what I’m doing to drive this?)
Anyway, while I was investigating, I looked at the per-page byte counts, using my nifty “MySQL-based Apache logs”:http://jeremy.zawodny.com/blog/archives/000407.html, and I discovered that an empty weblog posting here is around 26Kb in size! This is due to the sidebar (mainly the blogroll and the total word count), and then the HTML overhead. By the time you add in all of those little buttons, the byte count doubles.
Yikes! It’s a good thing that html compresses well, and that most browsers speak gzip; I have “mod_gzip”:http://webcompression.org/ installed, so the actual bytes_sent value is around 8K for most entries not including the buttons). I think it may be time to do some trimming.
There are some easy tricks that’ll get me 25% or so; remove the (superfluous) buttons; the humour is getting old anyway. Change or remove the total wordcount stuff, and trim the blogroll.
One gross hack would be to use frames, evil though they are; then most visitors would download the sidebar only when it changed, instead of for every entry. I don’t think that’ll really help much, since most of my traffic seems to be Google driven. Naughty googlers they are, too; the top traffic generators (after “blackout satellite pictures”) are still:
* “Female Nudity”:http://blog.cfrq.net/chk/archives/2003/01/06/female-nudity/
* “Trends in Playboy Models”:http://blog.cfrq.net/chk/archives/2002/12/20/trends-in-playboy-models/
Strangely, they’ve switched places from “the last time I looked”:http://blog.cfrq.net/chk/archives/2003/02/20/proving-once-again-that-its-all-porn-and-money/.
Anyway, I’m not too worried at this point. Even 16Mb/day isn’t much; we’re doing 8Mb/day of SMTP traffic in each direction…
This is one of the reasons I blocked Google from my site. While it did drive traffic sometimes, it also herded a lot of trolls to my doorstep.
We have a “new roommate”:http://blog.org/. Hopefully he won’t play the stereo too loud…
Anything you’d particularly like to hear when I crank up the speakers and put them against the wall? ;-)
In case you didn’t notice…
… and there is no reason you should – this weblog has left Reid’s machine and is now hosted by…
I have 19 entries that are still in “Draft” status, and an electricity rant brewing in the back of my head (and my pocket; I took notes). So over the next week or two I’m going to resurrect some of the old stuff. I just need a couple of material components…
I have 32 draft postings dating back to March. Must… blog… more…
A big “Thank you!” to my generous host, who not only cleanly shutdown my server during the blackout (it was on the UPS) but also brought it back up this morning before she went home for the day…
That, I would take it, would be Michelle, n’est-ce pas?
Perverse Access Memory: Blackholing Spam
bq. We’ve started using realtime blackhole lists (RBLs) to stop some of the spam that whiterose mail users are seeing.
My SPAM escalation:
* some homegrown procmail filters. That lasted a couple of years, but was too hard to keep up-to-date.
* “junkfilter”:http://junkfilter.zer0.org/ was quite good, but I still had problems keeping it up-to-date.
* I now use “spambayes”:http://www.spambayes.org/. It is available as a procmail filter, as a POP/IMAP proxy, and as an Outlook plugin. I’m now using this everywhere. It uses statistical techniques to learn the characteristics of your incoming mail, and filters accordingly. It is surprisingly accurate :-)
* In parallel with spambayes, I switched cfrq.net from sendmail to postfix, which has a whole bunch of useful anti-spam technology built into the SMTP listener; the theory is that it is better to reject SPAM during the SMTP session than it is to deal with it later.
* Sadly, I had to disable some of postfix’s filters, because it was trapping too much legitimate e-mail :-) Many of my correspondents work for companies that can’t seem to configure their DNS or their SMTP servers properly, and educating / whitelisting them was taking too much time :-)
* I have now given up, and started testing some (conservative) RBLs, with reasonably good results; they’re now installed fulltime. I’m currently evaluating bl.spamcop.net, but I’m getting too many false positives (for me), because they tend to pick up the MTAs of large companies and ISPs.
Every time I’ve tried SpamAssassin I’ve had trouble with it; but naturally, your mileage may vary :-)
We’re in a war, and the spammers are as smart as we are. I’m already seeing SPAM specifically designed to foil statistical filtering. As with all escalations, the solution seems to be to make my host less attractive to spammers than someone else’s…
I like the RBLs from http://blackholes.us/ cn-kr.blackholes.us in particular stops a lot of our spam.
In fact, I need to add more of them. I’m hoping they’ll consolidate more of their lists. asia.blackholes.us would be nice…
I also use
sbl.spamhaus.org – a conservative list recommended by a very knowledgable regular poster to the Stalker SIMS mailing list
dialups.relays.osirusoft.com – known dialup open relays
socks.relays.osirusoft.com – know open socks proxies
To date we haven’t had any good mail caught, but we may. We’ll see.
Also using the new version of Eudora (beta 6) which has a beysean Junk Filter function. I’m really starting to see a lot less of the spam I get sent.
Bayesian would be the other (i.e. correct) spelling. blackholes.us lists several anti-spam products that can use the RBLs…
<laughter> never post before coffee.
I’m going to move on to test spamhaus next; I’ve heard good things about them. I tried relays.osirusoft.com once, but got way too many false positives; I’ve never tried their individual subdomains.
I’m currently using (recommended by the postfix-users list):
(155) proxies.relays.monkeys.com – socks/HTTP proxies
(91) list.dsbl.org – single-stage relays, proxies, and formmail sources
The number is the number of messages rejected by the RBL in the last week. I used to use relays.ordb.org, because they are described as “extremely conservative”, but I didn’t get any rejects from them other than a mailhost that I was forced to whitelist.
I’ve just tested bl.spamcop.net for a week. It rejected 28 messages that no one else did, but it lists all of the servers for groups.msn.com, generating false positives.
I forgot to add: I’ve been using a (postfix specific) log scanner to tell my users when email to them has been rejected by an RBL; I’ve had a couple of false positives reported that way :-)
I second the recommendation for Eudora. It’s getting some false positives (mostly via my yahoo account, which classifies some solicited commercial email/mailing lists as spam) but has been excellent about learning from its mistakes.
Can anybody help me in configuring exchange 2003 to check blocklist using sbl.spamhaus.org
Kindly reply at the earliest and oblige by doing the needful.
I possible PLEASE mail at vrbhatt@hotmail.com
Thank you.
Several webloggers who use “cornerhost”:http://www.cornerhost.com/ found themselves “suddenly relocated”:http://inessential.com/?comments=1&postid=2473 over the weekend (“for the record”:http://cornerhost.blogspot.com/, it wasn’t really cornerhost’s fault).
cornerhost’s policy is that users are responsible for their own backups. Naturally, some people found out that this was true :-)
As part of the ensuing chaos, someone pointed out Mike Rubel’s article “Easy Automated Snapshot-Style Backups with Linux and Rsync”:http://www.mikerubel.org/computers/rsync_snapshots/.
I’ve been a sysadmin for over 15 years. I’m a *big* fan of backups. (I’m _very_ unhappy that my tape drive is broken right now. :-) I’ve been doing nightly full backups of my servers using rsync for a long time, but the technique Mike uses for incrementals never occured to me (blush). A minor change to a couple of scripts was all it took for me to have a week’s worth of snapshots on the backup hosts. Fabulous!
Thanks, “Mike”:http://www.mikerubel.org/!
“Greg”:http://www.third-bit.com/ has a few new students starting this summer. Time to update the default user profiles and “create new account” software to make this easier, since I do it so seldom and keep forgetting all of the steps. I’m thinking of either using LDAP or MySQL for authentication, or finding an /etc/passwd based auth module for Apache and samba; either would let me use the same passwords everywhere on the system.
I recently converted from “uw-imap”:http://www.washington.edu/imap/ to “courier-imap”:http://www.inter7.com/courierimap/INSTALL.html. Courier uses maildir instead of mbox format. Webmail is now much faster, since IMAPD does not lock and parse the entire mail spool for every web-click! OTOH, this means no more mail(1) or pine; aw, shucks.
I’ve been cleaning up the “main CFRQ page”:http://www.cfrq.net/ and the top-level stylesheet a bit. Not really sure why, or where I’m going with that-which-loosely-qualifies-as-a-design. I also finally got the default “VirtualHost”:http://httpd.apache.org/docs/mod/core.html#virtualhost (“persephone.cfrq.net”:http://persephone.cfrq.net/) working again, so I think I’m going to move all of the local stuff (and that installed by “RedHat”:http://www.redhat.com/) back to that page, and then replicate it to “hermione”:http://hermione.cfrq.net/.
Are there any imap daemons that use MySQL for mail? And, if there are, are there any command-line clients that can parse/use these mailboxes?
Hm, one reason I like mbox format is that I can burn it to CD and read it with software 20 years later without a problem. Maildir is what (ex)mh uses, right? Where a mailbox is a dir and messages are files? Sort of like seeing your mailbox explode. :-)
That would be fine with something like rfs, which is optimized for small files, but I think it would kill ext2 if you have huge mailboxes (which I do — several over 1000 messages).
I guess for archival, if I had some SQL thing, I could have a script that spit everything out in mbox or something..
I’m sure there are software packages out there that store e-mail in a MySQL database; I haven’t researched that specifically. Google is your friend :-)
My archived mail is currently 507Mb (wow!), in 47450 files (with a couple of control files in each folder, I have slightly fewer actual messages). In practice, I don’t have any trouble with using MH format; EXMH as a GUI hides that detail, and the MH/NMH command line tools are very easy to use (they were designed that way, after all).
Maildir format is a little weird; the filenames are long and somewhat unintelligible, so using standard command line tools is more challenging. I’ve eneded up with a few perl/python scripts to make life easier.
Mutt speaks maildir format directly, as does courier IMAP; between the two, it’s easy to manipulate my mailboxes. Also (as you mentioned) I keep my primary mail on my laptop in MH format; the maildir stuff is a) for other cfrq.net users that use IMAP and/or webmail, and b) for when I’m travelling and using webmail instead of my laptop.
“persephone.cfrq.net”:http://www.cfrq.net/ is also “herne.third-bit.com”:http://www.third-bit.com/, hosting a couple of “UofT”:http://www.utoronto.ca/ “Computer Science”:http://www.cs.toronto.edu/DCS/index.html “project courses”:http://www.artsandscience.utoronto.ca/ofr/calendar/crs_CSC.htm#CSC494H1. The students are in the final crunch of developing a servlet-based application.
The students have set things up so that they are sandboxed from each other (a good idea). Unfortuantely, this means that tomcat is loading separate copies of all of the support classes, one for each student.
The net result is that tomcat wants twice as much memory as is available on the box, causing aggressive paging activity. Expect both the webserver and e-mail to be a little slow for the next couple of weeks, until they’re done…
It works in Safari.