The Blog of Harald

code snippets

It took _way_ too much effort to get those blobs of code to display properly. Even inside PRE and CODE tags, WordPress was mangling stuff. I downloaded “David House’s grabcode plugin”:http://xmouse.ithium.net/archives/2004/07/19/implementing-a-code-snippet-system, but then I had to play with my CSS to get the results to look acceptable (and they’re still wrong in the RSS feeds, but I give up for now).

It’s done now, and the _next_ time will be easy :-)

posted at 9:56 am on Thursday, September 30, 2004 in Programming, Site News | Comments Off

WordPress and HTTP Conditional GET

I was looking at my logs, because I was seeing a lot of traffic on my XML feeds. ClichÃ©d, perhaps, after the Microsoft debacle :-). I was looking because my RSS2 feed had risen to the top of the daily traffic statistics.

I did notice that many aggregators still aren’t using Conditional GET, and that some others don’t support gzip. But that wasn’t it… Instead, a full 25% of hits resulted in HTTP Code 302 (not modified), and yet were _also_ transferring the full RSS feed (sometimes compressed, sometimes not). Digging in the code, I found (in the Conditional GET logic in wp-blog-header.php):

I moved the exit to the right place, and now all my Conditional GET clients are downloading 0 bytes :-). Here’s the fixed version:

posted at 8:58 am on Thursday, September 30, 2004 in Programming, Site News | Comments (4)

Reid says:

2004-09-30 at 12:11

I use “wget” to download my RSS feeds onto tnir to produce my news.tnir.org page. Now wget has an option, “-N, --timestamping that tells it ” don’t re-retrieve files unless newer than local”. However, I also use another option, “--output-file=FILE” because I don’t want 200 files named “index.html“. When you use that option, it turns off the time-stamping feature. Bleaugh.

I downloaded the code to see if I could fix it, but it turns out that the code is structured in such a way that redirecting the output is a major fork in the code path. *sigh. Maybe one day I’ll get around to it..
Reid says:

2004-09-30 at 16:33

Whoah, some of my HTML leaked through there. The comments info says bold tags are allowed, and I certainly did not type any “del” tags..
Harald says:

2004-09-30 at 18:44

Apparently Textile is processing comments, too; text surrounded by “-” is marked with del tags, and text surrounded by + is marked with ins tags.

Seems to be haphazard though, esp. since it appears that textile is adding tags, and then the HTML sanitizer is stripping (some of) them out…
Traffic Statistics Webmaster says:

2004-10-04 at 05:19

Hi Harald,

just stumbled about your blog and found a nasty bug there:
You got a self signed SSL cert for some parts of your blog, if a user doesn’t accept the self signed cert, your blog layout breaks away completely.

Grettings from Berlin/Germany

dumb crawlers

I wonder who is teaching people that it is acceptable to download the entire content of a website, as fast as possible, and then come back and do it again the next day (and the next)…

larbin has been added to the list of blocked robots. Buh-bye!

posted at 9:19 am on Tuesday, September 28, 2004 in Site News | Comments Off

Internet Explorer image caching revisited

A few days ago “I complained about Internet Explorer”:http://blog.cfrq.net/chk/archives/2004/08/21/argh-msie-and-bandwidth/

Google searching leads me to believe that MSIE doesn’t send If-Modified-Since: headers for images (and possibly other files, like CSS); instead, it expects to see an Expires: header in the HTTP response (It will also apparently listen to Cache-Control: headers). The beauty of standards is that there are so many to choose from…

More Googling led me to the following configuration directives for Apache:

   ExpiresActive On
   ExpiresByType image/gif "access plus 1 week"
   ExpiresByType image/jpeg "access plus 1 week"
   ExpiresByType image/png "access plus 1 week"

(It’s possible that image/* will work; I haven’t tried it).

I hope this helps someone else; I hope it helps me remeber next time :-)

posted at 10:50 am on Thursday, August 26, 2004 in Site News | Comments Off

Argh; MSIE and bandwidth

It appears that if you set the Cache settings in IE to “Automatically” or “Every visit to the page”, then every time you visit a page at blog.cfrq.net IE fetches all page objects (page, CSS, favicon, embedded images). For some of them, it is sending the If-Modified-Since: header (I see 304 responses for the blog CSS, for example), but it does not seem to be sending If-Modified-Since: for the banner JPEGs. This means that MSIE visitors download the banners several times in a row as they browse the site. This not only wastes my bandwidth, but it also interferes with their experience, since they have to wait for the banner to download on every page visit.

I’ve noticed IE doing this before on the client side with image intense applications (like MovableType :-), but I hadn’t investigated until recently, when a small increase in visitors to my blog site _doubled_ the bandwidth used…

Is this a known IE bug? Is there anything I can do on the server side to work around it? The investigation continues…

posted at 8:44 am on Saturday, August 21, 2004 in Rants, Site News | Comments (2)

Reid says:

2004-08-23 at 15:13

You could conditionally use a low-res substitute for IE users..
Harald says:

2004-08-25 at 09:31

An excellent suggestion, and trivial to implement. Since WordPress already shoves a bunch of rewrite rules into a .htaccess file, it is trivial to add another one to conditionally rewrite the .jpg URLs for MSIE users. I’ve compressed the JPEGs to about 20% of their original size. The quality suffers, but less than I expected it would…

spam source

Ok, so it turns out that all (well, 125 of 126 :-) of the spam I’m getting these days is coming through my pobox.com address. The greylisting is working fine, in other words :-)

It’s been great having a portable email address, but now that I pay real money for my own domains, maybe it is time to switch over. I can do more accurate spam filtering on my personal server than they can on their shared servers Unfortunately, the massive spam volumes floating around these days are forcing us to these drastic measures. I’m beginning to believe the pessimists; e-mail is dying…

posted at 10:57 pm on Sunday, August 08, 2004 in General, Security, Site News | Comments (1)

The Blog of Harald says:

2004-08-17 at 09:11

Re: oops
So after looking at “the mail I accidentally misfiled”:http://blog.cfrq.net/chk/archives/2004/08/16/oops/ there were, in fact, about 150 spam (almost 50%).

pobox.com has completely revamped their spam filtering service since I last looked; I can n…

greylist results revisited

So maybe I spoke too soon; in “greylist results”:http://blog.cfrq.net/chk/archives/2004/07/14/greylist-results/ I said that my spam volume had gone way down. Well, it has come back up again. I’ll have to write scripts to prove it, but I have a theory.

Machines owned by spammers are being used relatively infrequently, maybe to reduce the chances of getting detected and blacklisted? So the first time a spam host shows up, it gets greylisted. But if they show up again a day or a week later, they get past the greylist filter, because they’re now in the cache (but haven’t been expired yet).

Maybe a fix would be to put two cache timeouts in; the first would be for machines that have not yet successfully delivered a message i.e. by retrying the original delivery), and would be relatively short, probably less than a day. The second would be the existing long timeout for machines that have already passed the first test.

That would eliminate spam machines that only show up infrequently. I don’t know whether it is worth the effort, though.

On the plus side, greylisting _is_ still keeping out the virus traffic…

posted at 12:18 pm on Sunday, August 08, 2004 in Security, Site News | Comments Off

Comments

At the request of “a loyal reader”:http://rae.tnir.org/ I’ve reintroduced comments on the main page and in the RSS and Atom feeds. Enjoy.

posted at 1:03 pm on Wednesday, July 28, 2004 in Site News | Comments (1)

Reid says:

2004-08-02 at 23:17

Thanks Harald! Btw, my web site is just http://rae.tnir.org these days. :-)

greylist results

It’s been a week since “I installed postgrey”:http://blog.cfrq.net/chk/archives/2004/07/06/greylist/.

Wow!

My spam volume has droppped back to manageable levels; 10-20 per day (maximum). Even better, I’m no longer getting 10s of those encrypted ZIP file viruses every day; greylisting stops them all dead (at least so far :-).

I suppose the spammers will eventually figure it out, and start runing mail queues, but (in theory) it should be easier to pick those up via DNS block lists…

posted at 2:36 pm on Wednesday, July 14, 2004 in Site News | Comments Off

greylist

Spam volumes have been rising continually around here. I started my foray into automated spam filtering a couple of years back; at the time, I was receiving about 100 per _quarter_. Now I’m getting almost 100 per _day_.

I needed an excuse to upgrade my “postfix”:http://www.postfix.org/ install to the new “2.1 release”:ftp://ftp.utoronto.ca/mirror/packages/postfix/index.html, so I decided to install “postgrey”:http://isg.ee.ethz.ch/tools/postgrey/, a “greylisting”:http://projects.puremagic.com/greylisting/ daemon. So far I’m using it after all of my other spamtraps, but it seems to be working reasonably well. I’ll be watching the logs for a while to make sure…

In a nutshell, greylisting relies on the fact that spammers use dump-and-run tactics, while legitimate email gets queued at the sender. So, when a new, previously unknown client connects, the mailserver sends a “temporary deny”. If that connection is a spammer, they’ll probably not return; the reject means the spam was refused. If the sender was legitimate, it will retry, and our server will allow the retry through.

Pretty cool, if you ask me :-)

posted at 9:40 pm on Tuesday, July 06, 2004 in Site News | Comments Off

Simple Page Editors

I’m currently using “Whisper”:http://www.whisper.cx/ for the static content around here, but I tripped over “EditThisPage”:http://editthispagephp.sourceforge.net/home/index.php the other day, and it looks useful also.

More and more of these things are cropping up, probably as a backlash against how complicated (and fragmented!) the Wiki space is getting…

EditThisPagePHP

posted at 2:11 pm on Friday, June 04, 2004 in Links, Site News | Comments Off

comment spam

Some idiot script kiddy wiped out our bandwidth again today. He could have an automated tool, or he could be doing it manually. He’s trying to post comment spam to blog.org, but he’s repeatedly fetching pages over and over again (presumably to see if his comments are getting published or not).

The problem is that David’s pages are large (and getting larger all the time); an average of 200Kb each. So this spammer has single-handedly downloaded at least 70Mb of data today!

It’s one thing to try to abuse my server to get a site ranked higher in Google. It’s another thing entirely to waste _my_ bandwidth in the process!

64.57.64.0/19, 66.154.0.0/18, and 66.154.64.0/19 just made it into the blackhole list…

posted at 1:18 pm on Thursday, June 03, 2004 in Security, Site News | Comments (4)

David Brake says:

2004-06-03 at 18:29

I was kept busy removing the comment spam this created on the other end today as well (unfortunately, the script kiddies are starting to randomise their IP addresses and choose from long lists of URLs so IP address or URL blocking is less effective). Makes me think the only long-term solution to comment spam may be one of these type in the numbers from an image plug-ins. Though apparently determined spammers are actually doing it by hand! AARGH!
joy says:

2004-06-03 at 23:23

What about comment moderation in WP?
Harald says:

2004-06-04 at 07:31

I’m using WP, and (as you can see) comment moderation is working.

David’s still using MovableType, and his weblog is quite popular…
Chris Sampson says:

2004-06-25 at 18:08

I would recommend you setup some type of image number system so bots can’t spam!

First WP Problem

WordPress apparently doesn’t let me (easily) put fake HTML tags in my posts, even if I use HTML entities like &lt; — it seems there’s a double decode going on.

(To get that to appear I had to type & amp ; amp ; lt ; (without the spaces).

I like to do things like <grin> in my posts…

*Update:* it looks like the problem is the fixEntities() function in the textile2 plugin…

posted at 10:27 pm on Wednesday, June 02, 2004 in Site News | Comments (2)

Reid says:

2004-08-02 at 23:23

So did you fix the plugin and send the changes upstream?

Can you tell I am browsing your site-related entries to see how the whole WordPress thing went? :-)
Harald says:

2004-08-03 at 12:33

No, my WordPress Fu isn’t good enough yet; I’m still working around the problem.

IÃ±tÃ«rnÃ¢tiÃ´nÃ lizÃ¦tiÃ¸n

I ran this test a long time ago with Movable Type (and had to make a whole bunch of changes to get it to work properly). I thought I’d try it again with WordPress…

How does my weblog perform using unicode. See also: “Survival guide to i18n”:http://intertwingly.net/stories/2004/04/14/i18n.html. Some tests:

bq. ã“ã‚Œã¯æ—¥æœ¬èªžã®ãƒ†ã‚ã‚¹ãƒˆã§ã™ã€‚èªã‚ã¾ã™ã‹
Letâ€™s see how Unicode and weblogs does with Japanese :) ã“ã‚Œã¯æ—¥æœ¬èªžã®ãƒ†ã‚ã‚¹ãƒˆã§ã™ã€‚èªã‚ã¾ã™ã‹ï¼Ÿâ€¦

And checkâ€¦

(via “Anne van Kesteren”:http://annevankesteren.nl/archives/2004/05/unicode via “Russell Beattie Notebook”:http://www.russellbeattie.com/notebook/1007860.html#1007929)

posted at 5:21 pm on Tuesday, June 01, 2004 in Site News | Comments (3)

Harald says:

2004-06-01 at 17:27

How about comments?

Î£Ï„Î¿ ÎºÎ¹ ÏŒÏ„Î±Î½ Î´Î¹Î¿Î¯ÎºÎ·ÏƒÎ· Î¼Ï€Î¿ÏÎ¿ÏÏƒÎµ. ÎÏÎ± Ï€Ï‰ ÎºÎ¬Î½Îµ Î´Î¹Î¿Î¹ÎºÎ·Ï„Î¹ÎºÏŒ Î´Î·Î¼Î¹Î¿Ï…ÏÎ³Î¹ÎºÎ®, Î±Î½Î¬ Î²Î³Î®ÎºÎµ Î¶Î·Ï„Î®ÏƒÎµÎ¹Ï‚ Ï„Î±, Î¼Î¬Ï„ÏƒÎ¿ Ï€ÎµÏÎ¯Ï€Î¿Ï… Ï€Î¿ÏƒÎ¿ÏƒÏ„ÏŒ Ï€Ï‰ ÎºÎ±Î¹. ÎˆÎ½Î± Ï„Î± Ï€Î±ÎºÎÏ„Î¿ Ï€ÏÏŽÏ„Î¿Î¹, Î¼Î¹Î± Ï€Î·Î³Î±Î¯Î¿Ï… Î¼ÎµÏ„Î±Ï†ÏÎ±ÏƒÏ„Î®Ï‚ Î´Îµ, Î½Î± ÎºÎ»Ï€ ÎµÏ€ÎµÎ¾ÎµÏÎ³Î±ÏƒÎ¯Î± ÎµÏ€Î¹Ï‡ÎµÎ¹ÏÎ·Î¼Î±Ï„Î¯ÎµÏ‚. Î˜Î± Î³Î¹Î±’ ÎµÏÏ‰Ï„Î®ÏƒÎµÎ¹Ï‚ Î´Î¿ÎºÎ¹Î¼Î¬ÏƒÎµÎ¹Ï‚. Î‘Î½ Î¬Ï„Î¿Î¼Î¿ Î´Î¹Î±Î´Î¯ÎºÏ„Ï…Î¿ Î´Î¹Î±Ï€Î¹ÏƒÏ„ÏŽÎ½ÎµÎ¹Ï‚ ÏŒÎ»Î·.
Reid says:

2004-08-02 at 23:22

Looks good. I notice, btw, that the comment was converted into HTML numbered entities instead of staying unicode. Or is that the way it is supposed to work?

Of course, the final result will depend on the user’s web browser being able to display the unicode text correctly.
Harald says:

2004-08-03 at 12:35

*sigh; I hadn’t noticed that. No, that’s _not_ how it is supposed to work; time to investigate a little, I guess…

Traffic Analysis

We actually didn’t get that much traffic last night from the slashdot crowd, other than one Australian tool who kept fetching image files over and over again with various random query arguments. 1734 fetches of one image; 1144 fetches of a slightly larger one. It might have been a browser bug, but somehow I doubt it. It was single-handedly responsible for about 250Mb of traffic in a few minutes; Fortunately that was at 1AM, so I don’t think anyone would have noticed. Into the black hole…

Meanwhile, some other jerk in Japan has been downloading over and over again from blog.org, resulting in almost a gigabyte of traffic in the last two days!!! He downloaded the same (large) pages, over and over again (200 or so times each), sometimes minutes apart; Unbelievable! Also into the black hole…

By comparison, total combined traffic from slashdot.org _and_ all traffic for the referenced paper is only about 180Mb in the last week. Even without the redirects in place, we would only have transferred between 170Mb and 480Mb of additional data (depending on the number of clients that support gzip compression).

I hate computers :-)

posted at 10:05 am on Thursday, May 27, 2004 in Site News | Comments Off

slashdotted!

Several years ago (back when this machine was still a 486, actually), I put a global apache rewrite rule on the server to deny access to anyone who came here from slashdot. This was to avoid the so-called “slashdot effect”:http://en.wikipedia.org/wiki/Slashdot_effect.

Well, the rule has finally been triggered, thanks to “Extensible Programming for the 21st Century”:http://developers.slashdot.org/article.pl?sid=04/05/26/2231214 (A link to one of Greg’s articles).

Apparently denying the page was somewhat confusing, so I changed the rule to redirect to “this page”:http://www.cfrq.net/slashdot.html instead.

The server is holding up remarkably well under the load (much better than it did with comment spammers before I rate-limited the mt-comments.cgi scripts). Still, there is a lot of dynamic content here, and I don’t think Michelle wants the bandwidth headaches, so the rule stays.

posted at 9:19 pm on Wednesday, May 26, 2004 in Site News | Comments (4)

joy says:

2004-05-26 at 21:33

You slashdot denying heathen! :-P
Mark says:

2004-05-27 at 21:08

I read the article at http://www.third-bit.com/~gvwilson/xmlprog.html
(my browser shows it was the link I visited) as refered by /. shortly after it was posted and did not notice any lag, /. effect or redirection.

I thought it was hosted at UofT or HP.

If it was hosted on your box then, good job! What sort of net connection do you have?

Cheers
Harald says:

2004-05-27 at 22:18

This site is currently redirecting automatically to the pyre.third-bit.com mirror (which is at UofT), but only for slashdot referers, so you might have read it here instead.

As for our network connection, we are trying to use as little as possible of our generous host’s 3Mb/768Kb business-class DSL…
Mark says:

2004-05-28 at 20:03

I was refered from slashdot, but due to a bug/feature of Galeon (as shipped with RH9.0) the referer field is not set when you open a new tab on a link.

So http://www.third-bit.com served the page (your server survived). :)

Here’s a test where pone.html was loaded by typing it into the address
bar, ptwo.html was loaded via opening a new tab on a link from pone,
and pthree was opened by clicking on a link from ptwo.

Note referer is “-” except when it is set by the direct click loading of pthree.html.

127.0.0.1 – – [28/May/2004:19:57:53 -0400] “GET /pone.html HTTP/1.1” 200 342 “-” “Mozilla/5.0 Galeon/1.2.7 (X11; Linux i686; U;) Gecko/20030131”
127.0.0.1 – – [28/May/2004:19:57:57 -0400] “GET /ptwo.html HTTP/1.1” 200 346 “-” “Mozilla/5.0 Galeon/1.2.7 (X11; Linux i686; U;) Gecko/20030131”
127.0.0.1 – – [28/May/2004:19:58:01 -0400] “GET /pthree.html HTTP/1.1” 200 342 “http://127.0.0.1/ptwo.html” “Mozilla/5.0 Galeon/1.2.7 (X11; Linux i686; U;) Gecko/20030131”

Cheers.

Movies update

I copied my “movie list”:http://blog.cfrq.net/chk/static/movies.html from the “blog entry”:http://blog.cfrq.net/chk/archives/000659.html over onto my static content pages. The “new movie list”:http://blog.cfrq.net/chk/static/movies.html I’ll try to keep up-to-date :-)

posted at 4:48 pm on Monday, April 19, 2004 in Site News | Comments (1)

Jeff K says:

2004-04-23 at 07:04

Hey, where’s “Fahrenheit 9/11”? [n.b. I’m not a big fan of Moore’s agenda, but he collects a lot of important facts while pursuing it].

Bandwidth

It’s amazing what happens when you add a banner graphic to the weblog. The banner is larger than the main index page, and certainly larger than the individual pages; bandwidth has spiked quite a bit since I added it.

I did some checking, and noticed a surprisingly large number of clients fetching the .jpg over and over again, instead of pulling it out of local cache; what’s up with that?

posted at 8:54 am on Wednesday, March 10, 2004 in Site News | Comments Off

Content Management Systems

In “Comes in Two Sizes”:http://www.third-bit.com/~gvwilson/blog/archives/000029.html Greg Wilson comments on Wiki technology.

One of the frustrations I find with wiki software (and, actually, open source software in general) is the proliferation of almost identical versions of a tool. There are too many wiki implementations out there, and each one seems to have one or two good features, but is also missing one or two important features.

TWiki is too large and feature-rich (and has a weird hybrid version of Wiki syntax), but it has real authentication (unlike most of the others) and has a working XML-RPC interface (useful for integrating with, say, movable type :-), so that’s why it is installed.

On the other hand, I’m using MoinMoin at work, because I don’t need authentication there, and it is easier to setup and use.

Anyway, what I’m really looking for for things like “the rolemaster pages”:http://www.cfrq.net/~rolemaster/ is a Content Management System that makes it easy for the casual user to create linked documents. This means, for example, that it should use Textile or Wiki syntax for editing. But the most important feature for me would be to extend the power of WikiWords to arbitrary phrases or keywords. For example, if I create a document with the title “Greg Wilson”, I’d like any instance of the string “Greg Wilson” to be replaced with a hyperlink. This makes it trivial to create content; just create the page, rebuild, and every reference to the topic will be magically linked, without using WikiWords. (The problem with WikiWords is that you have to remember to use them, and they don’t always fit comfortably. My Rolemaster characters don’t have last names, for example, so I’d have to name Alex as AlEx, or CharacterAlex, or something similar to get standard wikis to work).

WordPress comes close; there’s a plugin for keyword processing that could probably be extended to dynamically generate the list of keywords from the database. If only I had some time to play :-)

posted at 10:31 pm on Friday, February 27, 2004 in Site News | Comments (1)

Reid says:

2004-03-22 at 09:02

Hm, can’t tell if all of the posting was by you or by Greg Wilson, and I’m too lazy to clikc the link, so there you go.

If you have a work that isn’t InterCaps, you can (with TWiki anyway) use a syntax like [[wiki][text]] which will use the ‘wiki’ part as the link href, and display it as ‘text’. So to be bizarre about it, you *could* use [[Alex][Alex]]. :-)

Willow Quotes

That was a quote from Willow Rosenberg, btw; it’s evil Willow from the “third season”:http://www.tvtome.com/tvtome/servlet/EpisodeGuideSummary/showid-10/season-3/ episode “The Wish”:http://www.tvtome.com/tvtome/servlet/GuidePageServlet/showid-10/epid-43/ . It goes with “This is the part that’s less fun. When there isn’t any screaming.” :-)

There’s a Willow fan site titled “Bored Now”:http://www.borednow.envy.nu/, and I found sound clips at “Willow Sounds”:http://dogwood.phpwebhosting.com/~tvshrine/willow.htm.

posted at 9:35 am on Thursday, February 12, 2004 in Site News | Comments Off