When Cluster Computing Can Slow You Down, and How To Optimize It
I have long been fascinated with the concept of cluster (or distributed) computing. Projects like BOINC and Folding At Home that take advantage of millions of people’s idle computer cycles are brilliant. I have several computers in my house that I use for various purposes, but they are rarely ever all being used at the same time. I always wanted to try to run some sort of clustered application across them but never found a great reason until recently: making DVD backups by transcoding them into xvid files.
There is a great program for ripping DVDs in linux called dvd::rip. It it written in perl and is very easy to use if you are comfortable with a linux/console environment. One of its great features is the ability to transcode ripped DVDs by creating a computer cluster out of your networked computers. Before I get into the details, let me step back and give a very brief overview of the basic way a computing cluster works.
The Ultimate Twitter Search Widget
I have released a Twitter Search Widget and supporting API for use on any website. You can find it here:
The Ultimate Twitter Search Widget
These widgets display tweets based on the search.twitter.com API and can be customized with great flexibility. Go check it out!
Initial Google Chrome Memory Usage Comparison
Before you flame me, I know that Google Chrome is early early beta, but I find this kinda silly. Bragging about your superior memory management means it should have superior memory management. After installing Chrome and firing up my 4 most frequently loaded pages in 4 new tabs, I notice that things started to get clunky. I open up Firefox 2 and open the same 4 pages in 4 new tabs and compare. In Chrome you can open ‘about:memory’ to see memory usage statistics (which is pretty awesome). Anyway, this is what I see:
Chrome is using almost twice what Firefox is using. If you’re interested, here are the 4 sites:
- Gmail
- Google Finance
- CNN Money
Now, I know that RAM is there to be used, and maybe I’m missing some point of what Chrome is trying to do at the moment (I know that the comic book said that it tries to amortize the memory cost of a tab up front at its creation), but yikes… this is a tab tad concerning…
Hurricane Gustav Twitter Tracker
At the suggestion of waynesutton, I have created another twitter tracking site. This time it is for Hurricane Gustav related tweets. There are a few filters available to narrow down the results on each page.
Hurricane Gustav Twitter Tracker
UPDATE:
I also created a widget you can install on your site/blog at http://jazzychad.com/twitter/gustav/widget.php
In the first 24 hours, over 32,000 widgets were served.
You can see it in action at these sites (probably until Gustav blows over):
Sarah Palin Little Known Facts
Today John McCain announced Alaska governer Sarah Palin as his candidate for Vice President. No sooner had the happened than @MichaelTurk inadvertently started the newest internet viral (micro)meme on twitter: “Little Known Facts: Sarah Palin” His unassuming tweet, “Little known fact: Sarah Palin used to wrestle kodiak bears in Alaskan bare knuckles fight clubs.” set off an explosion of other twitter users coming up with their own Chuck Norris-style little known facts about Palin. MichaelTurk is keeping a log of some of these little known facts over at his site http://www.palinfacts.com/
During the next few hours, over 1,000 little known facts were tweeted (perhaps not all unique), but the point is how quickly this caught on like wildfire.
Since you can only see up to the most recent 100 tweets on search.twitter.com, I decided to write a tweet collecting engine (similar to the one used for FlixPulse.com) to keep a catalog of these little known facts. The results are shown at http://jazzychad.com/twitter/palinfacts/ In just under 2 hours, I have collected over 200 little known fact tweets. Amazing.
During the first few hours, all of the tweets were positive in Chuck Norris fashion. Then all of the sudden I started to see some negative ones creep in. It seemed that those who are unhappy with McCain’s choice, and some Obama supports, figured out that two can play at this game. Still, the overwhelming majority are positive in nature.
I see that even as I write this post, Turk from palinfacts.com has added a link to my database site on his blog. Thanks, Turk!
Who knows how long this will keep up, but I will keep logging the tweets as they come in!
Barcodes Are Fun
I have always been fascinated by barcodes and the various methods to encode human readable information into a machine readable format. A long time ago I found a great little program for my TI-89 calculator that would draw a user-entered UPC code onto the graph screen. I ported it to work in Windows using Visual Basic to gain an understanding of how barcodes are created. I even went so far as to use my CueCat to scan barcodes on products and have my program automatically reproduce the bardcoce on screen.
Later on when I was learning to use the gd library for image creation within php scripts, I decided to start by porting the barcode generator once again. I never got it quite right, but I had learned enough to move on with whatever project I was working on. Recently I stumbled upon that old barcode gd script and decided to fix it so it worked correctly.
First the barcode creator only generated UPC-A barcodes (the big 12-digit codes you scan at the check-out line). After getting that to work successfully, I decided to implement UPC-E barcode generation as well. UPC-E codes are usually found on soda cans or snacks and only have 6 digits so they take up less physical space on the product.
UPC-A and UPC-E Barcode Generator
If you should ever need to create barcodes on the fly, I have created a sort of API for them:
To create a UPC-A barcode, use the following pattern:
http://jazzychad.com/barcode/upca-<insert-12-digit-code-here>.png
Example:
http://jazzychad.com/barcode/upca-012000809965.png
will produce

To create a smaller version, simply add “small” after the 12-digit code, like so:
http://jazzychad.com/barcode/upca-012000809965small.png
will produce

If you would rather have a gif format rather than png, just use .gif instead.
UPC-E creation is very similar:
http://jazzychad.com/barcode/upce-<insert-6-digit-code-here>.png
Notice “upce” instead of “upca”.
Example:
http://jazzychad.com/barcode/upce-120850.png
will produce

Again, to create a smaller version, just add “small” after the 6-digit code:
http://jazzychad.com/barcode/upce-120850small.png
will produce

.png and .gif work the same way as well.
Note: UPC-A creation will still work even if the check-digit (the last digit after the barcode) is incorrect. I think I will change the script to indicate an invalid UPC-A code by turning the check-digit red.
I wanted to test scanning these barcodes after printing them out, but my CueCat seems to have died. Maybe I will take one to the store next time I want to buy a Mountain Dew 12-pack.
Resources:
Wikipedia UPC Article
Barcode Calculator Tool
How To Block the CUIL Spider Bot
Some people have landed on my previous rant about CUIL after searching Google for “block cuil bot”. I realize that article does not answer that question (unless you click the link to read madstatter’s robots.txt file). So, here is how you block CUIL’s insane spiders.
CUIL’s spider is called “Twiceler”. Why? I have no idea. To block it from your site, add the following lines to your site’s robots.txt file:
User-agent: twiceler
Disallow: /
That’s it!
For more information about robots.txt files, there is some great information at robotstxt.org.
Beware WordPress SQL Attack!
If you run a WordPress blog on your server (or are an admin for one), you need to read this.
I was watching my webserver log scroll by (yes, I do that a lot), and I witnessed yet another attempted SQL attack. This time it wasn’t trying to inject anything onto my server like last time. This time it was trying to get the admin password hash stored in the database. Yikes!
Here is what the request looked like:
GET /stuff/index.php?cat=999 UNION SELECT null,CONCAT(666,CHAR(58),user_pass,CHAR(58),666,CHAR(58)),null,null,null FROM wp_users where id=1
There were several variations of this request that came in very quick succession. Can you see what it’s doing?
Basically it is trying to dump your admin password hash to the screen. If successful it would look something like:
666:<password-hash-here>:666:
The evil bot/script/whatever would just look for the “666:” surrounding the hash and read it out. Then it would probably lookup the hash in a Rainbow Table. If you have a weak password it would be completely compromised.
Luckily, since I had modified my webserver’s .htaccess file after enduring the last SQL attack, this attack got a very nice “HTTP 403: SCREW YOU!” response from my server! I am re-displaying the section of the .htaccess file for your edification (I also added an entry for “outfile” which should never be used in an http request).
<IfModule mod_rewrite.c>
RewriteEngine On
RewriteBase /RewriteCond %{QUERY_STRING} union [NC]
RewriteRule .* /————http———– [F,NC]
RewriteRule http: /———http———– [F,NC]RewriteCond %{QUERY_STRING} select [NC]
RewriteRule .* /————http———– [F,NC]
RewriteRule http: /———http———– [F,NC]RewriteCond %{QUERY_STRING} jatest [NC]
RewriteRule .* /————http———– [F,NC]
RewriteRule http: /———http———– [F,NC]RewriteCond %{QUERY_STRING} http [NC]
RewriteRule .* /————http———– [F,NC]
RewriteRule http: /———http———– [F,NC]RewriteCond %{QUERY_STRING} outfile [NC]
RewriteRule .* /————http———– [F,NC]
RewriteRule http: /———http———– [F,NC]
</IfModule>
This should also be a good reminder to always use strong passwords.
How CUIL Lost Me as a Customer Long Before They Launched
Several months ago I noticed a new spider crawling madstatter.com (my baseball statistics site) called “Twiceler”. The link in the user-agent for Twiceler led to this page. Apparently it was for an new un-launched search engined called “cuil”. Obviously this Twiceler bot was just crawling sites to gather pages before the launch. Great, I was happy to be indexed by a new engine.
Then things went wrong…. very wrong.
For some reason Twiceler was trying to index links such as:
http://www.madstatter.com/06/07/07/07/07/06/06/…/scoring.php
…which of course is a ridiculous address. Twiceler was creating hundreds (if not thousands) of these types of bogus addresses and trying to index them. They all returned 404, of course. I thought surely that this would stop soon enough because no spider could be this goofy.
It continued for days. At this point I decided to look up the contact for Twiceler and send off an informative email to tell them what was happening. I really was just trying to help.
Hi Jim,
I see your Twiceler robot crawling my site (www.madstatter.com) which is all fine and dandy, except that it is creating mal-formed addresses which all return 404 errors.
For example, it is trying to find this page:
http://www.madstatter.com/06/07/07/07/07/06/…/07/07/scoring.php…which doesn’t exist of course. Perhaps there is something goofy with the logic that parses each page’s links and creates new addresses to crawl? For example, I use a lot of “../” and “./” references in my href links which tend to throw off some url-parsing robots (that is not intentional, it is just the way it is).
I am afraid at this rate your bot may be creating an infinite number of mal-formed addresses to crawl. I do not with to block your bot, but I don’t want a whole ton of garbage addresses in my log files either
![]()
Thank you for looking into this.
-Chad
The next day I received this rather cold response:
Dear Chad,
Twiceler is the crawler that we are developing for our new search
engine. It is important to us that it obey robots.txt, and that it not
crawl sites that do not wish to be crawled. If you wish I will be
glad to add your site to our list of sites to exclude.Like all startups, we hope to launch sooner rather later, but exactly
when that will be, I don’t know. Watch our web site (www.cuill.com) for
the announcement.Recently we have seen a number of crawlers masquerading as Twiceler,
so please check that the IP address of the crawler in question is one of
ours. You can see our IP addresses at http://cuill.com/twiceler/robot.htmlYou may wish to add a robots.txt file to your site (I notice you don’t
have one). That is the standard mechanism for controlling robot access and
behavior. You can read about it at
http://www.robotstxt.org/wc/exclusion-admin.html
and there a simple generator of the file here
http://www.mcanerin.com/EN/search-engine/robots-txt.aspIncorrectly formed URLs are usually the result of links we have
picked up from earlier crawls - usually from some other unrelated site
that has a stale or mangled link to yours. We have no way of knowing
their validity until we try to access them.I apologize for any inconvenience this has caused you and please feel
free to contact me if you have any further questions.Sincerely,
James Akers
Operations Engineer
Cuill, Inc.
I wasn’t sure if this was a form-response email or what. It did not seem to address the issue I had raised. Furthermore, it basically said “if you’re not happy with our spider crawling your site, just block it.”
In my confusion, I sent back another email:
Hi James,
Thanks for the reply.I do have a robots.txt file, it is just blank. I do not wish to block any crawlers, so there are not entries in the file.
The IP that is crawling is 38.99.44.103, which is in the list of valid IPs you provide.
Perhaps a solution that will suit both of us is if you could clear out the “pending addresses” in your database for madstatter.com? That way Twiceler can still crawl the site, but it will just have to start from scratch creating the address links over again.
Is that a fair/doable compromise?
Thanks,
-Chad
I received no replies after sending this email. None.
After another week of Twiceler spamming my server (and logs) with insane requests, I decided I had enough. I finally added the first (and only) entry to madstatter.com/robots.txt to block Twiceler entirely.
I’m sorry, but if you are trying to create “the Google killer” search engine, then you shouldn’t treat (or outright ignore) people that are trying to voluntarily help you and your broken spider.
It’s no wonder that Cuil is claiming that they have indexed three times as many pages as Google. They have probably crawled millions of non-existent addresses!
I feel somewhat vindicated in my anger toward Cuil by seeing the receptive opinions of their official launch be so overtly negative. I honestly don’t think Google has anything to worry about here.
FlixPulse.com - How it Works
The following is from another of the newly added pages at FlixPulse.com. Again, I thought it would be an interesting post, so here you go:
So, you wanna get all technical, do ya? Fine…
FlixPulse has two basic parts: the front-end (website) where data about each movie is displayed, and the back-end where all the magic happens. I’ll assume you know how a website works, so let’s skip to the fun part.
At the core of FlixPulse is the Tweet Classification Engine (TCE). It is basically a script that runs indefinitely on one of my linux servers. The TCE does 4 basic steps.
- Poll the Twitter Search (formerly Summize) API for the latest movie tweets from the public timeline (aka The Twitterverse).
- Classify the tweet as Good, Bad, Indifferent, Noise, or Unfiled.
- Insert the tweet (and associated information) into the database.
- Repeat.
Step 1 is pretty easy, but can be tricky. I search for the movie titles and common mis-spellings or other phrases. For exmaple, for “The Dark Knight”, I search for “Dark Knight”, “Dark Night”, and “new batman movie”.
Step 3 is also pretty easy, but Step 2 is the interesting bit.
Some people have asked whether classification is based on keywords, that is, if I classify tweets as good if they contain words like “awesome”, “great”, etc. The answer is no. I did consider this method early on, but quickly discovered that there are far too many ways to express positive, negative, and indifferent opinions to find them all using keywords. Using this method would be wrong a lot of the time as well. Phrases such as “that movie was not good at all” would register as a Good review based upon the keyword “good” when it is, in fact, a Bad review. So, how do you get a computer to master the complexities of linguistic expression. Well, you can’t really, but you can come close… with Bayesian Filters.
Bayesian Filters are the main workhorse for spam filtering in email. They require a large body of input data (or “knowledge”) to be able to work. I was not able to implement such a system until I had a large number of classified tweets already in the database. The Mod Squad had done a great job of classifying a huge number of tweets in a short period of time. Using this data, I was able to construct some surprisingly accurate Bayesian filters to handle the onslaught of Dark Knight tweets during opening weekend.
Here is a basic explanation of how Bayesian spam filtering works, and then I will use that as a way to explain the way TCE Step 2 works.

Bayesian filters are all about probability. They basically ask, “What is the probability that this email is spam, taking into account all the previous spam messages I have seen?” The above formula boils down to this: the probability that an email is spam, given that it has certain words in it, is equal to the probability of finding those certain words in spam email, times the probability that any email is spam, divided by the probability of finding those words in any email.
Now, in FlixPulse’s case, replace the word “Spam” with “Good” and replace the word “email” with “tweet” and you can see how using Bayesian filters maps nicely to the problem of having a computer classify tweets on its own. Rinse and repeat for “Bad”, “Indifferent”, and “Noise” filters. Why have a “Noise” filter? Tweets such as “going to watch Iron Man with some friends” do not count as a review, and do not need to be marked as Good, Bad, or Indifferent. Now I have a set of four Bayesian filters that can determine the probability that any incoming tweet fits into one of those four categories.
Each filter is constructed using statistical distributions of individual words and phrases from each category of tweets.
As each tweet is received by the TCE, it is given a score by each filter. If any of the scores registers very high, and all the others score low, the tweet is automatically marked as such. If more than one score registers high, it is marked as “Unfiled” and sent off for human moderation.
Every so often, the statistical distributions are automatically recalculated and the filters are re-created. In this way, the filters are self-learning in that they use all of the most recently classified tweets to add to their “knowledge base” for the next set of incoming tweets.
So there you have it. Pretty simple, right?