April 08, 2005

Yahoo! Modification Date

Posted at April 8, 2005 10:51 PM

A question was raised by Kalena Jordan of Search Engine College the other day on Jill's Forum regarding how to figure out when Yahoo! last spidered a site or last updated their index.

This is one bit of information Yahoo! hasn't included in their SERP display for a long time now, though no one could ever figure out why. It never seemed to be sensitive information so not including it as a norm never made any sense to me.

Oddly, I'd just started playing a bit with Yahoo! Developer stuff a few days earlier because I plan to create a few little API applications for fun and hopefully to help folks out. In doing this I happened to notice that the API version of Yahoo! searches does in fact include a "ModificationDate" field that gets returned.

Technically, this ModificationDate isn't intended to show the last time Slurp came by for a visit. Instead it shows the time and date (in unix timestamp format the actual file was last updated or modified. In other words, the time and date the server the page resides on supplies. Or at least that's the way I read it in the documentation.

In practice though, this info is still pretty useful.

To begin with, sites that use server side scripting (e.g. PHP, ASP,etc) are always going to send a current day date/timestamp even if the files do not contain any scripting. That's because the server still has to parse through the file before delivering it. So these files are always going to show as being brand new even if they've not technically be modified in months or years.

If you have file types that will report the true date of upload (e.g. plain HTML files) it's a bit more problematic if my understanding of ModificationDate is correct. Still, if you find that your server reports an old date you could always re-upload exactly the same file to your server, thus updating the date/timestamp the server has stored, and check the tool in a few days to make sure Slurp is coming by to grab the file(s).

Anyway, I'm rambling. Back to the main point of this entry.

Since Kalena asked, and since I was feeling inquisitive, I spent a few minutes --literally less time than it's taking me to write this post-- to tweak their example API code and register an application with them that will show this ModificationDate information. Simply because it is interesting and maybe even valuable if you ever run into a situation where your Yahoo! rankings go wonky and/or you need a quick way to figure out if Slurp has been by lately.

Using a tool like this is much easier than pouring over log files, that's for sure! Plus even if you see the spider hitting a certain page in your logs you still couldn't be sure that Yahoo! updated it in their index unless the content of the page had changed pretty significantly.

I'm going to release this little tool with a GPL (GNU) license so that others can slap it up on their site if they want to. Because of the way Yahoo's API works --they use REST, not SOAP-- your server will need to support DOMXML via PHP. That means your PHP installation will need to be at least version 4.3.x and be compiled with DOMXML support.

You should be able tell if your server supports this by creating a phpinfo() file. Or simply unzip and upload the mod_date file and try to run it. If DOMXML is not supported you'll get an error right away telling you that one of several built in functions is not available.

Or if you would prefer to simply use the version on my server, the live mod_date script can be found here. Yes it's free to use. You'll probably notice that I pared down the information it displays a bit, so it doesn't show things like the title of a page or even a snippet. That seemed to clutter things up so I simply removed them from the display portion of the equation.

Part of the reason I'm making the code available for others to upload to their site is because there is some confusion in the way Yahoo! has their application license worded.

Unlike Google and their API license, Yahoo! has you actually register an Applicaiton ID with them. That's how they identify the scripts. Then instead of the 1,000 queries Google allows via API, Yahoo! allows 5,000 queries per day, per IP address. They don't state clearly if it's the IP number of the server that is recorded against the rate limit, or if it's the IP number of the person using the application.

I could see it being either way, so I'm going to try to get some clarification on that point. If it's 5,000 queries per user it's no big deal. It would be very difficult for one person to send through that many queries in a day. On the other hand if it's 5,000 queries per server, that's a whole other story.

If it is keyed to the server IP as I suspect, it would be better IMHO to set the limit as per day, per application, per IP. I can envision myself and others creating all kinds of little applications. The problem as I see it is that while each of these applications would not even begin to approach the daily query limit, when you group together the number of queries from a couple of dozen of applications on a decently active site/server, the combined total could easily top the daily limit.

Especially when you factor in the idea that many of us use Shared Hosting, where there could be hundreds of sites on a single IP number.

My way of limiting queries would certainly encourage API Developers to create more, and more useful applications. So hopefully Yahoo! will consider such a change to their license if they indeed count queries against the server IP number.

The reason I'm releasing the code is that even if the version on my site runs out of queries for a given day, the same script being run from a different server could have another 5,000 queries left to go.

So there it is. Enjoy the little tool and look for more to be coming in the very near future. Also, please let me know if you upload it to your own site or if you improve on the concept. I'll certainly link to your version from here somewhere to give people options.

Comments

Posting of new comments has been disabled for this post.