jazzace.ca

Anthony’s Mac Labs Blog

Checking Web Pages with AutoPkg

Posted 2018 June 17

As I detailed in my previous post, I noticed (by chance) that an Apple Support Article that was important to system administrators had changed significantly. A number of presentations at the MacAD.UK conference in February (including mine) quoted this Article. Apparently, it had changed 18 days before I discovered the change[1]. The timing of this update wasn’t aligned with any particular OS or security update, so it’s no wonder it took a while to find. Still, it seemed like there should be a way to automatically check on an Article like that one so that it doesn’t sit there unnoticed by the community.

What tools do I use as a Mac Administrator to check for updates? Well, there’s Apple’s Software Update (soon to be moved from the Mac App Store back to System Preferences). Apple also has an RSS Feed for Developers that mentions OS releases but there’s nothing for Support Articles. There’s always the Mac Admins Slack and Twitter, of course, but that is not at all systematic. So my mind went to the tool that I use to check on software updates not covered by Apple’s mechanisms: AutoPkg. Could the URLDownloader processor also be used to check for updates to web pages? The answer, I found out, was “Yes, but….” Let me take you through my process. (This is also partially meant as a tutorial for people learning to write AutoPkg recipes, so apologies to readers fluent in this area for the depth of detail.) If you’re OK with spoilers, you can follow along with the finished recipe.

Building the Recipe

The goal was for the recipe user to specify the ID for the Article (e.g., HT208020 for the article I mentioned off the top) and have AutoPkg do the rest. This is similar to the approach Nate Felton used when developing his AppleSupportDownloadInfoProvider custom processor, only that the page was what we wanted to download, not a link within the page (hence, no need for a custom processor). AutoPkg would then retain a copy of the Article such that, when the next update came, the user could then compare the changes because they had both files in hand.

So the first test was to see if the URLDownloader processor would actually save the page/item if it URL wasn’t pointing to an installer or a disk image or an archive of some sort. Happily, the answer was yes. The URL format for Apple Support Articles is such that the kind of file being saved is not specified by an extension, so the file that URLDownloader cached did not have an extension on it either. No matter; a quick examination determined that it was HTML. When I viewed the cached file in a web browser, it wasn’t nearly as pretty as the normal Apple version[2], but the content was clearly there. As is my normal process in developing recipes, I then ran it again and the recipe correctly determined that there had been no changes since the last run. Success! This could work!

Next was the task of archiving a copy in some way. I noticed that there was a “Published Date” at the bottom of each Article. I’ve used URLTextSearcher before to find specific links on a page, so I used it to find that date, which would always follow the words “Published Date:”. Strangely, the date wasn’t in the format displayed on the page—Javascript modified it for output—but it was completely consistent. The HTML block that I would need to locate looked like this:

<div itemProp="dateModified" class="mod-date"><span>Published Date:</span>
<span itemProp="dateModified">Thu May 10 15:53:48 GMT 2018</span></div>

I simplified that search such that I took just enough of the markup before the date to make certain I found the right string. Using trial and error the power of regular expressions, I captured that date, but I also used a trick I picked up from someone else’s recipe (I think it was in the main AutoPkg recipes repo) to assign different parts of the string to different variables. So my re_pattern looked like this:

itemProp="dateModified"&gt;(?P&lt;version&gt;[^:]*):(?P&lt;mm&gt;..):.. ... (?P&lt;yr&gt;....)

That would assign all of the date string up to the first colon (e.g., Thu May 10 15) to version, the minutes of the timestamp to mm, and the four-digit year to yr.[3] I then used those variables to build an output filename more to my liking. (I could have just captured the whole date+timestamp and used it as is, but I wanted the year first and I wanted it to say UTC not GMT, being the pedant that I am.)

A subtle detail here: Normally, I would provide URLTextSearcher a URL for a web page. I could certainly have had AutoPkg download this page redundantly to perform the search for the date+timestamp, but I decided to see if using a local reference (file:///) would work. It did. In addition to eliminating a redundant download, this method accounts for the rare instance when the page might change in the split second between the two calls to fetch the page.

I used the Copier processor to move the downloaded file to a directory that the recipe user specified in the Input dict, renaming the file with the Article Number, formatted date+time information, and an .html extension. Since I was trying to avoid using a custom processor (particularly since I can’t code in Python… yet), I just documented in the recipe description that the path should exist (since the recipe could fail otherwise).

Enhancements

I now had a working recipe as far as I could tell. I chose to add two more features:

  1. Users must specify the language and country using the LANG input variable.
    Rather than assume that users would want the article that gets served to their country (which is what you get when you omit that information from the URL), I made users specify it. My bet is that en-us will always be the first article updated, so I wanted users to be able to specify that. Also, if they wanted to track more than one language-country combination, they could override it more than once and track both versions this way. Because of this, I added the language-country code to the filename of the page we saved. I made en-us the default, which will meet most user’s needs.
  2. In order to avoid redundant copying of the file when it hasn’t changed, I added the StopProcessingIf processor.
    This is, admittedly, a minor efficiency in this case, since the file is so small, but I do this in all my .ds recipes and decided to be consistent.

Because there was no precedent for this, I did question whether I should invent a new recipe type or even whether I should split the recipe such that the copying occurred in a separate child recipe. I determined that most people who would use the recipe would want to archive a copy of the article and that there was no post-processing of the file (save for the filename), so I kept it all as a .download recipe. For those who didn’t want to retain an archived copy, I carefully placed an EndOfCheckPhase processor just before copying the file so that running the recipe using autopkg run --check would avoid the copy entirely.

Yes, but…

I received very positive feedback on Twitter after I announced the existence of this recipe. So it was all the more painful to have to walk this back a bit a few days later. You see, a scheduled run of my recipe reported a change in HT208020 a few days later, even though the Published Date remained the same. I confirmed that the content you would see in a browser had not changed. So what did? Items in the page header. Things like the page’s “Helpful” rating, the expiry date for the cookie that Apple leaves behind, and any updates Apple may have made to their Javascript code in the interim. Sigh.

The most robust solution would be to compare only the content inside the body, perhaps even completely stripped of markup. That would require a custom processor that would (1) use URLDownloader or similar code to grab the current version, (2) if it reported that there was a new version, perform a comparison of the body contents with a past version that was stored somewhere in the cache, and (3) report back via the download_changed variable whether the Article’s content had truly changed or not. This is not a level of complexity that interests me right now. If someone else wishes to write that processor and maintain it, I’ll happily update my recipe to use your processor—or you could just steal learn from my recipe and write your own to work with your processor; I’m OK either way.

The good news is that the recipe as written will simply overwrite the Article, since the date+timestamp for Published Date hasn’t changed. So if you are being notified when the recipe downloads a “new” copy (e.g., using AutoPkgr), you can just check your directory of archived articles; if there isn’t an additional file with a different date in the filename in that directory, the content wasn’t updated. At this point, I’m willing to live with that limitation.

A More General Recipe

Along with this blog post, I’m also releasing a new recipe, webpages.download, that is generalized for any web page. Based on what you’ve read above, you’ll understand why it will work best with static web pages where the header or injected scripts won’t change. The site you’re visting now is one where it will work well, as it is developed using Jekyll (a static site generator) on GitHub Pages. The default Input values should work as a faux-RSS feed for my blog posts, but you are of course welcome to override them with pages you want to track. I did some text massaging of the date in that one using a shared processor, so you’ll need to add Elliot Jordan’s homebysix-recipes to your RecipesRepo in addition to mine (jazzace-recipes) in order to run it.

Upcoming Presentation

If you want to learn more about AutoPkg, especially if you have dabbled with AutoPkgr but haven’t delved into writing recipes or using AutoPkg at the command line, I’ll be giving a presentation at the 2018 Mac Admins Conference at Penn State on just that topic. This is the conference where I found my professional peer group; I cannot recommend it highly enough. Please approach me and say, “Hi,” if you are there.

Note: The AppleSupportArticle.download recipe has been updated (re: date format) and a second recipe has been added that you may prefer. I wrote an additional blog post about those changes on 2018-09-19.


[1] In developing the AppleSupportArticle AutoPkg recipe, I noticed that the Published Date for the non-English language versions of the article was 15 days later. Armin Briegel surmised to me that all the versions were updated publicly on the same date (May 25) and I tend to agree. Taking three days for someone to find the change makes more sense than 18. [Return to main text]

[2] It is common to reference relative paths in the markup on a web site. Since the downloaded page — the equivalent of manually doing a Save As… > Page Source — does not have the same resources locally in the same relative locations, such as CSS (stylesheets) and images, those will not be loaded, affecting appearance. [Return to main text]

[3] For those of you who don’t encounter XML/HTML entities very often, &lt; stands for a less than symbol and &gt; for greater than; we can’t use the proper character because they have a special meaning in XML, so this is how you “escape” those characters. [Return to main text]