April 10, 2014

Finding Duplicates in Large Archives

Martin Sullivan @ 7:45 pm

The Universal Jobmatch Archive, a database, as its name suggests, of the UK Government’s Employment Services’ web-site Universal Jobmatch. The database aims to capture all the postings made on this site and has grown very large over the year-or-so that it’s been running. Recently, it was decided to explore the use of duplicate entries in a large free-text description field, notionally used to describe the ‘vacancy’ in question.

[Picture: Move poster - The Double]

Oh no! another duplicate

It has been found, by casual inspection, that postings tend to be duplicated across company and geography as posters, largely employment agencies, seek to draw traffic to their own web-sites and capture valuable personal information such as Curriculum Vitae (CVs), telephone and e-mail addresses by doing so. This practise seems to be tolerated by the Government Department charged with providing the service, The Department of Work and Pensions (DWP), and the primary contractor for Universal Jobmatch, Monster Worldwide is profligate in posting them in its own right. Such postings can frequently start on some-other unconnected web-site, are copied, anonymised and re-posted on Universal Jobmatch. The practise, made in bulk, leads to dilution of genuine vacancy data so that even if the users are search-engine ninja the real becomes hard to spot.


February 22, 2014

Installation Problems Caused by Disk Space

Martin Sullivan @ 5:51 pm

Ubuntu Linux updates have been failing because of “No space left on device”, although there seems an abundance. The actual problem is lack of Inodes and the solution was archiving a large number of Linux header files, to free up a workable set of them. To understand the rest of this posting you’ll need to know some UNIX System Administration.


January 14, 2014

Using Cursors in Perl DBI

Martin Sullivan @ 1:02 pm

This is an example of use of Cursors with Perl and DBI. Basic knowledge of Perl and SQL is assumed.

Every now and again a large data-set, retrieved from a Database, has to be handled in Perl. Tim Bunce’s Perl DBI is the foremost Perl add-on to access a Relation Database and drivers exist for a number of Database Managers (DBM), both commercial and Open Source. It is used extensively, together with Postgresql in the Universal Jobmatch Archive and with the occasional, and as yet unpublished, bit of Biology.


September 8, 2013

An Interactive Map

Martin Sullivan @ 1:01 pm

In days of yore some work was done on the Jobcentre Plus Mirror web-site. It allowed a summary to be ‘floated’ before a link to a more fulsome description. The user could roll-over stuff and be quickly appraised, while keeping the underlying table quite compact.

[Picture: Movie poster -- Treasure Island]

You’ll need a map, preferably with dynamic elements

So pleasing was this that a Technical Note was written on the underlying HTML-injection technique. It was subsequently discovered, of course, that this approach had occurred to others and no unique invention could be claimed. Since the death of the Jobcentre Plus Mirror Database, this behavior is now observable in the Examplar.

This work has been returned too, in recent days. A Thespian Friend has a work-in-progress that involves a Script. It has a number of locations where amusing events happen and the well written comedic characters exchange cracking dialogue. Thought has been given, then, to a map of the central locations which are both imaginary and a bit dystopian. The users would rollover the flagged bits of map and up would pop a morsel that would lead them to a world of not completely satiated interest. Well, there is a Script/Movie to sell, so it wouldn’t be encyclopedic.

First some work would need to be done on the underlying Javascript/HTML/CSS, and for these purposed a rather Poe-faced ‘straight’ map was completed of Sheppey, the island that’s just off the coast and part of Kent. The content for this is all from public sources, the map itself is a modified version you can find on Google Maps and the data for the three locations involved has all been shamelessly ‘borrowed’ from Wikipedia. The usual caveat applies, therefore. If anybody is unhappy about this content, drop ZOIS a line and something else will be used (that involves more originality, one would hope). Sheppey was chosen, by the way, because most of the examples on this site tend to involve Cumbria, and we were all getting a bit bored about that. If you were actually looking for something about Sheppey and happened on this site, then sorry for the distraction.

As this is all the subject of ongoing experimentation but a Technical Note will be in the offing, probably as an adjunct to the existing Quick and Dirty Ajax, wherein many of the HTML-injection techniques are already described. Until then all the code is available on-line, obviously. The Author is open to feedback and criticism (but please be gentle).

Thank you for getting this far, hopefully it was entertaining if not informative, and it hopefully makes a change from all that Universal Jobmatch ranting.

July 16, 2013

Farewell to WRU

Martin Sullivan @ 3:46 pm

[Picture: Movie poster -- The Telegraph Trail]

Probably used an encoding predating ITA2

The last Telegraph has been sent, in India, and thus the last large Telex network has been closed. This important cultural event has been noted in a number of places. Several articles examine the somewhat terse nature of Telegraph/Telex, it was expensive and metered. This lead to the use of clunky acronyms and sparse humour. The comparison with SMS is inevitably made, favourably, and some pleasure is derived from the foreshortened wit of yore; Cary Grant makes an appearance.

Although it has moved on to more modern codes, it also notes the last substantive use of the International Telegraph Alphabet Number 2 (ITA2) character set, but this was less well appreciated. (more…)

May 2, 2013

Resurection of Old Computers with New Ubuntu

Martin Sullivan @ 10:48 am

There are several old, elderly, past-their-expiry date computers in the ZOIS estate. You sort of accumulates these things and after years of loyal service you are loath to discard them. They’re usually more than adequate to do things like writing Blog entries and they’re possessed of good keyboards and reasonable screens and sound. They may not be able to do Youtubery but that, in your old dumb correspondents largely discounted opinion, is not great loss.

[Picture: Difference Engine Number 1]

A computer definitely too old

And some e-mail to an old friend will explain more …

Well I’m just about to finish here. I’ve spent most of the day fiddling with an ancient machine (called Orange, another from the McDonald’s skip). It’s wheezing mess with Ubuntu 10.04 imposed upon it, but impossible with 12.04. Not enough memory for all the pretty animated widget tripe.

I’ve therefore put it up under the server and install LXDE as it’s Window Manager. It’s usable again.

There was something called Lubuntu, a supposed Light-weight Ubuntu, but I didn’t have enough memory for the graphical install, and the server solution seemed the easiest way.

I may write this up and put it on the Blog. All appropriately pointless, and therefore apt.

Inspired by a post on a popular Tech News site, I have. (more…)

April 2, 2013

MSWord Stuff in Descriptions on UJM

Martin Sullivan @ 12:53 pm

Universal Jobmatch (UJM) has a large ‘description’ field which is unstructured. Normally plain-text is found there, or more-or-less plain-text that is easily massaged into a form that can be scanned for key-words, stored in the database and so forth. In other posts it was noted that both script and style elements were creeping in to this field and the scraper now removes them.

Lately the scraper is having to deal with save-as-HTML Microsoft Word documents, which were confusing it mightily. As the reader will recall, one simply has to type “Hello World” into Word and then save it as HTML to get a vast document stuffed full of what should be inconsequential HTML-guff with “Hello World” somewhere in the middle of it. By and large this works, but embedding it into another HTML page can have consequences. It can be usually be seen as missing text or bizarre formatting or font choices on the Universal Jobmatch web-site itself. The scraper was finding it tough digging the valuable text out of all this wibble too. The main culprit being ‘actionable’ HTML comments. These are comments which can contain instructions to some scripting element or modify behaviour and formatting in a particular browser. They can be quite large and contain all manner of confusing embedded HTML, script, style and what-have-you.

The UJM scraper now takes care of all of this, so hopefully there’ll be less baffling browser instructions like “false false false EN-GB X-NONE X-NONE MicrosoftInternetExplorer4″ in the ‘description’ fields of posts.

Another consequence of this is the inadvertent use of Win-1252 characters from these Word Documents in what has been declared a UTF-8 HTML. Some of these characters are thus illegal in UTF-8, the chosen character set for all of this work, causing bafflement to the database and aborted updates. It necessitates the use of the Encode module and code like this (in Perl):

my $octets = decode('UTF-8', $stuff, Encode::FB_DEFAULT);
return (encode('UTF-8', $octets, Encode::FB_CROAK));

The solution has appeared in Stackoverflow as well as a number of other places on the Internet.

The usual exhortations on the feedback front are made.

March 9, 2013

Another Web-site, Another Numpty

Martin Sullivan @ 1:13 pm

Another week and another web-site. This time it’s for the production company that will be used to back the work-in-progress that has the working title ‘True’. The web-site, trueproductionsltd.co.uk, is a rather thin on content, with the author of the Play on which this is based staring balefully over a couple of holding pages.

Of interest are fonts, which caught me out and demonstrated my inner numpty, and the drop-down menus. (more…)

February 9, 2013

Browser to Squid, Privoxy, Tor and the World

Martin Sullivan @ 4:50 pm

Now it serves me to keep a discrete eye on some web-sites. It also serves me to circumvent the odd control that some other web-sites have felt justified in placing upon me; foolish though this sounds. I am the JCPM-guy, as was, and some seem irrationally fearful of that.

[Picture: Movie poster - The Three Stooges in Spookes]

Keeping them at bay

On these occasions I feel a little anonymity doesn’t go amiss. Stepping into the shadows involves The Onion Router, Tor. Without being too technical, the Tor router system bounces your TCP/IP traffic all over the world, encrypting it and re-encrypting it before it bafflingly connects to the eventual desired machine. It thus ensures when the connection is made with that desired machine, the machine has no idea who made the connection.

In using the valuable resource, we’d only want to deploy its resources on one or two web-sites. The need, then, is to construct a proxy system that allows most traffic to connect to proceed normally and only those web-sites to be connected via Tor. It gets slightly more complicated for Tor presents its interface as SOCKS and a Proxy that understands this is required. (more…)

February 4, 2013

A Web-site for Some Thespian Friends

Martin Sullivan @ 12:29 pm

I’m at a bit of a loose-end, these days, with my other good works forbidden to me. I’ve therefore decided to help both the odd Biologist who’s struggling with their ‘R’ or Bioperl, and to pick-up where I left of on the open-source Distributed Transaction thing that I’ve been doing on-and-off, mostly off, for most of this century. But first a little web-site development for some Thespian friends of mine.

[Picture: Mother and Daughter]


Emma Rydal and Toby Gaffney, who’ve been involved in the performing arts for some time, they’ve as assailed the pinnacles of popular serial drama, are in the process of putting together an independently produced Movie based on Emma’s play ‘True’. ‘True’, the play, has seen a number of theatrical outings and has been well received. They’ve also gone on to produce a short film based on the last act, called ‘One Last Walk’, and that went well, so they’re going to do the whole thing.

They’ve organised themselves into a company, called ‘La’al Marra Productions’ and this, these days, demands a web-site. I’ve stuck together a small brochure site, based on stuff that I’ve already done for the Cockermouth Film Club and on previously constructed art-work.

There are one or two innovations, well, they’re new to me. And I thought I’d document them here, even though the are other better explanations elsewhere on the Internet. As ever, you can always look at the source code of the pages and the the style-sheet. (more…)

