Failing In So Many Ways

Icon

Liang Nuren – Failing In So Many Ways

Dates and Regular Expressions

Suppose you need to parse dates out of a file, and for some reason you can’t use strptime.  Maybe there can be new lines in the middle of your date format, or maybe there’s some other noise or oddity which makes life difficult.  Maybe you expect your date format to look something like this: YYYY-MM-DD, but you don’t know what the separators are or even if all the components are even going to be on the same line.

Example line:

2011:
12-   25 |" some other "|" text that "| might be formatted a bit
|" weird "|and,maybe,even,"embeds
 fields in a",line|

Now I know, you’re going to say that things like that could never ever happen.  And you’re wrong – I’ve personally seen production code from another company output something eerily similar to the above statement.  But at any rate – your problem is that you need to parse that 2011-12-25 out and a regular expression looks mighty tempting.  Maybe you’ll whip up some kind of fancy line parser that understands the gibberish above – but at some point you’re going to have to do some surgery to extract that malformed date expression.  Maybe you’ll do something like this (perl): qr/(\d{4})[\s:\-\/]+(\d{2})[\s:\-\/]+(\d{2})/.  Now your date is “$1-$2-$3” provided it actually matched anything (and yes, I tested it).

But really, we’re only interested in things between April and June of and between 2009 and 2021 so we’ll just put a bit of date filtering into the regular expression: qr/(20[012]\d)[\s:\-\/]+(0[456])[\s:\-\/]+([0123][0-9])/.  But now it we need to make sure that it doesn’t accept April 39th, so we’ll need to modify the regular expression just a bit more.

NO.

You have the date string you need with this regular expression qr/(\d{4})[\s:\-\/]+(\d{2})[\s:\-\/]+(\d{2})/ and its already hard enough to maintain.  Feed the parsed date into a date library and do your validation and range checking there.  Save yourself… save the world!  And most of all: save me from having to pick up the pieces when the target data of that horrific regular expression changes.

Advertisements

Filed under: Software Development, , , , ,

2 Responses

  1. Mara Rinn says:

    +1 REs are for data extraction & matching. Validation should be done elsewhere!

    Though I have seen form validation done in ECMAScript in the browser *as well as* on the server (in Perl/Python/whatever), which is perfectly fine: the ECMAScript stuff can be more responsive as well as doing funky things like showing and hiding form elements when you select (for example) “Shipping address is the same as Postal address”.

    Though I do have a bit of a giggle at your expense about the use of qr/…/, when the purpose of qr is to escape from leaning toothpick syndrome 😉

    qr#(\d{4})[\s:\-/]+(\d{2})[\s:\-/]+(\d{2})#

    It’s only a marginal improvement, but the principle applies universally.

    And I’d go one further:

    qr#(\d{4})[^[:alnum:]]+(\d{2})[^[:alnum:]]+(\d{2})\b#

    Allow any non-alphanumeric character(s) as the separator, ensure the string ends on a word break.

    • Liang Nuren says:

      Haha, I was wondering if someone would comment on qr/. Its kinda a habit to use that particular one, and the same concept applies for q/, qq/, qr/, and really all of the operators like that.

      As we discussed on Twitter, :alnum: is dangerous because it can swallow your column delimiter. 😉

      As to client side validation on the web – yeah I think that’s a pretty fantastic idea for the most part. Immediate feedback is so very important… but at the same time I think you shouldn’t pour tons of time into developing the perfect (unreadable) regex. Hell – set up an AJAX response and send it server side for final validation if you have to. 🙂

      Anyway. We don’t really disagree and I’m really glad someone besides coworkers/ex coworkers reads my technical blog posts!

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: