“Variable length lookbehind not implemented” but it isn't variable length

Stephen 05/15/2018. 4 answers, 1.682 views
regex perl

I have a very crazy regex that I'm trying to diagnose. It is also very long, but I have cut it down to just the following script. Run using Strawberry Perl v5.26.2.

use strict;
use warnings;

my $text = "M Y H A P P Y T E X T";
my $regex = '(?i)(?<!(Mon|Fri|Sun)day |August )abcd(?-i)';

if ($text =~ m/$regex/){
    print "true\n";
}
else {
    print "false\n";
}

This gives the error "Variable length lookbehind not implemented in regex."

I am hoping you can help with several issues:

  1. I don't see why this error would occur, because all of the possible lookbehind values are 7 characters: "Monday ", "Friday ", "Sunday ", "August ".
  2. I did not write this regex myself, and I am not sure how to interpret the syntax (?i) and (?-i). When I get rid of the (?i) the error actually goes away. How will perl interpret this part of the regex? I would think the first two characters are evaluated to "optional literal parentheses" except that the parentheses isn't escaped and also in that case I would get a different syntax error because the closing parentheses would then not be matched.
  3. This behavior starts somewhere between Perl 5.16.3_64 and 5.26.1_64, at least in Strawberry Perl. The former version is fine with the code, the latter is not. Why did it start?

4 Answers


anubhava 05/16/2018.

I have reduced your problem to this:

my $text = 'M Y H A P P Y T E X T';
my $regex = '(?<!st)A';
print ($text =~ m/$regex/i ? "true\n" : "false\n");

Due to presence of /i (case insensitive) modifier and presence of certain character combinations such as "ss" or "st" that can be replaced by a Typographic_ligature causing it to be a variable length (/August/i matches for instance on both AUGUST (6 characters) and august (5 characters, the last one being U+FB06)).

However if we remove /i (case insensitive) modifier then it works because typographic ligatures are not matched.

Solution: Use aa modifiers i.e.:

/(?<!st)A/iaa

Or in your regex:

my $text = 'M Y H A P P Y T E X T';
my $regex = '(?<!(Mon|Fri|Sun)day |August )abcd';
print ($text =~ m/$regex/iaa ? "true\n" : "false\n");

From perlre:

To forbid ASCII/non-ASCII matches (like "k" with "\N{KELVIN SIGN}"), specify the "a" twice, for example /aai or /aia. (The first occurrence of "a" restricts the \d, etc., and the second occurrence adds the "/i" restrictions.) But, note that code points outside the ASCII range will use Unicode rules for /i matching, so the modifier doesn't really restrict things to just ASCII; it just forbids the intermixing of ASCII and non-ASCII.

See a closely related discussion here


choroba 05/15/2018.

That's because st can be a ligature. The same happens to fi and ff:

#!/usr/bin/perl
use warnings;
use strict;

use utf8;

my $fi = 'fi';
print $fi =~ /fi/i;

So imagine something like fi|fi where, indeed, the lengths of alternatives isn't the same.


Adam Katz 07/10/2018.

st could be represented in a 1-character stylistic ligature as or , so its length could be 2 or 1.

Quickly finding perl's full list of 2→1-character ligatures using a bash command:

$ perl -e 'print $^V'
v5.26.2
$ for lig in {a..z}{a..z}; do \
    perl -e 'print if /(?<!'$lig')x/i' 2>/dev/null || echo $lig; done

ff fi fl ss st

These respectively represent the , , , ß, and / ligatures.
( represents ſt, using the obsolete long s character; it matches st and it does not match ft.)

Perl also supports the remaining stylistic ligatures, and for ffi and ffl, though this isn't noteworthy in this context since lookbehinds already have issues with and / separately.

Future releases of perl may include more stylistic ligatures, though all that remain are font-specific (e.g. Linux Libertine has stylistic ligatures for ct and ch) or debatably stylistic (such as the Dutch ij for ij or the obsolete Spanish for ll). It doesn't seem appropriate to have this treatment for ligatures that are not entirely interchangeable (nobody would accept dœs for does), though there are other scenarios, such as including ß thanks to its uppercase form being SS.

Perl 5.16.3 (and similarly old versions) only stumble on ss (for ß) and fail to expand the other ligatures in lookbehinds (they have fixed width and will not match). I didn't seek out the bugfix to itemize exactly which versions are affected.

Perl 5.14 introduced ligature support, so earlier versions don't have this problem.

Workarounds

Workarounds for /(?<!August)x/i (only the first will properly avoid August):

  • /(?<!Augus[t])(?<!Augu(?=st).)x/i (absolutely comprehensive)
  • /(?<!Augu(?aa:st))x/i (just the st in the lookbehind is "ASCII-safe" ²)
  • /(?<!(?aa)August)x/i (the whole the lookbehind is "ASCII-safe" ²)
  • /(?<!August)x/iaa (the whole regex is "ASCII-safe" ²)
  • /(?<!Augus[t])x/i (breaks ligature seeking ¹)
  • /(?<!Augus.)x/i (slightly different, matches more)
  • /(?<!Augu(?-i:st))x/i (case-sensitive st in lookbehind, won't match AugusTx)

These toy with removing the case-insensitive modifier¹ or adding the ASCII-safe modifier² in various places, often requiring the regex writer to specifically know of the variable-width ligature.

The first variation (which is the only comprehensive one) matches the variable widths with two lookbehinds: first for the six character version (no ligatures as noted in the first quote below) and second for any ligatures, employing a forward lookahead (which has zero width!) for st (including the ligatures) and then accounting for its single character width with a .

Two segments of the perlre man page:

¹ Case-insensitive modifier /i & ligatures

There are a number of Unicode characters that match a sequence of multiple characters under /i. For example, "LATIN SMALL LIGATURE FI" should match the sequence fi. Perl is not currently able to do this when the multiple characters are in the pattern and are split between groupings, or when one or more are quantified. Thus

"\N{LATIN SMALL LIGATURE FI}" =~ /fi/i;          # Matches [in perl 5.14+]
"\N{LATIN SMALL LIGATURE FI}" =~ /[fi][fi]/i;    # Doesn't match!
"\N{LATIN SMALL LIGATURE FI}" =~ /fi*/i;         # Doesn't match!
"\N{LATIN SMALL LIGATURE FI}" =~ /(f)(i)/i;      # Doesn't match!

² ASCII-safe modifier /aa (perl 5.14+)

To forbid ASCII/non-ASCII matches (like k with \N{KELVIN SIGN}), specify the a twice, for example /aai or /aia. (The first occurrence of a restricts the \d, etc., and the second occurrence adds the /i restrictions.) But, note that code points outside the ASCII range will use Unicode rules for /i matching, so the modifier doesn't really restrict things to just ASCII; it just forbids the intermixing of ASCII and non-ASCII.

To summarize, this modifier provides protection for applications that don't wish to be exposed to all of Unicode. Specifying it twice gives added protection.


Hegel F. 05/15/2018.

Put (?i) after lookbehind:

(?<!(Mon|Fri|Sun)day |August )(?i)abcd(?-i)

or

(?<!(Mon|Fri|Sun)day |August )(?i:abcd)

To me it seems to be a bug.


HighResolutionMusic.com - Download Hi-Res Songs

1 Martin Garrix

Yottabyte flac

Martin Garrix. 2018. Writer: Martin Garrix.
2 Alan Walker

Diamond Heart flac

Alan Walker. 2018. Writer: Alan Walker;Sophia Somajo;Mood Melodies;James Njie;Thomas Troelsen;Kristoffer Haugan;Edvard Normann;Anders Froen;Gunnar Greve;Yann Bargain;Victor Verpillat;Fredrik Borch Olsen.
3 Sia

I'm Still Here flac

Sia. 2018. Writer: Sia.
4 Blinders

Breach (Walk Alone) flac

Blinders. 2018. Writer: Dewain Whitmore;Ilsey Juber;Blinders;Martin Garrix.
5 Dyro

Latency flac

Dyro. 2018. Writer: Martin Garrix;Dyro.
6 Cardi B

Taki Taki flac

Cardi B. 2018. Writer: Bava;Juan Vasquez;Vicente Saavedra;Jordan Thorpe;DJ Snake;Ozuna;Cardi B;Selena Gomez.
7 Bradley Cooper

Shallow flac

Bradley Cooper. 2018. Writer: Andrew Wyatt;Anthony Rossomando;Mark Ronson;Lady Gaga.
8 Halsey

Without Me flac

Halsey. 2018. Writer: Halsey;Delacey;Louis Bell;Amy Allen;Justin Timberlake;Timbaland;Scott Storch.
9 Lady Gaga

I'll Never Love Again flac

Lady Gaga. 2018. Writer: Benjamin Rice;Lady Gaga.
10 Kelsea Ballerini

This Feeling flac

Kelsea Ballerini. 2018. Writer: Andrew Taggart;Alex Pall;Emily Warren.
11 Mako

Rise flac

Mako. 2018. Writer: Riot Music Team;Mako;Justin Tranter.
12 Dewain Whitmore

Burn Out flac

Dewain Whitmore. 2018. Writer: Dewain Whitmore;Ilsey Juber;Emilio Behr;Martijn Garritsen.
13 Bradley Cooper

Always Remember Us This Way flac

Bradley Cooper. 2018. Writer: Lady Gaga;Dave Cobb.
14 Little Mix

Woman Like Me flac

Little Mix. 2018. Writer: Nicki Minaj;Steve Mac;Ed Sheeran;Jess Glynne.
15 Charli XCX

1999 flac

Charli XCX. 2018. Writer: Charli XCX;Troye Sivan;Leland;Oscar Holter;Noonie Bao.
16 Rita Ora

Let You Love Me flac

Rita Ora. 2018. Writer: Rita Ora.
17 Diplo

Electricity flac

Diplo. 2018. Writer: Diplo;Mark Ronson;Picard Brothers;Wynter Gordon;Romy Madley Croft;Florence Welch.
18 Jonas Blue

Polaroid flac

Jonas Blue. 2018. Writer: Jonas Blue;Liam Payne;Lennon Stella.
19 Lady Gaga

Look What I Found flac

Lady Gaga. 2018. Writer: DJ White Shadow;Nick Monson;Mark Nilan Jr;Lady Gaga.
20 Avril Lavigne

Head Above Water flac

Avril Lavigne. 2018. Writer: Stephan Moccio;Travis Clark;Avril Lavigne.

Related questions

Hot questions

Language

Popular Tags