Searching, modifying and encoding text in the .NET framework
Lesson 1: Forming Regular Expressions
Console app to process regular expressions:
namespace TestRegExp
{
using System.Text.RegularExpressions;
class Class1
{
static void Main(string[] args)
{
if (Regex.IsMatch(args[1], args[0])
{
Console.WriteLine("Input matches RegEx");
}
}
}
}
Running TestRegExp ^\d{5}$ 12345
produces Input matches RegEx
When validating input always begin with ^ and end with $. This ensures input matches the regular expression and does not merely contain it.
Regular expressions difficult to create unless familiar with format.
Reading regular expressions is confusing - always add comments to explain their action to other developers.
Matching Simple Text
The regular expression "abc" will match the strings "abc", "abcde", "yzabc" because each of the strings contains the regular expression.
Matching Text In Specific Locations
Character | Description |
---|---|
^ | Match must begin at either first character of string or first character of line. In multi line input ^ will match the beginning of any line. |
$ | Match must end at last character of string, or last character before \n, or last character at end of line. In multi line input $ will match the end of any line. |
\A | Match must begin at first character of string. Ignores multiple lines. |
\Z | Match must end at either last character of string or last character before \n. Ignores multiple lines. |
\z | Match must end at last character of string. Ignore multiple lines. |
\G | Match must occur at point where previous match ended. When used with Match.NextMatch ensures that matches are contiguous. |
\b | Match must occur at boundary between \w (alphanumeric) and \W (non alphanumeric) characters. The match must occur on word boundaries which are first or last characters in words separated by non alphanumeric characters. |
\B | Match must not occur on \b boundary. |
Many regular expressions begin with a . In C# use @ so backslashes are not treated as escape characters.
Matching Special Characters
Character*nbsp; | Description |
---|---|
\a | Match bell (0x07) |
\b | Word boundary except within [] character class where it means backspace. In replacement patterns it means backspace. |
\t | Tab (0x09) |
\r | \Carriage return (0x0D) |
\v | Vertical tab (0x0B) |
\f | Form feed (0x0C) |
\n | New line (0x0A) |
\e | Escape (0x1B) |
\040 | Match ASCII as octal (up to 3 digits),e.g. \040 represents a space. |
\x20 | Matches ASCII using hexadecimal - exactly two digits |
\cC | Matches ASCII using control characters - e.g. \cC is control C |
\u0020 | Matches Unicode character using hexadecimal notation (4 digits) |
\ | When followed by character not recognised as escaped character, matches that character. * represents *, whist \ represents a single backslash. |
Matching Text Using Wild cards
Character | Description |
---|---|
* | Matches preceding character or sub expression zero or more times, e.g. "zo*" matches "z" and "zoo" - the * is equivalent to {0.} |
+ | Matches preceding character of sub expression one or more times, e.g. "zo+" matches "zo" and "zoo" but not "z" - the + is equivalent to {1.} |
? | Matches preceding character or sub expression zero or one times, e.g. "do(es)? Matches the "do" in "do" or "does" - the ? is equivalent to {0,1} |
{n} | n is non-negative integer. Matches n times, e.g. "o{2}" does not match "bob" but does match "food". |
{n,} | n is non-negative integer. Matches at least n times, e.g. "o{2," does not match "bob" but does match "fooood" |
{n,m} | m and n are non-negative integers, where n <= m. Matches at least n and at most m times. Note, no space permissible between command and number |
? | If immediately follows other quantifier (*,+,?,{n} ,{n,} ,{n.m} ) then matching pattern is non-greedy. Non-greedy pattern matches as little of search string as possible, e.g. "oooo" with "o+?" matches a single "o", whilst "o+" matches "oooo" |
. | Match any single character except "\n". To match any character including "\n" use pattern such as "[\s|S]" |
x|y | Matches either x or y. "z|food" matches "z" or "food". "(z|f)ood" matches "zood" or "food". |
[xyz] | Matches any of the enclosed characters, e.g. "[abc]" matches the "a" in "plain" |
[a-z] | Matches range of characters, e.g. [a-z] matches any lower case character. |
Special characters also available for common character ranges:
"[0-9]" is equivalent to "\d".
"\D" matches any non-numeric character.
"\s" matches any white space character, whilst "\S" matches any non- white space character.
"\w" Matches any word character (including "", equivalent to "[A-Za-z0-9]"
"\W" any non word character, equivalent to "[^A-Za-z0-9_]"
To match group of characters surround by (), e.g. "foo(loo){1,3}hoo" matches "fooloohoo" or "fooloolooloohoo" but not "foohoo" or "foololohoo".
You can name these groups to refer to the matched data later. To name a group use the format (<name>pattern)
, e.g. foo(<mid>loo|roo)hoo
would match fooloohoo
. Later can reference group mid
to retrieve loo
. Same expression matching fooroohoo
would have mid
containing roo
.
Match using back references
Back referencing using named groups allows searching for other instances of characters that match wildcard. Can be thought of as instruction to match same string again.
The expression (?<char>\w)\k<char>
searches for adjacent paired characters, i.e. Whenever a single character is the same as the preceding one. The \w matches any single word character and saves it under "char". The \k<char>
causes the engine to match the current character against that stored under "char"
To find repeating whole words replace \w with \w+ that will match 1 or more characters. Precede it by a \s to match a space. Gives the expression (?<char>\s\w+)\k<char>
. Currently this will match the string "the theory" - to restrict to whole words verify that the repeat match is on a word boundary with a \b, giving (?<char>\s\w+)\k<char>\b
Back references refer to the most recent definition of a group, i.e. the most recent capture, e.g (?<1>a)(?<1>\1b)* matches aababb with the pattern (a)(ab)(abb)
If group has not captured any substring then reference to that group is undefined and never matches, e.g. \1() never matches anything, but ()\1 matches an empty string.
Backreference parameters
\number
- Backreference, e.g. (\w)\1
finds doubled word characters
\k<name>
- Named backreference, e.g. (?<char>\w)\k<char>
finds doubled word characters. Can use single quotes in place of <&>, e.g. \k'char'
Regular Expression Options
All regular expression options turned on by default.
Options to specify matching behaviour can be specified in options parameter to RegEx(pattern, options).
Alternatively specify in-line (i.e. within the pattern). When using in-line a minus character before an option (or set of options) turns it off, e.g. <?ix-ms) turns on IgnoreCase and IgnorePatternWhitespace and turns off Multiline and Singleline.
RegExOption | Inline Character | Description |
---|---|---|
None | No options set | |
IgnoreCase | i | Case insensitive matching |
Multiline | m | Changes meaning of ^ and $ so they perform matching at beginning and end of any line, not just of whole string. |
ExplicitCapture | n | Only valid captures are explicitly named or numbered groups of form (?<name>) . Allows parenthesis to act as non-capturing groups |
Compiled | Regular expression will be compiled to an assembly, yields faster execution at expense of start-up time. | |
Singleine | s | Changes meaning of . So that it matches every character (instead of every character except \n) |
IgnorePatternWhitespace | x | Unescaped white space is excluded from patter and enables comments following #. White space is never eliminated from within a character class. |
RightToLeft | Changes search direction only. Does not reverse substring that is searched for. Lookahead and lookbehind do not change; lookahead looks to right, lookbehind to the left | |
ECMAScript | Enables ECMAScript compliant behaviour. Can only use in conjunction with IgnoreCase and Multiline flags, any others cause an exception. | |
CultureInvariant | Ignore cultural differences in languages. |
Extract Matched Data
Can extract info from string using regular expression.
- Create regular expressions
- Pass into Match method of static RegEx class
- Retrieve data from Match class returned
For example to access all href values in a string use the following code:
Regex r;
Match m;
r = new Regex("href\\s*=\\s*(?:\"(?<1&>[^\"]*)\"|(?<1>\\S+))", RegexOptions.IgnoreCase|RegexOptions.Compiled);
for (m = r.Match(inputString); m.Success; m = m.NextMatch())
{
Console.WriteLine("Found href " + m.Groups[1] + " at " m.Groups[1].Index
}
Can reformat extracted sub strings. For example to extract protocol and port from url using the following code, which for the string "http://www.contoso.com:8080/letters" would return "http:8080"
String Extension(String url)
{
Regex r = new Regex(@"^(?<proto>\w+)://[^/]+?(?<port>:\d+)?", RegexOptions.Compiled)
return r.Match(url).Result("${proto}${port}");
}
Replace sub strings
Regular expression can perform complex replacements, for example to replace dates in form mm/dd/yy with those in dd-mm-yy format use:
Regex.Replace(input, "\\b(?<month>\\d{1,2})/(?<da>\\d{1,2})/(?<year>\\d{2,4}\\b", "${day}-${month}-${year}");
The example above uses named back references within replacement pattern. The replacement param ${day}
inserts the sub string captured by the group (?<day>...)
.
To clean input strings from all non-alphanumeric characters (except . @ and -) use:
Regex.Replace(input, @"[^\w\.@-]", "");
Only character escapes and substitutions are recognised in replacement pattern, e.g. The pattern a*${txt}b
inserts the string a* followed by the substring matched by the txt capturing group (if any) followed by the string "b". The * character is not recognised as a meta character within a replacement pattern. Similarly $ is not recognised within expression matching pattern, within regular expressions it denotes the end of the string.
Character | Description |
---|---|
$number | Substitutes last substring matched by specified group number |
${name} | Substitutes last substring matched by (?<name>) group |
$$ | Substitutes single "$" literal |
$& | Substitutes copy of entire match |
$` | Substitutes all text before match |
$' | Substitutes all text after match |
$+ | Substitutes last group captured |
$_ | Substitutes entire input string |
Constraining String Input
Regular expression efficient way to validate user input, e.g. Application expects five digit input - use expression to check five characters between 0 and 9.
Other areas, such as names more complicated. Can ask users to not use non-alpha numeric characters in their names (e.g. - and ' should be avoided). Many do not like this. Other approaches can leave you open to malicious input, e.g. "1' DROP TABLE PRODUCTS -"
Consider performing as much filtering as possible, then clean input of potentially malicious content. Most validations should be pessimistic and only allow explicitly approved characters. This could be problematic when processing names, so perhaps adopt optimistic approach and only cause error on specifically denied characters (e.g. !, @, #, %, ^, *, (), <, >)
Lesson 2: Encoding and Decoding
ASCII not first encoding type, but is foundation for existing encoding types. Characters assigned to 7 bit bytes (0 to 127). Sufficient for English communications, but does not support non-English alphabets. To support other languages use was made of values 128 through 255, but different languages assigned different characters to the same value. To try and solve problem ANSI defined standard code pages that had standard ASCII values though 0 to 127 and language specific values from 128 through 255.
If text appears as boxes or question marks then there is an encoding problem. When creating web pages, email, etc. the text must be tagged with the encoding type, e.g. email will contain following header:
Content-Type: text/plain; charset=ISO-8859-1
ISO-8859-1 corresponds to code page 25891 "Western European (ISO)". ASCII based encoding types are being replaced by Unicode. Unicode is a massive code page with tens of thousands of characters supporting most languages and scripts. Unicode does not specify encoding type, there are several standards for encoding Unicode. .NET uses Unicode-16 to represent characters, in some cases using UTF-8 internally. System.Text namespace provides classes to encode / decode characters. Following encodings supported:
- UTF-32 represents each Unicode character as a 32-bit value.
- UTF-16 represents each Unicode character as a 16-bit value.
- UTF-8 represents each Unicode character as either 8, 16, 24 and up to 48 bits. Values 0 through 127 use 8 bits and match ASCII values. 128 though 2047 use 16-bit encoding to support Latin, Greek, Cyrillic, Hebrew and Arabic. Values 2048 though 65535 use 24-bit encoding for Chinese, Japanese, Korean and other languages requiring a large number of values.
- ASCII represents values from 0 through 127. Inadequate for internationalised applications.
- NASI/ISO support is provided for wide range of ANSI/ISO encodings.
Using Encode Class
Use System.Text.Encoding.GetEncoding method to obtain encoding object for specified encoding. Use Encoding.GetBytes to convert Unicode string to byte representation in specified encoding.
Encoding e = Encoding.GetEncoding("Korean");
byte [] encoded = e.GetBytes("Hello, world!");
for (int i=0; i < encoded.Length, i++)
Console.WriteLine("Bytes {0}: {1}", i, encoded[i]);
In this example the translated bytes in the Korean code page exactly match the original ASCII bytes.
Supported Code Pages
Calling Encoding.GetEncodings provides array of EncodingInfo objects, each one representing an encoding supported by the system
Specify Encoding Type when writing file
Use overloaded stream constructor, e.g.
StreamWriter sw = new StreamWriter("utf32.txt", false, Encoding.UTF32);
If not sure what encoding type to use then accept system default (which is UTF-16).
Specify Encoding Type when reading file
Typically don't need to specify type when reading file. .NET will automatically decode most common encoding types. Can specify using overloaded stream constructor, e.g.
StreamReader sr = new StreamReader("file.txt", Encoding.UTF7);
Unlike most encoding types, UTF-7 encoding requires explicit declaration when reading file.