Podcast 294: [Public Sector Special Series #4] – Using AI to make Content Available for Students at Imperial College of London

This post was originally published on this site

How do you train the next generation of Digital leaders? How do you provide them with a modern educational experience? Can you do it without technical expertise? Hear how Ruth Black (Teaching Fellow at the Digital Academy) applied Amazon Transcribe to make this real.

Additional Resources

About the AWS Podcast

The AWS Podcast is a cloud platform podcast for developers, dev ops, and cloud professionals seeking the latest news and trends in storage, security, infrastructure, serverless, and more. Join Simon Elisha and Jeff Barr for regular updates, deep dives, and interviews.

Rate us on iTunes and send your suggestions, show ideas, and comments to awspodcast@amazon.com. We want to hear from you!

Subscribe with one of the following:

 

NCCIC Awareness Briefing on Chinese Malicious Cyber Activity

This post was originally published on this site

Original release date: January 30, 2019

The Cybersecurity and Infrastructure Security Agency (CISA) will conduct a series of virtual awareness briefings on Chinese malicious cyber activity targeting managed service providers (MSPs). Briefings will be held from 1–2 p.m. ET on the dates listed below:

CISA encourages MSPs and their customers to register for the briefing by clicking on one of the dates listed above. The briefing will provide a background on the identified cyber activity and mitigation techniques.   


This product is provided subject to this Notification and this Privacy & Use policy.

MS-ISAC Releases Advisory on DNS Flag Day

This post was originally published on this site

Original release date: January 30, 2019

The Multi-State Information Sharing & Analysis Center (MS-ISAC) has released an alert on Domain Name System (DNS) Flag Day, which is Friday, February 1, 2019. On DNS Flag Day, DNS software and service providers will roll out updates to remove workarounds that allow users to bypass the Extension Mechanisms Protocol for DNS (EDNS). While the updates will improve DNS operations, some domains served by DNS servers operating out-of-date software may become unavailable.

The National Cybersecurity and Communications Integration Center (NCCIC), part of the Cybersecurity and Infrastructure Security Agency (CISA), encourages users and administrators to review MS-ISAC’s Cyber Alert: DNS Flag Day for more information and the DNS Flag Day website to determine whether a domain name will be affected.


This product is provided subject to this Notification and this Privacy & Use policy.

Mozilla Releases Security Update for Thunderbird

This post was originally published on this site

Original release date: January 30, 2019

Mozilla has released a security update to address vulnerabilities in Thunderbird. An attacker could exploit one of these vulnerabilities to take control of an affected system.

The National Cybersecurity and Communications Integration Center (NCCIC), part of the Cybersecurity and Infrastructure Security Agency (CISA), encourages users and administrators to review the Mozilla Security Advisory for Thunderbird 60.5 and apply the necessary update.


This product is provided subject to this Notification and this Privacy & Use policy.

Never Try to Fix the Problem When Hacked

This post was originally published on this site

Eventually, we all get hacked. The bad guys are very persistent and we can all make a mistake. If you suspect you have been hacked never try to fix the situation, instead report it right away. If you try to fix the situation, such as paying an online ransom or deleting the infected files, not only could you stil be hacked but you are most likely causing far more harm than good.

Google Releases Security Updates for Chrome

This post was originally published on this site

Original release date: January 29, 2019

Google has released Chrome version 72.0.3626.81 for Windows, Mac, and Linux. This version addresses multiple vulnerabilities that an attacker could exploit to take control of an affected system.  

The National Cybersecurity and Communications Integration Center (NCCIC), part of the Cybersecurity and Infrastructure Security Agency (CISA), encourages users and administrators to review the Chrome Releases page and apply the necessary updates.


This product is provided subject to this Notification and this Privacy & Use policy.

Mozilla Releases Security Updates for Firefox

This post was originally published on this site

Original release date: January 29, 2019

Mozilla has released security updates to address vulnerabilities in Firefox and Firefox ESR. An attacker could exploit some of these vulnerabilities to take control of an affected system.

The National Cybersecurity and Communications Integration Center (NCCIC), part of the Cybersecurity and Infrastructure Security Agency (CISA), encourages users and administrators to review the Mozilla Security Advisories for Firefox 65 and Firefox ESR 60.5 and apply the necessary updates.


This product is provided subject to this Notification and this Privacy & Use policy.

Podcast 293: Diving into Data with Amazon Athena

This post was originally published on this site

Do you have lots of data to analyze? Is writing SQL a skill you have? Would you like to analyze massive amounts of data at low cost without capacity planning? In this episode, Simon shares how Amazon Athena can give you options you may not have considered before.

Additional Resources

About the AWS Podcast

The AWS Podcast is a cloud platform podcast for developers, dev ops, and cloud professionals seeking the latest news and trends in storage, security, infrastructure, serverless, and more. Join Simon Elisha and Jeff Barr for regular updates, deep dives and interviews. Whether you’re building machine learning and AI models, open source projects, or hybrid cloud solutions, the AWS Podcast has something for you. Subscribe with one of the following:

Like the Podcast?

Rate us on iTunes and send your suggestions, show ideas, and comments to awspodcast@amazon.com. We want to hear from you!

Parsing Text with PowerShell (3/3)

This post was originally published on this site

This is the third and final post in a three-part series.

  • Part 1:
    • Useful methods on the String class
    • Introduction to Regular Expressions
    • The Select-String cmdlet
  • Part 2:
    • the -split operator
    • the -match operator
    • the switch statement
    • the Regex class
  • Part 3:
    • a real world, complete and slightly bigger, example of a switch-based parser
      • General structure of a switch-based parser
      • The real world example

In the previous posts, we looked at the different operators what are available to us in PowerShell.

When analyzing crashes at DICE, I noticed that some of the C++ runtime binaries where missing debug symbols. They should be available for download from Microsoft’s public symbol server, and most versions were there. However, due to some process errors at DevDiv, some builds were released publicly without available debug symbols.
In some cases, those missing symbols prevented us from debugging those crashes, and in all cases, they triggered my developer OCD.

So, to give actionable feedback to Microsoft, I scripted a debugger (cdb.exe in this case) to give a verbose list of the loaded modules, and parsed the output with PowerShell, which was also later used to group and filter the resulting data set. I sent this data to Microsoft, and 5 days later, the missing symbols were available for download. Mission accomplished!

This post will describe the parser I wrote for this task (it turned out that I had good use for it for other tasks later), and the general structure is applicable to most parsing tasks.

The example will show how a switch-based parser would look when the input data isn’t as tidy as it normally is in examples, but messy – as the real world data often is.

General Structure of a switch Based Parser

Depending on the structure of our input, the code must be organized in slightly different ways.

Input may have a record start that differs by indentation or some distinct token like

Foo                    <- Record start - No whitespace at the beginning of the line
    Prop1=Staffan      <- Properties for the record - starts with whitespace
    Prop3 =ValueN
Bar
    Prop1=Steve
    Prop2=ValueBar2

If the data to be parsed has an explicit start record, it is a bit easier than if it doesn’t have one.
We create a new data object when we get a record start, after writing any previously created object to the pipeline.
At the end, we need to check if we have parsed a record that hasn’t been written to the pipeline.

The general structure of a such a switch-based parser can be as follows:

$inputData = @"
Foo
    Prop1=Value1
    Prop3=Value3
Bar
    Prop1=ValueBar1
    Prop2=ValueBar2
"@ -split 'r?n'   # This regex is useful to split at line endings, with or without carriage return

class SomeDataClass {
    $ID
    $Name
    $Property2
    $Property3
}

# map to project input property names to the properties on our data class
$propertyNameMap = @{
    Prop1 = "Name"
    Prop2 = "Property2"
    Prop3 = "Property3"
}

$currentObject = $null
switch -regex ($inputData) {

    '^(S.*)' {
        # record start pattern, in this case line that doesn't start with a whitespace.
        if ($null -ne $currentObject) {
            $currentObject                   # output to pipeline if we have a previous data object
        }
        $currentObject = [SomeDataClass] @{  # create new object for this record
            Id = $matches.1                  # with Id like Foo or Bar
        }
        continue
    }

    # set the properties on the data object
    '^s+([^=]+)=(.*)' {
        $name, $value = $matches[1, 2]
        # project property names
        $propName = $propertyNameMap[$name]
        if ($propName = $null) {
            $propName = $name
        }
        # assign the parsed value to the projected property name
        $currentObject.$propName = $value
        continue
    }
}

if ($currentObject) {
    # Handle the last object if any
    $currentObject # output to pipeline
}

ID  Name      Property2 Property3
--  ----      --------- ---------
Foo Value1              Value3
Bar ValueBar1 ValueBar2

Alternatively, we may have input where the records are separated by a blank line, but without any obvious record start.

commitId=1234                         <- In this case, a commitId is first in a record
description=Update readme.md
                                      <- the blank line separates records
user=Staffan                          <- For this record, a user property comes first
commitId=1235
description=Fix bug.md

In this case the structure of the code looks a bit different. We create an object at the beginning, but keep track of if it’s dirty or not.
If we get to the end with a dirty object, we must output it.

$inputData = @"

commit=1234
desc=Update readme.md

user=Staffan
commit=1235
desc=Bug fix

"@ -split "r?n"

class SomeDataClass {
    [int] $CommitId
    [string] $Description
    [string] $User
}

# map to project input property names to the properties on our data class
# we only need to provide the ones that are different. 'User' works fine as it is.
$propertyNameMap = @{
    commit = "CommitId"
    desc   = "Description"
}

$currentObject = [SomeDataClass]::new()
$objectDirty = $false
switch -regex ($inputData) {
    # set the properties on the data object
    '^([^=]+)=(.*)$' {
        # parse a name/value
        $name, $value = $matches[1, 2]
        # project property names
        $propName = $propertyNameMap[$name]
        if ($null -eq $propName) {
            $propName = $name
        }
        # assign the projected property
        $currentObject.$propName = $value
        $objectDirty = $true
        continue
    }

    '^s*$' {
        # separator pattern, in this case any blank line
        if ($objectDirty) {
            $currentObject                           # output to pipeline
            $currentObject = [SomeDataClass]::new()  # create new object
            $objectDirty = $false                    # and mark it as not dirty
        }
    }
    default {
        Write-Warning "Unexpected input: '$_'"
    }
}

if ($objectDirty) {
    # Handle the last object if any
    $currentObject # output to pipeline
}

CommitId Description      User
-------- -----------      ----
    1234 Update readme.md
    1235 Bug fix          Staffan

The Real World Example

I have adapted this sample slightly so that I get the loaded modules from a running process instead of from my crash dumps. The format of the output from the debugger is the same.
The following command launches a command line debugger on notepad, with a script that gives a verbose listing of the loaded modules, and quits:

# we need to muck around with the console output encoding to handle the trademark chars
# imagine no encodings
# it's easy if you try
# no code pages below us
# above us only sky
[Console]::OutputEncoding = [System.Text.Encoding]::GetEncoding("iso-8859-1")

$proc = Start-Process notepad -passthru
Start-Sleep -seconds 1
$cdbOutput = cdb -y 'srv*c:symbols*http://msdl.microsoft.com/download/symbols' -c ".reload -f;lmv;q" -p $proc.ProcessID

The output of the command above is here for those who want to follow along but who aren’t running windows or don’t have cdb.exe installed.

The (abbreviated) output looks like this:

Microsoft (R) Windows Debugger Version 10.0.16299.15 AMD64
Copyright (c) Microsoft Corporation. All rights reserved.

*** wait with pending attach

************* Path validation summary **************
Response                         Time (ms)     Location
Deferred                                       srv*c:symbols*http://msdl.microsoft.com/download/symbols
Symbol search path is: srv*c:symbols*http://msdl.microsoft.com/download/symbols
Executable search path is:
ModLoad: 00007ff6`e9da0000 00007ff6`e9de3000   C:Windowssystem32notepad.exe
...
ModLoad: 00007ffe`97d80000 00007ffe`97db1000   C:WINDOWSSYSTEM32ntmarta.dll
(98bc.40a0): Break instruction exception - code 80000003 (first chance)
ntdll!DbgBreakPoint:
00007ffe`9cd53050 cc              int     3
0:007> cdb: Reading initial command '.reload -f;lmv;q'
Reloading current modules
.....................................................
start             end                 module name
00007ff6`e9da0000 00007ff6`e9de3000   notepad    (pdb symbols)          c:symbolsnotepad.pdb2352C62CDF448257FDBDDA4081A8F9081notepad.pdb
    Loaded symbol image file: C:Windowssystem32notepad.exe
    Image path: C:Windowssystem32notepad.exe
    Image name: notepad.exe
    Image was built with /Brepro flag.
    Timestamp:        329A7791 (This is a reproducible build file hash, not a timestamp)
    CheckSum:         0004D15F
    ImageSize:        00043000
    File version:     10.0.17763.1
    Product version:  10.0.17763.1
    File flags:       0 (Mask 3F)
    File OS:          40004 NT Win32
    File type:        1.0 App
    File date:        00000000.00000000
    Translations:     0409.04b0
    CompanyName:      Microsoft Corporation
    ProductName:      Microsoft??? Windows??? Operating System
    InternalName:     Notepad
    OriginalFilename: NOTEPAD.EXE
    ProductVersion:   10.0.17763.1
    FileVersion:      10.0.17763.1 (WinBuild.160101.0800)
    FileDescription:  Notepad
    LegalCopyright:   ??? Microsoft Corporation. All rights reserved.
...
00007ffe`9ccb0000 00007ffe`9ce9d000   ntdll      (pdb symbols)          c:symbolsntdll.pdbB8AD79538F2730FD9BACE36C9F9316A01ntdll.pdb
    Loaded symbol image file: C:WINDOWSSYSTEM32ntdll.dll
    Image path: C:WINDOWSSYSTEM32ntdll.dll
    Image name: ntdll.dll
    Image was built with /Brepro flag.
    Timestamp:        E8B54827 (This is a reproducible build file hash, not a timestamp)
    CheckSum:         001F20D1
    ImageSize:        001ED000
    File version:     10.0.17763.194
    Product version:  10.0.17763.194
    File flags:       0 (Mask 3F)
    File OS:          40004 NT Win32
    File type:        2.0 Dll
    File date:        00000000.00000000
    Translations:     0409.04b0
    CompanyName:      Microsoft Corporation
    ProductName:      Microsoft??? Windows??? Operating System
    InternalName:     ntdll.dll
    OriginalFilename: ntdll.dll
    ProductVersion:   10.0.17763.194
    FileVersion:      10.0.17763.194 (WinBuild.160101.0800)
    FileDescription:  NT Layer DLL
    LegalCopyright:   ??? Microsoft Corporation. All rights reserved.
quit:

The output starts with info that I’m not interested in here. I only want to get the detailed information about the loaded modules. It is not until the line

start             end                 module name

that I care about the output.

Also, at the end there is a line that we need to be aware of:

quit:

that is not part of the module output.

To skip the parts of the debugger output that we don’t care about, we have a boolean flag initially set to true.
If that flag is set, we check if the current line, $_, is the module header in which case we flip the flag.

$inPreamble = $true
    switch -regex ($cdbOutput) {

        { $inPreamble -and $_ -eq "start             end                 module name" } { $inPreamble = $false; continue }

I have made the parser a separate function that reads its input from the pipeline. This way, I can use the same function to parse module data, regardless of how I got the module data. Maybe it was saved on a file. Or came from a dump, or a live process. It doesn’t matter, since the parser is decoupled from the data retrieval.

After the sample, there is a breakdown of the more complicated regular expressions used, so don’t despair if you don’t understand them at first.
Regular Expressions are notoriously hard to read, so much so that they make Perl look readable in comparison.

# define an class to store the data
class ExecutableModule {
    [string]   $Name
    [string]   $Start
    [string]   $End
    [string]   $SymbolStatus
    [string]   $PdbPath
    [bool]     $Reproducible
    [string]   $ImagePath
    [string]   $ImageName
    [DateTime] $TimeStamp
    [uint32]   $FileHash
    [uint32]   $CheckSum
    [uint32]   $ImageSize
    [version]  $FileVersion
    [version]  $ProductVersion
    [string]   $FileFlags
    [string]   $FileOS
    [string]   $FileType
    [string]   $FileDate
    [string[]] $Translations
    [string]   $CompanyName
    [string]   $ProductName
    [string]   $InternalName
    [string]   $OriginalFilename
    [string]   $ProductVersionStr
    [string]   $FileVersionStr
    [string]   $FileDescription
    [string]   $LegalCopyright
    [string]   $LegalTrademarks
    [string]   $LoadedImageFile
    [string]   $PrivateBuild
    [string]   $Comments
}

<#
.SYNOPSIS Runs a debugger on a program to dump its loaded modules
#>
function Get-ExecutableModuleRawData {
    param ([string] $Program)
    $consoleEncoding = [Console]::OutputEncoding
    [Console]::OutputEncoding = [System.Text.Encoding]::GetEncoding("iso-8859-1")
    try {
        $proc = Start-Process $program -PassThru
        Start-Sleep -Seconds 1  # sleep for a while so modules are loaded
        cdb -y srv*c:symbols*http://msdl.microsoft.com/download/symbols -c ".reload -f;lmv;q" -p $proc.Id
        $proc.Close()
    }
    finally {
        [Console]::OutputEncoding = $consoleEncoding
    }
}

<#
.SYNOPSIS Converts verbose module data from windows debuggers into ExecutableModule objects.
#>
function ConvertTo-ExecutableModule {
    [OutputType([ExecutableModule])]
    param (
        [Parameter(ValueFromPipeline)]
        [string[]] $ModuleRawData
    )
    begin {
        $currentObject = $null
        $preamble = $true
        $propertyNameMap = @{
            'File flags'      = 'FileFlags'
            'File OS'         = 'FileOS'
            'File type'       = 'FileType'
            'File date'       = 'FileDate'
            'File version'    = 'FileVersion'
            'Product version' = 'ProductVersion'
            'Image path'      = 'ImagePath'
            'Image name'      = 'ImageName'
            'FileVersion'     = 'FileVersionStr'
            'ProductVersion'  = 'ProductVersionStr'
        }
    }
    process {
        switch -regex ($ModuleRawData) {

            # skip lines until we get to our sentinel line
            { $preamble -and $_ -eq "start             end                 module name" } { $preamble = $false; continue }

            #00007ff6`e9da0000 00007ff6`e9de3000   notepad    (deferred)
            #00007ffe`9ccb0000 00007ffe`9ce9d000   ntdll      (pdb symbols)          c:symbolsntdll.pdbB8AD79538F2730FD9BACE36C9F9316A01ntdll.pdb
            '^([0-9a-f`]{17})s([0-9a-f`]{17})s+(S+)s+(([^)]+))s*(.+)?' {
                # see breakdown of the expression later in the post
                # on record start, output the currentObject, if any is set
                if ($null -ne $currentObject) {
                    $currentObject
                }
                $start, $end, $module, $pdbKind, $pdbPath = $matches[1..5]
                # create an instance of the object that we are adding info from the current record into.
                $currentObject = [ExecutableModule] @{
                    Start        = $start
                    End          = $end
                    Name         = $module
                    SymbolStatus = $pdbKind
                    PdbPath      = $pdbPath
                }
                continue
            }
            '^s+Image was built with /Brepro flag.' {
                $currentObject.Reproducible = $true
                continue
            }
            '^s+Timestamp:s+[^(]+((?<timestamp>.{8}))' {
                # see breakdown of the regular  expression later in the post
                # Timestamp:        Mon Jan  7 23:42:30 2019 (5C33D5D6)
                $intValue = [Convert]::ToInt32($matches.timestamp, 16)
                $currentObject.TimeStamp = [DateTime]::new(1970, 01, 01, 0, 0, 0, [DateTimeKind]::Utc).AddSeconds($intValue)
                continue
            }
            '^s+TimeStamp:s+(?<value>.{8}) (This' {
                # Timestamp:        E78937AC (This is a reproducible build file hash, not a timestamp)
                $currentObject.FileHash = [Convert]::ToUInt32($matches.value, 16)
                continue
            }
            '^s+Loaded symbol image file: (?<imageFile>[^)]+)' {
                $currentObject.LoadedImageFile = $matches.imageFile
                continue
            }
            '^s+Checksum:s+(?<checksum>S+)' {
                $currentObject.Checksum = [Convert]::ToUInt32($matches.checksum, 16)
                continue
            }
            '^s+Translations:s+(?<value>S+)' {
                $currentObject.Translations = $matches.value.Split(".")
                continue
            }
            '^s+ImageSize:s+(?<imageSize>.{8})' {
                $currentObject.ImageSize = [Convert]::ToUInt32($matches.imageSize, 16)
                continue
            }
            '^s{4}(?<name>[^:]+):s+(?<value>.+)' {
                # see breakdown of the regular expression later in the post
                # This part is any 'name: value' pattern
                $name, $value = $matches['name', 'value']

                # project the property name
                $propName = $propertyNameMap[$name]
                $propName = if ($null -eq $propName) { $name } else { $propName }

                # note the dynamic property name in the assignment
                # this will fail if the property doesn't have a member with the specified name
                $currentObject.$propName = $value
                continue
            }
            'quit:' {
                # ignore and exit
                break
            }
            default {
                # When writing the parser, it can be useful to include a line like the one below to see the cases that are not handled by the parser
                # Write-Warning "missing case for '$_'. Unexpected output format from cdb.exe"

                continue # skip lines that doesn't match the patterns we are interested in, like the start/end/modulename header and the quit: output
            }
        }
    }
    end {
        # this is needed to output the last object
        if ($null -ne $currentObject) {
            $currentObject
        }
    }
}


Get-ExecutableModuleRawData Notepad |
    ConvertTo-ExecutableModule |
    Sort-Object ProductVersion, Name
    Format-Table -Property Name, FileVersion, Product_Version, FileDescription

Name               FileVersionStr                             ProductVersion FileDescription
----               --------------                             -------------- ---------------
PROPSYS            7.0.17763.1 (WinBuild.160101.0800)         7.0.17763.1    Microsoft Property System
ADVAPI32           10.0.17763.1 (WinBuild.160101.0800)        10.0.17763.1   Advanced Windows 32 Base API
bcrypt             10.0.17763.1 (WinBuild.160101.0800)        10.0.17763.1   Windows Cryptographic Primitives Library
...
uxtheme            10.0.17763.1 (WinBuild.160101.0800)        10.0.17763.1   Microsoft UxTheme Library
win32u             10.0.17763.1 (WinBuild.160101.0800)        10.0.17763.1   Win32u
WINSPOOL           10.0.17763.1 (WinBuild.160101.0800)        10.0.17763.1   Windows Spooler Driver
KERNELBASE         10.0.17763.134 (WinBuild.160101.0800)      10.0.17763.134 Windows NT BASE API Client DLL
wintypes           10.0.17763.134 (WinBuild.160101.0800)      10.0.17763.134 Windows Base Types DLL
SHELL32            10.0.17763.168 (WinBuild.160101.0800)      10.0.17763.168 Windows Shell Common Dll
...
windows_storage    10.0.17763.168 (WinBuild.160101.0800)      10.0.17763.168 Microsoft WinRT Storage API
CoreMessaging      10.0.17763.194                             10.0.17763.194 Microsoft CoreMessaging Dll
gdi32full          10.0.17763.194 (WinBuild.160101.0800)      10.0.17763.194 GDI Client DLL
ntdll              10.0.17763.194 (WinBuild.160101.0800)      10.0.17763.194 NT Layer DLL
RMCLIENT           10.0.17763.194 (WinBuild.160101.0800)      10.0.17763.194 Resource Manager Client
RPCRT4             10.0.17763.194 (WinBuild.160101.0800)      10.0.17763.194 Remote Procedure Call Runtime
combase            10.0.17763.253 (WinBuild.160101.0800)      10.0.17763.253 Microsoft COM for Windows
COMCTL32           6.10 (WinBuild.160101.0800)                10.0.17763.253 User Experience Controls Library
urlmon             11.00.17763.168 (WinBuild.160101.0800)     11.0.17763.168 OLE32 Extensions for Win32
iertutil           11.00.17763.253 (WinBuild.160101.0800)     11.0.17763.253 Run time utility for Internet Explorer

Regex pattern breakdown

Here is a breakdown of the more complicated patterns, using the ignore pattern whitespace modifier x:

([0-9a-f`]{17})s([0-9a-f`]{17})s+(S+)s+(([^)]+))s*(.+)?

# example input: 00007ffe`9ccb0000 00007ffe`9ce9d000   ntdll      (pdb symbols)          c:symbolsntdll.pdbB8AD79538F2730FD9BACE36C9F9316A01ntdll.pdb

(?x)                # ignore pattern whitespace
^                   # the beginning of the line
([0-9a-f`]{17})     # capture expression like 00007ff6`e9da0000 - any hex number or backtick, and exactly 17 of them
s                  # a space
([0-9a-f`]{17})     # capture expression like 00007ff6`e9da0000 - any hex number or backtick, and exactly 17 of them
s+                 # skip any number of spaces
(S+)               # capture until we get a space - this would match the 'ntdll' part
s+                 # skip one or more spaces
(                  # start parenthesis
    ([^)])         # capture anything but end parenthesis
)                  # end parenthesis
s*                 # skip zero or more spaces
(.+)?               # optionally capture any symbol file path

Breakdown of the name-value pattern:

^s+(?<name>[^:]+):s+(?<value>.+)

# example input:  File flags:       0 (Mask 3F)

(?x)                # ignore pattern whitespace
^                   # the beginning of the line
s+                 # require one or more spaces
(?<name>[^:]+)      # capture anything that is not a `:` into the named group "name"
:                   # require a comma
s+                 # require one or more spaces
(?<value>.+)        # capture everything until the end into the name group "value"

Breakdown of the timestamp pattern:

^s{4}Timestamp:s+[^(]+((?<timestamp>.{8}))

#example input:     Timestamp:        Mon Jan  7 23:42:30 2019 (5C33D5D6)

(?x)                # ignore pattern whitespace
^                   # the beginning of the line
s+                 # require one or more spaces
Timestamp:          # The literal text 'Timestamp:'
s+                 # require one or more spaces
[^(]+              # one or more of anything but a open parenthesis
(                  # a literal '('
(?<timestamp>.{8})  # 8 characters of anything, captured into the group 'timestamp'
)                  # a literal ')'

Gotchas – the Regex Cache

Something that can happen if you are writing a more complicated parser is the following:
The parser works well. You have 15 regular expressions in your switch statement and then you get some input you haven’t seen before, so you add a 16th regex.
All of a sudden, the performance of your parser tanks. WTF?

The .net regex implementation has a cache of recently used regexs. You can check the size of it like this:

PS> [regex]::CacheSize
15

# bump it
[regex]::CacheSize = 20

And now your parser is fast(er) again.

Bonus tip

I frequently use PowerShell to write (generate) my code:

Get-ExecutableModuleRawData pwsh |
    Select-String '^s+([^:]+):' |       # this pattern matches the module detail fields
    Foreach-Object {$_.matches.groups[1].value} |
    Select-Object -Unique |
    Foreach-Object -Begin   { "class ExecutableModuleData {" }`
                   -Process { "    [string] $" + ($_ -replace "s.", {[char]::ToUpperInvariant($_.Groups[0].Value[1])}) }`
                   -End     { "}" }

The output is

class ExecutableModuleData {
    [string] $LoadedSymbolImageFile
    [string] $ImagePath
    [string] $ImageName
    [string] $Timestamp
    [string] $CheckSum
    [string] $ImageSize
    [string] $FileVersion
    [string] $ProductVersion
    [string] $FileFlags
    [string] $FileOS
    [string] $FileType
    [string] $FileDate
    [string] $Translations
    [string] $CompanyName
    [string] $ProductName
    [string] $InternalName
    [string] $OriginalFilename
    [string] $ProductVersion
    [string] $FileVersion
    [string] $FileDescription
    [string] $LegalCopyright
    [string] $Comments
    [string] $LegalTrademarks
    [string] $PrivateBuild
}

It is not complete – I don’t have the fields from the record start, some types are incorrect and when run against some other executables a few other fields may appear.
But it is a very good starting point. And way more fun than typing it 🙂

Note that this example is using a new feature of the -replace operator – to use a ScriptBlock to determine what to replace with – that was added in PowerShell Core 6.1.

Bonus tip #2

A regular expression construct that I often find useful is non-greedy matching.
The example below shows the effect of the ? modifier, that can be used after * (zero or more) and + (one or more)

# greedy matching - match to the last occurrence of the following character (>)
if("<Tag>Text</Tag>" -match '<(.+)>') { $matches }

Name                           Value
----                           -----
1                              Tag&gt;Text&lt;/Tag
0                              &lt;Tag&gt;Text&lt;/Tag&gt;

# non-greedy matching - match to the first occurrence of the the following character (>)
if("<Tag>Text</Tag>" -match '<(.+?)>') { $matches }
Name                           Value
----                           -----
1                              Tag
0                              &lt;Tag&gt;

See Regex Repeat for more info on how to control pattern repetition.

Summary

In this post, we have looked at how the structure of a switch-based parser could look, and how it can be written so that it works as a part of the pipeline.
We have also looked at a few slightly more complicated regular expressions in some detail.

As we have seen, PowerShell has a plethora of options for parsing text, and most of them revolve around regular expressions.
My personal experience has been that the time I’ve invested in understanding the regex language was well invested.

Hopefully, this gives you a good start with the parsing tasks you have at hand.

Thanks to Jason Shirk, Mathias Jessen and Steve Lee for reviews and feedback.

Staffan Gustafsson, @StaffanGson, github

Staffan works at DICE in Stockholm, Sweden, as a Software Engineer and has been using PowerShell since the first public beta.
He was most seriously pleased when PowerShell was open sourced, and has since contributed bug fixes, new features and performance improvements.
Staffan is a speaker at PSConfEU and is always happy to talk PowerShell.

The post Parsing Text with PowerShell (3/3) appeared first on Powershell.

Parsing Text with PowerShell (3/3)

This post was originally published on this site

This is the third and final post in a three-part series.

  • Part 1:
    • Useful methods on the String class
    • Introduction to Regular Expressions
    • The Select-String cmdlet
  • Part 2:
    • the -split operator
    • the -match operator
    • the switch statement
    • the Regex class
  • Part 3:
    • a real world, complete and slightly bigger, example of a switch-based parser
      • General structure of a switch-based parser
      • The real world example

In the previous posts, we looked at the different operators what are available to us in PowerShell.

When analyzing crashes at DICE, I noticed that some of the C++ runtime binaries where missing debug symbols. They should be available for download from Microsoft’s public symbol server, and most versions were there. However, due to some process errors at DevDiv, some builds were released publicly without available debug symbols.
In some cases, those missing symbols prevented us from debugging those crashes, and in all cases, they triggered my developer OCD.

So, to give actionable feedback to Microsoft, I scripted a debugger (cdb.exe in this case) to give a verbose list of the loaded modules, and parsed the output with PowerShell, which was also later used to group and filter the resulting data set. I sent this data to Microsoft, and 5 days later, the missing symbols were available for download. Mission accomplished!

This post will describe the parser I wrote for this task (it turned out that I had good use for it for other tasks later), and the general structure is applicable to most parsing tasks.

The example will show how a switch-based parser would look when the input data isn’t as tidy as it normally is in examples, but messy – as the real world data often is.

General Structure of a switch Based Parser

Depending on the structure of our input, the code must be organized in slightly different ways.

Input may have a record start that differs by indentation or some distinct token like

Foo                    <- Record start - No whitespace at the beginning of the line
    Prop1=Staffan      <- Properties for the record - starts with whitespace
    Prop3 =ValueN
Bar
    Prop1=Steve
    Prop2=ValueBar2

If the data to be parsed has an explicit start record, it is a bit easier than if it doesn’t have one.
We create a new data object when we get a record start, after writing any previously created object to the pipeline.
At the end, we need to check if we have parsed a record that hasn’t been written to the pipeline.

The general structure of a such a switch-based parser can be as follows:

$inputData = @"
Foo
    Prop1=Value1
    Prop3=Value3
Bar
    Prop1=ValueBar1
    Prop2=ValueBar2
"@ -split 'r?n'   # This regex is useful to split at line endings, with or without carriage return

class SomeDataClass {
    $ID
    $Name
    $Property2
    $Property3
}

# map to project input property names to the properties on our data class
$propertyNameMap = @{
    Prop1 = "Name"
    Prop2 = "Property2"
    Prop3 = "Property3"
}

$currentObject = $null
switch -regex ($inputData) {

    '^(S.*)' {
        # record start pattern, in this case line that doesn't start with a whitespace.
        if ($null -ne $currentObject) {
            $currentObject                   # output to pipeline if we have a previous data object
        }
        $currentObject = [SomeDataClass] @{  # create new object for this record
            Id = $matches.1                  # with Id like Foo or Bar
        }
        continue
    }

    # set the properties on the data object
    '^s+([^=]+)=(.*)' {
        $name, $value = $matches[1, 2]
        # project property names
        $propName = $propertyNameMap[$name]
        if ($propName = $null) {
            $propName = $name
        }
        # assign the parsed value to the projected property name
        $currentObject.$propName = $value
        continue
    }
}

if ($currentObject) {
    # Handle the last object if any
    $currentObject # output to pipeline
}
ID  Name      Property2 Property3
--  ----      --------- ---------
Foo Value1              Value3
Bar ValueBar1 ValueBar2

Alternatively, we may have input where the records are separated by a blank line, but without any obvious record start.

commitId=1234                         <- In this case, a commitId is first in a record
description=Update readme.md
                                      <- the blank line separates records
user=Staffan                          <- For this record, a user property comes first
commitId=1235
description=Fix bug.md

In this case the structure of the code looks a bit different. We create an object at the beginning, but keep track of if it’s dirty or not.
If we get to the end with a dirty object, we must output it.

$inputData = @"

commit=1234
desc=Update readme.md

user=Staffan
commit=1235
desc=Bug fix

"@ -split "r?n"

class SomeDataClass {
    [int] $CommitId
    [string] $Description
    [string] $User
}

# map to project input property names to the properties on our data class
# we only need to provide the ones that are different. 'User' works fine as it is.
$propertyNameMap = @{
    commit = "CommitId"
    desc   = "Description"
}

$currentObject = [SomeDataClass]::new()
$objectDirty = $false
switch -regex ($inputData) {
    # set the properties on the data object
    '^([^=]+)=(.*)$' {
        # parse a name/value
        $name, $value = $matches[1, 2]
        # project property names
        $propName = $propertyNameMap[$name]
        if ($null -eq $propName) {
            $propName = $name
        }
        # assign the projected property
        $currentObject.$propName = $value
        $objectDirty = $true
        continue
    }

    '^s*$' {
        # separator pattern, in this case any blank line
        if ($objectDirty) {
            $currentObject                           # output to pipeline
            $currentObject = [SomeDataClass]::new()  # create new object
            $objectDirty = $false                    # and mark it as not dirty
        }
    }
    default {
        Write-Warning "Unexpected input: '$_'"
    }
}

if ($objectDirty) {
    # Handle the last object if any
    $currentObject # output to pipeline
}
CommitId Description      User
-------- -----------      ----
    1234 Update readme.md
    1235 Bug fix          Staffan

The Real World Example

I have adapted this sample slightly so that I get the loaded modules from a running process instead of from my crash dumps. The format of the output from the debugger is the same.
The following command launches a command line debugger on notepad, with a script that gives a verbose listing of the loaded modules, and quits:

# we need to muck around with the console output encoding to handle the trademark chars
# imagine no encodings
# it's easy if you try
# no code pages below us
# above us only sky
[Console]::OutputEncoding = [System.Text.Encoding]::GetEncoding("iso-8859-1")

$proc = Start-Process notepad -passthru
Start-Sleep -seconds 1
$cdbOutput = cdb -y 'srv*c:symbols*http://msdl.microsoft.com/download/symbols' -c ".reload -f;lmv;q" -p $proc.ProcessID

The output of the command above is attached to the blog post for those who want to follow along but who aren’t running windows or don’t have cdb.exe installed.

The (abbreviated) output looks like this:

Microsoft (R) Windows Debugger Version 10.0.16299.15 AMD64
Copyright (c) Microsoft Corporation. All rights reserved.

*** wait with pending attach

************* Path validation summary **************
Response                         Time (ms)     Location
Deferred                                       srv*c:symbols*http://msdl.microsoft.com/download/symbols
Symbol search path is: srv*c:symbols*http://msdl.microsoft.com/download/symbols
Executable search path is:
ModLoad: 00007ff6`e9da0000 00007ff6`e9de3000   C:Windowssystem32notepad.exe
...
ModLoad: 00007ffe`97d80000 00007ffe`97db1000   C:WINDOWSSYSTEM32ntmarta.dll
(98bc.40a0): Break instruction exception - code 80000003 (first chance)
ntdll!DbgBreakPoint:
00007ffe`9cd53050 cc              int     3
0:007> cdb: Reading initial command '.reload -f;lmv;q'
Reloading current modules
.....................................................
start             end                 module name
00007ff6`e9da0000 00007ff6`e9de3000   notepad    (pdb symbols)          c:symbolsnotepad.pdb2352C62CDF448257FDBDDA4081A8F9081notepad.pdb
    Loaded symbol image file: C:Windowssystem32notepad.exe
    Image path: C:Windowssystem32notepad.exe
    Image name: notepad.exe
    Image was built with /Brepro flag.
    Timestamp:        329A7791 (This is a reproducible build file hash, not a timestamp)
    CheckSum:         0004D15F
    ImageSize:        00043000
    File version:     10.0.17763.1
    Product version:  10.0.17763.1
    File flags:       0 (Mask 3F)
    File OS:          40004 NT Win32
    File type:        1.0 App
    File date:        00000000.00000000
    Translations:     0409.04b0
    CompanyName:      Microsoft Corporation
    ProductName:      Microsoft??? Windows??? Operating System
    InternalName:     Notepad
    OriginalFilename: NOTEPAD.EXE
    ProductVersion:   10.0.17763.1
    FileVersion:      10.0.17763.1 (WinBuild.160101.0800)
    FileDescription:  Notepad
    LegalCopyright:   ??? Microsoft Corporation. All rights reserved.
...
00007ffe`9ccb0000 00007ffe`9ce9d000   ntdll      (pdb symbols)          c:symbolsntdll.pdbB8AD79538F2730FD9BACE36C9F9316A01ntdll.pdb
    Loaded symbol image file: C:WINDOWSSYSTEM32ntdll.dll
    Image path: C:WINDOWSSYSTEM32ntdll.dll
    Image name: ntdll.dll
    Image was built with /Brepro flag.
    Timestamp:        E8B54827 (This is a reproducible build file hash, not a timestamp)
    CheckSum:         001F20D1
    ImageSize:        001ED000
    File version:     10.0.17763.194
    Product version:  10.0.17763.194
    File flags:       0 (Mask 3F)
    File OS:          40004 NT Win32
    File type:        2.0 Dll
    File date:        00000000.00000000
    Translations:     0409.04b0
    CompanyName:      Microsoft Corporation
    ProductName:      Microsoft??? Windows??? Operating System
    InternalName:     ntdll.dll
    OriginalFilename: ntdll.dll
    ProductVersion:   10.0.17763.194
    FileVersion:      10.0.17763.194 (WinBuild.160101.0800)
    FileDescription:  NT Layer DLL
    LegalCopyright:   ??? Microsoft Corporation. All rights reserved.
quit:

The output starts with info that I’m not interested in here. I only want to get the detailed information about the loaded modules. It is not until the line

start             end                 module name

that I care about the output.

Also, at the end there is a line that we need to be aware of:

quit:

that is not part of the module output.

To skip the parts of the debugger output that we don’t care about, we have a boolean flag initially set to true.
If that flag is set, we check if the current line, $_, is the module header in which case we flip the flag.

    $inPreamble = $true
    switch -regex ($cdbOutput) {

        { $inPreamble -and $_ -eq "start             end                 module name" } { $inPreamble = $false; continue }

I have made the parser a separate function that reads its input from the pipeline. This way, I can use the same function to parse module data, regardless of how I got the module data. Maybe it was saved on a file. Or came from a dump, or a live process. It doesn’t matter, since the parser is decoupled from the data retrieval.

After the sample, there is a breakdown of the more complicated regular expressions used, so don’t despair if you don’t understand them at first.
Regular Expressions are notoriously hard to read, so much so that they make Perl look readable in comparison.

# define an class to store the data
class ExecutableModule {
    [string]   $Name
    [string]   $Start
    [string]   $End
    [string]   $SymbolStatus
    [string]   $PdbPath
    [bool]     $Reproducible
    [string]   $ImagePath
    [string]   $ImageName
    [DateTime] $TimeStamp
    [uint32]   $FileHash
    [uint32]   $CheckSum
    [uint32]   $ImageSize
    [version]  $FileVersion
    [version]  $ProductVersion
    [string]   $FileFlags
    [string]   $FileOS
    [string]   $FileType
    [string]   $FileDate
    [string[]] $Translations
    [string]   $CompanyName
    [string]   $ProductName
    [string]   $InternalName
    [string]   $OriginalFilename
    [string]   $ProductVersionStr
    [string]   $FileVersionStr
    [string]   $FileDescription
    [string]   $LegalCopyright
    [string]   $LegalTrademarks
    [string]   $LoadedImageFile
    [string]   $PrivateBuild
    [string]   $Comments
}

<#
.SYNOPSIS Runs a debugger on a program to dump its loaded modules
#>
function Get-ExecutableModuleRawData {
    param ([string] $Program)
    $consoleEncoding = [Console]::OutputEncoding
    [Console]::OutputEncoding = [System.Text.Encoding]::GetEncoding("iso-8859-1")
    try {
        $proc = Start-Process $program -PassThru
        Start-Sleep -Seconds 1  # sleep for a while so modules are loaded
        cdb -y srv*c:symbols*http://msdl.microsoft.com/download/symbols -c ".reload -f;lmv;q" -p $proc.Id
        $proc.Close()
    }
    finally {
        [Console]::OutputEncoding = $consoleEncoding
    }
}

<#
.SYNOPSIS Converts verbose module data from windows debuggers into ExecutableModule objects.
#>
function ConvertTo-ExecutableModule {
    [OutputType([ExecutableModule])]
    param (
        [Parameter(ValueFromPipeline)]
        [string[]] $ModuleRawData
    )
    begin {
        $currentObject = $null
        $preamble = $true
        $propertyNameMap = @{
            'File flags'      = 'FileFlags'
            'File OS'         = 'FileOS'
            'File type'       = 'FileType'
            'File date'       = 'FileDate'
            'File version'    = 'FileVersion'
            'Product version' = 'ProductVersion'
            'Image path'      = 'ImagePath'
            'Image name'      = 'ImageName'
            'FileVersion'     = 'FileVersionStr'
            'ProductVersion'  = 'ProductVersionStr'
        }
    }
    process {
        switch -regex ($ModuleRawData) {

            # skip lines until we get to our sentinel line
            { $preamble -and $_ -eq "start             end                 module name" } { $preamble = $false; continue }

            #00007ff6`e9da0000 00007ff6`e9de3000   notepad    (deferred)
            #00007ffe`9ccb0000 00007ffe`9ce9d000   ntdll      (pdb symbols)          c:symbolsntdll.pdbB8AD79538F2730FD9BACE36C9F9316A01ntdll.pdb
            '^([0-9a-f`]{17})s([0-9a-f`]{17})s+(S+)s+(([^)]+))s*(.+)?' {
                # see breakdown of the expression later in the post
                # on record start, output the currentObject, if any is set
                if ($null -ne $currentObject) {
                    $currentObject
                }
                $start, $end, $module, $pdbKind, $pdbPath = $matches[1..5]
                # create an instance of the object that we are adding info from the current record into.
                $currentObject = [ExecutableModule] @{
                    Start        = $start
                    End          = $end
                    Name         = $module
                    SymbolStatus = $pdbKind
                    PdbPath      = $pdbPath
                }
                continue
            }
            '^s+Image was built with /Brepro flag.' {
                $currentObject.Reproducible = $true
                continue
            }
            '^s+Timestamp:s+[^(]+((?<timestamp>.{8}))' {
                # see breakdown of the regular  expression later in the post
                # Timestamp:        Mon Jan  7 23:42:30 2019 (5C33D5D6)
                $intValue = [Convert]::ToInt32($matches.timestamp, 16)
                $currentObject.TimeStamp = [DateTime]::new(1970, 01, 01, 0, 0, 0, [DateTimeKind]::Utc).AddSeconds($intValue)
                continue
            }
            '^s+TimeStamp:s+(?<value>.{8}) (This' {
                # Timestamp:        E78937AC (This is a reproducible build file hash, not a timestamp)
                $currentObject.FileHash = [Convert]::ToUInt32($matches.value, 16)
                continue
            }
            '^s+Loaded symbol image file: (?<imageFile>[^)]+)' {
                $currentObject.LoadedImageFile = $matches.imageFile
                continue
            }
            '^s+Checksum:s+(?<checksum>S+)' {
                $currentObject.Checksum = [Convert]::ToUInt32($matches.checksum, 16)
                continue
            }
            '^s+Translations:s+(?<value>S+)' {
                $currentObject.Translations = $matches.value.Split(".")
                continue
            }
            '^s+ImageSize:s+(?<imageSize>.{8})' {
                $currentObject.ImageSize = [Convert]::ToUInt32($matches.imageSize, 16)
                continue
            }
            '^s{4}(?<name>[^:]+):s+(?<value>.+)' {
                # see breakdown of the regular expression later in the post
                # This part is any 'name: value' pattern
                $name, $value = $matches['name', 'value']

                # project the property name
                $propName = $propertyNameMap[$name]
                $propName = if ($null -eq $propName) { $name } else { $propName }

                # note the dynamic property name in the assignment
                # this will fail if the property doesn't have a member with the specified name
                $currentObject.$propName = $value
                continue
            }
            'quit:' {
                # ignore and exit
                break
            }
            default {
                # When writing the parser, it can be useful to include a line like the one below to see the cases that are not handled by the parser
                # Write-Warning "missing case for '$_'. Unexpected output format from cdb.exe"

                continue # skip lines that doesn't match the patterns we are interested in, like the start/end/modulename header and the quit: output
            }
        }
    }
    end {
        # this is needed to output the last object
        if ($null -ne $currentObject) {
            $currentObject
        }
    }
}


Get-ExecutableModuleRawData Notepad |
    ConvertTo-ExecutableModule |
    Sort-Object ProductVersion, Name
    Format-Table -Property Name, FileVersion, Product_Version, FileDescription
Name               FileVersionStr                             ProductVersion FileDescription
----               --------------                             -------------- ---------------
PROPSYS            7.0.17763.1 (WinBuild.160101.0800)         7.0.17763.1    Microsoft Property System
ADVAPI32           10.0.17763.1 (WinBuild.160101.0800)        10.0.17763.1   Advanced Windows 32 Base API
bcrypt             10.0.17763.1 (WinBuild.160101.0800)        10.0.17763.1   Windows Cryptographic Primitives Library
...
uxtheme            10.0.17763.1 (WinBuild.160101.0800)        10.0.17763.1   Microsoft UxTheme Library
win32u             10.0.17763.1 (WinBuild.160101.0800)        10.0.17763.1   Win32u
WINSPOOL           10.0.17763.1 (WinBuild.160101.0800)        10.0.17763.1   Windows Spooler Driver
KERNELBASE         10.0.17763.134 (WinBuild.160101.0800)      10.0.17763.134 Windows NT BASE API Client DLL
wintypes           10.0.17763.134 (WinBuild.160101.0800)      10.0.17763.134 Windows Base Types DLL
SHELL32            10.0.17763.168 (WinBuild.160101.0800)      10.0.17763.168 Windows Shell Common Dll
...
windows_storage    10.0.17763.168 (WinBuild.160101.0800)      10.0.17763.168 Microsoft WinRT Storage API
CoreMessaging      10.0.17763.194                             10.0.17763.194 Microsoft CoreMessaging Dll
gdi32full          10.0.17763.194 (WinBuild.160101.0800)      10.0.17763.194 GDI Client DLL
ntdll              10.0.17763.194 (WinBuild.160101.0800)      10.0.17763.194 NT Layer DLL
RMCLIENT           10.0.17763.194 (WinBuild.160101.0800)      10.0.17763.194 Resource Manager Client
RPCRT4             10.0.17763.194 (WinBuild.160101.0800)      10.0.17763.194 Remote Procedure Call Runtime
combase            10.0.17763.253 (WinBuild.160101.0800)      10.0.17763.253 Microsoft COM for Windows
COMCTL32           6.10 (WinBuild.160101.0800)                10.0.17763.253 User Experience Controls Library
urlmon             11.00.17763.168 (WinBuild.160101.0800)     11.0.17763.168 OLE32 Extensions for Win32
iertutil           11.00.17763.253 (WinBuild.160101.0800)     11.0.17763.253 Run time utility for Internet Explorer

Regex pattern breakdown

Here is a breakdown of the more complicated patterns, using the ignore pattern whitespace modifier x:

([0-9a-f`]{17})s([0-9a-f`]{17})s+(S+)s+(([^)]+))s*(.+)?

# example input: 00007ffe`9ccb0000 00007ffe`9ce9d000   ntdll      (pdb symbols)          c:symbolsntdll.pdbB8AD79538F2730FD9BACE36C9F9316A01ntdll.pdb

(?x)                # ignore pattern whitespace
^                   # the beginning of the line
([0-9a-f`]{17})     # capture expression like 00007ff6`e9da0000 - any hex number or backtick, and exactly 17 of them
s                  # a space
([0-9a-f`]{17})     # capture expression like 00007ff6`e9da0000 - any hex number or backtick, and exactly 17 of them
s+                 # skip any number of spaces
(S+)               # capture until we get a space - this would match the 'ntdll' part
s+                 # skip one or more spaces
(                  # start parenthesis
    ([^)])         # capture anything but end parenthesis
)                  # end parenthesis
s*                 # skip zero or more spaces
(.+)?               # optionally capture any symbol file path

Breakdown of the name-value pattern:

^s+(?<name>[^:]+):s+(?<value>.+)

# example input:  File flags:       0 (Mask 3F)

(?x)                # ignore pattern whitespace
^                   # the beginning of the line
s+                 # require one or more spaces
(?<name>[^:]+)      # capture anything that is not a `:` into the named group "name"
:                   # require a comma
s+                 # require one or more spaces
(?<value>.+)        # capture everything until the end into the name group "value"

Breakdown of the timestamp pattern:

^s{4}Timestamp:s+[^(]+((?<timestamp>.{8}))

#example input:     Timestamp:        Mon Jan  7 23:42:30 2019 (5C33D5D6)

(?x)                # ignore pattern whitespace
^                   # the beginning of the line
s+                 # require one or more spaces
Timestamp:          # The literal text 'Timestamp:'
s+                 # require one or more spaces
[^(]+              # one or more of anything but a open parenthesis
(                  # a literal '('
(?<timestamp>.{8})  # 8 characters of anything, captured into the group 'timestamp'
)                  # a literal ')'

Gotchas – the Regex Cache

Something that can happen if you are writing a more complicated parser is the following:
The parser works well. You have 15 regular expressions in your switch statement and then you get some input you haven’t seen before, so you add a 16th regex.
All of a sudden, the performance of your parser tanks. WTF?

The .net regex implementation has a cache of recently used regexs. You can check the size of it like this:

PS> [regex]::CacheSize
15

# bump it
[regex]::CacheSize = 20

And now your parser is fast(er) again.

Bonus tip

I frequently use PowerShell to write (generate) my code:

Get-ExecutableModuleRawData pwsh |
    Select-String '^s+([^:]+):' |       # this pattern matches the module detail fields
    Foreach-Object {$_.matches.groups[1].value} |
    Select-Object -Unique |
    Foreach-Object -Begin   { "class ExecutableModuleData {" }`
                   -Process { "    [string] $" + ($_ -replace "s.", {[char]::ToUpperInvariant($_.Groups[0].Value[1])}) }`
                   -End     { "}" }

The output is

class ExecutableModuleData {
    [string] $LoadedSymbolImageFile
    [string] $ImagePath
    [string] $ImageName
    [string] $Timestamp
    [string] $CheckSum
    [string] $ImageSize
    [string] $FileVersion
    [string] $ProductVersion
    [string] $FileFlags
    [string] $FileOS
    [string] $FileType
    [string] $FileDate
    [string] $Translations
    [string] $CompanyName
    [string] $ProductName
    [string] $InternalName
    [string] $OriginalFilename
    [string] $ProductVersion
    [string] $FileVersion
    [string] $FileDescription
    [string] $LegalCopyright
    [string] $Comments
    [string] $LegalTrademarks
    [string] $PrivateBuild
}

It is not complete – I don’t have the fields from the record start, some types are incorrect and when run against some other executables a few other fields may appear.
But it is a very good starting point. And way more fun than typing it 🙂

Note that this example is using a new feature of the -replace operator – to use a ScriptBlock to determine what to replace with – that was added in PowerShell Core 6.1.

Bonus tip #2

A regular expression construct that I often find useful is non-greedy matching.
The example below shows the effect of the ? modifier, that can be used after * (zero or more) and + (one or more)

# greedy matching - match to the last occurrence of the following character (>)
if("<Tag>Text</Tag>" -match '<(.+)>') { $matches }
Name                           Value
----                           -----
1                              Tag>Text</Tag
0                              <Tag>Text</Tag>
# non-greedy matching - match to the first occurrence of the the following character (>)
if("<Tag>Text</Tag>" -match '<(.+?)>') { $matches }
Name                           Value
----                           -----
1                              Tag
0                              <Tag>

See Regex Repeat for more info on how to control pattern repetition.

Summary

In this post, we have looked at how the structure of a switch-based parser could look, and how it can be written so that it works as a part of the pipeline.
We have also looked at a few slightly more complicated regular expressions in some detail.

As we have seen, PowerShell has a plethora of options for parsing text, and most of them revolve around regular expressions.
My personal experience has been that the time I’ve invested in understanding the regex language was well invested.

Hopefully, this gives you a good start with the parsing tasks you have at hand.

Thanks to Jason Shirk, Mathias Jessen and Steve Lee for reviews and feedback.

Staffan Gustafsson, @StaffanGson, github

Staffan works at DICE in Stockholm, Sweden, as a Software Engineer and has been using PowerShell since the first public beta.
He was most seriously pleased when PowerShell was open sourced, and has since contributed bug fixes, new features and performance improvements.
Staffan is a speaker at PSConfEU and is always happy to talk PowerShell.