Podcast 294: [Public Sector Special Series #4] – Using AI to make Content Available for Students at Imperial College of London

This post was originally published on this site

How do you train the next generation of Digital leaders? How do you provide them with a modern educational experience? Can you do it without technical expertise? Hear how Ruth Black (Teaching Fellow at the Digital Academy) applied Amazon Transcribe to make this real.

Additional Resources

About the AWS Podcast

The AWS Podcast is a cloud platform podcast for developers, dev ops, and cloud professionals seeking the latest news and trends in storage, security, infrastructure, serverless, and more. Join Simon Elisha and Jeff Barr for regular updates, deep dives, and interviews.

Rate us on iTunes and send your suggestions, show ideas, and comments to awspodcast@amazon.com. We want to hear from you!

Subscribe with one of the following:

 

Never Try to Fix the Problem When Hacked

This post was originally published on this site

Eventually, we all get hacked. The bad guys are very persistent and we can all make a mistake. If you suspect you have been hacked never try to fix the situation, instead report it right away. If you try to fix the situation, such as paying an online ransom or deleting the infected files, not only could you stil be hacked but you are most likely causing far more harm than good.

Podcast 293: Diving into Data with Amazon Athena

This post was originally published on this site

Do you have lots of data to analyze? Is writing SQL a skill you have? Would you like to analyze massive amounts of data at low cost without capacity planning? In this episode, Simon shares how Amazon Athena can give you options you may not have considered before.

Additional Resources

About the AWS Podcast

The AWS Podcast is a cloud platform podcast for developers, dev ops, and cloud professionals seeking the latest news and trends in storage, security, infrastructure, serverless, and more. Join Simon Elisha and Jeff Barr for regular updates, deep dives and interviews. Whether you’re building machine learning and AI models, open source projects, or hybrid cloud solutions, the AWS Podcast has something for you. Subscribe with one of the following:

Like the Podcast?

Rate us on iTunes and send your suggestions, show ideas, and comments to awspodcast@amazon.com. We want to hear from you!

Parsing Text with PowerShell (3/3)

This post was originally published on this site

This is the third and final post in a three-part series.

  • Part 1:
    • Useful methods on the String class
    • Introduction to Regular Expressions
    • The Select-String cmdlet
  • Part 2:
    • the -split operator
    • the -match operator
    • the switch statement
    • the Regex class
  • Part 3:
    • a real world, complete and slightly bigger, example of a switch-based parser
      • General structure of a switch-based parser
      • The real world example

In the previous posts, we looked at the different operators what are available to us in PowerShell.

When analyzing crashes at DICE, I noticed that some of the C++ runtime binaries where missing debug symbols. They should be available for download from Microsoft’s public symbol server, and most versions were there. However, due to some process errors at DevDiv, some builds were released publicly without available debug symbols.
In some cases, those missing symbols prevented us from debugging those crashes, and in all cases, they triggered my developer OCD.

So, to give actionable feedback to Microsoft, I scripted a debugger (cdb.exe in this case) to give a verbose list of the loaded modules, and parsed the output with PowerShell, which was also later used to group and filter the resulting data set. I sent this data to Microsoft, and 5 days later, the missing symbols were available for download. Mission accomplished!

This post will describe the parser I wrote for this task (it turned out that I had good use for it for other tasks later), and the general structure is applicable to most parsing tasks.

The example will show how a switch-based parser would look when the input data isn’t as tidy as it normally is in examples, but messy – as the real world data often is.

General Structure of a switch Based Parser

Depending on the structure of our input, the code must be organized in slightly different ways.

Input may have a record start that differs by indentation or some distinct token like

Foo                    <- Record start - No whitespace at the beginning of the line
    Prop1=Staffan      <- Properties for the record - starts with whitespace
    Prop3 =ValueN
Bar
    Prop1=Steve
    Prop2=ValueBar2

If the data to be parsed has an explicit start record, it is a bit easier than if it doesn’t have one.
We create a new data object when we get a record start, after writing any previously created object to the pipeline.
At the end, we need to check if we have parsed a record that hasn’t been written to the pipeline.

The general structure of a such a switch-based parser can be as follows:

$inputData = @"
Foo
    Prop1=Value1
    Prop3=Value3
Bar
    Prop1=ValueBar1
    Prop2=ValueBar2
"@ -split 'r?n'   # This regex is useful to split at line endings, with or without carriage return

class SomeDataClass {
    $ID
    $Name
    $Property2
    $Property3
}

# map to project input property names to the properties on our data class
$propertyNameMap = @{
    Prop1 = "Name"
    Prop2 = "Property2"
    Prop3 = "Property3"
}

$currentObject = $null
switch -regex ($inputData) {

    '^(S.*)' {
        # record start pattern, in this case line that doesn't start with a whitespace.
        if ($null -ne $currentObject) {
            $currentObject                   # output to pipeline if we have a previous data object
        }
        $currentObject = [SomeDataClass] @{  # create new object for this record
            Id = $matches.1                  # with Id like Foo or Bar
        }
        continue
    }

    # set the properties on the data object
    '^s+([^=]+)=(.*)' {
        $name, $value = $matches[1, 2]
        # project property names
        $propName = $propertyNameMap[$name]
        if ($propName = $null) {
            $propName = $name
        }
        # assign the parsed value to the projected property name
        $currentObject.$propName = $value
        continue
    }
}

if ($currentObject) {
    # Handle the last object if any
    $currentObject # output to pipeline
}

ID  Name      Property2 Property3
--  ----      --------- ---------
Foo Value1              Value3
Bar ValueBar1 ValueBar2

Alternatively, we may have input where the records are separated by a blank line, but without any obvious record start.

commitId=1234                         <- In this case, a commitId is first in a record
description=Update readme.md
                                      <- the blank line separates records
user=Staffan                          <- For this record, a user property comes first
commitId=1235
description=Fix bug.md

In this case the structure of the code looks a bit different. We create an object at the beginning, but keep track of if it’s dirty or not.
If we get to the end with a dirty object, we must output it.

$inputData = @"

commit=1234
desc=Update readme.md

user=Staffan
commit=1235
desc=Bug fix

"@ -split "r?n"

class SomeDataClass {
    [int] $CommitId
    [string] $Description
    [string] $User
}

# map to project input property names to the properties on our data class
# we only need to provide the ones that are different. 'User' works fine as it is.
$propertyNameMap = @{
    commit = "CommitId"
    desc   = "Description"
}

$currentObject = [SomeDataClass]::new()
$objectDirty = $false
switch -regex ($inputData) {
    # set the properties on the data object
    '^([^=]+)=(.*)$' {
        # parse a name/value
        $name, $value = $matches[1, 2]
        # project property names
        $propName = $propertyNameMap[$name]
        if ($null -eq $propName) {
            $propName = $name
        }
        # assign the projected property
        $currentObject.$propName = $value
        $objectDirty = $true
        continue
    }

    '^s*$' {
        # separator pattern, in this case any blank line
        if ($objectDirty) {
            $currentObject                           # output to pipeline
            $currentObject = [SomeDataClass]::new()  # create new object
            $objectDirty = $false                    # and mark it as not dirty
        }
    }
    default {
        Write-Warning "Unexpected input: '$_'"
    }
}

if ($objectDirty) {
    # Handle the last object if any
    $currentObject # output to pipeline
}

CommitId Description      User
-------- -----------      ----
    1234 Update readme.md
    1235 Bug fix          Staffan

The Real World Example

I have adapted this sample slightly so that I get the loaded modules from a running process instead of from my crash dumps. The format of the output from the debugger is the same.
The following command launches a command line debugger on notepad, with a script that gives a verbose listing of the loaded modules, and quits:

# we need to muck around with the console output encoding to handle the trademark chars
# imagine no encodings
# it's easy if you try
# no code pages below us
# above us only sky
[Console]::OutputEncoding = [System.Text.Encoding]::GetEncoding("iso-8859-1")

$proc = Start-Process notepad -passthru
Start-Sleep -seconds 1
$cdbOutput = cdb -y 'srv*c:symbols*http://msdl.microsoft.com/download/symbols' -c ".reload -f;lmv;q" -p $proc.ProcessID

The output of the command above is here for those who want to follow along but who aren’t running windows or don’t have cdb.exe installed.

The (abbreviated) output looks like this:

Microsoft (R) Windows Debugger Version 10.0.16299.15 AMD64
Copyright (c) Microsoft Corporation. All rights reserved.

*** wait with pending attach

************* Path validation summary **************
Response                         Time (ms)     Location
Deferred                                       srv*c:symbols*http://msdl.microsoft.com/download/symbols
Symbol search path is: srv*c:symbols*http://msdl.microsoft.com/download/symbols
Executable search path is:
ModLoad: 00007ff6`e9da0000 00007ff6`e9de3000   C:Windowssystem32notepad.exe
...
ModLoad: 00007ffe`97d80000 00007ffe`97db1000   C:WINDOWSSYSTEM32ntmarta.dll
(98bc.40a0): Break instruction exception - code 80000003 (first chance)
ntdll!DbgBreakPoint:
00007ffe`9cd53050 cc              int     3
0:007> cdb: Reading initial command '.reload -f;lmv;q'
Reloading current modules
.....................................................
start             end                 module name
00007ff6`e9da0000 00007ff6`e9de3000   notepad    (pdb symbols)          c:symbolsnotepad.pdb2352C62CDF448257FDBDDA4081A8F9081notepad.pdb
    Loaded symbol image file: C:Windowssystem32notepad.exe
    Image path: C:Windowssystem32notepad.exe
    Image name: notepad.exe
    Image was built with /Brepro flag.
    Timestamp:        329A7791 (This is a reproducible build file hash, not a timestamp)
    CheckSum:         0004D15F
    ImageSize:        00043000
    File version:     10.0.17763.1
    Product version:  10.0.17763.1
    File flags:       0 (Mask 3F)
    File OS:          40004 NT Win32
    File type:        1.0 App
    File date:        00000000.00000000
    Translations:     0409.04b0
    CompanyName:      Microsoft Corporation
    ProductName:      Microsoft??? Windows??? Operating System
    InternalName:     Notepad
    OriginalFilename: NOTEPAD.EXE
    ProductVersion:   10.0.17763.1
    FileVersion:      10.0.17763.1 (WinBuild.160101.0800)
    FileDescription:  Notepad
    LegalCopyright:   ??? Microsoft Corporation. All rights reserved.
...
00007ffe`9ccb0000 00007ffe`9ce9d000   ntdll      (pdb symbols)          c:symbolsntdll.pdbB8AD79538F2730FD9BACE36C9F9316A01ntdll.pdb
    Loaded symbol image file: C:WINDOWSSYSTEM32ntdll.dll
    Image path: C:WINDOWSSYSTEM32ntdll.dll
    Image name: ntdll.dll
    Image was built with /Brepro flag.
    Timestamp:        E8B54827 (This is a reproducible build file hash, not a timestamp)
    CheckSum:         001F20D1
    ImageSize:        001ED000
    File version:     10.0.17763.194
    Product version:  10.0.17763.194
    File flags:       0 (Mask 3F)
    File OS:          40004 NT Win32
    File type:        2.0 Dll
    File date:        00000000.00000000
    Translations:     0409.04b0
    CompanyName:      Microsoft Corporation
    ProductName:      Microsoft??? Windows??? Operating System
    InternalName:     ntdll.dll
    OriginalFilename: ntdll.dll
    ProductVersion:   10.0.17763.194
    FileVersion:      10.0.17763.194 (WinBuild.160101.0800)
    FileDescription:  NT Layer DLL
    LegalCopyright:   ??? Microsoft Corporation. All rights reserved.
quit:

The output starts with info that I’m not interested in here. I only want to get the detailed information about the loaded modules. It is not until the line

start             end                 module name

that I care about the output.

Also, at the end there is a line that we need to be aware of:

quit:

that is not part of the module output.

To skip the parts of the debugger output that we don’t care about, we have a boolean flag initially set to true.
If that flag is set, we check if the current line, $_, is the module header in which case we flip the flag.

$inPreamble = $true
    switch -regex ($cdbOutput) {

        { $inPreamble -and $_ -eq "start             end                 module name" } { $inPreamble = $false; continue }

I have made the parser a separate function that reads its input from the pipeline. This way, I can use the same function to parse module data, regardless of how I got the module data. Maybe it was saved on a file. Or came from a dump, or a live process. It doesn’t matter, since the parser is decoupled from the data retrieval.

After the sample, there is a breakdown of the more complicated regular expressions used, so don’t despair if you don’t understand them at first.
Regular Expressions are notoriously hard to read, so much so that they make Perl look readable in comparison.

# define an class to store the data
class ExecutableModule {
    [string]   $Name
    [string]   $Start
    [string]   $End
    [string]   $SymbolStatus
    [string]   $PdbPath
    [bool]     $Reproducible
    [string]   $ImagePath
    [string]   $ImageName
    [DateTime] $TimeStamp
    [uint32]   $FileHash
    [uint32]   $CheckSum
    [uint32]   $ImageSize
    [version]  $FileVersion
    [version]  $ProductVersion
    [string]   $FileFlags
    [string]   $FileOS
    [string]   $FileType
    [string]   $FileDate
    [string[]] $Translations
    [string]   $CompanyName
    [string]   $ProductName
    [string]   $InternalName
    [string]   $OriginalFilename
    [string]   $ProductVersionStr
    [string]   $FileVersionStr
    [string]   $FileDescription
    [string]   $LegalCopyright
    [string]   $LegalTrademarks
    [string]   $LoadedImageFile
    [string]   $PrivateBuild
    [string]   $Comments
}

<#
.SYNOPSIS Runs a debugger on a program to dump its loaded modules
#>
function Get-ExecutableModuleRawData {
    param ([string] $Program)
    $consoleEncoding = [Console]::OutputEncoding
    [Console]::OutputEncoding = [System.Text.Encoding]::GetEncoding("iso-8859-1")
    try {
        $proc = Start-Process $program -PassThru
        Start-Sleep -Seconds 1  # sleep for a while so modules are loaded
        cdb -y srv*c:symbols*http://msdl.microsoft.com/download/symbols -c ".reload -f;lmv;q" -p $proc.Id
        $proc.Close()
    }
    finally {
        [Console]::OutputEncoding = $consoleEncoding
    }
}

<#
.SYNOPSIS Converts verbose module data from windows debuggers into ExecutableModule objects.
#>
function ConvertTo-ExecutableModule {
    [OutputType([ExecutableModule])]
    param (
        [Parameter(ValueFromPipeline)]
        [string[]] $ModuleRawData
    )
    begin {
        $currentObject = $null
        $preamble = $true
        $propertyNameMap = @{
            'File flags'      = 'FileFlags'
            'File OS'         = 'FileOS'
            'File type'       = 'FileType'
            'File date'       = 'FileDate'
            'File version'    = 'FileVersion'
            'Product version' = 'ProductVersion'
            'Image path'      = 'ImagePath'
            'Image name'      = 'ImageName'
            'FileVersion'     = 'FileVersionStr'
            'ProductVersion'  = 'ProductVersionStr'
        }
    }
    process {
        switch -regex ($ModuleRawData) {

            # skip lines until we get to our sentinel line
            { $preamble -and $_ -eq "start             end                 module name" } { $preamble = $false; continue }

            #00007ff6`e9da0000 00007ff6`e9de3000   notepad    (deferred)
            #00007ffe`9ccb0000 00007ffe`9ce9d000   ntdll      (pdb symbols)          c:symbolsntdll.pdbB8AD79538F2730FD9BACE36C9F9316A01ntdll.pdb
            '^([0-9a-f`]{17})s([0-9a-f`]{17})s+(S+)s+(([^)]+))s*(.+)?' {
                # see breakdown of the expression later in the post
                # on record start, output the currentObject, if any is set
                if ($null -ne $currentObject) {
                    $currentObject
                }
                $start, $end, $module, $pdbKind, $pdbPath = $matches[1..5]
                # create an instance of the object that we are adding info from the current record into.
                $currentObject = [ExecutableModule] @{
                    Start        = $start
                    End          = $end
                    Name         = $module
                    SymbolStatus = $pdbKind
                    PdbPath      = $pdbPath
                }
                continue
            }
            '^s+Image was built with /Brepro flag.' {
                $currentObject.Reproducible = $true
                continue
            }
            '^s+Timestamp:s+[^(]+((?<timestamp>.{8}))' {
                # see breakdown of the regular  expression later in the post
                # Timestamp:        Mon Jan  7 23:42:30 2019 (5C33D5D6)
                $intValue = [Convert]::ToInt32($matches.timestamp, 16)
                $currentObject.TimeStamp = [DateTime]::new(1970, 01, 01, 0, 0, 0, [DateTimeKind]::Utc).AddSeconds($intValue)
                continue
            }
            '^s+TimeStamp:s+(?<value>.{8}) (This' {
                # Timestamp:        E78937AC (This is a reproducible build file hash, not a timestamp)
                $currentObject.FileHash = [Convert]::ToUInt32($matches.value, 16)
                continue
            }
            '^s+Loaded symbol image file: (?<imageFile>[^)]+)' {
                $currentObject.LoadedImageFile = $matches.imageFile
                continue
            }
            '^s+Checksum:s+(?<checksum>S+)' {
                $currentObject.Checksum = [Convert]::ToUInt32($matches.checksum, 16)
                continue
            }
            '^s+Translations:s+(?<value>S+)' {
                $currentObject.Translations = $matches.value.Split(".")
                continue
            }
            '^s+ImageSize:s+(?<imageSize>.{8})' {
                $currentObject.ImageSize = [Convert]::ToUInt32($matches.imageSize, 16)
                continue
            }
            '^s{4}(?<name>[^:]+):s+(?<value>.+)' {
                # see breakdown of the regular expression later in the post
                # This part is any 'name: value' pattern
                $name, $value = $matches['name', 'value']

                # project the property name
                $propName = $propertyNameMap[$name]
                $propName = if ($null -eq $propName) { $name } else { $propName }

                # note the dynamic property name in the assignment
                # this will fail if the property doesn't have a member with the specified name
                $currentObject.$propName = $value
                continue
            }
            'quit:' {
                # ignore and exit
                break
            }
            default {
                # When writing the parser, it can be useful to include a line like the one below to see the cases that are not handled by the parser
                # Write-Warning "missing case for '$_'. Unexpected output format from cdb.exe"

                continue # skip lines that doesn't match the patterns we are interested in, like the start/end/modulename header and the quit: output
            }
        }
    }
    end {
        # this is needed to output the last object
        if ($null -ne $currentObject) {
            $currentObject
        }
    }
}


Get-ExecutableModuleRawData Notepad |
    ConvertTo-ExecutableModule |
    Sort-Object ProductVersion, Name
    Format-Table -Property Name, FileVersion, Product_Version, FileDescription

Name               FileVersionStr                             ProductVersion FileDescription
----               --------------                             -------------- ---------------
PROPSYS            7.0.17763.1 (WinBuild.160101.0800)         7.0.17763.1    Microsoft Property System
ADVAPI32           10.0.17763.1 (WinBuild.160101.0800)        10.0.17763.1   Advanced Windows 32 Base API
bcrypt             10.0.17763.1 (WinBuild.160101.0800)        10.0.17763.1   Windows Cryptographic Primitives Library
...
uxtheme            10.0.17763.1 (WinBuild.160101.0800)        10.0.17763.1   Microsoft UxTheme Library
win32u             10.0.17763.1 (WinBuild.160101.0800)        10.0.17763.1   Win32u
WINSPOOL           10.0.17763.1 (WinBuild.160101.0800)        10.0.17763.1   Windows Spooler Driver
KERNELBASE         10.0.17763.134 (WinBuild.160101.0800)      10.0.17763.134 Windows NT BASE API Client DLL
wintypes           10.0.17763.134 (WinBuild.160101.0800)      10.0.17763.134 Windows Base Types DLL
SHELL32            10.0.17763.168 (WinBuild.160101.0800)      10.0.17763.168 Windows Shell Common Dll
...
windows_storage    10.0.17763.168 (WinBuild.160101.0800)      10.0.17763.168 Microsoft WinRT Storage API
CoreMessaging      10.0.17763.194                             10.0.17763.194 Microsoft CoreMessaging Dll
gdi32full          10.0.17763.194 (WinBuild.160101.0800)      10.0.17763.194 GDI Client DLL
ntdll              10.0.17763.194 (WinBuild.160101.0800)      10.0.17763.194 NT Layer DLL
RMCLIENT           10.0.17763.194 (WinBuild.160101.0800)      10.0.17763.194 Resource Manager Client
RPCRT4             10.0.17763.194 (WinBuild.160101.0800)      10.0.17763.194 Remote Procedure Call Runtime
combase            10.0.17763.253 (WinBuild.160101.0800)      10.0.17763.253 Microsoft COM for Windows
COMCTL32           6.10 (WinBuild.160101.0800)                10.0.17763.253 User Experience Controls Library
urlmon             11.00.17763.168 (WinBuild.160101.0800)     11.0.17763.168 OLE32 Extensions for Win32
iertutil           11.00.17763.253 (WinBuild.160101.0800)     11.0.17763.253 Run time utility for Internet Explorer

Regex pattern breakdown

Here is a breakdown of the more complicated patterns, using the ignore pattern whitespace modifier x:

([0-9a-f`]{17})s([0-9a-f`]{17})s+(S+)s+(([^)]+))s*(.+)?

# example input: 00007ffe`9ccb0000 00007ffe`9ce9d000   ntdll      (pdb symbols)          c:symbolsntdll.pdbB8AD79538F2730FD9BACE36C9F9316A01ntdll.pdb

(?x)                # ignore pattern whitespace
^                   # the beginning of the line
([0-9a-f`]{17})     # capture expression like 00007ff6`e9da0000 - any hex number or backtick, and exactly 17 of them
s                  # a space
([0-9a-f`]{17})     # capture expression like 00007ff6`e9da0000 - any hex number or backtick, and exactly 17 of them
s+                 # skip any number of spaces
(S+)               # capture until we get a space - this would match the 'ntdll' part
s+                 # skip one or more spaces
(                  # start parenthesis
    ([^)])         # capture anything but end parenthesis
)                  # end parenthesis
s*                 # skip zero or more spaces
(.+)?               # optionally capture any symbol file path

Breakdown of the name-value pattern:

^s+(?<name>[^:]+):s+(?<value>.+)

# example input:  File flags:       0 (Mask 3F)

(?x)                # ignore pattern whitespace
^                   # the beginning of the line
s+                 # require one or more spaces
(?<name>[^:]+)      # capture anything that is not a `:` into the named group "name"
:                   # require a comma
s+                 # require one or more spaces
(?<value>.+)        # capture everything until the end into the name group "value"

Breakdown of the timestamp pattern:

^s{4}Timestamp:s+[^(]+((?<timestamp>.{8}))

#example input:     Timestamp:        Mon Jan  7 23:42:30 2019 (5C33D5D6)

(?x)                # ignore pattern whitespace
^                   # the beginning of the line
s+                 # require one or more spaces
Timestamp:          # The literal text 'Timestamp:'
s+                 # require one or more spaces
[^(]+              # one or more of anything but a open parenthesis
(                  # a literal '('
(?<timestamp>.{8})  # 8 characters of anything, captured into the group 'timestamp'
)                  # a literal ')'

Gotchas – the Regex Cache

Something that can happen if you are writing a more complicated parser is the following:
The parser works well. You have 15 regular expressions in your switch statement and then you get some input you haven’t seen before, so you add a 16th regex.
All of a sudden, the performance of your parser tanks. WTF?

The .net regex implementation has a cache of recently used regexs. You can check the size of it like this:

PS> [regex]::CacheSize
15

# bump it
[regex]::CacheSize = 20

And now your parser is fast(er) again.

Bonus tip

I frequently use PowerShell to write (generate) my code:

Get-ExecutableModuleRawData pwsh |
    Select-String '^s+([^:]+):' |       # this pattern matches the module detail fields
    Foreach-Object {$_.matches.groups[1].value} |
    Select-Object -Unique |
    Foreach-Object -Begin   { "class ExecutableModuleData {" }`
                   -Process { "    [string] $" + ($_ -replace "s.", {[char]::ToUpperInvariant($_.Groups[0].Value[1])}) }`
                   -End     { "}" }

The output is

class ExecutableModuleData {
    [string] $LoadedSymbolImageFile
    [string] $ImagePath
    [string] $ImageName
    [string] $Timestamp
    [string] $CheckSum
    [string] $ImageSize
    [string] $FileVersion
    [string] $ProductVersion
    [string] $FileFlags
    [string] $FileOS
    [string] $FileType
    [string] $FileDate
    [string] $Translations
    [string] $CompanyName
    [string] $ProductName
    [string] $InternalName
    [string] $OriginalFilename
    [string] $ProductVersion
    [string] $FileVersion
    [string] $FileDescription
    [string] $LegalCopyright
    [string] $Comments
    [string] $LegalTrademarks
    [string] $PrivateBuild
}

It is not complete – I don’t have the fields from the record start, some types are incorrect and when run against some other executables a few other fields may appear.
But it is a very good starting point. And way more fun than typing it 🙂

Note that this example is using a new feature of the -replace operator – to use a ScriptBlock to determine what to replace with – that was added in PowerShell Core 6.1.

Bonus tip #2

A regular expression construct that I often find useful is non-greedy matching.
The example below shows the effect of the ? modifier, that can be used after * (zero or more) and + (one or more)

# greedy matching - match to the last occurrence of the following character (>)
if("<Tag>Text</Tag>" -match '<(.+)>') { $matches }

Name                           Value
----                           -----
1                              Tag&gt;Text&lt;/Tag
0                              &lt;Tag&gt;Text&lt;/Tag&gt;

# non-greedy matching - match to the first occurrence of the the following character (>)
if("<Tag>Text</Tag>" -match '<(.+?)>') { $matches }
Name                           Value
----                           -----
1                              Tag
0                              &lt;Tag&gt;

See Regex Repeat for more info on how to control pattern repetition.

Summary

In this post, we have looked at how the structure of a switch-based parser could look, and how it can be written so that it works as a part of the pipeline.
We have also looked at a few slightly more complicated regular expressions in some detail.

As we have seen, PowerShell has a plethora of options for parsing text, and most of them revolve around regular expressions.
My personal experience has been that the time I’ve invested in understanding the regex language was well invested.

Hopefully, this gives you a good start with the parsing tasks you have at hand.

Thanks to Jason Shirk, Mathias Jessen and Steve Lee for reviews and feedback.

Staffan Gustafsson, @StaffanGson, github

Staffan works at DICE in Stockholm, Sweden, as a Software Engineer and has been using PowerShell since the first public beta.
He was most seriously pleased when PowerShell was open sourced, and has since contributed bug fixes, new features and performance improvements.
Staffan is a speaker at PSConfEU and is always happy to talk PowerShell.

The post Parsing Text with PowerShell (3/3) appeared first on Powershell.

Parsing Text with PowerShell (3/3)

This post was originally published on this site

This is the third and final post in a three-part series.

  • Part 1:
    • Useful methods on the String class
    • Introduction to Regular Expressions
    • The Select-String cmdlet
  • Part 2:
    • the -split operator
    • the -match operator
    • the switch statement
    • the Regex class
  • Part 3:
    • a real world, complete and slightly bigger, example of a switch-based parser
      • General structure of a switch-based parser
      • The real world example

In the previous posts, we looked at the different operators what are available to us in PowerShell.

When analyzing crashes at DICE, I noticed that some of the C++ runtime binaries where missing debug symbols. They should be available for download from Microsoft’s public symbol server, and most versions were there. However, due to some process errors at DevDiv, some builds were released publicly without available debug symbols.
In some cases, those missing symbols prevented us from debugging those crashes, and in all cases, they triggered my developer OCD.

So, to give actionable feedback to Microsoft, I scripted a debugger (cdb.exe in this case) to give a verbose list of the loaded modules, and parsed the output with PowerShell, which was also later used to group and filter the resulting data set. I sent this data to Microsoft, and 5 days later, the missing symbols were available for download. Mission accomplished!

This post will describe the parser I wrote for this task (it turned out that I had good use for it for other tasks later), and the general structure is applicable to most parsing tasks.

The example will show how a switch-based parser would look when the input data isn’t as tidy as it normally is in examples, but messy – as the real world data often is.

General Structure of a switch Based Parser

Depending on the structure of our input, the code must be organized in slightly different ways.

Input may have a record start that differs by indentation or some distinct token like

Foo                    <- Record start - No whitespace at the beginning of the line
    Prop1=Staffan      <- Properties for the record - starts with whitespace
    Prop3 =ValueN
Bar
    Prop1=Steve
    Prop2=ValueBar2

If the data to be parsed has an explicit start record, it is a bit easier than if it doesn’t have one.
We create a new data object when we get a record start, after writing any previously created object to the pipeline.
At the end, we need to check if we have parsed a record that hasn’t been written to the pipeline.

The general structure of a such a switch-based parser can be as follows:

$inputData = @"
Foo
    Prop1=Value1
    Prop3=Value3
Bar
    Prop1=ValueBar1
    Prop2=ValueBar2
"@ -split 'r?n'   # This regex is useful to split at line endings, with or without carriage return

class SomeDataClass {
    $ID
    $Name
    $Property2
    $Property3
}

# map to project input property names to the properties on our data class
$propertyNameMap = @{
    Prop1 = "Name"
    Prop2 = "Property2"
    Prop3 = "Property3"
}

$currentObject = $null
switch -regex ($inputData) {

    '^(S.*)' {
        # record start pattern, in this case line that doesn't start with a whitespace.
        if ($null -ne $currentObject) {
            $currentObject                   # output to pipeline if we have a previous data object
        }
        $currentObject = [SomeDataClass] @{  # create new object for this record
            Id = $matches.1                  # with Id like Foo or Bar
        }
        continue
    }

    # set the properties on the data object
    '^s+([^=]+)=(.*)' {
        $name, $value = $matches[1, 2]
        # project property names
        $propName = $propertyNameMap[$name]
        if ($propName = $null) {
            $propName = $name
        }
        # assign the parsed value to the projected property name
        $currentObject.$propName = $value
        continue
    }
}

if ($currentObject) {
    # Handle the last object if any
    $currentObject # output to pipeline
}
ID  Name      Property2 Property3
--  ----      --------- ---------
Foo Value1              Value3
Bar ValueBar1 ValueBar2

Alternatively, we may have input where the records are separated by a blank line, but without any obvious record start.

commitId=1234                         <- In this case, a commitId is first in a record
description=Update readme.md
                                      <- the blank line separates records
user=Staffan                          <- For this record, a user property comes first
commitId=1235
description=Fix bug.md

In this case the structure of the code looks a bit different. We create an object at the beginning, but keep track of if it’s dirty or not.
If we get to the end with a dirty object, we must output it.

$inputData = @"

commit=1234
desc=Update readme.md

user=Staffan
commit=1235
desc=Bug fix

"@ -split "r?n"

class SomeDataClass {
    [int] $CommitId
    [string] $Description
    [string] $User
}

# map to project input property names to the properties on our data class
# we only need to provide the ones that are different. 'User' works fine as it is.
$propertyNameMap = @{
    commit = "CommitId"
    desc   = "Description"
}

$currentObject = [SomeDataClass]::new()
$objectDirty = $false
switch -regex ($inputData) {
    # set the properties on the data object
    '^([^=]+)=(.*)$' {
        # parse a name/value
        $name, $value = $matches[1, 2]
        # project property names
        $propName = $propertyNameMap[$name]
        if ($null -eq $propName) {
            $propName = $name
        }
        # assign the projected property
        $currentObject.$propName = $value
        $objectDirty = $true
        continue
    }

    '^s*$' {
        # separator pattern, in this case any blank line
        if ($objectDirty) {
            $currentObject                           # output to pipeline
            $currentObject = [SomeDataClass]::new()  # create new object
            $objectDirty = $false                    # and mark it as not dirty
        }
    }
    default {
        Write-Warning "Unexpected input: '$_'"
    }
}

if ($objectDirty) {
    # Handle the last object if any
    $currentObject # output to pipeline
}
CommitId Description      User
-------- -----------      ----
    1234 Update readme.md
    1235 Bug fix          Staffan

The Real World Example

I have adapted this sample slightly so that I get the loaded modules from a running process instead of from my crash dumps. The format of the output from the debugger is the same.
The following command launches a command line debugger on notepad, with a script that gives a verbose listing of the loaded modules, and quits:

# we need to muck around with the console output encoding to handle the trademark chars
# imagine no encodings
# it's easy if you try
# no code pages below us
# above us only sky
[Console]::OutputEncoding = [System.Text.Encoding]::GetEncoding("iso-8859-1")

$proc = Start-Process notepad -passthru
Start-Sleep -seconds 1
$cdbOutput = cdb -y 'srv*c:symbols*http://msdl.microsoft.com/download/symbols' -c ".reload -f;lmv;q" -p $proc.ProcessID

The output of the command above is attached to the blog post for those who want to follow along but who aren’t running windows or don’t have cdb.exe installed.

The (abbreviated) output looks like this:

Microsoft (R) Windows Debugger Version 10.0.16299.15 AMD64
Copyright (c) Microsoft Corporation. All rights reserved.

*** wait with pending attach

************* Path validation summary **************
Response                         Time (ms)     Location
Deferred                                       srv*c:symbols*http://msdl.microsoft.com/download/symbols
Symbol search path is: srv*c:symbols*http://msdl.microsoft.com/download/symbols
Executable search path is:
ModLoad: 00007ff6`e9da0000 00007ff6`e9de3000   C:Windowssystem32notepad.exe
...
ModLoad: 00007ffe`97d80000 00007ffe`97db1000   C:WINDOWSSYSTEM32ntmarta.dll
(98bc.40a0): Break instruction exception - code 80000003 (first chance)
ntdll!DbgBreakPoint:
00007ffe`9cd53050 cc              int     3
0:007> cdb: Reading initial command '.reload -f;lmv;q'
Reloading current modules
.....................................................
start             end                 module name
00007ff6`e9da0000 00007ff6`e9de3000   notepad    (pdb symbols)          c:symbolsnotepad.pdb2352C62CDF448257FDBDDA4081A8F9081notepad.pdb
    Loaded symbol image file: C:Windowssystem32notepad.exe
    Image path: C:Windowssystem32notepad.exe
    Image name: notepad.exe
    Image was built with /Brepro flag.
    Timestamp:        329A7791 (This is a reproducible build file hash, not a timestamp)
    CheckSum:         0004D15F
    ImageSize:        00043000
    File version:     10.0.17763.1
    Product version:  10.0.17763.1
    File flags:       0 (Mask 3F)
    File OS:          40004 NT Win32
    File type:        1.0 App
    File date:        00000000.00000000
    Translations:     0409.04b0
    CompanyName:      Microsoft Corporation
    ProductName:      Microsoft??? Windows??? Operating System
    InternalName:     Notepad
    OriginalFilename: NOTEPAD.EXE
    ProductVersion:   10.0.17763.1
    FileVersion:      10.0.17763.1 (WinBuild.160101.0800)
    FileDescription:  Notepad
    LegalCopyright:   ??? Microsoft Corporation. All rights reserved.
...
00007ffe`9ccb0000 00007ffe`9ce9d000   ntdll      (pdb symbols)          c:symbolsntdll.pdbB8AD79538F2730FD9BACE36C9F9316A01ntdll.pdb
    Loaded symbol image file: C:WINDOWSSYSTEM32ntdll.dll
    Image path: C:WINDOWSSYSTEM32ntdll.dll
    Image name: ntdll.dll
    Image was built with /Brepro flag.
    Timestamp:        E8B54827 (This is a reproducible build file hash, not a timestamp)
    CheckSum:         001F20D1
    ImageSize:        001ED000
    File version:     10.0.17763.194
    Product version:  10.0.17763.194
    File flags:       0 (Mask 3F)
    File OS:          40004 NT Win32
    File type:        2.0 Dll
    File date:        00000000.00000000
    Translations:     0409.04b0
    CompanyName:      Microsoft Corporation
    ProductName:      Microsoft??? Windows??? Operating System
    InternalName:     ntdll.dll
    OriginalFilename: ntdll.dll
    ProductVersion:   10.0.17763.194
    FileVersion:      10.0.17763.194 (WinBuild.160101.0800)
    FileDescription:  NT Layer DLL
    LegalCopyright:   ??? Microsoft Corporation. All rights reserved.
quit:

The output starts with info that I’m not interested in here. I only want to get the detailed information about the loaded modules. It is not until the line

start             end                 module name

that I care about the output.

Also, at the end there is a line that we need to be aware of:

quit:

that is not part of the module output.

To skip the parts of the debugger output that we don’t care about, we have a boolean flag initially set to true.
If that flag is set, we check if the current line, $_, is the module header in which case we flip the flag.

    $inPreamble = $true
    switch -regex ($cdbOutput) {

        { $inPreamble -and $_ -eq "start             end                 module name" } { $inPreamble = $false; continue }

I have made the parser a separate function that reads its input from the pipeline. This way, I can use the same function to parse module data, regardless of how I got the module data. Maybe it was saved on a file. Or came from a dump, or a live process. It doesn’t matter, since the parser is decoupled from the data retrieval.

After the sample, there is a breakdown of the more complicated regular expressions used, so don’t despair if you don’t understand them at first.
Regular Expressions are notoriously hard to read, so much so that they make Perl look readable in comparison.

# define an class to store the data
class ExecutableModule {
    [string]   $Name
    [string]   $Start
    [string]   $End
    [string]   $SymbolStatus
    [string]   $PdbPath
    [bool]     $Reproducible
    [string]   $ImagePath
    [string]   $ImageName
    [DateTime] $TimeStamp
    [uint32]   $FileHash
    [uint32]   $CheckSum
    [uint32]   $ImageSize
    [version]  $FileVersion
    [version]  $ProductVersion
    [string]   $FileFlags
    [string]   $FileOS
    [string]   $FileType
    [string]   $FileDate
    [string[]] $Translations
    [string]   $CompanyName
    [string]   $ProductName
    [string]   $InternalName
    [string]   $OriginalFilename
    [string]   $ProductVersionStr
    [string]   $FileVersionStr
    [string]   $FileDescription
    [string]   $LegalCopyright
    [string]   $LegalTrademarks
    [string]   $LoadedImageFile
    [string]   $PrivateBuild
    [string]   $Comments
}

<#
.SYNOPSIS Runs a debugger on a program to dump its loaded modules
#>
function Get-ExecutableModuleRawData {
    param ([string] $Program)
    $consoleEncoding = [Console]::OutputEncoding
    [Console]::OutputEncoding = [System.Text.Encoding]::GetEncoding("iso-8859-1")
    try {
        $proc = Start-Process $program -PassThru
        Start-Sleep -Seconds 1  # sleep for a while so modules are loaded
        cdb -y srv*c:symbols*http://msdl.microsoft.com/download/symbols -c ".reload -f;lmv;q" -p $proc.Id
        $proc.Close()
    }
    finally {
        [Console]::OutputEncoding = $consoleEncoding
    }
}

<#
.SYNOPSIS Converts verbose module data from windows debuggers into ExecutableModule objects.
#>
function ConvertTo-ExecutableModule {
    [OutputType([ExecutableModule])]
    param (
        [Parameter(ValueFromPipeline)]
        [string[]] $ModuleRawData
    )
    begin {
        $currentObject = $null
        $preamble = $true
        $propertyNameMap = @{
            'File flags'      = 'FileFlags'
            'File OS'         = 'FileOS'
            'File type'       = 'FileType'
            'File date'       = 'FileDate'
            'File version'    = 'FileVersion'
            'Product version' = 'ProductVersion'
            'Image path'      = 'ImagePath'
            'Image name'      = 'ImageName'
            'FileVersion'     = 'FileVersionStr'
            'ProductVersion'  = 'ProductVersionStr'
        }
    }
    process {
        switch -regex ($ModuleRawData) {

            # skip lines until we get to our sentinel line
            { $preamble -and $_ -eq "start             end                 module name" } { $preamble = $false; continue }

            #00007ff6`e9da0000 00007ff6`e9de3000   notepad    (deferred)
            #00007ffe`9ccb0000 00007ffe`9ce9d000   ntdll      (pdb symbols)          c:symbolsntdll.pdbB8AD79538F2730FD9BACE36C9F9316A01ntdll.pdb
            '^([0-9a-f`]{17})s([0-9a-f`]{17})s+(S+)s+(([^)]+))s*(.+)?' {
                # see breakdown of the expression later in the post
                # on record start, output the currentObject, if any is set
                if ($null -ne $currentObject) {
                    $currentObject
                }
                $start, $end, $module, $pdbKind, $pdbPath = $matches[1..5]
                # create an instance of the object that we are adding info from the current record into.
                $currentObject = [ExecutableModule] @{
                    Start        = $start
                    End          = $end
                    Name         = $module
                    SymbolStatus = $pdbKind
                    PdbPath      = $pdbPath
                }
                continue
            }
            '^s+Image was built with /Brepro flag.' {
                $currentObject.Reproducible = $true
                continue
            }
            '^s+Timestamp:s+[^(]+((?<timestamp>.{8}))' {
                # see breakdown of the regular  expression later in the post
                # Timestamp:        Mon Jan  7 23:42:30 2019 (5C33D5D6)
                $intValue = [Convert]::ToInt32($matches.timestamp, 16)
                $currentObject.TimeStamp = [DateTime]::new(1970, 01, 01, 0, 0, 0, [DateTimeKind]::Utc).AddSeconds($intValue)
                continue
            }
            '^s+TimeStamp:s+(?<value>.{8}) (This' {
                # Timestamp:        E78937AC (This is a reproducible build file hash, not a timestamp)
                $currentObject.FileHash = [Convert]::ToUInt32($matches.value, 16)
                continue
            }
            '^s+Loaded symbol image file: (?<imageFile>[^)]+)' {
                $currentObject.LoadedImageFile = $matches.imageFile
                continue
            }
            '^s+Checksum:s+(?<checksum>S+)' {
                $currentObject.Checksum = [Convert]::ToUInt32($matches.checksum, 16)
                continue
            }
            '^s+Translations:s+(?<value>S+)' {
                $currentObject.Translations = $matches.value.Split(".")
                continue
            }
            '^s+ImageSize:s+(?<imageSize>.{8})' {
                $currentObject.ImageSize = [Convert]::ToUInt32($matches.imageSize, 16)
                continue
            }
            '^s{4}(?<name>[^:]+):s+(?<value>.+)' {
                # see breakdown of the regular expression later in the post
                # This part is any 'name: value' pattern
                $name, $value = $matches['name', 'value']

                # project the property name
                $propName = $propertyNameMap[$name]
                $propName = if ($null -eq $propName) { $name } else { $propName }

                # note the dynamic property name in the assignment
                # this will fail if the property doesn't have a member with the specified name
                $currentObject.$propName = $value
                continue
            }
            'quit:' {
                # ignore and exit
                break
            }
            default {
                # When writing the parser, it can be useful to include a line like the one below to see the cases that are not handled by the parser
                # Write-Warning "missing case for '$_'. Unexpected output format from cdb.exe"

                continue # skip lines that doesn't match the patterns we are interested in, like the start/end/modulename header and the quit: output
            }
        }
    }
    end {
        # this is needed to output the last object
        if ($null -ne $currentObject) {
            $currentObject
        }
    }
}


Get-ExecutableModuleRawData Notepad |
    ConvertTo-ExecutableModule |
    Sort-Object ProductVersion, Name
    Format-Table -Property Name, FileVersion, Product_Version, FileDescription
Name               FileVersionStr                             ProductVersion FileDescription
----               --------------                             -------------- ---------------
PROPSYS            7.0.17763.1 (WinBuild.160101.0800)         7.0.17763.1    Microsoft Property System
ADVAPI32           10.0.17763.1 (WinBuild.160101.0800)        10.0.17763.1   Advanced Windows 32 Base API
bcrypt             10.0.17763.1 (WinBuild.160101.0800)        10.0.17763.1   Windows Cryptographic Primitives Library
...
uxtheme            10.0.17763.1 (WinBuild.160101.0800)        10.0.17763.1   Microsoft UxTheme Library
win32u             10.0.17763.1 (WinBuild.160101.0800)        10.0.17763.1   Win32u
WINSPOOL           10.0.17763.1 (WinBuild.160101.0800)        10.0.17763.1   Windows Spooler Driver
KERNELBASE         10.0.17763.134 (WinBuild.160101.0800)      10.0.17763.134 Windows NT BASE API Client DLL
wintypes           10.0.17763.134 (WinBuild.160101.0800)      10.0.17763.134 Windows Base Types DLL
SHELL32            10.0.17763.168 (WinBuild.160101.0800)      10.0.17763.168 Windows Shell Common Dll
...
windows_storage    10.0.17763.168 (WinBuild.160101.0800)      10.0.17763.168 Microsoft WinRT Storage API
CoreMessaging      10.0.17763.194                             10.0.17763.194 Microsoft CoreMessaging Dll
gdi32full          10.0.17763.194 (WinBuild.160101.0800)      10.0.17763.194 GDI Client DLL
ntdll              10.0.17763.194 (WinBuild.160101.0800)      10.0.17763.194 NT Layer DLL
RMCLIENT           10.0.17763.194 (WinBuild.160101.0800)      10.0.17763.194 Resource Manager Client
RPCRT4             10.0.17763.194 (WinBuild.160101.0800)      10.0.17763.194 Remote Procedure Call Runtime
combase            10.0.17763.253 (WinBuild.160101.0800)      10.0.17763.253 Microsoft COM for Windows
COMCTL32           6.10 (WinBuild.160101.0800)                10.0.17763.253 User Experience Controls Library
urlmon             11.00.17763.168 (WinBuild.160101.0800)     11.0.17763.168 OLE32 Extensions for Win32
iertutil           11.00.17763.253 (WinBuild.160101.0800)     11.0.17763.253 Run time utility for Internet Explorer

Regex pattern breakdown

Here is a breakdown of the more complicated patterns, using the ignore pattern whitespace modifier x:

([0-9a-f`]{17})s([0-9a-f`]{17})s+(S+)s+(([^)]+))s*(.+)?

# example input: 00007ffe`9ccb0000 00007ffe`9ce9d000   ntdll      (pdb symbols)          c:symbolsntdll.pdbB8AD79538F2730FD9BACE36C9F9316A01ntdll.pdb

(?x)                # ignore pattern whitespace
^                   # the beginning of the line
([0-9a-f`]{17})     # capture expression like 00007ff6`e9da0000 - any hex number or backtick, and exactly 17 of them
s                  # a space
([0-9a-f`]{17})     # capture expression like 00007ff6`e9da0000 - any hex number or backtick, and exactly 17 of them
s+                 # skip any number of spaces
(S+)               # capture until we get a space - this would match the 'ntdll' part
s+                 # skip one or more spaces
(                  # start parenthesis
    ([^)])         # capture anything but end parenthesis
)                  # end parenthesis
s*                 # skip zero or more spaces
(.+)?               # optionally capture any symbol file path

Breakdown of the name-value pattern:

^s+(?<name>[^:]+):s+(?<value>.+)

# example input:  File flags:       0 (Mask 3F)

(?x)                # ignore pattern whitespace
^                   # the beginning of the line
s+                 # require one or more spaces
(?<name>[^:]+)      # capture anything that is not a `:` into the named group "name"
:                   # require a comma
s+                 # require one or more spaces
(?<value>.+)        # capture everything until the end into the name group "value"

Breakdown of the timestamp pattern:

^s{4}Timestamp:s+[^(]+((?<timestamp>.{8}))

#example input:     Timestamp:        Mon Jan  7 23:42:30 2019 (5C33D5D6)

(?x)                # ignore pattern whitespace
^                   # the beginning of the line
s+                 # require one or more spaces
Timestamp:          # The literal text 'Timestamp:'
s+                 # require one or more spaces
[^(]+              # one or more of anything but a open parenthesis
(                  # a literal '('
(?<timestamp>.{8})  # 8 characters of anything, captured into the group 'timestamp'
)                  # a literal ')'

Gotchas – the Regex Cache

Something that can happen if you are writing a more complicated parser is the following:
The parser works well. You have 15 regular expressions in your switch statement and then you get some input you haven’t seen before, so you add a 16th regex.
All of a sudden, the performance of your parser tanks. WTF?

The .net regex implementation has a cache of recently used regexs. You can check the size of it like this:

PS> [regex]::CacheSize
15

# bump it
[regex]::CacheSize = 20

And now your parser is fast(er) again.

Bonus tip

I frequently use PowerShell to write (generate) my code:

Get-ExecutableModuleRawData pwsh |
    Select-String '^s+([^:]+):' |       # this pattern matches the module detail fields
    Foreach-Object {$_.matches.groups[1].value} |
    Select-Object -Unique |
    Foreach-Object -Begin   { "class ExecutableModuleData {" }`
                   -Process { "    [string] $" + ($_ -replace "s.", {[char]::ToUpperInvariant($_.Groups[0].Value[1])}) }`
                   -End     { "}" }

The output is

class ExecutableModuleData {
    [string] $LoadedSymbolImageFile
    [string] $ImagePath
    [string] $ImageName
    [string] $Timestamp
    [string] $CheckSum
    [string] $ImageSize
    [string] $FileVersion
    [string] $ProductVersion
    [string] $FileFlags
    [string] $FileOS
    [string] $FileType
    [string] $FileDate
    [string] $Translations
    [string] $CompanyName
    [string] $ProductName
    [string] $InternalName
    [string] $OriginalFilename
    [string] $ProductVersion
    [string] $FileVersion
    [string] $FileDescription
    [string] $LegalCopyright
    [string] $Comments
    [string] $LegalTrademarks
    [string] $PrivateBuild
}

It is not complete – I don’t have the fields from the record start, some types are incorrect and when run against some other executables a few other fields may appear.
But it is a very good starting point. And way more fun than typing it 🙂

Note that this example is using a new feature of the -replace operator – to use a ScriptBlock to determine what to replace with – that was added in PowerShell Core 6.1.

Bonus tip #2

A regular expression construct that I often find useful is non-greedy matching.
The example below shows the effect of the ? modifier, that can be used after * (zero or more) and + (one or more)

# greedy matching - match to the last occurrence of the following character (>)
if("<Tag>Text</Tag>" -match '<(.+)>') { $matches }
Name                           Value
----                           -----
1                              Tag>Text</Tag
0                              <Tag>Text</Tag>
# non-greedy matching - match to the first occurrence of the the following character (>)
if("<Tag>Text</Tag>" -match '<(.+?)>') { $matches }
Name                           Value
----                           -----
1                              Tag
0                              <Tag>

See Regex Repeat for more info on how to control pattern repetition.

Summary

In this post, we have looked at how the structure of a switch-based parser could look, and how it can be written so that it works as a part of the pipeline.
We have also looked at a few slightly more complicated regular expressions in some detail.

As we have seen, PowerShell has a plethora of options for parsing text, and most of them revolve around regular expressions.
My personal experience has been that the time I’ve invested in understanding the regex language was well invested.

Hopefully, this gives you a good start with the parsing tasks you have at hand.

Thanks to Jason Shirk, Mathias Jessen and Steve Lee for reviews and feedback.

Staffan Gustafsson, @StaffanGson, github

Staffan works at DICE in Stockholm, Sweden, as a Software Engineer and has been using PowerShell since the first public beta.
He was most seriously pleased when PowerShell was open sourced, and has since contributed bug fixes, new features and performance improvements.
Staffan is a speaker at PSConfEU and is always happy to talk PowerShell.

AA19-024A: DNS Infrastructure Hijacking Campaign

This post was originally published on this site

Original release date: January 24, 2019 | Last revised: February 13, 2019

Summary

The National Cybersecurity and Communications Integration Center (NCCIC), part of the Cybersecurity and Infrastructure Security Agency (CISA), is aware of a global Domain Name System (DNS) infrastructure hijacking campaign. Using compromised credentials, an attacker can modify the location to which an organization’s domain name resources resolve. This enables the attacker to redirect user traffic to attacker-controlled infrastructure and obtain valid encryption certificates for an organization’s domain names, enabling man-in-the-middle attacks.

See the following links for downloadable copies of open-source indicators of compromise (IOCs) from the sources listed in the References section below:

Note: these files were last updated February 13, 2019, to remove the following three non-malicious IP addresses:

  • 107.161.23.204
  • 192.161.187.200
  • 209.141.38.71

Technical Details

Using the following techniques, attackers have redirected and intercepted web and mail traffic, and could do so for other networked services.

  1. The attacker begins by compromising user credentials, or obtaining them through alternate means, of an account that can make changes to DNS records.
  2. Next, the attacker alters DNS records, like Address (A), Mail Exchanger (MX), or Name Server (NS) records, replacing the legitimate address of a service with an address the attacker controls. This enables them to direct user traffic to their own infrastructure for manipulation or inspection before passing it on to the legitimate service, should they choose. This creates a risk that persists beyond the period of traffic redirection.
  3. Because the attacker can set DNS record values, they can also obtain valid encryption certificates for an organization’s domain names. This allows the redirected traffic to be decrypted, exposing any user-submitted data. Since the certificate is valid for the domain, end users receive no error warnings.

Mitigations

NCCIC recommends the following best practices to help safeguard networks against this threat:

  • Update the passwords for all accounts that can change organizations’ DNS records.
  • Implement multifactor authentication on domain registrar accounts, or on other systems used to modify DNS records.
  • Audit public DNS records to verify they are resolving to the intended location.
  • Search for encryption certificates related to domains and revoke any fraudulently requested certificates.

References

Revisions

  • January 24, 2019: Initial version
  • February 6, 2019: Updated IOCs, added Crowdstrike blog
  • February 13, 2019: Updated IOCs

This product is provided subject to this Notification and this Privacy & Use policy.

New – TLS Termination for Network Load Balancers

This post was originally published on this site

When you access a web site using the HTTPS protocol, a whole lot of interesting work (formally known as an SSL/TLS handshake) happens to create and maintain a secure communication channel. Your client (browser) and the web server work together to negotiate a mutually agreeable cipher, exchange keys, and set up a session key. Once established, both ends of the conversation use the session key to encrypt and decrypt all further traffic. Because the session key is unique to the conversation between the client and the server, a third party cannot decrypt the traffic or interfere with the conversation.

New TLS Termination
Today we are simplifying the process of building secure web applications by giving you the ability to make use of TLS (Transport Layer Security) connections that terminate at a Network Load Balancer (you can think of TLS as providing the “S” in HTTPS). This will free your backend servers from the compute-intensive work of encrypting and decrypting all of your traffic, while also giving you a host of other features and benefits:

Source IP Preservation – The source IP address and port is presented to your backend servers, even when TLS is terminated at the NLB. This is, as my colleague Colm says, “insane magic!”

Simplified Management – Using TLS at scale means that you need to take responsibility for distributing your server certificate to each backend server. This creates extra management work (sometimes involving a fleet of proxy servers), and also increases your attack surface due to the presence of multiple copies of the certificate. Today’s launch removes all of that complexity and gives you a central management point for your certificates. If you are using AWS Certificate Manager (ACM), your certificates will be stored securely, expired & rotated regularly, and updated automatically, all with no action on your part.

Zero-day Patching – The TLS protocol is complex and the implementations are updated from time to time in response to emerging threats. Terminating your connections at the NLB protects your backend servers and allows us to update your NLB in response to these threats. We make use of s2n, our security-focused , formally-verified implementation of the TLS/SSL protocols.

Improved Compliance – You can use built-in security policies to specify the cipher suites and protocol versions that are acceptable to your application. This will help you in your PCI and FedRAMP compliance effort, and will also allow you to achieve a perfect TLS score.

Classic Upgrade – If you are currently using a Classic Load Balancer for TLS termination, switching to a Network Load Balancer will allow you to scale more quickly in response to an increased load. You will also be able to make use of a static IP address for your NLB and to log the source IP address for requests.

Access Logs – You now have the ability to enable access logs for your Network Load Balancers and to direct them to the S3 bucket of your choice. The log entries include detailed information about the TLS protocol version, cipher suite, connection time, handshake time, and more.

Using TLS Termination
You can create a Network Load Balancer and make use of TLS termination in minutes! You can use the API (CreateLoadBalancer), CLI (create-load-balancer), the EC2 Console, or a AWS CloudFormation template. I’ll use the Console, and click Load Balancers to get started. Then I click Create in the Network Load Balancer area:

I enter a name (MyLB2) and choose TLS (Secure TCP) as the Load Balancer Protocol:

Then I choose one or more Availability Zones, and optionally choose an Elastic IP address for each one. I can also choose to tag my NLB. When I am all set, I click Next: Configure Security Settings to proceed:

On the next page, I can choose an existing certificate or upload a new one. I already have one for www.jeff-barr.com, so I’ll choose it. I also choose a security policy (more on that in a minute):

There are currently seven security policies to choose from. Each policy allows for the use of certain TLS versions and ciphers:

The describe-ssl-policies command can be used to learn more about the policies:

After choosing the certificate and the policy, I click Next:Configure Routing. I can choose the communication protocol (TCP or TLS) that will be used between my NLB and my targets. If I choose TLS, communication is encrypted; this allows you to make use of complete end-to-end encryption in transit:

The remainder of the setup process proceeds as usual, and I can start using my Network Load Balancer right away.

Available Now
TLS Termination is available now and you can start using it today in the US East (N. Virginia), US East (Ohio), US West (N. California), US West (Oregon), Asia Pacific (Mumbai), Asia Pacific (Seoul), Asia Pacific (Singapore), Asia Pacific (Sydney), Asia Pacific (Tokyo), Canada (Central), Europe (Frankfurt), Europe (Ireland), Europe (London), Europe (Paris), and South America (São Paulo) Regions.

Jeff;

 

Parsing Text with PowerShell (2/3)

This post was originally published on this site

This is the second post in a three-part series.

  • Part 1:
    • Useful methods on the String class
    • Introduction to Regular Expressions
    • The Select-String cmdlet
  • Part 2:
    • the -split operator
    • the -match operator
    • the switch statement
    • the Regex class
  • Part 3:
    • a real world, complete and slightly bigger, example of a switch-based parser

The -split operator

The -split operator splits one or more strings into substrings.

The first example is a name-value pattern, which is a common parsing task. Note the usage of the Max-substrings parameter to the -split operator.
We want to ensure that it doesn’t matter if the value contains the character to split on.

$text = "Description=The '=' character is used for assigning values to a variable"
$name, $value = $text -split "=", 2

@"
Name  =  $name
Value =  $value
"@

Name  =  Description
Value =  The '=' character is used for assigning values to a variable

When the line to parse contains fields separated by a well known separator, that is never a part of the field values, we can use the -split operator in combination with multiple assignment to get the fields into variables.

$name, $location, $occupation = "Spider Man,New York,Super Hero" -split ','

If only the location is of interest, the unwanted items can be assigned to $null.

$null, $location, $null = "Spider Man,New York,Super Hero" -split ','

$location

New York

If there are many fields, assigning to null doesn’t scale well. Indexing can be used instead, to get the fields of interest.

$inputText = "x,Staffan,x,x,x,x,x,x,x,x,x,x,Stockholm,x,x,x,x,x,x,x,x,11,x,x,x,x"
$name, $location, $age = ($inputText -split ',')[1,12,21]

$name
$location
$age

Staffan
Stockholm
11

It is almost always a good idea to create an object that gives context to the different parts.

$inputText = "x,Steve,x,x,x,x,x,x,x,x,x,x,Seattle,x,x,x,x,x,x,x,x,22,x,x,x,x"
$name, $location, $age = ($inputText -split ',')[1,12,21]
[PSCustomObject] @{
    Name = $name
    Location = $location
    Age = [int] $age
}

Name  Location Age
----  -------- ---
Steve Seattle   22

Instead of creating a PSCustomObject, we can create a class. It’s a bit more to type, but we can get more help from the engine, for example with tab completion.

The example below also shows an example of type conversion, where the default string to number conversion doesn’t work.
The age field is handled by PowerShell’s built-in type conversion. It is of type [int], and PowerShell will handle the conversion from string to int,
but in some cases we need to help out a bit. The ShoeSize field is also an [int], but the data is hexadecimal,
and without the hex specifier (‘0x’), this conversion fails for some values, and provides incorrect results for the others.

class PowerSheller {
    [string] $Name
    [string] $Location
    [int] $Age
    [int] $ShoeSize
}

$inputText = "x,Staffan,x,x,x,x,x,x,x,x,x,x,Stockholm,x,x,x,x,x,x,x,x,33,x,11d,x,x"
$name, $location, $age, $shoeSize = ($inputText -split ',')[1,12,21,23]
[PowerSheller] @{
    Name = $name
    Location = $location
    Age = $age
    # ShoeSize is expressed in hex, with no '0x' because reasons :)
    # And yes, it's in millimeters.
    ShoeSize = [Convert]::ToInt32($shoeSize, 16)
}

Name    Location  Age ShoeSize
----    --------  --- --------
Staffan Stockholm  33      285

The split operator’s first argument is actually a regex (by default, can be changed with options).
I use this on long command lines in log files (like those given to compilers) where there can be hundreds of options specified. This makes it hard to see if a certain option is specified or not, but when split into their own lines, it becomes trivial.
The pattern below uses a positive lookahead assertion.
It can be very useful to make patterns match only in a given context, like if they are, or are not, preceded or followed by another pattern.

$cmdline = "cl.exe /D Bar=1 /I SomePath /D Foo  /O2 /I SomeOtherPath /Debug a1.cpp a3.cpp a2.cpp"

$cmdline -split "s+(?=[-/])"

cl.exe
/D Bar=1
/I SomePath
/D Foo
/O2
/I SomeOtherPath
/Debug a1.cpp a2.cpp

Breaking down the regex, by rewriting it with the x option:

(?x)      # ignore whitespace in the pattern, and enable comments after '#'
s+       # one or more spaces
(?=[-/])  # only match the previous spaces if they are followed by any of '-' or '/'.

Splitting with a scriptblock

The -split operator also comes in another form, where you can pass it a scriptblock instead of a regular expression.
This allows for more complicated logic, that can be hard or impossible to express as a regular expression.

The scriptblock accepts two parameters, the text to split and the current index. $_ is bound to the character at the current index.

function SplitWhitespaceInMiddleOfText {
    param(
        [string]$Text,
        [int] $Index
    )
    if ($Index -lt 10 -or $Index -gt 40){
        return $false
    }
    $_ -match 's'
}

$inputText = "Some text that only needs splitting in the middle of the text"
$inputText -split $function:SplitWhitespaceInMiddleOfText

Some text that
only
needs
splitting
in
the middle of the text

The $function:SplitWhitespaceInMiddleOfText syntax is a way to get to content (the scriptblock that implements it) of the function, just as $env:UserName gets the content of an item in the env: drive.
It provides a way to document and/or reuse the scriptblock.

The -match operator

The -match operator works in conjunction with the $matches automatic variable. Each time a -match or a -notmatch succeeds, the $matches variable is populated so that each capture group gets its own entry. If the capture group is named, the key will be the name of the group, otherwise it will be the index.

As an example:

if ('a b c' -match '(w) (?<named>w) (w)'){
    $matches
}

Name                           Value
----                           -----
named                          b
2                              c
1                              a
0                              a b c

Notice that the indices only increase on groups without names. I.E. the indices of later groups change when a group is named.

Armed with the regex knowledge from the earlier post, we can write the following:

PS> "    10,Some text" -match '^s+(d+),(.+)'

True

PS> $matches
Name                           Value
----                           -----
2                              Some text
1                              10
0                              10,Some text

or with named groups

PS> "    10,Some text" -match '^s+(?<num>d+),(?<text>.+)'

True

PS> $matches
Name                           Value
----                           -----
num                            10
text                           Some text
0                              10,Some text

The important thing here is to put parentheses around the parts of the pattern that we want to extract. That is what creates the capture groups that allow us to reference those parts of the matching text, either by name or by index.

Combining this into a function makes it easy to use:

function ParseMyString($text){
    if ($text -match '^s+(d+),(.+)') {
        [PSCustomObject] @{
            Number = [int] $matches[1]
            Text    = $matches[2]
        }
    }
    else {
        Write-Warning "ParseMyString: Input `$text` doesn't match pattern"
    }
}

ParseMyString "    10,Some text"

 

Number  Text
------- ----
     10 Some text

Notice the type conversion when assigning the Number property. As long as the number is in range of an integer, this will always succeed, since we have made a successful match in the if statement above. ([long] or [bigint] could be used. In this case I provide the input, and I have promised myself to stick to a range that fits in a 32-bit integer.)
Now we will be able to sort or do numerical operations on the Number property, and it will behave like we want it to – as a number, not as a string.

The switch statement

Now we’re at the big guns 🙂

The switch statement in PowerShell has been given special functionality for parsing text.
It has two flags that are useful for parsing text and files with text in them. -regex and -file.

When specifying -regex, the match clauses that are strings are treated as regular expressions. The switch statement also sets the $matches automatic variable.

When specifying -file, PowerShell treats the input as a file name, to read input from, rather than as a value statement.

Note the use of a ScriptBlock instead of a string as the match clause to determine if we should skip preamble lines.

class ParsedOutput {
    [int] $Number
    [string] $Text

    [string] ToString() { return "{0} ({1})" -f $this.Text, $this.Number }
}

$inputData =
    "Preamble line",
    "LastLineOfPreamble",
    "    10,Some Text",
    "    Some other text,20"

$inPreamble = $true
switch -regex ($inputData) {

    {$inPreamble -and $_ -eq 'LastLineOfPreamble'} { $inPreamble = $false; continue }

    "^s+(?<num>d+),(?<text>.+)" {  # this matches the first line of non-preamble input
        [ParsedOutput] @{
            Number = $matches.num
            Text = $matches.text
        }
        continue
    }

    "^s+(?<text>[^,]+),(?<num>d+)" { # this matches the second line of non-preamble input
        [ParsedOutput] @{
            Number = $matches.num
            Text = $matches.text
        }
        continue
    }
}

 

Number  Text
------ ----
    10 Some Text
    20 Some other text

The pattern [^,]+ in the text group in the code above is useful. It means match anything that is not a comma ,. We are using the any-of construct [], and within those brackets, ^ changes meaning from the beginning of the line to anything but.

That is useful when we are matching delimited fields. A requirement is that the delimiter cannot be part of the set of allowed field values.

The regex class

regex is a type accelerator for System.Text.RegularExpressions.Regex. It can be useful when porting code from C#, and sometimes when we want to get more control in situations when we have many matches of a capture group. It also allows us to pre-create the regular expressions which can matter in performance sensitive scenarios, and to specify a timeout.

One instance where the regex class is needed is when you have multiple captures of a group.

Consider the following:

Text Pattern
a,b,c, (w,)+

If the match operator is used, $matches will contain

Name                           Value
----                           -----
1                              c,
0                              a,b,c,

The pattern matched three times, for a,, b, and c,. However, only the last match is preserved in the $matches dictionary.
However, the following will allow us to get to all the captures of the group:

[regex]::match('a,b,c,', '(w,)+').Groups[1].Captures
Index Length Value
----- ------ -----
    0      2 a,
    2      2 b,
    4      2 c,

Below is an example that uses the members of the Regex class to parse input data

class ParsedOutput {
    [int] $Number
    [string] $Text

    [string] ToString() { return "{0} ({1})" -f $this.Text, $this.Number }
}

$inputData =
    "    10,Some Text",
    "    Some other text,20"  # this text will not match

[regex] $pattern = "^s+(d+),(.+)"

foreach($d in $inputData){
    $match = $pattern.Match($d)
    if ($match.Success){
        $number, $text = $match.Groups[1,2].Value
        [ParsedOutput] @{
            Number = $number
            Text = $text
        }
    }
    else {
        Write-Warning "regex: '$d' did not match pattern '$pattern'"
    }
}

 

WARNING: regex: '    Some other text,20' did not match pattern '^s+(d+),(.+)'
Number Text
------ ----
    10 Some Text

It may surprise you that the warning appears before the output. PowerShell has a quite complex formatting system at the end of the pipeline, which treats pipeline output different than other streams. Among other things, it buffers output in the beginning of a pipeline to calculate sensible column widths. This works well in practice, but sometimes gives strange reordering of output on different streams.

Summary

In this post we have looked at how the -split operator can be used to split a string in parts, how the -match operator can be used to extract different patterns from some text, and how the powerful switch statement can be used to match against multiple patterns.

We ended by looking at how the regex class, which in some cases provides a bit more control, but at the expense of ease of use. This concludes the second part of this series. Next time, we will look at a complete, real world, example of a switch-based parser.

Thanks to Jason Shirk, Mathias Jessen and Steve Lee for reviews and feedback.

Staffan Gustafsson, @StaffanGson, powercode@github

Staffan works at DICE in Stockholm, Sweden, as a Software Engineer and has been using PowerShell since the first public beta.
He was most seriously pleased when PowerShell was open sourced, and has since contributed bug fixes, new features and performance improvements.
Staffan is a speaker at PSConfEU and is always happy to talk PowerShell.

The post Parsing Text with PowerShell (2/3) appeared first on Powershell.

Parsing Text with PowerShell (2/3)

This post was originally published on this site

This is the second post in a three-part series.

  • Part 1:
    • Useful methods on the String class
    • Introduction to Regular Expressions
    • The Select-String cmdlet
  • Part 2:
    • the -split operator
    • the -match operator
    • the switch statement
    • the Regex class
  • Part 3:
    • a real world, complete and slightly bigger, example of a switch-based parser

The -split operator

The -split operator splits one or more strings into substrings.

A common pattern is a name-value pattern:
Note the usage of the Max-substrings parameter to the -split operator.
We want to ensure that is doesn’t matter if the value contains the character to split on.

$text = "Description=The '=' character is used for assigning values to a variable"
$name, $value = $text -split "=", 2

@"
Name  =  $name
Value =  $value
"@
Name  =  Description
Value =  The '=' character is used for assigning values to a variable

When the line to parse contains fields separated by a well known separator, that is never a part of the field values, we can use the -split operator
in combination with multiple assignment to get the fields into variables.

$name, $location, $occupation = "Spider Man,New York,Super Hero" -split ','

If only the location is of interest, the unwanted items can be assigned to $null.

$null, $location, $null = "Spider Man,New York,Super Hero" -split ','

$location
New York

If there are many fields, assigning to null doesn’t scale well. Indexing can be used instead, to get the fields of interest.

$inputText = "x,Staffan,x,x,x,x,x,x,x,x,x,x,Stockholm,x,x,x,x,x,x,x,x,11,x,x,x,x"
$name, $location, $age = ($inputText -split ',')[1,12,21]

$name
$location
$age
Staffan
Stockholm
11

It is almost always a good idea to create an object that gives context to the different parts.

$inputText = "x,Steve,x,x,x,x,x,x,x,x,x,x,Seattle,x,x,x,x,x,x,x,x,22,x,x,x,x"
$name, $location, $age = ($inputText -split ',')[1,12,21]
[PSCustomObject] @{
    Name = $name
    Location = $location
    Age = [int] $age
}
Name  Location Age
----  -------- ---
Steve Seattle   22

Instead of creating a PSCustomObject, we can create a class. It’s a bit more to type, but we can get more help from the engine, for example with tab completion.

The example below also shows an example of type conversion, where the default string to number conversion doesn’t work.
The age field is handled by PowerShell’s built-in type conversion. It is of type [int], and PowerShell will handle the conversion from string to int,
but in some cases we need to help out a bit. The ShoeSize field is also an [int], but the data is hexadecimal,
and without the hex specifier (‘0x’), this conversion fails for some values, and provides incorrect results for the others.

class PowerSheller {
    [string] $Name
    [string] $Location
    [int] $Age
    [int] $ShoeSize
}

$inputText = "x,Staffan,x,x,x,x,x,x,x,x,x,x,Stockholm,x,x,x,x,x,x,x,x,33,x,11d,x,x"
$name, $location, $age, $shoeSize = ($inputText -split ',')[1,12,21,23]
[PowerSheller] @{
    Name = $name
    Location = $location
    Age = $age
    # ShoeSize is expressed in hex, with no '0x' because reasons :)
    # And yes, it's in millimeters.
    ShoeSize = [Convert]::ToInt32($shoeSize, 16)
}
Name    Location  Age ShoeSize
----    --------  --- --------
Staffan Stockholm  33      285

The split operator’s first argument is actually a regex (by default, can be changed with options).
I use this on long command lines in log files (like those given to compilers) where there can be hundreds of options specified. This makes it hard to see if a certain option is specified or not, but when split into their own lines, it becomes trivial.
The pattern below uses a positive lookahead assertion.
It can be very useful to make patterns match only in a given context, like if they are, or are not, preceded or followed by another pattern.

$cmdline = "cl.exe /D Bar=1 /I SomePath /D Foo  /O2 /I SomeOtherPath /Debug a1.cpp a3.cpp a2.cpp"

$cmdline -split "s+(?=[-/])"
cl.exe
/D Bar=1
/I SomePath
/D Foo
/O2
/I SomeOtherPath
/Debug a1.cpp a2.cpp

Breaking down the regex, by rewriting it with the x option:

(?x)      # ignore whitespace in the pattern, and enable comments after '#'
s+       # one or more spaces
(?=[-/])  # only match the previous spaces if they are followed by any of '-' or '/'.

Splitting with a scriptblock

The -split operator also comes in another form, where you can pass it a scriptblock instead of a regular expression.
This allows for more complicated logic, that can be hard or impossible to express as a regular expression.

The scriptblock accepts two parameters, the text to split and the current index. $_ is bound to the character at the current index.

function SplitWhitespaceInMiddleOfText {
    param(
        [string]$Text,
        [int] $Index
    )
    if ($Index -lt 10 -or $Index -gt 40){
        return $false
    }
    $_ -match 's'
}

$inputText = "Some text that only needs splitting in the middle of the text"
$inputText -split $function:SplitWhitespaceInMiddleOfText
Some text that
only
needs
splitting
in
the middle of the text

The $function:SplitWhitespaceInMiddleOfText syntax is a way to get to content (the scriptblock that implements it) of the function, just as $env:UserName gets the content of an item in the env: drive.
It provides a way to document and/or reuse the scriptblock.

The -match operator

The -match operator works in conjunction with the $matches automatic variable.
Each time a -match or a -notmatch succeeds, the $matches variable is populated so that each capture group gets its own entry.
If the capture group is named, the key will be the name of the group, otherwise it will be the index.

As an example:

if ('a b c' -match '(w) (?<named>w) (w)'){
    $matches
}
Name                           Value
----                           -----
named                          b
2                              c
1                              a
0                              a b c

Notice that the indices only increase on groups without names. I.E. the indices of later groups change when a group is named.

Armed with the regex knowledge from the earlier post, we can write the following:

PS> "    10,Some text" -match '^s+(d+),(.+)'
True
PS> $matches
Name                           Value
----                           -----
2                              Some text
1                              10
0                              10,Some text

or with named groups

PS> "    10,Some text" -match '^s+(?<num>d+),(?<text>.+)'
True
PS> $matches
Name                           Value
----                           -----
num                            10
text                           Some text
0                              10,Some text

The important thing here is that the parts of the pattern that we want to extract has parenthesis around them.
That is what creates the capture groups that allow us to reference those parts of the matching text, either by name or by index.

Combining this into a function makes it easy to use:

function ParseMyString($text){
    if ($text -match '^s+(d+),(.+)') {
        [PSCustomObject] @{
            Number = [int] $matches[1]
            Text    = $matches[2]
        }
    }
    else {
        Write-Warning "ParseMyString: Input `$text` doesn't match pattern"
    }
}

ParseMyString "    10,Some text"
Number  Text
------- ----
     10 Some text

Notice the type conversion when assigning the Number property. As long as the number is in range of an integer, this will always succeed, since we have made a successful match in the if statement above. ([long] or [bigint] could be used. In this case I provide the input, and I have promised myself to stick to a range that fits in a 32-bit integer.)
Now we will be able to sort or do numerical operations on the Number property, and it will behave like we want it to – as a number, not as a string.

The switch statement

Now we’re at the big guns 🙂

The switch statement in PowerShell has been given special functionality for parsing text.
It has two flags that are useful for parsing text and files with text in them. -regex and -file.

When specifying -regex, the match clauses that are strings are treated as regular expressions. The switch statement also sets the $matches automatic variable.

When specifying -file, PowerShell treats the input as a file name, to read input from, rather than as a value statement.

Note the use of a ScriptBlock instead of a string as the match clause to determine if we should skip preamble lines.

class ParsedOutput {
    [int] $Number
    [string] $Text

    [string] ToString() { return "{0} ({1})" -f $this.Text, $this.Number }
}

$inputData =
    "Preamble line",
    "LastLineOfPreamble",
    "    10,Some Text",
    "    Some other text,20"

$inPreamble = $true
switch -regex ($inputData) {

    {$inPreamble -and $_ -eq 'LastLineOfPreamble'} { $inPreamble = $false; continue }

    "^s+(?<num>d+),(?<text>.+)" {  # this matches the first line of non-preamble input
        [ParsedOutput] @{
            Number = $matches.num
            Text = $matches.text
        }
        continue
    }

    "^s+(?<text>[^,]+),(?<num>d+)" { # this matches the second line of non-preamble input
        [ParsedOutput] @{
            Number = $matches.num
            Text = $matches.text
        }
        continue
    }
}
Number  Text
------ ----
    10 Some Text
    20 Some other text

The pattern [^,]+ in the text group in the code above is useful. It means match anything that is not a comma ,. We are using the any-of construct [], and within those brackets, ^ changes meaning from the beginning of the line to anything but.

That is useful when we are matching delimited fields. A requirement is that the delimiter cannot be part of the set of allowed field values.

The regex class

regex is a type accelerator for System.Text.RegularExpressions.Regex. It can be useful when porting code from C#, and sometimes when we want to get more control in situations when we have many matches of a capture group. It also allows us to pre-create the regular expressions which can matter in performance sensitive scenarios, and to specify a timeout.

One instance where the regex class is needed is when you have multiple captures of a group.

Consider the following:

Text Pattern
a,b,c, (w,)+

If the match operator is used, $matches will contain

Name                           Value
----                           -----
1                              c,
0                              a,b,c,

The pattern matched three times, for a,, b, and c,. However, only the last match is preserved in the $matches dictionary.
However, the following will allow us to get to all the captures of the group:

[regex]::match('a,b,c,', '(w,)+').Groups[1].Captures
Index Length Value
----- ------ -----
    0      2 a,
    2      2 b,
    4      2 c,

Below is an example that uses the members of the Regex class to parse input data

class ParsedOutput {
    [int] $Number
    [string] $Text

    [string] ToString() { return "{0} ({1})" -f $this.Text, $this.Number }
}

$inputData =
    "    10,Some Text",
    "    Some other text,20"  # this text will not match

[regex] $pattern = "^s+(d+),(.+)"

foreach($d in $inputData){
    $match = $pattern.Match($d)
    if ($match.Success){
        $number, $text = $match.Groups[1,2].Value
        [ParsedOutput] @{
            Number = $number
            Text = $text
        }
    }
    else {
        Write-Warning "regex: '$d' did not match pattern '$pattern'"
    }
}
WARNING: regex: '    Some other text,20' did not match pattern '^s+(d+),(.+)'
Number Text
------ ----
    10 Some Text

It may surprise you that the warning appears before the output. PowerShell has a quite complex formatting system at the end of the pipeline, which treats pipeline output different than other streams. Among other things, it buffers output in the beginning of a pipeline to calculate sensible column widths. This works well in practice, but sometimes gives strange reordering of output on different streams.

Summary

In this post we have looked at how the -split operator can be used to split a string in parts, how the -match operator can be used to extract different patterns from some text, and how the powerful switch statement can be used to match against multiple patterns.

We ended by looking at how the regex class, which in some cases provides a bit more control, but at the expense of ease of use.

This concludes the second part of this post. Next time, we will look at a complete, real world, example of a switch-based parser.

Thanks to Jason Shirk, Mathias Jessen and Steve Lee for reviews and feedback.

Staffan Gustafsson, @StaffanGson, powercode@github

Staffan works at DICE in Stockholm, Sweden, as a Software Engineer and has been using PowerShell since the first public beta.
He was most seriously pleased when PowerShell was open sourced, and has since contributed bug fixes, new features and performance improvements.
Staffan is a speaker at PSConfEU and is always happy to talk PowerShell.

The PowerShell-Docs repositories have been moved

This post was originally published on this site

The PowerShell-Docs repositories have been moved from the PowerShell organization to the MicrosoftDocs organization in GitHub.

The tools we use to build the documentation are designed to work in the MicrosoftDocs org. Moving the repository lets us build the foundation for future improvements in our documentation experience.

Impact of the move

During the move there may be some downtime. The affected repositories will be inaccessible during
the move process. Also, the documentation processes will be paused. After the move, we need to test
access permissions and automation scripts.After these tasks are complete, access and operations will return to normal. GitHub automatically
redirects requests to the old repo URL to the new location.
For more information about transferring repositories in GitHub, see About repository transfers.
If the transferred repository has any forks, then those forks will remain associated with the
repository after the transfer is complete.
  • All Git information about commits, including contributions, are preserved.
  • All of the issues and pull requests remain intact when transferring a repository.
  • All links to the previous repository location are automatically redirected to the new location.

When you use git clone, git fetch, or git push on a transferred repository, these commands will redirect to the new repository location or URL.

However, to avoid confusion, we strongly recommend updating any existing local clones to point to
the new repository URL. For more information, see Changing a remote’s URL.

The following example shows how to change the “upstream” remote to point to the new location:

[Wed 06:08PM] [staging =]
PS C:GitPS-DocsPowerShell-Docs> git remote -v
origin  https://github.com/sdwheeler/PowerShell-Docs.git (fetch)
origin  https://github.com/sdwheeler/PowerShell-Docs.git (push)
upstream        https://github.com/PowerShell/PowerShell-Docs.git (fetch)
upstream        https://github.com/PowerShell/PowerShell-Docs.git (push)

[Wed 06:09PM] [staging =]
PS C:GitPS-DocsPowerShell-Docs> git remote set-url upstream https://github.com/MicrosoftDocs/PowerShell-Docs.git

[Wed 06:10PM] [staging =]
PS C:GitPS-DocsPowerShell-Docs> git remote -v
origin  https://github.com/sdwheeler/PowerShell-Docs.git (fetch)
origin  https://github.com/sdwheeler/PowerShell-Docs.git (push)
upstream        https://github.com/MicrosoftDocs/PowerShell-Docs.git (fetch)
upstream        https://github.com/MicrosoftDocs/PowerShell-Docs.git (push)

Which repositories were moved?

 

The following repositories were transferred:

  • PowerShell/PowerShell-Docs
  • PowerShell/powerShell-Docs.cs-cz
  • PowerShell/powerShell-Docs.de-de
  • PowerShell/powerShell-Docs.es-es
  • PowerShell/powerShell-Docs.fr-fr
  • PowerShell/powerShell-Docs.hu-hu
  • PowerShell/powerShell-Docs.it-it
  • PowerShell/powerShell-Docs.ja-jp
  • PowerShell/powerShell-Docs.ko-kr
  • PowerShell/powerShell-Docs.nl-nl
  • PowerShell/powerShell-Docs.pl-pl
  • PowerShell/powerShell-Docs.pt-br
  • PowerShell/powerShell-Docs.pt-pt
  • PowerShell/powerShell-Docs.ru-ru
  • PowerShell/powerShell-Docs.sv-se
  • PowerShell/powerShell-Docs.tr-tr
  • PowerShell/powerShell-Docs.zh-cn
  • PowerShell/powerShell-Docs.zh-tw

Call to action

If you have a fork that you cloned, change your remote configuration to point to the new upstream URL.
Help us make the documentation better.
  • Submit issues when you find a problem in the docs.
  • Suggest fixes to documentation by submitting changes through the PR process.
Sean Wheeler
Senior Content Developer for PowerShell
https://github.com/sdwheeler

The post The PowerShell-Docs repositories have been moved appeared first on Powershell.