Friday, December 08, 2006 - 01:06
[#]
Powershell, String Similarity and well, etc.
Sup 'peeps!
I realised that I haven't blogged in a little while and it
was time to give you fine people the low down on some of the totally cool shiznat
I have been playing with recently. You got it; when I'm not doing paid-for
project work, reading books on Workflow Foundation or Design Patterns
I'm busy applying cool technology so I've actually got something to talk
about on our office Friday technical shindig in the office.
The current hot topics of exploration in our office are WPF,
WPF/E, Microsoft Expression Blend, WF, C# 3.0/Orcas tech-stack and Powershell
(I must make people sick of Powershell with all my incessant rambling about it).
I'm always cautious not to bore you, my precious few (if any
readers); with stuff that other people have already blogged about. So
that rules out WPF/E!
Today I would like to discuss a little PowerShell scenario
that worked pretty well for me.
Time Keeping (The Story)
Everyone at Dot Net Solutions logs their time. Unfortunately
the GDI+/WinForm app we have for this is super-laborious and I like to log time
in my Outlook calendar (and all of my meetings are there anyway). It's also
super-crucial that the data is in the Time Keeping database because thats
where we do a bunch of analysis like profitability per staff member and even
invoicing management.
The Solution
My first problem was that the GDI+ app had no API
(translate: automation model). Times like this you have divine confirmation how
important the Model View Controller (MVC) design pattern is. The bottom line is
that in the situation where I would like to make a web page or powershell
script for TimeSheet stuff - I would have to hit the database directly which
would bypass any critical validation or business rules.
I started to work on an API using C#/.NET. The great thing
about PowerShell is that you can start calling into your .NET code at any time.
Think of it as the VS immediate window on steroids.
#load the assemblies in
gci *.dll | % { [Reflection.Assembly]::LoadFrom($_.FullName) }
#call my .NET function
[TimeSheetAPI.TimeSheetTools]::GetProjects()
Loading Assemblies (note the LoadFrom instead
of LoadFile, the latter will search elsewhere than the current directory):

Calling .NET Static Functions:

Getting members:

Query:

Logging time from Powershell!
/// <summary>
/// Insert an entry into the TimeSpent
table
/// </summary>
/// <param name="salary">annual salary in
i.e. 40000</param>
/// <param name="description"></param>
/// <param name="employeename"></param>
/// <param name="chargable">is it chargable</param>
/// <param name="startdate">from</param>
/// <param name="end">to</param>
/// <param name="hours">how many hours</param>
/// <param name="projectname">name of project
(case insensitive) will create if no match found</param>
/// <param name="clientname">name of client
(case insensitive) will create if no match found</param>
/// <param name="clientname">commit to db?</param>
/// <returns></returns>
public static TimeSpent[] InsertTimeInformation(
int salary, string description,
string
employeename, bool chargable,
DateTime
startdate, DateTime enddate,
decimal hours,
string projectname,
string
clientname, bool commit)
With this function cooked up I hit the ground running. At
this point I was still exporting data from outlook and massaging the
data before importing. This diagram shows the use of import-csv data.csv | % { }.

Lets automate the whole thing
We can get the data out of Outlook really easily:
$o = New-Object -com
Outlook.Application;
$ns = $o.GetNamespace("MAPI");
$ns.GetDefaultFolder(9).Items | where {
$_.Body -inotmatch "ADDED
TO TIMESHEET" -and $_.Subject -imatch "^TS:"
} | % {
#from here, we can get all of our calendar
items!
write-host $_.Subject
[...]
}
Houston, we got a problem
The main reason I would have a staging area for the data
before committing it to the database was because if I didn't type the project
name in exactly right (or couldn't remember how it was worded in the database)
it would create a duplicate. What I really needed was a fuzzy look-up so I
could type in roughly what it was and get the match with some degree of certainty.
I decided to do this in a really simple way. I downloaded C#
implementations of Levenshtein distance and Soundex. I then wrote
a similarity class that would break up search terms into words and run a Levenshtein
distance on every combination of word and full phrase (of the original terms
and the soundex representations). I have implemented a point scoring system
where matches of Levenshtein distance 0 are 45, and matches of distance 1 are 10
etc. Turns out it works phenomenally well!
Similarity is always something that has fascinated me (text,
images & video). We could easily step this up with better feature
detection i.e.
- Word Shape
- Lexical Analysis
- Phonemes
- N-Grams
- (and many more)
But luckily in small domain environments it's easy to tweak
your algorithm with heuristics/special cases (not really needed for this case
though). Another point is adding in more complicated vertical-domain feature
detection would require better aggregation of the n-dimensional space created:
- Clustering
- Artificial Neural Networks
- Support Vector Machines
- Bayesian Networks
- (etc)
public static int GetSimilarityScore(string
input1, string input2, bool
debug)
{
if( String.IsNullOrEmpty(input1) || String.IsNullOrEmpty(input2) ) return 0;
///we run LD processing based on soundex keys of words/entire
phrase
int[]
SoundexScoring = ProcessIndexesLD(GetWords(input1, true),
GetWords(input2, true), debug);
///we run LD processing based on words/entire phrase alone
int[]
WordScoring = ProcessIndexesLD(GetWords(input1, false),
GetWords(input2, false), debug);
List<int> scoreing = new
List<int>();
scoreing.AddRange(SoundexScoring);
scoreing.AddRange(WordScoring);
///tally up relevant scores
int level1 =
scoreing.Count<int>( i=>i==0 );
int level2 =
scoreing.Count<int>( i=>i==1 );
int level3 =
scoreing.Count<int>( i=>i==2 );
int level4 =
scoreing.Count<int>( i=>i==3 );
int level5 =
scoreing.Count<int>( i=>i==4 );
int boost=0;
///if we have an exact match, we want to show an extremely
high score
///i.e. the weightings become irrelevant and we dont want to
risk other things
///having higher score due to many words @ level1/2 etc
if( input1 ==
input2 ) boost= 1000;
else if( input1.ToLower() == input2.ToLower() ) boost=
500;
return (
level1*45+level2*10+ level3*5+ level4*3+ level5*2+boost );
}
private static int[] ProcessIndexesLD(string[]
keys, string[] inputkeys, bool debug)
{
List<string> dedupe = new
List<string>();
List<int> scoreing = new
List<int>();
for (int x = 0; x < keys.Length; x++)
{
string
key = keys[x];
for (int y = 0; y < inputkeys.Length; y++)
{
string
inputkey = inputkeys[y];
string
combination1 = String.Format("{0}_{1}", x, y);
string
combination2 = String.Format("{0}_{1}", y, x);
if
(!dedupe.Contains(combination1))
{
scoreing.Add(LevenshteinDistance.LD(key, inputkey));
}
dedupe.Add(combination1);
dedupe.Add(combination2);
}
}
return
scoreing.ToArray();
}
Let's have some fun from PowerShell:
function fp {
$what=$args[0]
tgp | select Name,
@{Name="Score";Expression={[TimeSheetAPI.SimilarityTools]::GetSimilarityScore(
$_.Name, $what, 0 )}} | sort Score -des| where{$_.code -ne 0} | select
-fir 30 | ft -auto;
}



I could show you some other examples, but damn. It's almost bullet-proof!
With this in place we can now automatically pull stuff from
items in the Outlook calendar. If we don't have a decent match, we can simply
assume that a new project should be created.
Final Powershell Script:

function SyncTimeSheet{
$o = New-Object -com Outlook.Application;
$ns = $o.GetNamespace("MAPI");
$projregex = [Regex] "\[(?<name>.+?)\]"
$clientregex = [Regex] "\{(?<name>.+?)\}"
$ns.GetDefaultFolder(9).Items | where { $_.Body
-inotmatch "ADDED TO TIMESHEET"
-and $_.Subject -imatch "^TS:" }
| % {
if(
$_.AllDayEvent ) {
$hours=8;
}
else {
$hours = (
[Convert]::ToDateTime($_.End) - [Convert]::ToDateTime($_.Start) ).Hours
}
$chargable = 1;
if(
[Regex]::IsMatch( $_.Subject, "\[c0\]"
) ) {
$chargable = 0;
}
#perform fuzzy match on project
if(
$projregex.IsMatch( $_.Subject ) ) {
$projectname=$projregex.Match(
$_.Subject ).Groups["name"].Value
tfpraw( $projectname ) | where{
$_.Score -gt 100 } | select -fir 1 | % { $projectname=$_.Name }
write-host $_.Subject +"_"+ $_.Start +"-->MATCH
FOUND: "+ $projectname
}
$client = "Internal";
#fuzzy match the client
if(
$clientregex.IsMatch( $_.Subject ) ) {
$client=$clientregex.Match(
$_.Subject ).Groups["name"].value;
tfcraw( $client ) | where{
$_.Score -gt 100 } | select -fir 1 | % { $client=$_.Name }
}
#the timesheet app want to see time
00:00:00, god knows :)
$start = [Convert]::ToDateTime(
[Convert]::ToDateTime($_.Start).ToShortDateString() )
$end = [Convert]::ToDateTime(
[Convert]::ToDateTime($_.End).ToShortDateString() )
[TimeSheetAPI.TimeSheetTools]::InsertTimeInformation(
100,
$_.Subject +" "+ $_.Description,
$_.Organizer,
$chargable,
$start,
$end,
$hours,
$projectname,
$client,
1) | ft -auto
#///ensure it
doesnt get hit again
$_.Body = $_.Body + " [ADDED TO TIMESHEET]"
}
}

What would be nice
When you write "proper" command-lets you can do
far more clever and expressive things i.e. implementing whatif and confirm
extensions (as well as pipeline input although that is possible from straight
functions).
Imagine some of the expressive possibilities:
Get-TimeSpent -Today | Log-Time
-whatif
Get-Client | where{ $_.Name imatch
"Microsoft" } | Delete-Client -confirm
We have a sophisticated (home-grown) database proxy
code-generation system in the company. You point it at the database and it
gives you C#.NET. Now I'm thinking about it giving me powershell commandlets
too.
Powershell has much in common with C#3.0/Language Query
because of it's functional rather than imperative approach to
programming.
Think of it as a half way house between SQL and C#2.0.
Get-Customers | where{ $_.Orders
gt 300 } | % { Delete-Customer $_ }
Tags: powershell similarity levenshtein soundex csharp powergadgets outlook automation search