Tasty Technology! By Tim Scarfe.

Friday, December 08, 2006 - 01:06 [#]

Powershell, String Similarity and well, etc.


Sup 'peeps!


I realised that I haven't blogged in a little while and it was time to give you fine people the low down on some of the totally cool shiznat I have been playing with recently. You got it; when I'm not doing paid-for project work, reading books on Workflow Foundation or Design Patterns I'm busy applying cool technology so I've actually got something to talk about on our office Friday technical shindig in the office.


The current hot topics of exploration in our office are WPF, WPF/E, Microsoft Expression Blend, WF, C# 3.0/Orcas tech-stack and Powershell (I must make people sick of Powershell with all my incessant rambling about it).


I'm always cautious not to bore you, my precious few (if any readers); with stuff that other people have already blogged about. So that rules out WPF/E!


Today I would like to discuss a little PowerShell scenario that worked pretty well for me.


Time Keeping (The Story)


Everyone at Dot Net Solutions logs their time. Unfortunately the GDI+/WinForm app we have for this is super-laborious and I like to log time in my Outlook calendar (and all of my meetings are there anyway). It's also super-crucial that the data is in the Time Keeping database because thats where we do a bunch of analysis like profitability per staff member and even invoicing management.


The Solution


My first problem was that the GDI+ app had no API (translate: automation model). Times like this you have divine confirmation how important the Model View Controller (MVC) design pattern is. The bottom line is that in the situation where I would like to make a web page or powershell script for TimeSheet stuff - I would have to hit the database directly which would bypass any critical validation or business rules.


I started to work on an API using C#/.NET. The great thing about PowerShell is that you can start calling into your .NET code at any time. Think of it as the VS immediate window on steroids.


#load the assemblies in

gci *.dll | % { [Reflection.Assembly]::LoadFrom($_.FullName) }

#call my .NET function

[TimeSheetAPI.TimeSheetTools]::GetProjects()

Loading Assemblies (note the LoadFrom instead of LoadFile, the latter will search elsewhere than the current directory):



Calling .NET Static Functions:



Getting members:



Query:


Logging time from Powershell!


/// <summary>

/// Insert an entry into the TimeSpent table

/// </summary>

/// <param name="salary">annual salary in i.e. 40000</param>

/// <param name="description"></param>

/// <param name="employeename"></param>

/// <param name="chargable">is it chargable</param>

/// <param name="startdate">from</param>

/// <param name="end">to</param>

/// <param name="hours">how many hours</param>

/// <param name="projectname">name of project (case insensitive) will create if no match found</param>

/// <param name="clientname">name of client (case insensitive) will create if no match found</param>

/// <param name="clientname">commit to db?</param>

/// <returns></returns>

public static TimeSpent[] InsertTimeInformation(

int salary, string description,

string employeename, bool chargable,

DateTime startdate, DateTime enddate,

decimal hours, string projectname,

string clientname, bool commit)


With this function cooked up I hit the ground running. At this point I was still exporting data from outlook and massaging the data before importing. This diagram shows the use of import-csv data.csv | % { }.



Lets automate the whole thing


We can get the data out of Outlook really easily:


$o = New-Object -com Outlook.Application;

$ns = $o.GetNamespace("MAPI");

$ns.GetDefaultFolder(9).Items | where {

$_.Body -inotmatch "ADDED TO TIMESHEET" -and $_.Subject -imatch "^TS:" } | % {

#from here, we can get all of our calendar items!

write-host $_.Subject

[...]

}


Houston, we got a problem


The main reason I would have a staging area for the data before committing it to the database was because if I didn't type the project name in exactly right (or couldn't remember how it was worded in the database) it would create a duplicate. What I really needed was a fuzzy look-up so I could type in roughly what it was and get the match with some degree of certainty.


I decided to do this in a really simple way. I downloaded C# implementations of Levenshtein distance and Soundex. I then wrote a similarity class that would break up search terms into words and run a Levenshtein distance on every combination of word and full phrase (of the original terms and the soundex representations). I have implemented a point scoring system where matches of Levenshtein distance 0 are 45, and matches of distance 1 are 10 etc. Turns out it works phenomenally well!


Similarity is always something that has fascinated me (text, images & video). We could easily step this up with better feature detection i.e.


  • Word Shape
  • Lexical Analysis
  • Phonemes
  • N-Grams
  • (and many more)

But luckily in small domain environments it's easy to tweak your algorithm with heuristics/special cases (not really needed for this case though). Another point is adding in more complicated vertical-domain feature detection would require better aggregation of the n-dimensional space created:


  • Clustering
  • Artificial Neural Networks
  • Support Vector Machines
  • Bayesian Networks
  • (etc)

 

public static int GetSimilarityScore(string input1, string input2, bool debug)

{

if( String.IsNullOrEmpty(input1) || String.IsNullOrEmpty(input2) ) return 0;

 

///we run LD processing based on soundex keys of words/entire phrase

int[] SoundexScoring = ProcessIndexesLD(GetWords(input1, true), GetWords(input2, true), debug);

///we run LD processing based on words/entire phrase alone

int[] WordScoring = ProcessIndexesLD(GetWords(input1, false), GetWords(input2, false), debug);

List<int> scoreing = new List<int>();

scoreing.AddRange(SoundexScoring);

scoreing.AddRange(WordScoring);

///tally up relevant scores

int level1 = scoreing.Count<int>( i=>i==0 );

int level2 = scoreing.Count<int>( i=>i==1 );

int level3 = scoreing.Count<int>( i=>i==2 );

int level4 = scoreing.Count<int>( i=>i==3 );

int level5 = scoreing.Count<int>( i=>i==4 );

int boost=0;

///if we have an exact match, we want to show an extremely high score

///i.e. the weightings become irrelevant and we dont want to risk other things

///having higher score due to many words @ level1/2 etc

if( input1 == input2 ) boost= 1000;

else if( input1.ToLower() == input2.ToLower() ) boost= 500;

return ( level1*45+level2*10+ level3*5+ level4*3+ level5*2+boost );

}

 

private static int[] ProcessIndexesLD(string[] keys, string[] inputkeys, bool debug)

{

List<string> dedupe = new List<string>();

List<int> scoreing = new List<int>();

 

for (int x = 0; x < keys.Length; x++)

{

string key = keys[x];

 

for (int y = 0; y < inputkeys.Length; y++)

{

string inputkey = inputkeys[y];

 

string combination1 = String.Format("{0}_{1}", x, y);

string combination2 = String.Format("{0}_{1}", y, x);

 

if (!dedupe.Contains(combination1))

{

scoreing.Add(LevenshteinDistance.LD(key, inputkey));

}

 

dedupe.Add(combination1);

dedupe.Add(combination2);

}

}

return scoreing.ToArray();

}


Let's have some fun from PowerShell:


function fp {

 

$what=$args[0]

 

tgp | select Name, @{Name="Score";Expression={[TimeSheetAPI.SimilarityTools]::GetSimilarityScore( $_.Name, $what, 0 )}} | sort Score -des| where{$_.code -ne 0} | select -fir 30 | ft -auto;

}





I could show you some other examples, but damn. It's almost bullet-proof!


With this in place we can now automatically pull stuff from items in the Outlook calendar. If we don't have a decent match, we can simply assume that a new project should be created.


Final Powershell Script:



function SyncTimeSheet{

 

$o = New-Object -com Outlook.Application;

$ns = $o.GetNamespace("MAPI");

 

$projregex = [Regex] "\[(?<name>.+?)\]"

$clientregex = [Regex] "\{(?<name>.+?)\}"

 

$ns.GetDefaultFolder(9).Items | where { $_.Body -inotmatch "ADDED TO TIMESHEET" -and $_.Subject -imatch "^TS:" } | % {

 

if( $_.AllDayEvent ) {

$hours=8;

}

else {

$hours = ( [Convert]::ToDateTime($_.End) - [Convert]::ToDateTime($_.Start) ).Hours

}

$chargable = 1;

 

if( [Regex]::IsMatch( $_.Subject, "\[c0\]" ) ) {

 

$chargable = 0;

}

 

#perform fuzzy match on project

if( $projregex.IsMatch( $_.Subject ) ) {

 

$projectname=$projregex.Match( $_.Subject ).Groups["name"].Value

tfpraw( $projectname ) | where{ $_.Score -gt 100 } | select -fir 1 | % { $projectname=$_.Name }

 

write-host $_.Subject +"_"+ $_.Start +"-->MATCH FOUND: "+ $projectname

}

 

$client = "Internal";

 

#fuzzy match the client

if( $clientregex.IsMatch( $_.Subject ) ) {

 

$client=$clientregex.Match( $_.Subject ).Groups["name"].value;

tfcraw( $client ) | where{ $_.Score -gt 100 } | select -fir 1 | % { $client=$_.Name }

}

 

#the timesheet app want to see time 00:00:00, god knows :)

$start = [Convert]::ToDateTime( [Convert]::ToDateTime($_.Start).ToShortDateString() )

$end = [Convert]::ToDateTime( [Convert]::ToDateTime($_.End).ToShortDateString() )

 

 

[TimeSheetAPI.TimeSheetTools]::InsertTimeInformation(

100,

$_.Subject +" "+ $_.Description,

$_.Organizer,

$chargable,

$start,

$end,

$hours,

$projectname,

$client,

1) | ft -auto

 

#///ensure it doesnt get hit again

$_.Body = $_.Body + " [ADDED TO TIMESHEET]"

}

}

 

 

What would be nice


When you write "proper" command-lets you can do far more clever and expressive things i.e. implementing whatif and confirm extensions (as well as pipeline input although that is possible from straight functions).


Imagine some of the expressive possibilities:


Get-TimeSpent -Today | Log-Time -whatif

Get-Client | where{ $_.Name imatch "Microsoft" } | Delete-Client -confirm


We have a sophisticated (home-grown) database proxy code-generation system in the company. You point it at the database and it gives you C#.NET. Now I'm thinking about it giving me powershell commandlets too.


Powershell has much in common with C#3.0/Language Query because of it's functional rather than imperative approach to programming.


Think of it as a half way house between SQL and C#2.0.


Get-Customers | where{ $_.Orders gt 300 } | % { Delete-Customer $_ }

Tags: powershell similarity levenshtein soundex csharp powergadgets outlook automation search


Copyright Tim Scarfe © 1999-2006. All rights reserved.
Dot Net Solutions