Google Website Translator Gadget

Friday 17 June 2011

Managing .NET RESX Duplication – Part 2: Identifying Resource Duplication

Update 20/06/2011: Fixed bug in C# snippet that shows how to process list of resources files

Globe This is the second post in a three part series of posts where I look at a possible solution for managing duplication of resources between your .NET RESX files.  In part 1 I covered the overall solution requirements and design to fit our environment.  In this second part I’m going to delve straight into the custom MSBuild task that identifies the resource duplication across the resource files being used in our code base.  As mentioned, the solution consists out of two distinct phases.  Phase 1 identifies the resources duplicates across the .RESX files.  Phase 2 attempts to minimize the duplication in these files.  We decided to implement both these phases as custom MSBuild tasks that run as part of every solution compilation to facilitate our Continuously Execute requirement.

DATASTRUCTURES

Before looking at the custom task, let’s consider the algorithm and data structures that we will use to track the resource duplication.  The MSBuild task will receive as input a list of .RESX files that it will scan for duplication.  Whilst scanning through the files it will build up a dictionary containing all the resource entries for a single key. This can be visualized as illustrated below:

 image

For every entry found, we need to track the File as well as the ResXDataNode found and for this we use a ResXDuplicateEntry class:

1 public class ResXDuplicateEntry 2 { 3 #region Properties 4 5 public string File { get; set; } 6 public ResXDataNode Node { get; set; } 7 8 #endregion 9 10 #region Methods 11 12 public static string GetLanguageNeutralFile(string file, string culture) 13 { 14 return culture == null ? file : file.Replace("." + culture + ".resx", ".resx"); 15 } 16 17 public static string GetLanguageSpecificFile(string file, string culture) 18 { 19 return culture == null ? file : file.Replace(".resx", "." + culture + ".resx"); 20 } 21 22 #endregion 23 } 24

A single resource key can contain multiple resource entries.   We use a ResXDuplicate container to track all the ResXDuplicateEntry instances for a single resource key across the different files:

1 public class ResXDuplicate 2 { 3 private static readonly ResXDuplicateComparer comparer = new ResXDuplicateComparer(); 4 5 public ResXDuplicate(TaskLoggingHelper log, string culture) 6 { 7 Log = log; 8 Culture = culture; 9 Entries = new List<ResXDuplicateEntry>(); 10 } 11 12 #region Properties 13 14 public int Count 15 { 16 get { return Entries.Count; } 17 } 18 19 public string Culture { get; private set; } 20 21 public string Key 22 { 23 get { return Node.Name; } 24 } 25 26 public ResXDataNode Node 27 { 28 get { return Entries[0].Node; } 29 } 30 31 private List<ResXDuplicateEntry> Entries { get; set; } 32 33 #endregion 34 35 #region Methods 36 37 public void AddEntry(ResXDataNode node, string file) 38 { 39 Entries.Add(new ResXDuplicateEntry {Node = node, File = file}); 40 } 41 42 #endregion 43 }

Once we have scanned through all the resource files and build up our dictionary of resource keys with resource entries, we identify the list of duplicates by filtering the dictionary to all entries containing more than one ResxDuplicateEntry class.  For these duplicates, we then proceed to compare the individual entries for the same key across different files to each other.  For this we created a few extensions methods on the ResXDataNode class to make reading and setting the values a bit easier.

1 public static class ResXExtensions 2 { 3 public static string GetValue(this ResXDataNode node) 4 { 5 return node.GetValue((ITypeResolutionService)null).ToString().Trim(); 6 } 7 8 public static ResXDataNode Tag(this ResXDataNode node, string tagToken) 9 { 10 ResXDataNode taggedNode = new ResXDataNode(node.Name, tagToken); 11 return taggedNode; 12 } 13 14 public static bool IsTagged(this ResXDataNode node, string tagToken) 15 { 16 return GetValue(node).StartsWith(tagToken); 17 } 18 } 19

 

When comparing the different values for the same key, we can one of two outcomes:

  1. All the entries match – If all the entries match we can safely report all of them as duplicates
  2. Some entries differ – If only some entries match up, we first find the value amongst the entries that are duplicated the most, and use that to report the number of duplicates identified

We use a ResXStatsSummary class to report the statistics of the resources scan.

1 public struct ResXStatsSummary 2 { 3 #region Properties 4 5 public int Duplicates { get; set; } 6 7 public int NotDuplicates 8 { 9 get { return Scanned - Duplicates; } 10 } 11 12 public int Scanned { get; set; } 13 public int DuplicatesTagged { get; set; } 14 15 public int TotalResourcesToTranslate 16 { 17 get { return (Duplicates - DuplicatesTagged) + NotDuplicates; } 18 } 19 20 #endregion 21 22 #region Methods 23 24 public override string ToString() 25 { 26 return String.Format(CultureInfo.InvariantCulture, 27 @"Scanned: {1} resource(s){0}Unique: {3} resource(s){0}Duplicates: {4} resource(s){0}Duplicates tagged: {5} resource(s){0}Total to translate: {2} resource(s)", 28 Environment.NewLine, Scanned, TotalResourcesToTranslate, NotDuplicates, Duplicates, DuplicatesTagged); 29 } 30 31 #endregion 32 }

 

Scanning Resource Files

Now that we’ve covered the basis of the scanning algorithm and the data structures used for storing the results, let’s look at the actual code that scans the resources files to build the dictionary of ResXDuplicate entries.   All the MSBuild tasks that we create inherit from a ResXTaskBase class that contains the shared scanning logic. Here is the definition of the class along with its member variables.   Notice the list of files is received as an input parameter for the MSBuild task on line 19.  Also notice the dictionary of ResxDuplicate entries that will contain the result of scanning all the resource files on line 10.

1 public abstract class ResXTaskBase : Task 2 { 3 public const string TagToken = "[## SHARED ##]"; 4 5 protected IEnumerable<ResXDuplicate> _distinct; 6 protected IEnumerable<ResXDuplicate> _duplicates; 7 protected IEnumerable<ResXDuplicate> _notDistinct; 8 9 // Dictionary keyed on Resource Key that contains all the files containing an entry for the same key 10 protected Dictionary<string, ResXDuplicate> _processed; 11 protected Dictionary<string, List<ResXDataNode>> _distinctFileDuplicates = new Dictionary<string, List<ResXDataNode>>(); 12 protected Dictionary<string, List<ResXDataNode>> _mostUsedFileDuplicates = new Dictionary<string, List<ResXDataNode>>(); 13 14 protected ResXStatsSummary _stats; 15 16 #region Properties 17 18 [Required] 19 public ITaskItem[] SourceFiles { get; set; } 20 21 #endregion

 

Next up is the scanning algorithm.  As we want to use the same scanning algorithm to scan all the files specific to a culture, we make sure that we can specify what culture to take into account as a parameter.  By default the culture will be null which implies that all the culture neutral resource files will be scanned:

1 protected Dictionary<string, ResXDuplicate> ParseResxFiles(string culture = null)

We start the scanning algorithm by first building up the list of resource files to scan by taking the culture parameter into account.  If we are for example scanning the Brazilian Portuguese resources, we make sure that we add the pt-Br culture onto the file name being processed:

1 Dictionary<string, ResXDuplicate> processed = new Dictionary<string, ResXDuplicate>(); 2 3 // Build a list of files to process 4 List<string> resxFiles = new List<string>(); 5 foreach (var sourceFile in SourceFiles) 6 { 7 string fileName = ResXDuplicateEntry.GetLanguageSpecificFile(sourceFile.ItemSpec, culture); 8 resxFiles.Add(fileName); 9 }

Once we have the actual list of file names to process, we can now simply run through the list of files, reading all the entries to populate our dictionary with the respective ResXDuplicate entries:

1 foreach (var resxFile in resxFiles) 2 { 3 if (!File.Exists(resxFile)) 4 continue; 5 6 using (ResXResourceReader reader = new ResXResourceReader(resxFile)) 7 { 8 reader.UseResXDataNodes = true; 9 Log.LogMessage(MessageImportance.Low, @"Scanning {0}...", Path.GetFileName(resxFile)); 10 foreach (DictionaryEntry entry in reader) 11 { 12 string key = entry.Key.ToString(); 13 14 ResXDataNode node = (ResXDataNode) entry.Value; 15 ResXDuplicate duplicate; 16 if (!processed.TryGetValue(key, out duplicate)) 17 { 18 duplicate = new ResXDuplicate(Log, culture); 19 processed.Add(key, duplicate); 20 } 21 duplicate.AddEntry(node, resxFile); 22 } 23 } 24 } 25 return processed;

 

 

With the dictionary now fully populated with all resource entries, we can proceed to filtering out the duplicate entries.  As mentioned, when doing the filtering we identify for every resource key whether all the entries contain the same value - i.e. is distinct or not.  For this we use the IsDistinct method created on the ResXDuplicate class:

1 public bool IsDistinct() 2 { 3 List<ResXDuplicateEntry> notAlreadyShared = Entries.Where(x => !x.Node.IsTagged(ResXTaskBase.TagToken)).ToList(); 4 return notAlreadyShared.Count == 0 || notAlreadyShared.Distinct(comparer).Count() == 1; 5 }

We first ignored all nodes that have been tagged as shared (we’ll cover this in the next post) and then use our ResXDuplicateComparer and a LINQ query to check whether all entries in the list match.  We stored the results of our filtering into two separate lists of ResXDuplicate entries:

1 public override bool Execute() 2 { 3 // Parse the language neutral resource files 4 _processed = ParseResxFiles(null); 5 6 // Identify which entries contain duplicates (i.e. > 1 entry for the same key) 7 _duplicates = _processed.Values.Where(x => x.Count > 1); 8 9 // Identify which duplicates contain the same value across all files 10 _distinct = _duplicates.Where(x => x.IsDistinct()); 11 12 // Identify which duplicates contain different values across all files 13 _notDistinct = _duplicates.Where(x => !x.IsDistinct());

Once we have the lists, we can setup the stats for the scan as follow:

1 // Setup the stats 2 _stats = new ResXStatsSummary 3 { 4 Scanned = _processed.Values.Sum(x => x.Count), 5 Duplicates = _duplicates.Sum(x => x.Count), 6 DuplicatesTagged = _duplicates.Sum(x => x.EntriesTaggedCount()) 7 };

Lastly, we finish off the scan by creating two separate dictionaries of all resource nodes keyed per file from the existing _distinct and _notDistinct filtered.  This is useful for reporting which files contained duplicates and will be used extensively in the next blog post where we cover how to actually eliminate the resource duplication. 

1 // Setup Dictionaries keyed per file for the respective duplicate resources 2 _distinctFileDuplicates = BuildResxFileList(_distinct, x => x.ToDictionary()); 3 _mostUsedFileDuplicates = BuildResxFileList(_notDistinct, x => x.FindMostUsedEntries().ToDictionary(y => y.File, y => y.Node));

 

Here is the BuildResxFileList routine:

1 protected Dictionary<string, List<ResXDataNode>> BuildResxFileList(IEnumerable<ResXDuplicate> duplicates, Func<ResXDuplicate, Dictionary<string, ResXDataNode>> findEntriesFilter) 2 { 3 Dictionary<string, List<ResXDataNode>> resourceFiles = new Dictionary<string, List<ResXDataNode>>(); 4 foreach (var duplicate in duplicates) 5 { 6 Dictionary<string, ResXDataNode> entries = findEntriesFilter(duplicate); 7 foreach (var entry in entries) 8 { 9 if (!resourceFiles.ContainsKey(entry.Key)) 10 resourceFiles.Add(entry.Key, new List<ResXDataNode>()); 11 12 resourceFiles[entry.Key].Add(entry.Value); 13 } 14 } 15 return resourceFiles; 16 }

 

 

 

With all of this infrastructure in place in the ResxTaskBase, creating the ResXFindDuplicate task is straightforward as illustrated below.

1 public class ResXFindDuplicates : ResXTaskBase 2 { 3 public enum LogOutput 4 { 5 Text = 0, 6 Xml = 1, 7 Csv = 2 8 } 9 10 #region Properties 11 12 [Output] 13 public ITaskItem[] FilesWithDuplicates { get; set; } 14 15 [Output] 16 public ITaskItem[] FilesWithNoDuplicates { get; set; } 17 18 [Required] 19 public string LogFile { get; set; } 20 21 [Required] 22 public int LogType { get; set; } 23 24 #endregion 25 26 #region Methods 27 28 public override bool Execute() 29 { 30 Log.LogMessage(MessageImportance.Normal, @"Finding duplicate resources within {0} resx files...", SourceFiles.Length); 31 32 // Parse the language neutral resources 33 bool result = base.Execute(); 34 35 Log.LogMessage(MessageImportance.High, _stats.ToString()); 36 37 // Create the output parameters containing the files affected/not affected 38 List<ITaskItem> duplicates = new List<ITaskItem>(); 39 List<ITaskItem> noDuplicates = new List<ITaskItem>(); 40 foreach (ITaskItem sourceResxFile in SourceFiles) 41 { 42 if (_distinctFileDuplicates.ContainsKey(sourceResxFile.ItemSpec)) 43 duplicates.Add(sourceResxFile); 44 else 45 noDuplicates.Add(sourceResxFile); 46 } 47 FilesWithDuplicates = duplicates.ToArray(); 48 FilesWithNoDuplicates = noDuplicates.ToArray(); 49 50 // Create the Log file 51 if (Enum.IsDefined(typeof (LogOutput), LogType)) 52 { 53 Log.LogMessage(MessageImportance.Normal, @"Writing {0} log to {1}...", Enum.GetName(typeof (LogOutput), LogType), LogFile); 54 55 if ((LogOutput) LogType == LogOutput.Text) 56 WriteTextLog(_stats, _duplicates); 57 else if ((LogOutput) LogType == LogOutput.Xml) 58 WriteXmlLog(_stats, _duplicates); 59 else if ((LogOutput) LogType == LogOutput.Csv) 60 WriteCsvLog(_duplicates); 61 } 62 else 63 { 64 Log.LogError("Invalid LogType specified"); 65 result = false; 66 } 67 68 return result; 69 } 70 71 private void WriteCsvLog(IEnumerable<ResXDuplicate> duplicates) 72 { 73 StringBuilder sb = new StringBuilder(); 74 sb.AppendLine("Shared,File,Name,Value"); 75 foreach (var entry in duplicates.OrderByDescending(x => x.Count)) 76 { 77 sb.AppendFormat(entry.ToCsvText()); 78 } 79 80 WriteLogFile(sb, LogFile); 81 } 82 83 ...

 

 

 

 

 

 

 

We receive the LogFile and LogType as input parameters to the MSBuild task and we have separate FilesWithDuplicates and FilesWithNoDuplicates output parameters that will contain the names of the files with duplicates and without duplicates.  After calling into the base class (line 33) to scan the files and setup all the data structures, we simply setup the list of output files (lines 40-48) and also create the desired log output (lines 51-66).

Invoking the MSBuild Task

Now that we’ve covered the MSBuild task, let’s see how we would invoke this when compiling our projects within Visual Studio.  As every .csproj in Visual Studio is a MSBuild file, invoking the task is actually quite easy.  We will hook into the BeforeBuild target of the project containing our resource files to execute our ResXFindDuplicate task.   We therefore edit the specific .csproj and start by adding the following to identify the set of resource files to scan:

1 <PropertyGroup> 2 <ResxDuplicatesFile>Resources\SharedResources.resx</ResxDuplicatesFile> 3 <ResxFindDuplicatesLog>Resources\SharedResources.csv</ResxFindDuplicatesLog> 4 <NEW_LINE>%0D%0A</NEW_LINE> 5 <TAB>%09</TAB> 6 </PropertyGroup> 7 <ItemGroup> 8 <ResxInputs Include="Resources\*.resx" Exclude="Resources\*.*.resx"> 9 <Visible>false</Visible> 10 </ResxInputs> 11 <ResxOutputs Include="$(ResxFindDuplicatesLog)"> 12 <Visible>false</Visible> 13 </ResxOutputs> 14 </ItemGroup> 15

 

Notice that the ResxInputs ItemGroup (lines 8-10) is set to match only the resource neutral resource files by excluding any culture specific files.  Also notice that we set the visibility of these item groups to false to prevent them from showing up in the Visual Studio solution explorer.  We can now invoke our custom task by hooking it into the BeforeBuild target.  Every .csproj has a BeforeBuild and AfterBuild target that Visual Studio will invoke before and after every time it compiles the project.

1 <Import Project="..\..\Tools\MSBuild\Pragma.MSBuild.Tasks\Pragma.MSBuild.Tasks.Targets" /> 2 <Target Name="BeforeBuild" Inputs="@(ResxInputs)" Outputs="@(ResxOutputs)"> 3 <!-- Find the resx files containing duplicates --> 4 <ResXFindDuplicates SourceFiles="@(ResxInputs)" LogType="2" LogFile="$(ResxFindDuplicatesLog)"> 5 <Output TaskParameter="FilesWithDuplicates" ItemName="ResxFilesWithDuplicates" /> 6 <Output TaskParameter="FilesWithNoDuplicates" ItemName="ResxFilesWithNoDuplicates" /> 7 </ResXFindDuplicates> 8 <Message Text="Resx files with duplicates:$(NEW_LINE)$(TAB)@(ResxFilesWithDuplicates->'%(RecursiveDir)%(FileName)%(Extension)', '$(NEW_LINE)$(TAB)')" Importance="normal" /> 9 <Message Text="Resx files with no duplicates:$(NEW_LINE)$(TAB)@(ResxFilesWithNoDuplicates->'%(RecursiveDir)%(FileName)%(Extension)', '$(NEW_LINE)$(TAB)')" Importance="normal" /> 10 </Target>

 

 

 

 

 

After importing our compiled MSBuild tasks, we simply invoke it as showed on lines 4-7 by specifying what kind of log file we want (2 = CSV) and also where the log should be created.  We can investigate and mine the log output to further assist us with out translation efforts.   Finally, here is the output that we will now receive in Visual Studio every time we compile the project containing our resource files:

 CompileOutput

Conclusion

Now that we know the amount of resource duplication we have across our resource files, we need a mechanism to try and eliminate as much of the duplication as possible.  This we’ll cover in the next post of the series. Till then… Smile

0 comments:

Post a Comment