Using LINQ to XML in poly-hierarchical tree

Using LINQ to XML in polyhierarchical tree structures:
A programmatic approach
Jonathan Sexton, Data & Services Developer, UK Data Archive
17 April 2014
V1.0
UK Data Archive
•
based at the University of Essex since 1967
•
curator of the UK’s largest collection of digital data in the
social sciences
•
currently holds nearly 6,000 data collections for research
and teaching, both quantitative and qualitative
•
certified to ISO 27001, the international information
security standard
•
makes these available via the new UK Data Service
Website: www.data-archive.ac.uk
2
UK Data Service
• the UK Data Service
indexes all data
collections in the Archive
– all catalogued at thematic level
– many indexed at variable level
• also harvests metadata
from other sources
• all are available for
download via Discover
search-and-browse
catalogue:
discover.ukdataservice.ac.uk
3
UK Data Service
• led by experts at University of Essex along
with colleagues at Manchester, Leeds,
Southampton, Edinburgh and UCL
• also provides access to UK Census data (1971 to 2011)
• source of guidance, training, and support for data users in
UK and around the world
• currently serve approx. 24,000 registered users
• newly funded to coordinate the Administrative Data
Research Network, part of UK’s Big Data strategy
Websites: ukdataservice.ac.uk, census.ukdataservice.ac.uk
4
Lesson Objectives
To discuss the method used to navigate a polyhierarchical tree structure, in XML, using C# and LINQ
to XML, in order to provide the calling application with
the data it requires, given an XML object to navigate in
and a valid identifier within it.
5
Scenario
•
An application allows a user to add a ‘suggestion’ to a tree
view based structure, in this particular case a Thesaurus for
the Social Sciences (ELSST).
•
The tree view is poly-hierarchical, in that identical terms may
appear in several places in the structure.
•
The data for the tree view is stored in an XML file, stored in
the user’s session.
•
The application and its supporting architecture is written in
.NET v4.5 (C#).
6
The XML (a snippet of the data)
Top/Broader Term
Identifier
Attributes for Term
Narrower Term
7
The tree view in action in the application
A user has performed a
search on ‘communication
skills’ as a Preferred Term in
the thesaurus, and the
results are as shown.
We can clearly see two
instances in the hierarchy,
under different ‘Top Terms’,
‘Ability’ and
‘Communication’.
8
Requirement
•
When the user is making a ‘suggestion’ they may wish to
alter the position in the hierarchy that the term in scope
appears.
•
To do this three dropdown lists are presented to them, valid
‘broader’, ‘narrower’ and ‘related’ terms, each appertaining to
the term in scope.
•
Given the full XML data and the identifier of the term in
scope, calculate the content of these lists.
9
Design
•
We need a method that, given the full XML (as a
System.Xml.Linq.XElement object) and the identifier (as a
string), will populate three lists (representing valid ‘broader’,
‘narrower’ and ‘related’ terms) and return them as an
IEnumerable List of strings (a List of a List of strings).
•
We will need to use the System.Xml.Linq library to enable
us to easily navigate through the passed XElement object.
•
We will need to implement our method into an existing UK
Data Archive DLL, namely our UKDA.Utility.Library
component, in the DataAccess/XmlFileReader class.
10
Implementation (1 of 12)
•
OK, our new method is going to be:
public static IEnumerable<List<string>> GetRelevantTermsLists(
XElement xElementOriginal,
string cid)
{
•
And let’s get something to work with:
const string Delimiter = "|";
NOTE: An XElement object, when passed
in as a parameter, will be by reference, not
by value, therefore we need to make a
copy before we do any manipulation on it.
The identifier (‘cid’) is OK as that’s a string
and is hence passed in by value.
const string ClassString = "class-";
var xElement = new XElement(xElementOriginal);
var el = xElement.DescendantsAndSelf("element").Where(
nm => string.Equals(nm.Element("attr").Element("class").Value,
cid,
StringComparison.CurrentCultureIgnoreCase));
var btList = new List<string>();
var ntList = new List<string>();
var rtList = new List<string>();
var xElements = el as IList<XElement> ?? el.ToList();
11
Implementation (2 of 12)
•
So now let’s put some meat on the bones: firstly we should get the
‘broader’, ‘narrower’ and ‘related’ terms in our xElements object and
store them in our declared lists…
foreach (var e in xElements)
{
var bts =
e.DecendantsAndSelf("element")
.First()
.Elements(“attr“)
.Elements(“BTs")
.Elements(“BT");
btList.AddRange(
bts.Select(bt => bt.Attribute(“BTLex").Value + Delimiter + ClassString +
bt.Attribute(“BTID").Value;
•
And so on for the ‘narrower’ and ‘related’ terms too.
12
Implementation (3 of 12)
•
So far so good, all pretty easy stuff, but now let’s crank-up the gameplay
a little.
•
We’re now going to have to ‘walk’ up our XElement object, looking for
parent nodes to add to an exclusions list (because we’re not going to
want these to be returned from the method).
•
We’re going to need two, and only two, iterations. The first iteration will
get us all the instances of ‘self’, whereas the second will bring back all the
ancestors.
•
The following code snippet will hopefully shed some light onto what we’re
trying to achieve here
13
Implementation (4 of 12)
var exclusionsList = new List<string>();
for (var i = 0; i < 2; i++)
Only TWO iterations are required.
{
var pt = el.Elements("data");
var ptId = el.Elements("attr").Elements("class");
var elements = pt as IList<XElement> ?? pt.ToList();
var ptArray = pt as XElement[] ?? elements.ToArray();
Add our data into our ‘ptArray’ variable in the form of:
var id = ptId as IList<XElement> ?? ptId.ToList();
“DATA ANALYSIS|class-E72A058B-8A32-E311-93C1-000BDB5CC6D5”.
var ptIdArray = ptId as XElement[] ?? id.ToArray();
for (var j = 0; j < ptArray.Count(); j++)
{
if (!ptArray[j].Value.Contains(Delimiter)) { ptArray[j].Value += Delimiter + ptIdArray[j].Value; }
}
var xElements1 = pt as IList<XElement> ?? ptArray.ToList();
Break out of the outer loop if we encounter an empty list.
if (!xElements1.Any()) { break; }
foreach (var p in xElements1.Where(p => !exclusionsList.Contains(p.Value)))
{
Loop around and only put in items to exclude if they don’t
exclusionsList.Add(p.Value);
already exist.
}
el = el.Ancestors();
}
14
Implementation (5 of 12)
•
Still with us? OK, so now we should consider getting the ‘child’ nodes for
removal, thus:
var elementsToRemove =
from a in xElement.DescendantsAndSelf("element").Where(
s => string.Equals(
s.Element("attr").Element("class").Value,
cid,
StringComparison.CurrentCultureIgnoreCase))
select a;
15
Implementation (6 of 12)
•
Let’s now add this data into our exclusions list, thus:
foreach (var element in elementsToRemove)
{
var v = from child in element.Descendants("element")
select
new
{
DataToExclude =
child.Element("data").Value + Delimiter +
child.Element("attr").Element("class").Value
};
exclusionsList.AddRange(v.Select(x => x.DataToExclude));
}
16
Implementation (7 of 12)
•
We now need to jump back to our original XElement, the one we were
originally passed, thus:
var allEles = xElementOriginal.DescendantsAndSelf("element");
•
And using this object get all the remaining nodes into a big list of strings,
thus:
var bigList = (from ele in allEles
select ele.Element("data").Value + Delimiter + ele.Element("attr").Element("class").Value
into itemToAdd
let badFlag = exclusionsList.Any(badEle => badEle == itemToAdd)
where !badFlag
select itemToAdd).ToList();
17
Implementation (8 of 12)
•
Let’s now get rid of any duplicates, thus:
var distinctBigList = bigList.Distinct().ToList();
btList = btList.Distinct().ToList();
ntList = ntList.Distinct().ToList();
rtList = rtList.Distinct().ToList();
•
Pretty simple. We’re now ready to populate our lists that we’ll be
returning…
18
Implementation (9 of 12)
•
We will add the items to our lists, for ‘broader’, ‘narrower’ and ‘related’
terms, thus:
foreach (var item in distinctBigList.Where(item => !btList.Contains(item)))
{
btList.Add(item);
‘broader’ terms.
}
foreach (var item in distinctBigList.Where(item => !ntList.Contains(item)))
{
ntList.Add(item);
}
foreach (var item in distinctBigList.Where(item => !rtList.Contains(item)))
{
rtList.Add(item);
}
‘narrower’ terms.
‘related’ terms.
19
Implementation (10 of 12)
•
Nearly there. We must now remove the item that was clicked on by the
user in the application (we won’t be needing this), thus:
foreach (var bt in btList.Where(bt => bt.Contains(cid)))
{
btList.Remove(bt);
break;
}
‘broader’ terms, there’ll only be one so once
we’ve removed it, exit the loop.
foreach (var nt in ntList.Where(nt => nt.Contains(cid)))
{
ntList.Remove(nt);
break;
}
‘narrower’ terms, there’ll only be one so once
we’ve removed it, exit the loop.
foreach (var rt in rtList.Where(rt => rt.Contains(cid)))
{
rtList.Remove(rt);
break;
}
‘related’ terms, there’ll only be one so once
we’ve removed it, exit the loop.
20
Implementation (11 of 12)
•
Finally, populate a container object with our nice new lists and return it.
var containerList = new List<List<string>>();
containerList.Add(btList);
containerList.Add(ntList);
containerList.Add(rtList);
return containerList;
}
End of the method, yay!
21
Implementation (12 of 12)
•
And of course, to make all of the code in the previous slides actually
work, we’re going to need to include the following .NET class libraries in
the class our method sits in:
•
•
•
•
•
•
•
•
System
System.Collections.Generic
System.Globalization
System.IO *
System.Linq
System.Text *
System.Xml
System.Xml.Linq
* Used for the WriteOutList method, used for testing (next slide).
22
Testing (1 of 2)
•
It was found to be quite useful to employ a simple private method, that
could be called to write the content of a list to a text file, for checking and
comparison purposes.
private static void WriteOutList(string newFileLoc, IEnumerable<string> dataList, string type = null, string cid = null)
{
using (var writeText = new StreamWriter(newFileLoc))
{
var sb = new StringBuilder();
if (!string.IsNullOrEmpty(type))
{
sb.Append("Valid ");
sb.Append(type);
sb.Append("'s for class id:'");
sb.Append(cid);
sb.Append("'.");
writeText.WriteLine(sb.ToString());
writeText.WriteLine("----------------------------------------------------------------------");
}
foreach (var line in dataList) { writeText.WriteLine(line); }
writeText.Close();
}
23
}
Testing (2 of 2)
•
Why the optional parameters (‘type’ and ‘cid’) then?
•
For ‘broader’, ‘narrower’ and ‘related’ terms, that are associated with an
identifier, simply call the method thus:
WriteOutList([file path], [data as a list of strings], [“BT”, “NT” or “RT”], [identifier]);
•
For data in the ‘big list’ simply call the method thus:
WriteOutList([file path], [data as a list of strings]);
•
Saves on the need for overloading so one size fits all
24
Lesson Summary
We have discussed the method used to navigate a
poly-hierarchical tree structure, in XML, using C# and
LINQ to XML, in order to provide the calling application
with the data it requires, given an XML object to
navigate in and a valid identifier within it.
25
Questions?
26
Find Out More
Find out more


Data Archive – data-archive.ac.uk
UK Data Service – ukdataservice.ac.uk
Contact Information

Jonathan Sexton
UK Data Archive
Wivenhoe Park, University of Essex
Colchester CO4 3SQ
E-mail: [email protected]
27