Tax Agents Profiling by their Clients Tax Return

PAPER TO BE PRESENTED TO TERADATA CONFERENCE
TAX AGENT PROFILING
Distance from Centroid and Decision Tree Techniques
Fuchun Luan1, Warwick Graco1, and Mark Norrie2
1
Australian Taxation Office;
2 NCR,
Canberra Australia
Introduction
In this paper we report some preliminary results from modelling tax agent behaviour
using a distance from centroid (DFC) method and decision tree (DT) models. Tax
agents are of interest because they act as intermediaries in the Federal Taxation
System of Australia. They are responsible for assisting clients to submit tax returns
for personal tax and to prepare business activities for business taxes including goods
and services taxes or GST.
Tax agents by their position are able to exert considerable influence on client
behaviour and whether they comply with taxation regulations. Unfortunately, some
agents abuse their positions to defraud the tax system. One way they do this is by
inflating the business deductions of their clients. The Australian Taxation Office
(ATO) is responsible for identifying high-risk tax agents who are engaging in
unacceptable practice. The methods used in the research were aimed at identifying
high-risk agents.
Subjects:
The steps here included:
 14,913 agents were selected for income year 2002. These were active agents
who practised throughout the year.
 49 known cases of high risk agents were nominated by ATO field staff and were
used as a high risk group in the research.
Data:
The data used was either in-situ or was extracted from the ATO data warehouse
(DWH) relational database tables for income year 2002. The research mainly
focused on examining the characteristics of tax agents via their aggregated clients’ tax
return data. Data on 256 features or variables was used in the research. The features
included descriptive and summary statistics of tax agent practice such as total
number of clients serviced and average deductions claimed for rental property.
Feature Extraction
The 256 features were far too higher number for the score modelling that was carried
out. It is very difficult to develop effective models when the data has high number of
dimensions or features. S0 steps were taken to identify features which discriminated
the high risk tax agent group from other agents in the remaining population. A
comparison was made between the mean values of the features for the high risk
group with those of the remaining agents. It was found that up to 16 features
distinguished between the two groups (see Figures 1 and 2). These discriminating
features cannot be listed for confidentiality reasons. However, they covered issues
_________________________________________________________
81910352
28/07/2017
such as high risk tax agent inflating claims for work related expenses and deductions
for rental properties compared to other agents.
Profiling and Modelling
There were two score modelling techniques used to identify high risk agents. They
were the DFC method and DT models. They were selected because the research had
to be done using methods that were readily available, that suited the data and that
were able to give early results and trends. DFC rank ordered all tax agents based on
the distance their profiles were from the centroid of the profiles of the group of high
risk agents (see Figure 3). The discriminatory features used to determine the distance
scores were weighted based on the degree they maximised the pick-up rate of the
high-risk agents in the 500 highest ranked profiles. This was to ensure that the top
group of high risk agents were clearly seen in the data because they were the group of
most interest to the ATO.
The scoring formulae used in the DFC calculations included:
Score model 1: (close neighbours of high-risk profile, see Figure 1)
n
( Fij  F i)
} * *2
Sj =  {Wi 
Fi
i 1
Score model 2: (beyond high-risk profile, see Figure 1)
n
( Fij  F i )
} * *1
Sj =  {FiWi 
Fi
i 1
Where i is the i-th selected feature (column), j is the tax agent (row), and Wi is
the weight, and Fi is sign flag.
In Score Model 1 the closer the tax agent profiles were to the mean profile of the high
risk group for the weighted discriminatory features, the lower their scores. The lower
the score, the higher the risk the tax agent was practising in a manner that was
unacceptable. All 14,913 profiles were scored and ranked in this manner.
Score Model 2 was aimed at identifying agents who had profiles which were worse
than the centroid of the high risk group. The higher the score the higher the risk the
tax agent was abusing the tax system.
The DT models selected features based on their ability to discriminate between a
high-risk agents and the remaining cases. The most discriminatory feature was at the
top and the least discriminatory feature at the bottom of each tree. The decision tree
models were applied to the feature scores of all 14,913 agents to identify those who
were high risks. The Teradata Warehouse Miner Decision Tree algorithm was used
for this purpose. An example of a decision tree is given below.
_________________________________________________________
81910352
28/07/2017
The number of high risk cases was very small (49) compared to the much larger
group of unknown agents (14,864). This training set was very unbalanced or
asymmetrical. This imbalance would bias the development of the decision trees to
identifying low risk cases. We wanted the opposite of ensuring the decision trees
would pick up high risk cases. To overcome this imbalance problem, three sets of 50
cases each were randomly drawn from the remaining population. Three were used to
ensure that with such small numbers we did not have serious problems with sampling
bias in terms of selecting cases. The three data sets are separately combined with the
known high-risk cases of 49 to form three training sets, which in turn enabled three
decision tree models to be built. Then the entire data set of 14.913 cases were scored
by these three models. The three scored data sets were then compared to see which
cases they all identified as high risks.
Results:
The trends here were:
 The top 500 agents selected using DFC method included 40 out of 49 high-risk
agents. This gave a 82% pick-up rate.
 202 tax agents were identified by the three DT models and the DFC method.
Discussion and recommendations
The results of these detection activities were sent to ATO field staff for review and
possible regulatory action. We are still waiting for feedback whether this targeting
exercise hit the right targets in terms of true positives. Other notable trends included:
• Only a small number of features (in our case 16) out of a possible 256 were
found to discriminate between 49 high risk tax agents and the remaining
population of 14,864 tax agents
• The discriminatory feature scores of the 49 high risk tax agents formed a tight
cluster or one with relative low variance (see Figure 3).
• The difference in mean values of the discriminatory features between high-risk
cluster and that of the general population of tax agents was more than double.
Other issues that warrant comments include that only two scoring methods were
used in this research. There are others that can be applied to the data including the
nearest neighbour (NN) method and ensemble approaches.
_________________________________________________________
81910352
28/07/2017
• The NN method is similar to the DFC method except that it assigns high risk
classifications to a nominated number of nearest neighbours to each high-risk
case. For example, the three closest neighbours to each of the 49 high-risk tax
agents could have been found using the NN method
• Ensemble approaches use multiple models to generate classifications.
• One approach is to generate a large number of randomly selected small trees
and then use a voting method to resolve the classification of each tax agent
profile. The majority classification is given to the profile. This approach is
advocated by Professor Leo Breiman from University of California Berkeley
• Another approach is to fit a small decision tree to the classified profiles and
then fit a second tree to those cases not correctly classified by the first tree
• This process is repeated until all cases in the high-risk group are correctly
classified
• This approach is called ‘multiple adaptive regression trees’ by Professor
Jerome Friedman from Stanford University
Future research will examine these alternative approaches. The number of known
high-risk cases also needs to increase so that a more diverse group of high-risk
profiles is used in the research. This will give better results when it comes to
identifying high-risk agents.
_________________________________________________________
81910352
28/07/2017
Tax Agent ProfileBenchmarks (IY2002)
Pop_AVG
High-Risk_AVG
$3,500
$3,000
$2,500
$2,000
$1,500
$1,000
$500
$0
-$500
-$1,000
-$1,500
-$2,000
Features
Figure 1: Features were identified as highly distinguishable between the High-Risk
agent group and the general population. The feature weights were optimised.
Tax Agent Profile Benchmarks with Distinguishing Features (IY2002)
Pop_AVG
High-Risk_AVG
xxxxxxxx-1
xxxxxxxx-2000
$3,500
$3,000
$2,500
$2,000
$1,500
$1,000
$500
$0
-$500
-$1,000
-$1,500
-$2,000
Features
Figure 2: Features were identified as highly distinguishable between the High-Risk
group and the general population.
_________________________________________________________
81910352
28/07/2017
Among the top 500, known high-risk(G2), high-risk by DT(G1)
AVG_Mtr_Ded
9000
7000
5000
3000
1000
200
100
0
0
100
200
1000
3000
5000
7000
9000
AVG_Oth_Ded
group
1
2
3
Figure 3: The Square represents the locations of 49 high-risk agents. The size of the
entire population is 14913.
_________________________________________________________
81910352
28/07/2017