an extended use of administrative data

Reengineering French structural business statistics - an
extended use of administrative data
Sébastien Chami
INSEE, business statistics directorate
18 Bd Adolphe Pinard
75675 PARIS CEDEX 14, FRANCE
[email protected]
1 Introduction
The INSEE has upgraded its sub-process for dealing with annual corporate tax forms in order to
collect less data via questionnaire surveys.
This has required an extensive redesign of the production process so as to reduce the duration of the
sub-process dealing with administrative data, to produce intermediate results and to take advantage
of direct follow-up of businesses by statisticians about corporate tax data. Some changes were
brought as well to the previous organisation in order to provide for consistency with the global
production process for structural statistics:
The first part of this paper presents a general description of the administrative tax data. The second
part delineates how the statistical register is taken into account in the tax data editing process and
the third part details the main principles of this new tax data editing process itself.
2 The administrative tax data
2.1 A data structure related to the management by the administration
As any administrative source, management by the administration of tax data imposes many
constraints to our statistics production process.
2.1.1 Schedule
First, the delivery schedule by the administration is constrained. The time-limit allowed to firms to
do their tax return is 4 months after their last fiscal year end. The fiscal administration then
proceeds to a grouping and capturing data process before they send them us. The complete
processing of tax forms that are sent by mail is via optical character recognition. This work takes
about six months.
However, the development of return by internet, a mandatory procedure for the biggest companies,
has reduced the processing time for theses companies to about a month and a half. Thus, the
administration is able to send tax data in several deliveries : the first one mid-June N +1 and then at
intervals, a second one in September, a third one at the end of October and a remainder one in late
January N+2.
The 1st delivery in June, although limited in number of firms, about one fifth of the scope, already
represents 3/4th of the turnover through the requirement for larger companies to return by internet.
This requirement implies these large companies are part of that delivery. Thus, the data available at
this deadline is sufficient to establish relevant early results in the summer.
Re-engineering French Structural Business Statistics - an extended use of administrative data
1
The September delivery is of limited importance, but improves our coverage to establish the early
SBS results. The late October delivery is the most complete in terms of units. It gives us an almost
complete coverage of the tax scope for our annual statistics production process.
The final delivery in late January N+2, is composed of data the tax administration treated with delay
for various reasons and also includes very early returns of the year after. Delayed data has a
relatively small weight, but arrives too late for the development of our annual statistics for N.
Nevertheless, we integrate them in our information system to consolidate the year N in the
production of statistics on year N+1. The early returns are also integrated for the next year.
Table 1 : schedule of tax data deliveries by the administration
Delivery
Delivery date
Nb of units
%
Turnover
(thousands)
(billions €)
Complementary N-1 late Jan. N+1
230
10%
260
Advanced
mid June N+1
420
19%
2 550
Intermediate
mid Sept. N+1
110
5%
350
Definitive
late October N+1
1 460
65%
360
Complementary
late Jan. N+2
20
2%
30
Total
2 240 100%
3 550
%
7%
72%
10%
10%
1%
100%
2.1.2 A decentralized administrative organization
The administration organization has very direct consequences on data transmitted to us. There are
two distinct processes depending on whether the tax form is completed on paper and filed by mail
or is uploaded via the Internet.
Uploaded returns are managed in a single data center. However, the collection of paper returns is
very decentralized. The gathering is done in each of the 100 French departments. Once the returns
of the department are consolidated, departmental centers forward them to four national data centers
(including the one which already manages uploaded returns) that will produce files that are
provided to us.
The consequence of this organization is that each delivery is comprised of 4 files, one for each
center. Sometimes, even if it is rarer in recent years, a center is late sending us its file and we do not
receive it at the same time as the other three centers. Our old production process had not foreseen
this eventuality and remained blocked until receipt of the 4 centers. To avoid this drawback, we
designed a data integration in our production system that is independent for each source. Thus the
entire system downstream of this integration can operate in the absence of one or more files
expected, although, of course, the results produced are weakened.
2.1.3 Stock files instead of flow files
The files sent to us by the tax authorities are stock files : the September delivery includes the units
already passed in June, and the same for the October one. Only the January one is an exception.
But we are only interested in the new units (the flow) of each delivery. The units already sent, as we
shall see later, could have been analyzed and modified by clerks, and we do not want to lose this
job. Therefore, our production process has a preliminary step that removes the data already
transmitted and keeps only the new data of each delivery.
Re-engineering French Structural Business Statistics - an extended use of administrative data
2
2.2 Accounting data
2.2.1 The diversity of types of profits subject to taxation
The legal rules governing the taxation of the companies are complex.
The return requirement and the type of return to be filed by the company will depend on the
activity, the size of the company and its legal status.
The activity of the company will determine the type of benefit that is taxable :
o
agricultural profits : for the agricultural sector. This sector is covered by specific
statistical procedures not using tax data, it is beyond the scope of Resane even if the
administration gives us the files of this profit.
o
non-commercial profits : the professions (doctors, lawyers)
o
the industrial and business profits : for other companies
The firm size, according to criteria of turnover, will determine the tax system.
o
the normal system : for larger companies
o
the simplified system : for small businesses
o
the extremely simplified system : for very small businesses (micro-businesses)
Table 2 : types of benefits and tax systems
Profit
System
Industrial and business
Industrial and business
Industrial and business
Non commercial
Non commercial
Total
Normal
Simplified
Micro
Normal
Micro
Nb of units
(thousands)
660
1 090
210
490
120
2 570
%
26%
43%
8%
19%
5%
100%
Turnover
(billions €)
3 330
150
3
60
4
3 550
%
94%
4%
%
2%
%
100%
It should be noted that the micro-system requires only a mere declaration of annual turnover. There
is therefore no tax data provided by the administration for them and we only have an aggregate
turnover for these 330 000 units.
2.2.2 A highly prescriptive accounting
The information that companies must report to the administration to determine their profit taxes are
taken from their accounts. The framework in which such accounts must be done is defined for all
businesses, it is the French general accounting standard. This accounting is extremely prescriptive.
It accurately defines the nomenclature and detailed positions that must be integrated with various
accounting transactions.
It is also very prescriptive about the accounting methods that can be used to assess the various
economic characteristics that affect the activity of the company (depreciation or provisioning
methods for example).
Re-engineering French Structural Business Statistics - an extended use of administrative data
3
The consequence of this prescriptive accounting is to get consistent information. With few
exceptions, a given accounting concept represents a single characteristic and one and only
accounting method to determine the value of this characteristic. The homogeneity of the accounting
information can be aggregated into consistent and comparable statistics over time or between
classes of businesses.
2.2.3 A detailed set of data
As it is mentioned above, the accounting is very detailed. The tax forms differ by system and type
of profit, but all rely on this accounting at a more or less detailed level - the larger the company the
more detailed is the information requested. The tax forms of the normal industrial and business
profits have nearly 600 accounting characteristics.
With all kinds of profits and systems included, it is nearly 1,000 different characteristics that are
available, some are common and others are specific to a profit or a system.
Many characteristics primarily have a fiscal aim and we are not interested in them for our economic
statistics. We therefore restricted our need among all of them for a selection of 250 target
characteristics.
These target characteristics are present in most tax forms whatever the profit or system, either
directly or by simple summation of more detailed characteristics. Only non-commercial profits,
which the form is rather small, is an exception and we had to implement an estimation procedure to
complete them.
2.2.4 Data are strongly constrained by mathematical relationships
Accounting data are presented with a large redundancy of information. A similar amount may be
present in multiple accounts, for example.
Most of the 250 target characteristics are connected by nearly a hundred arithmetic relations (like X
= X1 + X2 + ... Xn + or Z = X - Y). All of these relations provides a concise and complete overview
of the quality of the tax form.
If all the relations are satisfied the probability that there is an error in the return is very low.
Conversely, if one relation is not respected, the difference between the two terms gives a measure
the error affecting the return.
Overall, data quality, as measured by any inconsistencies on all relations that the return must
respect, is very satisfactory for uploaded data, but is lower for forms sent by mail and captured by
optical recognition.
Table 3 : errors rates by mode of data acquisition
Mode
Error
Nb of units
(thousands)
Internet No error
1 410
Internet Lower than 15 k-€
130
Internet Equal or greater than 15 k-€
30
Internet Total
1 570
Optical
No error
500
Optical
Lower than 15 k-€
90
Optical
Equal or greater than 15 k-€
70
Optical Total
670
%
90%
8%
2%
100%
75%
14%
11%
100%
Turnover
(billions €)
2 890
160
90
3 140
330
60
20
410
%
92%
5%
3%
100%
80%
15%
5%
100%
Re-engineering French Structural Business Statistics - an extended use of administrative data
4
Total
Total
Total
Total
No error
Lower than 15 k-€
Equal or greater than 15 k-€
Total
1 910
220
110
2 240
86%
10%
5%
100%
3 210
220
120
3 550
91%
6%
3%
100%
Only a few characteristics are not involved in any relation, less than twenty in total, and cannot be controlled
this way. We use likelihood controls instead (relative to their value last year and evolution quantiles of their
sectors for example) to judge their relevance.
2.2.5 Flexible rules for accounting periods
The many constraints described above ensure consistency and quality of data but there is one aspect
that leaves the company with a fairly wide latitude of choice and has important implications for the
development of our statistics : the accounting period on which is based the tax return.
There are two main constraints governing the choice of this period :

The company must make at least one accounting period-end during each calendar year

There must be continuity between consecutive accounting periods : neither overlap nor gap.
Companies can choose a period different from 12 months : January 1st to September 30th for one
period and October 1st of the same year to December 31st of the next year for the following period
for instance. Or they can choose a period which do not coincide with calendar year : from April 1st
to next March 31st every year.
But the statistics we produce are based on the calendar year. Thus, a restatement has been
implemented to move from the accounting period to the calendar year by choosing the accounting
period with the most common months of the calendar year of study.
Examples :

a period from April 1st 2008 to March 31st 2009 will be assigned to the calendar year 2008 (9
months out of 12 are in 2008)

a period from 1 October 1st 2008 to September 30th 2009 will be assigned to the calendar year
2009 (9 months out of 12 are in 2009)
We therefore make the assumption that the accounting periods ahead offset those late. Studies have
shown that attempting to correct these shifts create more error than it would solve and a simple
choice of assignment to one calendar year rather than another is an easier and more efficient rule.
Then we estimate to 12 months the forms of surviving companies with periods different from 12
months. Only births and deaths are kept on their original period.
To set straight the magnitude of this phenomenon, about 65% of tax returns have an accounting
period that coincide with the calendar year, 30% are 12 months long but shifted from the calendar
year and 5% are different from 12 months (including 3% of births and deaths).
2.2.6 Multiple tax returns
The constraints on the duration of the accounting period described above ensure there is at least one
form each year. But they do not prevent the opposite risk : having several returns for one year. And
this occurs fairly frequently.
The main reasons that companies send multiple tax forms are as follows:
Re-engineering French Structural Business Statistics - an extended use of administrative data
5

an important event, such as a merger for example, which led the company to make an
accounting period-end in the middle of the calendar year. Then, the company sends two tax
forms on consecutive periods (January 1st to June 23rd and June 24th to December 31st for
instance). Our editing process treats these cases by "pasting" the two forms : we retain amounts
of the beginning of the 1st period, the amounts of the closing of the 2nd period and sum of all
amounts of the two periods that are flows between.

the company made an error in his 1st return and send a 2nd one to correct the previous one. The
administration use a specific code that enables us to detect these kinds of cases. Our process
then delete the 1st return and keeps the second one.

in some cases companies may declare a part of their profit as a business one and another part as
non-commercial. Our process then conduct a consolidation by summing the amounts of each
characteristic of the two returns.
3 Linkage with the statistical register (Ocsane)
3.1 The identification of returns
The list of units within the economic scope is given by the statistical register, Ocsane. These units
are identified by the id-number, called Siren number, used in the French inter-administrative
register (Sirene). But the tax administration has its own register with its own identifying number,
called IFRP.
The objective of identification is to match the returns identified by IFRP with the units of our
statistical register identified by the Siren number. Fortunately, the tax administration integrated the
siren number in its register several years ago as a simple attribute and made strong effort to improve
its accuracy ever since. Nowadays, our automatic process is able to match 98% of the returns with
units of the register.
We split the remaining 2% according to the
amounts are manually reviewed by clerks and
tax administration sends us elements of name
for unidentified returns. Thus, almost 1,000
pairing failed.
amounts contained in the form : returns with high
returns with low amounts are discarded. Indeed, the
and address which often enables us to find a match
returns are manually matched after the automatic
3.2 The specific cases
Some cases have forced us to design a specific production process in parallel of the tax data
processing.
3.2.1 Profiled Companies
Administrative tax data refer to legal units, but some large groups have a legal organization that
makes the accounting data of their legal units irrelevant for our statistics. Bilateral agreements have
been concluded between INSEE and these major groups to provide us with accounting data on
consolidated outlines that are relevant for statistics.
The resulting data are uploaded by a specific process and the tax returns of the legal units of the
outlines are dropped to avoid double counting. Although they are very few, 4 groups in 2009, their
weight on the economy is significant about 4% of the value added of France.
Re-engineering French Structural Business Statistics - an extended use of administrative data
6
3.2.2 Off tax companies
A portion of the business economic scope is not covered by tax data :

cooperatives : the benefits are paid by the members so cooperatives are tax-free and do not file
tax forms. Their accounts are collected via an additive questionnaire in the ESA survey.

semi-public companies : these units are on the border of public and business sectors. Some
examples : the water authorities of municipalities, the National Office of Forests. They pay no
taxes but are nonetheless part of the business scope. Their accounts are collected by the
administration but in a different department than the tax one. A data file is prepared by this
department and is integrated by a specific process in our information system.
Overall these units are about 10,000 and do not represent a large part of the whole economic scope
(less than 1% of turnover), but they are concentrated in some particular sectors and take a
significant place in them.
3.3 Imputed data
Administrative data are theoretically exhaustive but still present a few holes. For example, returns
of companies in tax audits are not transmitted to Insee. Moreover, some returns are sent in the late
January N+2 delivery which is too late for our own needs as our annual campaign stops in
December N+1. An imputation procedure is implemented to get a complete coverage of the
economic scope defined by the register and satisfy the constraint : one return (collected or imputed)
for each unit of the register.
As it is presented above, micro-businesses do not file a tax return but make a simple statement of
their turnover. However, the tax administration gives us a list of expected micro-businesses and the
total turnover they represent. A massive imputation is made for them (12% in number, 0.1% in
turnover).
Beyond these micro-businesses, about 7% of the units of the register have no accounting data and
are also imputed for about 3% of the total turnover.
Table 4 : the scale on imputed data
Collected data
Imputed data (micro excluded)
Micro-businesses imputed data
Total
Nb of units
(thousands)
2 240
210
330
2 770
%
81%
7%
12%
100%
Turnover
(billions €)
3 550
130
7
3 680
%
97%
3%
%
100%
There are three methods used to impute data.

For non micro-businesses : if a non-imputed return is available from the year before, the
process uses this return, inflates it by a median evolution of turnover of the company sector to
turnover N
create the return for the current year : XN  XN-1 . mediansec tor (
)
turnover N-1

if no non-imputed return is available, the return for the current year is imputed as an average
return of its sector and its size class : X N  averagesec torsize class ( X N ) .
Re-engineering French Structural Business Statistics - an extended use of administrative data
7

Micro-businesses are imputed in a similar way as the second method but with a specific
structure of accounts which ensures to find the aggregate turnover provided by the tax
administration for these companies.
4 Mains principles of administrative data editing process
The statistical editing of tax data is divided into two main steps:

an automatic micro-editing process

a manual review by a team of clerks of the most problematic cases. This manual review is
driven by selective controls performed on the data.
4.1 Micro-data editing process
4.1.1 Micro-controls
They will measure the errors of business analysis at the individual level:

micro-consistency controls based, as already seen above, on mathematics constraints
n
X   X i . They underline definite errors.
i 1

micro-likelihood controls which will underline probable errors : they are based on the
comparison of ratios (xi/yi) calculated for the firm and predefined bounds calculated with
quantiles of the sector. If the ratio is outside the bounds, the control is in error.
These micro-controls are said temporal controls when they involve characteristics of the reference
year and the year before and contemporary ones when they affect only characteristics of the
reference year.
4.1.2 The legacy of the past
Micro-controls measure individual errors and they are the ones who will therefore lead to
adjustments. For reasons of priority in the development of our new process, these adjustments are
all taken from the former editing process and they use only the micro-consistency controls and
therefore only contemporary adjustments are implemented. Other types of micro-controls were
added in the new process : micro-temporal consistency controls, temporal and contemporaries
likelihood controls but they do not trigger automatic adjustments. They are used only for the manual
review to diagnose problems and make manual adjustments after the phase selective-controls
explained later.
Data editing programs are executed on the data to remove any contemporary inconsistency. If the
n
theoretical relationship X   w i Xi (where wi   1,0,1) is not satisfied there are two choices for
i 1
data editing :

editing 1 - shaping the breakdown : i  1, n X i 
Xi
. X
n
w X
j 1
j
j
Re-engineering French Structural Business Statistics - an extended use of administrative data
8
n

editing 2 - recalculation of the total : X   w i Xi
i 1
The choice between the two types of data editing is made by minimizing the impact on other
relations in the return : if characteristic X is involved in another relationship that is coherent, the
process shapes the breakdown, otherwise the recovery is done by recalculation of X.
To achieve a coherent whole process, the data editing starts with the final characteristic of the
income statement : the accounting profit and carry on with the relations involving more detailed
characteristics in the income statement and characteristics of the balance sheet and then
characteristics of schedules (tangible assets, depreciation, etc.)
4.2 Selective editing process
The second step of the tax-data editing process consists of a selective editing process, which
constitutes the cornerstone of data editing in the new system.
It rests on two kinds of methods : on the one hand, “drop-out” methods, using scores measuring the
impact of each unit on a given ratio, and on the other hand “diff” methods, using scores based on
the difference between the value of a characteristic before and after micro-editing and measuring
the impact of this micro-editing on aggregates.
The drop out method is applied on micro-edited data, and concerns non-imputed units only. The diff
method is applied on every units, imputed or non-imputed.
4.2.1 Drop-out controls
Local “drop-out” scores form the heart of the selective editing process. This kind of score, which
relies only on micro-edited data, measures the contribution of a given unit to different ratios. For a
given ratio, the objective is thus to determine which units have influence on this ratio, in order to
give priority to such units for being manually reviewed in a detailed way.
As the objective of the system is the validation of aggregates both in level and evolution, two local
“drop-out” scores are calculated for each interest characteristic and each level of aggregation


Xn - x in
X
temporal drop-out control : TDOC i (X) 
where Xn and Xn-1 are
- n
i
Xn-1 - x n-1 Xn-1
aggregates of characteristic X in year n and n-1 and xi the value of this characteristic for
firm i.
Xn - x in Xn
where Xn and Yn are the
yn
Yn - y in
aggregates of characteristics X and Y and xi, yi values of characteristics for firm i.
contemporary drop-out control CDOCi (X) 
4.2.2 Diff controls
Therefore, the influence of each firm on interest aggregates is checked, both in level and evolution,
by the drop-out controls. However, this control mechanism raises a problem for units that have been
imputed or modified during micro-edits. Indeed, the imputation procedure is mostly based on
median, mean or ratio imputation by class. Consequently, such units will have an average behaviour
with regard to imputed characteristics, which mechanically leads to small “drop-out” scores, even if
Re-engineering French Structural Business Statistics - an extended use of administrative data
9
important units are concerned. There is thus a risk of under-control concerning imputed
characteristics. In order to make up for this risk, a local “diff” score, confronting raw and microedited values, measures the weight of imputation and micro-editing in a given aggregate :
x in raw - x in imputed or microedited
DiffCi (X) 
Xn
where Xn is aggregates of the interest characteristic X, x i raw the value of the character for unit i
before micro-editing (by convention 0 if it is imputed) and xi imputed or micro-edited the value after. Such a
score permits to identify units for which the lack of reliable data is too detrimental to the quality of
aggregates.
4.2.3 Global priority indicator
For a given characteristic and a given level of validation, the joint use of two local “drop-out” and a
local “diff” scores allows to organize controls into a hierarchy.
However, since units – i.e. tax forms – need to be treated on a “unit by unit” basis, the results of the
local scores are synthesized into a global priority indicator, according to a three-step procedure:

for each characteristic and each local score, two thresholds, a “high” threshold and a “medium”
one, permit to divide the whole set of units into three groups : very influential, moderately
influential and non influential units;

then, the status of each characteristic is defined as the “maximum status” of the different local
scores relating to this characteristic. So, the status S(Xi) of a given characteristic Xi is defined
as I if the unit is very influential for at least one local score, at S if the unit is only moderately
influential for at least one local score, and at O otherwise;

lastly, the global priority indicator is defined as :
GPI 
A  K i I S ( X i  I    K i I S ( X i  S 
v ar i
v ar i
(1  A)  K i
v ar i
where A represents the importance attached to the “very influential” status compared with the
“moderately influential” status, and Ki represents the importance of each characteristic.
Eventually, the whole set of units is divided into four groups, according to the value of their global
priority indicator. The group of units with the highest priority is checked manually first, then the
second group and last the third one, according to available time and means. This mechanism
permits to manage the amount of work during the campaign, and thus to respect practical
constraints while ensuring a good level of quality for statistics.
4.3 The recall by clerks
The selective editing exposed before identifies the units that need to be manually reviewed and
within their form which characteristics need a manual validation (i.e. characteristics with status=I or
S). This review consists mainly in recalling companies and asking them to confirm or give the
correct value of their selected characteristics.
Re-engineering French Structural Business Statistics - an extended use of administrative data
10
4.4 The global control : returns with selective controls without micro-control.
Tax data are characterized by a large number of influential returns, as measured by selective
controls, that show no micro-control error. The data in these statements are generally valid and a
recall would lead to no adjustment. The reviewing in these particular cases is not to confirm
selected characteristics values one by one but to validate the return as a whole by verifying with the
company no important event occurred, such as restructuring, which could lead to unexpected
behaviour of our aggregate statistics.
Re-engineering French Structural Business Statistics - an extended use of administrative data
11
REFERENCES
Augeraud P. and Chapron J.E., ‘Using Business Accounts for Compiling National Accounts:
the French Experience’, Oct 1997, Insee Working Paper n° G 9723
Brion Ph., “The future system of French structural business statistics: the role of the
estimates”, UN/ECE Work Session on Statistical Data Editing, Vienna, 2008.
Depoutot R., “Reengineering French structural business statistics: an overview”, work session
of the Q2010 conference in Helsinki
Gros E., “Setting cut off scores for selective editing in structural business statistics : an
automatic procedure using simulations study” , work session of the conference of European
statistician 2009
Haag O., “Reengineering French structural business statistics : redesign of the annual survey”,
work session of the Q2010 conference in Helsinki
Re-engineering French Structural Business Statistics - an extended use of administrative data
12