Reengineering French structural business statistics - an extended use of administrative data Sébastien Chami INSEE, business statistics directorate 18 Bd Adolphe Pinard 75675 PARIS CEDEX 14, FRANCE [email protected] 1 Introduction The INSEE has upgraded its sub-process for dealing with annual corporate tax forms in order to collect less data via questionnaire surveys. This has required an extensive redesign of the production process so as to reduce the duration of the sub-process dealing with administrative data, to produce intermediate results and to take advantage of direct follow-up of businesses by statisticians about corporate tax data. Some changes were brought as well to the previous organisation in order to provide for consistency with the global production process for structural statistics: The first part of this paper presents a general description of the administrative tax data. The second part delineates how the statistical register is taken into account in the tax data editing process and the third part details the main principles of this new tax data editing process itself. 2 The administrative tax data 2.1 A data structure related to the management by the administration As any administrative source, management by the administration of tax data imposes many constraints to our statistics production process. 2.1.1 Schedule First, the delivery schedule by the administration is constrained. The time-limit allowed to firms to do their tax return is 4 months after their last fiscal year end. The fiscal administration then proceeds to a grouping and capturing data process before they send them us. The complete processing of tax forms that are sent by mail is via optical character recognition. This work takes about six months. However, the development of return by internet, a mandatory procedure for the biggest companies, has reduced the processing time for theses companies to about a month and a half. Thus, the administration is able to send tax data in several deliveries : the first one mid-June N +1 and then at intervals, a second one in September, a third one at the end of October and a remainder one in late January N+2. The 1st delivery in June, although limited in number of firms, about one fifth of the scope, already represents 3/4th of the turnover through the requirement for larger companies to return by internet. This requirement implies these large companies are part of that delivery. Thus, the data available at this deadline is sufficient to establish relevant early results in the summer. Re-engineering French Structural Business Statistics - an extended use of administrative data 1 The September delivery is of limited importance, but improves our coverage to establish the early SBS results. The late October delivery is the most complete in terms of units. It gives us an almost complete coverage of the tax scope for our annual statistics production process. The final delivery in late January N+2, is composed of data the tax administration treated with delay for various reasons and also includes very early returns of the year after. Delayed data has a relatively small weight, but arrives too late for the development of our annual statistics for N. Nevertheless, we integrate them in our information system to consolidate the year N in the production of statistics on year N+1. The early returns are also integrated for the next year. Table 1 : schedule of tax data deliveries by the administration Delivery Delivery date Nb of units % Turnover (thousands) (billions €) Complementary N-1 late Jan. N+1 230 10% 260 Advanced mid June N+1 420 19% 2 550 Intermediate mid Sept. N+1 110 5% 350 Definitive late October N+1 1 460 65% 360 Complementary late Jan. N+2 20 2% 30 Total 2 240 100% 3 550 % 7% 72% 10% 10% 1% 100% 2.1.2 A decentralized administrative organization The administration organization has very direct consequences on data transmitted to us. There are two distinct processes depending on whether the tax form is completed on paper and filed by mail or is uploaded via the Internet. Uploaded returns are managed in a single data center. However, the collection of paper returns is very decentralized. The gathering is done in each of the 100 French departments. Once the returns of the department are consolidated, departmental centers forward them to four national data centers (including the one which already manages uploaded returns) that will produce files that are provided to us. The consequence of this organization is that each delivery is comprised of 4 files, one for each center. Sometimes, even if it is rarer in recent years, a center is late sending us its file and we do not receive it at the same time as the other three centers. Our old production process had not foreseen this eventuality and remained blocked until receipt of the 4 centers. To avoid this drawback, we designed a data integration in our production system that is independent for each source. Thus the entire system downstream of this integration can operate in the absence of one or more files expected, although, of course, the results produced are weakened. 2.1.3 Stock files instead of flow files The files sent to us by the tax authorities are stock files : the September delivery includes the units already passed in June, and the same for the October one. Only the January one is an exception. But we are only interested in the new units (the flow) of each delivery. The units already sent, as we shall see later, could have been analyzed and modified by clerks, and we do not want to lose this job. Therefore, our production process has a preliminary step that removes the data already transmitted and keeps only the new data of each delivery. Re-engineering French Structural Business Statistics - an extended use of administrative data 2 2.2 Accounting data 2.2.1 The diversity of types of profits subject to taxation The legal rules governing the taxation of the companies are complex. The return requirement and the type of return to be filed by the company will depend on the activity, the size of the company and its legal status. The activity of the company will determine the type of benefit that is taxable : o agricultural profits : for the agricultural sector. This sector is covered by specific statistical procedures not using tax data, it is beyond the scope of Resane even if the administration gives us the files of this profit. o non-commercial profits : the professions (doctors, lawyers) o the industrial and business profits : for other companies The firm size, according to criteria of turnover, will determine the tax system. o the normal system : for larger companies o the simplified system : for small businesses o the extremely simplified system : for very small businesses (micro-businesses) Table 2 : types of benefits and tax systems Profit System Industrial and business Industrial and business Industrial and business Non commercial Non commercial Total Normal Simplified Micro Normal Micro Nb of units (thousands) 660 1 090 210 490 120 2 570 % 26% 43% 8% 19% 5% 100% Turnover (billions €) 3 330 150 3 60 4 3 550 % 94% 4% % 2% % 100% It should be noted that the micro-system requires only a mere declaration of annual turnover. There is therefore no tax data provided by the administration for them and we only have an aggregate turnover for these 330 000 units. 2.2.2 A highly prescriptive accounting The information that companies must report to the administration to determine their profit taxes are taken from their accounts. The framework in which such accounts must be done is defined for all businesses, it is the French general accounting standard. This accounting is extremely prescriptive. It accurately defines the nomenclature and detailed positions that must be integrated with various accounting transactions. It is also very prescriptive about the accounting methods that can be used to assess the various economic characteristics that affect the activity of the company (depreciation or provisioning methods for example). Re-engineering French Structural Business Statistics - an extended use of administrative data 3 The consequence of this prescriptive accounting is to get consistent information. With few exceptions, a given accounting concept represents a single characteristic and one and only accounting method to determine the value of this characteristic. The homogeneity of the accounting information can be aggregated into consistent and comparable statistics over time or between classes of businesses. 2.2.3 A detailed set of data As it is mentioned above, the accounting is very detailed. The tax forms differ by system and type of profit, but all rely on this accounting at a more or less detailed level - the larger the company the more detailed is the information requested. The tax forms of the normal industrial and business profits have nearly 600 accounting characteristics. With all kinds of profits and systems included, it is nearly 1,000 different characteristics that are available, some are common and others are specific to a profit or a system. Many characteristics primarily have a fiscal aim and we are not interested in them for our economic statistics. We therefore restricted our need among all of them for a selection of 250 target characteristics. These target characteristics are present in most tax forms whatever the profit or system, either directly or by simple summation of more detailed characteristics. Only non-commercial profits, which the form is rather small, is an exception and we had to implement an estimation procedure to complete them. 2.2.4 Data are strongly constrained by mathematical relationships Accounting data are presented with a large redundancy of information. A similar amount may be present in multiple accounts, for example. Most of the 250 target characteristics are connected by nearly a hundred arithmetic relations (like X = X1 + X2 + ... Xn + or Z = X - Y). All of these relations provides a concise and complete overview of the quality of the tax form. If all the relations are satisfied the probability that there is an error in the return is very low. Conversely, if one relation is not respected, the difference between the two terms gives a measure the error affecting the return. Overall, data quality, as measured by any inconsistencies on all relations that the return must respect, is very satisfactory for uploaded data, but is lower for forms sent by mail and captured by optical recognition. Table 3 : errors rates by mode of data acquisition Mode Error Nb of units (thousands) Internet No error 1 410 Internet Lower than 15 k-€ 130 Internet Equal or greater than 15 k-€ 30 Internet Total 1 570 Optical No error 500 Optical Lower than 15 k-€ 90 Optical Equal or greater than 15 k-€ 70 Optical Total 670 % 90% 8% 2% 100% 75% 14% 11% 100% Turnover (billions €) 2 890 160 90 3 140 330 60 20 410 % 92% 5% 3% 100% 80% 15% 5% 100% Re-engineering French Structural Business Statistics - an extended use of administrative data 4 Total Total Total Total No error Lower than 15 k-€ Equal or greater than 15 k-€ Total 1 910 220 110 2 240 86% 10% 5% 100% 3 210 220 120 3 550 91% 6% 3% 100% Only a few characteristics are not involved in any relation, less than twenty in total, and cannot be controlled this way. We use likelihood controls instead (relative to their value last year and evolution quantiles of their sectors for example) to judge their relevance. 2.2.5 Flexible rules for accounting periods The many constraints described above ensure consistency and quality of data but there is one aspect that leaves the company with a fairly wide latitude of choice and has important implications for the development of our statistics : the accounting period on which is based the tax return. There are two main constraints governing the choice of this period : The company must make at least one accounting period-end during each calendar year There must be continuity between consecutive accounting periods : neither overlap nor gap. Companies can choose a period different from 12 months : January 1st to September 30th for one period and October 1st of the same year to December 31st of the next year for the following period for instance. Or they can choose a period which do not coincide with calendar year : from April 1st to next March 31st every year. But the statistics we produce are based on the calendar year. Thus, a restatement has been implemented to move from the accounting period to the calendar year by choosing the accounting period with the most common months of the calendar year of study. Examples : a period from April 1st 2008 to March 31st 2009 will be assigned to the calendar year 2008 (9 months out of 12 are in 2008) a period from 1 October 1st 2008 to September 30th 2009 will be assigned to the calendar year 2009 (9 months out of 12 are in 2009) We therefore make the assumption that the accounting periods ahead offset those late. Studies have shown that attempting to correct these shifts create more error than it would solve and a simple choice of assignment to one calendar year rather than another is an easier and more efficient rule. Then we estimate to 12 months the forms of surviving companies with periods different from 12 months. Only births and deaths are kept on their original period. To set straight the magnitude of this phenomenon, about 65% of tax returns have an accounting period that coincide with the calendar year, 30% are 12 months long but shifted from the calendar year and 5% are different from 12 months (including 3% of births and deaths). 2.2.6 Multiple tax returns The constraints on the duration of the accounting period described above ensure there is at least one form each year. But they do not prevent the opposite risk : having several returns for one year. And this occurs fairly frequently. The main reasons that companies send multiple tax forms are as follows: Re-engineering French Structural Business Statistics - an extended use of administrative data 5 an important event, such as a merger for example, which led the company to make an accounting period-end in the middle of the calendar year. Then, the company sends two tax forms on consecutive periods (January 1st to June 23rd and June 24th to December 31st for instance). Our editing process treats these cases by "pasting" the two forms : we retain amounts of the beginning of the 1st period, the amounts of the closing of the 2nd period and sum of all amounts of the two periods that are flows between. the company made an error in his 1st return and send a 2nd one to correct the previous one. The administration use a specific code that enables us to detect these kinds of cases. Our process then delete the 1st return and keeps the second one. in some cases companies may declare a part of their profit as a business one and another part as non-commercial. Our process then conduct a consolidation by summing the amounts of each characteristic of the two returns. 3 Linkage with the statistical register (Ocsane) 3.1 The identification of returns The list of units within the economic scope is given by the statistical register, Ocsane. These units are identified by the id-number, called Siren number, used in the French inter-administrative register (Sirene). But the tax administration has its own register with its own identifying number, called IFRP. The objective of identification is to match the returns identified by IFRP with the units of our statistical register identified by the Siren number. Fortunately, the tax administration integrated the siren number in its register several years ago as a simple attribute and made strong effort to improve its accuracy ever since. Nowadays, our automatic process is able to match 98% of the returns with units of the register. We split the remaining 2% according to the amounts are manually reviewed by clerks and tax administration sends us elements of name for unidentified returns. Thus, almost 1,000 pairing failed. amounts contained in the form : returns with high returns with low amounts are discarded. Indeed, the and address which often enables us to find a match returns are manually matched after the automatic 3.2 The specific cases Some cases have forced us to design a specific production process in parallel of the tax data processing. 3.2.1 Profiled Companies Administrative tax data refer to legal units, but some large groups have a legal organization that makes the accounting data of their legal units irrelevant for our statistics. Bilateral agreements have been concluded between INSEE and these major groups to provide us with accounting data on consolidated outlines that are relevant for statistics. The resulting data are uploaded by a specific process and the tax returns of the legal units of the outlines are dropped to avoid double counting. Although they are very few, 4 groups in 2009, their weight on the economy is significant about 4% of the value added of France. Re-engineering French Structural Business Statistics - an extended use of administrative data 6 3.2.2 Off tax companies A portion of the business economic scope is not covered by tax data : cooperatives : the benefits are paid by the members so cooperatives are tax-free and do not file tax forms. Their accounts are collected via an additive questionnaire in the ESA survey. semi-public companies : these units are on the border of public and business sectors. Some examples : the water authorities of municipalities, the National Office of Forests. They pay no taxes but are nonetheless part of the business scope. Their accounts are collected by the administration but in a different department than the tax one. A data file is prepared by this department and is integrated by a specific process in our information system. Overall these units are about 10,000 and do not represent a large part of the whole economic scope (less than 1% of turnover), but they are concentrated in some particular sectors and take a significant place in them. 3.3 Imputed data Administrative data are theoretically exhaustive but still present a few holes. For example, returns of companies in tax audits are not transmitted to Insee. Moreover, some returns are sent in the late January N+2 delivery which is too late for our own needs as our annual campaign stops in December N+1. An imputation procedure is implemented to get a complete coverage of the economic scope defined by the register and satisfy the constraint : one return (collected or imputed) for each unit of the register. As it is presented above, micro-businesses do not file a tax return but make a simple statement of their turnover. However, the tax administration gives us a list of expected micro-businesses and the total turnover they represent. A massive imputation is made for them (12% in number, 0.1% in turnover). Beyond these micro-businesses, about 7% of the units of the register have no accounting data and are also imputed for about 3% of the total turnover. Table 4 : the scale on imputed data Collected data Imputed data (micro excluded) Micro-businesses imputed data Total Nb of units (thousands) 2 240 210 330 2 770 % 81% 7% 12% 100% Turnover (billions €) 3 550 130 7 3 680 % 97% 3% % 100% There are three methods used to impute data. For non micro-businesses : if a non-imputed return is available from the year before, the process uses this return, inflates it by a median evolution of turnover of the company sector to turnover N create the return for the current year : XN XN-1 . mediansec tor ( ) turnover N-1 if no non-imputed return is available, the return for the current year is imputed as an average return of its sector and its size class : X N averagesec torsize class ( X N ) . Re-engineering French Structural Business Statistics - an extended use of administrative data 7 Micro-businesses are imputed in a similar way as the second method but with a specific structure of accounts which ensures to find the aggregate turnover provided by the tax administration for these companies. 4 Mains principles of administrative data editing process The statistical editing of tax data is divided into two main steps: an automatic micro-editing process a manual review by a team of clerks of the most problematic cases. This manual review is driven by selective controls performed on the data. 4.1 Micro-data editing process 4.1.1 Micro-controls They will measure the errors of business analysis at the individual level: micro-consistency controls based, as already seen above, on mathematics constraints n X X i . They underline definite errors. i 1 micro-likelihood controls which will underline probable errors : they are based on the comparison of ratios (xi/yi) calculated for the firm and predefined bounds calculated with quantiles of the sector. If the ratio is outside the bounds, the control is in error. These micro-controls are said temporal controls when they involve characteristics of the reference year and the year before and contemporary ones when they affect only characteristics of the reference year. 4.1.2 The legacy of the past Micro-controls measure individual errors and they are the ones who will therefore lead to adjustments. For reasons of priority in the development of our new process, these adjustments are all taken from the former editing process and they use only the micro-consistency controls and therefore only contemporary adjustments are implemented. Other types of micro-controls were added in the new process : micro-temporal consistency controls, temporal and contemporaries likelihood controls but they do not trigger automatic adjustments. They are used only for the manual review to diagnose problems and make manual adjustments after the phase selective-controls explained later. Data editing programs are executed on the data to remove any contemporary inconsistency. If the n theoretical relationship X w i Xi (where wi 1,0,1) is not satisfied there are two choices for i 1 data editing : editing 1 - shaping the breakdown : i 1, n X i Xi . X n w X j 1 j j Re-engineering French Structural Business Statistics - an extended use of administrative data 8 n editing 2 - recalculation of the total : X w i Xi i 1 The choice between the two types of data editing is made by minimizing the impact on other relations in the return : if characteristic X is involved in another relationship that is coherent, the process shapes the breakdown, otherwise the recovery is done by recalculation of X. To achieve a coherent whole process, the data editing starts with the final characteristic of the income statement : the accounting profit and carry on with the relations involving more detailed characteristics in the income statement and characteristics of the balance sheet and then characteristics of schedules (tangible assets, depreciation, etc.) 4.2 Selective editing process The second step of the tax-data editing process consists of a selective editing process, which constitutes the cornerstone of data editing in the new system. It rests on two kinds of methods : on the one hand, “drop-out” methods, using scores measuring the impact of each unit on a given ratio, and on the other hand “diff” methods, using scores based on the difference between the value of a characteristic before and after micro-editing and measuring the impact of this micro-editing on aggregates. The drop out method is applied on micro-edited data, and concerns non-imputed units only. The diff method is applied on every units, imputed or non-imputed. 4.2.1 Drop-out controls Local “drop-out” scores form the heart of the selective editing process. This kind of score, which relies only on micro-edited data, measures the contribution of a given unit to different ratios. For a given ratio, the objective is thus to determine which units have influence on this ratio, in order to give priority to such units for being manually reviewed in a detailed way. As the objective of the system is the validation of aggregates both in level and evolution, two local “drop-out” scores are calculated for each interest characteristic and each level of aggregation Xn - x in X temporal drop-out control : TDOC i (X) where Xn and Xn-1 are - n i Xn-1 - x n-1 Xn-1 aggregates of characteristic X in year n and n-1 and xi the value of this characteristic for firm i. Xn - x in Xn where Xn and Yn are the yn Yn - y in aggregates of characteristics X and Y and xi, yi values of characteristics for firm i. contemporary drop-out control CDOCi (X) 4.2.2 Diff controls Therefore, the influence of each firm on interest aggregates is checked, both in level and evolution, by the drop-out controls. However, this control mechanism raises a problem for units that have been imputed or modified during micro-edits. Indeed, the imputation procedure is mostly based on median, mean or ratio imputation by class. Consequently, such units will have an average behaviour with regard to imputed characteristics, which mechanically leads to small “drop-out” scores, even if Re-engineering French Structural Business Statistics - an extended use of administrative data 9 important units are concerned. There is thus a risk of under-control concerning imputed characteristics. In order to make up for this risk, a local “diff” score, confronting raw and microedited values, measures the weight of imputation and micro-editing in a given aggregate : x in raw - x in imputed or microedited DiffCi (X) Xn where Xn is aggregates of the interest characteristic X, x i raw the value of the character for unit i before micro-editing (by convention 0 if it is imputed) and xi imputed or micro-edited the value after. Such a score permits to identify units for which the lack of reliable data is too detrimental to the quality of aggregates. 4.2.3 Global priority indicator For a given characteristic and a given level of validation, the joint use of two local “drop-out” and a local “diff” scores allows to organize controls into a hierarchy. However, since units – i.e. tax forms – need to be treated on a “unit by unit” basis, the results of the local scores are synthesized into a global priority indicator, according to a three-step procedure: for each characteristic and each local score, two thresholds, a “high” threshold and a “medium” one, permit to divide the whole set of units into three groups : very influential, moderately influential and non influential units; then, the status of each characteristic is defined as the “maximum status” of the different local scores relating to this characteristic. So, the status S(Xi) of a given characteristic Xi is defined as I if the unit is very influential for at least one local score, at S if the unit is only moderately influential for at least one local score, and at O otherwise; lastly, the global priority indicator is defined as : GPI A K i I S ( X i I K i I S ( X i S v ar i v ar i (1 A) K i v ar i where A represents the importance attached to the “very influential” status compared with the “moderately influential” status, and Ki represents the importance of each characteristic. Eventually, the whole set of units is divided into four groups, according to the value of their global priority indicator. The group of units with the highest priority is checked manually first, then the second group and last the third one, according to available time and means. This mechanism permits to manage the amount of work during the campaign, and thus to respect practical constraints while ensuring a good level of quality for statistics. 4.3 The recall by clerks The selective editing exposed before identifies the units that need to be manually reviewed and within their form which characteristics need a manual validation (i.e. characteristics with status=I or S). This review consists mainly in recalling companies and asking them to confirm or give the correct value of their selected characteristics. Re-engineering French Structural Business Statistics - an extended use of administrative data 10 4.4 The global control : returns with selective controls without micro-control. Tax data are characterized by a large number of influential returns, as measured by selective controls, that show no micro-control error. The data in these statements are generally valid and a recall would lead to no adjustment. The reviewing in these particular cases is not to confirm selected characteristics values one by one but to validate the return as a whole by verifying with the company no important event occurred, such as restructuring, which could lead to unexpected behaviour of our aggregate statistics. Re-engineering French Structural Business Statistics - an extended use of administrative data 11 REFERENCES Augeraud P. and Chapron J.E., ‘Using Business Accounts for Compiling National Accounts: the French Experience’, Oct 1997, Insee Working Paper n° G 9723 Brion Ph., “The future system of French structural business statistics: the role of the estimates”, UN/ECE Work Session on Statistical Data Editing, Vienna, 2008. Depoutot R., “Reengineering French structural business statistics: an overview”, work session of the Q2010 conference in Helsinki Gros E., “Setting cut off scores for selective editing in structural business statistics : an automatic procedure using simulations study” , work session of the conference of European statistician 2009 Haag O., “Reengineering French structural business statistics : redesign of the annual survey”, work session of the Q2010 conference in Helsinki Re-engineering French Structural Business Statistics - an extended use of administrative data 12
© Copyright 2026 Paperzz