PARALLEL ALGORITHM FOR CALCULATING GENERAL EQUILIBRIUM IN MULTIREGION ECONOMIC GROWTH MODELS

: We develop and analyze a parallel algorithm for computing a solution in a multiregion dynamic general equilibrium model. The algorithm is based on an iterative method of the Gauss –Seidel type and exploits a special block structure of the model. Calculation of prices and input-output ratios in production for diﬀerent time steps is carried out in parallel. We implement the parallel algorithm using the OpenMP interface for systems with shared memory. The eﬃciency of the algorithm is studied with the numbers of cores varying in the full range from one to the number of time steps of the model.


Introduction
Dynamic computable general equilibrium (CGE) models are widely used for estimating the effects of demographic and technological changes on energy use and carbon dioxide (CO 2 ) emissions.The equilibrium is described in the framework of the Arrow -Debreu theory, which leads to a systems of nonlinear equations.Usually large-scale nonlinear systems are solved by one of the "generalpurpose" Krylov subspace solvers, which can deal effectively with sparse matrices (see, e.g., [1]).
In our paper [2], we presented a parallel algorithm based on an iterative method of the Gauss -Seidel type [3].We exploited the special block structure of the nonlinear system of equations in dynamic CGE models.We implemented the algorithm using parallel programming environments for the one-region version of the Population-Environmental-Technology (PET) model [4,5].The numerical results showed that the speed of our algorithm is comparable to the one of Krylov methods solvers such as NITSOL [6].
In this paper we extend the algorithm to models with international trade and apply it to the multiregion PET model [7].We implement the parallel algorithm using the OpenMP interface for systems with shared memory.To demonstrate the effectiveness of the parallel algorithm we use the PET model calibrated to reproduce major outcomes for the socioeconomic scenarios from the Shared Socioeconomic Pathways (SSP) database (see, e.g.[8]).The calibration of the PET model to the SSPs is described in the supplementary material to [8].
The paper is organized as follows.In Sect. 2 we present a description of the multiregion PET model.In particular, we explain in detail how the intermediate goods demand is calculated in the presence of the international trade.In Sect. 3 we present the numerical method for calculating the equilibrium and explain the parallel algorithm.In Sect. 4 we discuss the calculation results.

Structure of the CGE model
In this section we describe the multiregion PET model (for description of the one-region PET model, see, e.g.[2,4,5]).
The PET model is a forward-looking CGE model with tree types of agents: consumers, producers, and government.Consumers maximize their lifetime utility function taking prices as given (Subsec.2.1).Producers maximize profits supported by the prices as described in Subsec.2.2.Government redistributes capital through taxes and transfers (for details see, e.g.[5]).International trade is described by the Armington model as described in Subsec.2.3.Prices are determined by the markets clearing conditions for production factors, intermediate and final goods (Subsec.2.4).The first-order optimality conditions for the agents and supply-equals-demand conditions for markets form a system of nonlinear equations.A solution to this system of equation is called the general equilibrium.

Consumers side
In each of the N R regions the utility function of the representative household is given by the discounted lifetime consumption , is the substitution parameter) and µ jt is the preference coefficient (for details of calculating µ jt see, e.g., [5]).The capital dynamics is where k t is capital (k 0 > 0), x t is investment, δ ∈ (0, 1) is the capital depreciation coefficient, 1 + ν t = n t+1 /n t is the population growth coefficient (ν t is the growth rate).The budget constraint is where p jt is the price of the jth consumer good, q t is the prices of investments, ω t is the wage rate, r t is the rental rate of capital, g t is the government transfers, l t is the labor supply, θ t and φ t are the tax rates on capital and labor incomes, respectively.Here the quantities c jt , k t , x t , l t and g t are given in per capita terms.Taking prices as given, the representative household maximizes the utility, where the consumption composite and price index are The transversality conditions lim t− →∞ λ t k t = 0, where λ t is the Lagrange multiplier, guarantees that the optimal trajectory (c t , k t , x t ) exists and is unique (see, e.g., [9]).

Producers side
Firms are aggregated into sectors that produce final goods (N C consumer goods and one "investment good") and intermediate goods (N E energy goods and the rest, which we call materials).The total number of production sector is Production level of the good X is defined by the constant elasticity of substitution (CES) function where K is capital, L is labor, Ē is energy composite and M is materials (unlike small letters that indicate the per capita values, capital letters denote the totals).Here G I , I = K, L, Ē, M , are the productivity factors and the coefficient γ X normalizes the production shares α I to unity.Both productivity factors and production shares can be sector-and time-dependent.(Current version of the PET model [8] also has land as a production factor but, for simplicity, we do not consider it here.)At each time moment, the producer of the good X maximizes profit, or equivalently, minimizes costs given the level of production (2.4).Here P I is the corresponding price and τ M is the tax on the use of materials (for brevity, we omit the time index).
The minimal cost for problem (2.4) and (2.5) is given by P X X, where The cost minimizing input-output ratios A I X = I/X for I = K, L, Ē are given by and for I = M the ratio is given by Since the PET model is primarily intended for energy economics analysis, it is detailed in the energy sector, where E i , i = 1, N E are different energy types.Solving the cost-minimization problem given the level of production (2.6), we derive the price of the energy composite, and the input-output ratios where τ E i , i = 1, . . ., N E , are the taxes on the use of energy.

Intermediate goods demand
Production has a nested structure.Therefore, calculation of the intermediate goods demand requires a recursive procedure.We derive the necessary formulae first for the one-region model and then for the multiregion case.

One-region case
To explain the main idea, we first consider the market in which intermediate goods are aggregated into one good, which we call materials M .In this case, according to the nested production structure shown in Fig. 1, demand for materials is given by where the first term corresponds to the portion of materials used in production of the final good X(K, L, M ), the second term corresponds to the portions of materials used in production of materials M (K, L, M ) one level down, etc. Calculating the sum of the geometric series, we obtain The latter means that demand for materials is equal to the amount of materials needed to produce the final good and amount needed to produce the materials themselves.
Next, we consider the production (2.4) with two intermediate goods, energy and materials.In this case, the aggregate demand for materials is given by This formula describes the sum over layer of the nested production structure (Fig. 2).Each expression in square brackets corresponds to a particular layer.Rearrangement of the terms in square brackets gives Similarly, for energy we obtain Defining y = (A M X X, A E X X) T and we write expressions (2.10) and (2.11) as a matrix series: , where I is the unity 2 × 2-matrix.Summing the series, we have (2.12) Equation (2.12) can be written in the form where Note that equation (2.13) is the same as equation (2.9) we obtained with one intermediate good.It is the dimensionality of this equation and form of the vectors Z and Y and matrix A that change when we change the number of intermediate goods.

Multiregion case
In this subsection we obtain the intermediate goods demand in the multiregion economy with trade.
International trade is described by the Armington model (see, e.g.[10]).It is based on the assumption that the same goods produced in different regions are not perfect substitutes but can be aggregated according a certain rule (usually a CES function).The Armington model enables the representation of markets in which domestically produced goods keep a share of domestic markets even though their price is higher than the price in other regions, and in which different exporters co-exist even if they have different prices.Same as in the previous subsection, first we consider the market with only one intermediate good (Fig. 3).Then M M 1 , . . ., M N R aggregates materials M 1 , . . ., M N R from N R regions (Fig. 4).Similarly to the problem (2.5) and (2.4), we consider

M M
where P 1 , . . ., P N R are the export prices.Then the minimum of the cost function is equal to P M M , where The cost minimizing input-output ratios are given by Similarly to relation (2.8), we have , we write where we set  In the case of production (2.4) with two intermediate goods, energy and materials (Fig. 5), the vector Z has the form and the components of equation (2.14) will have the block structure . Matrices A X and A will consist of the input-output ratios for materials and energy, In the PET model the energy composite is the aggregate (2.6) of N E energy types.In this case, Z will be a vector of dimensions (N E + 1)N R (N E energy types plus materials per region).Matrix elements b E ij and A E i E i will be diagonal matrices and A X i will have N E + 1 elements.

Market equilibrium
Aggregate supply for capital K AS and labor L AS are determined by the sums over all regions of n t k t and n t l t , respectively.Aggregate demand for capital and labor are where GP is government purchases and A K GP and A L GP are the government sector input-output ratios of capital and labor, respectively.
An equilibrium is defined by the markets clearing conditions.That means aggregate demand is equal to aggregate supply (in each region and each time t) for the factors of production and final goods, Here X AD is equal to the sums of n t c t (or n t x t ) over all regions, and X AS is the production output.
For the government sector, we require that revenues are equal to expenditures, The set of the optimality conditions for consumers and producers and markets clearing conditions form a system of nonlinear equations that need to be solved.This system of equations depends on consumer quantities, i.e. capital, investment, consumption and government transfers, on the one hand and production costs (prices) and input-output ratios on the other.

Parallel algorithm
Since all other quantities can be obtained explicitly if we know capital K and prices P , the system of equations describing the general equilibrium can be written as The block structure of the system and parallel algorithm for solving such systems were described in detail in our paper [2].Here we briefly recall the main ideas before describing the implementation of the parallel algorithm.The Fair -Taylor method [3] works as follows.Let K s be the sth iterate of capital.To obtain the next iterate of prices P s+1 it is necessary to solve the system with respect to P .To obtain the next iterate of capital K s+2 it is necessary to solve the system with respect to K, and so on.The part of the algorithm that calculates the next iterate of capital (3.2) is implemented as the outer loop.The part that calculates the next iterate of prices (3.1) is implemented as the inner loop.Blocks of the system (3.1) that correspond to different time-periods can be calculated in parallel.To improve the convergence, solution of each block is broken down into two nested loops: the NewtonA-loop for factor prices (P K and P L in N R regions) and the NewtonB-loop for all other prices (goods prices in each region and export prices).The NewtonA-loop has a smaller dimensions, therefore we can use the classical Newton method with backtracking as a solver.For the NewtonB-loop we use a more advanced Krylov subspace method NITSOL (see, e.g.[6,11]), because it has much larger dimensions and it is called more often to calculate the Jacobian for the NewtonA-loop.
The algorithm is described in Fig. 6.The input data of the algorithm is the initial approximations of capital K 0 and prices P 0 and the output is the equilibrium capital K and prices P .The general parameters are the tolerance tol and number of iterations numIt.Parameter T is the time horizon of the model.
There are two types of arrays for storing and processing the economic data: dyn and stor.The first group of arrays corresponds to data at the current time and is used by the inner loop (Fig. 6, lines 6-9), the second is used for storing data over the iterations of the algorithm (outer loop).The variable it is the iteration index and diff is the target error for the outer loop.The lines 8 and 11 in Fig. 6 correspond to the implementation of economic equations and line 13 computes the error using current iterates of capital and prices.
In the OpenMP version, the time steps of the inner loop are performed in the parallel region.All dyn and stor arrays are shared.The arrays with parameters are distributed using copyin clause (Fig. 6, line 4).

Results and discussion
For calculations we use the PET model with N R = 9 regions and time horizon T = 105 years.The total number of production sectors is N X = 10 in each region.As inputs the PET model uses national production and household survey data at the baseyear and long-term population and technical change projections over the whole time period.We use three sets of input data that correspond to socioeconomic scenarios from the Shared Socioeconomic Pathways (SSP) database (for the implementation of SSPs in the PET model, see [8]).We use two supercomputer systems for the model runs.The first one is the Lomonosov supercomputer [12].We use two types of nodes at the Lomonosov: regular node with 12 cores (Intel Xeon X5670 2.93 GHz, 1 Gb/core) and a node with 128 cores (16 Gb/core) with shared memory, the Symmetric MultiProcessing (SMP) node.The second system is the Yellowstone supercomputer [13].At the Yellowstone we use regular node with 16 cores (Intel Xeon E5-2670 2.6 GHz, 4 Gb/core), up to 32 cores with hyperthreading.The model is implemented using Fortran.For the algorithm implementation we use BLAS [14], LAPACK [15] and Fortran implementation of NITSOL [6].For compiling the libraries and our code we use the Intel Fortran Compiler 15 with optimization flag -O3 and standard make-file techniques for building the project.
To study strong scalability of the parallel algorithm we need to increase the computing power while keeping the total problem size constant.This is achieved by running the model with the same initial approximations K 0 and P 0 and same set of numerical parameters (for each SSP) with increasing number of threads.The results show that the speedup of the parallel algorithm grows almost linearly at both supercomputers as the number of threads grows from 1 to about 12-16 (Fig. 7).Overall, we obtain the speedup of about 10 times for a regular node.With further increase of the number of nodes the speedup slows down and saturates (Table 1).Once the number of nodes becomes greater than the time horizon of the model each thread solves one year-block of the inner loop an no more speed up is possible with this algorithm.From Table 1 we see that the maximum speedup is about 22 times but using 64 nodes we already get very close to it.
Fig. 8 shows that, for the number of threads from about one to ten, there is a visible monotone decreases in timing of the outer loop as the algorithm converges (especially after the 50th iteration).This effect can be explained if we look at the timings of different year-blocks of the inner loop (Fig. 9).As the number of iterations increase the algorithm stops computing the Jacobian in the NewtonA-loop using the one from the previous iteration.The number of these "fast" year-blocks of the inner loop is increasing from iteration to iteration.For the 100th iteration of the outer loop the calculation times of more than 60 first year-blocks are close to zero.For the 200th the "fast" year-blocks span almost the whole time horizon T = 105.From Fig. 8 we also see that the timings of the outer loops uniformly decrease with the number of threads increasing.But as the number of threads increases above ten, the timings of the outer loops level out.The reason is that the timing of the inner loop in the parallel algorithm cannot get smaller than the timing of the slowest year-block.From Fig. 9 we see that the number of "fast" year-blocks is increasing as the algorithm converges but there are always some "slow" year-blocks close to the end period T .

Aknowledgments
We are grateful to R. Loft and M. Weitzel for useful discussions of the results.

Figure 1 .
Figure 1.Nested production structure with one intermediate good.

Figure 2 .
Figure 2. Nested production structure with energy and materials.

Figure 3 .
Figure 3. Nested production structure with one intermediate good for the multiregion case.

Figure 5 .
Figure 5. Nested production structure with energy and materials in the multiregion case.

Figure 7 .
Figure 7. Speedup of the model runs for different SSPs.

Figure 8 .
Figure 8. Timing of the outer loop iterations for the SSP3 obtained at the Lomonosov supercomputer.

Figure 9 .
Figure 9. Timing of year-blocks in the inner loop for the SSP3 obtained at the Lomonosov supercomputer.

Table 1 .
Speedup of the model runs for the SSP3 at the SMP node of the Lomonosov supercomputer.