Query Evaluation Optimization in a Distributed Database using Data Reorganization

A distributed database is stored on many sites and usually uses the relational data model. It is composed by global tables. A global table of a distributed database can be horizontally fragmented. The fragments can be replicated on the sites of the distributed database. In order to execute a distributed query, data is transferred between sites. In the paper L. Tambulea, A. S. Darabant, V. Varga: "Query Evaluation Optimization in a Distributed Database using Data Reorganization" we propose two methods (mathematical models) to minimize data transfer cost based on information about database storage (the fragments of the database, the dimension of fragments, the sites where the fragments are stored) and query execution (sites where the queries are initiated, for every query the fragments used, an estimation of the queries results obtained by the evaluation of relational algebra operators). These information are generated by Oracle procedures and are stored in tables. Some views were defined in order to obtain information about fragments, sites and queries.

If the fragments which are necessary to execute a relational operator are not stored in the site where the execution of the operator is initiated, the fragments have to be transferred, which implies a data transfer cost. These data transfer costs needed for query executions are determined by some views, which are based on information about fragments, sites and queries.

The proposed methods solve the next problems:

  1. Create new replicas for fragments, called Induced Fragment Replication
  2. Redistribution of existing fragments, called Fragment Re-allocation

For each method a linear programming problem is genarated. These mathematical programming problems are generated by procedures from Oracle package.

In the following there are presented the resources used, the phases of processing and the results obtained from experimental tests of the proposed methods.

Resources used

Commands for creating tables
Commands for creating temporary tables
Definition of views
Definition of views for determining the transfer cost of the data
Oracle package to generate rows for the tables and linear programming problems

Phases of processing

Phases of processing for obtaining experimental results

Results obtained for the Induced Fragment Replication problem

Nr.Generating dataAvailable spaceGenerated filesData transfer final cost
1 Commands for generating data
(fragm.nr., nodes nr., queries nr., max.reps)=(50, 20, 10000, 100)

Initial data transfer cost: 103 MBytes
update sites set dimlib=dims; The mathematical programming problem (817 variables)
The solution of the problem
Insert commands generated by the problem
Cost: 78 MB
(75.65% from initial cost)
update sites set dimlib=trunc(dims*1.5); The mathematical programming problem (817 variables)
The solution of the problem
Insert commands generated by the problem
Cost: 66 MB
(64.04% from initial cost)
2 Commands for generating data
(fragm.nr., nodes nr., queries nr., max.reps)=(100, 20, 100000, 1000)

Initial data transfer cost: 9612MByte
update sites set dimlib=dims; The mathematical programming problem (1538 variables)
The solution of the problem
Insert commands generated by the problem
Cost: 6700MB
(69.70% from initial cost)
update sites set dimlib=trunc(dims*1.5); The mathematical programming problem (1538 variables)
The solution of the problem
Insert commands generated by the problem
Cost: 5269MB
(54.81% from initial cost)
3 Commands for generating data
(fragm.nr., nodes nr., queries nr., max.reps)=(200, 20, 100000, 10000)

Initial data transfer cost: 103397MBytes
update sites set dimlib=dims; The mathematical programming problem (3179 variables)
The solution of the problem
Insert commands generated by the problem
Cost: 74157MB
(71.72% from initial cost)
update sites set dimlib=trunc(dims*1.5); The mathematical programming problem (3179 variables)
The solution of the problem
Insert commands generated by the problem
Cost: 60133MB
(58.16% from initial cost)
4 Commands for generating data
(fragm.nr., nodes nr., queries nr., max.reps)=(300, 20, 100000, 10000)

Initial data transfer cost: 100608MBytes
update sites set dimlib=dims; The mathematical programming problem (4703 variables)
The solution of the problem
Insert commands generated by the problem
Cost: 70179MB
(69.75% from initial cost)
update sites set dimlib=trunc(dims*1.5); The mathematical programming problem (4703 variables)
The solution of the problem
Insert commands generated by the problem
Cost: 55845MB
(55.51% from initial cost)

Results obtained for the Fragment Re-allocation problem

Nr.Generating dataAvailable spaceGenerated filesData transfer final cost
1 Commands for generating data
(fragm.nr., nodes nr., queries nr., max.reps)=(50, 20, 10000, 100)

Initial data transfer cost: 102MBytes
update sites set dimlib=dims; The mathematical programming problem (1000 variables)
The solution of the problem
Insert commands generated by the problem 2
Cost: 100MB
(97.72% from initial cost)
update sites set dimlib=2*dims; The mathematical programming problem (1000 variables)
The solution of the problem
Insert commands generated by the problem 2
Cost: 76MB
(74.26% from initial cost)
update sites set dimlib=trunc(dims*2.5); The mathematical programming problem (1000 variables)
The solution of the problem
Insert commands generated by the problem 2
Cost: 64MB
(62.62% from initial cost)
2 Commands for generating data
(fragm.nr., nodes nr., queries nr., max.reps)=(100, 20, 10000, 1000)

Initial data transfer cost: 9612MBytes
update sites set dimlib=dims; The mathematical programming problem (2000 variables)
The solution of the problem
Insert commands generated by the problem 2
Cost: 9480MB
(98.62% from initial cost)
update sites set dimlib=2*dims; The mathematical programming problem (2000 variables)
The solution of the problem
Insert commands generated by the problem 2
Cost: 6621MB
(68.88% from initial cost)
update sites set dimlib=trunc(dims*2.5); The mathematical programming problem (2000 variables)
The solution of the problem
Insert commands generated by the problem 2
Cost: 5222MB
(54.33% from initial cost)
3 Commands for generating data
(fragm.nr., nodes nr., queries nr., max.reps)=(200, 20, 100000, 10000)

Initial data transfer cost: 103397MBytes
update sites set dimlib=dims; The mathematical programming problem (4000 variables)
The solution of the problem
Insert commands generated by the problem 2
Cost: 101206MB
(97.88% from initial cost)
update sites set dimlib=2*dims; The mathematical programming problem (4000 variables)
The solution of the problem
Insert commands generated by the problem 2
Cost: 73096MB
(70.69% from initial cost)
update sites set dimlib=trunc(dims*2.5); The mathematical programming problem (4000 variables)
The solution of the problem
Insert commands generated by the problem 2
Cost: 59434MB
(57.48% from initial cost)
4 Commands for generating data
(fragm.nr., nodes nr., queries nr., max.reps)=(300, 20, 100000, 10000)

Initial data transfer cost: 100608MBytes
update sites set dimlib=dims; The mathematical programming problem (6000 variables)
The solution of the problem
Insert commands generated by the problem 2
Cost: 97658MB
(97.07% from initial cost)
update sites set dimlib=2*dims; The mathematical programming problem (6000 variables)
The solution of the problem
Insert commands generated by the problem 2
Cost: 68857MB
(68.44% from initial cost)
update sites set dimlib=trunc(dims*2.5); The mathematical programming problem (6000 variables)
The solution of the problem
Insert commands generated by the problem 2
Cost: 54882MB
(54.55% from initial cost)