数据库系统作业代写 Query Optimization代写

CSCD43 DATABASE SYSTEM TECHNOLOGY

数据库系统作业代写 In this assignment, you will carry out a number of experiments involving the optimization of queries using the PostgreSQL query optimizer.

Due Friday April 9 Weight: 15% of your ftnal grade

Assignment 3

Query Optimization

1 Outline 数据库系统作业代写

In this assignment, you will carry out a number of experiments involving the optimization of queries using the PostgreSQL query optimizer. The primary focus will be on statistics collected and used by the PostgreSQL query optimizer.

2 Setup 数据库系统作业代写

Download the files csv andtwo.csv from

http://www.cs.toronto.edu/~koudas/courses/cscd43/one.csv http://www.cs.toronto.edu/~koudas/courses/cscd43/two.csv

The tables corresponding to these files can be created by the following SQL statements.

CREATE TABLE One(

id integer, b integer, c integer, d integer);

CREATE TABLE Two(

id integer, a integer);

Load the csv files into the database using the copy command of psql. Create indices on each of the attributes in the database, using the CreateIndex command.
Runthe vacuum analyze command to update the statistics on the tables.数据库系统作业代写
Run a few sample queries to get an idea of the ranges and data distributions on each of the attributes.

3 Evaluation 数据库系统作业代写

3.1 System Catalogs

Each time you run vacuum analyze on PostgreSQL , the system catalogs are updated. The extent of statistics collection is controlled by a variable default_statistics_target that can take values between 0and 1000, with the default value of 10. This variable can be changed using the set command of psql. Eachtime you change the value of this variable, you should run vacuum analyze to update the statistics.

PostgreSQL collects a wide variety of statistics about each attribute. The most important ones are:

the most common values (MCVs) and their frequencies,and
aset of bucket boundaries for an equi-width 数据库系统作业代写

This statistical information can be easily accessed by simple SQL queries on the PostgreSQL system catalogs. The system catalogs of PostgreSQL are well documented and the statistical information can be easily found with a casual websearch.数据库系统作业代写

Answer the following questions.

Usethe PostgreSQL catalog to find the number of distinct values of each of the attributes in table One. Write the SQL query you used to find out this information.
Use the catalog to find out the space used by the relation Two on the disk, in number of bytes. Write downthe SQL query used to find this information.

3.2 Equality Queries

Consider a set (with size at least two) of equality queries of the form

SELECT * FROM One WHERE d = value;

Set value to a high-frequency value of the attribute d. Which query plan is selected by PostgreSQL toexecute this query? (This information can be found by using the explain command).数据库系统作业代写
Setvalue to a low-frequency value of the attribute d. Which query plan is selected by PostgreSQL to execute this query now?
If the two query plans (corresponding to the two cases) are different, try to find the cross-over point interms of frequency, e., the point
- abovewhich one plan is used, and
- below which the other plan is Why does such a cross-over point exist?

3.3 Single Column Range Queries

Generate 10 queries of the form

SELECT COUNT(*) FROM One WHERE c > value1 AND c < value2;

where (obviously) value1 < value2.数据库系统作业代写

Using the default statistics, compare the estimated and actual selectivities of the predicates for each of the 10
Collect more statistics by increasing the default_statistics_target value, and running vacuum analyze
Plot the average estimation error over the set of queries, as a function of the default_statistics_target value (between 10 and500).数据库系统作业代写

The estimation error for a query is computed as follows:

|EstimatedCardinality − ActualCardinality|

ActualCardinality

The average estimation error for a set of queries is the average of the estimation error over all the queries in the set.数据库系统作业代写

For the given attribute (c), what do you think is a good value for the default_statistics_target, with respect to the average estimation error? (i.e., the value where there should be a point beyond which you get diminishing returns in accuracy). Justify your choice using the graph created above.

3.4 MultiColumn Range Queries

Before starting this section, reset default_statistics_target to its default value (10) and run vacuum analyze. Consider a set (of size 10) of queries of the form

SELECT COUNT(*) FROM One

WHERE c > value1 AND c < value2 AND d > value3 AND d < value4;

In general, you should find that the optimizer estimates are well off the actual estimates. This is true for queries that return many values, and for queries that return very few or no values.数据库系统作业代写

Increase the default_statistics_target value as you did in Section 3.3. Plot the average estimation errorfor the queries in a graph
Do the estimates come significantly closer to the actual values? Why or why not? What is the wrongassumption that the optimizer is making? Use information from the graph and/or the Explaincommand to explain why you believe the optimizer is making the assumption you think it is making.

3.5 Simple Joins

As before, reset default_statistics_target to its default value (10) and run vacuum analyze. Consider an id-based join of the form

SELECT COUNT(*) FROM One, Two

WHERE One.id = Two.id AND

<some selectivity conditions on One> AND

<some selectivity conditions on Two>;

Answer the following questions: 数据库系统作业代写

Disablehash-joins and sort-merge joins using the set The optimizer should now be using a nested-loop join (maybe with an index). Which relation is set as the inner and which one is used as an outer (i.e., the smaller or larger relation)? Try to come up with a query that flips the inner and outer relations when more statistics are available for it. Explain what new information led to this change.数据库系统作业代写

2.Answerthe above question when a hash-join is used (instead of a nested-loop one).

3.When does PostgreSQL prefer which join operator (i.e., nested-loop, hash-join, or sort-merge join)? Generate 100 join queries with random single attribute range selectivity conditions. By “ran- dom” we mean uniformly distributed across the space of possible selectivities that a range condition could For each query, run the explain command to extract the optimizers’ choice of join. Make ascatter plot of these queries in 2 dimensions, as described below:

TheX-axis represents the estimated selectivity of the predicate on the inner
TheY-axis represents the estimated selectivity of the predicate on the outer
For each such (x,y) point, mark the position on the graph with a symbol corresponding to which join algorithm is used (nested-loop, hash, or merge-join). Can you see any interesting trends? When is which algorithm used? Note: This question will take forever to complete, if you attempt it by hand. A fairly simple script will make it much easier!数据库系统作业代写

4.Increase the default_statistics_target to a very high number and update the statistics. Run the explain command on the 100 queries, as above. How many of the queries had their plans changed (i.e.,join operator or inner/outer order)? Give an example of a query which had its operator Explain why.

5.Modify the tables to add a primary key constraint on the column id for both the tables. This can be done using the Alter Table command (Look up the documentation to see how). Repeat the experimentyou did in Question Did the choice of plans change for many queries? Now add an additional constraint on Table Two adding a foreign key constraint on Two.id referencing columnOne.id. Again repeat the experiment and see whether this additional information changed any choicesmade by the optimizer. In both cases provide a scatter plot displaying the choice of the optimizer.数据库系统作业代写

4 Submission instructions

Submit a report (pdf/ps) only, containing your answers to all the questions above. Make sure your names/student numbers are on that report.

Good luck!

其他代写：program代写 cs作业代写 app代写 Programming代写 homework代写考试助攻 finance代写代写CS finance代写 java代写代写CS作业 course代写 data代写

合作平台：essay代写论文代写写手招聘英国留学生代写