Plan creation with PDQ

In order to run PDQ's planner you have to create a configuration folder with the three following files: schema.xml, query.xml and case.properties. The first two files describe the input schema and the input query, while the third one specifies planning and logging properties. You can find examples of configurations folders here and inside the pdq-regression project, under directory "test".
The syntax to run the PDQ planner is
java -jar pdq-benchmark.jar planner -i configuration/folder/path [logging options] [planning options/cost options]

Logging options

-w the best plan found is saved under the input configuration folder.
-W all the plans that are built during planning are saved under the input configuration folder.

Planning options

The planning options are prefixed by –D and specify the properties of the planner that will eventually search the space of proofs/plans. Some of the most important options are listed below:
-Dplanner_type=[LINEAR_GENERIC, LINEAR_OPTIMIZED, LINEAR_KCHASE, DAG_GENERIC, DAG_OPTIMIZED]

LINEAR_GENERIC: Linear planning algorithm that searches the space of plans/proofs exhaustively and performs reasoning every planning step. This algorithm, as well as, the LINEAR_OPTIMIZED algorithm is described in
Michael Benedikt, Balder ten Cate, and Efthymia Tsamoura. "Generating low-cost plans from proofs", In Proceedings of the 33rd ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems (PODS '14). NY, USA, pp. 200-211, 2014.
LINEAR_OPTIMIZED: Linear planning algorithm that prunes the search space without compromising optimality. It employs the "dominance", "success dominance", "equivalence" and "path post-pruning" optimisations. Similar to LINEAR_GENERIC it performs reasoning every planning step. You can turn off/on path post-pruning using the option -Dpost_pruning_type=REMOVE_ACCESSES. By default, "path post-pruning" is turned on.
LINEAR_KCHASE: Linear planning algorithm that does reasoning every k planning steps. It employs the "dominance" and "success dominance" optimisations also applied in the LINEAR_OPTIMIZED explorer but not the "path post-pruning" optimisation.
DAG_GENERIC: Bushy planning algorithm that is based on the dynamic programming paradigm and prunes down the search space employing the "dominance" and "success dominance" optimisations. This algorithm, as well as, the DAG_OPTIMIZED algorithm is detailed described in
Michael Benedikt, Julien Leblay, and Efthymia Tsamoura. "Querying with access patterns and integrity constraints", Proceedings of the VLDB Endowment 8,6 pp. 690-701, 2015.
DAG_OPTIMIZED: Bushy planning algorithm that is based on the dynamic programming paradigm and prunes down the search space employing "dominance" and "success dominance" optimisations. The difference with the DAG_GENERIC algorithms is that it performs reasoning in parallel.

When using the DAG* explorers you can specify the "dominance" (-Ddominance_type) and "success dominance" (-Dsuccess_dominance_type) types to OPEN or CLOSED. CLOSED domination requires both input configurations/plans that are compared to be closed (i.e., have no input parameters). OPEN does not impose this requirement but it is not guaranteed to return the optimal bushy plan. You can further restrict the type and shape of configurations/plans that the planner visits. This is done through the -Dvalidator_type and -Dfilter_type options.
-Dvalidator_type=[DEFAULT_VALIDATOR,APPLYRULE_VALIDATOR, DEPTH_VALIDATOR, RIGHT_DEPTH_VALIDATOR, APPLYRULE_DEPTH_VALIDATOR, LINEAR_VALIDATOR]

DEFAULT_VALIDATOR: No shape or type restriction.
APPLYRULE_VALIDATOR: Requires at least one of the input configurations to be an ApplyRule one.
DEPTH_VALIDATOR: Restricts the depth of the plans visited.
RIGHT_DEPTH_VALIDATOR: Restricts the depth of the RHS plans used.
APPLYRULE_DEPTH_VALIDATOR: Combination of APPLYRULE_VALIDATOR and DEPTH_VALIDATOR.
LINEAR_VALIDATOR: Restricts the shape of plans to left-deep ones.

-Dfilter_type=[FACT_DOMINATED_FILTER, NUMERICALLY_DOMINATED_FILTER]

FACT_DOMINATED_FILTER: Removes the fact dominated configurations after each exploration step.
NUMERICALLY_FACT_DOMINATED_FILTER: Removes the numerically fact dominated configurations after each exploration step.

-Dcontrol_flow=[BOTTOM_UP, TOP_DOWN]

BOTTOM_UP: Pull control flow allowing tuples to move bottom-up only during execution.
TOP_DOWN: Pull control flow allowing tuples to move top-down and bottom-up during plan execution.

-Dreasoning_type=[BLOCKING_CHASE/ RESTRICTED_CHASE/ KTERMINATION_CHASE/ BOUNDED_CHASE]

BLOCKING_CHASE: Chase algorithm with blocking. It is used in cases where the chase algorithm does not terminate.
RESTRICTED_CHASE: Restricted chase algorithm as defined in
Andrea Cali, Georg Gottlob, and Michael Kifer. "Taming the infinite chase: Query answering under expressive relational constraints", In KR, pp.70-80, 2008.
KTERMINATION_CHASE: Restricted chase algorithm, where the number of rule firing rounds is bounded by a constant K.

We use a database to detect homomorphisms during chasing. By default, we use the Apache Derby database which is embedded within the pdq-benchmark.jar. Users can also use the MySQL database for this purpose; however, in this case, they have to install MySQL in advance and specify the following parameters in the case.properties file

-Dconnection_url=jdbc:mysql://localhost/
-Ddatabase_name=[name of the chase database]
-Ddatabase_user=[username]
-Ddatabase_password=[password]

The complete set of planning options can be found in the PlannerParameters.java file under the pdq-planner project.

Cost options

The cost options are also prefixed by –D and specify the cost function that will be used to evaluate the cost of each plan found. Some of the most important options are listed below.
-Dcost_type=[SIMPLE_CONSTANT, SIMPLE_RANDOM, SIMPLE_GIVEN, SIMPLE_COUNT, BLACKBOX, BLACKBOX_DB, INVERSE_LENGTH, SIMPLE_ERSPI]

SIMPLE_CONSTANT: Estimates the cost as the sum of the cost of all accesses in a plan, where access costs are provided externally.
SIMPLE_RANDOM: Estimates the cost as the sum of the cost of all accesses in a plan, where access costs are assigned randomly.
SIMPLE_GIVEN: Estimates the cost as the sum of the cost of all accesses in a plan, where access costs are measured automatically from the underlying datasources.
SIMPLE_COUNT: Estimates the cost as the sum of all accesses in the plan.
BLACKBOX: Estimates the cost using the function described in "Database Management System" by Ramakrishnan and Gehrke.
INVERSE_LENGTH: Estimates the cost as the number of atoms in a plan.
SIMPLE_ERSPI: Estimates the cost as the sum of the estimated result size per invocation associated to each access method used in a plan. When using the SIMPLE_ERSPI cost function you must provide a file containing the relation/column cardinality statistics. Examples using this cost function can be found here and under pdq-regression/test/dag/bio/SIMPLE_ERSPI. In the other cost functions, the metadata can be embedded in the schema.xml files.
BLACKBOX_DB: When the data reside on PostgreSQL users do also have the option to estimate the cost by translating the query to SQL and asking its cost to PostgreSQL database. When using this option, users should also specify the following parameters

-Dblack_box_connection_url=jdbc:postgresql://localhost/
-Dblack_box_database_name=[name of the database where data reside]
-Dblack_box_database_user=[username]
-Dblack_box_database_password=[password]

here