RandForGen

Overview

The Random Forest Generator (RandForGen) is a tool that creates a random forest for CNF formula classification. Such random forests are used in SAT solvers to determine the type of CNF formula that was handed to the solver, such that the solver can adapt its behavior in solving this formula with optimal performance.

Today, the most common application of classifiers (in general) and random forests (in particular) in the domain of SAT solving is to provide classification capabilities in portfolio SAT solvers. Such a portfolio SAT solver is basically a bunch of solvers. Each of these solvers has different strengths and weaknesses regarding the type of CNF formula it can solve. The classifier determines the type of CNF formula and then hands it over to one of the solvers in the portfolio. The hope is that this solver is then best suited to solve this formula, which should result in a comparatively short solving time.

You might ask yourself: What is a random forest? Well, the term Random Forest was introduced by Tin Kam Ho at Bell Labs, and later Leo Breiman proposed a more sophisticated variant. I suggest reading the Wikipedia article regarding Random Forests as a starter. I will assume that you know what a random forest is for the remainder of this text.

The following article gives a brief overview of RandForGen, and explains how it can be used to create random forests for CNF classification. The intended readers of this article are scientists and people who work in SAT solving. In case you want to try RandForGen yourself you should read the getting started page.

You might also want to take a look at the known issues and bugs page.

RandForGen is basically a collection of Bash-shell scripts (so it only runs on Linux computers), and it is based on the Waffles binaries that are actively maintained by Mike Gashler and Eric Moyer. Check waffles.sourceforge.net for more information on Waffles.

Waffles provides the waffles_learn binary that implements a lot of (supervised) learning algorithms. Among them is the algorithm to learn a random forest. In case you are interested in learning algorithms or classification in general, you should take a look at Waffles.

How RandForGen works

RandForGen works in several steps.

  • 1. Check the preconditions
    The very first thing that RandForGen does is to check your system in order to ensure that it meets the necessary preconditions. In particular, it checks if the Dimetheus binary is available to do the CNF attribute computation. It also ensures that the Linux command line tools awk, bc, cut, and waffles_learn are available. Finally, it also checks that the CNF formulas for classification are available.
  • 2. Create a C header file declaring all the CNF classes
    RandForGen will take a closer look at the CNF path (the default is the directory ./cnfs) and inspects what CNF classes you have in the ./cnfs/application, ./cnfs/crafted, and ./cnfs/random directories. Here, we understand a CNF class merely as a directory containing a bunch of CNF formulas that belong to this specific class.
    For example, all files in
    ./cnfs/random/k3-small-easy/*.cnf
    will be understood as a single class within the random category. In this case, the name suggests that the k3-small-easy class contains a bunch of small and easy to solve uniform random 3-CNF formulas.
    The C header file will then contain the following define that gives an ID and a name to this class.
    #define CNFCLASSES_RES_RAN_K3SMALLEASY 1001
  • 3. Compute the attributes
    After all the available CNF classes have been declared, it is time to compute the attribute vectors for all the CNF files. Such an attribute vector is basically just a very long string of float values that are separated by spaces. The default binary that does the computation is the Dimetheus SAT solver. The resulting attribute vectors are stored temporarily in a file in the ./tmp/ssvs path. Each class has its own file. The file name of the temporary file is basically the class name. In the example given above, you will find all the attribute vectors in ./tmp/ssvs/RAN_K3SMALLEASY.ssv.
    RandForGen might skip a CNF class if it detects that the corresponding .ssv file is already available and contains the same number of vectors as there are .cnf files in the class.
    Finally, RandForGen will combine all the .ssv files into the file ./tmp/attributes.ssv.
  • 4. Tranlate the attributes into an ARFF file
    The resulting attributes.ssv file will now be translated into a special format called ARFF. This is the input format that the Waffles learning algorithm needs. ARFF files are really no special thing. They contain a bit of meta information on the attribute names and their domain (currently, every attribute is treated as a continuous real value). Furthermore, it contains all the attribute vectors as comma separated lists. The last value in such a list is the CNF class the respective formula corresponding to this attribute vector belongs to.
  • 5. Retaining the most salient attributes
    Waffles has the ability to detect the most salient (most important) attributes for classification. Currently, I decided to disable this, but in theory, we could simplify the ARFF file such that it only contains a subset of the attributes that are most helpful for classification. It seems that the attributes that Dimetheus computes are all helpful :D
  • 5. Find optimal parameters to learn the random forest
    In principle, there are two parameters that heavily influence the learning process in order to create a random forest. First, we must somehow decide how many trees the random forest must contain (called the tree-number). Second, whenever a new tree is learned, we must somehow decide which attributes the tree should check in the tree-node that is currently created. Waffles needs a sample parameter for this. This parameter (called the sample-number) tells it to pick sample-number many attributes at random, and then select from these attributes the one that best splits the remaining data (for pros: the information gain is computed for all the selected attributes and the one with the highest gain is selected for the tree-node).
    The approach we use here is cross-validation. Basically, we call Waffles and tell it to cross validate on the given ARFF file, telling it how many trees and how many samples it should use. We test this for a broad range for the tree number and the sample number. For each parameter setting, Waffles will learn a random forest and reports on the estimated accuracy. The higher this is, the better. RandForGen will retain the parameter setting for which the accuracy was the highest. As soon as the accuracy is above 95%, it will stop evaluating.
  • 6. Learn the random forest
    Given the optimal parameter settings (and possibly the retained most salient attributes), Waffles will learn the random forest and outputs it in a special format. This format is great for Waffles, but useless for a SAT solver. However, the data on the random forest is now available. In order to use the data, we tell Waffles to translate it into a human readable form. The random forest will then be available as a list of trees where each tree is basically represented as if-then-else statements.
  • 7. Create a C source file containing the random forest
    RandForGen will use the human readable form of the random forest to create a C source file that basically contains the random forest and any methods necessary to ask the trees for their votes. The idea is that every tree looks at a certain number of attributes and in the end decides that the formula must belong to a certain class. The trees are stored as serialized and compressed strings representing a tree in in-order format. There is a generic method that can read through such a string and basically traverse the tree until it determines the tree's vote. The idea is to simply ask all the trees for their vote, and the CNF class with most votes wins. The CNF formula is then believed to be part of this CNF class.

What RandForGen provides

RandForGen will provide you with two result files after it finished the computations.

  • ./res/cnfClasses.h
    This file contains all the CNF class names, basically declaring what types of CNF formulas the random forest can classify.
  • ./res/randomforest_generic.c
    This file contains the random forest and the methods necessary to traverse the trees for providing a vote.

It is interesting to note that the resulting files can be easily integrated into any C project. You must include the cnfClasses.h file somewhere in your code, and you must use the randomforest_generic.c file somewhere.

Finally, you need to implement and provide a method called classify_intern_voteForClass(uint32_t) in your program. This method basically just receives the class ID that a tree in the forest votes for. I leave the implementation to you, as I do not know how you want to store votes in your program.

You call randomforest_intern_generic_voteAll() in your program to start the classification process. This method will call for the method you implemented in order to signal a vote for a specific CNF class. In the end, all trees had the chance to vote exactly once and all the votes have been delivered.

Using RandForGen

In case you are interested in using RandForGen you should check the getting started page. It contains a step by step walk-through for setting it up and states various preconditions that must be fulfilled in order to run RandForGen.

AdaptGen

As part of the RandForGen toolbox, the AdaptGen will provide a parameter adapter that can be used in conjuncion with the CNF classification. In other words, as soon as you know what sort of formula you are working on, you can adapt all sorts of parameters (provided you know an optimal setting), that will be enforced to speed up the solving process.

AdaptGen needs parameter files. Each CNF class has its own parameter file. The syntax of these files is very simple. Assume you have the CNF class
./cnfclasses/random/k3n100000r4200.
Furthermore, assume that you know the optimal SLS parameter setting for the Dimetheus solver when solving these formulas in the SLS guide. Let these parameters be as follows.

prepStrategy PREP_STRATEGY_RAND
inpStrategy INP_STRATEGY_NULL
searchStrategy SEARCH_STRATEGY_SLS
slsPickAndFlipVarRule 1
slsProbsatCB 2.06
slsProbsatCBShift 0.9

Now all you need to do is create the parameter file
./cnfclasses/random/k3n100000r4200/parameters.par containing the following content.

GUIDE_SLS
prepStrategy PREP_STRATEGY_RAND
inpStrategy INP_STRATEGY_NULL
searchStrategy SEARCH_STRATEGY_SLS
slsPickAndFlipVarRule 1
slsProbsatCB 2.06
slsProbsatCBShift 0.9
DONE

AdaptGen will extract the information and creates a parameter setting for the solver. You can provide multiple guide settings in a parameters file. Parameter adaptation is a complex task, and several situations can occurr that must be resolved. For example:

  • I have no parameters.par file for one of my classes. What will happen?
    Nothing. In case no optimal parameters are known for a specific class, no adaptation is performed.
  • I have a parameters.par file, but some guide is not present in it. When I run Dimetheus in that guide, what will happen?
    Nothing. In case you enforce a guide when running Dimetheus with the guide parameter, and there is no setting for this guide, then no adaptation is performed.
  • I have multiple guide settings in a parameters file, but do not call Dimetheus with the guide parameter, what setting will it adatp to?
    Running Dimetheus without the guide parameter means that it runs in the default guide called AUTOADAPT. In this guide, the solver will try to adapt as best as possible. It will, in case multiple guide settings are available in the parameters.par file, pick the first guide it encounters. The first guide in a parameters file is assumed to be the best one to solve a formula from the given CNF class.
  • I have a parameters file and a specific guide is present in it. I run Dimetheus in that guide, but provide one of the parameters on the command line that is also present in the file and should be adapted. What will happen? Does Dimetheus pick the command line value or the value from the guide?
    Dimetheus will use what was provided on the command line. Any parameter found in the parameters file has an adaptation precondition. You can check all the preconditions by running Dimetheus with option -s. You will see lines starting with
    c PARAMS: PARAM-ADAPT-PRECONDITION-PARAMS:
    The first thing following that prefix is a precondition. Everything else in that line denotes a parameter that can only be adapted when the precondition is true. So when you start Dimetheus and provide a parameter from that class, the precondition will be set to false and none of the parameters in that class can be adapted anymore.

As we have seen above, the generic random forest source-code provided by RandForGen can be easily integrated into other C projects. What about the result of AdaptGen? Can I use Dimetheus and AadptGen to create an adapter that I can use in my own solver? That is a tricky question. In principle, yes.

The problem is, that your solver needs the GUIDANCE phase and the guide parameter, just introduce "param_guide" in your sources and set it to the same value all the time. That already does the trick for the guide. GUIDANCE must then check on all the parameters provided on the command line to figure out the preconditions for adaptation. The most simple thing is to provide a precondition for every parameter and always set it to true (like introduce "param_adaptBla" for parameter "param_bla" in your source code that is set via "-bla" on the command line).

Your solver needs to output what guides can be adapted (in this case the only one you have), and all the parameters with their respective precondition. In order to see how Dimetheus does it, run the solver with -s and look at the output. Then check the sources, the file is ./dimetheus/src/guidance/guidance.c, in order to see how I compute the preconditions.

If you really really want to use RandForGen in conjunction with AdaptGen to extend your solver, it is probably best if you write me an email. Of course, you can also write me in case you have questions or encounter any problems.

GNET BANNER
(C) 2014 Oliver Gableske
www.gableske.net
Valid CSS
Valid XHTML
Verified by Starfield
Copyright