Tetrad Overview What is Tetrad?

Tetrad Overview
What is Tetrad?
Tetrad is a program for
creating,
simulating data from,
estimating,
testing,
predicting with,
and searching for
causal/statistical models.
The aim of the program is to provide sophisticated methods in a friendly interface requiring very little
statistical sophistication of the user and no programming knowledge. It is not intended to replace flexible
statistical programming systems such as Matlab, Splus or R. Tetrad is freeware that performs many of the
functions in commercial programs such as Netica, Hugin, LISREL, EQS and other programs, and many
discovery functions these commercial programs do not perform.
Tetrad is unique in the suite of principled search ("exploration," "discovery") algorithms it provides--for
example its ability to search when there may be unobserved confounders of measured variables, to search
for models of latent structure, and to search for linear feedback models--and in the ability to calculate
predictions of the effects of interventions or experiments based on a model. All of its search procedures
are "pointwise consistent"--they are guaranteed to converge almost certainly to correct information about
the true structure in the large sample limit, provided that structure and the sample data satisfy various
commonly made (but not always true!) assumptions.
Tetrad is limited to models of categorical data (which can also be used for ordinal data) and to linear
models ("structural equation models’) with a Normal probability distribution, and to a very limited class of
time series models. The Tetrad programs describe causal models in three distinct parts or stages: a picture,
representing a directed graph specifying hypothetical causal relations among the variables; a specification
of the family of probability distributions and kinds of parameters associated with the graphical model; and
a specification of the numerical values of those parameters.
The program and its search algorithms have been developed over several years with support from the
National Aeronautics and Space Administration and the Office of Naval Research. Joseph Ramsey has
implemented most of the program, with substantial
assistance from Frank Wimberly. Executable and Source code for all versions of Tetrad IV, and this
manual, are copyrighted, 2004, by Clark Glymour, Richard Scheines, Peter Spirtes and Joseph Ramsey.
The program may be freely downloaded and used without permission of copyright holders, who reserve
the right to alter the program at any time without notification.
The Tetrad suite of programs permits the user to do any of the following:
1. Generate a graphical statistical/causal model of any of the following kinds:
1. Models for categorical data (Bayes networks);
2. Models for continuous data with variables having a Gaussian (Normal) joint probability
distribution;
3. Models for a limited class of time-series representing genetic regulatory networks..
2. Estimate parameters of models of the following kinds:
1. Models for categorical data in which all variables are recorded in the data (no "latent" variables);
2. Models for continuous data with or without latent variables;
3. Test the fit of models of any of the kinds listed in 2. above.
4. Simulate data from a model. or any of the kinds listed in 1. above.
5. Update models of categorical data; i.e.,, compute the probability of any variable in the model
conditional on any set of values for other variables in the model.
6. Predict the probability of a variable in a model (without latent variables) from interventions that fix
or randomize values for any set of other variables in the model.
7. Search for models:
1. Of categorical data with or without latent variables;
2. Of continuous, Gaussian data with or without latent variables.
8. Compare graphical features of two models.
9. Find alternative models statistically equivalent to any given model without latent variables.
10. Select variables within a dataset for classifying values of cases of another variable in the dataset
11. Classify new (or old) cases using the variables selected in 9. above.
12. Assess the accuracy of classification.
Manual
Why Doesn’t Tetrad...?
For Further Help
References
Tetrad Manual
Tetrad is organized as a main workspace in which on or more sessions can be constructed or edged. Each
session can contain a number of boxes, which can themselves contain modules (statistical models, data
sets, search algorithms, etc.). There are also several functions that appear in more than one box or are used
to manage sessions in general.
For information on the main workspace, see Main Workspace Explained.
For information on each box, see Each Box Explained. You will find there explainations of the modules
that can be placed into each box.
For information on common tasks, see Some Common Tasks. These items are linked to from module
explanations, but they are all made available here in case you want to peruse them.
There are also some definitions of terms. To look at these, see Some Definitions.
The Main Workspace Explained
The main workspace in Tetrad consists of a workbench for building sessions, a toolbar for selecting types
of boxes to add to the session, and a menu bar for performing operations like loading and saving sessions,
and so on.
Tetrad 4 works with drag and drop objects, dialog boxes, drop down menus, and clicking. Occasionally
you will need to typesomething brief--numerals and names.
Most Tetrad operations are performed with a single left click or a double left click. Right clicks are used
for additional information or less common options.
The main workspace looks like this:
For an explanation of each part of this workspace, follow the link:
For an explanation of the title bar ("Tetrad version"), see Tetrad Versioning.
For an explanation of the menubar and the various functions it makes available, see Tetrad Menubar.
For an explanation of the left-hand toolbar ("Graph," "Parametric Model," etc.), see Tetrad Toolbar.
For an explanation of the main workspace area for a session (with the sample boxes "Graph1,"
"PM1," etc.), please see How To Build a Session..
Tetrad Versioning
Beginning with version 4.3.2-1, saved sessions (".tet" files) are intended to be backwards compatible. That
is, sessions saved out using one version are intended to be loadable using versions of Tetrad with equal or
higher version numbers. There may be a few problems; hopefully these will get ironed out quickly.
You might want to know the exact version of Tetrad you are running for two reasons:
1. New features are regularly added to Tetrad, and bugs in old features are removed. You may want to
to use a version of Tetrad with a specific version of some algorithm, or you may simply want to sync
your version with a version that someone else is using, to avoid any discrepancies.
2. You may experience some vestigial problems with loading, which we can fix.
If you do experience a problem loading a session that you’ve saved, please let us know by emailing the
session itself (the ".tet" file) to Joe Ramsey at [email protected]. We will attempt to fix the
problem and either post a new file for that version or else let you know which later version you can load
the session in. We sincerely appreciate your help on this.
To find out which exact version of Tetrad you are using, you may use one of three methods.
Method 1
In most operating systems, the version number is displayed in the title bar above the application. In the
example below, it’s "4.3.2-3". The "4" in this case is the major version, the first "3" the minor version, the
"2" the minor subversion, and the "3" the incremental release number. This appears in the title bar like
this:
Method 2
You can also find the version number by selecting "Session Version" from the main File menu:
A dialog will appear telling you the version number and date the last time the current session was saved,
and the current version of Tetrad:
Each saved ".tet" file is stamped with a version and a date in such a way that, even if the file itself cannot
be loaded, at least this meta-information can be loaded. So if you have a file that won’t load, you can still
see the version it was saved under and the date it was saved. This allows you to go to the Tetrad website
and launch the version of Tetrad that was used to save out the file. You can then load the file under that
version.
Method 3
The current version number is also displayed in the "About Tetrad" menu item in the "Help" menu.
Tetrad Menubar
The main menu bar in Tetrad lets you manage Tetrad sessions, lets you perform some editing operations
on the current session, and gives you access to the help functionality (this), among other things. It looks
like this:
The function of each item in each menu is described below. If you were looking for help for popup menus
for boxes, see Popup Menus.
The File Menu
Here’s what each item does in the File menu.
New Session - creates a new, empty Tetrad session. Your previous session is still available-just go to
the Session menu--see below.
Open Session - prompts you for a saved Tetrad session file (with suffice ".tet") and opens it.
Close Session - closes your main workbench and leaves you in Tetrad or in a stored session in
memory. If there are no sessions you get a blank Tetrad screen with no workbench You can’t do
anything in the program until another session is opened or a new one is created.
Save Session - saves all of your work, data, the whole thing, to a file (suffix ".tet") so that you can
call it back up later. You are asked to give the session a name if you haven’t done so already.
Save Session As - Does the same thing as Save Session, but always asks you for a file name.
Session Version - Gives version information for the current session. See Tetrad Versioning.
Save Screenshot - Saves a screenshog (PNG format) of the entire Tetrad Session, in case you don’t
want to fool around with Photoshop or GIMP.
Save Session Graph Image - Saves an image of just the session graph in the white workspace are with
the boxes and arrow in it. Leaves out the menubar and toolbar. This is useful if your session flowchart is
larger than your screen.
Exit - Gets you out of Tetrad the polite way.
The Edit Menu
Here’s what each item does in the Edit Menu.
Cut - Cuts out any selected boxes from the workbench (together with any edges between them) and
allows you to paste them. For advise on how to select groups of boxes, see Selecting Groups of
Nodes.
Copy - Same as Cut, but leaves the original nodes in the session.
Paste - Paste the cut or copied boxes slightly down and to the right of the original ones, either in the
current session or in some other session. Multiple pastes are supported; if you paste multiple times,
new copies appear down and to the right of the originally selected boxes.
The Session Menu
Tetrad can keep multiple sessions open at once, but only one workbench is visible at a time. "Sessions"
lists all of your open sessions and lets you switch to the workbench of whichever session you want. Your
sessions are automatically given a name, e.g., "Untitled1.tet" unless you have saved them with a name.
The Template Menu
In using Tetrad you will put together a sequence of boxes connected by flowchart arrows. (See How to
Build a Session.) Some sequences are so commonly used, that Tetrad will insert the entire sequence for
you--boxes and arrows--in the workbench all at once. For details, see Using Templates.
The Help Menu
The items listed do the following.
Tetrad Manual - That’s this. You already know about it.
About Tetrad [version-number] - Shows information about the project in general.
Warranty - Warranty information displayed as per requirements of the GNU General Public License.
License - License informatoin displayed as per requirements of the GNU General Public Licence.
Popup Menus
If you right click on any box in the session workbench (a Graph box, a Data box, etc.), a popup menu will
be displayed with a number of options. The options are as follows.
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
Create Model.
Edit Model.
Destroy Model.
(Re)create Descendant Models.
Rename Box.
Clone Box.
Delete Box.
Set Repetitions for Simulation
Run Simulation.
Info for this Box.
Create Model.
"Create" does the same thing as double clicking on the Graph box the first time--if you have not yet
created a graph. Otherwise it does nothing.
Edit Model.
"Edit" does the same thing as double clicking on the Graph box after you have already created a graph--it
opens a graph editing window.
Destroy Model.
Removes the model that the box contains and lets you create another one from scratch. Any models
downstream will be destroyed as well, since they depend on the model being destroyed.
(Re)create Descendant Models.
Allows you to quickly create (or recreate if they already exist) a model and all of the models downstream
of it. This is helpful if you’re just playing around with random models but can be frustrating if you
accidentally overwrite models you’re spent a long time creating. Therefore, a warning is displayed before
any changes are made to make sure you really want to make the changes.
Rename Box.
This lets you rename the session box you right-clicked on. This is useful if you have a number of similar
session boxes on the workbench and would like to keep track of which is which.
Clone Box.
Clones the box you’ve right-clicked on so you can edit the copy separately.
Delete Box.
Deletes the box you’ve right-clicked on, destroying its model and removing it from the session altogether.
Cannot be undone.
Set Repetitions for Simulation
Sets the number of times a node is "executed" (i.e., destroyed and randomly recreated) in simulation. Each
time a node is destroyed and recreated, all of the children of the node are executed as well.
Run Simulation.
Executes (i.e. destroys and recreates) the node right-clicked on. This starts a cascade of nodes being
destroyed and recreated downstream. This can be useful if you’d like to know how well a particular search
performs on graphs with 10 nodes in them, with 20 edges selected randomly, for instance.
Info for this Box.
Lets you choose which of the models for the box being right-clicked on you’d like help for, and which of
the possible parent combinations of that box you’re interested in.
Three things are important:
1. The "Unoriented" and "Half-Oriented" buttons are for creating theoretical graphical objects of various
kinds. Do not use these buttons for ordinary modeling. Currently, none of the statistical procedures in the
program work for graphs with such edges.
2. If you create a model with a latent variable, if you later use the model to generate data, values for the
latent variables will not be shown.
3. If you have introduced any boxes that depend on a Graph box, changing the graph will alter the contents
of all boxes downstream in the flowchart from the Graph box. Often it is easier to simply create a new
Graph box in the same main workspace window--there is no limit to how many graph or other boxes you
can have at the same time.
The menu at the top of the Graph window has two options, "File" and "Edit". "Edit" currently does
nothing. "File" gives three options: You can save a graph in a file--but there is no point because we have
not yet implemented a facility to paste the saved graph into a new window. You can introduce the
ALARM network, a fairly complex graph standardly used as a test for search algorithms, and you can
save any graph you create as an image file that can be introduced into text documents, e.g., into Microsoft
World.
Selecting Groups of Nodes
If you would like to move a whole section of nodes, to copy them or move them to another location, first
draw a "rubberband" around them and either drag one of the nodes to drag the group or select the copy
function from the Edit menu to copy the group.
To draw a rubberband around a group of nodes, first click in the white background to the upper left of the
group of nodes, then drag the mouse down to the lower right of the group of nodes. The rubberband will
be shown as a dotted box around the group, and all of the nodes in the group will be highlighted. It will
look like this:
Once selected, this group of nodes may, for example, be dragged to another location by clicking on X3
and dragging it.
Using Templates
In using Tetrad you will put together a sequence of boxes connected by flowchart arrows. (See How to
Build a Session.) Some sequences are so commonly used, that Tetrad will insert the entire sequence for
you--boxes and arrows--in the workbench all at once.
Templates are added to the active session using the Templates menu in the main workspace. The
Templates menu looks like this:
An image of each template along with a short description of it follows.
Search from Loaded Data
This template can be used if you simply want to load in a data set and do a search on it. The data set can
be either continuous or discrete; the options for search algorithms will depend on which type of data set
you load.
Estimate from Loaded Data (Bayes)
This template is useful if you want to estimate a Bayes instantiatec model (Bayes IM) from a given data
set. A Bayes estimation requires a data set and a Bayes parmaeterized model (Bayes PM) as input. There
are two difficulties in getting such an estimation to work:
1. All of the measured variables in the Bayes PM must occur in the data set. The maximum likelihood
(ML) Bayes estimator and the Dirichlet estimator both require that all of the variables in the Bayes
PM be measured, although the Structural EM search allows for latents variables.
2. For each variable V in the Bayes PM with categories Ci, i = 1,...,ci for some ci > 0, the variable by
the same name in the data set must have the same categories.
These conditions can be difficult to ensure when building a Bayes PM from scratch. Adding the edge from
Data1 to Graph1 in the template creates an edgeless graph in Graph1 that can then be used to construct a
specific DAG to use to build a Bayes PM. Adding the edge from Data1 to PM1 ensures that the categories
for each relevant variable in the data set are used when building the Bayes PM. The two arrows out of
Data together make it easier to ensure that the Bayes estimation will work.
Estimate from Loaded Data (SEM)
Like the Bayes version of Estimate from Loaded Data, in order to estimate a SEM IM, a continuous data
set and a SEM PM are required that have the same variables. In this case, however, the variables are
always continuous, and continuous variables always have the same range (the real numbers), so there is no
need to add the edge from Data1 to PM1.
Simulate Data
This is a very useful template for simulating continuous or discrete data sets. Continuous data sets can be
simulated by constructing a SEM Graph (or DAG), using that to construct a SEM PM, then a SEM IM,
and then finally a data set. Discrete data sets can be simulated by constructing a DAG, using that to
construct a Bayes PM, then a Bayes IM, and finally a data set. For information on any one of these steps,
see the help files for the corresponding box or module.
Search from Simulated Data
This template can be used to try out search algorithms on simulated data. Data can be simulated as with
the Simulate Data template, and then an appropriate search procedure can be run on this data. Search
procedures options are different depending on the type of data simulated.
Search from Simulated Data with Edge Comparisons
This template adds to the Search from Simulated Data a Compare node, which counts the number of extra
edges and missing edges in the Search graph vis a vis the reference graph in Graph1. This is useful if you
want ot get a sense of how well a given search procedure performs on data with particular characteristics.
Estimate from Simulated Data
This template can be used to estimate data with respect to the parametric model that generated it. It is
useful if you would like to see how well an estimator does on data with particular characteristics,
simulated from an instantiated model with particular characteristics, when you know the parametric model
used to generate it.
Estimate using Results of Search (Bayes)
This template shows how to hook up boxes to estimate data using a model that was generated by a search
algorithm on that same data. Usually, the graph coming out of Search1 is an equivalence class graph such
as a Pattern or a PAG, and some work might be required to turn this into a DAG or SEM Graph in Graph1
that can be used to build an appropriate parametric model in PM1. The edge from Data1 to PM1 is added
in the discrete case to ensure that the variables in PM1 use the same categories as the variables in Data1.
Estimate using Results of Search (SEM)
This template shows how to hook up boxes to estimate data using a model that was generated by a search
algorithm on that same data. Usually, the graph coming out of Search1 is an equivalence class graph such
as a Pattern or a PAG, and some work might be required to turn this into a DAG or SEM Graph in Graph1
that can be used to build an appropriate parametric model in PM1.
Update Bayes IM
This template can be used to do updating operations on a Bayes instantiated model that you’ve built in
IM1.
Tetrad Toolbar
The main toolbar allows you select box types to place in the main workspace (the white area). It also
allows you to select tools for selecting and moving boxes and for drawing arrows between them. Each
button in the toolbar is explained below.
Select and Move Button
When the movement button is highlighted, the objects in the workspace can be moved around by clicking
over each object and dragging it elsewhere in the workspace.
Once you have created a box, its contents can be opened by double clicking on it. The contents may be
another workbench for creating an object, or may be the object itself once it has been created:
The button at the bottom left of the toolbar column--the one with a red and a green arrow--permits you to
make a flow chart connecting boxes you have placed in the workspace.
Flow Chart Button
To make a flow chart, simply click on the flow chart tool button, and then click on the box you want at the
tail of a flowchart arrow and drag the arrow to the box you want at the head of the flow chart arrow. You
can do this repeatedly without having to click on the .flow chart tool button in between. Only one
flowchart arrow can connect any two boxes, but a box can have any number of flowchart arrows out of it.
The flow chart you create provides the input to each Tetrad operation. Some boxes require no input (e.g,
the Graph boz), some require one input (e..g., PM box requires a Graph box as input) and some boxes
require several inputs (e.g., th Estimate box requires a Data box and a PM box.). Not all connections are
allowed, and if you attempt to connect two boxes that cannot be related (e.g., two graph boxes), the
flowchart tool will simple refuse to make the connecting arrow.
If you put the cursor over a box and let it rest for a moment, a "tip" appears that describes the inputs
required for the operations in that box.
The Tool Buttons
Each tool button when clicked allows the creation of a corresponding box inside the workspace. Various
operations can be carried out by opening a box, provided it has appropriate inputs. The results of the
operations are contained in, and remain accessible inside of the box in whcih they are created. Running an
operation or program inside a box never creates a new box. We will describe each of the other tool buttons
and how to use them for a variety of tasks. Clicking in this file on the tool buttons illustrated below will
provide much more information about each of their functions and operation.
Graph
Creates an instance of a graph. Options are:
Regular Graph - A set of variables over which a set of edges has been defined, where the edges can
be of any of the four standard Tetrad edge types.
Lag Graph - A set of variables, each at a series of time lags. Directed edges may extend from
previous time lags into the current time step. The time series graph is interpreted as a repeating
update graph
Parametric Model
Creates a PM box in which a parametric model can be created. A parametric model specifies the family of
probability functions connecting cause and effect,a, but does NOT specify values for its parameters.For
example, if you open the PM box a dialog box will come up giving simple alternatives. One alternative,
for example, is "Bayes net." If you choose that, the graph you have specified as input to the PM box will
be parametrized as a categorical model in which the parameters are the (unspecified) conditional
probabilities of values of each variable on the values of its parent variables in the graph. If you specify
"SEM," the graph will be parametrized as a linear Gaussian model, with variances and linear coefficients.
The values for the parameters in the parametric model selected are NOT in PM. They must be specified in
an IM box, which must have a flowchart arrow from a PM box directed into it.
Instantiated Model
Creates an IM box, which can be used to create an instantiated model. An instantiated model specifies
particular numberical values for the parameters of a parametric model. Arbitrary parameter values are
entered randomly and can be edited in a window created from the IM box
Data
Creates a Data box which can be used to created a data set for an IM and allows the importation of data
files from outside the program. Only the last data set generated is stored for any one Data box.
Manipulated Data
Takes a Data box with data as input and creates a new data file, with mixxing values marked or
interpolated (for discrete variables), and with multiple copies of user-selected cases.
Estimator Button
Creates an Estimator box. Given a PM and Data, the procedures in the statistical estimator allow
estimation of the parameters--that is, creation of an instantiated model, based on the Data input to the
Estimator box. Estimators include maximum liklelihood and Dirichlet types. There are also procedures
for handling missing values.
Updater Button
Creates an Update box. The Update box requires input from an IM box that is a Bayes net--i.e., is for
discrete variables. It will compute the conditional probability of any variable in the Bayes net given values
for any other variables in the model. It will also compute such probabilities condiitonal on an intervention
the fixes or randomizes other variables.
Classify Button
Classify creates a Classifier box, which requires input from Data and from an IM box. It is used to classify
new cases with the Bayes net in the IM box.. The variables in the IM box must match some of the
variables in the Data. The user specifies a target variable in the IM and the classifier uses the Bayes net
structure of the IM to predict the values of the target in the data set. Statistics on classification accuracy
are provided (as ROC curves and confusion matrices.)
Search Button
Regression Button
Creates a Search Box. The Search box requires Data as input. The user can choose from among a variety
of search algorithms, consistent under diifferent assumptions, and can specify background knowledge to
be used in the search.
Compare Button
Creates a Compare box. The Compare box requires input from a Search box result and input from a Graph
box, or input from two Graph boxes. It compares the edges in the structure from Search with the structure
in Graph, or the edges in the second graph to be connected to it with the first graph connected to it, and
returns counts of how well the Search graph (or the second graph) agrees with the Graph box structure (or
with the first graph).
How to Build a Session
Sessions in Tetrad are constructed by placing boxes into the white session workspace, connecting them up
with arrows in legal ways that represent their dependencies, and constructing modules in each box using
modules in parent boxes.
The session window allows you to create boxes in which all of the Tetrad objects are created and stored
and all of the Tetrad statistical operations are applied and their results are stored. The contents of each box
can be viewed by clicking on the box.. To create a box of any kind, for example a Graph box, simply left
click on the corresponding tool on the left of the workbench, move the mouse over the workspace, and
click again. As soon as you do that, the button for the movement tool at the top of the tool column lights
up.
Here, step by step, is an example of how to use the tools in Tetrad to build a random Bayes net model and
simulate data from it. (We show how to build a random Bayes net, only because it makes the example
shorter; this is not necessarily something you would need to do ordinarily. For detailed explanations of
how each module works, see the help files for those modules.) First, we place a Graph box into the
workspace, then a PM box, then an IM box, then a Data box. In each case, we do this by first clicking in
the toolbar on the left for the type of box we want and then clicking in the workspace area where we want
the box to appear. The result after this step is as follows:
Next, we draw flowchart edges from the Graph box to the PM box, from the PM box to the IM box, and
from the IM box to the Data box. To start this process, we first click the flowchart tool in the toolbar,
which looks like this:
Then we hold the mouse down over the Graph box and drag to the PM box, then release the mouse. Same
for PM to IM and IM to Data. (Notice that only legal edges will be drawn; if an edge is not legal, it simply
will not be drawn.) The result after this step looks like this:
At this point, we have four boxes in the workspace, with dependencies between them specified, but there
are no modules in them. To put a module in the Graph box, double click it, select "Directed Acyclic
Graph" from the dropdown, select "A random DAG from the dialog that appears (accepting the defaults),
and click "OK." The result will look something like this:
Click "Save." The workspace will now show the Graph box in a different color, indicating that it now
contains a module. (The other three are still empty.) To place a module into the PM box, now that the
parent it depends on now has a module in it, double click the PM box. Select "Bayes Parametric Model"
from the dropdown. Click "OK." Select "Automatically Assigned" from the dialog, accepting the defaults.
Click "OK." The result looks something like this:
When you click "Save," now two boxes are filled in. To fill in the IM box, double click it, select "Bayes
Instantiated Model," select "Randomly, overwriting previous values," and click "OK." The result looks
something like this (with perhaps a different graph and different conditional probabilities showing):
Clicking "Save," you see now three boxes are filled in. Now double click the "Data" box, accepting the
defaults, and click "OK." You now have a data set with 1000 cases, simulated from the Bayes net that was
randomly generated in the last three steps. It looks something like this:
If you click "Save," you see now that all four boxes are filled in with modules. Notice that along the way,
most of the boxes could have stored a variety of different modules. You made a choice as to which
modules to put in which boxes. Having made those choices, your choices downstream were constrained.
The final workspace looks like this:
If you right-click on any of these boxes, you will get a popup menu with a variety of actions you can take.
See Popup Menus for more details. Also, it is important to understand the implications of some boxes
being dependent on others. When you destroy the module in a box, modules downstream will be destroyed
also. See Flowchart Dependencies for more details.
Each Box Explained
Sessions in Tetrad are built up by placing boxes on the main workspace area, connecting them up with
arrows, and building modules in each box that depending on parent modules that have already been built.
(See How to Build a Session for an example. For a discussion of how arrows create dependencies between
boxes, see Flowchart Dependencies.)
Boxes are things that look like this:
Each box can contain one of a specific list of modules, and which modules are available depends on what
modules are in the parent boxes for that box. Here we discuss each box in particular, list the types of
modules it can contain, and provide links to explanations for those modules. If you’d prefer to go directly
to explanations of modules, see Each Module Explained.
Graph Box
Parametric Model (PM) Box
Instantiated Model (IM) Box
Data Box
Manipulated Data Box
Estimate Box
Update Box
Search Box
Regression Box
Classify Box
Compare Box
Destroying Contents of Boxes
When you create a flowchart by placing boxes on the session workbench and drawing edges between
them, you set up dependencies between the models that are in the boxes. For instance, if you place a
Graph box and a PM box on the workbench, with an edge from the Graph box to the PM box, whatever
model you put in the PM box will be dependent on the model you put in the Graph box. Let’s say you put
a DAG in the Graph box and a Bayes PM in the PM box, as follows:
Then if you change the DAG by adding a node, the Bayes PM is no longer valid, since it doesn’t have the
same variables as the graph. Tetrad in such a case will offer you choice to either have the Bayes PM
automatically updated to reflect the new variable or to have the edge between the Graph box and PM box
removed.
If you select "Execute," the Bayes PM will be replaced by a new Bayes PM that adds the new variable and
copies over as much of the information from the old Bayes PM as possible.
It is possible that information may be lost in this process. For example, if you add an IM box to the above
session and place a Bayes IM in this box, if you change the number of categories for some variable, some
of the conditional probability tables for the Bayes IM will inevitably lose information.
Note that all models in boxes downstream will be revised when you click "Execute." This is because of
dependencies created by arrows between boxes downstream.
Each Module Explained
The interface for Tetrad lets you put boxes of different types into the main workspace and connect them
up with arrows. Inside each of these boxes a specific module can be built, which depends on whatever
modules are in the parent boxes. In this section, we discuss each module in particular, describe the types of
parents it can take, describe how to navigate the dialogs for constructing and editing those modules, and
refer to books or articles describing background theory for modules whose background theory requires
more explanation.
Inside the Graph Box
A Graph box in the main workspace looks like this:
When you double left click on the graph box a dialog box opens:
If you choose "Directed Acyclic Graph," you will only be able to construct directed acyclic graphs. That
is, you will only be allowed to constructed directed (-->) edges, and you will not be permitted to construct
cycles in your graph (X-->...-->X for some variable X in your graph). See Directed Acyclic Graphs.
If you choose "(General) Graph," you will be permitted to construct edges between variables with
endpoints of the following three types: segment (-), arrow (->), and circle (-o). See General Graphs.
If you choose "Time Series Graph," you will be permitted to construct directed graphs over time-lagged
variables for a give set of variables X1,..., Xn. See Time Series Graphs.
Choosing a graph type
Here is some general advise for picking graph types.
If you are consructing a Bayes model, choose "Directed Acyclic Graph." This will ensure that you
use only directed edges and don’t create cycles.
If you are constructing a SEM model, choose "SEM Graph." This will ensure that you construct a
graph with only directed edges (-->, showing causal relationships) and bidirected edges (<->,
showing correlated errors) and that cycles in the graph will be permitted.
If you want to construct or edit other types of graphs used in Tetrad, choose "General Graph." This
will allow you to construct directed graphs, patterns, PAGs, POIPGs, MAGs, and so on. See Tetrad
Graph Types for more details. In most cases, you do not need to construct these types of graphs
yourself, but they are output by Tetrad search algorithms, and you may need to edit them to turn them
into DAGs. In that case, a "General Graph" editor will be displayed to help you edit these graphs.
See also SEM Graph.
SEM Graph
This is a specialized type of graph used for specifying the graphical structure of structural equation models
(SEMs). The causal structure of the graph is indicated using directed edges (-->), and correlated errors are
indicated using bidirected edges (<->). Cycles are permitted..
Structurally, between any two variables in the graph X and Y, up to three different edges may be added to
the graph: X-->Y, X<--Y, and X<->Y.
To construct a SEM graph, place a Graph box on the workbench (see Graphs), double click the Graph box,
choose "General Graph" from the menu, and click "OK." You will see the following dialog.
If you select "Created manually" and click "OK," a blank graph editor window is opened. If you select "A
random DAG," you will need to fill in parameters to generate a random DAG. See Generating Random
DAGs for more information. This DAG will be treated like a general graph, so that if you edit it you will
be able add edges to it that aren’t directed edges (-->) and you will be able to construct cycles.
You may at this point add variables and edges to the graph, or remove them if they’re already there. To
add a measured variable to the graph, click "Add Variable" and then click in the white workbench area
where you want the measured variable to be located. To add a latent variable to the graph, click "Add
Latent" and then click in the white workbench area where you want the latent variable to be located. The
names of an added variable will be the first name in the sequence X1, X2, ..., that’s not already in the
graph. These names may be changed; see Editing Node Properties for details.
To remove a variable from the graph, click on the variable you want to delete to select it and then press the
delete key. If you remove a node from the graph, all of the edges attached to it will be removed as well.
To add an edge to the graph, click the type of edge you want to add, click and hold the mouse button down
over the variable you want to edge to be from, and then drag the mouse to the variable you want the edge
to be to. There are four types of edges you may add: directed (-->), undirected (---), unoriented (o-o), and
bidirected (o->). Cycles are permitted.
To remove an edge from the graph, click on the edge you want to remove to select it and then press the
delete key.
If all you want to do is turn edges that aren’t directed into directed edges or change the directions of
directed edges, there is a shortcut way to do this. Simply click on the endpoint of the edge you want the
arrow to be on, and the edge will change direction for you. Other edge orientation shortchuts are also
available. See Edge Orientation Shortcuts.
A sample graph might look like this:
The interpretion is that X5 causes X1, X1 causes X3, X2 causes X3, there is a feedback loop between X3
and X4, and the error terms for X5 and X1 are correlated. Each variable in a SEM Graph is associated
implicitly with an error term (see Stuctural Equation Models). To show the error terms for the endogenous
terms, select "Show Error Terms" from the Tools menu.
Notice that any bidirected edges are adjusted so that they attach only to exogenous variables when error
terms are shown. To hide the error terms, select "Hide Error Terms" from the Tools menu.
Note that using SEM graphs to construct SEMs is merely a convenience. Any SEM model you can build,
you can build from a SEM graph, and all SEM graphs can be used to construct SEM models. However,
DAGs can also be used to construct SEM models, and general graphs (see) can be used to construct SEM
models, provided only directed and bidirected edges are used.
Once you have made a graph, you may rearrange the nodes by clicking and dragging; the edges will
follow along so taht the structure of the graph remains the same. If you would like to move a whole
section of nodes to another location, draw a "rubberband" around them and click on any one to move
them. See Selecting Groups of Nodes for details. If you would like to change the name of a variable, or
change whether the variable is latent or measured, double click the variable, edit its properties, and click
"OK." See Editing Node Properties for details.
When you click "Save," the graph editor window will close, and the contents will be saved in memory. If
you click "Cancel," the changes you made while editing will be disgarded, and the state of the graph
before editing will remain unchanged. You may change the graph you made at any time by reopening the
Graph box and adding or remove variables or edges or rearranging variables.
If you right click on the Graph box, a popup menu will appear with several options. See Popup Menus for
more detail.
Important Points
1. If you create a model with a latent variable, if you later use the model to generate data, values for the
latent variables will not be shown. See Measured Vs. Latent Variables.
2. If you have introduced any boxes that depend on a Graph box, changing the graph will alter the
contents of all boxes downstream in the flowchart from the Graph box. Often it is easier to simply
create a new Graph box in the same main workspace window--there is no limit to how many graph or
other boxes you can have at the same time. See Flowchart Dependencies.
Menus
1. The menu at the top of the Graph window has two options, "File" and "Edit". "Edit" currently does
nothing. "File" gives three options: You can save a graph in a file-- but there is no point because we
have not yet implemented a facility to paste the saved graph into a new window. You can introduce
the ALARM network, a fairly complex graph standardly used as a test for search algorithms, and you
can save any graph you create as an image file that can be introduced into text documents, e.g., into
Microsoft World.
Possible Parents for "SEM Graph"
A "SEM Graph" model can be self-standing, as described above. However, it can also be made a child of a
number of other models of a variety of different box types. Usually what this does is to make a copy in the
"SEM Graph" box of a preexisting graph in the parent model, which is often a convenient thing to be able
to do. The following models all have graphs that, if they happen to be interpretable as SEM graphs, can be
copied into a "SEM Graph" model:
1. All graph models.
2. All search models.
3. All parametric models.
4. All instantiated models.
5. All updater models.
In certain special cases, making a "SEM Graph " model a child of another model has a specialized
behavior. If you make a "SEM Graph " model a child of a Data box model, the effect is to create a graph
with all of the variables in the data set but no edges, as illustrated below.
Inside the Parametric Model Box
A Parametric Model box in the main workspace looks like this:
In the standard setup the PM Box requires a directed arrow from a Graph Box. The type of graph one uses
depends on the type of parametric model one wishes to construct. See Bayes Parametric Model or SEM
Parametric Model for details.
Bayes Parametric Model
Description of Model
Bayes Parametric Model (Bayes PM) takes a DAG and adds to it two bits of information:
1. For each named node in the graph, the number of categories for the variable by that name.
2. For each variable, with a given number of categories, the list of category names for that variable.
Given the graph and the additional information in (1) and (2), a Bayes net can be formally specified; it is
determined what all the parameters of the Bayes net are, although no values for parameters are yet known.
To specify a Bayes net up to parameter values, a Bayes Instantiated Model must be constructed, based on
a Bayes PM. For details on the parameters of a Bayes IM, see Bayes Instantiated Model.
It is assumed in the current version of Tetrad that all discrete variables are nominal--that is, that the order
of their categories is not important. See Defining Discrete Variables for more details.
How to Construct a Bayes PM
For example, say you put the following boxes on the session, connected as follows:
For example, say you start with this DAG. (It need not be, specifically, in a Directed Acyclic Graph box;
all that matters is that it contain only directed edges with no cycles.)
If you click "Save" and double click the PM1 box, you are given a choice of which model type you would
like to construct. Choose "Bayes Parametric Model."
Once you click OK, the following dialog appears:
In this dialog, you can click on a variable and edit its number of nodes and category names. For instance,
we can change the number of categories for X1 to 3 and set its categories to <Low, Medium, High>.
When you’re finished editing categories for variables, click "Save."
Potential Parents for Bayes Parametric Model
The Bayes PM can take any graph as parent that contains a DAG--that is, a graph that contains only
directed edges (-->) with no cycles (i.e. there is no X such that X-->...-->X in the graph). The simplest
option is to construct Directed Acyclic Graph in the Graph box. (See Directed Acyclic Graph for more
details.) If the parent is not a DAG, an error message will be displayed when the Bayes PM is constructed.
SEM Parameteric Model
Description of the Model
A SEM Parametric Model (SEM PM) is structural equation model (SEM) up to specification of what the
parameters of the model are, without giving values for those parameters.
The implementation of structural equation models in Tetrad essentially follows Bollen (???). A structual
equation model is a set of linear equations expressing each variable as a linear sum of its parents plus an
exogenous error term--e.g.,
X1 = a1 * X2 + a2 * X3 + e1,
X2 = a3 * X3 + e2,
and so on, where the distribution of each error terms has a specified variance, and correlations among error
terms are specified.
The graph for such a system consists of one node for each variable, one node for each error term (which
may be hidden, or at least the error terms for exogenous variable may be hidden), a directed edge from
each variable on the right hand side each such equation above to the variable on the left hand side of the
equation, and bidirected edges between each pair of variables whose error terms are correlated. (If the
error term for a variable is being shown the bidirected edge attaches to the error terms instead of the
variables itself.) Cyclical dependencies among variables are permitted. See SEM Graph for details.
The parameters in this model consist of:
1. Each linear coefficient in the structural equations for the model (e.g., a1, a2, and a3, above), plus
2. The variances of each error term in the model (e.g., var(e1), var(e2), above), plus
3. The covariances of each pair of error terms that is specified to be correlated.
The number of parameters, therefore, is equal to the number of edges in the graph of the model with error
terms hidden (directed plus bidirected) plus the number of variables in the model. (When error terms are
shown, extra directed edges are added to the graph from error terms to their variables; these to not add
parameters to the model.)
The SEM Parameteric Model specifies only this list of parameters and allows this list to be edited. To give
specific values for each parameter in the model, one should use the SEM Instantiated Model.
How to Construct a SEM PM
For example, say you put the following boxes on the session, connected as follows:
Say you start by creating a SEM Graph in the Graph box. (See SEM Graph for details.) To make it
interesting, we create a SEM Graph that uses a couple of bidirected edge and has a cycle.
If you click "Save" and double click the PM1 box, you are given a choice of which model type you would
like to construct. Choose "SEM Parametric Model."
Once you click OK, the following dialog appears:
In this dialog, error terms for endogenous variables are shown explicitly, and all of the parameters are
labeled. Parameters B1, B2, B3, B4, and B5 (shown in black) are linear coefficients in the underlying
structural equation model; parameters T1, T2, T3, T4, and T5 (shown in blue) are error variance terms;
parameters T6 and T7 (shown in red) are error covariance terms.
In the dialog, you can double click on any parameter and change its name. For instance, you can double
click on the variance term T3, above, and change its name to "var_x3". Also, a fact which becomes
important in SEM estimation, one can set here whether this parameter should be held fixed for estimation
and control its starting value for estimation. (In SEM estimation, parameters are initialized in general
randomly and then adjusted by an optimization algorithm to optimize, e.g., the maximum likelihood
function for the model. See SEM Estimator for details. You can control here how these values are
initialized for this process.)
Clicking OK, you see that the name of the paramter has been changed.
It is important to notice what you cannot do in this editor. You cannot change the list of variable or the
names of variables, and you cannot add or remove edges to the graph. To do these things, simply edit the
graph that was used to construct the SEM PM model.
Potential Parents for a Sem PM
The SEM PM must be constructed using an object that has a graph in it of a type that can be used to
construct a structural equation model. The obvious choice is a SEM Graph, since with this graph, you can
add bidirected edges and cycles. You can, however, construct a SEM PM using a Directed Acyclic Graph,
if you don’t care that the graph cannot contain bidirected edges or cycles, or a General Graph, if you don’t
mind making sure on your own that the graph contains only directed and bidirected edges.
Potential Children for a SEM PM
There are two natural children for a SEM PM.
1. SEM Instantiated Model (SEM IM). The SEM PM is in a sense an incomplete SEM model, since it
doesn’t specify values for its parameters. To specify these values, make a SEM IM a child of the
SEM PM.
2. SEM Estimator. A SEM Estimator takes a SEM PM and a continuous data set and generates a fully
estimated SEM IM.
Inside the Instantiated Model Box
An Instantiated Model box in the main workspace looks like this:
In the standard setup, an IM box takes as parent a PM box, but instantiated models can generated in other
ways as well, such as from estimators or updaters. See Bayes Instantiated Model, Dirichlet Bayes
Instantiated Model, or SEM instantiated Model for details.
Bayes Instantiated Model
Description of the Model
A Bayes Instantiated Model (Bayes IM) extends a Bayes Parameterized Model, specifying values for all of
the parameters in the Bayes net. The parameters for a Bayes net (in the form that they’re used in Tetrad)
are conditional probabilities stored in conditional probability tables, one for each variable in the Bayes
net. A variable X has a (possibly empty) list of parents P1, ..., Pn--i.e., variables Pi such that Pi-->X in the
Bayes PM. The variable itself and each of its parents has a list of categories. A conditional probability
table for X is a specification of the probability P(X=x’ | P1=p1’, ..., Pn=pn’) for each category x’ of X and
each combination of categories <p1’, ..., pn’> for parents P1, ..., Pn of X. For any particular combination
of parent values <p1’, ..., pn’>, the sum of the conditional probabilities P(X=xj | P1=p1’,...,Pn=pn’) for all
categories xj of X is equal to 1.0.
How to Construct a Bayes IM
To construct a Bayes IM, first construct a DAG, then a Bayes PM, and add an IM box to the workspace,
with an arrow from the Bayes PM to the IM.
Fill in the Graph box and the PM box, as explained in Bayes Parameterized Model. For instance, you
might end up with a graph that looks like this (the categories for X1 are shown).
Now, double click the IM box. You get a choice of models; choose Bayes Instantiated Model:
What you click OK, you are offered a choice. You may either initialize the parameters of your Bayes net
manually (i.e., fill them in one by one, by hand), or fill them in randomly.
We choose "Manually." We now get a dialog that looks like the following:
X1 here has two parents, X2 and X5. Each combination of parent values for X2 and X5 is listed as a row
in the conditional probability table for X1. Each category for X1 is listed as a column in the conditional
probability table. We can now fill in these probability values however we like, provided we choose
non-negative real numbers that sum to 1.0 in each row. The interface helps out a little by filling in table
cells whose values are implied. If you fill in the 0.2000 and 0.5000 in the table below, the table will fill in
the 0.3000 for you. Also, if you simply want to fill in table cells randomly, right click on any nonselected
table cell. You get a popup menu like the one below.
If you select "Randomize this row," the row is filled in with random values. For example:
Similarly for the other popup menu functions shown.
Once all of the table cells have been filled in, the Bayes IM is ready to be used as input to other boxes.
You may, for example, simulate data using the Bayes IM, or you may perform updating operations on it.
See Simulating Data (Bayes) for more information on how to simulate data and Update Box for more
information on updating.
Potential Parents for a Bayes IM
A Bayes IM can be constructed as indicated above (as a child of a Bayes PM). Bayes IM’s are also,
however, output by other processes. In particular, a Bayes IM can be made a child of the following:
1. ML Bayes Estimator. Bayes estimations take Bayes PM’s and discrete data sets and produce new
Bayes IM’s.
2. Dirichlet Estimator. Dirichlet estimators also take Bayes PM’s and discrete data sets (with possibly
Dirichlet priors in the form of Dirichlet Bayes IM’s) and output Dirichlet Bayes IM’s. They may
alternatively output Bayes IM’s.
3. Any Bayes updater. All of the Bayes updaters (Row Summing Updater, CPT Invariant Updater, and
Approximate Updater) output new, updated Bayes IM’s.
Old text:
If you choose an ML Instantiated Bayes Model, the program will either randomly specify the conditional
probability of each value of each variable given the values of its parents, or you can specify them
manually through the dialog box:
Choose a variable either using the "Next" button on the right side of the window or by clicking on the
variable in the graph on the left
side of the window. In either case, you will see one or more rows of entry spaces, with each entry space
labeled by the name of the value of the selected variables. Each row corresponds to an allowed assignment
of variables to the variables that are parents of the selected variable. For each row--each assignment of
allowed values to the variables that are parents of the variable you have selected--you must enter a
numerical value between 0 and 1 for each value of the variable you have selected. These are the respective
probabilities of the values of your selected variable, condiitonal on the values in that row of its parents
variables. The numbers you put into any row must add up exactly to 1. If they don’t the program will
simply erase the values you have entered in that row. If you start entering numbers in a row from the left,
which is highly recommended, the program will automatically fill in the next to last entry space with the
number (if one exists) needed to make the row numbers sum to 1.
Entering all of these conditional probabilities in even a medium sized model is very tedious, but there is no
help for it other than to estimate the condiitonal probabilities from a data set, or to randomize. If you
choose "Random" instead of "Manual" you will get a window very much like the one above, except that
randomly chosen values will be entered in each row. You can edit these values by clicking on a row and
typing. As in other windows showing graphs, you can select a variable by clicking on it.
Potential Parents for Bayes Instantiated Model
The Bayes IM can accept the following potential parents.
1. Bayes Parametric Model. To build a Bayes IM from scratch, one should first build a Bayes PM and
then construct a Bayes IM as a child of the Bayes PM.
2. ML Bayes Estimator. An ML Bayes Estimator estimates a Bayes IM using a given Bayes PM and
discrete data set.
3. Dirichlet Bayes Estimator. The normal output of a Dirichlet Bayes Estimator is a Dirichlet Bayes IM,
but a Bayes Instantiated Model may be substituted. This Bayes IM will contains as parameter values the
maximum likelihood probability values from the estimated Dirichlet Bayes IM.
Potential Children for Bayes Instantiated Model
1. Data Box--This is the way to simulate data from a discrete Bayes model. See Simulating Data
(Bayes) for more details.
2. Any graph--Directed Acyclic Graph, SEM Graph, General Graph--simply copies the graph from the
Bayes IM into a new graph box.
3. Bayes PM--copies the Bayes PM from the Bayes IM into a new PM box.
4. Bayes IM--copies the Bayes IM itself into a new IM box.
5. Any of the updating algorithms--Row Summing Exact Updater, CPT Invariant Updater, Approximate
Updater.
6. ML Bayes Estimator (together with a discrete Data Set)--estimates the associated Bayes PM,
producing a new Bayes IM.
Dirichlet Bayes Instantiated Model
Description of the Model
A Dirichlet Bayes Instantiated Model (Bayes IM) is an alternative to a Bayes Instantiated Model that
represents the distribution over each row in each conditional probability table as a Dirichlet distribution. A
Dirichlet distribution the probability distribution of the parameters of a multinomial distribution. That is, it
is the probability distribution of the parameters of a list of "cells" whose probabilities are (a) mutually
independent and (b) sum to 1.0. Each row in a conditional probability table satisfies this criterion, so we
build an alternative Bayes net representation row by row out of distributions defined in this way.
A specific Dirichlet model for a list of cells is given by specifying a Dirichlet parameter for each cell. The
maximum likelihood probability for the cells is then given by the ratio of the Dirichlet parameter for that
cell divided by the sum of Dirichlet parameters for all of cells in the list. In the simple case, these Dirichlet
parameters will just be cell counts, considered as real numbers. We will usually, therefore, refer to
Dirichlet parameters as pseudocounts. In the more general case, these pseudocounts can in fact be any
positive real numbers.
A Dirichlet Bayes Instantiated Model is constructed using a Bayes Parametric Model, just like a Bayes
Instantiated Model. The main differences are:
Instead of conditional probability tables, the Dirichlet Bayes Instantiated Model contains tables of
pseudocounts, as described above.
Dirichlet Bayes Instantiated Models are used as inputs to a Dirichlet Estimator, rather than as inputs
to an ML Bayes Estimator.
How to Construct a Dirichlet Bayes IM
To construct a Bayes IM, first construct a DAG, then a Bayes PM, and add an IM box to the workspace,
with an arrow from the Bayes PM to the IM.
Fill in the Graph box and the PM box, as explained in Bayes Parameterized Model. For instance, you
might end up with a graph that looks like this (the categories for X1 are shown).
Now, double click the IM box. You get a choice of models; choose Dirichlet Bayes Instantiated Model:
What you click OK, you are offered a choice. You may either initialize the parameters of your Dirichlet
Bayes net manually (i.e., fill them in one by one, by hand), or fill them in randomly.
We choose "Manually." We now get a dialog that looks like the following:
There are two tabs in the dialog that comes up next, "Probabilities" and "Pseudocounts." Let us consider
"Pseudocounts" first. Pseudocounts are displayed in tables, one for each variable, with the same structure
as conditional probability tables in Bayes IM’s. Each pseudocount is a positive real number; in this case
the are all initialized to 1.0. The sum of the pseudocounts in each row is shown in the rightmost column.
Turning now to the "Probabilities" tab, we have a table in the form of a conditional probability table that
displays maximum likelihood probabilities for each cell of each Dirichlet distribution (row) in the model.
These probabilities are calculated by dividing each pseudocount value in the previous display by the sum
or pseudocounts in that row. In order not to lose information, the total count for each row is displayed in
the "Probabiliities" tab as well. To recover pseudocounts, simply multiply the probability of a cell by the
"total count" in the rightmost column.
[Note: there is some funny business going on with the right-click popup menus for doing randomization.
Need to make this work.]
Old text:
If you choose a Dirichlet Instantiated Bayes Model, you will be putting an initial (or prior) Dirichlet
probability distribution over the conditional probability of each value of each variable condtional on
values of its parent variables:. a probability distribution over conditional probabiliy distributions. The
probability distribution over the conditional probability distributions implies an "all probabilities"
considered probability for each value of each variable condiitonal on its parent’s values. Such Dirichlet
distributions can be specified by pseudocounts, essentially a kind of fictional database. The program will
automatically create a uniform and symmetric Dirichlet prior distribution for you in which all counts have
the same value--you can pick the value. A Dirichlet Bayes IM may be set up manually (all values set by
hand) or set up automatically as a symmetric prior in which all pseudocounts for all cells are set to a given,
specified, value. Such Dirichlet distributions are called "symmetric, because the distribution function itself
with such a choice of pseudocounts is symmetric with respect to variable permutation. (If all pseudocounts
are set to 1.0, the distribution function is completely flat and therefore uninformative. If all pseudocounts
are set to 0.5, the resulting distribution is known as a Jeffreys prior and has connections to information
theory.)
SEM Instantiated Model
Description of the Model
For a description of the sort of structural equation model (SEM) that is implemented in Tetrad, see SEM
Parameterized Model. The parameters of the structural equation model are:
1. The linear coefficients of the model,
2. The variances of each error term in the model, and
3. The covariances of each pair of error terms that is specified to be correlated in the model.
The purpose of the SEM IM is to allow values for these parameters to be specified. The SEM IM also
implements the functions (such as the maximum likelihood function of the SEM) that are optimized by the
SEM Estimator and calculates statistics for the SEM.
How to Construct a SEM IM
To construct a SEM IM, first construct a SEM Graph, then a SEM PM, as explained in SEM Parametric
Model, and then add an IM box to the workspace, with arrows form the SEM PM to the IM:
For instance, you might end up with a SEM PM that looks like this.
When you double click the IM box now, you get a SEM IM model that’s been filled in with randomly
chosen values. (Notice you’re not given a choice of models; this is because there is only one IM model in
Tetrad that can serve as a child of a SEM PM model.)
Notice that in the SEM IM model, parameter values appear where parameter names appeared in the SEM
PM. For instance, the linear coefficient for the edge X6-->X5 is labeled as "B7" in the SEM PM, but the
actual real value for the parameter, "-0.7867," is shown in the SEM IM. These real values may be edited in
two ways. The first way is to click on the numbers themselves. If you click on the "-0.7867," above, a
small box appears to let you edit the value of this parameters.
You may, for instance, type ".25" and hit return; your new value for the parameter is recorded.
The other way to view and edit parameter values is using the Tabular Editor. If you click on the Tabular
Editor tab, you get this display.
Notice that the parameter value for the edge coefficient of the edge X6-->X5 is "0.2500." We can edit this
value again from this view by clicking on the box containing the value "0.2500" and changing it to, say,
"-.5."
Whichever view we edit values in, the other view will reflect the updated values.
Notice that in the tabular view there are some columns to show statistics for each parameter. These
columns are used by the SEM Estimator to show how robust the estimtation of each parameter is. These
are ordinary statistics calculated for SEM estimations; SE standard for "standard error," T for "t statistic,"
and P for "p value."
If you click on the Implied Matrices tab, an implied matrix of some type that you choose will be displayed.
You may display either the implied covariance matrix of the model for all variables (shown), or the
implied covariance matrix of the model for the measured variables only, or one of the corresponding
implied correlation matrices.
If you click on the Model Statistics tab after having done a SEM estimation, you will be shown some
goodness of fit statistics for the model as a whole. See SEM Estimator for more details.
Potential Parents for SEM Instantiated Model
A SEM IM may be made the child of the following modules:
1. SEM Parametric Model. The normal way to construct a SEM IM from scratch, as described above, is
to first make a SEM PM and then make a SEM IM that’s a child of of the SEM PM.
2. SEM Estimator. A SEM Estimator estimates a SEM IM from a given SEM PM and continuous data
set.
Potential Children for SEM Instantiated Model
A SEM IM may be made a parent of the following objects:
1. Data Box. This is the way to simulate data from a SEM model. See Simulating Data (SEM) for more
details.
2. SEM Graph. Copies the contents of the SEM IM graph into a new Graph box.
3. SEM PM. Makes a copy of the SEM PM used to create this SEM IM.
4. SEM IM. Makes a copy of the parent SEM IM.
5. SEM Estimator (with a continuous Data Set). See SEM Estimator.
Notably, there is currenly no updater that takes as input a SEM IM, although there should be.
Inside the Data Box
A Data Box in the main workspace looks like this:
if it’s data that’s being simulated, or this:
if it’s data that was loaded in from an external file.
Most of the time that you interact with the data box, you will be interacting with a Data Set List, which is
a list of data sets, one of which (the one you see) is designated as "selected." For most information, see
Data Set Lists.
Data Set List
A data set list stores a list of one or more data set, possibly of different types. One of data sets is
designated as "active," in the sense that (a) it’s the one you see when you double click the Data Box, and
(b) it’s the one that’s used downstream by, e.g., by search and estimation algorithms. The types of data
sets that can currently be stored in a data set list are:
1. Tabular Data Set,
2. Covariance Matrix, and
3. Correlation Matrix.
These types of data sets each has a distinctive appearance when being edited, as shown below.
For information on how to load data files, see the help file for Data Loader.
Tabular Data Sets
Tabular data sets are rectangular data sets with data for a (possibly mixed) list of continuous and discrete
variables. (For detailed information on tabular data sets, see Tabular Data Sets. For information on
continuous and discrete variables, see Continuous and Discrete Variables. A tabular data set containing all
discrete data (that is, a discrete data set) looks like this:
A tabular data set contain all continuous data (that is, a continuous data set) looks like this:
These data sets can be edited directly. For information on how to edit tabular data sets in the Data Editor,
see Editing Tabular Data Set.
Covariance Matrices
A covariance matrix in Tetrad is a symmetric, positive definite matrix M with dimension equal to the
number of variables in the data set, associated with a sample size. If the list of variables in <X1, X2, X3,
X4, X5>, then var(Xi) = m(i, i) and cov(Xi, Xj) = m(i, j) The sample size may be any number greater than
zero. Here is what a covariance matrix looks like in the data editor:
Only the lower triangle is shown, since the matrix is symmetric. For information on how to edit a
covariance matrix in the Data Editor, see Editing Covariance/Correlation Matrices.
Correlation Matrices
A correlation matrix in Tetrad behaves just like a covariance matrix (it is one!), except that it is labelled as
"Correlation Matrix," has a diagonal consisting entirely of 1.0’s, and shows correlations instead of
covariances. That is, if the list of variables in <X1, X2, X3, X4, X5>, as in the above example, then a
symmetric, positive dfinite matrix M is shown such that then m(i, j) is the correlation of Xi and Xj and in
particular m(i, i) = 1.0 for all i, j. The sample size may be any number greater than zero. Here is what a
covariance matrix looks like in the data editor (notice that only the lower triangle is shown since it is
symmetric):
Notice that these last images shows tabs that let you switch back and forth between three different data
sets. All three are stored in the Data Box; if you want to, say, search over a different one, simply click the
tab for that data set and your next search will be over that one instead.
Missing Data
Missing data for all data types is represented using asterisks ("*"). See Handling Missing Data for details.
Creating Data Sets
You can make a data set in a number different ways:
1. You may create a data set from scratch, by typing it into the data set. See Creating Data from Scratch
for details.
2. You may load a data set from a file. See Loading Data for details.
3. You may simulate data from a model. See Simulating Data for details.
4. You may manipulate one data set to create another . See Manipulating Data for details.
Knowledge
Every data set may be associated with background knowledge. The reason for this is that one often wants
to run more than one search from the same data set, using the same knowledge, and associated knowledge
with a data set is an easy way to accomplish that. To see how to set up background knowledge, see Editing
Knowledge.
To use knowledge associated with a data set in a search, simply (a) construct the data set, (b) associate the
knowledge, (c) add a search box to the main workspace, (d) draw an edge from the data box to the search
box, and (e) execute the search.
Inside the Manipulated Data Box
An Instantiated Model box in the main workspace looks like this:
The manipulated data box is used to alter a data file. Inside the box you may:
Interpolate a Missing Value value (an asterisk) whereever values are missing in the data file.
Insert the modal value of a discrete variable wherever values of that variable are missing in the data
file. (Better interpolations are coming).
Produce a data set with multiplications of cases.. Each case is multiplied by the number in the
leftmost column of the Data file input to the Manipulated Data Box.
Once it has been run, the Manipulated Data box works just like any other Data Box
Note one programming oddity: if you wish to discretize a continuous variable, you must do so inside the
Data Box that holds the data file,not inside the Manipulated Data box.
Inside the Estimate Box
An Estimate box in the main workspace looks like this:
The Estimate program takes a parametrized model (in PM) and a data set for the variables in that model,
and returns an Instantiated Model, an IM. It will also take data and an (ML) IM as input.. Once a model is
estimated, the contents of the Estimate box can be transferred to an empty IM box and then used to
generate data, to classify, or to update (in the last two cases, only if the model is a Bayes net, not a SEM).
If a Maximum Likelihood Bayes Net and data are directly connected to Estimate, the estimation
procedures will ignore all cases in the data set with missing values for any variables. Missing data values
can be interpolated by connecing the data to a Manipulate Data box, and connecting that box to the
Estimator box.
There are several varieties of estimation, depending on the.graphical input (the PM or IM):
1. If the input PM or IM is for a SEM, the Estimate program immediately produces a full information
maximum likelihood estimate of the parameters, provided the model in the PM or IM is identifiable.
Latent variables are allowed. The procedure also gives model statistics, including the implied covariance
and correlation matrices, and the chi square likelihood ratio statistic and its p value for the model.
2. If the input is a PM for a Bayes Net, the Estimate program produces a maximum likelihood estimate of
the model parameters, provided the model has no latent variables..
3. If the input is an Maximum Likelihhod Instatiated Bayes Net (an IM), the Estimate program produces a
maximum likelihood estimate of the model parameters.
4. If the input is a Dirichlet Instantiated Bayes Model, the Dirichlet Bayes estimator estimates a posterior
Dirichlet Bayes instantiated model given a prior Dirichlet Bayes instantiated model and a a discrete data
set. The data set must contain all of the same variables as the prior instantiated model. Latent variables are
not allowed.
The Dirichlet estimation algorithm is simple. First, a new (blank) posterior Dirichlet Bayes IM is created.
Then, for each cell in the posterior, the value (a) from the corresponding cell in the prior is retrieved, and
the number of cases in the data satisfying the condition of that cell (n) is counted. The value of the cell in
the posterior is set to a + n. Estimated conditional probabilities total pseudocount in each row are
calculated from these cell values.
As a shortcut, it is possible in the interface to use a Bayes PM and a discrete data set as parents to the
Dirichlet Bayes Estimator. If you do this, a symmetric Dirichlet Bayes IM will be generated in the
background and used as the prior for the Dirichlet estimation algorithm. The symmetric pseudocount that
should be used here may be specified at time of construction.
In its present implementation, Bayes nets with latent variables cannot be estimated.
Types of estimators:
ML Bayes Estimator
SEM Estimator
Dirichlet Estimator
ML Bayes Estimator
When Data together with a Dirichlet Instantiated Bayes Model are input to Estimate, new
parameters--new probabilities for each variable conditional on its parents’ values--are calculated. by
Bayes’ Theorem. Unlike an Estimate for an ML instantiated model, which simply ignores the previous
parameter values, with the Dirichlet Instantiated Bayes Model the Estimate function combines new Data
with the prior probabilities in the model to produce its parameter estimates.
SEM Estimator
Data and an ML Instantiated Bayes Model" can be input to Estimate. Estimate will erase the parameter
values (the conditional probability of each variable value given values of the parent variables) in the IM
model and replace them with a maximum liklihood estimate of parameter values from the data.
Dirichlet Estimator
When Data together with a Dirichlet Instantiated Bayes Model are input to Estimate, new
parameters--new probabilities for each variable conditional on its parents’ values--are calculated. by
Bayes’ Theorem. Unlike an Estimate for an ML instantiated model, which simply ignores the previous
parameter values, with the Dirichlet Instantiated Bayes Model the Estimate function combines new Data
with the prior probabilities in the model to produce its parameter estimates.
Inside the Update Box
An Instantiated Model box in the main workspace looks like this:
The functions inside the Update Box enable you to use an Instantiated Bayes Net model to compute the
conditional probability of any variable in the model from values you specify for any other variables in the
model.
Tetrad has three programs for updating: (1) Approximate Updater, (2) Row Summing Exact Updater, and
(3) CPT Invariant Updater.
1. Approximate Updater
Calculates updated marginals for a Bayes net by simulating data and calculating likelihood ratios. The
method is as follows. For P(A | E) (where E is the
enough sample points are simulated from the underlying Bayes Im so that 1000 satisfy the condition E,
keeping track of the number n that satisfy condition A. Then the maximum likelihood estimate of P(A | E)
is calculated as n / 1000.
The approximate updater runs quite quickly, even for large numbers of variables, so long as the number of
variables in
is small. The more variables there are in
the more sample points need to be generated to achieve 1000 samples points satisfying E.
manipulationEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidence),
manipulationEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidence
manipulationEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidence,
2. Row Summing Exact Updater
Calculates updated marginals P(A | E) for a Bayes net (where E is the
by summing probabilities for rows in the joint probability table that satisfy condition E, summing
probabilities for for rows in the joint probability table that satisfy condition A & E, and dividing the
second sum by the first. A row in the joint probability in this sense is a combination of values for the
variables of the Bayes net mapped to the probability of that combination of values occurring in a sample.
This probability is calculated for each row from the conditional probability tables of the Bayes net using
the standard factorization of the Bayes net.
The row summing updater can be extremely expensive if the number of variables in the Bayes net is large
and the number of variables in
manipulationEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidence)
manipulationEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidence
is small. However, the row summing updater can be extremely useful (and fast) if almost all of the
variables in the Bayes net are in
For instance, if all but one variable (say, X) is in
then the number of rows in the joint probability table that have to be examined in order to calculate
marginals for X is just the number of categories of X.
manipulationEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidence.
manipulationEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidence,
3. CPT Invariant Updater
Calculates updated marginals P(A | E) for a Bayes net (where E is the
by breaking the problem down into two parts: first, calculating an "updated Bayes net" (in a sense to be
defined), and second, calculating marginals recursively from this updated Bayes net. Probabilities for a
Bayes net are specified in terms of conditional probability tables for its variables. These are tables of the
probability for each category of a variable conditional on each combination of parent values of that
variable, P(V = v’ | P1 = p1’ & ...& Pn = pn’). Define an "updated Bayes net" as the Bayes net in which
each of these probabilities has been replaced by P(V = v’ | P1 = p1’ &... & Pn = pn’ & E). (These
replacement values will not be defined if the conjunction P1 = p1’ &... & Pn = pn’ & E is impossible.) It is
straightforward to show that marginals for such an updated Bayes net just are the updated marginals for
the original Bayes net.
In updating a Bayes net, in the sense defined above, only the conditional probabilty tables for ancestors of
the
variables are altered. This suggests an algorithm for updating a Bayes net given
E. For each variable that’s an ancestor of
use the row summing method to calculate updated conditional probabilities in that variable’s conditional
probability table. Otherwise, just keep the conditional probabilities from the original Bayes net. This is the
algorithms that is implemented.
For calculating single-variable marginals from a Bayes net, Bayes Theorem is used recursively. For
example, if X-->Y, where X and Y both have categories {0, 1}, P(Y = 0) = P(Y = 0 | X = 0) P(X = 0) +
P(Y = 0 | X = 1) P(X = 1). Since all of the probabilites on the right side of this equation are stored in the
conditional probability tables for the Bayes net, this values can be calculated directly. For longer chains,
more recursion would have to be done to calculate marginals for intervening variables. These intervening
marginals can, however, once calculated, be stored for later use, and they are.
The algorithm slows down (as do most updating algorithms) when dealing with graphs where parents of a
modelNode are d-connected (see Spirtes, et al 2000 for the exact definition). For instance, in this graph:
X-->Y
X-->Z
Y-->W
Z-->W
W-->R
calculating R requires much more extensive calculation than for this graph:
X1-->Y
X2-->Z
Y-->W
Z-->W
W-->R.
manipulationEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidence)
manipulationEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidence
manipulationEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidence
manipulationEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidence,
In this case, in order to calculate marginals for W, one needs to know probabilities of W given particular
combinations of parent values of W, which are given in the conditional probability tables of the Bayes net,
and one also needs to know the probabilities of the various combinations of parent values occurreing. For
example, say that Y, Z have categories {0, 1} in the first graph, above, and one wants to know the
probability P(W = 0). One can calculate this probability as P(W = 0 | Y = 0, Z = 0) P(Y = 0, Z = 0) + P(W
= 0 | Y = 0, Z = 1) P(Y = 0, Z = 1) + P(W = 0 | Y = 1, Z = 0) P(Y = 1, Z = 0) + P(W = 0 | Y = 1, Z = 1)
P(Y = 1, Z = 1). The problem with d-connected parents is calculating, e.g., P(Y = 0, Z = 0). The CPT
invariant updater calculates this probability in a standard way, as P(Y = 0) P(Z = 0 | Y = 0). This requires a
recursive application of the marginal calculating procedure and is expensive. However, the problem of
d-connected parents of variables is a standard problem (even if not always phrased that way) for updating
procedures.
In general, the CPT invariant updater is quite fast, but can be slowed down for two reasons: (a) the
subgraph of the Bayes net restricted to ancestors of
variables is complicated, forcing more updated conditional probabilities to be calculated, and (b) there are
a lot of variables in the Bayes net whose parents are moderately or strongly d-connected.
manipulationEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidence
Types of updaters:
Row Summing Updater
CPT Invariant Updater
Approximate Updater
Row Summing Updater
CPT Invariant Updater
Approximate Updater
Inside the Search Box
The search box in the main workspace area looks like this:
if the search is being conducted over a data set, or this:
if the search is being conducted directly over a graph (as a source of true conditional independence facts,
to see how the algorithm behaves ideally).
Tetrad has a variety of search algorithms to assist in searching for causal explanations of a body of data.
It should be noted that the Tetrad search procedures are exponential in the worst case (when all pairs of
variables are dependent conditional on every other set of variables.) The search procedures may take a
good bit of time, and there is no guarantee beforehand as to how long that will be.
These search algorithms are different from those conventionally used in statistics.
1. There are several search algorithms, differing in the assumptions they make.
2. Many of the search algorithms allow the user to specify background information that will be used in
the search procedure. In many cases the search results will be uninformative unless such background
assumptions are explicitly made. This design not only provides for more flexibility, it also
encourages the user to be conscious of the additional assumptions imposed in deriving a model from
data.
3. Even with background assumptions, data often do not determine a unique best or robust explanation.
The search algorithms take in data and return information about a collection of alternative causal
graphs that can explain features of the data. They do not usually return a unique graph, although they
sometimes will if sufficient prior knowledge is specified. In contrast, if one searches for a regression
model of the influences of a set of variables on a selected variable, a regression model will certainly
be found (provided there are sufficient data points), specifying which variables influence the target
and which do not.
4. The algorithms are in some respects cautious. Search algorithms such as FCI and PC , described
below, will often say, correctly, that it cannot be determined whether or not a particular variable
influences another.
5. The algorithms are not just useful guesses. Under explicit assumptions (which often hold at best only
approximately), the algorithms are "pointwise consistent"--they converge almost surely to the correct
answer. The conditions for this sort of consistency of the seaarch procedures are described in the
references. Conventional model search algorithms--stepwise regression, for example--have such
guarantees only under very strong prior assumptions about the causal structure.
6. The output of the search algorithms provides a variety of indirect information about how much the
conclusions of the algorithm can be trusted. They can, for example, be run repeatedly for a variety of
specifications of alpha values in their statistical tests, to gain insight about robustness. For search
algorithms such as PC, CCD, GES and MimBuild, described below, if the search algorithms return
"spaghetti"--a highly connected graph--that indicates the serach cannot determine whether all of the
connected variables may be influenced by a common unmeasured cause. If the PC algoirthm returns an
edge with two arrowheads, that indicates a latent variable may be acting; if searches other than CCD
return graphs with cycles, that indicates the assumptions of the search algoirthm are violated.
7. Some of the search procedures are robust against common difficulties in sampling designs--they give
correct, but reduced information in such cases. For example, the FCI algorithm allows that there may be
unobserved latent common causes of measured variables--or not--and that the sample may have been
formed by a process in which the values of measured variables invlucne whether or not a unit is included
in the sample (sample selection bias). The CCD algorithm allows that the correct causal structure may be
"non-recursive"--essentially a cyclic graphical mode, a folded up time series.
8. The output of the algorithms is not an estiamted model with parameter values, but a discription of a
class of causall graphs that can explain statistical features of the data considered by the search procedures.
That information can be converted by hand into particular graphical models in the form of directed graphs,
which can then be estimated by the program and tested.
The search procedures available are named:
PC - Searches for Bayes net or SEM models when it is assumed there is no latent (unrecorded)
variable that contributes to the association of two or more measured variables.
CPC - Variant of PC that improves arrow orientation accuracy.
PCD - Variant of PC that can be applied to deterministic data.
FCI --which performs a search similar to PC but allowing that there may be latent variables.
CCD--for searching for non-recursive SEM models (models of feedback systems using cyclic graphs)
without latent variables
GES -- Scoring search for Bayes net or SEM models when it is assumed there is no latent
(unrecorded) variable that contributes to the association of two or more measured variables.
MBF -- Searches for the Markov blankets DAGs for a given target T over a list of variables
<v1,...,vn,T>.
CEF - Variant of MBF that searches for the causal environment of a T (i.e., parents and children of
T).
Structural EM MimBuild--for searching for latent structure from the output of Build Pure Clusters or Purify Clusters
BPC --for searching for sets of variables that share a single latent common cause
Purify Clusters--for searching for sets of variables that share a single latent common cause
Inputs to the Search Box
There are two possible inputs for a search algorithm: a data set or a graph. If a graph is input, the program
allows searches the program computes implied independence and conditional independence relations and
allows you to conduct any search that uses only such constraints--the PC, FCI and CCD algorithms.
Why would you apply a Search procedure to a model you already know? For a very important reason:
The Search procedures will find the graphical representation of alternative models to your model that
imply the same constraints.
The more usual use of the search algorithms requires a data set as input. Here is an example.
Select the Search button.
Click in the workbench to create a Search icon.
Use the Flow Charter button to connect the Data icon to the Search icon.
Double-click the Search icon to choose an search procedure.
Selecting a Search procedure
Tetrad offers the following choices of search algorithms. For more details about the assumptions and
parameters needed for each algorithm, click in the respective links.
There are two main classes of algorithms. The first one is designed for general graphs with or without
assuming the possibility of hidden common causes:
PC algorithm: this method assumes that there are no hidden common causes between observed
variables in the input (i.e., variables from the data set, or observed variables in the input graph) and
that the graphical structure sought has no cycles.
FCI algorithm: this method does not assume that there are no hidden common causes between
observed variables in the input (i.e., variables from the data set, or observed variables in the input
graph); it does assume that the graphical strucutre sought has no cycles.
CCD algorithm: this method assumes there are no hidden common causes; it allows cycles; it is only
correct for discrete variables under a restrictive assumtptions
GES algorithm: same assumptions as the PC algorithm, except that this one performs search by
scoring a graph by its asymptotic posterior probability.
The second class concerns algorithms to search for latent variable structural equation models from data
and background knowledge.
MIM Build algorithm: learns the causal relationships among latent variables, when the true
(unknown) data generation process is believed to be a pure measurement/structural model.
Build Pure Clusters algorithm: a complement to MIM Build and Purify, this algorithm learns the
causal relationships from latent variables to observed variables, when the true (unknown) data
generation process is believed to be contain a pure measurement/structural submodel--i.e. a model in
which each
Purify algorithm: given a measurement model, this method searches for a submodel in which there
are no every measured variable is influenced by one and only one latent variable.
Select the desired algorithm that meets your assumptions from the Search list. An initial dialog box
showing the search parameters you can set is displayed. The following figure illustrates the one that is
displayed when PC Algorithm is selected.
After the proper parameters are set, if the user checks the box "Execute searches automatically", the
automated search procedure will start when the OK button is clicked. The respective Help button can be
used to get instructions about that specific algorithm. The next window displays the result of the
procedure, and can also be used to fire new searches. The following figure illustrates an output for the PC
algorithm.
Inserting background knowledge
Besides the assumptions underlying each algorithm, another source of constraints that can be used by the
search procedures to narrow down the search and return a more informative output is making use of
background knowledge provided by the user. To see how to specify background knowledge for a search
algorithm, see Editing Knowledge.
Assumptions
A search procedure is pointwise consistent if as the sample size increases without bound, the output of the
algorithm converges with probability 1 to
true information about the data generating structure. For all of the Tetrad search algorithms, available
proofs of pointwise consistency, assume at least the following:
1. The sample is i.i.d--the probaiblity of any unit in a population being sampled is the same as any other,
and the joint probability distribution of the variables is the same for all units.
2. The joint probability distribution is locally Markov. In acyclic cases, this is equivalent to a simpler
"global" Markov condition: that a variable is independent of all variables that are not its effects
conditional on the direct causes of the variable in the causal graph (its "parents"). In cyclic cases, the local
Markov condition has a related but more technical definition. (See Spirtes, et al., 2000).
3. All of the independence and conditional independence relations in the joint probability distribution are
consequences of the local Markov condition for the true causal graph.
In addition, various specific search algorithms impose other assumptions. Of course, the search algorithms
may give correct in information when these assumptions do not strictly hold, and in some cases will do so
when they are grossly violated--the PC algoirthm, for example, will sometimes correctly identify the
presence of unrecorded common causes of recorded variables.
Types of Searches:
PC
CPC
PCD
FCI
CFCI
CCD
GES
MBF
MIMBuild
BPC
Purity
Search Algorithms: PC
The PC algorithm is designed to search for causal explanations of observational or mixed observational
and experimental data in which it may be assumed that the true causal hypothesis is acyclic and there is no
hidden common cause between any two variables in the dataset. (It is also assumed that no relationship
between variables in the data is deterministic--see PCD).
The algorithm operates by asking a conditional independence oracle to make judgements about the
independence of pairs of variables (e.g., X, Z) conditional on sets of variables (e.g., {Y}). Conditional
indepedence tests are available for datasets that consist either entirely of continuous variables or entirely
of discrete variables; hence, datasets of these types can be used as input to the algorithm. As a way of
getting one’s head around how the algorithm should behave in the ideal, when independence tests always
give correct answers, one may also use a DAG as an input to the algorithm, in which case graphical
d-separation will be substituted for an actual independence test.
In the case where a continuous dataset is used as input, the available conditional independence tests
assume that the direct causal influence of any variable on any other is linear and that the distribution of
each variable is Normal.
Some of the above assumptions are not testable using observational data. They should come from prior
knowledge or partial experiments.
Pseudocode for the version of PC implemented in Tetrad IV is given below. As shown in the pseudocode,
the algorithm can be broken into two phases: an adjacency phase and an orientation phase. In the
adjacency phase, a complete undirected graph over the variables is initially constructed and then edges
X---Y are removed if some set S among either the adjacents of X or the adjacents of Y can be found (of a
certain size, or "depth") such that I(X, Y | S). Once the adjacency structure over V has been well estimated
by this procedure, an orientation phase is begun. The first step of the orientation phase is to examine
unshielded triples and consider whether to orient them as colliders. An unshielded triple is a triple <X, Y,
Z> where X is adjacent to Y, Y is adjacent to Z, but X is not adjacent to Z. Since X is not adjacent to Z,
the edge X---Z must have been removed during the adjacency search by conditioning on some set Sxz;
<X, Y, Z> is oriented as a collider X-->Y<--Z just in case Y is not in this Sxz. Once all such unshielded
triples have been oriented as colliders by this rule that can be, a series of orientation rules is applied (in
this case, the complete orientation rule set from Meek 1995) to orient any edges whose orientations are
implied by previous orientations. The log of particular decisions the algorithm makes, as described above,
when searching on an actual dataset is available through the Logging menu in the interface.
Entering PC parameters
Consider the following "true" causal hypothesis (a DAG):
When the PC algorithm is chosen from the Search dropdown, window appears in which on may enter an
alpha value and edit knowledge. The alpha value is the significance level of the statistical test used as a
conditional independence oracle for the algorithm. The default value is 0.05, although it is useful to
experiment with different alpha levels to test the sensitivity of the analysis to this parameter. (Typical
values for experimenting are 0.01, 0.05, and 0.10.)
PC is sensitive to background knowledge--that is, sensitive to specifications that certain edges are either
required in the model or forbidden to be in the model. To edit this information, click the edit button for
background knowledge and enter the information in that interface.
When parameters are set to their desired values, click "Execute" to run the algorithm. The output will be a
pattern like the following:
Interpreting the output
The are basically two types of edges that can appear in PC output:
a directed edge:
In this case, the PC algorithm deduced that A is a direct cause of B, i.e., the causal effect goes from A
to B and it is not intermediated by any of the other observed variable
a undirected edge:
In this case, the PC algorithm cannot tell if A causes B or if B causes A.
The absence of an edge between any pair of nodes means they are independent, or that the causal effect of
one modelNode in the other is intermediate by other observed variables.
Sometimes a double directed edge sometimes appear in a PC search output. Such edges are the result of a
partial failure of the PC search. They may appear due to failure of assumptions (e.g., relationships are
non-linear, the population graph is cyclic, etc.) or because the sample is not large enough and some
statistical decisions are inconsistent. In a situation like that, the user may introduce prior knowledge to
constraint the direction such edge may assume, collect more data or use a different algorithm. Knowledge
of the domain will be essential.
Finally, a triplet of nodes may assume the following pattern:
In other words, in such patterns, A and B are connected by an undirected edge, A and C are connected by
an undirected edge, and B and C are not connected by an edge. By the PC search assumptions, this means
that B and C cannot both be cause of A. The three possible scenarios are:
A is a common cause of B and C
B is a direct cause of A, and A is a direct cause of C
C is a direct cause of A, and A is a direct cause of B
In our example, some edges were compelled to be directed: X2 and X3 are causes of X4, and X4 is a cause
of X5. However, we cannot tell much about the triplet (X1, X2, X3), but we know that X2 and X3 cannot
both be causes of X1.
Pseudocode for PC
The following is pseudocode representing the way PC is implemented in Tetrad.
Step A:
Form the complete undirected graph G over v1,...,vn.
Step B (Fast Adjacency Search):
For each depth d = 0, 1, ...:
For for each variable x:
"next_y":
For each adjacent modelNode y to v:
Let adjX = adj(x) - {y}
Let adjY = adj(y) - {x}
For each subset Sx of adjX up to size d:
If x _||_ y | Sx, remove x---y from G.
Continue "next_y."
For each subset Sy of adjY up to size d:
if x _||_ y | Sy, remove x---y from G.
Continue "next_y."
Step C:
Orient colliders in G, as follows:
For each modelNode x:
For each pair of nodes y, z adjacent to x:
If y and z are not adjacent:
If ~(y _||_ z | x):
Orient y-->x<--z as a collider.
Step D:
Apply orientation rules until no more orientations are possible.
Rules to use: away from collider, away from cycle, kite1, kite2.
(These are Meek’s rules R1, R2, R3, and R4.)
Away from collider:
For each modelNode a:
For each b, c in adj(a):
If b-->a---c:
Orient b-->a-->c.
Else if c-->a---b:
Orient c-->a-->b.
Away from cycle:
For each modelNode a:
For b, c in adj(a):
If a-->b-->c and a---c:
Orient a-->c.
Else if c-->b-->a and c---a:
Orient c-->a.
Kite 1:
For each modelNode a:
For each nodes b, c, d in adj(a) such that a---b, a---c,
a---d, and !(c---d):
If c-->b and d-->b:
Orient a-->b.
Kite 2:
For each modelNode a:
For each nodes b, c, d in adj(a) such that a---b, a---d,
b is not adjacent to d, and either a---c, a-->c, or c-->a,
If b-->c and c-->d:
Orient a-->d.
Else if d-->c and c-->b:
Orient a-->b.
References:
Spirtes, Glymour, and Scheines (2000). Causation, Prediction, and Search.
Chris Meek (1995), "Causal inference and causal explanation with background knowledge."
Search Algorithms: FCI
The FCI algorithm is designed to search for causal explanations of observational or mixed observational
and experimental data in which is may be assumed that the true causal graph is acyclic, but there may be
unrecorded (hidden, latent) common causes of variables in the data set, or in which there may be sample
selection bias. Sample selection bias occurs when the values of two or more recorded variables influence
the probability that a unit is sampled. (It is also assumed that no relationship between variables in the data
is deterministic--see PCD.)
The algorithm operates by asking a conditional independence oracle to make judgements about the
independence of pairs of variables (e.g., X, Z) conditional on sets of variables (e.g., {Y}). Conditional
indepedence tests are available for datasets that consist either entirely of continuous variables or entirely
of discrete variables; hence, datasets of these types can be used as input to the algorithm. As a way of
getting one’s head around how the algorithm should behave in the ideal, when independence tests always
give correct answers, one may also use a DAG as an input to the algorithm, in which case graphical
d-separation will be substituted for an actual independence test.
In the case where a continuous dataset is used as input, the available conditional independence tests
assume that the direct causal influence of any variable on any other is linear and that the distribution of
each variable is Normal.
Some of the above assumptions are not testable using observational data. They should come from prior
knowledge or partial experiments.
FCI is operated by the user exactly as is PC. The differences are in the interpretation of the output. The
output of FCI is a partial ancestral graph (PAG). It gives partial information about which variables are or
are not drect or indirect causes and effects of other variables.
An edge between two variables in the output, however the ends of the edge are marked, indicates that
there is a causal pathway--a direct cause in one direction or the other or a common cause--connecting the
two variables, that does not contain any other observed variable. It does not necessarily mean that in the
true causal graph, the connected variables have a direct causal connection. An edge of any kind between
two measured variables implies that the variables are not independent conditional on any set of measured
variables.
If there is a edge from X to Y that is unmarked--a tail of an arrow-- then X is a cause of Y. X may not,
however, be a direct cause of Y.
If there is an edge from X to Y that has an arrowhead directed into Y, then Y is not a cause--not an
ancestor--of X.
If there is an edge with two arrowheads connecting X and Y, then there is an unrecorded common cause of
X and Y
If an edge end is marked with an "o" the algorithm cannot determine whether there should or should not
be an arrowhead at that edge end.
Here is pseudocode for the implementation of the FCI algorithm used in Tetrad:
Given: Independence test I over variables v1,...,vn.
Step A:
Form new empty PAG G with variables from I. Fully connect G using
unoriented (o-o) edges.
Step B:
Run a Fast Adjacency Search on G using I.
Step C:
Orient colliders in G, as follows:
For all nodes B:
For each pair of nodes A,C adjacent to B:
If A and C are not adjacent:
If A and C are d-connected conditional on B:
Orient A-->B<--C as a collider.
Step D:
Form a Sepset matrix using a possible d-sep search.
Then reorient all edges as unoriented.
Step CI C:
Orient unshielded triples, as follows:
For all nodes B:
For each pair of nodes A,C adjacent to B:
If A and C are not adjacent:
If A and C are d-connected conditional on B:
Orient A-->B<--C as a collider.
Else:
Do nothing (effectively marking A---B---C as a noncollider)
Step CI D:
Apply orientation rules until no more orientations are possible.
Rules to use: double triangle, discriminating paths, away from collider, away
from ancestor, away from cycle.
Definitions of Orientation Rules:
Double triangle rule:
If D*-oB, A*->B<-**C and A**-**D**-**C is a noncollider, then D**->B.
For all nodes B:
possible A: nodes into B with arrow
possible C: nodes into B with arrow
possible D: nodes into B with circle
For all possible D:
For all possible A:
For all possible C:
If A != C and required conditions hold:
Orient D*->B.
Discriminating paths rule:
The triangles that must be oriented this way (won’t be done by another rule)
all look like the ones below, where the dots are a collider path from L to A
with each modelNode on the path (except L) a parent of C.
B
xo
/
x is either an arrowhead or a circle
\
v
v
L....A --> C
For all nodes B
possible A: nodes out from B with arrow and into B with arrow or circle.
possible C: nodes out from B with arrow and into B with circle.
For all possible A:
For all possible C:
If A is a parent of C:
Find a collider path back from A.
If path found and if path endpoint is d-sep from C conditional on B:
Set C<--B.
else,
Set A<->B and B<->C.
Away from collider rule:
If A*->Bo-oC and not A*-**C, then orient A**->B-->C. (Orient either circle
if present, don’t need both.)
Away from ancestor rule:
If A*-oC and either A-->B*->C or A*->B-->C, then orient A*->C.
Away from cycle rule:
If Ao->C and A-->B-->C, then orient A-->C.
Pseudocode for FCI:
Given: Independence test I over variables v1,...,vn.
Step A:
Form new empty PAG G with variables from I. Fully connect G using
unoriented (o-o) edges.
Step B:
Run a Fast Adjacency Search on G using I.
Step C:
Orient colliders in G, as follows:
For all nodes B:
For each pair of nodes A,C adjacent to B:
If A and C are not adjacent:
If A and C are d-connected conditional on B:
Orient A-->B<--C as a collider.
Step D:
Form a Sepset matrix using a possible d-sep search.
Then reorient all edges as unoriented.
Step CI C:
Orient unshielded triples, as follows:
For all nodes B:
For each pair of nodes A,C adjacent to B:
If A and C are not adjacent:
If A and C are d-connected conditional on B:
Orient A-->B<--C as a collider.
Else:
Do nothing (effectively marking A---B---C as a noncollider)
Step CI D:
Apply orientation rules until no more orientations are possible.
Rules to use: double triangle, discriminating paths, away from collider,
away from ancestor, away from cycle.
Definitions of Orientation Rules:
Double triangle rule:
If D*-oB, A*->B<-**C and A**-**D**-**C is a noncollider, then D**->B.
For all nodes B:
possible A: nodes into B with arrow
possible C: nodes into B with arrow
possible D: nodes into B with circle
For all possible D:
For all possible A:
For all possible C:
If A != C and required conditions hold:
Orient D*->B.
Discriminating paths rule:
The triangles that must be oriented this way (won’t be done by another rule) all
look like the ones below, where the dots are a collider path from L to A with each
modelNode on the path (except L) a parent of C.
B
xo
/
v
x is either an arrowhead or a circle
\
v
L....A --> C
For all nodes B
possible A: nodes out from B with arrow and into B with arrow or circle.
possible C: nodes out from B with arrow and into B with circle.
For all possible A:
For all possible C:
If A is a parent of C:
Find a collider path back from A.
If path found and if path endpoint is d-sep from C conditional on B:
Set C<--B.
else,
Set A<->B and B<->C.
Away from collider rule:
If A*->Bo-oC and not A*-**C, then orient A**->B-->C. (Orient either circle if
present, don’t need both.)
Away from ancestor rule:
If A*-oC and either A-->B*->C or A*->B-->C, then orient A*->C.
Away from cycle rule:
If Ao->C and A-->B-->C, then orient A-->C.
Note: Zhang (2006) supplies an orientation rule set for FCI that is both arrow-complete and
tail-complete; this is not currently implemented.
Search Algorithms: CCD
The CCD algorithm (due to Thomas Richardson) is operated in the same way as the PC algorithm. The
output is interpreted subject to the same restrictions as the PC algorithm, with the exception that CCD will
return PAGs that can only be consistently interpreted as cyclic graphs.
The algorithm is pointwise consistent for linear systems with Normal distributions, no latent variables and
no correlated errors. It is similarly consistent for categorical variables with a Multinomial distribution if
the true causal graph and the distribuiton satisfy the local Markov conditon--which is quite restrictive for
categorical variables for cyclic systems.
Search Algorithms: GES
GES (Greedy Equivalence Search) is a Bayesian algorithm that searches over Markov equivalence classes,
represented by patterns, for a data set D over a set of variables V.
A pattern is an acyclic graph that consists whose edges are either directed (-->) or undirected (---) and
represents an equivalence class of DAGs, as follows: each directed edge in the pattern is so directed in
every DAG in the equivalence class, and for each undirected edge X---Y in the pattern, a DAG exists in
the equivalence class with that edge directed as X<--Y and a DAG exists in the equivalence class with that
edge directed as X-->Y. To put it differently, a pattern represent the set of edges that can be determined by
the search, with as many of these edges oriented as can be, using the available information.
It is assumed (as with PC) that the true causal graph is acyclic and the no common hidden causes exist
between pairs of variables in the graph. GES can be run on datasets that are either entirely continuous or
entirely discrete (but not directly on graphs using d-separation). In the continuous case, it is assumed that
the direct causal influence of any variable into any other is linear, with the distribution of each variable
being Normal. Under these assumptions, the algorithm is pointwise consistent.
GES searches over patterns by scoring the patterns themselves. There is a forward sweep in the algorithm
and a backward sweep. In the forward sweep, at each step, GES tries to find the edge which, once added,
increases the score the most over not adding any edge at all. (After adding each such edge, the pattern is
rebuilt by orienting any edge as --- that does not participate in a collider and then applying Meek’s PC
orientation rules to add any implied orientations.) Once the algorithm gets to the point where there is no
edge that can be added that would increase the score, the backward sweep begins. In the backward sweep,
GES tries at each step to find the one edge it can remove that will increase the score of the resulting the
most over the previous pattern. Once it gets to the point where there is no edge anymore than once
removed increases the score, the algorithm stops.
There are some differences in assumptions and expected behavior between this algorithm and the PC
algorithm. When, contrary to assumptions, there is actually a latent common cause of two measured
variables the PC algorithm will sometimes discover that fact; GES will not.
Information about how precisely GES makes decisions about adding or removing edges can be found in
the logs, which can be accessed using the Logging menu.
Entering GES parameters
Consider the following example:
When the PC algorithm is chosen from the Search Object combo box, the following window appears:
The parameters that are used by the GES algorithm can be specified in this window. The parameters are as
follows:
view background knowledge: this button gives access to a background knowledge editor that is
analogous to the one used in most search algorithms.
Execute the search.
Interpreting the output
The GES algorithm returns a partially oriented graph where the nodes represent the variables given as
input. In our example, the outcome should be as follows if the sample is representative of the population:
The are basically two types of edges that can appear in GES output:
a directed edge:
In this case, the GES algorithm deduced that A is a direct cause of B, i.e., the causal effect goes from
A to B and it is not intermediated by any of the other observed variable
a undirected edge:
In this case, the GES algorithm cannot tell if A causes B or if B causes A.
The absence of an edge between any pair of nodes means they are independent, or that the causal effect of
one modelNode in the other is intermediate by other observed variables. Unlike the PC algorithm, no
accidental double-directed edges can appear. It does not mean that GES will be immune to the sample
variation that caused the unexpected behavior of the PC search. It is a good idea to run both searches and
compare the result.
Finally, a triplet of nodes may assume the following pattern:
In other words, in such patterns, A and B are connected by an undirected edge, A and C are connected by
an undirected edge, and B and C are not connected by an edge. By the PC search assumptions, this means
that B and C cannot both be cause of A. The three possible scenarios are:
A is a common cause of B and C
B is a direct cause of A, and A is a direct cause of C
C is a direct cause of A, and A is a direct cause of B
In our example, some edges were compelled to be directed: X2 and X3 are causes of X4, and X4 is a cause
of X5. However, we cannot tell much about the triplet (X1, X2, X3), but we know that X2 and X3 cannot
both be causes of X1.
References:
Chickering (2002). Optimal structure identification with greedy search. Journal of Machine Learning
Research.
Search Algorithms: MBF
The MBF search (Markov blanket fan search) is designed to search for Markov blanket DAGs of target
variables in datasets, under the assumptions of the PC algorithm--i.e., that the true causal graph over the
variables in the dataset does not contain any cycles, that there are no hidden common causes between
variables in the dataset, and that no relationship between variables in the dataset is deterministic. The
Markov blanket of a variable t is the smallest set S of variables out of a set of variables V such that t _||_ y
| S for every y in V \ (S U {t}). The Markov blanket DAG of t is the restriction of the entire causal graph
over V to the variables in {t} U S.
MBF operates by asking a conditional independence oracle to make judgements about the independence of
pairs of variables (e.g., X, Z) conditional on sets of variables (e.g., {Y}). Conditional indepedence tests are
available for datasets that consist either entirely of continuous variables or entirely of discrete variables;
hence, datasets of these types can be used as input to the algorithm. (As a way of getting one’s head
around how the algorithm should behave in the ideal, when independence tests always give correct
answers, one may also use a DAG as an input to the algorithm, in which case graphical d-separation will
be substituted for an actual independence test.)
In the case where a continuous dataset is used as input, the available conditional independence tests
assume that the direct causal influence of any variable on any other is linear and that the distribution of
each variable is Normal.
Some of the above assumptions are not testable using observational data. They should come from prior
knowledge or partial experiments.
Like PC, MBF returns a pattern, in this case containing:
(a) The target T, the true parents and children of T, and the true parents of the children of T.
(b) All edges among T, true parents of T, true children of T, and true parents of children of T. Some of
these edges may not be oriented as -->.
(c) Possibly some extra nodes and edges to account for the possibility that if some edges T---v were
actually oriented as T-->v, these nodes and adjacencies would be required in the MB DAG of T.
(d) No nodes or adjacencies or --> edges that do no belong in some MB DAG consistent with
independence facts supplied by (2).
There may also be some bidirected <-> edges in G_out if the independence information from (2), above, is
inconsistent. These <-> edges may either be left in the final graph or oriented as if they were directed
edges.
Search Algorithms: MIMBuild
Introduction
MIM Build stands for Multiple Indicator Model Build. It is one of the three algorithms in Tetrad designed
to build pure measurement/structural models (the others are the Build Pure Clusters algorithm and
the Purify algorithm).
MIM Build should be used to learn causal relationships among latent variables in a when the measurement
model is given in advance but the structural model is unknown.
The MIM Build algorithm also assumes that the underlying (unknown) data generating process is a linear
graph. If the user strongly suspects that the latents or indicators may be non-linearly related, MIM Build
should not be used. We are also assuming that latents here do not have other hidden common causes.
All observed variables are assumed to be continuous, and therefore the current implementation of the
algorithm accepts only continuous data sets as input. For general information about model building
algorithms, consult the Search Algorithms page.
Entering MIM Build parameters
Create a new Search object as described in the Search Algorithms page, but in order to follow this tutorial,
use the following graph to generate a simulated continuous data set:
When the MIM Build algorithm is chosen from the Search box, a window appears for specifying search
parameters.
The parameters that are used by MIM Build can be specified in this window. The parameters are as
follows:
alpha value: if you choose the PC search in the combo box "Choice of algorithm", MIM Build uses
statistical hypothesis tests in order to generate models automatically. The alpha value parameter
represents the level by which such tests are used to accept or reject constraints that compose the final
output. The default value is 0.05, but the user may want to experiment with different alpha values in
order to test the sensitivity of her data within this algorithm.
number of clusters: MIM Build needs a pure measurement model specified in advance. The
measurement model is defined by a set of clusters of variables, where each cluster represents a set of
pure indicators of a single latent. In this box, the user specifies how many latents there are in the
measurement model based in prior knowledge. In our example, let’s use three clusters.
edit cluster assignments: once the number of latents is specified, the user should now determine
which variables in the data set should be clustered together. When this button is clicked, the
following dialog box appears:
In this example, we want to enter the measurement model that we know is the correct one by
assumption. In other words, variables X1, X2 and X3 should be clustered together, since they are pure
indicators of a same latent. Variables X4, X5 and X6 form another cluster, and the same holds for X7, X8
and X9. In order to perform cluster assignment, since click the respective combo box and choose the
cluster that shows up in the list. For example, click the X4 combo box and choose Cluster 1. Do the same
for X5 and X6. For variables X7, X8 and X9, choose Cluster 2. The final outcome should be as follows:
algorithm: MIM Build is actually a family of algorithms for the problem of learning structural
models. Currently, we offer two alternatives, both corresponding to the case where we have no latent
variables: the GES and PC search algorithms. The PC version can be slower and less robust than
GES, but might be useful to indicate if the assumption of no extra hidden common causes among the
latents holds (the appearance of double directed edges is an indication of that possibility).
view background knowledge: this button gives access to a background knowledge editor that is
analogous to the one used in most search algorithms, but with one difference: instead of entering
background knowledge about observed variables (in MIM Build case, all background knowledge about
observed variables boils down to the specification of a measurement model), the user here enters prior
knowledge about causal relations of latent variables. Latents are denoted by the label _Lx, where x is
the number of the respective cluster. In our example, the latent parent of X7, X8 and X9 is referred as _L2.
Note: use of background knowledge is not implemented for GES yet.
Execute the search as explained in the Search Algorithms page.
Interpreting the output
MIM Build returns a pattern over latent variables that is completely analogous to the one produced by a
PC Search, or GES Search. The same interpretation used in such algorithms can be applied to MIM
Build output.
Search Algorithms: BPC
Introduction
Build Pure Clusters (BPC) is one of the three algorithms in Tetrad designed to build pure
measurement/structural models (the others are the MIM Build algorithm and the Purify algorithm).
The goal of Build Pure Clusters is to build a pure measurement model using observed variables from a
data set. Observed variables are clustered into disjoint groups, each group representing indicators of a
single hidden variable. Variables in one group are not indicators of the hidden variables associated with
the other groupsl. Also, some variables given as input will not be used because they do not fit into a pure
measurement model along with the chosen ones.
The Build Pure Clusters algorithm assumes that the population can be described as a
measurement/structural model where observed variables are linear indicators of the unknown latents.
Notice that linearity among latents is not necessary (although it will be necessary for the MIM Build
algorithm) and latents do not need to be continuous. It is also assumed that the unknown population graph
contains a pure subgraph where each latent has at least three indicators. This assumption is not testable is
should be evaluated by the plausibility of the final model.
The current implementation of the algorithm accepts only continuous data sets as input. For general
information about model building algorithms, consult the Search Algorithms page.
Entering Build Pure Clusters parameters
For example, consider a model with this true graph:
If data is generated using this model and a search is constructed from the data, selecting BPC, the
following parameters will be requested:
alpha value: Build Pure Clusters uses statistical hypothesis tests in order to generate models
automatically. The alpha value parameter represents the level by which such tests are used to accept
or reject constraints that compose the final output. The default value is 0.05, but the user may want to
experiment with different alpha values in order to test the sensitivity of her data within this algorithm.
number of iterations: Build Pure Clusters uses a randomized procedure in order to generate a model,
since in general there are different pure measurement submodels of a given general
measurement/structural model. This option allows the use to specify a given number of runs of the
algorithm, where the outputs given for each run are combined together into s single model. This
usually provides models that are more robust against statistical flunctuations and slight deviances
from the assumptions.
statistical test: as stated before, automated model building is done by testing statistical hypothesis.
Build Pure Clusters provides two basic statistical tests that can be used. Wishart’s Tetrad ssumes that
the given variables follow a multivariate normal distribution. Bollen’s Tetrad test not make this
assumption. However, it needs to compute a matrix of fourth moments, which can be time
consuming. It is also less robust against sampling variability when compared to Wishart’s test if the
data actually follows a multivariate normal distribution..
Interpreting the Output.
Upon executin the search, BPC returns a pure measurement model. Because of the internal randomization,
outputs may vary from run to run, but one should not expect large differences (and this can be actually
used to evaluate if the assumptions are reasonable for a given set of input variables). In our example, the
outcome should be as follows if the sample is representative of the population:
Edges with circles at the endpoints are added only to distinguish latent variables from the indicators. BPC
does not make any claims about the causal relationships among latent variables (this is the role of the
MIM Build algorithm). The labels given to the latent variables are arbitrary. As part of the analysis, a
domain expert should evaluate if such latents have indeed a physical or abstract meaning, or if they should
be discarded as meaningless. Such reification is domain dependent.
Note: If the output is not arranged helpfully, use the Fruchterman-Reingold layout in the Layout menu to
arrange more readably.
Search Algorithms: Purify
Introduction
Entering Purify parameters
Interpreting the output
Introduction
Purify is one of the three algorithms in Tetrad designed to build pure measurement/structural models
(the others are the MIM Build algorithm and the Purify algorithm).
Purify should be used to select indicators of a given measurement model such that the selected indicators
form a pure measurement model. In other words, the user specifies a set of clusters of indicators, where
each cluster containts indicators of an assumed latent variable. The task of Purify is to discard any
indicator that is impure, i.e., that may have other common causes with other indicators, or that is a direct
cause of other indicators.
The Purify algorithm assumes that the population can be described as a measurement/structural model
where observed variables are linear indicators of the unknown latents, and that the given measurement
model is correct, but perhaps impure. Notice that linearity among latents is not necessary (although it will
be necessary for the MIM Build algorithm) and latents do not need to be continuous.
All variables are assumed to be continuous, and therefore the current implementation of the algorithm
accepts only continuous data sets as input. For general information about model building algorithms,
consult the Search Algorithms page.
Entering Purify parameters
Create a new Search object as described in the Search Algorithms page, but in order to follow this
tutorial, use the following graph to generate a simulated continuous data set:
Notice that, in this example, X4, X5 and X7 are in impure relations. Notice also that X4 is not an impurity
anymore when X7 is removed, but X5 and X7 cannot be made pure, since they are indicators of two
latents.
When the Purify algorithm is chosen from the Search Object combo box, the following window appears:
The parameters that are used by Purify can be specified in this window. The parameters are as follows:
alpha value: Purify uses statistical hypothesis tests in order to generate models automatically. The
alpha value parameter represents the level by which such tests are used to accept or reject constraints
that compose the final output. The default value is 0.05, but the user may want to experiment with
different alpha values in order to test the sensitivity of her data within this algorithm.
number of clusters: Purify needs a measurement model specified in advance. The measurement
model is defined by a set of clusters of variables, where each cluster represents a set of pure
indicators of a single latent. In this box, the user specifies how many latents there are in the
measurement model based in prior knowledge. In our example, assuming we know the true
measurement model, let’s use three clusters.
edit cluster assignments: this is identical to the cluster editor of the MIM Build algorithm. Consult
its documentation for details. In our example, we should create the following clustering:
statistical test: as stated before, automated model building is done by testing statistical hypothesis.
Purify provides two basic statistical tests that can be used. Wishart’s Tetrad ssumes that the given
variables follow a multivariate normal distribution. Bollen’s Tetrad test not make this assumption.
However, it needs to compute a matrix of fourth moments, which can be time consuming. It is also
less robust against sampling variability when compared to Wishart’s test if the data actually follows a
multivariate normal distribution.
default mode: there are basically two different strategies used by Purify. In the Impure by default
mode, the algorithm does not assume that the user believes the measurement model is pure, and
therefore will try to find constraints that guarantees that a indicator is pure with respect to other
indicators. If it fails to find a condition by which indicator A is pure with respect to indicator B, then
A will be marked as impure with respect to B. In the Pure by default mode, the algorithm assumes
that the given measurement model is pure. It will try to find constraints that guarantees that a
indicator is impure with respect to other indicators. If it fails to find a condition by which indicator A
is impure with respect to indicator B, then A will be marked as pure with respect to B.
Execute the search as explained in the Search Algorithms page.
Interpreting the output
Although a given measurement model may have many different pure submodels, the Purify algorithm has
a deterministic output: it will basically throw away indicators that violate constraints, following an order
determined by the number of constraints that are violated by each indicator. It returns a pure measurement
model. In our example, the outcome should be as follows if the sample is representative of the population:
Edges with circles at the endpoints are added only to distinguish latent variables from the indicators.
Purify does not make any claims about the causal relationships among latent variables (this is the role of
the MIM Build algorithm). The labels given to the latent variables are arbitrary.
Sometimes some latents will not have any indicator. As an important sidenote, if some cluster has only
two variables, Purify cannot find any condition by which the two indicators in this cluster can be
considered pure. If the Impure by default method is chosen, such indicators will always be removed.
Editing Knowledge
Background knowledge (or "knowledge" for short) is a set of specifiable constraints used in a variety of
searches that can be associated either with search objects or, for convenience sake, with data objects.
Background knowledge can be used by search procedures to narrow down the search and return a more
informative output is making use of background knowledge provided by the user. There are three main
types of background knowledge that can be used by Tetrad:
forbidden edges: a given pair of variables that cannot be directly connected in some direction or any,
independently of what the data say.
required edges: a given pair of variable that has to be connected in some direction or another,
independently of what the data say.
temporal tiers: informs which variables precede others in a temporal order, as a way to decide the
direction of a causal arrow where the algorithm is not able to infer.
Tools for manipulating knowledge are located in the Knowledge menu of components that are associated
with background knowledge. Consider a PC search over variables X1, X2, X3, X4, and X5. The search
editor will be initially blank, like so, and will have a Knowledge menu with several tools in it:
If you select "Edit Knowledge," you will se an editor that looks like this:
]
There are three tabs--the "Tiers" tab (showing), the "Edges" tab, and the "Text" tab. The "Tiers" tab let
you specify temporal tiers; you simply drag and drop the variables in the tiers you want, increasing or
decreasing the number of tabs as needed. If V1 is in tier m and V2 is in tier n, where m < n, then the edge
V2-->V1 will be forbidden. Let’s say you drag X1 and X2 to Tier 1, drag X3 and X4 to Tier 2, and drag
X5 to Tier 3:
To see the specific edges that are forbidden by this specification of tiers, click on "Edges":
The edges shown in gray are forbidden. You may add required edges to this view by clicking "Add
Required," clicking on the "from" node for the edge you want to add, and dragging to the "to" node. Here
is the same knowledge with two required edges added (shown in green):
Finally, you may view the edited knowledge in a format consistent with Tetrad 3 by clicking on Text:
If you click "Save," this knowledge will be saved and used in the next search.
If you select "Save Knowledge" from the Knowledge menu, you will be able to save knowledge out to a
file in the form shown in the "Text" tab, above. If you select "Load Knowledge," you will be able to load
knowledge from a file in the form shown in the "Text" tab, above.
The remaining items in the Knowledge menu are used to help ove knowledge from one box to another.
The remaining items in the Knowledge menu are used to help move knowledge around from component to
component. If you select "Copy Knowledge," the current knowledge will be copied to the system
clipboard. If you select paste knowledge, the knowledge stored on the clipboard will be copied into the
current box.
How Knowledge is Used
Knowledge is used when searches are done to forbid or required edges. Forbidden edges are not permitted
to appear in the final graph; required edges must appear in the final graph. How this is accomplished
varies from algorithm to algorithm; to see how it’s done in a specific algorithm, see the manual page for
that algorithm.
Temporal tiers provide a mechanism to forbid edges systematically in layered groups. Edges from any
later tier to any earlier tier are forbidden. This provides a convenient way to give knowledge about
temporal ordering to a search algorithm. The above knowledge would be sensible to provide if, for
example, we knew that X1 and X2 preceded X3 and X4, and X3 and X4 peceded X5. By simply placing
the variables in these tiers, all of the necessary forbidden edges are generated automatically. Variables in
the box marked "Not in tier" do not carry any temporal constraint with respect to any other variable.
Search Algorithms: CPC
CPC (Conservative PC) is a variant of the PC algorithm designed to improve arrowpoint orientation
accuracy. The types of input data permitted and the assumptions for CPC are the same as for PC. That is,
input data may be used that is either entirely discrete or entirely continuous. A DAG may be used as input,
in which case graphical d-separation will be used in place of conditional independence for purposes of the
search. It is assumed that the true causal DAG over which the search is being done does not contain
cycles, that there are no hidden common causes between any pair of observed variables, and that with
continuous variables the direct causal effect of any variable into any other is linear, with the distribution of
each variable being Normal.
To know how to interpret the output of CPC, it helps to know how CPC differs from PC. In the collider
orientation step, instead of orienting an unshielded triple <X, Y, Z> as a collider just in case (as in PC) Y
is not in the set Sxz that shielded off X from Z and hence led to the remove of X---Z during the adjency
phase of the search, <X, Y, Z> is first labeled as either a collider, a noncollider, or an unfaithful triple,
depending, respectively, on whether Y is in none of the sets S among adjacents of X or adjacents of Y
such that X _||_ Y | S, or all of those sets, or some but not all of those sets. If <X, Y, Z> is labeled as a
collider then it is oriented as X-->Y<--Z; if it is labeled as a noncollider or unfaithful, these facts are noted
for later use. (It is intended for unfaithful triples to be underlined in some way, but this is not implemented
in the interface currently. However, by inspecting the logs, the classification of unshielded triples into
colliders, noncolliders, and unfaithful triples may be obtained.)
Once all colliders have been marked and all other unshielded triples sorted properly, the Meek orientation
rules are then applied as usual, with the exception that where these orientation rules require noncolliders,
the fact that a triple is a noncollider is checked against the previously compiled classification of
unshielded triples.
Whereas the output graph of PC is a pattern (allowing for bidirected edges in some cases where
independence information conflicts with the assumptions of PC), the output graph of CPC is an e-pattern
(with the same allowance). The e-pattern represents an equivalence class of DAGs that have the same
adjacencies as the e-pattern, with every edge A-->B in the e-pattern also in the DAG and every unshielded
collider in the DAG either an unshielded collider in the e-pattern or marked as unfaithful in the e-pattern.
(Currently, the interface in Tetrad does how show which triples are marked as unfaithful, but the logs,
accessed through the Logging menu, provide this information.)
References:
Spirtes, Glymour, and Scheines (2000). Causation, Prediction, and Search.
Chris Meek (1995), "Causal inference and causal explanation with background knowledge."
Ramsey, Zhang, and Spirtes (2006). Adjacency-Faithfulness and Conservative Causal Inference.
Uncertainty in Artificial Intelligence, forthcoming.
Search Algorithms: PCD
PCD is a modification of the PC algorithm that allows it to search over datasets in which some of the
variables stand in deterministic relationships. A deterministic relationship between variables is one that is
purely functional, with no random variation involved. For instance, if X = 2Y, then X and Y in the graph
X<--Y-->Z stand in a deterministic relationships. The PC algorithm does not deal well with examples of
this sort. For instance, the PC algorithm might do a statistical test to determine whether X _||_ Z | Y, in
order to see whehter the edge X---Z should be removed during the adjacency phase. But since X and Y
stand in a deterministic relationship, it is always the case that X _||_ Z | Y, since this is informationally
equivalent to asking whether X _||_ Z | X, which is always true. But establishing that X _||_ Z | X is never
a good reason for removing the edge between X and Z; if such a reason were permitted, the adjacency
phase of the PC search would always return an empty graph! In the face of deterministic relationships, the
PC algorithm needs to be made aware of when effective conditioning on an endpoint of an edge happens
and adjustements need to be made to prevent edges from being eliminated for such reasons.
To correct the problem, PCD checks before performing an independence check of the form X _||_ Z | S,
whether X determines either X or Y, and if it does, refuses to do the independence check. This correct the
problem for the adjacency search. There is an additional problem in the step where colliders are oriented.
In this case, PCD, when considering whether to orient an unshielded triple <X, Y, Z> as a collider, based
on a conderation of the set Sxz that was used to remove the edge X---Z from the graph during the
adjacency search, first asks whether Sxz determines X or Y. If it does not, then <X, Y, Z> is oriented as a
collider X-->Y<--Z if Y is not contained in Sxz.
Otherwise, the assumptions and types of data permitted for PCD are identical to that of PC.
Search Algorithms: CFCI
CFCI is an experimental algorithm that modifies FCI in the same way that CPC modifies PC.
Inside the Regression Box
The regression box in the main workspace area looks like this:
Tetrad currently offers linear regression as an option for continuous data sets. Possibly other types of
regression (e.g., logistic regression) will be added in the future.
Inside the Classify Box
A Classify box in the main workspace looks like this:
The operations in the Classify box permit you to use an instantiated model to estimate values for a variable
from a data set--the variable estimated
must be in the Instantiated Model, but it need not be in the data set. A Classify Box requires input from a
Data box containing data and input from an Instantiated Model in an IM box. (Remember that you can
copy an estimated model in an Estimate box into an IM box simply by putting a flowchart arrow fromthe
estimate box to a new IM box.)
Here are some things to note:
The Instantiated Model can contain variables that are not in the Data
The Data can contain variables that are not in the Instantiated Model
The variable values must all be categorical--a data file with decimal numbers will not be accepted by
Classify
The IM must be a Bayes net--either Maximum Likelihood or Dirichlet..
If the target variable to be classified has multiple values, the Classifier will assign the target variable
its most probable value for each case.
If the target variable to be classified has two values, you can specify the cut-off probability for
classification.
The Classifier box will show the graph of the IM used for classification..
The original data can be viewed by cliicking the "Test Data" tab.
Tabs above the graph give choices for how the Classifer will work. You can choose:
The variable in the IM that is the target--to be classified (you can also chose this variable by clicking
on it in the graph.)
If the target is binary, with only two possible values, you can choose the cut-off value below which
the variable will be classfied one way, and above which, the other.
When you hit the "Classify" button at the top of the graph window inside the Classify box, the Classifier
does its work and gives you some viewing choices.
According to which tab you then click, you can see:
The original data
The orginal data plus, in the first column, for each case the value of the target variable the classifier
predicts.
If the target variable is included in the data, the tabs at the top of the Classifier window after the
classfication program has run will also give you.
A Receiver Operating Characteristic, or ROC curve for short. There is a distinct ROC curve for each
value of the target variable, showing the ratio of true positives to false positives as a function of the
cutoff value of probabilities for positive classification. ROC curves are traditionally used only with
binary variables but the program allows them for muliptle valued variables.
The Area Under the ROC curve, or AUC
A Confusion Matrix, showiing, for each value of the target variable, the number of cases having that
true value that were predicted (by the classifier) to have each of the possible values of the target variable.
You do not need to leave the Classify Box, or destroy its contents, to view the ROC curves or Confusion
matrices for alternative values of the target variable. You do need to do so (or create a new Classify Box)
if you want to classify a different target.
Inside the Compare Box
A Compare box in the main workspace looks like this:
The use of the Compare Box is very simple and best understood with an example. Suppose we have
generated data from an IM with a graph.
and run a PC search with the result:
Then the Compare books returns the agreement and differences between the edges of the two graphs (not
their orientations)
Common Tasks
There are several tasks in Tetrad that are common to one or more different dialogs or are tasks that need to
performed on sessions in general. We list these tasks here, with explanations. These explanations are
linked to from other files in an attempt to say things once where possible.
Tasks:
Copying, Cutting, Pasting
Defining Discrete Variables
Destroying Models
Discretizing Data
Editing Knowledge
Editing Node Properties
Exiting TETRAD
Generating Random DAGs
Handling Missing Data
Opening Sessions
Saving Screenshots
Saving Sessions
Selecting Groups of Nodes
Using Popup Menus
Using Templates
Saving, Cutting, Pasting
Nodes and subgraphs of nodes, either in the main session workspace or in any graph display, can be
copied, cut, and pasted, either into the same workspace of graph or into some other workspace or graph.
To copy a section of a flowchart graph, first select the nodes you copy. The edges connecting those nodes
will be copied as well.
Now select "Copy" from the Edit menu, or type Control-C. If you now select "Paste" from the Edit menu,
or type Control-V, a copy of the selected nodes and edges will be pasted into the workspace slightly down
and to the right of the original.
These selected nodes can now be dragged to the desired location.
If you select "Cut" (or type Control-X) instead of "Copy" (Control-C), the original nodes will be deleted,
but you will still have the option of pasting copies of them into the workspace.
You may also paste multiple copies of the originally selected nodes into the workspace. These are pasted
down and to the right in each case--e.g.:
All of these procedures work for nodes in graphs as well.
Defining Discrete Variables
Discrete variables in Tetrad may be described as follows.
1. They are assumed to be nominal--that is, the order of categories doesn’t matter for searches and
estimations.
2. When trying to decide whether two variables by same name are equal, their categories are idenfied by
name.
3. When sending data to algorithms, categories are identified by index only.
Some comments. For point (1), it is clearly a simplification to assume that all discrete variables are
nominal, and it clearly in some cases leads to a loss of information, since if you knew the categories for
some variable carried ordinal information you might be able to use tests of conditional independence that
took advantage of this information. For reasons of speed and flexibility, we’ve stayed with the nominal
independence tests.
For point (2), the problem is that a variable "X" can be defined in two different boxes--say, two different
Bayes Parametric Model boxes or a Bayes Parametric Model box and and a Data box. It’s possible that the
two variables have the same number of categories (in fact, when doing estimations, this is desirable) but
that in the one case the categories are <High, Medium, Low> while in the other case the categories are
<Low, Medium, High>. In this case, the mapping of categories should be High-->High,
Medium-->Medium, Low-->Low and not High-->Low, Medium-->Medium, Low-->High. That is, the
categories should be identified by name.
However, as regards point (3), it is extremely inefficient, especially in Java, to force algorithms over
discrete variables to deal with names of categories; algorithms need to deal with indices of categories. So
when sending a column of data with variable X, with categories <High, Medium, Low> to an estimator,
the estimator only knows that there are three categories for X, at indices 0, 1, and 2, respectively. It
doesn’t know about the names of the categories.
Points (2) and (3) are reconciled in Tetrad using a "bulletin board" system. The first time a list of
categories is encountered, it is posted on a "bulletin board." After that, if that same list of categories is
encountered again, but in some permuted order, the version from the "bulletin board" is retrieved and used
instead. So any particular list of categories can only appear in one order in Tetrad. (This does not imply
that the variables are ordinal; algorithms still interpret these variables as nominal, in that they employ
statisitical tests that do not take advantage of ordinality.)
You can see the effects of "bulletin board" system, for example, in the following situations:
If you’ve specified a Bayes Parametric Model and then read data in from a file for the same variables,
the order of the categories for the data will be the same as order of categories in your Bayes PM.
Estimations, taking Bayes PM’s and discrete data sets as parents will work smoothly.
If you create a Bayes PM with variable X, with categories <Low, High>, and later create another
Bayes PM with variable X with categories <High, Low>, the order of categories in the second case
will be adjusted to <Low, High>.
Discretizing Data
Data can be discretized column by column in Tetrad by selecting "Discretize Selected Columns..." from
the Tools menu of the data editor, which you can launch by double clicking on a Data box.
Both continuous and discrete data can be discretized. Continuous data is discretized by selecting the
number of categories one want the data to have, giving the categories names, and selecting cut points. For
categories C1, C2, and C3, cut points c1 and c2 will be needed. Real values in the column < c1 will be
mapped to C1; real values in [c1, c2) will be mapped to C2, and real values >= c2 will be mapped to C3.
Discrete columns are discretized, by contrast, by simply mapping old categories to new ones, by name.
Consider this data set, simulated from a SEM instantiated model. There are five variables: X1, X2, X3, X4
and X5. Three of the columns (X3, X4 and X5) are selected, and the "Discretize Selected Columns..." item
is shown:
After selecting the "Discretize Selected Columns..." item, the following dialog appears:
The "Next" and "Previous" buttons at the bottoms allow one to navigate through the selected columns. For
each columns, one must select the number of discretized categories, the names for those categories, and
the cut points for those categories. To be helpful, the minimum and maximum value for the column are
displayed, default category names in the sequence "0", "1", ... are chosen, and cut points are chosen that
evenly divide up the range [Min, Max]. At the bottom of the dialog is a checkbox labeled "Copy
unselected columns into new data set." If you check this, the new data set created by the discretizer will
contain all of the variables of the old one, with discretized columns changed. Let’s leave this unchecked
for now. If you accept all of the defaults, with the checkbox unchecked, a new data set is created
comprised of discretized versions of X3, X4, and X5, and this new data set is added as a new tab to the
Data Editor:
Since this tab is selected, it because immediately available to searches, estimations, etc. To see how
discretization of discrete colums works, we can further discretize X5 in this data set by selecting it and
choosing "Discretize Selected Columns..." from the Tools menu again. The following dialog appears:
We can then specify the category name each category in the column should be mapped to, this time
copying over unselected columns:
If you now click "Discretize," a new data set will be added to the Data Editor in a new tab:
...OLD TEXT:
Sometimesthe values of two variables in a data set are strongly correlated. Climate data, for example, may
have many essentailly redundant variables. Such "multicollinearities" make data analysis difficult, and
make model search especially difficult.. There are various heuristic techniques for dealing with the
problem, but Tetrad offers a simple device. If you click on "Split by collinear columns, the program will
prompt you for a correlation value. If you enter a value, say 0.95, the program will create a separate data
set for every pair of variables whosecorrelation is as large or larger than that value. If, for example,
variables X2 and X4 are so correlated, and variables X1 and X5 are also so correlated, you will obtain 4
distinct data sets, one with X1, X2, X3, one with X1, X4, X3, one with X2, X3, X5 and one with X4, X3,
X5.. Be careful with this function: in a large data set, if the correlation is set too low, a huge number of
data files might be created.
The first column shows the number of mulitples of each case in small lettering. Changing that sumber,
e.g., from 1 to 5, will add four more cases with the same values to the data set. A data set with each case
repeated according to the multiplier is created when you connect the Data box to a Manipulate Data box.
Editing Node Properties
Nodes have two properties: a name, and whether or not they’re latent. To edit these properties for a
particular node, double click that node, set the properties the way you want, and click "OK."
Consider the following example.
Here we have a graph using default node names X1, X2, and X3. We would like this to read "Heat, a latent
variable, causes volume and pressure," so we need to change the name "X3" to "Heat" and make it latent.
Double clicking X3, the following dialog comes up:
Clicking "OK," the result looks like this:
Exiting Tetrad
There are two ways to exit the program. One is to choose "Exit" from the File menu. (See Tetrad
Menubar.) The other is to click the red "X" in the top title bar for the application. This looks differently
depending on what operating system you’re in. In Windows XP, it looks like this:
When you exit the program, Tetrad ask you, for each session you have open, whether you’d like to save
that session and, if so, where you would like to save it. Saved sessions are given the ".tet" suffix and are
loadable in later versions of Tetrad.
Generating Random DAGs
In various contexts, you will be given the opportunity to generate a random DAG. The algorithm for
generating random DAGs in Tetrad is a Markov chain algorithm due to Malancon, Dutour, and Philippe,
which selects a random DAG uniformly from the set of DAGs that satisfy a given list of parameters,
which you may tweak.
For instance, here is the dialog that comes up when you construct a new DAG in a Graph box:
If you select "A random DAG" in this dialog, you will need to specify seven parameters (or accept the
defaults). The parameters are as follows:
1. Number nodes - that is, the total number of nodes (measured plus latent) in the graph.
2. Number of latent nodes - the number of nodes that should be made latent (see).
3. Maximum number of edges - The number of edges in the randomly generated graph will not exceed
this. In most cases, this will be the number of edges in the graph, but if other parameters are too
constraining, the number of edges may be less than this.
4. Maximum indegree - The number of edges into any given node will not exceed this.
5. Maximum outdegree - The number of edges out of any given node will not exceed this.
6. Maximum degree - The total number of edges connected to any given node will not exceed this.
7. Connected - "Yes" or "No," depending on whether the entire DAG should be connected or not. A
graph is connected if for every pair of nodes X, Y in the graph, there is an undirected path from X to
Y.
For the combination of parameters in the above dialog, the following random DAG might be generated:
Handling Missing Data
Missing data is represented in both the tabular and covariance data editors by an asterisk ("*"). There are
three ways to create missing data values in a data set:
In the Data Editor, replace the relevant entries with asterisks, by hand.
When loading data, load data that contains missing values.
From the Tools menu in the Data Editor, select "Inject Missing Values Randomly."
The first way explicitly declares that a particular datum is missing. The second way reflects missing data
as represented in a data file. The third way automatically adds missing data at random in a data file,
usually as a test of how an algorithm performs under conditions of missing data.
Usually it is a good idea to remove or impute missing data before handing it over to an algorithm. With
tabular data, one has the option of removing cases containing missing values from a data file, by selecting
Tools-->Missing Values-->Remove Cases with Missing Values. Selecting instead
Tools-->MissingValues-->Replace Missing Values with Column Mode will replace missing values with
the mode of the column. This will work for either continuous or discrete data. Similarly for
Tools-->MissingValues-->Replace Missing Values with Column Mean, although this operates only on
continuous columns. Also, Tools->Missing values-->Replace Missing Values with Extra Category works
on discrete variables only and addes an extra category to each indicating where the missing values were.
Tools-->Missing values-->Replace Missing Values with Regression Predictions imputes missing values
using a regression model, as follows. First, in a copy of the data set, missing values are imputed using
means. Then for each missing value, the column of that missing value is regressed onto the remaining
columns in the data set, and the predicted value for the case of the missing value using values from the
data set copy replaces the missing value.
If an algorithm is handed data with missing values and does not have a means to deal with it sensibly, a
message is posted asking the use to remove or impute the missing data first. Some algorithm
configurations do have sensible means to deal with missing data, in which case the data is used. For
instance, any algorithm taking a Chi Square or G Square independence test as oracle can sensibly ignore
rows of variables being compared that contain missing values. For fine control over handling of missing
data, it is best, however, to impute missing data first.
Algorithms that can handle missing data include:
Any search algorithm using Chi Square or G Square as an conditional independence oracle.
ML Bayes Updater
Dirichlet Bayes updater
EM Bayes Updater
Any other algorithm that takes data as an input will display an error message if the data contains missing
values.
Opening Sessions
Sessions that have been previously saved out as ".tet" files can be loaded back using the "Open Session..."
menu item in the main File menu.
You will be asked to locate the file on your hard drive. Once you do, click "Open."
Note that sessions saved using versions published since March 2005 are stamped with the version number
of Tetrad they were saved out with and the date they were saved. (See Tetrad Versioning.) If you need a
specific version of the program for some reason, all versions since March, 2005 can be launched using
Java Web Start from the Tetrad download site, and versions previous to March, 2005 are stored in the
archive directory inside the Tetrad donwload directory on that site and can be launched as executable jars.
Saving Screenshots
Most editors in Tetrad, along with the main workspace area have a menu item in their File menu called
"Save Screenshot..." Editors that display graphs also have a second menu item called "Save Graph
Image..." These save PNG images of the dialog being displayed and the graph being displayed,
respectively.
To show how these menu items function, consider the editor for Directed Acyclic Graphs. If you select
Save Screenshot from the File menu of this editor, you wil see a Save dialog asking you for a filename. If
you type a filename (or accept the default) click "Save" a PNG image will be saved for you that looks
something like this:
Notice that part of the graph is obscured. To save the entire image of just the graph, select "Save Graph
Image." If you do this, you will get an image something like this:
Saving and Loading Sessions.
Sessions can be saved by using either the "Save Session" or the "Save Session As..." menu item in the
main File menu:
If you use "Save Session As...," you will be asked to supply a name for the session and to locate the
directory you want to save it in. The name must end with ".tet"; this will be added to the end of your name
if you don’t type it.
If you use "Save Session," you will only be asked for a name and directory for you sesssion if you have
not already supplied one. Otherwise, the session will be saved in the same place and the same name as the
last time you saved it.
If you exit Tetrad, you will be given the option of saving your sessions. See Exiting Tetrad.
Definitions
Although a decent understanding of causal search theory requires more than a few comments in a manual,
it is helpful to define at least some of the terms used in this program, or, where it seems more appropriate,
to make references to the literature where the terms are defined and discussed in appropriate detail. We
will try to define enough terms to make it clear at least how to use the program.
Terms:
Measure-Structural Graph
Measured and Latent Variables
Tetrad Graph Types
Definition: Measurement/structural model
A specific graphical model where all observed variables are indicators of some latent variable (i.e., each
observed modelNode has at least one latent parent), and no observed variable is a cause of any latent
variable (but they may be causes of another observed variables). The model describing which latents are
causes of which observed variables, and how the observed variables are related, is called the
measurement model. The model describing how latents are causally related is called the structural
model.
Also, we say that a measurement/structural model is pure if every observed variable has only one latent
parent, and no pair of observed variables is connected by a path that does not include a latent variable of
the model. The following figure illustrates a pure measurement/structural model.
Measured Vs. Latent Variables
Variables in Tetrad are of two types: measured and latent. Measured variables (often called "observed"
variables) are variables for which data have been measured. Latent variables are variables for which data
has not been measured but which you believe might be required to explain the causal relationships
between measured variables adequately.
We represent measured variables in graphs using rectangular boxes around their variable names and latent
variables using oval shapes around their variable names. In the example below, Grind, CoffeeTaste, and
Temperature are measured variables, while Freshness is a latent variable.
We would expect data sets for these variables to contain columns only for Grind, Temperature, and
CoffeeTaste, although causal models for such data might include extra latent variables such as Freshness,
in this example.
A latent variable that is a parent of two or more measured variables is referred to as a latent common cause
(or unmeasured common cause), as for example CoffeeType in the picture below:
Tetrad Graph Types
The theory of causal search that the Tetrad program implements uses graphs of a variety of different types,
some simple and some fairly sophisticated. A brief description of each of the main types is given below.
For more details, consult the Bibliography, especially Spirtes, Glymour and Scheines (2002), Causation,
Prediction, and Search.
Directed Graphs.
A directed graph is a set of variables V together with a set of directed edges Vi-->Vj for Vi, Vj in V, Vi
not equal to Vj. A directed graph may contain cycles--that is, paths of the form X-->...-->X, for some X in
V.
Directed Acyclic Graphs (DAG).
A directed acyclic graph is a directed graph that does not contain cycles. This type of graph can be used to
construct a Bayes or SEM parametric model. A Bayes parametric model requires a DAG.
SEM Graph.
A SEM graph is a directed graph over a set of variables V which has been embellished by adidtional
variables E representing error terms for endogenous variables in V, edges from each e in E to its
corresponding variable in V, and a set of bidirected edges over this embellished graph. SEM graphs are
used to represent the causal structure of SEM models.
Pattern.
From Causation, Prediction and Search (2002), p. 61: "A pattern Pi is a mixed graph with directed and
undirected edges. A graph G is in the set of graphs represented by Pi if and only if:
(i) G has the same adjacency relations as Pi;
(ii) if the edge between A and B is oriented A-->B in Pi, then it is oriented A-->B in G;
(iii) if Y is an unshielded collider on the path <X, Y, Z> in G then Y is an unshielded collider on <X, Y,
Z> in Pi."
Y is an unshielded collider on path <X, Y, Z> iff X-->Y<--Z and X and Z are not adjacent.
Patterns are theoretically output by the PC and GES algorithms. Sometimes the PC algorithm includes in
its output bidirected edges. When this happens, it is because there is conflicting independence test
information. This usually means that the assumption of causal sufficiency has been violated and that FCI
should be run as well for comparison.
POIPG.
POIPG’s (pronounced "poip-G") were output by earlier versions of FCI; due to theoretical advances since
then, FCI now outputs PAGs (see below).
Partial Ancestral Graph (PAG).
A PAG is ...
PAGs are output by FCI and CCD.
Mixed Ancestral Graphs (MAG).
Definition.
Why Doesn’t Tetrad...?
Provide descriptive statistics?
- Because those statistics can easily be obtained in Excel or Matlab
Transform variables, e.g., by taking logs
- Because this can be done in common commercial packages
Do linear regression analysis?
- Same answer.
Use logistic regression or log-linear models?
- These regression procedures could be valuable in estimating parameters in Bayes nets, but they require a
sound search procedure for selecting interaction terms, and we haven’t solved that yet.
Do factor analysis?
- Probably it should. For many problems, however, the latent variable search procedure in Tetrad is
preferable.
Deal with non-Normal distributions for continuous variables?
-Relevant statistics are available only for Normal, Multinomial and Conditional Gaussian distribution
families; the last should be included.
Provide all of the models consistent with search output, instead of
"patterns," PAGs" etc.
- The models corresponding to a pattern could and perhaps should be provided, but their an infinite
number of models consistent with a PAG, and your computer is finite.
Provide Bayesain search procedures when there may be latent variables?
- No consistent, computationally feasible, general algoirithms are known.
Provide search procedures for cyclic (non-recursive) models with latent
variables
- No consistent search procedures are known.
Provide search procedures for time series?
- Bayes net search procedures can in principle be used for time series, but no practical, consistent, general
search procedures are known. The search algorithms in Tetrad can be used to search for "simultaneous"
causal relations after regression on lags.
Provide provide search procedures to find a common model or models
for two or more separate data sets?
- If the variables in one data set are a subset of those in the other data set, a common model can be sought
with the present version of Tetrad. If the data sets have entirely distinct variable sets, no principled search
procedure for a common model is possible. If the data sets contain some common and some distinct
variables, a sometimes useful principled search is possible, but adequate algorithms have not yet been
developed.
Further Help
These help files are currently incomplete. If you have a question that they do not answer, or wish to report
a bug, please send email to [email protected]. Someone working on the Tetrad project will
answer.
References
The following books and articles are referred to in manual pages.
Bollen.
Spirtes, Glymour, and Scheines (2002). Causation, Predicton, and Search.