StatesAndGenericsProposal/Discussion

From BiopaxWiki
Jump to: navigation, search

Nicolas:

"We identify three types of state variables, chemical modification (e.g. phosphorylation), localization (e.g. cytosolic) and complex member (e.g. member of mdm2-p53 complex). Each combination of the variables, called a variable set, defines a separate state. Interactions involving identical physical entities in a specific state (all with the same variable set) reference the same state instance."

One very important type of state variable is missing in my opinion, the conformational "state", define by the relative position of atoms. For instance, a channel can be open or closed, an enzyme active or inhibited (I am not talking about the inhibition by another molecule, but just because of a different conformation: see the conformations of the CaMKII). Another example is the protonation of proteins that affect their function. For instance some enzymes are active in the lysosomes, but not in the cytosol (that is just an example, cellular-location is not an answer). A CRUCIAL difference is that we need a mechanism to represent NON-BINARY variables: E.g active, inactive and desensitized. Or a channel with several conductance states.

"Current biological representation system are inconsistent in the way they represent states. In one extreme states are totally omitted, and only reference physical entities are used, often supported with free text to explain the actual mechanism involving states. In the other extreme, e.g. simulation models, physical entity representation is totally dismissed and each molecule/state is treated as a separate physical entity."

"e.g. MOST simulation models ..."

Cf. StochSim (http://www.pdn.cam.ac.uk/groups/comp-cell/StochSim.html), that uses variable sets and generics.

In the table of correspondence, you can add that in SBML Level 2 Version 2, the "Reference Data" is the class "SpeciesType".

"Define three new properties, MODIFIED-AT, NOT-MODIFIED-AT, REGARDLESS-OF for specifying sequence features in the applicable physical entities ( dna, rna and protein) .These are a subset of the corresponding reference entity's possible variable set."

That implies the only state variables are those which modify the sequence. As I said previously, one can imagine other cases, such as conformation and protonation. As a consequence, although I thing the idea is nice, I would not use this terminology, and think it should be attached to the pool rather than the sequence.

More generally, the existence of a non-binary state variable allows the usage of wildcards:

state-variable
  name=Thr34-phosphorylation
  state={on|off|?}
state-variable
  name=Thr34-glycosylation
  state={on|off|?} 

where '?' means "any" or "does not matter"

 Alan:</b>
 There is a basic problem with the notion of does not matter and that is 
 that it is often ambiguous, context dependent, or wrong.
 What does not matter in one interaction can matter in another. In this 
 sense, "does not matter", if anything, is a property of the 
 molecule-in-the-context-of a pathway, a reaction, or a model (all of 
 these, in different cases)
   <b>Nicolas:
   Absolutely! And this is why it has to be attached to the so-called
   pool. But to state that something does not matter already contains
   more information than not saying anything.
   Therefore, if I have a physicalEntity X with two state variables:
catalytic-state: {on|off}
    Thr306P: {on|off|?} 
   The control of an interaction by  X[catalytic-state{on}] 
   and 
   the control of an interaction by X[catalytic-state{on};Thr306P{?}]
   Are semantically different. In the former case, I do not know if the
   phosphorylation of Thr306 affects the control, whereas in the latter
   case I know that the phosphorylation of Thr306 does NOT affect the
   control.
     Alan:</b>
     In the case that you present, where one is making a strong statement 
     (presumably experiments w/wo Thr306P have been done) I agree.
     And if this is what the state proposal suggests is the meaning of ? 
     then it should be documented (the defining experimental evidence, in 
     particular).
     But the proposal makes stronger statements, like having to enumerate 
     the possible state variables in the reference data, information which 
     is often incomplete. How are we to take this list of possible states? 
     As some of them? If so this is easily derivable by a query on the 
     database, represents redundant information, and is therefore prone to 
     error accumulation without careful maintenance, As all of them? Then 
     this is problematic from a data integration point of view. What  if two 
     databases referenceData  are talking about the same thing but one knows 
     more phosphorylation sites than the other? Of course neither may know 
     the complete set. And so on.
     I would also point out that your distinction makes use of the so-called 
     open-world assumption (that if you don't state something, you don't 
     know it), something that the DX proposals reject. That OWL operates 
     under the open world assumption is one of the great benefits of its use 
     for representing biological knowledge. So by choosing to disregard it, 
     the DX proposals lose important expressivity, and generate semantically 
     incorrect OWL.
     Again, insofar as the DX proposal addresses short term needs of a 
     subset of the BioPAX community, it is up to them to make these choices. 
     But once this is out of the way we need to be working on solutions to 
     these issues.
 And to know that something really does not matter you need to do the 
 experiment with and without the molecule in the state you are making 
 the claim doesn't matter. Many times this is not done. When it is, it 
 is often worth recording the two experiments. More often only a 
 particular set of variables is recorded and the model is constructed of 
 these. And in many cases these models don't work because some variable 
 that was not known was not recorded and mattered.
 Then there is the blurring of what exactly we are talking about. Are we 
 talking about partial descriptions of things in the world, or are we 
 talking about computational models of pathways, which, depending on the 
 methodology can make different assumptions about what does and doesn't 
 matter, and about level of representation, for example continuous 
 versus discrete.
 If we are talking about things in the world, then we need to realize 
 that state is really a shortcut for representing an approximation of 
 something more complicated, and we should provide provisions for 
 linking the more elaborate knowledge to the shortcut as we acquire it, 
 or if some particular database chooses to represent an area in more 
 detail (compare, e.g. "phosphorylation", with the descriptions you find 
 in the RESID database.
 If we are talking about some specific model, then we should probably 
 say so, so that when we attempt to integrate knowledge from different 
 sources we don't have the wrong sorts of information mixing together 
 because of differing assumptions of the models.
 None of this may be a problem for the short term needs of DX (hard to 
 see why it wouldn't be, but hey, I don't write their requirements), but 
 resolving these issues is important as we move forward, and 
 particularly for those of use who need to be able to meaningfully 
 combine information from multiple sources.

OR

state-variable
  name=Thr34
  state={naked|phosphorylated|glycosylated|?}

state-variable
  name=open
  state={on|off|?}
state-variable
  name=desnsitized
  state={on|off|?}

where '?' means "any" or "does not matter"

OR

state-variable
  name=channel-state
  state={open|closed|desensitized|?}

SUMMARY:

1) We should extend the stated to modification that do not imply chemical modification (Yes, I know a protonation can be seen a chemical modification. You see what I mean).

2) There is maybe room for non-binary state variables

<b>Alan:

The second major issue with the proposal, from my point of view, is the notion of reference data. The issue with is that it effectively says: There is something the same about this physicalEntity and the reference data but it is up to you to figure out what. In order to support generics, physical entities can point to reference data that resolves to groups of physical entities, each of which may have conflicting information, such as sequence. As far as I can tell, there is no theory or mechanism for deciding which of these properties is valid or specific to the physicalEntity in your reaction and what the status of the rest are.

This makes more sense in SBML which is explicitly about making computational models of a certain kind. In BioPAX, which is intended to describe information at a different level, and for a wider variety of uses, in particular integration (as opposed to simple accumulation), it is a problem.

Once again, perhaps not an issue for DX, but of serious concern to others.