M.Sc projects
taverna project
reading
basics
XML
two major uses of XML
data-centric
describing structured information that is generally meant for machine consumption
document-centric
attaching metadata to information intended for human consumption
XML instances
XML docs have an optional 'prologue' to:
identify the doc as XML
include comments about the doc
include metadata about document content
Processing instructions
directives to application that will process the document, enclosed in tags
the important part there is the ?, not the name 'PITarget', which can be anything
common one:
prologue is followed by root element
elements
can have 3 different content types
element-only
consists of nested elements
mixed
any combination of nested elements and text
empty content
just a start tag and an end tag, oras shorthand
namespaces
you often compose XML docs, and if they have similarly names tags you get name collisions
namespaces let you define a prefix for your tags to make them unique
eg xmlns:my_prefix="http://namespace_url/"
DTDs
seperate metalanguage for describing valid documents in a set
suck
xml schemas
xml schema data usually in xsd namespace
xsi is a namespace that has resources for instance operations
e.g. you can use xsi:schemaLocation to point to an url that has teh schema
schema spec has all sorts of builtin types you can use to annotate data elements with
and you can defin your own by deriving rom a base type and restricting along several 'facets'
parsing
SAX
push-parsing
pull parsing
one-shot parsing
read whole doc into DOM
SOAP
is a messaging protocol for XML
XML protocols have had a first and second generation
first generation are pure XML
second generation are also use XML namespaces and XML schema
basically like a ProtocolBuffer except raw XML, used for communicating between services
Problem with 1st gen is same problem as protocolbuffer -- whenever you want to add a new datatype to your packets, you have to revise the standards and everyone has to update their parsers
second gen protocols try to build extensibiity into the protocol itself
also, they try to include semantic information along with the data so you can understand the context in which the data is being used
SOAP has a bunch of optional headers and then a mandatory body, both enclosed in a root 'envelope' element
this means you can recognize SOAP docs really quickly, but you need a seperate validation procedure for embedded XML (i.e. the SOAP doc by itself won't validate, you need to build custom SOAP validation into the web server.)
WSDL
portType
describes interfaces of operations supported by services
the 'what'
binding
details of how elements in abstract interface are converted into a concrete representation
the 'how'
port
how a binding is deployed at a particular network endpoint (where the service is, basically)
service
bag of operations that compose one service
see salcentral.com or xmethods.net for examples
notes
taverna
service registry
it's in default services section of /Users/Debo/Library/Application Support/Taverna/conf/mygrid.properties
However, this file isn't used by default. The default lookup goes to a properties file in /Users/Debo/Library/Application%20Support/Taverna/repository/uk /org/mygrid/taverna/taverna-core/1.5.1/taverna-core-1.5.1.jar!/mygrid.properties, as identified in the user-editable properties file
the default differs from the user-exposed one in a couple of ways: It comments out some services that ahve died, and adds a couple of other øones.
feasibility experiments
retrieving the service specs for all services
looked up service registry locations in service registry spec (mygrid.properties)
soaplab
can't use straight wget because of robots.txt (stupid)
wget the services file that has html with links to wsdl embedded
used 'grep http' on the services file to grep out the html that holds these wsdl links
wrote perl one-off to extract the actual wsdl links from the html (linkgrabber.pl)
dumped wsdl links to file
used the dumped wsdl links as list of target urls for my python 'wget_simple' to retrieve all soaplab wsdl files
put wsdl files in data/msc/services/soaplab
taverna defaults
there was a short list of them, so I just hand-copied the links and individually wget them
biomoby
having problems, it's a script so i can't view source and wget is being a bitch, will fix
ok, so that script implements the biomoby central registry API. You have to interact with it programatically
The Central class of the biomoby java api permits interactions with this script, I'm going to try to build and run that library right now
had to download and build jmoby to get that library, now i'm going to write a little app to do the downloading
found a better way: Building jmoby gives you a command line client to manipulate the central moby registry
what i did to fetch the wsdl for each biomoby service:
i used the commandline tool to fetch all service names (run-cmdline.sh -ls)
I grepped out the actuall names from all the cruft (grep -v '
copied the names file into data/msc/services/biomoby
wrote python script in 'one-offs' called biomoby-wsdl-fetcher that iterates through each name and calls cmdline -wsdland writes to a file
put the output of this script (the wsdl docs) into data/msc/services/biomoby
I will probably need more than just the wsdl to do this if the inputs/outputs are semantically annotated
biomart
i'm getting a server not found but taverna seems to find the services just fine!
seqhound?
profiling the service inputs and outputs
I wrote a script to parse all the biomoby, taverna_default, and soaplab wsdl files
dev/code/python/one-offs/wsdl_inputs_profiler.py
the script uses SOAPpy to parse the WSDL files
for each method in each WSDL, it retrieves the inputs and makes a schema:type key, which it uses to count
results:
inputs
processing errors
35 documents out of 1664 could not be parsed because of XML parsing and WSDL interpretation errors
I haven't looked into why this is the case yet
vast numbers of strings
http://schemas.xmlsoap.org/soap/encoding/|string 4068 http://www.w3.org/2001/XMLSchema|string 1246
quite a few maps
http://xml.apache.org/xml-soap|Map 956
a smattering of other types
http://www.w3.org/2001/XMLSchema|int 27 SOAP/KEGG|ArrayOfstring 19 http://www.w3.org/2001/XMLSchema|float 6 SOAP/KEGG|ArrayOfint 2
then 1386 "unique" inputs
but each of these is an emboss soaplab øservice where they've used the url of the service as the schema, and all the types are of
http://www.ebi.ac.uk/soaplab/emboss4/services/acd.acdpretty|waitFor 1 http://www.ebi.ac.uk/soaplab/emboss4/services/acd.acdpretty|runAndWaitFor 1 http://www.ebi.ac.uk/soaplab/emboss4/services/acd.acdpretty|getStatus 1 http://www.ebi.ac.uk/soaplab/emboss4/services/acd.acdpretty|getResults 1 http://www.ebi.ac.uk/soaplab/emboss4/services/acd.acdpretty|createAndRun 1 http://www.ebi.ac.uk/soaplab/emboss4/services/acd.acdpretty|ArrayOf_soapenc_string 1
outputs
http://schemas.xmlsoap.org/soap/encoding/|string 1676 http://xml.apache.org/xml-soap|Map 1195 http://www.w3.org/2001/XMLSchema|string 1170 http://www.w3.org/2001/XMLSchema|long 956 SOAP/KEGG|ArrayOfstring 36 SOAP/KEGG|ArrayOfDefinition 8 urn:BINDSOAP|SearchResultBean 4 SOAP/KEGG|ArrayOfSSDBRelation 4 http://www.ebi.ac.uk/soaplab/emboss4/services/GowlabFactory|ArrayOf_soapenc_string 3 http://www.ebi.ac.uk/soaplab/emboss4/services/AnalysisFactory|ArrayOf_soapenc_string 3 http://www.ebi.ac.uk/collab/mygrid/service1/goviz/GoViz.jws|ArrayOf_xsd_string 3 SOAP/KEGG|ArrayOfStructureAlignment 3
then a bunch of unique soaplab outputs as per inputs
Note that analysis on soaplab services is moot: They all take arbitrary maps and input and return lists of strings as output
discussion
obviously ontologies to classify these inputs and outputs would go a long way towards improving our predictive value here
I haven't profiled the input or output names, but a lot of them seem to be content-poor (e.g. "data", "sequence")
I don't know how ontologies are used in the current Taverna/Moby builds: I'm going to sift code and figure that out now.
directions
profile actual workflows and see how many different services are used and when
I don't know how taverna represents these services internally. it could be that they have richer type information than what I'm retrieving from the WSDLs directly, but this seems unlikely because the annotations exposed to the user are at least as vague as what I'm seeing in the source
figure out how types are annotated
profiling workflow inputs and outputs
tried to write a script to analyze all scufl workflows
3 kinds of network services being accessed
arbitrary wsdl
This was easy, I just fetch the wsdl from the specified url in the SCUFL and examine the inputs and outputs
soaplab
points at a soaplab url, but if you retrieve it from the web it's just bogus html content
i've already retrieved the actual wsdl files, but there seem to be more being specified than i have on me
it's also confusing to retrieve inputs and outputs because of the weird way that soaplab messages are structured -- they're always the same messageø "startRun" etc.
anyways this is moot, since any startup operation takes an arbitrary map
biomoby
Haven't started on this yet, but I feel like it will be complicated since you just get the operation name and a pointer to the registry -- perhaps I can look it up locally.
meetings
first
ben's project
assume hierarchy, evaluate those
for each class, each instance will demonstrate a pattern
given a set of instances, can we find a pattern for the class
chi defined two metrics
converage
features that could be associated with genes (co-occurrence of mesh keywoards)
too simplistic
we have no control
proposal
use ontology that defines phosphotases because we have an owl that has restrictions on classes
build classifiers to do it -- ben
todo
get ontology from ben
web service composition (mark's connotea)
taverna
myexperiment.org
carol goble
and the work they're already doing (mark will send papers)
[13] D. Hull, K. Wolstencroft, R. Stevens, C. Goble, M. Pocock, P. Li, T. Oinn, N. Zhang, L. Yao, A. Nenadic, J. Chin, C. Goble, A. Rector, D. Chadwick, S. Otenko, and Q. Shi, "Taverna: a tool for building and running workflows of services Achieving Fine-grained Access Control in Virtual Organisations," Nucleic Acids Research, 2006.
workflows
they're annotated into ontologies
one approach ot pattern finding would be to traverse the ontology upwards and see where common nodes
31-Jan-07
brief talk about notes organization
rachel does alignment, experimental data
rachel will send jiang's stuff on experimental analysis
22-May-07
i vent my freakout on Rachel about wanting to switch projects
She suggests that we should be thinking about this problem as culling paths instead of statistically inferring what people will pick next
as a first task, should collect:
all possible service inputs
all possible service outputs
all inputs into workflow instances we have right now
all outputs of workflow instances we have right now
Calculations:
How many inputs are there in total
how many outputs are there in total
what percentage of these are actually used
some other stats that would tell us what the order of the number of paths we could possible have is
29-May-07
todo
figure out how biomart and seqhound services are retrieved
profile collections of workflows to see how many services are actually used and how often
write XML parser that does the following
parse each scufl doc for each processor tag if it's a wsdl tag from the spec get the wsdl url make a WSDL service stub from the wsdl doc introspect out the inputs and outputs into a map as before this might take forever to run...
determine if there are any type annotations on the WSDL specs for services that let us actually know what the types mean, as per BioMoby's claim
there seem to be a lot of technologies used to attach meaning to various aspects of the workflows and processors -- FETA, MOBY, RDF, OWL, etc. I want to be able to delineate what each technology contributes to the behaviour of Taverna, so that I can filter out which ones will be of use in path reduction
Basically I need to know exactly what type information is available beyond WSDL for each type of service so I know what services it is possible to restrict on
also knowing the performance characteristics of retrieving this type information is important
mark is swamped but i can skype, and maybe meet with ben?
write script to convert mindmaps to wiki
google scholar evaluation project
limit between 1950-2005
look up 100-1000
pubmed
inputs: search strings and limits
output: list of pmIDs that were returned by search
second part: input is gold standard PMids and found ones, compute precision, recall, etc.
gs
inputs: html doc
intermediate records of stuff that needs to be spewed out
careful: GS sometimes duplicates references, don't give it two checkmarks when this happens!
output: same initial
due 31st