MichaelDiBernardoSandbox < Sandbox

Tags: view all tags

M.Sc projects

taverna project

reading

basics

XML

two major uses of XML

data-centric

describing structured information that is generally meant for machine consumption

document-centric

attaching metadata to information intended for human consumption

XML instances

XML docs have an optional 'prologue' to:

```
identify the doc as XML
```
```
include comments about the doc
```
```
include metadata about document content
```

Processing instructions

directives to application that will process the document, enclosed in  tags

the important part there is the ?, not the name 'PITarget', which can be anything

```
common one:
```

```
prologue is followed by root element
```

elements

can have 3 different content types

```
element-only
```
- ```
consists of nested elements
```

mixed

any combination of nested elements and text

empty content

just a start tag and an end tag, or  as shorthand

namespaces

you often compose XML docs, and if they have similarly names tags you get name collisions

namespaces let you define a prefix for your tags to make them unique

eg xmlns:my_prefix="http://namespace_url/"

DTDs

seperate metalanguage for describing valid documents in a set

```
suck
```

xml schemas

xml schema data usually in xsd namespace

xsi is a namespace that has resources for instance operations

e.g. you can use xsi:schemaLocation to point to an url that has teh schema

schema spec has all sorts of builtin types you can use to annotate data elements with

and you can defin your own by deriving rom a base type and restricting along several 'facets'

parsing

```
SAX
```
- ```
push-parsing
```
```
pull parsing
```
```
one-shot parsing
```
- ```
read whole doc into DOM
```

SOAP

```
is a messaging protocol for XML
```

XML protocols have had a first and second generation

```
first generation are pure XML
```

second generation are also use XML namespaces and XML schema

basically like a ProtocolBuffer except raw XML, used for communicating between services

Problem with 1st gen is same problem as protocolbuffer -- whenever you want to add a new datatype to your packets, you have to revise the standards and everyone has to update their parsers

second gen protocols try to build extensibiity into the protocol itself

also, they try to include semantic information along with the data so you can understand the context in which the data is being used

SOAP has a bunch of optional headers and then a mandatory body, both enclosed in a root 'envelope' element

this means you can recognize SOAP docs really quickly, but you need a seperate validation procedure for embedded XML (i.e. the SOAP doc by itself won't validate, you need to build custom SOAP validation into the web server.)

WSDL

portType

describes interfaces of operations supported by services

```
the 'what'
```

binding

details of how elements in abstract interface are converted into a concrete representation

```
the 'how'
```

port

how a binding is deployed at a particular network endpoint (where the service is, basically)

service

bag of operations that compose one service

see salcentral.com or xmethods.net for examples

notes

taverna

service registry

it's in default services section of /Users/Debo/Library/Application Support/Taverna/conf/mygrid.properties

However, this file isn't used by default. The default lookup goes to a properties file in /Users/Debo/Library/Application%20Support/Taverna/repository/uk /org/mygrid/taverna/taverna-core/1.5.1/taverna-core-1.5.1.jar!/mygrid.properties, as identified in the user-editable properties file

the default differs from the user-exposed one in a couple of ways: It comments out some services that ahve died, and adds a couple of other øones.

feasibility experiments

retrieving the service specs for all services

looked up service registry locations in service registry spec (mygrid.properties)

soaplab

can't use straight wget because of robots.txt (stupid)

wget the services file that has html with links to wsdl embedded

used 'grep http' on the services file to grep out the html that holds these wsdl links

wrote perl one-off to extract the actual wsdl links from the html (linkgrabber.pl)

```
dumped wsdl links to file
```

used the dumped wsdl links as list of target urls for my python 'wget_simple' to retrieve all soaplab wsdl files

put wsdl files in data/msc/services/soaplab

taverna defaults

there was a short list of them, so I just hand-copied the links and individually wget them

biomoby

having problems, it's a script so i can't view source and wget is being a bitch, will fix

ok, so that script implements the biomoby central registry API. You have to interact with it programatically

The Central class of the biomoby java api permits interactions with this script, I'm going to try to build and run that library right now

had to download and build jmoby to get that library, now i'm going to write a little app to do the downloading

found a better way: Building jmoby gives you a command line client to manipulate the central moby registry

what i did to fetch the wsdl for each biomoby service:

i used the commandline tool to fetch all service names (run-cmdline.sh -ls)

I grepped out the actuall names from all the cruft (grep -v '

copied the names file into data/msc/services/biomoby

wrote python script in 'one-offs' called biomoby-wsdl-fetcher that iterates through each name and calls cmdline -wsdl  and writes to a file

put the output of this script (the wsdl docs) into data/msc/services/biomoby

I will probably need more than just the wsdl to do this if the inputs/outputs are semantically annotated

biomart

i'm getting a server not found but taverna seems to find the services just fine!

```
seqhound?
```

profiling the service inputs and outputs

I wrote a script to parse all the biomoby, taverna_default, and soaplab wsdl files

dev/code/python/one-offs/wsdl_inputs_profiler.py

the script uses SOAPpy to parse the WSDL files

for each method in each WSDL, it retrieves the inputs and makes a schema:type key, which it uses to count

results:

inputs

processing errors

35 documents out of 1664 could not be parsed because of XML parsing and WSDL interpretation errors

I haven't looked into why this is the case yet

vast numbers of strings

http://schemas.xmlsoap.org/soap/encoding/|string   4068 http://www.w3.org/2001/XMLSchema|string   1246

quite a few maps

http://xml.apache.org/xml-soap|Map   956

a smattering of other types

http://www.w3.org/2001/XMLSchema|int   27 
SOAP/KEGG|ArrayOfstring   19 
http://www.w3.org/2001/XMLSchema|float   6 
SOAP/KEGG|ArrayOfint   2

then 1386 "unique" inputs

but each of these is an emboss soaplab øservice where they've used the url of the service as the schema, and all the types are of

http://www.ebi.ac.uk/soaplab/emboss4/services/acd.acdpretty|waitFor   1 http://www.ebi.ac.uk/soaplab/emboss4/services/acd.acdpretty|runAndWaitFor   1 http://www.ebi.ac.uk/soaplab/emboss4/services/acd.acdpretty|getStatus   1 http://www.ebi.ac.uk/soaplab/emboss4/services/acd.acdpretty|getResults   1 http://www.ebi.ac.uk/soaplab/emboss4/services/acd.acdpretty|createAndRun   1 http://www.ebi.ac.uk/soaplab/emboss4/services/acd.acdpretty|ArrayOf_soapenc_string   1

outputs

http://schemas.xmlsoap.org/soap/encoding/|string   1676 
http://xml.apache.org/xml-soap|Map   1195 
http://www.w3.org/2001/XMLSchema|string   1170 
http://www.w3.org/2001/XMLSchema|long   956 
SOAP/KEGG|ArrayOfstring   36 
SOAP/KEGG|ArrayOfDefinition   8 
urn:BINDSOAP|SearchResultBean   4 
SOAP/KEGG|ArrayOfSSDBRelation   4 http://www.ebi.ac.uk/soaplab/emboss4/services/GowlabFactory|ArrayOf_soapenc_string   3 http://www.ebi.ac.uk/soaplab/emboss4/services/AnalysisFactory|ArrayOf_soapenc_string   3 http://www.ebi.ac.uk/collab/mygrid/service1/goviz/GoViz.jws|ArrayOf_xsd_string   3 SOAP/KEGG|ArrayOfStructureAlignment   3

then a bunch of unique soaplab outputs as per inputs

Note that analysis on soaplab services is moot: They all take arbitrary maps and input and return lists of strings as output

discussion

obviously ontologies to classify these inputs and outputs would go a long way towards improving our predictive value here

I haven't profiled the input or output names, but a lot of them seem to be content-poor (e.g. "data", "sequence")

I don't know how ontologies are used in the current Taverna/Moby builds: I'm going to sift code and figure that out now.

directions

profile actual workflows and see how many different services are used and when

I don't know how taverna represents these services internally. it could be that they have richer type information than what I'm retrieving from the WSDLs directly, but this seems unlikely because the annotations exposed to the user are at least as vague as what I'm seeing in the source

```
figure out how types are annotated
```

profiling workflow inputs and outputs

tried to write a script to analyze all scufl workflows

3 kinds of network services being accessed

arbitrary wsdl

This was easy, I just fetch the wsdl from the specified url in the SCUFL and examine the inputs and outputs

soaplab

points at a soaplab url, but if you retrieve it from the web it's just bogus html content

i've already retrieved the actual wsdl files, but there seem to be more being specified than i have on me

it's also confusing to retrieve inputs and outputs because of the weird way that soaplab messages are structured -- they're always the same messageø "startRun" etc.

anyways this is moot, since any startup operation takes an arbitrary map

biomoby

Haven't started on this yet, but I feel like it will be complicated since you just get the operation name and a pointer to the registry -- perhaps I can look it up locally.

meetings

first

ben's project

```
assume hierarchy, evaluate those
```

for each class, each instance will demonstrate a pattern

given a set of instances, can we find a pattern for the class

chi defined two metrics

converage

features that could be associated with genes (co-occurrence of mesh keywoards)

```
too simplistic
```
```
we have no control
```

proposal

use ontology that defines phosphotases because we have an owl that has restrictions on classes

```
build classifiers to do it -- ben
```

todo

```
get ontology from ben
```

web service composition (mark's connotea)

```
taverna
```
```
myexperiment.org
```

carol goble

and the work they're already doing (mark will send papers)

[13]   D. Hull, K. Wolstencroft, R. Stevens, C. Goble, M. Pocock, P. Li, T. Oinn, N. Zhang, L. Yao, A. Nenadic, J. Chin, C. Goble, A. Rector, D. Chadwick, S. Otenko, and Q. Shi, "Taverna: a tool for building and running workflows of services Achieving Fine-grained Access Control in Virtual Organisations," Nucleic Acids Research, 2006.

workflows

```
they're annotated into ontologies
```

one approach ot pattern finding would be to traverse the ontology upwards and see where common nodes

31-Jan-07

```
brief talk about notes organization
```

rachel does alignment, experimental data

rachel will send jiang's stuff on experimental analysis

22-May-07

i vent my freakout on Rachel about wanting to switch projects

She suggests that we should be thinking about this problem as culling paths instead of statistically inferring what people will pick next

as a first task, should collect:

```
all possible service inputs
```
```
all possible service outputs
```

all inputs into workflow instances we have right now

all outputs of workflow instances we have right now

Calculations:

```
How many inputs are there in total
```
```
how many outputs are there in total
```

what percentage of these are actually used

some other stats that would tell us what the order of the number of paths we could possible have is

```
29-May-07
```

todo

figure out how biomart and seqhound services are retrieved

profile collections of workflows to see how many services are actually used and how often

write XML parser that does the following

parse each scufl doc
    for each processor tag
        if it's a wsdl tag from the spec
            get the wsdl url
            make a WSDL service stub from the wsdl doc
            introspect out the inputs and outputs into a map as before

this might take forever to run...

determine if there are any type annotations on the WSDL specs for services that let us actually know what the types mean, as per BioMoby's claim

there seem to be a lot of technologies used to attach meaning to various aspects of the workflows and processors -- FETA, MOBY, RDF, OWL, etc. I want to be able to delineate what each technology contributes to the behaviour of Taverna, so that I can filter out which ones will be of use in path reduction

Basically I need to know exactly what type information is available beyond WSDL for each type of service so I know what services it is possible to restrict on

also knowing the performance characteristics of retrieving this type information is important

mark is swamped but i can skype, and maybe meet with ben?

write script to convert mindmaps to wiki

google scholar evaluation project

```
limit between 1950-2005
```
```
look up 100-1000
```

pubmed

```
inputs: search strings and limits
```

output: list of pmIDs that were returned by search

second part: input is gold standard PMids and found ones, compute precision, recall, etc.

gs

```
inputs: html doc
```

intermediate records of stuff that needs to be spewed out

careful: GS sometimes duplicates references, don't give it two checkmarks when this happens!

```
output: same initial
```

```
due 31st
```

Raw edit | More topic actions

Topic revision: r1 - 2007-06-06 - MichaelDiBernardo