Tags:
create new tag
view all tags
  • M.Sc projects
    • taverna project
      • reading
        • basics
          • XML
            • two major uses of XML
              • data-centric
                • describing structured information that is generally meant for machine consumption
              • document-centric
                • attaching metadata to information intended for human consumption
            • XML instances
              • XML docs have an optional 'prologue' to:
                • identify the doc as XML
                • include comments about the doc
                • include metadata about document content
              • Processing instructions
                • directives to application that will process the document, enclosed in  tags
                • the important part there is the ?, not the name 'PITarget', which can be anything
                • common one:
              • prologue is followed by root element
              • elements
                • can have 3 different content types
                  • element-only
                    • consists of nested elements
                  • mixed
                    • any combination of nested elements and text
                  • empty content
                    • just a start tag and an end tag, or  as shorthand
              • namespaces
                • you often compose XML docs, and if they have similarly names tags you get name collisions
                • namespaces let you define a prefix for your tags to make them unique
                • eg xmlns:my_prefix="http://namespace_url/"
              • DTDs
                • seperate metalanguage for describing valid documents in a set
                • suck
              • xml schemas
                • xml schema data usually in xsd namespace
                • xsi is a namespace that has resources for instance operations
                • e.g. you can use xsi:schemaLocation to point to an url that has teh schema
                • schema spec has all sorts of builtin types you can use to annotate data elements with
                • and you can defin your own by deriving rom a base type and restricting along several 'facets'
              • parsing
                • SAX
                  • push-parsing
                • pull parsing
                • one-shot parsing
                  • read whole doc into DOM
          • SOAP
            • is a messaging protocol for XML
            • XML protocols have had a first and second generation
              • first generation are pure XML
              • second generation are also use XML namespaces and XML schema
            • basically like a ProtocolBuffer except raw XML, used for communicating between services
            • Problem with 1st gen is same problem as protocolbuffer -- whenever you want to add a new datatype to your packets, you have to revise the standards and everyone has to update their parsers
            • second gen protocols try to build extensibiity into the protocol itself
            • also, they try to include semantic information along with the data so you can understand the context in which the data is being used
            • SOAP has a bunch of optional headers and then a mandatory body, both enclosed in a root 'envelope' element
            • this means you can recognize SOAP docs really quickly, but you need a seperate validation procedure for embedded XML (i.e. the SOAP doc by itself won't validate, you need to build custom SOAP validation into the web server.)
          • WSDL
            • portType
              • describes interfaces of operations supported by services
              • the 'what'
            • binding
              • details of how elements in abstract interface are converted into a concrete representation
              • the 'how'
            • port
              • how a binding is deployed at a particular network endpoint (where the service is, basically)
            • service
              • bag of operations that compose one service
            • see salcentral.com or xmethods.net for examples
      • notes
        • taverna
          • service registry
            • it's in default services section of /Users/Debo/Library/Application Support/Taverna/conf/mygrid.properties
            • However, this file isn't used by default. The default lookup goes to a properties file in /Users/Debo/Library/Application%20Support/Taverna/repository/uk /org/mygrid/taverna/taverna-core/1.5.1/taverna-core-1.5.1.jar!/mygrid.properties, as identified in the user-editable properties file
            • the default differs from the user-exposed one in a couple of ways: It comments out some services that ahve died, and adds a couple of other øones.
        • feasibility experiments
          • retrieving the service specs for all services
            • looked up service registry locations in service registry spec (mygrid.properties)
            • soaplab
              • can't use straight wget because of robots.txt (stupid)
              • wget the services file that has html with links to wsdl embedded
              • used 'grep http' on the services file to grep out the html that holds these wsdl links
              • wrote perl one-off to extract the actual wsdl links from the html (linkgrabber.pl)
              • dumped wsdl links to file
              • used the dumped wsdl links as list of target urls for my python 'wget_simple' to retrieve all soaplab wsdl files
              • put wsdl files in data/msc/services/soaplab
            • taverna defaults
              • there was a short list of them, so I just hand-copied the links and individually wget them
            • biomoby
              • having problems, it's a script so i can't view source and wget is being a bitch, will fix
              • ok, so that script implements the biomoby central registry API. You have to interact with it programatically
              • The Central class of the biomoby java api permits interactions with this script, I'm going to try to build and run that library right now
              • had to download and build jmoby to get that library, now i'm going to write a little app to do the downloading
              • found a better way: Building jmoby gives you a command line client to manipulate the central moby registry
              • what i did to fetch the wsdl for each biomoby service:
                • i used the commandline tool to fetch all service names (run-cmdline.sh -ls)
                • I grepped out the actuall names from all the cruft (grep -v '
                  
                • copied the names file into data/msc/services/biomoby
                • wrote python script in 'one-offs' called biomoby-wsdl-fetcher that iterates through each name and calls cmdline -wsdl  and writes to a file
                • put the output of this script (the wsdl docs) into data/msc/services/biomoby
                • I will probably need more than just the wsdl to do this if the inputs/outputs are semantically annotated
            • biomart
              • i'm getting a server not found but taverna seems to find the services just fine!
            • seqhound?
          • profiling the service inputs and outputs
            • I wrote a script to parse all the biomoby, taverna_default, and soaplab wsdl files
              • dev/code/python/one-offs/wsdl_inputs_profiler.py
              • the script uses SOAPpy to parse the WSDL files
              • for each method in each WSDL, it retrieves the inputs and makes a schema:type key, which it uses to count
              • results:
                • inputs
                  • processing errors
                    • 35 documents out of 1664 could not be parsed because of XML parsing and WSDL interpretation errors
                    • I haven't looked into why this is the case yet
                  • vast numbers of strings
                    • http://schemas.xmlsoap.org/soap/encoding/|string   4068 http://www.w3.org/2001/XMLSchema|string   1246 
                  • quite a few maps
                    • http://xml.apache.org/xml-soap|Map   956
                  • a smattering of other types
                    • http://www.w3.org/2001/XMLSchema|int   27 
                      SOAP/KEGG|ArrayOfstring   19 
                      http://www.w3.org/2001/XMLSchema|float   6 
                      SOAP/KEGG|ArrayOfint   2
                  • then 1386 "unique" inputs
                    • but each of these is an emboss soaplab øservice where they've used the url of the service as the schema, and all the types are of
                      • http://www.ebi.ac.uk/soaplab/emboss4/services/acd.acdpretty|waitFor   1 http://www.ebi.ac.uk/soaplab/emboss4/services/acd.acdpretty|runAndWaitFor   1 http://www.ebi.ac.uk/soaplab/emboss4/services/acd.acdpretty|getStatus   1 http://www.ebi.ac.uk/soaplab/emboss4/services/acd.acdpretty|getResults   1 http://www.ebi.ac.uk/soaplab/emboss4/services/acd.acdpretty|createAndRun   1 http://www.ebi.ac.uk/soaplab/emboss4/services/acd.acdpretty|ArrayOf_soapenc_string   1
                • outputs
                  • http://schemas.xmlsoap.org/soap/encoding/|string   1676 
                    http://xml.apache.org/xml-soap|Map   1195 
                    http://www.w3.org/2001/XMLSchema|string   1170 
                    http://www.w3.org/2001/XMLSchema|long   956 
                    SOAP/KEGG|ArrayOfstring   36 
                    SOAP/KEGG|ArrayOfDefinition   8 
                    urn:BINDSOAP|SearchResultBean   4 
                    SOAP/KEGG|ArrayOfSSDBRelation   4 http://www.ebi.ac.uk/soaplab/emboss4/services/GowlabFactory|ArrayOf_soapenc_string   3 http://www.ebi.ac.uk/soaplab/emboss4/services/AnalysisFactory|ArrayOf_soapenc_string   3 http://www.ebi.ac.uk/collab/mygrid/service1/goviz/GoViz.jws|ArrayOf_xsd_string   3 SOAP/KEGG|ArrayOfStructureAlignment   3 
                  • then a bunch of unique soaplab outputs as per inputs
                • Note that analysis on soaplab services is moot: They all take arbitrary maps and input and return lists of strings as output
              • discussion
                • obviously ontologies to classify these inputs and outputs would go a long way towards improving our predictive value here
                • I haven't profiled the input or output names, but a lot of them seem to be content-poor (e.g. "data", "sequence")
                • I don't know how ontologies are used in the current Taverna/Moby builds: I'm going to sift code and figure that out now.
              • directions
                • profile actual workflows and see how many different services are used and when
                • I don't know how taverna represents these services internally. it could be that they have richer type information than what I'm retrieving from the WSDLs directly, but this seems unlikely because the annotations exposed to the user are at least as vague as what I'm seeing in the source
                • figure out how types are annotated
          • profiling workflow inputs and outputs
            • tried to write a script to analyze all scufl workflows
              • 3 kinds of network services being accessed
                • arbitrary wsdl
                  • This was easy, I just fetch the wsdl from the specified url in the SCUFL and examine the inputs and outputs
                • soaplab
                  • points at a soaplab url, but if you retrieve it from the web it's just bogus html content
                  • i've already retrieved the actual wsdl files, but there seem to be more being specified than i have on me
                  • it's also confusing to retrieve inputs and outputs because of the weird way that soaplab messages are structured -- they're always the same messageø "startRun" etc.
                  • anyways this is moot, since any startup operation takes an arbitrary map
                • biomoby
                  • Haven't started on this yet, but I feel like it will be complicated since you just get the operation name and a pointer to the registry -- perhaps I can look it up locally.
      • meetings
        • first
          • ben's project
            • assume hierarchy, evaluate those
            • for each class, each instance will demonstrate a pattern
            • given a set of instances, can we find a pattern for the class
            • chi defined two metrics
              • converage
                • features that could be associated with genes (co-occurrence of mesh keywoards)
                • too simplistic
                • we have no control
              • proposal
                • use ontology that defines phosphotases because we have an owl that has restrictions on classes
                • build classifiers to do it -- ben
          • todo
            • get ontology from ben
            • web service composition (mark's connotea)
            • taverna
            • myexperiment.org
            • carol goble
              • and the work they're already doing (mark will send papers)
              • [13]   D. Hull, K. Wolstencroft, R. Stevens, C. Goble, M. Pocock, P. Li, T. Oinn, N. Zhang, L. Yao, A. Nenadic, J. Chin, C. Goble, A. Rector, D. Chadwick, S. Otenko, and Q. Shi, "Taverna: a tool for building and running workflows of services Achieving Fine-grained Access Control in Virtual Organisations," Nucleic Acids Research, 2006.
          • workflows
            • they're annotated into ontologies
            • one approach ot pattern finding would be to traverse the ontology upwards and see where common nodes
        • 31-Jan-07
          • brief talk about notes organization
          • rachel does alignment, experimental data
          • rachel will send jiang's stuff on experimental analysis
        • 22-May-07
          • i vent my freakout on Rachel about wanting to switch projects
          • She suggests that we should be thinking about this problem as culling paths instead of statistically inferring what people will pick next
          • as a first task, should collect:
            • all possible service inputs
            • all possible service outputs
            • all inputs into workflow instances we have right now
            • all outputs of workflow instances we have right now
          • Calculations:
            • How many inputs are there in total
            • how many outputs are there in total
            • what percentage of these are actually used
            • some other stats that would tell us what the order of the number of paths we could possible have is
        • 29-May-07
          • 
            
      • todo
        • figure out how biomart and seqhound services are retrieved
        • profile collections of workflows to see how many services are actually used and how often
          • write XML parser that does the following
            • parse each scufl doc
                  for each processor tag
                      if it's a wsdl tag from the spec
                          get the wsdl url
                          make a WSDL service stub from the wsdl doc
                          introspect out the inputs and outputs into a map as before
              
              this might take forever to run...
              
        • determine if there are any type annotations on the WSDL specs for services that let us actually know what the types mean, as per BioMoby's claim
        • there seem to be a lot of technologies used to attach meaning to various aspects of the workflows and processors -- FETA, MOBY, RDF, OWL, etc. I want to be able to delineate what each technology contributes to the behaviour of Taverna, so that I can filter out which ones will be of use in path reduction
        • Basically I need to know exactly what type information is available beyond WSDL for each type of service so I know what services it is possible to restrict on
        • also knowing the performance characteristics of retrieving this type information is important
        • mark is swamped but i can skype, and maybe meet with ben?
        • write script to convert mindmaps to wiki
    • google scholar evaluation project
      • limit between 1950-2005
      • look up 100-1000
      • pubmed
        • inputs: search strings and limits
        • output: list of pmIDs that were returned by search
        • second part: input is gold standard PMids and found ones, compute precision, recall, etc.
      • gs
        • inputs: html doc
        • intermediate records of stuff that needs to be spewed out
        • careful: GS sometimes duplicates references, don't give it two checkmarks when this happens!
        • output: same initial
      • due 31st
Topic revision: r1 - 2007-06-06 - MichaelDiBernardo
 
This site is powered by the TWiki collaboration platform Powered by PerlCopyright © 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback