'

Crowdsourcing Biology: The Gene Wiki, BioGPS and GeneGames.org

Понравилась презентация – покажи это...





Слайд 0

Crowdsourcing Biology: The Gene Wiki, BioGPS and GeneGames.org May 14, 2014 CBIIT Slides: slideshare.net/andrewsu Citizen Science!


Слайд 1

Few genes are well annotated… 2 Data: NCBI, February 2013


Слайд 2

… because the literature is sparsely curated? 3


Слайд 3

… because the literature is sparsely curated? 4


Слайд 4

5 311,696 articles (1.5% of PubMed) have been cited by GO annotations


Слайд 5

6 0 Sooner or later, the research community will need to be involved in the annotation effort to scale up to the rate of data generation.


Слайд 6

The Long Tail is a prolific source of content 7 News : Video: Product reviews: Food reviews: Talent judging: Newspapers TV/Hollywood Consumer reports Food critics Olympics Blogs YouTube Amazon reviews Yelp American Idol


Слайд 7

Wikipedia is reasonably accurate 8


Слайд 8

Wikipedia has breadth and depth 9 http://en.wikipedia.org/wiki/Wikipedia:Size_comparisons, July 2008


Слайд 9

10 We can harness the Long Tail of scientists to directly participate in the gene annotation process.


Слайд 10

From crowdsourcing to structured data 11


Слайд 11

Filtering, extracting, and summarizing PubMed Documents Concepts Review article


Слайд 12

Filtering, extracting, and summarizing PubMed Documents Concepts


Слайд 13

Wiki success depends on a positive feedback 14 Gene wiki page utility Number of users Number of contributors 100 1 200 2


Слайд 14

10,000 gene “stubs” within Wikipedia 15 Protein structure Symbols and identifiers Tissue expression pattern Gene Ontology annotations Links to structured databases Gene summary Protein interactions Linked references Huss, PLoS Biol, 2008


Слайд 15

Gene Wiki has a critical mass of readers 16 Total: 4.0 million views / month Huss, PLoS Biol, 2008; Good, NAR, 2011


Слайд 16

Gene Wiki has a critical mass of editors 17 Increase of ~10,000 words / month from >1,000 edits Currently 1.42 million words Approximately equal to 230 full-length articles Good, NAR, 2011


Слайд 17

A review article for every gene is powerful 18 References to the literature Hyperlinks to related concepts Reelin: 98 editors, 703 edits since July 2002 Heparin: 358 editors, 654 edits since June 2003 AMPK: 109 editors, 203 edits since March 2004 RNAi: 394 editors, 994 edits since October 2002


Слайд 18

Making the Gene Wiki more computable 19 Structured annotations Free text


Слайд 19

Filling the gaps in gene annotation 20 6319 novel GO annotations 2147 novel DO annotations


Слайд 20

Gene Wiki content improves enrichment analysis 21 GO term Gene list Concept recognition PubMed abstracts Enrichment analysis GO:0007411 axon guidance (GO:0007411) 264 genes Linked genes through PubMed P = 1.55 E-20 811 articles


Слайд 21

Gene Wiki content improves enrichment analysis 22 GO term Gene list Concept recognition PubMed abstracts Enrichment analysis GO:0006936 GO:0006936 muscle contraction (GO:0006936) 87 genes Linked genes through PubMed Linked genes through PubMed + Gene Wiki P = 1.0 P = 1.22 E-09 251 articles 87 articles


Слайд 22

Gene Wiki content improves enrichment analysis 23 p-value (PubMed only) p-value (PubMed + GW) Muscle contraction More significant PubMed + GW More significant PubMed only Good BM et al., BMC Genomics, 2011


Слайд 23

Making the Gene Wiki more computable 24 Structured annotations Free text Analyses


Слайд 24

Making the Gene Wiki more computable 25 Structured annotations Free text Databases


Слайд 25

Expansion through outreach and incentives 26


Слайд 26

Cardiovascular Gene Wiki Portal 27 CAMK2D -- CaM kinase II subunit delta CSRP3 -- Cysteine and glycine-rich protein 3 GJA1 -- Gap junction alpha-1 protein / Connexin-43 MAPK14 -- Mitogen-activated protein kinase 14 / p38-? MYL7 -- Myosin regulatory light chain 2, atrial isoform MYL2 -- Myosin regulatory light chain 2, ventricular/cardiac isoform PECAM1 -- Platelet endothelial cell adhesion molecule/CD31 RYR2 -- Ryanodine receptor 2 ATP2A2 -- Sarcoplasmic/endoplasmic reticulum calcium ATPase 2 / SERCA2 TNNI3 -- Troponin I, cardiac muscle TNNT2 -- Troponin T, cardiac muscle Peipei Ping UCLA


Слайд 27

The Long Tail of scientists is a valuable source of information on gene function 28


Слайд 28

From crowdsourcing to structured data 29


Слайд 29

Gene databases are numerous and overlapping 30 … and hundreds more …


Слайд 30

Why is there so much redundancy? 31 Users Requests Resources Time Community development BioGPS emphasizes community extensibility


Слайд 31

Why do developers define the gene report view? 32 BioGPS emphasizes user customizability


Слайд 32

Community extensibility and user customizability 33


Слайд 33

Utility: A simple and universal plugin interface 34


Слайд 34

Utility: A simple and universal plugin interface 35


Слайд 35

Utility: A simple and universal plugin interface 36


Слайд 36

Utility: A simple and universal plugin interface 37


Слайд 37

Utility: A simple and universal plugin interface 38


Слайд 38

Utility: A simple and universal plugin interface 39 Total of > 540 gene-centric online databases registered as BioGPS plugins


Слайд 39

Users: BioGPS has critical mass 40 Daily pageviews


Слайд 40

Contributors: Explicit and implicit knowledge 41 540 plugins registered (>300 publicly shared) by over 120 users spanning 280+ domains


Слайд 41

Gene Annotation Query as a Service 42 http://mygene.info High performance 3M hits/month Highly scalable 13k species 16M genes Weekly data updates JSON output REST interface Python/R/JS libraries


Слайд 42

The Long Tail of bioinformaticians can collaboratively build a gene portal. 43


Слайд 43

From crowdsourcing to structured data 44


Слайд 44

The biomedical literature is growing fast 45


Слайд 45

Information Extraction 46 Find mentions of high level concepts in text Map mentions to specific terms in ontologies Identify relationships between concepts


Слайд 46

Disease mentions in PubMed abstracts 47 NCBI Disease corpus 793 PubMed abstracts (100 development, 593 training, 100 test) 12 expert annotators (2 annotate each abstract) 6,900 “disease” mentions Dogan, Rezarta, and Zhiyong Lu. "An improved corpus of disease mentions in PubMed citations." Proceedings of the 2012 Workshop on Biomedical Natural Language Processing. Association for Computational Linguistics.


Слайд 47

Four types of disease mentions 48 Specific Disease: “Diastrophic dysplasia” Disease Class: “Cancers” Composite Mention: “prostatic , skin , and lung cancer” Modifier: ..the “familial breast cancer” gene , BRCA2.. Dogan, Rezarta, and Zhiyong Lu. "An improved corpus of disease mentions in PubMed citations." Proceedings of the 2012 Workshop on Biomedical Natural Language Processing. Association for Computational Linguistics.


Слайд 48

Question: Can a group of non-scientists collectively perform concept recognition in biomedical texts? 49


Слайд 49

The Turk 50 http://en.wikipedia.org/wiki/The_Turk


Слайд 50

The Turk 51 http://en.wikipedia.org/wiki/The_Turk


Слайд 51

Amazon Mechanical Turk (AMT) 52 For each task, specify: a qualification test how many workers per task how much we will pay per task Manages: parallel execution of jobs worker access to tasks via qualification tests payments task advertising 1. Create tasks 2. Execute 3. Aggregate


Слайд 52

Instructions to workers 53 Highlight all diseases and disease abbreviations “...are associated with Huntington disease ( HD )... HD patients received...” “The Wiskott-Aldrich syndrome ( WAS ) , an X-linked immunodeficiency…” Highlight the longest span of text specific to a disease “... contains the insulin-dependent diabetes mellitus locus …” Highlight disease conjunctions as single, long spans. “... a significant fraction of familial breast and ovarian cancer , but undergoes…” Highlight symptoms - physical results of having a disease “XFE progeroid syndrome can cause dwarfism, cachexia, and microcephaly. Patients often display learning disabilities, hearing loss, and visual impairment.


Слайд 53

Qualification test 54 Test #1: “Myotonic dystrophy ( DM ) is associated with a ( CTG ) in trinucleotide repeat expansion in the 3-untranslated region of a protein kinase-encoding gene , DMPK , which maps to chromosome 19q13 . 3 . ” Test #2: “Germline mutations in BRCA1 are responsible for most cases of inherited breast and ovarian cancer . However , the function of the BRCA1 protein has remained elusive . As a regulated secretory protein , BRCA1 appears to function by a mechanism not previously described for tumour suppressor gene products.” Test #3: “We report about Dr . Kniest , who first described the condition in 1952 , and his patient , who , at the age of 50 years is severely handicapped with short stature , restricted joint mobility , and blindness but is mentally alert and leads an active life . This is in accordance with molecular findings in other patients with Kniest dysplasia and…” 26 yes / no questions


Слайд 54

Qualification test results 55


Слайд 55

Simple annotation interface 56 Click to see instructions Highlight disease mentions


Слайд 56

Experimental design Task: Identify the disease mentions in the 593 abstracts from the NCBI disease corpus $0.06 per Human Intelligence Task (HIT) HIT = annotate one abstract from PubMed 5 workers annotate each abstract 57


Слайд 57

Aggregation function based on simple voting 58 58 1 or more votes (K=1) K=2 K=3 K=4


Слайд 58

Comparison to gold standard 59 593 documents 7 days 17 workers $192.90


Слайд 59

Comparison to gold standard 60 Max F = 0.69 0.79 0.82 k=1 2 3 2 3 4 5 0.85 k=1 N = 3 6 9 12 15 18 7 8 0.85 0.85


Слайд 60

Comparison to gold standard 61 Max F = 0.69 0.79 0.82 k=1 2 3 2 3 4 5 0.85 k=1 N = 3 6 9 12 15 18 7 8 0.85 0.85


Слайд 61

Comparison to gold standard 62 Max F = 0.69 0.79 0.82 k=1 2 3 2 3 4 5 0.85 k=1 N = 3 6 9 12 15 18 7 8 0.85 0.85


Слайд 62

Comparison to gold standard 63 Max F = 0.69 0.79 0.82 k=1 2 3 2 3 4 5 0.85 k=1 N = 3 6 9 12 15 18 7 8 0.85 0.85


Слайд 63

Comparisons to text-mining algorithms 64


Слайд 64

Comparisons to human annotators 65 Average level of agreement between expert annotators (stage 1) F = 0.76


Слайд 65

Comparisons to human annotators 66 F = 0.76 F = 0.87 Average level of agreement between expert annotators (stage 2)


Слайд 66

67 In aggregate, our worker ensemble is faster, cheaper and as accurate as a single expert annotator for disease concept recognition.


Слайд 67

Information Extraction 68 Find mentions of high level concepts in text Map mentions to specific terms in ontologies Identify relationships between concepts


Слайд 68

Annotating the relationships 69 This molecule inhibits the growth of a broad panel of cancer cell lines, and is particularly efficacious in leukemia cells, including orthotopic leukemia preclinical models as well as in ex vivo acute myeloid leukemia (AML) and chronic lymphocytic leukemia (CLL) patient tumor samples. Thus, inhibition of CDK9 may represent an interesting approach as a cancer therapeutic target especially in hematologic malignancies. therapeutic target subject predicate object GENE DISEASE


Слайд 69

Citizen Science at Mark2Cure.org 70


Слайд 70

The Long Tail of citizen scientists can collaboratively annotate biomedical text. 71


Слайд 71

72 Doug Howe, ZFIN John Hogenesch, U Penn Jon Huss, GNF Luca de Alfaro, UCSC Angel Pizzaro, U Penn Faramarz Valafar, SDSU Pierre Lindenbaum, Fondation Jean Dausset Michael Martone, Rush Konrad Koehler, Karo Bio Warren Kibbe, Simon Lim, Northwestern Lynn Schriml, U Maryland Paul Pavlidis, U British Columbia Peipei Ping, UCLA Many Wikipedia editors WP:MCB Project Collaborators Contact http://sulab.org asu@scripps.edu @andrewsu +Andrew Su Citizen Science logo based on http://thenounproject.com/term/teamwork/39543/


Слайд 72

Related AMT work 73 [1] Zhai et al 2013, used similar protocol to tag medication names in clinical trials descriptions. F = 0.88 compared to gold standard [2] Burger et al, using microtask workers to identify relationships between genes and mutations. [3] Aroyo & Welty, used workers to identify relations between concepts in medical text. [1] Zhai H. et al (2013) ”Web 2.0-Based Crowdsourcing for High-Quality Gold Standard Development in Clinical Natural Language Processing” J Med Internet Res [2] Burger, John, et al. (2014) "Hybrid curation of gene-mutation relations combining automated extraction and crowdsourcing.” Mitre technical report [3] Aroyo, Lora, and Chris Welty. Harnessing disagreement in crowdsourcing a relation extraction gold standard. Tech. Rep. RC25371 (WAT1304-058), IBM Research, 2013.


×

HTML:





Ссылка: