Taming Big Data!

Понравилась презентация – покажи это...

Слайд 0

Ian Foster Argonne National Laboratory and University of Chicago foster@anl.gov ianfoster.org Taming Big Data!

Слайд 1

Publish results Discovery is an iterative process Pose question Janet Rowley, 1972

Слайд 2

Publish results Discovery in the big data era: Resource-intensive, expensive, slow Pose question

Слайд 3

Three big data challenges Channel massive flows Automate management Build discovery engines 4

Слайд 4

Three big data challenges Channel massive flows Automate management Build discovery engines 5

Слайд 5

Channel massive data flows Data must move to be useful. We may optimize, but we can never entirely eliminate distance. Sources: experimental facilities, sensors, computations Sinks: analysis computers, display systems Stores: impedance matchers & time shifters Pipes: IO systems and networks connect other elements “We must think of data as a flowing river over time, not a static snapshot. Make copies, share, and do magic” – S. Madhavan

Слайд 6

Transfer is challenging at many levels Speed and reliability GridFTP protocol Globus implementation Scheduling and modeling SEAL and STEAL algorithms RAMSES project 7

Слайд 7

8 Source data store Desti-nation data store Wide Area Network File transfer is an end-to-end problem

Слайд 8

9 Application OS FS Stack HBA/HCA LAN Switch Router Source data transfer node TCP IP NIC Application OS FS Stack HBA/HCA LAN Switch Router TCP IP NIC Storage Array Wide Area Network OST MDT Lustre file system Destination data transfer node OSS OSS MDS MDS + diverse environments + diverse workloads + contention File transfer is an end-to-end problem

Слайд 9

GridFTP protocol and implementations: Fast, reliable, secure 3rd-party data transfer 10 Extend legacy FTP protocol to enhance performance, reliability, security Globus GridFTP provides a widely-used open source implementation. Modular, pluggable architecture (different protocols, I/O interfaces). Many optimizations: e.g., concurrency, parallelism, pipelining.

Слайд 10

85 Gbps sustained disk-to-disk over 100 Gbps network, Ottawa—New Orleans 11 Raj Kettiumuthu and team, Argonne Nov 2014

Слайд 11

Higgs discovery “only possible because of the extraordinary achievements of … grid computing”—Rolf Heuer, CERN DG 10s of PB, 100s of institutions, 1000s of scientists, 100Ks of CPUs, Bs of tasks 12

Слайд 12

13 One Advanced Photon Source data node: 125 destinations

Слайд 13

Same node (1 Gbps link)

Слайд 14

Слайд 15


Слайд 16

Transfer scheduling and optimization Science data traffic is extremely bursty User experience can be improved by scheduling to minimize slowdown Traffic can be categorized: interactive or batch Increased concurrency tends to increase aggregate throughput, to a point 17 Concurrency over 24 hours. Kettimuthu et al., 2015 Throughput vs. concurency & parallelism. Kettimuthu et al., 2014

Слайд 17

A load-aware, adaptive algorithm: (1) Data-driven model of throughput 18 Collect many <s, d, cs, cd, v, a> data E.g., <EP1, EP3, 3, 3, 20GB, 29sec> Estimate throughput(s, d, cs, cd, v) Adjust with estimate of external load

Слайд 18

Define transfer priority: Schedule transfers if neither source nor destination is saturated, using model to decide concurrency If source or destination is saturated, interrupt active transfer(s) to service waiting requests, if in so doing can reduce overall average slowdown Should a new transfer be scheduled? When scheduling a transfer, with what concurrency? When should active transfer be preempted? When change concurrency of active transfer? 19 A load-aware, adaptive algorithm: (2) Concurrency-constrained scheduling

Слайд 19


Слайд 20


Слайд 21

Robust analytic models for science at extreme scales Gagan Agarwal1* Prasanna Balaprakash2 Ian Foster2* Raj Kettimuthu2 Sven Leyffer2 Vitali Morozov2 Todd Munson2 Nagi Rao3* Saday Sadayappan1 Brad Settlemyer3 Brian Tierney4* Don Towsley5* Venkat Vishwanath2 Yao Zhang2 1 Ohio State University 2 Argonne National Laboratory 3 Oak Ridge National Laboratory 4 ESnet 5 UMass Amherst (* Co-PIs) Advanced Scientific Computing Research Program manager: Rich Carlson ¦?

Слайд 22

How to create more accurate, useful, and portable models of distributed systems? Simple analytical model: T= ?+ ?*l [startup cost + sustained bandwidth] Experiment + regression to estimate ?, ? 23 First-principles modeling to better capture details of system & application components Data-driven modeling to learn unknown details of system & application components Model composition Model, data comparison

Слайд 23

Differential regression for combining data from different sources Example of use: Predict performance on connection length L not realizable on physical infrastructure E.g., IB-RDMA or HTCP throughput on 900-mile connection Make multiple measurements of performance on path lengths d: Ms(d): OPNET simulation ME(d): ANUE-emulated path MU(di): Real network (USN) Compute measurement regressions on d: ?A(.), A?{S, E, U} Compute differential regressions: ??A,B(.) = ?A(.) - ?B(.), A, B?{S, E, U} Apply differential regression to obtain estimates, C?{S, E} ??U(d) = MC(d) - ??C,U(d) simulated/emulated measurements point regression estimate

Слайд 24

Source LAN profile WAN profile Destination LAN profile Configuration for host and edge devices Configuration for WAN devices Configuration for host and edge devices composition operations End-to-end profile composition

Слайд 25

Three big data challenges Channel massive flows Automate management Build discovery engines 26

Слайд 26

Registry Staging Store Ingest Store Analysis Store Community Store Archive Mirror Ingest Store Analysis Store Community Store Archive Mirror It should be trivial to Collect, Move, Sync, Share, Analyze, Annotate, Publish, Search, Backup, & Archive BIG DATA … but in reality it’s often very challenging

Слайд 27

One researcher’s perspective on data management challenges 28

Слайд 28

Слайд 29

Tripit exemplifies process automation Me Book flights Book hotel Record flights Suggest hotel Record hotel Get weather Prepare maps Share info Monitor prices Monitor flight Other services Time

Слайд 30

How the “business cloud” works Infrastructure services Computing, storage, networking Elastic capacity Multiple availability zones

Слайд 31

Process automation for science Run experiment Collect data Move data Check data Annotate data Share data Find similar data Link to literature Analyze data Publish data Time Automate and outsource: the Discovery cloud

Слайд 32

Analysis Staging Ingest Community Repository Archive Mirror Next-gen genome sequencer Telescope In millions of labs worldwide, researchers struggle with massive data, advanced software, complex protocols, burdensome reporting Globus research data management services www.globus.org Simulation

Слайд 33

Reliable, secure, high-performance file transfer and synchronization “Fire-and-forget” transfers Automatic fault recovery Seamless security integration Powerful GUI and APIs Data Source Data Destination

Слайд 34

Simple, secure sharing off existing storage systems Data Source Easily share large data with any user or group No cloud storage required

Слайд 35

Extreme ease of use InCommon, Oauth, OpenID, X.509, … Credential management Group definition and management Transfer management and optimization Reliability via transfer retries Web interface, REST API, command line One-click “Globus Connect Personal” install 5-minute Globus Connect Server install

Слайд 36


Слайд 37


Слайд 38

High-speed transfers to/from AWS cloud, via Globus transfer service UChicago ? AWS S3 (US region): Sustained 2 Gbps 2 GridFTP servers, GPFS file system at UChicago Multi-part upload via 16 concurrent HTTP connections AWS ? AWS (same region): Sustained 5 Gbps 39 go#s3

Слайд 39

Globus transfer & sharing; identity & group management, data discovery & publication 25,000 users, 75 PB and 3B files transferred, 8,000 endpoints Globus endpoints

Слайд 40

Globus under the covers Identity, group, profile management services … Sharing service Transfer service Globus Toolkit Globus Connect X

Слайд 41

Globus under the covers Identity, group, profile management services Sharing service Transfer service Globus Toolkit Globus Connect Publication and discovery X

Слайд 42


Слайд 43

Globus Platform-as-a-Service Identity, group, profile management services Sharing service Transfer service Globus Toolkit Globus APIs Globus Connect Publication and discovery X

Слайд 44

The Globus Galaxies platform: Science as a service Ematter materials science FACE-IT

Слайд 45

Three big data challenges Channel massive flows Automate management Build discovery engines 46

Слайд 46

Discovery engines: Integrate simulation, experiment, and informatics

Слайд 47

metagenomics.anl.gov A discovery engine for metagenomics

Слайд 48


Слайд 49

DOE Systems Biology Knowledge Base (KBase) Source: Rick Stevens

Слайд 50

Слайд 51

A discovery engine for the study of disordered structures Diffuse scattering images from Ray Osborn et al., Argonne Sample Experimental scattering Material composition Simulated structure Simulated scattering La 60% Sr 40% Detect errors (secs—mins) Knowledge base Past experiments; simulations; literature; expert knowledge Select experiments (mins—hours) Contribute to knowledge base Simulations driven by experiments (mins—days) Knowledge-driven decision making Evolutionary optimization

Слайд 52

Immediate assessment of alignment quality in near-field high-energy diffraction microscopy 53 Before After Hemant Sharma, Justin Wozniak, Mike Wilde, Jon Almer

Слайд 53

Integrate data movement, management, workflow, and computation to accelerate data-driven applications New data, computational capabilities, and methods create opportunities and challenges Integrate statistics/machine learning to assess many models and calibrate them against `all' relevant data New computer facilities enable on-demand computing and high-speed analysis of large quantities of data

Слайд 54

Big Data to Knowledge: bd2k.org 55

Слайд 55

Three big data challenges Channel massive flows New protocols and management algorithms Automate management The Discovery Cloud Build discovery engines MG-RAST, kBase, Materials 56

Слайд 56

My work is supported by: 57

Слайд 57

Thank you! foster@anl.gov ianfoster.org 58